A Self Validation Network for Object-Level Human Attention Estimation

10/31/2019 ∙ by Zehua Zhang, et al. ∙ 8

Due to the foveated nature of the human vision system, people can focus their visual attention on a small region of their visual field at a time, which usually contains only a single object. Estimating this object of attention in first-person (egocentric) videos is useful for many human-centered real-world applications such as augmented reality applications and driver assistance systems. A straightforward solution for this problem is to pick the object whose bounding box is hit by the gaze, where eye gaze point estimation is obtained from a traditional eye gaze estimator and object candidates are generated from an off-the-shelf object detector. However, such an approach can fail because it addresses the where and the what problems separately, despite that they are highly related, chicken-and-egg problems. In this paper, we propose a novel unified model that incorporates both spatial and temporal evidence in identifying as well as locating the attended object in firstperson videos. It introduces a novel Self Validation Module that enforces and leverages consistency of the where and the what concepts. We evaluate on two public datasets, demonstrating that Self Validation Module significantly benefits both training and testing and that our model outperforms the state-of-the-art.



There are no comments yet.


page 2

page 8

page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Among the many objects appearing in an egocentric video frame of a person’s field of view, we want to identify and locate the object to which the person is visually attending. Combining traditional eye gaze estimators and existing object detectors can fail when eye gaze prediction (blue dot) is slightly incorrect, such as when (a) it falls in the intersection of two object bounding boxes or (b) it lies between two bounding boxes sharing the same class. Red boxes shown actual attended object according to ground truth gaze and yellow dashed boxes show incorrect predictions.

Humans can focus their visual attention on only a small part of their surroundings at any moment, and thus have to choose what to pay attention to in real time 

Mozer and Sitton . Driven by the tasks and intentions we have in mind, we manage attention with our foveated visual system by adjusting our head pose and our gaze point in order to focus on the most relevant object in the environment at any moment in time Lazzari et al. (2009); Bowman et al. (2009); Hayhoe and Ballard (2005); Vidoni et al. (2009); Perone et al. (2008).

This close relationship between intention, attention, and semantic objects has inspired a variety of work in computer vision, including image classification 

Karessli et al. (2017), object detection Papadopoulos et al. (2014); Karthikeyan et al. (2013); Shcherbatyi et al. (2015); Rutishauser et al. (2004), action recognition Li et al. (2018c); Baradel et al. (2018); Pirsiavash and Ramanan (2012); Ma et al. (2016), action prediction Shen et al. (2018), video summarization Lee et al. (2012), visual search modeling Sattar et al. (2015), and irrelevant frame removal Liu et al. (2010), in which the attended object estimation serves as auxiliary information. Despite being a key component of these papers, how to identify and locate the important object is seldom studied explicitly. This problem in and of itself is of broad potential use in real-world applications such as driver assistance systems and intelligent human-like robots.

In this paper, we discuss how to identify and locate the attended object in first-person videos. Recorded by head-mounted cameras along with eye trackers, first-person videos capture an approximation of what people see in their fields of view as they go about their lives, yielding interesting data for studying real-time human attention. In contrast to gaze studies of static images or pre-recorded videos, first-person video is unique in that there is exactly one correct point of attention in each frame, as a camera wearer can only gaze at one point at a time. Accordingly, one and only one gazed object exists for each frame, reflecting the camera wearer’s real-time attention and intention. We will use the term object of interest to refer to the attended object in our later discussion.

Some recent work Zhang et al. (2018); Huang et al. (2018); Zhang et al. (2017a)

has discussed estimating probability maps of ego-attention or predicting gaze points in egocentric videos. However, people think not in terms of points in their field of view, but in terms of the

objects that they are attending to. Of course, the object of interest could be obtained by first estimating the gaze with the gaze estimator and generating object candidates from an off-the-shelf object detector, and then picking the object that the estimated gaze falls in. Because this bottom-up approach estimates where and what separately, it could be doomed to fail if the eye gaze prediction is slightly inaccurate, such as falling between two objects or in the intersection of multiple object bounding boxes (Figure 1). To assure consistency, one may think of performing anchor-level attention estimation and directly predicting the attended box by modifying existing object detectors. Class can be either predicted simultaneously with the anchor-level attention estimation using the same set of features, as in SSD Liu et al. (2016), or afterwards using the features pooled within the attended box, as in Faster-RCNN Ren et al. (2015). Either way, these methods still do not yield satisfying performance, as we will show in Sec. 4.2, because they lack the ability to leverage the consistency to refine the results.

We propose to identify and locate the object of interest by jointly estimating where it is within the frame as well as recognizing what its identity is. In particular, we propose a novel model — which we cheekily call Mindreader Net or Mr. Net — to jointly solve the problem. Our model incorporates both spatial evidence within frames and temporal evidence across frames, in a network architecture (which we call the Cogged Spatial-Temporal Module) with separate spatial and temporal branches to avoid feature entanglement.

A key feature of our model is that it explicitly enforces and leverages a simple but extremely useful constraint: our estimate of what is being attended should be located in exactly the position of where

we estimate the attention to be. This Self Validation Module first computes similarities between the global object of interest class prediction vector and each local anchor box class prediction vector as the attention validation score to update the anchor attention score prediction, and then, with the updated anchor attention score, we select the attended anchor and use its corresponding class prediction score to update the global object of interest class prediction. With global context originally incorporated by extracting features from the whole clip using 3D convolution, the Self Validation Module helps the network focus on the local context in a spatially-local anchor box and a temporally-local frame.

We evaluate the approach on two existing first-person video datasets that include attended object ground truth annotations. We show our approach outperforms baselines, and that our Self Validation Module not only improves performance by refining the outputs with visual consistency during testing, but also it helps bridge multiple components together during training to guide the model to learn a highly meaningful latent representation. More information is available at http://vision.soic.indiana.edu/mindreader/.

2 Related Work

Compared with many efforts to understand human attention by modeling eye gaze Zhang et al. (2018, 2017a); Huang et al. (2018); Li et al. (2014, 2013); Itti et al. (1998); Harel et al. (2007); Hou et al. (2012); Huang et al. (2015); Pan et al. (2016); Treisman and Gelade (1980); Torralba et al. (2006); Ba and Odobez (2011); Borji et al. (2012); Yamada et al. (2012) or saliency Judd et al. (2009); Zhao et al. (2015); Hou et al. (2017); Li et al. (2018b); Liu and Han (2016); Li et al. (2017); Song et al. (2018); Li et al. (2018a); Zhang et al. (2017b), there are relatively few papers that detect object-level attention. Lee et al. Lee et al. (2012) address video summarization with hand-crafted features to detect important people and objects, while object-level reasoning plays a key role in Baradel et al.’s work on understanding videos through interactions of important objects Baradel et al. (2018). In the particular case of egocentric video, Pirsiavash and Ramanan Pirsiavash and Ramanan (2012) and Ma et al. Ma et al. (2016) detect objects in hands as a proxy for attended objects to help action recognition. However, eye gaze usually precedes hand motion and thus objects in hand are not always those being visually attended (Fig. 0(a)). Shen et al. Shen et al. (2018) combine eye gaze ground truth and detected object bounding boxes to extract attended object information for future action prediction. EgoNet Bertasius et al. (2016), among the first papers to focus on important object detection in first-person videos, combines visual appearance and 3D layout information to generate probability maps of object importance. Multiple objects can be detected in a single frame, making their results more similar to saliency than human attention in egocentric videos.

Perhaps the most related work to ours is Bertasius et al.’s Visual-Spatial Network (VSN) Bertasius et al. (2017), which proposes an unsupervised method for important object detection in first-person videos that incorporates the idea of consistency between the where and what concepts to facilitate learning. However, VSN requires a much more complicated training strategy of switching the cascade order of the two pathways multiple times, whereas we present a unified framework that can be learned end-to-end.

3 Our approach

Given a video captured with a head-mounted camera, our goal is to detect the object that is visually attended in each frame. This is challenging because egocentric videos can be highly cluttered, with many competing objects vying for attention. We thus incorporate temporal cues that consider multiple frames at a time. We first consider performing detection for the middle frame of a short input sequence (as in Ma et al. (2016)), and then further develop it to work online (considering only past information) by performing detection on the last frame. Our novel model consists of two main parts (Figure 2), which we call the Cogged Spatial-Temporal Module and the Self Validation Module.

Figure 2: The architecture of our proposed Mindreader Net. Numbers indicate output size of each component (where is the number of object classes). Softmax is applied before computing the losses on global classification , anchor box classification , and attention (which is first flattened to be 8732-d). Please refer to supplementary materials for details about the Cogged Spatial-Temporal Module.

3.1 Cogged Spatial-Temporal Module

The Cogged Spatial-Temporal Module consists of a spatial and a temporal branch. The “cogs” refer to the way that the outputs of each layer of the two branches are combined together, reminiscent of the interlocking cogs of two gears (Figure 2). Please see supplementary material for more details.

The Spatial Gear Branch, inspired by SSD300 Liu et al. (2016), takes a single frame of size and performs spatial prediction of local anchor box offsets and anchor box classes. It is expected to work as an object detector, although we only have ground truth for the objects of interest to train it, so we do not add an extra background class as in Liu et al. (2016), and only compute losses for the spatial-based tasks on the matched positive anchors. We use atrous Chen et al. (2018); Yu and Koltun (2015) VGG16 Simonyan and Zisserman (2014) as the backbone and follow a similar anchor box setting as Liu et al. (2016). We also apply the same multi-anchor matching strategy. With the spatial branch, we obtain anchor box offset predictions and class predictions , where is the number of anchor boxes and is the number of classes in our problem. Following SSD300 Liu et al. (2016), we have , , and .

The Temporal Gear Branch takes continuous RGB frames as well as corresponding optical flow fields , both of spatial resolution (with , set empirically). We use Inception-V1 Szegedy et al. (2015) I3D Carreira and Zisserman (2017) as the backbone of our temporal branch. With aggregated global features from 3D convolution, we obtain global object of interest class predictions and anchor box attention predictions . We match the ground truth box only to the anchor with the greatest overlap (intersection over union). The matching strategy is empirical and discussed in Section 4.3.

3.2 Self Validation Module

The Self Validation Module connects the above branches and delivers global and local context between the two branches at both spatial (e.g., whole frame versus an anchor box) and temporal (e.g., whole sequence versus a single frame) levels. It incorporates the constraint on consistency between where and what by embedding a double validation mechanism: whatwhere and wherewhat.


With the outputs of the Cogged Spatial-Temporal Module, we compute the cosine similarities between the global class prediction

and the class prediction for each anchor box, , yielding an attention validation score for each box ,


Then the attention validation vector is used to update the anchor box attention scores by element-wise summation, . Since , we make the optimization easier by rescaling each to the range ,


where and are element-wise vector operations.

Wherewhat. Intuitively, obtaining the attended anchor box index is a simple matter of computing , and the class validation score is simply . Similarly, after rescaling, we take an element-wise summation, and , to update the global object of interest class prediction ( in Equation 2),

. However, the hard argmax is not differentiable, and thus gradients are not able to backpropagate properly during training. We thus use soft argmax. Softmax is applied to the updated anchor box attention score

to produce a weighting vector for class validation score estimation,


Now we replace with to update , .

This soft whatwhere validation is closely related to the soft attention mechanism widely used in many recent papers Sukhbaatar et al. (2015); Bahdanau et al. (2014); Cho et al. (2015); Xu et al. (2015); Luong et al. (2015); Vaswani et al. (2017). While soft attention learns the mapping itself inside the model, we explicitly incorporate the coherence of the where and what concepts into our model to self-validate the output during both training and testing. In contrast to soft attention which describes relationships between e.g. words, graph nodes, etc., this self-validation mechanism naturally mirrors the visual consistency of our foveated vision system.

3.3 Implementation and training details

We implemented our model with Keras 

Chollet et al. (2017)

and Tensorflow 

Abadi et al. (2016)

. A batch normalization layer 

Ioffe and Szegedy (2015) is inserted after each layer in both spatial and temporal backbones, and momentum for batch normalization is . Batch normalization is not used in the four prediction heads. We found pretraining the spatial branch helps the model converge faster. No extra data is introduced as we still only use the labels of the objects of interest for pretraining. VGG16 Simonyan and Zisserman (2014)

is initialized with weights pretrained on ImageNet 

Deng et al. (2009). We use Sun et al.’s method Sun et al. (2018); Niklaus (2018) to extract optical flow and follow Carreira and Zisserman (2017) to truncate the maps to and then rescale them to . The RGB input to the Temporal Gear Branch is rescaled to  Carreira and Zisserman (2017), while for the Spatial Gear Branch the RGB input is normalized to have 0 mean and the channels are permuted to BGR.

When training the whole model, the spatial branch is initialized with the pretrained weights from above. The I3D backbone is initialized with weights pretrained on Kinetics Kay et al. (2017) and ImageNet Deng et al. (2009)

, while other parts are randomly initialized. We use stochastic gradient descent with learning rate

, momentum , decay , and regularizer

. The loss function consists of four parts: global classification

, attention , anchor box classification , and box regression ,


where we empirically set , and is the total number of matched anchors for training the anchor box class predictor and anchor box offset predictor. and apply cross entropy loss, computed on the updated predictions of object of interest class and anchor box attention. is the total cross entropy loss and is the total box regression loss over only all the matched anchors. The box regression loss follows Ren et al. (2015); Liu et al. (2016) and we refer readers there for details. Our full model has trainable parameters, while the Self Validation Module contains no parameters, making it very flexible so that it can be added to training or testing anytime. It is even possible to stack multiple Self Validation Modules or use only half of it.

During testing, the anchor with the highest anchor box attention score is selected as the attended anchor. The corresponding anchor box offset prediction indicates where the object of interest is, while the argmax of the global object of interest class score gives its class.

4 Experiments

We evaluate our model on identifying attended objects in two first-person datasets collected in very different contexts: child and adult toy play, and adults in kitchens.

ATT Zhang et al. (2018) (Adult-Toddler Toy play) consists of first-person videos from head-mounted cameras of parents and toddlers playing with 24 toys in a simulated home environment. The dataset consists of 20 synchornized video pairs (child head cameras and parent head cameras), although we only use the parent videos. The object being attended is determined using gaze tracking. We randomly select of the samples in each object class for training and use the remaining for testing, resulting in about training and testing samples, each with 15 continuous frames. We do not restrict the object of interest to remain the same in each sample sequence and only use the label of the object of interest for training.

Epic-Kitchen Dataset Damen et al. (2018) contains hours of first-person video from 32 participants in their own kitchens. The dataset includes anntoations on the “active” objects related to the person’s current action. We use this as a proxy for attended object by we selecting only frames containing one active object and assuming that they are attended. Object classes with fewer than samples are also excluded, resulting in classes. We randomly select 90% of samples for training, yielding about training and testing samples.

For evaluation, we report accuracy — number of correct predictions over the number of samples. A prediction is considered correct if it has both (a) the correct class prediction and (b) an IoU between the estimated and the ground truth boxes above a threshold. Similar to Lin et al. (2014), we report accuracies at IOU thresholds of and , as well as a mean accuracy computed by averaging accuracies at IOU thresholds evenly distributed from to . Accuracy thus measures ability to correctly predict both what and where is being attended.

4.1 Baselines

We evaluate against several strong baselines. Gaze + GT bounding box, inspired by Li et al. Li et al. (2014), applies Zhang et al.’s gaze prediction method Zhang et al. (2018) (since it has state-of-the-art performance on the ATT) and directly uses ground truth object bounding boxes. This is equivalent to having a perfect object detector (with ), resulting in a very strong baseline. We use two different methods to match the predicted eye gaze to the object boxes: (1) Hit: only boxes in which the gaze falls in are considered matched, and if the estimated gaze point is within multiple boxes, the accuracy score is averaged by the number of matched boxes; and (2) Closest: the box whose center is the closest to the predicted gaze is considered to be matched. I3D Carreira and Zisserman (2017)-based SSD Liu et al. (2016) tries to overcome the discrepancy caused by solving the where and what problems separately by directly performing anchor-level attention estimation with an I3D Carreira and Zisserman (2017)-backboned SSD Liu et al. (2016). The anchor box setting is similar to SSD300 Liu et al. (2016). For each anchor we predict an attention score, a class score, and box offsets. Cascade model contains a temporal branch with I3D backbone and a spatial branch with VGG16 backbone. From the temporal branch, the important anchor as well as its box offsets are predicted, and then features are pooled Ren et al. (2015); He et al. (2017) from the spatial branch for classification. Object in hands + GT bounding box, inspired by Pirsiavash and Ramanan (2012); Ma et al. (2016); Furnari et al. (2017), tries to detect object of interest by detecting the object in hand. We use several variants; the “either handed model” is strongest, and uses both the ground truth object boxes and the ground truth label of the object in hands. When two hands hold different objects, the model always picks the one yielding higher accuracy, thus reflecting the best performance we can obtain with this baseline. Please refer to the supplementary materials for details of other variants. Center GT box uses the ground truth object boxes and labels to select the object closest to the frame center, inspired by the fact that people tend to adjust their head pose so that their gaze is near the center of their view Li et al. (2013).

4.2 Results on ATT dataset

Method Our Mr. Net 74.27 46.78 44.78 Gaze Zhang et al. (2018) + GT Box + Hit 25.26 25.26 25.26 Gaze Zhang et al. (2018) + GT Box + Closest 35.86 35.86 35.86 I3D Carreira and Zisserman (2017)-based SSD Liu et al. (2016) 70.11 42.10 40.85 Cascade Model 66.97 45.10 41.93

OIH Detectors + WH Classifier

37.16 37.16 37.16
Left Handed Model 38.31 38.31 38.31 Right Handed Model 39.00 39.00 39.00 OIH GT + WH Classifier 40.83 40.83 40.83 Either Handed Model 42.94 42.94 42.94 Center GT Box 23.97 23.97 23.97
Table 1: Accuracy of our method compared to others, on the ATT dataset. OIH represents Object-in-Hand, while WH means Which-Hand.
Self validation? Streams Training Testing Two yes yes 74.27 46.78 44.78 Two yes half 43.88 Two yes no 68.19 42.83 41.18 Two no yes 67.18 40.06 39.48 Two no half 37.87 Two no no 62.33 38.31 37.18 RGB yes yes 74.59 43.15 42.48 Flow yes yes 64.30 38.63 37.60 Flow no yes 25.10 Flow no no 18.40
Table 2: Ablation results. Testing with half means that the model is tested with only whatwhere validation.

Table 2 presents quantitative results of our Mindreader Net and baselines on the ATT dataset. Both enforcing and leveraging the visual consistency, our method even outperformed the either-handed model in terms of , which is built upon several strong oracles — a perfect object detector, two perfect object-in-hand detectors, and a perfect which-hand classifier. Other methods without perfect object detectors suffer from a rapid drop in as the IOU threshold becomes higher. For example, when the IOU threshold reaches 0.75, the either-handed model already has no obvious advantage compared with I3D-based SSD, and the Cascade model achieves a much higher score. When the threshold becomes 0.5, not only our Mindreader Net but also Cascade and I3D-based SSD outperform the either-handed model by a significant margin. Though the of the cascade model is lower than I3D-based SSD by about , its and are higher, suggesting bad box predictions with low IOU confuses the class head of the cascade model, but having a separate spatial branch to overcome feature entanglement improves the overall performance with higher-quality predictions.

We also observed that the Closest variant of the Gaze + GT Box model is about better than the Hit variant. This suggests that gaze prediction often misses the ground truth box a bit or may fall in the intersection of several bounding boxes, reflecting the discrepancy between the where and the what concepts in exiting eye gaze estimation algorithms.

Sample results of our model compared with other baselines are shown in Figure 3. Regular gaze prediction models fail in (c) & (d), supporting our hypothesis about the drawback of estimating where and what independently — the model is not robust to small errors in gaze estimation (recall the gaze-based baseline uses ground truth bounding boxes so failures must be caused by gaze estimation). In particular, the estimated gaze falls on 3 objects in (c), slightly closer to the center of the rabbit; In (d), eye gaze does not fall on any object. More unified models (I3D-based SSD, the cascade model, and our model) thus achieve better performance. In (a) & (b), our model outperforms I3D-based SSD and Cascade. Because a Self Validation Module is applied to inject consistency, our Mr. Net performs better when many objects including the object of interest are close to each other.

Figure 6 illustrate how various parts of our model work. Image (a) shows the intermediate anchor attention score from the temporal branch, visualized as the top attended anchors with attention scores. These are anchor-level attention and no box offsets are predicted here. Image (b) shows visualizations of the predicted anchor offsets and box class score from the spatial branch (only of the top 5 attended anchors). We do not have negative samples or a background class for training the spatial branch and thus there are some false positives. Image (c) combines output from both branches; this is also the final prediction of the model trained with the Self Validation Module but tested without it in the ablation studies in Section 4.3. The predicted class is obtained from and we combine and to get the location. Discrepancy happens in this example as the class prediction is correct but not the location. Image (d) shows prediction of our full model. By applying double self validation, the full model correctly predicts location and class.

Some failure cases of our model are shown in Figure 6: (a) heavy occlusion, (b) ambiguity of which held object is attended, (c) the model favors the object that is reached for, and (d) an extremely difficult case where parent’s reach is occluded by an object held by the child.

Figure 3: Sample results of our Mr. Net and baselines on ATT dataset. Detections are in blue, ground truth in red, and the predicted gaze of gaze-based methods in yellow.
Figure 4: Illustration of how parts of our model work.
Figure 5: Some failure cases of our model, with detections in blue and ground truth in red.
Figure 6: Sample results of Mr. Net on Epic-Kitchens.

4.3 Ablation studies

We conduct several ablation studies to evaluate the importance of the parts of our model.

Hard argmax vs. soft argmax during testing. The soft version of whatwhere is necessary for gradient backpropagation during training, but there is no such issue in testing. Our full model achieves when tested with hard argmax, versus when tested with soft argmax. When doing the same experiments with other model settings, we observed similar results.

Self Validation Module. To study the importance of the Self Validation Module, we conduct five experiments: (1) Train and test the model without the Self Validation Module; (2) Train the model without the Self Validation Module but test with only the whatwhere validation (the first step of Self Validation); (3) Train the model without Self Validation but test with it; (4) Train the model with Self Validation but test with only whatwhere validation; (5) Train the model with Self Validation but test without it. As shown in Table 2, the Self Validation Module yields consistent performance gain. If we train the model with Self Validation but remove it during testing, the remaining model still outperforms other models trained without the module. This implies that embedding the Self Validation Module during training helps learn a better model by bridging each component and providing guidance of how components are related to each other. Even when Self Validation is removed during testing, consistency is still maintained between the temporal and the spatial branches. Also, recall that when training the model with the Self Validation Module, the loss is computed based on the final output, and thus when we test the full model without Self Validation, the output is actually a latent representation in our full model. This suggests that our Self Validation Module encourages the model to learn a highly semantically-meaningful latent representation. Furthermore, the consistency injected by Self Validation helps prevent overfitting, while significant overfitting was observed without the Self Validation Module during training.

Validation method for whatwhere. We used element-wise summation for whatwhere validation. Another strategy is to treat as an attention vector in which rescaling is unnecessary,


We repeated experiments using this technique and obtained , a slight drop that may be because the double softmax inside the Self Validation Module increases optimization difficulty.

Single stream versus two streams. We conducted experiments to study the effect of each stream in our task. As Table 2 shows, a single optical flow stream performs much worse than single RGB or two-stream, indicating that object appearance is very important for problems related to object detection. However, it still acheived acceptable results since the network can refer to the spatial branch for appearance information through the Self Validation Module. To test this, we removed the Self Validation Module from the single flow stream model during training. When testing this model directly, we observed a very poor result of ; adding the Self Validation Module back during testing yields a large gain to .

Alternative matching strategy for box attention prediction. For the anchor box attention predictor, we perform experiments with different anchor matching strategies. When multi anchor matching is used, we do hard negative mining as suggested in Liu et al. (2016) with the negative:positive ratio set to 3. The model with the multi anchor matching strategy achieves , versus with one-best anchor matching. We tried other different negative:positive ratios ( e.g. 5, 10, 20) and still found the one best anchor matching strategy works better. This may be because we have an acceptable number of anchor boxes; once we set more anchor boxes, multi matching may work better.

Object of interest class prediction. We explore where to place the global object of interest class predictor. When we connect it to the temporal branch after the fused block 5, we obtain ; when placed after the conv block 8 at the end of the temporal branch, we achieve . This implies that for detecting the object of interest among others, a higher spatial resolution of the feature map is helpful.

Model Mr. Net 71.34 38.26 39.04 Gaze Zhang et al. (2018) + GT Boxes Hit 26.46 26.46 26.46 Gaze Zhang et al. (2018) + GT Boxes Closest 36.81 36.81 36.81 I3D Carreira and Zisserman (2017)-bsaed SSD Liu et al. (2016) 67.43 37.90 37.22 Cascade Model 65.96 38.01 37.93
Table 3: Results of online detection.
Method Our Mr. Net 57.18 31.00 31.20 I3D Carreira and Zisserman (2017)-based SSD Liu et al. (2016) 47.58 24.38 25.42 Cascade Model 51.20 28.18 28.36
Table 4: Accuracies on the Epic-Kitchen dataset.

4.4 Online Detection

Our model can be easily modified to do online detection, in which only previous frames are available. We modified the model to detect the object of interest in the last frame of a given sequence. As shown in Table 4, except for the Gaze + GT boxes model, all other models suffer from dropping scores, indicating that online detection is more difficult. However, since the gaze prediction model that we use Zhang et al. (2018) is trained to predict eye gaze in each frame of the video sequence and thus works for both online and offline tasks, its performance remains stable.

4.5 Results on Epic-Kitchen Dataset

We show the generalizability of our model by performing experiments on Epic-Kitchens Damen et al. (2018). Results by applying our model as well as the I3D-based SSD model and the cascade model on this dataset are shown in Table 4. On this dataset, the of the Cascade model is higher than that of the I3D + SSD model. The reason may be that objects are sparser in this dataset and thus poorly-predicted boxes will be less likely to lead to wrong classification. Sample results are shown in Figure 6.

5 Conclusion

We considered the problem of detecting attended object in cluttered first-person views. We proposed a novel unified model with a Self Validation Module to leverage the visual consistency of human vision system. The module jointly optimizes the class and the attention estimates as self validation. Experiments on two public datasets show our model outperforms other state-of-the-art methods by a large margin.

6 Acknowledgements

This work was supported in part by the National Science Foundation (CAREER IIS-1253549), the National Institutes of Health (R01 HD074601, R01 HD093792), NVidia, Google, and the IU Office of the Vice Provost for Research, the College of Arts and Sciences, and the School of Informatics, Computing, and Engineering through the Emerging Areas of Research Project “Learning: Brains, Machines, and Children.”


  • M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, D. G. Murray, B. Steiner, P. Tucker, V. Vasudevan, P. Warden, M. Wicke, Y. Yu, and X. Zheng (2016)

    TensorFlow: a system for large-scale machine learning

    In USENIX Conference on Operating Systems Design and Implementation, pp. 265–283. Cited by: §3.3.
  • S. O. Ba and J. M. Odobez (2011) Multiperson visual focus of attention from head pose and meeting contextual cues. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) 33 (1), pp. 101–116. Cited by: §2.
  • D. Bahdanau, K. Cho, and Y. Bengio (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: §3.2.
  • F. Baradel, N. Neverova, C. Wolf, J. Mille, and G. Mori (2018) Object level visual reasoning in videos. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 105–121. Cited by: §1, §2.
  • G. Bertasius, H. S. Park, S. X. Yu, and J. Shi (2016) First person action-object detection with egonet. arXiv preprint arXiv:1603.04908. Cited by: §2.
  • G. Bertasius, H. Soo Park, S. X. Yu, and J. Shi (2017) Unsupervised learning of important objects from first-person videos. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1956–1964. Cited by: §2.
  • A. Borji, D. N. Sihite, and L. Itti (2012) Probabilistic learning of task-specific visual attention. In

    IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    Cited by: §2.
  • M. C. Bowman, R. S. Johannson, and J. R. Flanagan (2009) Eye–hand coordination in a sequential target contact task. Experimental brain research 195 (2), pp. 273–283. Cited by: §1.
  • J. Carreira and A. Zisserman (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308. Cited by: §3.1, §3.3, §4.1, Table 2, Table 4, Table 4.
  • L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille (2018) Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence 40 (4), pp. 834–848. Cited by: §3.1.
  • K. Cho, A. Courville, and Y. Bengio (2015) Describing multimedia content using attention-based encoder-decoder networks. IEEE Transactions on Multimedia 17 (11), pp. 1875–1886. Cited by: §3.2.
  • F. Chollet, J. Allaire, et al. (2017) R interface to keras. GitHub. Note: https://github.com/rstudio/keras Cited by: §3.3.
  • D. Damen, H. Doughty, G. M. Farinella, S. Fidler, A. Furnari, E. Kazakos, D. Moltisanti, J. Munro, T. Perrett, W. Price, and M. Wray (2018) Scaling egocentric vision: the epic-kitchens dataset. In European Conference on Computer Vision (ECCV), Cited by: §4.5, §4.
  • J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: §3.3, §3.3.
  • A. Furnari, S. Battiato, K. Grauman, and G. M. Farinella (2017) Next-active-object prediction from egocentric videos. J. Vis. Comun. Image Represent. 49 (C), pp. 401–411. External Links: ISSN 1047-3203 Cited by: §4.1.
  • J. Harel, C. Koch, and P. Perona (2007) Graph-based visual saliency. In Advances in Neural Information Processing Systems (NeurIPS), pp. 545–552. Cited by: §2.
  • M. Hayhoe and D. Ballard (2005) Eye movements in natural behavior. Trends in cognitive sciences 9 (4), pp. 188–194. Cited by: §1.
  • K. He, G. Gkioxari, P. Dollár, and R. Girshick (2017) Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 2961–2969. Cited by: §4.1.
  • Q. Hou, M. Cheng, X. Hu, A. Borji, Z. Tu, and P. H. Torr (2017) Deeply supervised salient object detection with short connections. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3203–3212. Cited by: §2.
  • X. Hou, J. Harel, and C. Koch (2012) Image signature: highlighting sparse salient regions. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) 34 (1), pp. 194–201. Cited by: §2.
  • X. Huang, C. Shen, X. Boix, and Q. Zhao (2015)

    Salicon: reducing the semantic gap in saliency prediction by adapting deep neural networks

    In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 262–270. Cited by: §2.
  • Y. Huang, M. Cai, Z. Li, and Y. Sato (2018) Predicting gaze in egocentric video by learning task-dependent attention transition. In European Conference on Computer Vision (ECCV), pp. 754–769. Cited by: §1, §2.
  • S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning (ICML), Cited by: §3.3.
  • L. Itti, C. Koch, and E. Niebur (1998) A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) 20 (11), pp. 1254–1259 (English (US)). Cited by: §2.
  • T. Judd, K. Ehinger, F. Durand, and A. Torralba (2009) Learning to predict where humans look. In IEEE International Conference on Computer Vision (ICCV), pp. 2106–2113. Cited by: §2.
  • N. Karessli, Z. Akata, B. Schiele, and A. Bulling (2017) Gaze embeddings for zero-shot image classification. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4525–4534. Cited by: §1.
  • S. Karthikeyan, V. Jagadeesh, R. Shenoy, M. Ecksteinz, and B. Manjunath (2013) From where and how to what we see. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 625–632. Cited by: §1.
  • W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, et al. (2017) The kinetics human action video dataset. arXiv preprint arXiv:1705.06950. Cited by: §3.3.
  • S. Lazzari, D. Mottet, and J. Vercher (2009) Eye-hand coordination in rhythmical pointing. Journal of motor behavior 41 (4), pp. 294–304. Cited by: §1.
  • Y. J. Lee, J. Ghosh, and K. Grauman (2012) Discovering important people and objects for egocentric video summarization. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1346–1353. Cited by: §1, §2.
  • G. Li, Y. Xie, L. Lin, and Y. Yu (2017) Instance-level salient object segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2386–2395. Cited by: §2.
  • G. Li, Y. Xie, T. Wei, K. Wang, and L. Lin (2018a) Flow guided recurrent neural encoder for video salient object detection. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3243–3252. Cited by: §2.
  • X. Li, F. Yang, H. Cheng, W. Liu, and D. Shen (2018b) Contour knowledge transfer for salient object detection. In European Conference on Computer Vision (ECCV), pp. 355–370. Cited by: §2.
  • Y. Li, A. Fathi, and J. M. Rehg (2013) Learning to predict gaze in egocentric video. In IEEE International Conference on Computer Vision (ICCV), Cited by: §2, §4.1.
  • Y. Li, X. Hou, C. Koch, J. M. Rehg, and A. L. Yuille (2014) The secrets of salient object segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 280–287. Cited by: §2, §4.1.
  • Y. Li, M. Liu, and J. M. Rehg (2018c) In the eye of beholder: joint learning of gaze and actions in first person video. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 619–635. Cited by: §1.
  • T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In European Conference on Computer Vision (ECCV), pp. 740–755. Cited by: §4.
  • D. Liu, G. Hua, and T. Chen (2010) A hierarchical visual model for video object summarization. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) 32 (12), pp. 2178–2190. Cited by: §1.
  • N. Liu and J. Han (2016) Dhsnet: deep hierarchical saliency network for salient object detection. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 678–686. Cited by: §2.
  • W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Fu, and A. C. Berg (2016) Ssd: single shot multibox detector. In European Conference on Computer Vision (ECCV), pp. 21–37. Cited by: §1, §3.1, §3.3, §4.1, §4.3, Table 2, Table 4, Table 4.
  • M. Luong, H. Pham, and C. D. Manning (2015) Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025. Cited by: §3.2.
  • M. Ma, H. Fan, and K. M. Kitani (2016) Going deeper into first-person activity recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1894–1903. Cited by: §1, §2, §3, §4.1.
  • [43] M. C. Mozer and M. Sitton Computational modeling of spatial attention. Attention 9, pp. 341–393. Cited by: §1.
  • S. Niklaus (2018)

    A reimplementation of PWC-Net using PyTorch

    Note: https://github.com/sniklaus/pytorch-pwc Cited by: §3.3.
  • J. Pan, E. Sayrol, X. Giró i Nieto, K. McGuinness, and N. E. O’Connor (2016) Shallow and deep convolutional networks for saliency prediction. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 598–606. Cited by: §2.
  • D. P. Papadopoulos, A. D. Clarke, F. Keller, and V. Ferrari (2014) Training object class detectors from eye tracking data. In European Conference on Computer Vision (ECCV), pp. 361–376. Cited by: §1.
  • S. Perone, K. L. Madole, S. Ross-Sheehy, M. Carey, and L. M. Oakes (2008) The relation between infants’ activity with objects and attention to object appearance.. Developmental psychology 44 (5), pp. 1242. Cited by: §1.
  • H. Pirsiavash and D. Ramanan (2012) Detecting activities of daily living in first-person camera views.. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2847–2854. Cited by: §1, §2, §4.1.
  • S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: §1, §3.3, §4.1.
  • U. Rutishauser, D. Walther, C. Koch, and P. Perona (2004) Is bottom-up attention useful for object recognition?. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. 2, pp. II–II. Cited by: §1.
  • H. Sattar, S. Muller, M. Fritz, and A. Bulling (2015) Prediction of search targets from fixations in open-world settings. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 981–990. Cited by: §1.
  • I. Shcherbatyi, A. Bulling, and M. Fritz (2015) Gazedpm: early integration of gaze information in deformable part models. arXiv preprint arXiv:1505.05753. Cited by: §1.
  • Y. Shen, B. Ni, Z. Li, and N. Zhuang (2018) Egocentric activity prediction via event modulated attention. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 197–212. Cited by: §1, §2.
  • K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §3.1, §3.3.
  • H. Song, W. Wang, S. Zhao, J. Shen, and K. Lam (2018) Pyramid dilated deeper convlstm for video salient object detection. In European Conference on Computer Vision (ECCV), pp. 715–731. Cited by: §2.
  • S. Sukhbaatar, J. Weston, R. Fergus, et al. (2015) End-to-end memory networks. In Advances in Neural Information Processing Systems (NeurIPS), pp. 2440–2448. Cited by: §3.2.
  • D. Sun, X. Yang, M. Liu, and J. Kautz (2018) PWC-Net: CNNs for optical flow using pyramid, warping, and cost volume. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §3.3.
  • C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich (2015) Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9. Cited by: §3.1.
  • A. Torralba, M. S. Castelhano, A. Oliva, and J. M. Henderson (2006) Contextual guidance of eye movements and attention in real-world scenes: the role of global features in object search. Psychological Review 113, pp. 2006. Cited by: §2.
  • A. M. Treisman and G. Gelade (1980) A feature-integration theory of attention. Cognitive Psychology 12 (1), pp. 97 – 136. Cited by: §2.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in Neural Information Processing Systems (NeurIPS), pp. 5998–6008. Cited by: §3.2.
  • E. D. Vidoni, J. S. McCarley, J. D. Edwards, and L. A. Boyd (2009) Manual and oculomotor performance develop contemporaneously but independently during continuous tracking. Experimental brain research 195 (4), pp. 611–620. Cited by: §1.
  • K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio (2015) Show, attend and tell: neural image caption generation with visual attention. In International conference on machine learning, pp. 2048–2057. Cited by: §3.2.
  • K. Yamada, Y. Sugano, T. Okabe, Y. Sato, A. Sugimoto, and K. Hiraki (2012) Attention prediction in egocentric video using motion and visual saliency. In Advances in Image and Video Technology, Y. Ho (Ed.), pp. 277–288. Cited by: §2.
  • F. Yu and V. Koltun (2015) Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122. Cited by: §3.1.
  • M. Zhang, K. T. Ma, J. H. Lim, Q. Zhao, and J. Feng (2017a) Deep future gaze: gaze anticipation on egocentric videos using adversarial networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2.
  • P. Zhang, D. Wang, H. Lu, H. Wang, and X. Ruan (2017b) Amulet: aggregating multi-level convolutional features for salient object detection. In IEEE International Conference on Computer Vision (ICCV), pp. 202–211. Cited by: §2.
  • Z. Zhang, S. Bambach, C. Yu, and D. J. Crandall (2018) From coarse attention to fine-grained gaze: a two-stage 3d fully convolutional network for predicting eye gaze in first person video. In British Machine Vision Conference (BMVC), Cited by: §1, §2, §4.1, §4.4, Table 2, Table 4, §4.
  • R. Zhao, W. Ouyang, H. Li, and X. Wang (2015)

    Saliency detection by multi-context deep learning

    In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1265–1274. Cited by: §2.

7 Supplementary Material

7.1 The architecture of the Cogged Spatial-Temporal Module

Figure 7: The architecture of the Cogged Spatial-Temporal Module. The number below each component indicates its output dimension. is the number of class. All fusion is performed by element-wise sum. When trained without being followed by the Self Validation Module, before computing , and , Softmax is applied (the attention prediction is first flattened to be a 8732-d vector)

7.2 Visualization of some baselines

(a) Gaze + Box
(b) I3D-based SSD
(c) Cascade model
(d) Our model: Mindreader
Figure 8: Visualizations of (a) Gaze + Box, (b) I3D-based SSD, (c) Cascade Model, and (d) Our Mindreader. Note that in our experiments of Gaze + Box model, we directly use ground truth bounding boxes for each object instead of results from an object detector as is shown in (a). The box regression head is omitted for simplicity.

7.3 Hand based model settings

We train two object-in-hand detectors (for the left hand and the right hand respectively), using the ResNet-50 backbone, and one which-hand classifier with the I3D backbone to classify which hand holds the object of interest when the left hand and the right hand hold different objects. During testing, if only one object-in-hand detector predicts object in hand or both hands hold the same object, we accept the prediction as the object of interest and it is combined with the ground truth bounding box as the final output. Otherwise we apply the which-hand classifier to decide which object to take. We obtain testing accuracy of , and for the object-in-left-hand detector, the object-in-right-hand-detector and the object-in-which-hand classifier respectively.

To further strengthen the baseline, we directly use the ground truth of objects in hands and have 4 more settings: (1) Right handed model, which uses the ground truth object in hands labels, and when two hands hold different objects, it always favours the right one; (2) Left handed model, which is the same as (1) but always favours the left hand; (3) Model with object-in-hand ground truth and which-hand classifier, which will apply the which-hand classifier to decide which object to take when two hands hold different objects; (4) Either handed model, which uses the ground truth object-in-hand labels, and when two hands hold different objects, the model always take the one resulting in higher as the prediction. Note that (4) depicts the best performance which hand-based methods can possibly achieve in theory as it uses all of the ground truth.

Figure 9: More qualitative results of our model on the ATT dataset.