Do Pedestrians Pay Attention? Eye Contact Detection in the Wild

12/08/2021
by   Younes Belkada, et al.
15

In urban or crowded environments, humans rely on eye contact for fast and efficient communication with nearby people. Autonomous agents also need to detect eye contact to interact with pedestrians and safely navigate around them. In this paper, we focus on eye contact detection in the wild, i.e., real-world scenarios for autonomous vehicles with no control over the environment or the distance of pedestrians. We introduce a model that leverages semantic keypoints to detect eye contact and show that this high-level representation (i) achieves state-of-the-art results on the publicly-available dataset JAAD, and (ii) conveys better generalization properties than leveraging raw images in an end-to-end network. To study domain adaptation, we create LOOK: a large-scale dataset for eye contact detection in the wild, which focuses on diverse and unconstrained scenarios for real-world generalization. The source code and the LOOK dataset are publicly shared towards an open science mission.

READ FULL TEXT VIEW PDF

Authors

page 1

page 3

page 7

page 9

page 10

06/04/2020

Autonomous Driving: Framework for Pedestrian Intention Estimationin a Real World Scenario

Rapid advancements in driver-assistance technology will lead to the inte...
03/29/2021

Onfocus Detection: Identifying Individual-Camera Eye Contact from Unconstrained Images

Onfocus detection aims at identifying whether the focus of the individua...
07/13/2020

A Robotic Framework for Making Eye Contact with Humans

Meeting eye contact is the essential prerequisite skill of a human to in...
01/05/2021

Janus: Efficient and Accurate Dual-radio Social Contact Detection

Determining when two individuals are within close distance is key to con...
04/08/2019

Eye Contact Between Pedestrians and Drivers

When asked, a majority of people believe that, as pedestrians, they make...
07/25/2019

Accurate and Robust Eye Contact Detection During Everyday Mobile Device Interactions

Quantification of human attention is key to several tasks in mobile huma...
07/27/2018

ESCaF: Pupil Centre Localization Algorithm with Candidate Filtering

Algorithms for accurate localization of pupil centre is essential for ga...

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

When walking or driving, people use eye contact to communicate intentions, pay attention to their environments, or acknowledge the presence of others. Autonomous agents also need to understand this implicit channel of communication to move naturally around humans and avoid collisions [28, 31]. Eye contact detection is especially useful for autonomous vehicles, as they need to understand whether a pedestrian intends to cross the street in front of the vehicle [29, 36]. Similarly, smaller robots moving in crowds should be capable of detecting whether pedestrians have noticed them and are more likely to actively avoid them [11, 12]. Finally, even in smart cities, eye contact detection can be useful to better understand pedestrians’ behaviors, e.g., identify where their attentions go or what public signs they are looking at.

Although humans make eye contact with each other at all times, detecting this action in the wild, i.e., with no constraint on the environment such as exemplified in Figure 1, presents a few challenges. First, the action can be quick and subtle, happening with small head movements lasting as short as a few milliseconds. Because of this small window, both spatially and temporally, the detection is hard, and can easily be affected by environmental conditions, such as lighting or distances of pedestrians. Furthermore, eye contact has received little attention and few datasets have been annotated with it [29, 27], when compared to more popular vision tasks such as object detection [4]

or 2D pose estimation 

[18]. All these reasons make it more difficult for autonomous systems to detect eye contact effectively and to generalize to new environments.

Fig. 1: Typical scene for eye contact detection in the wild, where pedestrians might be far from the camera and heavily occluded. Our method estimates, from predicted body poses, whether people are paying attention (showed in green) to the ego camera through eye contact, or are distracted (showed in red). This information can then help to better forecast their behaviors and to reduce the risk of collision with a self-driving agent.222Image under license CC-0, https://jooinn.com/images1280_/people-walking-on-pedestrian-lane-during-daytime.jpg.

In order to mitigate these issues, we propose to detect eye contact from high-level semantic keypoints, as displayed in Figure 1. Although one could expect images to be a key input representation for eye contact detection, we show that we can use keypoints as input to escape the image domain, and process them with a simple, yet effective neural architecture. For this, we first rely on a pose estimation step, which extracts semantic keypoints for all pedestrians in an image, using the off-the-shelf pose detector OpenPifPaf [13]. Using pose features as input rather than images presents several advantages. As pose needs less resolution than gaze, while also being annotated on more diverse datasets, it should be less affected by noise from environmental conditions, and predictions should generalize better to different scenarios and environments. Poses are also much less dimensional than images, and do not require as much network capacity to be processed properly. This allows the use of lighter networks, which should help prevent overfitting on the few scenarios annotated. Finally, by leveraging these high-level features, we remove background information and reduce effects from changes in image statistics, allowing our model to focus solely on eye contact detection.

Since there are not many datasets covering a large variety of scenarios for eye contact, and these usually include a limited number of pedestrians [29, 27], we argue that if a model is to be trained on them and deployed in the real world, it then must be particularly able to generalize well to new, uncontrolled conditions. We suggest evaluating this through cross-dataset generalization, and we annotate three common autonomous driving datasets with this new task, namely KITTI [6], nuScenes [3], and JRDB [20], to diversify the scenarios involving eye contact. When evaluating our models, we show that using semantic keypoints leads to models generalizing better to various datasets and scenarios. We publicly release the annotations as a new dataset, which we refer to as LOOK333Dataset: https://looking-vita-epfl.github.io, as well as the source code444Source code: https://github.com/vita-epfl/looking, towards an open science mission.

To summarize, our contributions are as follows:

  • We propose a deep learning model leveraging semantic keypoints, specially adapted to the challenges of eye contact detection;

  • We publicly release LOOK, a diverse, large-scale dataset for eye contact detection in the wild, with numerous unique pedestrians and a focus on generalization across domains and scenarios, by annotating three common autonomous driving datasets;

  • We suggest an evaluation protocol for eye contact with real-world generalization in mind, and show that our approach yields state-of-the-art results and strong generalization compared to image-based methods.

Ii Related Work

Ii-a General eye contact

Gaze estimation has received a lot of attention from the Computer Vision community, as a simpler alternative to eye tracking. As for most other tasks, all leading approaches now rely on Deep Learning to get state-of-the-art results on the various benchmarks [39]. In this paper, we focus on eye contact detection, which can be considered as a special case of gaze estimation. There are multiple works that tackle this problem, e.g., Smith et al. [34] focus on gaze locking from eyes’ visual appearances by masking out their surroundings, Park et al. [24] transform single eye images into simplified pictorial representations to regress the angle of the gaze. Some others focus also on real-time inference. Fischer et al. [5] use a cascade of networks to localize heads and face landmarks, to align them to a predefined normalized face image for extracting eye patches, then compute gaze with a deep network. Rowntree et al. [32] also use two networks for head detection and gaze estimation in order to speed up the overall pipeline.

The major issue with these works is that the benchmarks and the methods are not applied to in-the-wild applications for autonomous vehicles, where the resolution of the pedestrian is low. They usually focus on situations where people are rather close to the camera (e.g., inside a vehicle, in front of a computer), where the heads occupy larger regions in the images, and with simple or plain backgrounds, sometimes in controlled setups. On the other hand, we focus on eye contact detection in the wild, where there is no prior constraint on the type of environment pedestrians are in.

Ii-B Eye contact between pedestrians and vehicles

From a driver’s perspective, detecting eye contact is an important cue that indicates pedestrians’ awareness of the traffic and future crossing intentions. However, few datasets have annotated this action. JAAD [29] and PIE [27] are two such datasets, both focusing on pedestrians likely to cross the road in front of vehicles. They therefore allow learning and evaluating eye contact directly from images. Rasouli et al. [29] use an AlexNet [15]

to classify cropped images of pedestrians’ heads but require bounding boxes to be given. Varytimidis et al. 

[36] have a similar approach where they use an SVM to classify features from a convolutional network applied to head crops, and then process their predictions with contextual information. Mordan et al. [22] jointly detect pedestrians and eye contact, along with other attributes, in a single network forward pass using multi-task fields.

In the context of pedestrian crossings, multiple works (in addition to the previous ones) use eye contact as an intermediate feature to better predict pedestrians’ future behaviors. Kooij et al. [11]

estimate head orientation as a cue for pedestrians’ situational awareness, and use it with other indicators to predict their paths around the road with a Dynamic Bayesian Network. Other approaches use a similar strategy for pedestrian awareness, e.g., Hariyono et al. 

[7] for estimating the risk of collision, Kwak et al. [17] for prediction pedestrian intention at night time.

Eye contact detection is directly related to whether pedestrians pay attention to the incoming traffic. In practice, detecting that they do not pay attention is as important. One of the main reasons for that is the use of a phone that draws their attention away from the road. Identifying phone-related activities is therefore an insightful cue to detect. Rangesh et al. [26] show the practical importance of having gaze annotations, both for eye contact between drivers and pedestrians, or phone-related distractions. Going further to recognize actions implying a phone, Saenz et al. [33] use a two-branch convolutional network to predict distracted behaviors due to phone usage from stereo image pairs.

While detecting phone-related activities or pedestrian intentions are crucial tasks, eye-contact detection remains an essential channel of communication. Pedestrians may have the intention to cross but have they seen the upcoming car they should yield to? Contrarily, people may hold the phone but still pay attention to the upcoming traffic.

Iii LOOK Dataset

We argue that eye contact detection is a crucial yet under-explored task to enable autonomous agents to safely navigate around pedestrians. To promote research in this area, we show that the current datasets are not sufficiently diverse for data-driven methods, and we create a new large-scale dataset for eye contact detection in the wild.

Dataset Frames Instances [% looking] Pedestrians
JAAD [30] 82K 133K [18%] 686
PIE [27] 909K 739K [9%] 1,842
Our LOOK-KITTI [6] 1,391 4.630 [17%] 425
Our LOOK-JRDB [20] 9,441 39K [18%] 399
Our LOOK-nuScenes [3] 2,216 13K [9%] 7,100
Our LOOK 13K 57K [16%] 7,944
TABLE I: Dataset statistics. Frames is the total number of frames in the datasets. Pedestrians indicates the number of unique pedestrians, while Instances counts the number of occurrences of pedestrians in all frames. In brackets, we mention the percentage of instances that are looking at the camera. JAAD [30] and PIE [27] datasets include a very large number of instances but in comparison a very small number of different pedestrians. In contrast, our LOOK dataset includes in total 7,944 unique pedestrians from three continents, enabling exhaustive studies on cross-dataset generalization.
Fig. 2: Modular architecture: the input of our keypoint-based model is the set of 2D joints extracted from a raw image, and the output is the binary flag indicating whether a person is looking at the camera. A Fully connected

block outputs 256 features and includes a fully connected layer (FC), a Batch Normalization layer (BN) 

[9]

, a ReLU activation function, and dropout 

[35]. Optionally, the features obtained from the semantic keypoints are concatenated with the features obtained from the head crops. We experiment with two types of fusions in the early (O1) or late (O2) layers, and with different convolutional architectures, such as ResNet-18 [8] or ResNeXt-50 [37] as backbones for the crop-based module.

Iii-a Existing datasets

To the best of our knowledge, only two datasets contain annotations for the eye contact detection task: JAAD dataset [30], and PIE dataset [27]. JAAD consists of 390K instances of pedestrians labeled with bounding boxes and behaviour annotation, of which 17K instances have been labeled as looking at the driver (i.e., at the camera in the car). The dataset is large in size but limited in diversity. It is made of 346 video clips of 5-10 seconds recorded with an on-board camera at 30fps in North America and Europe. Thus, many frames show the same people in similar scenes, and the number of unique pedestrians looking at the camera is 686.

PIE [27] is also a recent dataset for pedestrian intention estimation. It shares many of the characteristics of its predecessor JAAD [30]. It is recorded using an on-board camera at 30fps and consists of continuous footage of 6 hours in downtown Toronto, Canada. Out of 700K annotated pedestrian instances, there are 1,842 unique pedestrians, of which less than 180 are looking at the camera.

Iii-B Benchmark selection

We have built a new large-scale dataset for eye contact detection in the wild by selecting publicly available images from three existing datasets: KITTI [6], nuScenes [3] and JRDB [20]. The first two are autonomous driving datasets and are made of images taken from a driver perspective. The latter one consists of videos taken from a small robot moving in crowded spaces inside Stanford University campus. In total, we have labeled 13,048 images from four different cities (Boston, Singapore, Tübingen, Palo Alto) in three continents. We aim for diversity, selecting pedestrians areas [6], crowded images from six cameras around the car [3], and indoor environments from a robot perspective [20]. In total we have labeled around 8,000 unique pedestrians, making it the most diverse dataset for eye contact detection in the wild. Examples from the LOOK dataset are shown in Figures 3(a), 3(b) and 3(d).

We provide, together with the dataset annotation, the training and testing splits. We make sure that splits do not contain overlapping scenes and that the test set is sufficiently diverse, including 22% of the total number of unique pedestrians over 15% of the frames.

Iii-C Annotation pipeline

Our LOOK dataset has been annotated using the Amazon Mechanical Turk (AMT) platform. Each image has been annotated by four workers, which had the options to select whether a person was looking at the camera, somewhere else, or none of the two in case of ambiguity. We then include in the dataset only the labels with a consensus of at least three out of four annotators. This threshold has been selected by reviewing edge cases where not all the workers agree on a selected instance.

To promote the creation of an ever-growing open-source dataset, we have also developed and released a labeling tool

footnotemark: to ease the annotation process. The tool leverages the off-the-shelf pose detector OpenPifPaf [13] to locate the 2D bounding boxes of pedestrians. It then runs a pre-trained model (more detailed on Section V) on the JAAD [30] and PIE [27] dataset to provide a first guess. This pipeline allows annotators to only check and eventually correct wrong predictions.

To count the number of pedestrians, we use the tracking identification number for JRDB [20] dataset, while for the KITTI dataset [6] we manually count them. In the case of the nuScenes dataset [3], we leverage the metadata associated with each frame. We approximate the number of unique pedestrians by only counting once the instances that appear in the same camera multiple times in a 5-seconds time window. We run sensibility analysis on the time window and opt for 5 second as the images are recorded from a moving vehicle in the majority of scenes.

Iv Eye Contact Detection

The goal of our method is to detect from images whether humans are looking at the camera or somewhere else. We tackle autonomous driving scenarios, i.e., outdoor scenes where people may be several meters far from the camera. Our approach consists of two steps. First, we use a 2D pose detector to obtain a low-dimensional representation from the image domain, which we call semantic keypoints

. The keypoints are a convenient representation that provides invariance to many factors, e.g., background artifacts, clothes, weather conditions. Second, we feed the extracted keypoints to a simple feed-forward neural network that detects the presence of eye contact. In addition, we also explore multi-modal representations, by combining the keypoint representations with the features obtained from crops of images, and we explore different fusion techniques. A diagram of our modular architecture can be found in Figure

2.

Training Dataset Method Input Eye Contact Classification (AP) [Pedestrian Detection Recall ]
JAAD [30] LOOK-KITTI [6] LOOK-JRDB [20] LOOK-nuScenes [3] LOOK
Rasouli [30] Crops 75.4 [80.1] 65.9 [99.8] 87.2 [98.2] 78.7 [89.8] 77.3 [95.9]
MTL-Fields [22] Images 82.6 [92.4] 89.7 [93.1] 82.1 [81.9] 92.0 [71.8] 87.9 [82.3]
JAAD Our method Keypoints 85.9 [80.1] 91.6 [99.8] 94.8 [98.2] 91.0 [89.8] 92.5 [95.9]
Rasouli [30] Crops 71.0 [80.1] 76.8 [99.8] 89.5 [98.2] 82.9 [89.8] 83.1 [95.9]
MTL-Fields [22] Images 80.7 [79.0] 95.1 [96.5] 95.2 [93.0] 93.4 [68.4] 94.6 [86.0]
LOOK Our Method Keypoints 86.0 [80.1] 96.4 [99.8] 97.1 [98.2] 95.1 [89.8] 96.2 [95.9]
TABLE II: Comparing our proposed method and baseline results on JAAD [30] and on our LOOK dataset. We evaluate eye contact classification using the average precision (AP) metric. For a fair comparison, we also report the recall of the detected pedestrians for each method. All approaches have been trained for classification on either JAAD solely, or on our LOOK dataset, and we evaluate them on both JAAD and LOOK. Our method is only trained on keypoints and reaches state-of-the-art results on the eye contact detection task on both the JAAD and LOOK datasets when compared with image- and crop-based methods. It also shows the best generalization properties when evaluated on a different dataset. The keypoints are obtained running an off-the-shelf pose estimator [13] without re-training or adapting it to the different datasets.

Iv-a Keypoint-based method

We escape the image domain using 2D keypoints: a low-dimensional representation obtained through the off-the-shelf pose detector OpenPifPaf [13, 14], which was designed for crowded scenes and low-resolution images. The output of our network is the binary flag indicating whether a person is looking at the camera. To create the training and testing dataset, we match the bounding boxes enclosing the keypoints with the ground-truth bounding boxes using their intersection over union (IoU). We select the instances with the highest matching IoU above 0.3 for each ground truth.

Keypoints are especially useful to prevent overfitting. To further increase generalization properties, we normalize the keypoints and zero-center them on the y-axis. Normalization prevents different scale differences from biasing the results, while the vertical location of a person in the image plane does not add any information regarding eye contact detection. The x-coordinate in the image plane, on the other side, may help infer the relative head and body orientations with respect to the camera. For every keypoint with pixel coordinates in the image plane, we apply the following transformation:

(1)

where , is the mean of the coordinates of the left and right hips of the instance, , , the width and height of the enclosing box given by the keypoints, and the width of the input image. In practice, the normalization removes information on the size of the person as well as on the vertical location in the image plane.

Our architecture is composed of a simple fully-connected network with residual blocks [8], and includes batch-normalization [9] after every fully connected layer as well as dropout [35]. The structure is inspired by the success in 3D vision tasks using 2D keypoints, especially Martinez et al. [21] for 3D pose estimation, and Bertoni et al. [1] for human 3D localization. The residual blocks increase performances and avoid overfitting, while the model, which contains approximately 411K training parameters, is characterized by great speed and a low memory footprint. Its building blocks are shown in Figure 2.

Method JAAD [30] PIE [27] LOOK-KITTI [6] LOOK-nuScenes [3] LOOK-JRDB [20] LOOK
Crops(ResNet-18 [8] 78.1 73.5 76.7 81.7 92.0 83.5
Crops (ResNeXt-50 [37]) 79.7 74.2 72.0 85.7 92.5 83.4
Eyes Crops 77.4 70.6 77.1 84.7 83.6 81.8
Keypoints 85.9 83.8 91.6 91.0 94.8 92.5
Body Keypoints 76.4 72.6 79.3 80.7 75.4 78.5
Head Keypoints 86.3 84.0 90.9 90.2 95.1 92.0
Keypoints & Crops (ResNet-18 [8], Fusion: O1) 78.0 75.2 79.7 85.3 91.6 85.5
Keypoints & Crops (ResNet-18 [8], Fusion: O2) 78.7 75.6 78.9 84.3 92.7 85.4
Keypoints & Crops (ResNeXt-50 [37], Fusion: O1) 79.5 75.1 73.6 85.8 92.1 83.8
Keypoints & Crops (ResNeXt-50[37], Fusion: O2) 80.6 75.9 74.1 86.2 93.2 84.5
Keypoints & Eyes Crops (Fusion: O1) 83.9 79.9 87.0 91.2 92.5 90.2
TABLE III: Impact of different architectures on the AP metric for eye contact classification (%) on different datasets. All methods have been trained on JAAD dataset [30] only. Crops stands for adapting a crop-based model first introduced by Rasouli et al. [30] with a ResNet [8] or ResNeXt [37] architecture. Keypoints stands for our simple architecture only trained with keypoints as input, either all the 17 keypoints of the human body, or a subset of it: keypoints - Body includes all the keypoints but the head ones, while Keypoints - Head includes the ears, eyes and nose locations. Keypoints & Crops stands for our fusion-based approach combining keypoints and crops in a single representation. When training only using the 5 head keypoints, we obtain the best results on JAAD [30] but training on all the keypoints generalizes better across datasets.

Iv-B Combined method

We argue that 2D keypoints are a low-dimensional representation that contains enough information to understand whether a person is looking at the camera. This is motivated by the application we are targeting: autonomous driving scenarios, where people are often further away from the camera and the pupils may not be distinguishable. To test our hypothesis, we develop a modular architecture to optionally include visual information from the head region of a pedestrian. We create a combined method that encodes features from both the keypoints and the cropped head region. While the former branch does not change, we select for the crops the upper third of the bounding box [30] enclosing the keypoints, and we extract the features using a convolutional backbone. We explore different backbone architectures (i.e., AlexNet, [15] ResNet [8] and ResNeXt [37]) and different fusion techniques. As visually described in Figure 2, we experiment with early fusion and with late fusion

. In the former option (O1), we sum the visual features extracted from a convolutional backbone with the raw features extracted from the 2D keypoints. In the latter option (O2), we concatenate the visual features together with the features extracted from the last layer of the fully-connected architecture. Our training schedule, inspired by Zamir et al.

[38], consists of two steps. We first initialize the keypoint-based and the crop-based branches by training them independently. We then keep frozen all the layers before the concatenation and train the remaining ones. At both stages, we use the binary cross-entropy loss. The combined method allows us to verify whether adding visual information to the keypoint-based method increases the performance.

V Experiments

V-a Experimental setup

Evaluation metrics. We evaluate pedestrian detection and eye contact classification separately. Some previous methods [30] do not include pedestrian detection, using a box classification approach, where the ground-truth boxes are given. In our case, a detection step is also included to ensure fair comparison among different methods. To disentangle the contributions of pedestrian detection and eye contact classification, we split the two tasks. To evaluate the detection results, we use the recall metric with a threshold on intersection over union (IoU) of 0.5. Compared to [4, 22], we do not use Average Precision (AP) metric for detection as we only focus on instances labeled with the eye contact attribute, and in any given dataset very far instances are not annotated for it. In this case, the AP metric may penalize extra detections. In the classification setup, we evaluate the set of detected instances that match a ground-truth, and we use the AP metric to evaluate the classification of the binary attribute of looking or not at the camera.

As shown in Table I, each dataset is unbalanced toward a majority of people not looking at the camera. Following again the procedure of [22]

, we compute results on a balanced test set where negative instances are randomly sampled. The sampling is done 10 times to reduce the variance and the results are averaged.

Regarding training and testing split, for JAAD dataset [30] the official split is composed of 177 videos for training, 29 videos for validation, and 117 videos for testing. For PIE dataset we use, as recommended, set01, set02, set04 for training, set05 set06 for validation and set03 for the testing set.

Implementation details. To obtain input-output pairs of 2D joints and binary labels, we apply the off-the-shelf pose detector OpenPifPaf [13, 14]

and match our detections with the ground-truth boxes provided by each dataset. We train our keypoint-based architecture for 20 epochs, using binary cross-entropy loss function with Adam optimizer

[10], with a learning rate of 0.0001, and mini-batches of 64 instances. The crop model is trained for 20 epochs, using binary cross-entropy loss function, SGD optimizer [2]

with Nesterov momentum

[23], a learning rate of 0.0001, and mini-batches of 32 instances. For the combined architecture, we pre-train both branches and freeze the early layers before the fusion of the features. We train the last layers with an SGD optimizer, a learning rate of 0.00001, and mini-batches of 128 instances.

The code, available online, is developed using PyTorch

[25]. We do not apply any data augmentation procedure on the 2D poses.

V-B Baselines

We argue that eye contact detection is a crucial task yet to be solved to develop safe autonomous vehicles. However, to the best of our knowledge, very few methods have reported results on the eye contact task in JAAD [30] or PIE [27]. Rasouli et al. [30] proposed to use image crops of people as inputs to an AlexNet architecture [16] followed by fully connected layers. Their published results on the JAAD dataset [30] used a smaller version of the dataset and randomly split the instances of the dataset. Hence, the same unique pedestrian in different time frames may appear both in training and testing sets. For a fair comparison, we have re-implemented this method and evaluated it on the recently released official JAAD split to prevent any contamination of the testing set.

In addition, we compare against the very recent MTL-Fields developed by Mordan et al. [22]. It is a field-based approach that leverages multiple pedestrian attributes in a multi-task fashion, including eye contact, to understand the visual appearances and behaviors of pedestrians. Contrary to Rasouli et al. [30] that operate on image crops of people and therefore discard context around them, MTL-Fields keep full images in order to understand the scenes and learn interactions between pedestrians. As their code is open-source, we train a network and evaluate it with our setup for eye contact detection.

Our baselines. One of our goals is to compare the properties of keypoints and crops for the eye contact task. Thus, we develop a modular architecture that is either based on keypoints only, or can combine keypoints and visual information together, and we benchmark it with the following baselines:

  • Crops: when referring to methods using crops only, we consider the approach of Rasouli et al. [30] with a more recent ResNet [8] or ResNeXt [37] backbone. Rasouli et al. [30] train their model with ground-truth crops without including detection results. For a fair comparison, we train and evaluate the model including the same set of instances provided by the OpenPifPaf detector [14].

  • Head & Body Keypoints: we test our keypoint-based architecture with subsets of keypoints, either only including the keypoints of the head region (eyes, nose, ears), or only the ones from the rest of the body. The goal is to analyze whether the body orientation also provides informative cues, or the head keypoints suffice for the eye contact detection task.

  • Keypoints & Eye Crops: we test whether adding visual information about the region around the eyes could be informative and less prone to overfitting than head crops. From the 2D keypoint locations of the eyes and ears, we crop a small region around the pupils and resize it to a fixed patch of 3x10x30 pixels. The model architecture is consistent with the one shown in Figure 2, but we substitute the head crops with the eyes one, and a convolutional backbone with a fully connected block.

JAAD [30]
Training Datasets Instances Crops Keypoints
JAAD [30] 50K 79.7 ( - ) 85.9 ( - )
LOOK-nuScenes [3] 10K 71.1 (-8.6) 84.6 (-1.3)
LOOK 41K 73.7 (-6.0) 86.0 (+0.1)
LOOK + PIE [27] 61K 75.1 (-4.6) 87.7 (+1.8)
TABLE IV: Evaluating cross-dataset results for our best crop- and keypoint-based methods using the AP binary classification metric (%) on the JAAD dataset [30]. In parenthesis, the relative difference with respect to the same method trained on the JAAD dataset [30]. Instances counts the total number of training instances.

V-C Quantitative results

In Table II, we show the results of training and evaluating each method on the same dataset (either JAAD [30] or our LOOK dataset) as well as cross-dataset results. Our method achieves the best performances when compared to the other baselines, achieving at least a 5% improvement both when testing on the same dataset and when evaluating cross-dataset generalization properties. More notably, our model is able to generalize well on our LOOK dataset when trained on the JAAD dataset [30] only, as it reaches an AP of 92.5%; almost on par when compared against baselines trained on the LOOK dataset. Qualitative examples from different datasets are shown in Figure 4.

Our recall results are shared with Rasouli [30] baseline, as we train and evaluate their model on the same instances detected by the off-the-shelf pose detector OpenPifPaf [13]. We observe that MTL-Fields [22] achieves higher recall on JAAD when trained on the same dataset, but the same recall drops by 15% when trained on a different dataset. Our method on the other side maintains high recall when testing domain adaption, as it leverages an off-the-shelf pose estimator [13] trained and optimized on a general-purpose dataset [18] that is beneficial for domain adaptation.

Fig. 3: Visual illustration of the normalized magnitude of the gradients of the loss function with respect to each keypoint during training. The keypoints related to the head (eyes and ears) are the ones that most affect the loss function.
JAAD [30]
Method / Box Height [px] 240+ 160-240 110-160 0-110 All
Crops (ResNet-18) 80.4 79.5 80.4 73.4 78.1
Crops (ResNeXt-50) 79.0 81.2 83.0 75.3 79.7
Eyes Crops 74.4 79.1 81.3 74.7 77.4
Keypoints 87.9 88.7 87.0 78.7 85.9
Head Keypoints 90.4 89.7 86.7 76.5 86.3
Keypoints & Crops 80.5 82.2 83.6 75.9 80.6
Keypoints & Eyes Crops 85.3 86.4 85.0 79.4 83.9
TABLE V: Average precision (AP) in percentage (%) as a function of the bounding box height in pixels for the JAAD dataset [30]

. Each cluster corresponds to one quartile of the distribution. For the crop-based methods, we consider our ResNeXt-50 model with late fusion when not differently specified.

(a) LOOK-KITTI [6]
(b) LOOK-nuScenes [3]
(c) JAAD [30]
(d) LOOK-JRDB [20]
Fig. 4: Qualitative results for the eye contact detection task on multiple datasets. People with green poses are predicted as looking at the camera, people with red poses as not looking.

V-D Cross-dataset generalization.

We further study cross-dataset generalization with our new LOOK dataset in Tables III and IV. First of all, we investigate the performances of eleven crop- and keypoint-based methods in Table III. We train the models on JAAD dataset [30] and evaluate them on JAAD [30], PIE [27], and our LOOK dataset. Keypoint-based models perform and generalize better than crop-based ones, consistently with results obtained in Table II. Surprisingly, combining visual information to the keypoints into our combined models (which we refer to as Keypoints & Crops) degrades the performances as it leads to stronger overfitting. The best results are achieved only by combining features from the eye region instead of the head region. We attribute this result to the generalization properties of keypoints, as this low-dimensional representation does not overfit on background scenes or specific faces. Yet even simpler models with no visual information achieve the best results on all the datasets, but for LOOK-nuScenes [3].

As additional experiment in Table IV, we train our best keypoint-based and crop-based methods on different datasets but JAAD [30], and evaluate them on JAAD. We train our keypoint-based model on 10K instances from the nuScenes dataset [3], and we obtain less than 2% difference compared to training it on the 50K instances of the JAAD dataset [30]. The best crop-based model, on the other side, achieves an AP of 71.1%, down from an original 79.7%. When increasing the number of instances from multiple datasets, the performances of the crop-based model never reach the baseline result of being trained on JAAd only. The keypoint-based model, on the other side, achieves the best performances on JAAD [30] when trained on different datasets.

V-E Additional studies

The role of distance. We test the hypothesis that crop-based methods may be most effective when people are closer to the camera, while keypoints may be more useful when people are far away and details of the face are less informative. We obtain the distribution of bounding box heights for all the instances of the JAAD test set [30], and evaluate each quartile separately. As we show in Table V, the hypothesis is not verified: keypoints remain more effective than crops even for people close to the camera. This result may not be intuitive at first sight, but we are analyzing autonomous driving datasets, where even close people may be several meters away from the camera, and detecting the direction of the pupil may not be feasible. Keypoints provide a simple yet effective representation in these scenarios.

Saliency map. To verify the impact of each keypoint in the final decision of the model, we compute the absolute value of the gradient of the objective function with respect to each input node for every epoch on the training set:

(2)

where represents the impact of the keypoint with its three components: and coordinates, and the confidence score . We then average this value by taking the mean absolute value across all the training instances that consists of samples. The results are illustrated in Figure 3. The dominant keypoints are the eyes and ears, as shown by the magnitude of the gradients of the loss function with respect to each keypoint.

Average Run Time (ms)
Method AP (%) Detection Classification Total
Rasouli [30] (S-30) 71.0 602 39 672
MTL-Fields [22] 80.7 573
Our Method (S-30) 85.9 602 0.8 626
Our Method (R-50) 82.6 305 0.8 328
TABLE VI: Average run time performances for a single image on the JAAD test set. The detection steps for both Rasouli [30] and our method are calculated with the off-the-shelf pose detector OpenPifPaf [13], using either a ResNet-50 (R-50) [8] or a ShuffleNetV2K30 (S-30) [19] backbone.

Run Time. Our experiments have been conducted using a machine with a single NVIDIA GeForce GTX 1080 Ti and Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz. In Table VI, we compare the run time performances of several methods on the test set of the JAAD dataset [30]. As the original method from Rasouli et al. did not include a detection step, we use the same backbone to extract the poses for our method and the crops for Rasouli’s one. For MTL-Fields, detection and classification are performed in a single stage. Our method excels in the classification step with less than 1 ms of inference time as it uses low-dimensional keypoints. Regarding the detection step, our method is agnostic to the pose detector. We have tested it with OpenPifPaf [13] using two different backbones and achieved the fastest run time with a ResNet-50 (R-50) [8].

Vi Conclusions

Eye contact detection is a practically important task to better understand and forecast human behaviors. In particular, autonomous robots need to solve this task to navigate safely around humans. We have introduced a new deep learning approach for eye contact detection in the wild, i.e., with no prior knowledge on the environment, which is suited to the multiple challenges associated with this task. We start by extracting semantic keypoints from images, and use them as low-dimension, high-level features to escape the image domain and focus on relevant information. Then, we have compared several architectures to process this representation, including using it in addition to selected image crops. We have also publicly released LOOK, a large-scale dataset for eye contact detection in the wild. We designed it with real-world generalization in mind, by annotating three common autonomous driving datasets to consider cross-dataset training and evaluation, and focus on multiple scenarios and diverse environments. We evaluated our method and several approaches from the literature on LOOK to create a benchmark for this task, and show state-of-the-art results with robust generalization across datasets compared to image-based approaches. We hope that this new benchmark can help foster further research from the community on this important but overlooked topic.

References

  • [1] L. Bertoni, S. Kreiss, and A. Alahi (2019-10) MonoLoco: monocular 3d pedestrian localization and uncertainty estimation. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §IV-A.
  • [2] L. Bottou (2010)

    Large-scale machine learning with stochastic gradient descent

    .
    In Proceedings of COMPSTAT’2010, pp. 177–186. Cited by: §V-A.
  • [3] H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu, A. Krishnan, Y. Pan, G. Baldan, and O. Beijbom (2019) NuScenes: a multimodal dataset for autonomous driving. arXiv preprint arXiv:1903.11027. Cited by: §I, §III-B, §III-C, TABLE I, TABLE II, TABLE III, 3(b), §V-D, §V-D, TABLE IV.
  • [4] M. Everingham, S. A. Eslami, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman (2015) The pascal visual object classes challenge: a retrospective. International journal of computer vision 111 (1), pp. 98–136. Cited by: §I, §V-A.
  • [5] T. Fischer, H. J. Chang, and Y. Demiris (2018) RT-GENE: real-time eye gaze estimation in natural environments. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 334–352. Cited by: §II-A.
  • [6] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun (2013) Vision meets robotics: the kitti dataset. International Journal of Robotics Research (IJRR). Cited by: §I, §III-B, §III-C, TABLE I, TABLE II, TABLE III, 3(a).
  • [7] J. Hariyono, A. Shahbaz, L. Kurnianggoro, and K. Jo (2016) Estimation of collision risk for improving driver’s safety. In Conference of the IEEE Industrial Electronics Society (IECON), pp. 901–906. Cited by: §II-B.
  • [8] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In

    The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    ,
    pp. 770–778. Cited by: Fig. 2, §IV-A, §IV-B, TABLE III, 1st item, §V-E, TABLE VI.
  • [9] S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: Fig. 2, §IV-A.
  • [10] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §V-A.
  • [11] J. F. P. Kooij, N. Schneider, F. Flohr, and D. M. Gavrila (2014) Context-based pedestrian path prediction. In Proceedings of the IEEE European Conference on Computer Vision (ECCV), pp. 618–633. Cited by: §I, §II-B.
  • [12] P. Kothari, S. Kreiss, and A. Alahi (2021) Human trajectory forecasting in crowds: a deep learning perspective. IEEE Transactions on Intelligent Transportation Systems (T-ITS). Cited by: §I.
  • [13] S. Kreiss, L. Bertoni, and A. Alahi (2019) Pifpaf: composite fields for human pose estimation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11977–11986. Cited by: §I, §III-C, §IV-A, TABLE II, §V-A, §V-C, §V-E, TABLE VI.
  • [14] S. Kreiss, L. Bertoni, and A. Alahi (2021-03) OpenPifPaf: Composite Fields for Semantic Keypoint Detection and Spatio-Temporal Association. arXiv preprint arXiv:2103.02440. Cited by: §IV-A, 1st item, §V-A.
  • [15] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger (Eds.), Vol. 25, pp. 1097–1105. External Links: Link Cited by: §II-B, §IV-B.
  • [16] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems (NeurIPS) 25, pp. 1097–1105. Cited by: §V-B.
  • [17] J. Kwak, B. C. Ko, and J. Nam (2017) Pedestrian intention prediction based on dynamic fuzzy automata for vehicle driving at nighttime. Infrared Physics & Technology 81, pp. 41–51. Cited by: §II-B.
  • [18] T. Lin, M. Maire, S. J. Belongie, L. D. Bourdev, R. B. Girshick, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014)

    Microsoft coco: common objects in context

    .
    In The European Conference on Computer Vision (ECCV), Cited by: §I, §V-C.
  • [19] N. Ma, X. Zhang, H. Zheng, and J. Sun (2018) Shufflenet v2: practical guidelines for efficient cnn architecture design. In The European Conference on Computer Vision (ECCV), pp. 116–131. Cited by: TABLE VI.
  • [20] R. Martin-Martin*, M. Patel*, H. Rezatofighi*, A. Shenoi, J. Gwak, E. Frankel, A. Sadeghian, and S. Savarese (2021) JRDB: a dataset and benchmark of egocentric robot visual perception of humans in built environments. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI). Cited by: §I, §III-B, §III-C, TABLE I, TABLE II, TABLE III, 3(d).
  • [21] J. Martinez, R. Hossain, J. Romero, and J. J. Little (2017) A simple yet effective baseline for 3d human pose estimation. In The IEEE International Conference on Computer Vision (ICCV), pp. 2659–2668. Cited by: §IV-A.
  • [22] T. Mordan, M. Cord, P. Pérez, and A. Alahi (2020) Detecting 32 pedestrian attributes for autonomous vehicles. arXiv preprint arXiv:2012.02647. Cited by: §II-B, TABLE II, §V-A, §V-A, §V-B, §V-C, TABLE VI.
  • [23] Y. E. Nesterov (1983) A method for solving the convex programming problem with convergence rate o (1/k^ 2). In Soviet Mathematics Doklady, Vol. 269, pp. 543–547. Cited by: §V-A.
  • [24] S. Park, A. Spurr, and O. Hilliges (2018-09) Deep pictorial gaze estimation. In Proceedings of the European Conference on Computer Vision (ECCV), Cited by: §II-A.
  • [25] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al. (2019) PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems, pp. 8024–8035. Cited by: §V-A.
  • [26] A. Rangesh and M. M. Trivedi (2018) When vehicles see pedestrians with phones: a multicue framework for recognizing phone-based activities of pedestrians. IEEE Transactions on Intelligent Vehicles 3 (2), pp. 218–227. Cited by: §II-B.
  • [27] A. Rasouli, I. Kotseruba, T. Kunic, and J. K. Tsotsos (2019) PIE: a large-scale dataset and models for pedestrian intention estimation and trajectory prediction. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 6262–6271. Cited by: §I, §I, §II-B, §III-A, §III-A, §III-C, TABLE I, TABLE III, §V-B, §V-D, TABLE IV.
  • [28] A. Rasouli, I. Kotseruba, and J. K. Tsotsos (2017) Agreeing to cross: how drivers and pedestrians communicate. In IEEE Intelligent Vehicles Symposium (IV), pp. 264–269. Cited by: §I.
  • [29] A. Rasouli, I. Kotseruba, and J. K. Tsotsos (2017) Are they going to cross? a benchmark dataset and baseline for pedestrian crosswalk behavior. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 206–213. Cited by: §I, §I, §I, §II-B.
  • [30] A. Rasouli, I. Kotseruba, and J. K. Tsotsos (2017) Are they going to cross? a benchmark dataset and baseline for pedestrian crosswalk behavior. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 206–213. Cited by: §III-A, §III-A, §III-C, TABLE I, §IV-B, TABLE II, TABLE III, 3(c), 1st item, §V-A, §V-A, §V-B, §V-B, §V-C, §V-C, §V-D, §V-D, §V-E, §V-E, TABLE IV, TABLE V, TABLE VI.
  • [31] A. Rasouli and J. K. Tsotsos (2019) Autonomous vehicles that interact with pedestrians: a survey of theory and practice. IEEE Transactions on Intelligent Transportation Systems (T-ITS) 21 (3), pp. 900–918. Cited by: §I.
  • [32] T. Rowntree, C. Pontecorvo, and I. Reid (2019) Real-time human gaze estimation. In 2019 Digital Image Computing: Techniques and Applications (DICTA), Vol. , pp. 1–7. External Links: Document Cited by: §II-A.
  • [33] H. Saenz, H. Sun, L. Wu, X. Zhou, and H. Yu (2021) Detecting phone-related pedestrian distracted behaviours via a two-branch convolutional neural network. IET Intelligent Transport Systems 15 (1), pp. 147–158. Cited by: §II-B.
  • [34] B.A. Smith, Q. Yin, S.K. Feiner, and S.K. Nayar (2013-10) Gaze Locking: Passive Eye Contact Detection for Human?Object Interaction. In ACM Symposium on User Interface Software and Technology (UIST), pp. 271–280. Cited by: §II-A.
  • [35] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014) Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15 (1), pp. 1929–1958. Cited by: Fig. 2, §IV-A.
  • [36] D. Varytimidis, F. Alonso-Fernandez, B. Duran, and C. Englund (2018) Action and intention recognition of pedestrians in urban traffic. In 14th International Conference on Signal-Image Technology & Internet-based Systems (SITIS), pp. 676–682. Cited by: §I, §II-B.
  • [37] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He (2017) Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1492–1500. Cited by: Fig. 2, §IV-B, TABLE III, 1st item.
  • [38] A. R. Zamir, A. Sax, N. Cheerla, R. Suri, Z. Cao, J. Malik, and L. J. Guibas (2020) Robust learning through cross-task consistency. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11197–11206. Cited by: §IV-B.
  • [39] X. Zhang, S. Park, T. Beeler, D. Bradley, S. Tang, and O. Hilliges (2020) ETH-xgaze: a large scale dataset for gaze estimation under extreme head pose and gaze variation. In European Conference on Computer Vision (ECCV), Cited by: §II-A.