In urban or crowded environments, humans rely on eye contact for fast and efficient communication with nearby people. Autonomous agents also need to detect eye contact to interact with pedestrians and safely navigate around them. In this paper, we focus on eye contact detection in the wild, i.e., real-world scenarios for autonomous vehicles with no control over the environment or the distance of pedestrians. We introduce a model that leverages semantic keypoints to detect eye contact and show that this high-level representation (i) achieves state-of-the-art results on the publicly-available dataset JAAD, and (ii) conveys better generalization properties than leveraging raw images in an end-to-end network. To study domain adaptation, we create LOOK: a large-scale dataset for eye contact detection in the wild, which focuses on diverse and unconstrained scenarios for real-world generalization. The source code and the LOOK dataset are publicly shared towards an open science mission.READ FULL TEXT VIEW PDF
When walking or driving, people use eye contact to communicate intentions, pay attention to their environments, or acknowledge the presence of others. Autonomous agents also need to understand this implicit channel of communication to move naturally around humans and avoid collisions [28, 31]. Eye contact detection is especially useful for autonomous vehicles, as they need to understand whether a pedestrian intends to cross the street in front of the vehicle [29, 36]. Similarly, smaller robots moving in crowds should be capable of detecting whether pedestrians have noticed them and are more likely to actively avoid them [11, 12]. Finally, even in smart cities, eye contact detection can be useful to better understand pedestrians’ behaviors, e.g., identify where their attentions go or what public signs they are looking at.
Although humans make eye contact with each other at all times, detecting this action in the wild, i.e., with no constraint on the environment such as exemplified in Figure 1, presents a few challenges. First, the action can be quick and subtle, happening with small head movements lasting as short as a few milliseconds. Because of this small window, both spatially and temporally, the detection is hard, and can easily be affected by environmental conditions, such as lighting or distances of pedestrians. Furthermore, eye contact has received little attention and few datasets have been annotated with it [29, 27], when compared to more popular vision tasks such as object detection 
or 2D pose estimation. All these reasons make it more difficult for autonomous systems to detect eye contact effectively and to generalize to new environments.
In order to mitigate these issues, we propose to detect eye contact from high-level semantic keypoints, as displayed in Figure 1. Although one could expect images to be a key input representation for eye contact detection, we show that we can use keypoints as input to escape the image domain, and process them with a simple, yet effective neural architecture. For this, we first rely on a pose estimation step, which extracts semantic keypoints for all pedestrians in an image, using the off-the-shelf pose detector OpenPifPaf . Using pose features as input rather than images presents several advantages. As pose needs less resolution than gaze, while also being annotated on more diverse datasets, it should be less affected by noise from environmental conditions, and predictions should generalize better to different scenarios and environments. Poses are also much less dimensional than images, and do not require as much network capacity to be processed properly. This allows the use of lighter networks, which should help prevent overfitting on the few scenarios annotated. Finally, by leveraging these high-level features, we remove background information and reduce effects from changes in image statistics, allowing our model to focus solely on eye contact detection.
Since there are not many datasets covering a large variety of scenarios for eye contact, and these usually include a limited number of pedestrians [29, 27], we argue that if a model is to be trained on them and deployed in the real world, it then must be particularly able to generalize well to new, uncontrolled conditions. We suggest evaluating this through cross-dataset generalization, and we annotate three common autonomous driving datasets with this new task, namely KITTI , nuScenes , and JRDB , to diversify the scenarios involving eye contact. When evaluating our models, we show that using semantic keypoints leads to models generalizing better to various datasets and scenarios. We publicly release the annotations as a new dataset, which we refer to as LOOK333Dataset: https://looking-vita-epfl.github.io, as well as the source code444Source code: https://github.com/vita-epfl/looking, towards an open science mission.
To summarize, our contributions are as follows:
We propose a deep learning model leveraging semantic keypoints, specially adapted to the challenges of eye contact detection;
We publicly release LOOK, a diverse, large-scale dataset for eye contact detection in the wild, with numerous unique pedestrians and a focus on generalization across domains and scenarios, by annotating three common autonomous driving datasets;
We suggest an evaluation protocol for eye contact with real-world generalization in mind, and show that our approach yields state-of-the-art results and strong generalization compared to image-based methods.
Gaze estimation has received a lot of attention from the Computer Vision community, as a simpler alternative to eye tracking. As for most other tasks, all leading approaches now rely on Deep Learning to get state-of-the-art results on the various benchmarks . In this paper, we focus on eye contact detection, which can be considered as a special case of gaze estimation. There are multiple works that tackle this problem, e.g., Smith et al.  focus on gaze locking from eyes’ visual appearances by masking out their surroundings, Park et al.  transform single eye images into simplified pictorial representations to regress the angle of the gaze. Some others focus also on real-time inference. Fischer et al.  use a cascade of networks to localize heads and face landmarks, to align them to a predefined normalized face image for extracting eye patches, then compute gaze with a deep network. Rowntree et al.  also use two networks for head detection and gaze estimation in order to speed up the overall pipeline.
The major issue with these works is that the benchmarks and the methods are not applied to in-the-wild applications for autonomous vehicles, where the resolution of the pedestrian is low. They usually focus on situations where people are rather close to the camera (e.g., inside a vehicle, in front of a computer), where the heads occupy larger regions in the images, and with simple or plain backgrounds, sometimes in controlled setups. On the other hand, we focus on eye contact detection in the wild, where there is no prior constraint on the type of environment pedestrians are in.
From a driver’s perspective, detecting eye contact is an important cue that indicates pedestrians’ awareness of the traffic and future crossing intentions. However, few datasets have annotated this action. JAAD  and PIE  are two such datasets, both focusing on pedestrians likely to cross the road in front of vehicles. They therefore allow learning and evaluating eye contact directly from images. Rasouli et al.  use an AlexNet 
to classify cropped images of pedestrians’ heads but require bounding boxes to be given. Varytimidis et al. have a similar approach where they use an SVM to classify features from a convolutional network applied to head crops, and then process their predictions with contextual information. Mordan et al.  jointly detect pedestrians and eye contact, along with other attributes, in a single network forward pass using multi-task fields.
In the context of pedestrian crossings, multiple works (in addition to the previous ones) use eye contact as an intermediate feature to better predict pedestrians’ future behaviors. Kooij et al. 
estimate head orientation as a cue for pedestrians’ situational awareness, and use it with other indicators to predict their paths around the road with a Dynamic Bayesian Network. Other approaches use a similar strategy for pedestrian awareness, e.g., Hariyono et al. for estimating the risk of collision, Kwak et al.  for prediction pedestrian intention at night time.
Eye contact detection is directly related to whether pedestrians pay attention to the incoming traffic. In practice, detecting that they do not pay attention is as important. One of the main reasons for that is the use of a phone that draws their attention away from the road. Identifying phone-related activities is therefore an insightful cue to detect. Rangesh et al.  show the practical importance of having gaze annotations, both for eye contact between drivers and pedestrians, or phone-related distractions. Going further to recognize actions implying a phone, Saenz et al.  use a two-branch convolutional network to predict distracted behaviors due to phone usage from stereo image pairs.
While detecting phone-related activities or pedestrian intentions are crucial tasks, eye-contact detection remains an essential channel of communication. Pedestrians may have the intention to cross but have they seen the upcoming car they should yield to? Contrarily, people may hold the phone but still pay attention to the upcoming traffic.
We argue that eye contact detection is a crucial yet under-explored task to enable autonomous agents to safely navigate around pedestrians. To promote research in this area, we show that the current datasets are not sufficiently diverse for data-driven methods, and we create a new large-scale dataset for eye contact detection in the wild.
|Dataset||Frames||Instances [% looking]||Pedestrians|
|JAAD ||82K||133K [18%]||686|
|PIE ||909K||739K [9%]||1,842|
|Our LOOK-KITTI ||1,391||4.630 [17%]||425|
|Our LOOK-JRDB ||9,441||39K [18%]||399|
|Our LOOK-nuScenes ||2,216||13K [9%]||7,100|
|Our LOOK||13K||57K [16%]||7,944|
To the best of our knowledge, only two datasets contain annotations for the eye contact detection task: JAAD dataset , and PIE dataset . JAAD consists of 390K instances of pedestrians labeled with bounding boxes and behaviour annotation, of which 17K instances have been labeled as looking at the driver (i.e., at the camera in the car). The dataset is large in size but limited in diversity. It is made of 346 video clips of 5-10 seconds recorded with an on-board camera at 30fps in North America and Europe. Thus, many frames show the same people in similar scenes, and the number of unique pedestrians looking at the camera is 686.
PIE  is also a recent dataset for pedestrian intention estimation. It shares many of the characteristics of its predecessor JAAD . It is recorded using an on-board camera at 30fps and consists of continuous footage of 6 hours in downtown Toronto, Canada. Out of 700K annotated pedestrian instances, there are 1,842 unique pedestrians, of which less than 180 are looking at the camera.
We have built a new large-scale dataset for eye contact detection in the wild by selecting publicly available images from three existing datasets: KITTI , nuScenes  and JRDB . The first two are autonomous driving datasets and are made of images taken from a driver perspective. The latter one consists of videos taken from a small robot moving in crowded spaces inside Stanford University campus. In total, we have labeled 13,048 images from four different cities (Boston, Singapore, Tübingen, Palo Alto) in three continents. We aim for diversity, selecting pedestrians areas , crowded images from six cameras around the car , and indoor environments from a robot perspective . In total we have labeled around 8,000 unique pedestrians, making it the most diverse dataset for eye contact detection in the wild. Examples from the LOOK dataset are shown in Figures 3(a), 3(b) and 3(d).
We provide, together with the dataset annotation, the training and testing splits. We make sure that splits do not contain overlapping scenes and that the test set is sufficiently diverse, including 22% of the total number of unique pedestrians over 15% of the frames.
Our LOOK dataset has been annotated using the Amazon Mechanical Turk (AMT) platform. Each image has been annotated by four workers, which had the options to select whether a person was looking at the camera, somewhere else, or none of the two in case of ambiguity. We then include in the dataset only the labels with a consensus of at least three out of four annotators. This threshold has been selected by reviewing edge cases where not all the workers agree on a selected instance.
To promote the creation of an ever-growing open-source dataset, we have also developed and released a labeling tool††footnotemark: to ease the annotation process. The tool leverages the off-the-shelf pose detector OpenPifPaf  to locate the 2D bounding boxes of pedestrians. It then runs a pre-trained model (more detailed on Section V) on the JAAD  and PIE  dataset to provide a first guess. This pipeline allows annotators to only check and eventually correct wrong predictions.
To count the number of pedestrians, we use the tracking identification number for JRDB  dataset, while for the KITTI dataset  we manually count them. In the case of the nuScenes dataset , we leverage the metadata associated with each frame. We approximate the number of unique pedestrians by only counting once the instances that appear in the same camera multiple times in a 5-seconds time window. We run sensibility analysis on the time window and opt for 5 second as the images are recorded from a moving vehicle in the majority of scenes.
The goal of our method is to detect from images whether humans are looking at the camera or somewhere else. We tackle autonomous driving scenarios, i.e., outdoor scenes where people may be several meters far from the camera. Our approach consists of two steps. First, we use a 2D pose detector to obtain a low-dimensional representation from the image domain, which we call semantic keypoints
. The keypoints are a convenient representation that provides invariance to many factors, e.g., background artifacts, clothes, weather conditions. Second, we feed the extracted keypoints to a simple feed-forward neural network that detects the presence of eye contact. In addition, we also explore multi-modal representations, by combining the keypoint representations with the features obtained from crops of images, and we explore different fusion techniques. A diagram of our modular architecture can be found in Figure2.
|Training Dataset||Method||Input||Eye Contact Classification (AP) [Pedestrian Detection Recall ]|
|JAAD ||LOOK-KITTI ||LOOK-JRDB ||LOOK-nuScenes ||LOOK|
|Rasouli ||Crops||75.4 [80.1]||65.9 [99.8]||87.2 [98.2]||78.7 [89.8]||77.3 [95.9]|
|MTL-Fields ||Images||82.6 [92.4]||89.7 [93.1]||82.1 [81.9]||92.0 [71.8]||87.9 [82.3]|
|JAAD||Our method||Keypoints||85.9 [80.1]||91.6 [99.8]||94.8 [98.2]||91.0 [89.8]||92.5 [95.9]|
|Rasouli ||Crops||71.0 [80.1]||76.8 [99.8]||89.5 [98.2]||82.9 [89.8]||83.1 [95.9]|
|MTL-Fields ||Images||80.7 [79.0]||95.1 [96.5]||95.2 [93.0]||93.4 [68.4]||94.6 [86.0]|
|LOOK||Our Method||Keypoints||86.0 [80.1]||96.4 [99.8]||97.1 [98.2]||95.1 [89.8]||96.2 [95.9]|
We escape the image domain using 2D keypoints: a low-dimensional representation obtained through the off-the-shelf pose detector OpenPifPaf [13, 14], which was designed for crowded scenes and low-resolution images. The output of our network is the binary flag indicating whether a person is looking at the camera. To create the training and testing dataset, we match the bounding boxes enclosing the keypoints with the ground-truth bounding boxes using their intersection over union (IoU). We select the instances with the highest matching IoU above 0.3 for each ground truth.
Keypoints are especially useful to prevent overfitting. To further increase generalization properties, we normalize the keypoints and zero-center them on the y-axis. Normalization prevents different scale differences from biasing the results, while the vertical location of a person in the image plane does not add any information regarding eye contact detection. The x-coordinate in the image plane, on the other side, may help infer the relative head and body orientations with respect to the camera. For every keypoint with pixel coordinates in the image plane, we apply the following transformation:
where , is the mean of the coordinates of the left and right hips of the instance, , , the width and height of the enclosing box given by the keypoints, and the width of the input image. In practice, the normalization removes information on the size of the person as well as on the vertical location in the image plane.
Our architecture is composed of a simple fully-connected network with residual blocks , and includes batch-normalization  after every fully connected layer as well as dropout . The structure is inspired by the success in 3D vision tasks using 2D keypoints, especially Martinez et al.  for 3D pose estimation, and Bertoni et al.  for human 3D localization. The residual blocks increase performances and avoid overfitting, while the model, which contains approximately 411K training parameters, is characterized by great speed and a low memory footprint. Its building blocks are shown in Figure 2.
|Method||JAAD ||PIE ||LOOK-KITTI ||LOOK-nuScenes ||LOOK-JRDB ||LOOK|
|Crops (ResNeXt-50 )||79.7||74.2||72.0||85.7||92.5||83.4|
|Keypoints & Crops (ResNet-18 , Fusion: O1)||78.0||75.2||79.7||85.3||91.6||85.5|
|Keypoints & Crops (ResNet-18 , Fusion: O2)||78.7||75.6||78.9||84.3||92.7||85.4|
|Keypoints & Crops (ResNeXt-50 , Fusion: O1)||79.5||75.1||73.6||85.8||92.1||83.8|
|Keypoints & Crops (ResNeXt-50, Fusion: O2)||80.6||75.9||74.1||86.2||93.2||84.5|
|Keypoints & Eyes Crops (Fusion: O1)||83.9||79.9||87.0||91.2||92.5||90.2|
We argue that 2D keypoints are a low-dimensional representation that contains enough information to understand whether a person is looking at the camera. This is motivated by the application we are targeting: autonomous driving scenarios, where people are often further away from the camera and the pupils may not be distinguishable. To test our hypothesis, we develop a modular architecture to optionally include visual information from the head region of a pedestrian. We create a combined method that encodes features from both the keypoints and the cropped head region. While the former branch does not change, we select for the crops the upper third of the bounding box  enclosing the keypoints, and we extract the features using a convolutional backbone. We explore different backbone architectures (i.e., AlexNet,  ResNet  and ResNeXt ) and different fusion techniques. As visually described in Figure 2, we experiment with early fusion and with late fusion
. In the former option (O1), we sum the visual features extracted from a convolutional backbone with the raw features extracted from the 2D keypoints. In the latter option (O2), we concatenate the visual features together with the features extracted from the last layer of the fully-connected architecture. Our training schedule, inspired by Zamir et al., consists of two steps. We first initialize the keypoint-based and the crop-based branches by training them independently. We then keep frozen all the layers before the concatenation and train the remaining ones. At both stages, we use the binary cross-entropy loss. The combined method allows us to verify whether adding visual information to the keypoint-based method increases the performance.
Evaluation metrics. We evaluate pedestrian detection and eye contact classification separately. Some previous methods  do not include pedestrian detection, using a box classification approach, where the ground-truth boxes are given. In our case, a detection step is also included to ensure fair comparison among different methods. To disentangle the contributions of pedestrian detection and eye contact classification, we split the two tasks. To evaluate the detection results, we use the recall metric with a threshold on intersection over union (IoU) of 0.5. Compared to [4, 22], we do not use Average Precision (AP) metric for detection as we only focus on instances labeled with the eye contact attribute, and in any given dataset very far instances are not annotated for it. In this case, the AP metric may penalize extra detections. In the classification setup, we evaluate the set of detected instances that match a ground-truth, and we use the AP metric to evaluate the classification of the binary attribute of looking or not at the camera.
, we compute results on a balanced test set where negative instances are randomly sampled. The sampling is done 10 times to reduce the variance and the results are averaged.
Regarding training and testing split, for JAAD dataset  the official split is composed of 177 videos for training, 29 videos for validation, and 117 videos for testing. For PIE dataset we use, as recommended, set01, set02, set04 for training, set05 set06 for validation and set03 for the testing set.
with Nesterov momentum, a learning rate of 0.0001, and mini-batches of 32 instances. For the combined architecture, we pre-train both branches and freeze the early layers before the fusion of the features. We train the last layers with an SGD optimizer, a learning rate of 0.00001, and mini-batches of 128 instances.
We argue that eye contact detection is a crucial task yet to be solved to develop safe autonomous vehicles. However, to the best of our knowledge, very few methods have reported results on the eye contact task in JAAD  or PIE . Rasouli et al.  proposed to use image crops of people as inputs to an AlexNet architecture  followed by fully connected layers. Their published results on the JAAD dataset  used a smaller version of the dataset and randomly split the instances of the dataset. Hence, the same unique pedestrian in different time frames may appear both in training and testing sets. For a fair comparison, we have re-implemented this method and evaluated it on the recently released official JAAD split to prevent any contamination of the testing set.
In addition, we compare against the very recent MTL-Fields developed by Mordan et al. . It is a field-based approach that leverages multiple pedestrian attributes in a multi-task fashion, including eye contact, to understand the visual appearances and behaviors of pedestrians. Contrary to Rasouli et al.  that operate on image crops of people and therefore discard context around them, MTL-Fields keep full images in order to understand the scenes and learn interactions between pedestrians. As their code is open-source, we train a network and evaluate it with our setup for eye contact detection.
Our baselines. One of our goals is to compare the properties of keypoints and crops for the eye contact task. Thus, we develop a modular architecture that is either based on keypoints only, or can combine keypoints and visual information together, and we benchmark it with the following baselines:
Crops: when referring to methods using crops only, we consider the approach of Rasouli et al.  with a more recent ResNet  or ResNeXt  backbone. Rasouli et al.  train their model with ground-truth crops without including detection results. For a fair comparison, we train and evaluate the model including the same set of instances provided by the OpenPifPaf detector .
Head & Body Keypoints: we test our keypoint-based architecture with subsets of keypoints, either only including the keypoints of the head region (eyes, nose, ears), or only the ones from the rest of the body. The goal is to analyze whether the body orientation also provides informative cues, or the head keypoints suffice for the eye contact detection task.
Keypoints & Eye Crops: we test whether adding visual information about the region around the eyes could be informative and less prone to overfitting than head crops. From the 2D keypoint locations of the eyes and ears, we crop a small region around the pupils and resize it to a fixed patch of 3x10x30 pixels. The model architecture is consistent with the one shown in Figure 2, but we substitute the head crops with the eyes one, and a convolutional backbone with a fully connected block.
|JAAD ||50K||79.7 ( - )||85.9 ( - )|
|LOOK-nuScenes ||10K||71.1 (-8.6)||84.6 (-1.3)|
|LOOK||41K||73.7 (-6.0)||86.0 (+0.1)|
|LOOK + PIE ||61K||75.1 (-4.6)||87.7 (+1.8)|
In Table II, we show the results of training and evaluating each method on the same dataset (either JAAD  or our LOOK dataset) as well as cross-dataset results. Our method achieves the best performances when compared to the other baselines, achieving at least a 5% improvement both when testing on the same dataset and when evaluating cross-dataset generalization properties. More notably, our model is able to generalize well on our LOOK dataset when trained on the JAAD dataset  only, as it reaches an AP of 92.5%; almost on par when compared against baselines trained on the LOOK dataset. Qualitative examples from different datasets are shown in Figure 4.
Our recall results are shared with Rasouli  baseline, as we train and evaluate their model on the same instances detected by the off-the-shelf pose detector OpenPifPaf . We observe that MTL-Fields  achieves higher recall on JAAD when trained on the same dataset, but the same recall drops by 15% when trained on a different dataset. Our method on the other side maintains high recall when testing domain adaption, as it leverages an off-the-shelf pose estimator  trained and optimized on a general-purpose dataset  that is beneficial for domain adaptation.
|Method / Box Height [px]||240+||160-240||110-160||0-110||All|
|Keypoints & Crops||80.5||82.2||83.6||75.9||80.6|
|Keypoints & Eyes Crops||85.3||86.4||85.0||79.4||83.9|
. Each cluster corresponds to one quartile of the distribution. For the crop-based methods, we consider our ResNeXt-50 model with late fusion when not differently specified.
We further study cross-dataset generalization with our new LOOK dataset in Tables III and IV. First of all, we investigate the performances of eleven crop- and keypoint-based methods in Table III. We train the models on JAAD dataset  and evaluate them on JAAD , PIE , and our LOOK dataset. Keypoint-based models perform and generalize better than crop-based ones, consistently with results obtained in Table II. Surprisingly, combining visual information to the keypoints into our combined models (which we refer to as Keypoints & Crops) degrades the performances as it leads to stronger overfitting. The best results are achieved only by combining features from the eye region instead of the head region. We attribute this result to the generalization properties of keypoints, as this low-dimensional representation does not overfit on background scenes or specific faces. Yet even simpler models with no visual information achieve the best results on all the datasets, but for LOOK-nuScenes .
As additional experiment in Table IV, we train our best keypoint-based and crop-based methods on different datasets but JAAD , and evaluate them on JAAD. We train our keypoint-based model on 10K instances from the nuScenes dataset , and we obtain less than 2% difference compared to training it on the 50K instances of the JAAD dataset . The best crop-based model, on the other side, achieves an AP of 71.1%, down from an original 79.7%. When increasing the number of instances from multiple datasets, the performances of the crop-based model never reach the baseline result of being trained on JAAd only. The keypoint-based model, on the other side, achieves the best performances on JAAD  when trained on different datasets.
The role of distance. We test the hypothesis that crop-based methods may be most effective when people are closer to the camera, while keypoints may be more useful when people are far away and details of the face are less informative. We obtain the distribution of bounding box heights for all the instances of the JAAD test set , and evaluate each quartile separately. As we show in Table V, the hypothesis is not verified: keypoints remain more effective than crops even for people close to the camera. This result may not be intuitive at first sight, but we are analyzing autonomous driving datasets, where even close people may be several meters away from the camera, and detecting the direction of the pupil may not be feasible. Keypoints provide a simple yet effective representation in these scenarios.
Saliency map. To verify the impact of each keypoint in the final decision of the model, we compute the absolute value of the gradient of the objective function with respect to each input node for every epoch on the training set:
where represents the impact of the keypoint with its three components: and coordinates, and the confidence score . We then average this value by taking the mean absolute value across all the training instances that consists of samples. The results are illustrated in Figure 3. The dominant keypoints are the eyes and ears, as shown by the magnitude of the gradients of the loss function with respect to each keypoint.
|Average Run Time (ms)|
|Rasouli  (S-30)||71.0||602||39||672|
|Our Method (S-30)||85.9||602||0.8||626|
|Our Method (R-50)||82.6||305||0.8||328|
Run Time. Our experiments have been conducted using a machine with a single NVIDIA GeForce GTX 1080 Ti and Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz. In Table VI, we compare the run time performances of several methods on the test set of the JAAD dataset . As the original method from Rasouli et al. did not include a detection step, we use the same backbone to extract the poses for our method and the crops for Rasouli’s one. For MTL-Fields, detection and classification are performed in a single stage. Our method excels in the classification step with less than 1 ms of inference time as it uses low-dimensional keypoints. Regarding the detection step, our method is agnostic to the pose detector. We have tested it with OpenPifPaf  using two different backbones and achieved the fastest run time with a ResNet-50 (R-50) .
Eye contact detection is a practically important task to better understand and forecast human behaviors. In particular, autonomous robots need to solve this task to navigate safely around humans. We have introduced a new deep learning approach for eye contact detection in the wild, i.e., with no prior knowledge on the environment, which is suited to the multiple challenges associated with this task. We start by extracting semantic keypoints from images, and use them as low-dimension, high-level features to escape the image domain and focus on relevant information. Then, we have compared several architectures to process this representation, including using it in addition to selected image crops. We have also publicly released LOOK, a large-scale dataset for eye contact detection in the wild. We designed it with real-world generalization in mind, by annotating three common autonomous driving datasets to consider cross-dataset training and evaluation, and focus on multiple scenarios and diverse environments. We evaluated our method and several approaches from the literature on LOOK to create a benchmark for this task, and show state-of-the-art results with robust generalization across datasets compared to image-based approaches. We hope that this new benchmark can help foster further research from the community on this important but overlooked topic.
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778. Cited by: Fig. 2, §IV-A, §IV-B, TABLE III, 1st item, §V-E, TABLE VI.
Microsoft coco: common objects in context. In The European Conference on Computer Vision (ECCV), Cited by: §I, §V-C.