Gaze estimatin code. The Pytorch Implementation of "MPIIGaze: Real-World Dataset and Deep Appearance-Based Gaze Estimation".
Gaze estimation reveals where a person is looking. It is an important clue for understanding human intention. The recent development of deep learning has revolutionized many computer vision tasks, the appearance-based gaze estimation is no exception. However, it lacks a guideline for designing deep learning algorithms for gaze estimation tasks. In this paper, we present a comprehensive review of the appearance-based gaze estimation methods with deep learning. We summarize the processing pipeline and discuss these methods from four perspectives: deep feature extraction, deep neural network architecture design, personal calibration as well as device and platform. Since the data pre-processing and post-processing methods are crucial for gaze estimation, we also survey face/eye detection method, data rectification method, 2D/3D gaze conversion method, and gaze origin conversion method. To fairly compare the performance of various gaze estimation approaches, we characterize all the publicly available gaze estimation datasets and collect the code of typical gaze estimation algorithms. We implement these codes and set up a benchmark of converting the results of different methods into the same evaluation metrics. This paper not only serves as a reference to develop deep learning-based gaze estimation methods but also a guideline for future gaze estimation research. Implemented methods and data processing codes are available at http://phi-ai.org/GazeHub.READ FULL TEXT VIEW PDF
We propose a way to incorporate personal calibration into a deep learnin...
Appearance-based gaze estimation has achieved significant improvement by...
Due to the recent outbreak of COVID-19, many classes, exams, and meeting...
From medical research to gaming applications, gaze estimation is becomin...
With the increase in computation power and the development of new
Estimation of 3D gaze is highly relevant to multiple fields, including b...
Conventional feature-based and model-based gaze estimation methods have
Gaze estimatin code. The Pytorch Implementation of "MPIIGaze: Real-World Dataset and Deep Appearance-Based Gaze Estimation".
Gaze estimatin code. The Pytorch Implementation of "It’s written all over your face: Full-face appearance-based gaze estimation".
Gaze estimatin code. The Pytorch Implementation of "Eye Tracking for Everyone".
Gaze estimatin code. The Pytorch implementation of "Appearance-Based Gaze Estimation Using Dilated-Convolutions".
Gaze estimatin code. The Pytorch Implementation of "Gaze360: Physically Unconstrained Gaze Estimation in the Wild".
Eye gaze is one of the most important non-verbal communication cues. It contains rich information of human intent that enables researchers to gain insights into human cognition [1, 2] and behavior [3, 4]. It is widely demanded by various applications, e.g., human-computer interaction [5, 6, 7] and head-mounted devices [8, 9, 10]. To enable such applications, accurate gaze estimation methods are critical.
Over the last decades, a plethora of gaze estimation methods has been proposed. These methods usually fall into three categories: the 3D eye model recovery-based method, the 2D eye feature regression-based method and the appearance-based method. 3D eye model recovery-based methods construct a geometric 3D eye model and estimates gaze directions based on the model. The 3D eye model is usually person-specific due to the diversity of human eyes. Therefore, these methods usually require personal calibration to recover person-specific parameters such as iris radius and kappa angle. The 3D eye model recovery-based methods usually achieve reasonable accuracy while they require dedicated devices such as infrared cameras. The 2D eye feature regression-based methods usually keep the same requirement on devices as 3D eye model recovery-based methods. The methods directly use the detected geometric eye feature such as pupil center and glint to regress the point of gaze (PoG). They do not require geometric calibration for converting gaze directions into PoG.
Appearance-based methods do not require dedicated devices, instead, it uses on-the-shelf web cameras to capture human eye appearance and regress gaze from the appearance. Although the setup is simple, it usually requires the following components: 1) An effective feature extractor to extract gaze features from high-dimensional raw image data. Some feature extractors such as histograms of oriented gradients are used in the conventional method . However, it can not effectively extract high-level gaze features from images. 2) A robust regression function to learn the mappings from appearance to human gaze. It is non-trivial to map the high-dimensional eye appearance to the low-dimensional gaze. Many regression functions have been used to regress gaze from appearance, e.g
., local linear interpolation
, adaptive linear regression and gaussian process regression  , the regression performance is barely satisfactory. 3) A large number of training samples to learn the regression function. They usually collects personal samples with a time-consuming personal calibration, and learns a person-specific gaze estimation model. Some studies seek to reduce the number of training samples. Lu et al. propose an adaptive linear regression method to select an optimal set of sparsest training sample for interpolation . However, this also limits the usage in real-world applications.
Recently, deep learning-based gaze estimation approaches have become a research hotspot. Compared with conventional appearance based methods, deep learning based methods demonstrate many advantages. 1) It can extract high-level abstract gaze features from high-dimensional images. 2) It learns a highly non-linear mapping function from eye appearance to gaze. These advantages make deep learning-based methods are more robust and accurate than conventional methods. Conventional appearance-based methods often have performance drop when meet head motion, while deep learning-based methods tolerate head movement to some extent. Deep learning-based methods also improve the cross-subject gaze estimation performance with a large margin. These improvements greatly expand the application range of appearance-based gaze estimation methods.
In this paper, we provide a comprehensive review of appearance-based gaze estimation methods in deep learning. As shown in Fig. 1, we discuss these methods from four perspectives: 1) deep feature extraction, 2) deep neural network architecture design, 3) personal calibration, 4) device and platform. In the deep feature extraction perspective, we describe how to extract effective feature in the current methods. We divide the raw feature into eye images, face images and videos. The algorithm for extracting high-level feature from the three raw features is respectively reviewed in this part. In the deep neural network architecture design perspective, we review advanced CNN models. According to the supervision mechanism, we respective review supervised, self-supervised, semi-supervised and unsupervised gaze estimation methods. We also describe different CNN architectures in gaze estimation including multi-task CNNs and recurrent CNNs. In addition, some methods integrate CNN models and prior knowledges of gaze. These methods are also introduced in this part. In the personal calibration perspective, we describe how to use calibration samples to further improve the performance of CNNs. We also introduce the method integrating user-unaware calibration sample collection mechanism. Finally, in the device and platforms perspective, we consider different cameras, i.e., RGB cameras, IR cameras and depth cameras, and different platforms, i.e., computer, mobile devices and head-mount device. We review the advanced methods using these cameras and proposed for these platforms.
Besides deep learning-based gaze estimation methods, we also focus on the practice of gaze estimation. We first review the data pre-processing methods of gaze estimation including face and eye detection methods, and common data rectification methods. Then, considering various forms of human gaze, e.g., gaze direction and PoG, we further provide data post-processing methods. The methods describe the geometric conversion between various human gaze. We also build gaze estimation benchmarks based on the data post-processing methods. We collect and implement the codes of typical gaze estimation methods, and evaluate them on various datasets. For the different kinds of gaze estimation methods, we convert their result for comparison with the data post-processing methods. The benchmark provides comprehensive and fair comparison between state-of-the-art gaze estimation methods.
The paper is organized as follows. Section II introduces the background of gaze estimation. We introduce the development and category of gaze estimation methods. Section III reviews the state-of-the-art deep learning based method. In Section IV, we introduce the public datasets as well as data pre-processing and post-processing methods. We also build the benchmark in this section. In Section V, we conclude the development of current deep learning-based methods and recommend future research directions. This paper can not only serve as a reference to develop deep learning based-gaze estimation methods, but also a guideline for future gaze estimation research.
Gaze estimation research has a long history. Figure 2 illustrates the development progress of gaze estimation methods. Early gaze estimation methods rely on detecting eye movement patterns such as fixation, saccade and smooth pursuit . They attach the sensors around the eye and use potential differences to measure eye movement [17, 18]. With the development of computer vision technology, modern eye-tracking devices have emerged. These methods usually estimate gaze using the eye/face images captured by a camera. In general, there are two types of such devices, the remote eye tracker and the head-mounted eye tracker. The remote eye tracker usually keeps a certain distance from the user, typically 60 cm. The head-mounted eye tracker usually mounts the cameras on a frame of glasses. Compared to the intrusive eye tracking devices, the modern eye tracker greatly enlarges the range of application with computer vision-based methods.
Computer vision-based methods can be further divided into three types: the 2D eye feature regression method, the 3D eye model recovery method and the appearance-based method. The first two types of methods estimate gaze based on detecting geometric features such as contours, reflection and eye corners. The geometric features can be accurately extracted with the assistance of dedicated devices, e.g., infrared cameras. Detailly, the 2D eye feature regression method learns a mapping function from the geometric feature to the human gaze, e.g., the polynomials [19, 20] and the neural networks . The 3D eye model recovery method builds a subject-specific geometric eye model to estimate the human gaze. The eye model is fitted with geometric features, such as the infrared corneal reflections [22, 23], pupil center  and iris contours . In addition, the eye model contains subject-specific parameters such as cornea radius, kappa angles. Therefore, it usually requires time-consuming personal calibration to estimate these subject-specific parameters for each subject.
Appearance-based methods directly learn a mapping function from images to human gaze. Different from 2D eye feature regression methods, appearance-based methods do not require dedicated devices for detecting geometric features. They use image features such as image pixel  or deep features  to regress gaze. Various regression models have been used, e.g., the neural network , the Gaussian process regression model , the adaptive linear regression model 
and the convolutional neural network. However, this is still a challenging task due to the complex eye appearance.
Appearance-based methods directly learn the mapping function from eye appearance to human gaze. As early in 1994, Baluja et al. propose a neural network and collect 2,000 samples for training . Tan et al. use a linear function to interpolate unknown gaze position using 252 training samples . Early appearance-based methods usually learn a subject-specific mapping function. They require a time-consuming calibration to collect the training samples of the specific subject. To reduce the number of training samples, Williams et al. introduce semi-supervised Gaussian process regression methods . Sugano et al. propose a method that combines gaze estimation with saliency . Lu et al. propose an adaptive linear regression method to select an optimal set of sparsest training sample for interpolation . However, these methods only show reasonable performance in a constrained environment, i.e., fixed head pose and the specific subject. Their performance significantly degrades when tested on an unconstrained environment. This problem is always challenging in appearance-based gaze estimation.
To address the performance degradation problem across subjects, Funes et al. presented a cross-subject training method . However, the reported mean error is larger than 10 degrees. Sugano et al. introduce a learning-by-synthesis method . They use the large number of synthetic cross-subject data to train their model. Lu et al. employ a sparse auto-encoder to learn a set of bases from eye image patches and reconstruct the eye image using these bases . To tackle the head motion problem, Sugano et al. cluster the training samples with similar head poses and interpolate the gaze in local manifold . Lu et al. suggest that initiating the estimation with the original training images and compensating for the bias via regression . Lu et al. further propose a novel gaze estimation method that handles the free head motion via eye image synthesis using a single camera .
Appearance-based gaze estimation suffers from many challenges, such as head motion and subject differences, especially in the unconstrained environment. These factors have a large impact on the eye appearance and complicate the eye appearance. Conventional appearance-based methods cannot handle these challenges gracefully due to the weak fitting ability.
Convolutional neural networks (CNNs) have been used in many computer vision tasks and demonstrate outstanding performance. Zhang et al. propose the first CNN-based gaze estimation method to regress gaze directions from eye images . They use a simple CNN and the performance surpasses most of the conventional appearance-based approaches. Following this study, an increasing number of improvements and extensions on CNN-based gaze estimation methods emerged. Face images  and videos  are used as input to the CNN for gaze estimation. These inputs provide more valuable information than using eye images alone. Some methods are proposed for handling the challenges in an unconstrained environment. For example, Cheng et al. use asymmetric regression to handle the extreme head pose and illumination condition . Park et al. learn a pictorial eye representation to alleviate the personal appearance difference . The calibration-based methods are also proposed to learn a subject-specific CNN model [39, 40]. The vulnerability of appearance-based gaze estimation is also learned in .
We survey current deep learning based gaze estimation methods in this section. We introduce these methods from fours perspectives, deep feature extraction, deep neural network architecture design, personal calibration as well as device and platform. Figure 3 gives an overview of this section.
Feature extraction is critical in most of the learning-based tasks. Effectively extracting features from eye appearance is challenging due to the complex eye appearance. The quality of the extracted features determines the gaze estimation accuracy. Here, we summarize the feature extraction method according to the types of input into the deep neural network: eye images, face images and videos.
The gaze direction is highly correlated with the eye appearance. Any perturbation in gaze direction results in eye appearance changes. For example, the rotation of the eyeball changes the location of the iris and the shape of the eyelid, which leads to changes in gaze direction. This relationship makes it possible to estimate gaze from eye appearance. Conventional methods usually estimate gaze from high-dimensional raw image features. These features are directly generated from eye images by raster scanning all the pixels [15, 44]. The features are highly redundant and can not handle environmental changes.
Deep learning-based methods automatically extract deep features from eye images. Zhang et al. proposed the first deep learning-based gaze estimation method . They employ a CNN to extract the features from grey-scale single eye images and concatenate these features with an estimated head pose. As with most deep learning tasks, the deeper network structure and larger receptive field, the more informative features can be extracted. In , Zhang et al. further extend their previous work  and present a GazeNet which is a 13-convolutional-layer neural network inherited from a 16-layer VGG network  as shown in Fig. 4 (a). They demonstrate that the GazeNet outperforms the LeNet-based approach presented in . Chen et al.  use dilated convolutions to extract high-level eye features, which efficiently increases the receptive field size of the convolutional filters without reducing spatial resolution.
Early deep learning-based methods estimate the gaze from single eye image. Recent studies found that concatenating the features of two eyes help to improve the gaze estimation accuracy [47, 48]. Fischer et al.  employ two VGG-16 networks  to extract the individual features from two eye images, and concatenate two eye features for regression. Cheng et al.  build a four-stream CNN network for extracting features from two eye images. Two streams of CNN are used for extracting individual features from left/right eye images, the other two streams are used for extracting joint features of two eye images. They claim that the two eyes are asymmetric. Thus, they propose an asymmetric regression and evaluation network to extract the different features from two eyes. However, the studies in [47, 48]
simply concatenate the left and right eye features to form new feature vectors, more recent studies propose to use attention mechanism for fusing two eye features. Chenget al.  argue that the weights of two eye features are determined by face images due to the specific task in , so they assign weights with the guidance of facial features. Bao et al.  propose a self-attention mechanism to fuse two eye features. They concatenate the feature maps of two eyes and use a convolution layer to generate the weights of the feature map.
The above-mentioned methods extract the general features from eye images, some works explored extracting special features to handle the head motion and subject difference. Extracting subject-invariant gaze features has become a research hotspot. Eye appearance varies by much across different people. The ultimate solution is to collect training data that covers the whole data space, however, it is practically impossible. Several studies have attempted to extract subject-invariant features from eye images [38, 51, 40]. Park et al.  convert the original eye images into a unified gaze representation, which is a pictorial representation of the eyeball, the iris and the pupil. They regress the gaze direction from the pictorial representation. Wang et al. propose an adversarial learning approach to extract the domain/person-invariant feature et al
. use an autoencoder to learn the compact latent representation of gaze, head pose and appearance. They introduce a geometric constraint on gaze representations, i.e., the rotation matrix between the two given images transforms the gaze representation of one image to another. In addition, some methods use GAN to pre-process eye images to handle some specific environment factors. Kim et al.  utilize a GAN to convert low-light eye images into bright eye images. Rangesh et al.  use a GAN to remove eyeglasses.
Besides the supervised approaches for extracting gaze features, unannotated eye images have also been used for learning gaze representations. Yu et al. propose to use the difference of gaze representations from two eyes as input to a gaze redirection network. They use the unannotated eye images to perform the unsupervised gaze representation learning.
Face images contain the head pose information that also contributes to gaze estimation. Conventional methods have explored extracting features using face images. They usually extract features such as head pose  and facial landmarks [55, 56, 57]. The early eye image-based method uses the estimated head pose as an additional input . However, the feature is proved to be useless for the deep learning-based method . Some studies directly use face images as input and employ a CNN to automatically extract deep facial features [43, 35] as shown in Fig. 4 (b). It demonstrates an improved performance than the approaches that only use eye images.
Face images contain redundant information. Researchers have attempted to filter out the useless features in face image [43, 58]. Zhang et al.  propose a spatial weighting mechanism to efficiently encode the location of the face into a standard CNN architecture. The system learns spatial weights based on the activation maps of the convolutional layers. This helps to suppress the noise and enhance the contribution of the highly activated regions. Zhang et al.  propose a learning-based region selection method. They dynamically select suitable sub-regions from facial region for gaze estimation. Cheng et al.  propose a plug-and-play self-adversarial network to purify facial feature. Their network simultaneously removes all image feature and preserves gaze-relevant feature. As a result, this optimization mechanism implicitly removes the gaze-irrelevant feature and improve the robustness of gaze estimation networks.
Some studies crop the eye image out of the face images and directly feed it into the network. These works usually use a three-stream network to extract features from face images, left and right eye images, respectively as shown in Fig. 4 (c) [35, 46, 61, 62, 63]. Besides, Deng et al.  decompose gaze directions into the head rotation and eyeball rotation. They use face images to estimate the head rotation and eye images to estimate the eyeball rotation. These two rotations are aggregated into a gaze vector through a gaze transformation layer. Cheng et al.  propose a coarse-to-fine gaze estimation method. They first use a CNN to extract facial features from face images and estimate a basic gaze direction, then they refine the basic gaze direction using eye features. The whole process is generalized as a bi-gram model and they use GRU  to build the network.
Facial landmarks have also been used as additional features to model the head pose and eye position. Palmero et al. directly combine individual streams (face, eyes region and face landmarks) in a CNN . Dias et al. extract the facial landmarks and directly regress gaze from the landmarks . The network outputs the gaze direction as well as an estimation of its own prediction uncertainty. Jyoti et al. further extract geometric features from the facial landmark locations . The geometric feature includes the angles between the pupil center as the reference point and the facial landmarks of the eyes and the tip of the nose. The detected facial landmarks can also be used for unsupervised gaze representation learning. Dubey et al.  collect the face images from the web and annotate their gaze zone based on the detected landmarks. They perform gaze zone classification tasks on the dataset for unsupervised gaze representation learning. In addition, since the cropped face image does not contain face position information, Krafka et al.  propose the iTracker system, which combines the information from left/right eye images, face images as well as face grid information. The face grid indicates the position of the face region in the captured image and it is usually used in PoG estimation.
Besides the static features obtained from the images, temporal information from the videos also contributes to better gaze estimates. Recurrent Neural Network (RNN) has been widely used in video processing,e.g
., long short-term memory (LSTM)[36, 70]. As shown in Fig. 5, they usually use a CNN to extract the features from the face images at each frame, and then input these features into a RNN. The temporal information is automatically captured by the RNN for gaze estimation.
Temporal features such as the optical flow and eye movement dynamics have been used to improve gaze estimation accuracy. The optical flow provides the motion information between the frames. Wang et al.  use the optical flow constraints with 2D facial features to reconstruct the 3D face structure based on the input video frames. Eye movement dynamics, such as fixation, saccade and smooth pursuits, have also been used to improve gaze estimation accuracy. Wang et al. 
propose to leverage eye movement to generalize eye tracking algorithm to new subjects. They use a dynamic gaze transition network to capture underlying eye movement dynamics and serve as prior knowledge. They also propose another static gaze estimation network, which estimates gaze based on the static frame. By combining these two networks, they achieve better estimation accuracy compared with only using a static gaze estimation network. The combination method of the two networks is solved as a standard inference problem of linear dynamic system or Kalman filter.
Convolutional neural networks have been widely used in many compute vision tasks, such as object recognition [74, 75] and image segmentation [76, 77], they also demonstrate superior performance in the field of gaze estimation. In this section, we first review the existing gaze estimation methods from the learning strategy perspective, i.e., the supervised CNNs and the semi-/self-/un-supervised CNNs. Then we introduce the different network architectures,i.e., multi-task CNNs and the recurrent CNNs for gaze estimation. In the last part of this section, we discuss the CNNs that integrate prior knowledge to improve performance.
Supervised CNNs are the most commonly used network in appearance-based gaze estimation [26, 78, 79, 80]. Fig. 4 also shows the typical architecture of supervised gaze estimation CNN. The network is trained using image samples with ground truth gaze directions. The gaze estimation problem is essentially learning a mapping function from raw images to the human gaze. Therefore, similar to the computer vision tasks , the deeper CNN architecture usually achieves better performance. A number of CNN architectures, which have been proposed for typical computer vision tasks, also show great success in gaze estimation task, e.g., LeNet , AlexNet , VGG , ResNet18  and ResNet50 . Besides, some well-designed modules also help to improve the estimation accuracy [46, 49, 83, 84] , e.g., Chen et al. propose to use dilated convolution to extract features from eye images , Cheng et al. propose an attention module for fusing two eye features .
To supervise the CNN during training, the system requires the large-scale labeled dataset. Several large-scale datasets have been proposed, such as MPIIGaze  and GazeCapture . However, it is difficult and time-consuming to collect enough gaze data in practical applications. Inspired by the physiological eye model , some researchers propose to synthesize labeled photo-realistic image [30, 86, 87]. These methods usually build eye-region models and render new images from these models. One of such methods is proposed by Sugano et al. . They synthesize dense multi-view eye images by recovering the 3D shape of eye regions, where they use a patch-based multi-view stereo algorithm  to reconstruct the 3D shape from eight multi-view images. However, they did not consider the environmental changes. Wood et al. propose to synthesize the close-up eye images for a wide range of head poses, gaze directions and illuminations to develop a robust gaze estimation algorithm . Following this work, Wood et al. further propose another system named UnityEye to rapidly synthesize large amounts of eye images of various eye regions . To make the synthesized images more realistic, Shrivastava et al91]. These methods serve as data augmentation tools to improve the performance of gaze estimation.
Semi-supervised, self-supervised and unsupervised CNNs rely more on the unlabeled images to boost the gaze estimation performance. Collecting large-scale labeled images is expensive, however, it is cost-efficient to collect unlabeled images, they can be easily captured using web cameras.
Semi-supervised CNNs require both labeled and unlabeled images for optimizing networks. Wang et al
. propose an adversarial learning approach for semi-supervised learning to improve the model performance on the target subject/dataset. As shown in Fig. 6
, it requires labeled images in the training set as well as unlabeled images of the target subject/dataset. Therefore, they annotate the source of unlabeled images as “target” and labeled images as “training set”. To be more specific, they use the labeled data to supervise the gaze estimation network and design an adversarial module for semi-supervised learning. Given these features used for gaze estimation, the adversarial module tries to distinguish their source, the gaze estimation network aims to extract subject/dataset-invariant features to cheat the module.
Self-supervised CNNs aim to formulate a pretext auxiliary learning task to improve the estimation performance. Cheng et al. propose a self-supervised asymmetry regression network for gaze estimation . As shown in Fig. 7, the network contains a regression network to estimate the two eyes’ gaze directions, and an evaluation network to assess the reliability of two eyes. During training, the result of the regression network is used to supervise the evaluation network, the accuracy of the evaluation network determines the learning rate in the regression network. They simultaneously train the two networks and improve the regression performance without additional inference parameters. Xiong et al. introduce a random effect parameter to learn the person-specific information in gaze estimation 
. During training, they utilize the variational expectation-maximization algorithm94] to estimate the parameters of the random effect network. After training, they use another network to predict the random effect based on the feature representation of eye images. The self-supervised strategy predicts the random effects to enhance the accuracy for unseen subjects. He et al. introduce a person-specific user embedding mechanism. They concatenate the user embedding with appearance features to estimate gaze. They also build a teacher-student network, where the teacher network optimizes the user embedding during training and the student network learns the user embedding from the teacher network.
Unsupervised CNNs only require unlabeled data for training, nevertheless, it is hard to optimize CNNs without the ground truth. Many specific tasks are designed for unsupervised CNNs. Dubey et al.  collect unlabeled facial images from webpages. They roughly annotate the gaze region based on the detected landmarks. Therefore, they can perform the classical supervised task for gaze representation learning. Yu et al. utilize a pre-trained gaze redirection network to perform unsupervised gaze representation learning . As shown in Fig. 8, they use the gaze representation difference of the input and target images as the redirection variables. Given the input image and the gaze representation difference, the gaze network reconstructs the target image. Therefore, the reconstruction task supervises the optimization of the gaze representation network. Note that, these approaches learn the gaze representation, but they also require a few labeled samples to fine-tune the final gaze estimator.
Multi-task learning usually contains multiple tasks that provide related domain information as inductive bias to improve model generalization [96, 97]. Some auxiliary tasks are proposed for improving model generalization in gaze estimation. Lian et al. propose a multi-task multi-view network for gaze estimation . They estimate gaze directions based on single-view eye images and PoG from multi-view eye images. They also propose another multi-task CNN to estimate PoG using depth images. They design an additional task to leverage facial features to refine depth images. The network produces four features for gaze estimation, which are extracted from the facial images, the left/right eye images and the depth images.
Some works seek to decompose the gaze into multiple related features and construct multi-task CNNs to estimate these feature. Yu et al
. introduce a constrained landmark-gaze model for modeling the joint variation of eye landmark locations and gaze directions. As shown in Fig. 9, they build a multi-task CNN to estimate the coefficients of the landmark-gaze model as well as the scale and translation information to align eye landmarks. Finally, the landmark-gaze model serve as a decode to calculate gaze from estimated parameters.. Deng et al. decompose the gaze direction into eyeball movement and head pose . They design a multi-tasks CNN to estimate the eyeball movement from eye images and the head pose from facial images. The gaze direction is computed from eyeball movement and head pose using geometric transformation. Wu et al. propose a multi-task CNN that simultaneously segments the eye part, detects the IR LED glints, and estimates the pupil and cornea center . The gaze direction is covered from the reconstructed eye model.
Other works perform multiple gaze-related tasks simultaneously. Recasens et al. present an approach for following gaze in video by predicting where a person (in the video) is looking, even when the object is in a different frame 
. They build a CNN to predict the gaze location in each frame and the probability containing the gazed object of each frame. Also, visual saliency shows strong correlation with human gaze in scene images[102, 103]. In , they estimate the general visual attention and human’s gaze directions in images at the same time. Kellnhofer et al. propose a dynamic 3D gaze network that includes temporal information . They use bi-LSTM  to process a sequence of 7 frames. The extracted feature is used to estimate not only the gaze direction of the central frame but also the gaze uncertainty.
Human eye gaze is continuous. This inspires researchers to improve gaze estimation performance by using temporal information. Recently, recurrent neural networks have shown great capability in handling sequential data. Thus, some researchers employ recurrent CNNs to estimate the gaze in videos [66, 36, 70].
Here, we give a typical example of the data processing workflow. Given a sequence of frames , a united CNN is used to extract feature vectors from each frame, i.e., . These feature vectors are fed into a recurrent neural network and the network outputs the gaze vector, i.e., , where the index can be set according to specific tasks, e.g.,  or . An example is also shown in Fig. 5.
Different types of input have been explored to extract features. Kellnhofer et al. directly extract features from facial images . Zhou et al. combine the feature extracted from facial and eye images . Palmero et al. use facial images, binocular images and facial landmarks to generate the feature vectors . Different RNN structures have also been explored, such as GRU  in , LSTM  in  and bi-LSTM  in . Cheng et al. leverage the recurrent CNN to improve gaze estimation performance from static images rather than videos . They generalize the gaze estimation as a sequential coarse-to-fine process and use GRU to relate the basic gaze direction estimated from facial images and the gaze residual estimated from eye images.
Decomposition of Gaze Direction. The human gaze can be decomposed into the head pose and the eyeball orientation. Deng et al. use two CNNs to respectively estimate the head pose from facial images and the eyeball orientation from eye images. Then, they integrate the two results into final gaze with geometric transformation .
Anatomical Eye Model. The human eye is composed of the eye ball, the iris, and the pupil center, etc. Park et al. propose a pictorial gaze representation based on the eye model to predict the gaze direction 
. They render the eye model to generate a pictorial image, where the pictorial image eliminates the appearance variance. They use a CNN to map the original images into the pictorial images and use another CNN to estimate gaze directions from the pictorial image.
Eye Movement Pattern. Common eye movements, such as fixation, saccade and smooth pursuits, are independent of viewing contents and subjects. Wang et al. propose to incorporate the generic eye movement pattern in dynamic gaze estimation . They recover the eye movement pattern from videos and use a CNN to estimate gaze from static images.
Two eye asymmetry Property. Cheng et al. discover the ’two eye asymmetry’ property that the appearances of two eyes are different while the gaze directions of two eyes are approximately the same . Based on this observation, Cheng et al. propose to treat the two eyes asymmetrically in the CNN. They design an asymmetry regression network for adaptive weighting two eyes based on their performance. They also design an evaluation network for evaluating the asymmetric state of the regression network.
Gaze data distribution. The basic assumption of most regression model is independent identically distributed (i.i.d), however, gaze data is not i.i.d. Xiong et al. discuss the non-i.i.d problem in . They design a mixed-effect model to take into account the person-specific information.
Inter-subject bias. Chen et al. observe the inter-subject bias in most datasets . They make the assumption that there exists a subject-dependent bias that cannot be estimated from images. Thus, they propose a gaze decomposition method. They decompose the gaze into the subject-dependent bias and the subject-independent gaze estimated from images. During test, they use some image samples to calibrate the subject-dependent bias.
It is non-trivial to learn an accurate and universal gaze estimation model. Conventional 3D eye model recovery methods usually build a unified gaze model including subject-specific parameters such as eyeball radius . They perform a personal calibration to estimate these subject-specific parameters. In the field of deep learning-based gaze estimation, personal calibration is also explored to improve person-specific performance. Fig. 10 shows a common pipeline of personal calibration in deep learning.
The calibration problem can be considered as domain adaption problems, where the training set is the source domain and the test set is the target domain. The test set usually contains unseen subjects (the cross-person problem), or unseen environment (the cross-dataset problem). Researchers aim to improve the performance in the target domain using the calibration samples.
The common approach of domain adaption is to fine-tune the model in the target domain [35, 108, 109]. This is simple but effective. Krafka et al. replace the fully-connected layer with an SVM and fine-tune the SVM layer to predict the gaze location . Zhang et al. split the CNN into three parts: the encoder, the feature extractor, and the decoder . They fine-tune the encoder and decoder in each target domain. Zhang et al. also learn a third-order polynomial mapping function between the estimated and ground-truth of 2D gaze locations . Some studies introduce person-specific feature for gaze estimation [110, 111]. They learn the person-specific feature during fine-tuning. Linden et al. introduce user embedding for recording personal information. They obtain user embedding of the unseen subjects by fine-tuning using calibration samples . Chen et al.  observe the different gaze distributions of subjects. They use the calibration samples to estimate the bias between the estimated gaze and the ground-truth of different subjects. They use bias to refine the estimates. In addition, Yu et al. generate additional calibration samples through the synthesis of gaze-redirected eye images from the existing calibration samples . The generated samples are also directly used for training. These methods all need labeled samples for supervised calibration.
Besides the supervised calibration methods, there are some unsupervised calibration methods. These methods use unlabeled calibration samples to improve performance. They usually seek to align the features in different domains. Wang et al. propose an adversarial method for aligning features. They build a discriminator to judge the source of images from the extracted feature. The feature extractor has to confuse the discriminator, i.e., the generated feature should be domain-invariant. The adversarial method is semi-supervised and does not require labeled calibration samples. Guo et al.  use source samples to form a locally linear representation of each target domain prediction in gaze space. The same linear relationships are applied in the feature space to generate the feature representation of target samples. Meanwhile, they minimize the difference between the generated feature and extracted feature of target sample for alignment. Cheng et al.  propose a domain generalization methods. They improve the corss-dataset performance without knowing the target dataset or touching any new samples. They propose a self-adversarial framework to remove the gaze-irrelevant feature in face images. Since the gaze pattern is invariant in different domains, they align the features in different domains. Cui et al. define a new adaption problem : adaptation from adults to children. They use the conventional domain adaption method, geodesic flow kernel , to transfer the feature in the adult domain into the children domain.
Meta learning and metric learning also show great potentials in domain adaption-based gaze estimation. Park et al. propose a meta learning-based calibration approach . They train a highly adaptable gaze estimation network through meta learning. The network can be converted into a person-specific network once training with target person samples. Liu et al. propose a differential CNN based on metric learning . The network predicts the gaze difference between two eyes. For inference, they have a set of subject-specific calibration images. Given a new image, the network estimates the differences between the given image and the calibration image, and takes the average of them as the final estimated gaze.
Most calibration-based methods require labeled samples. However, it is difficult to acquire enough labeled samples in practical applications. Collecting calibration samples in a user-unaware manner is an alternative solution [116, 117, 118].
Some researchers implicitly collect calibration data when users are using computers. Salvalaio et al. propose to collect data when the user is clicking a mouse, this is based on the assumption that users are gazing at the position of the cursor when clicking the mouse . They use online learning to fine-tune their model with the calibration samples.
Other studies investigate the relation between the gaze points and the saliency maps [102, 103]. Chang et al. utilize saliency information to adapt the gaze cestimation algorithm to a new user without explicit calibration. They transform the saliency map into a differentiable loss map that can be used to optimize the CNN models. Wang et al
. introduce a stochastic calibration procedure. They minimize the difference between the probability distribution of the predicted gaze and the ground truth.
|Feature||Eye image||||—||[42, 119, 113]||[38, 47, 48, 95, 120]||[40, 46, 51, 98, 100, 115, 121]||[49, 50, 79, 52, 53, 80, 54]||—|
|Facial image||—||||[43, 64, 101]||[66, 68, 104, 108]||[5, 46, 69, 92, 99, 110, 111, 116, 118, 122]||[49, 50, 67, 59, 61, 83, 62, 123, 78, 124, 63, 112, 82, 37, 107]|||
|Video||—||—||—||||[36, 70, 71, 72]||||—|
|Model||Supervised CNN||||||[43, 64, 42, 101, 113, 119]||[38, 47, 66, 68, 95, 104, 108, 120]||[5, 36, 40, 46, 70, 71, 72, 98, 99, 100, 115, 116, 118, 121, 122]||[82, 50, 49, 37, 59, 61, 83, 62, 123, 78, 79, 124, 52, 53, 80, 63, 125, 107]||—|
|Semi-/Self-/Un-Supervised CNN||—||—||—||||[51, 69, 92, 110, 111]||[54, 112, 37]|||
|Multi-task CNN||—||—||[101, 64]||[95, 104]||[36, 98, 99, 100]||||—|
|Recurrent CNN||—||—||—||||[36, 70]||[49, 125]||—|
|CNN with Other Priors||—||—||||[38, 48]||[72, 92]||||—|
|Calibration||Domain Adaption||—||||||||[5, 39, 40, 110, 111, 115]||[107, 109, 112]||—|
|User-unaware Data Collection||—||||—||—||[116, 118]||—||—|
|Camera||Single camera||||||[43, 64, 42, 113]||[38, 47, 48, 66, 68, 95, 104, 108, 120]||[5, 36, 40, 46, 69, 70, 71, 72, 100, 111, 115, 116, 118, 121, 122]||[59, 49, 50, 61, 83, 62, 123, 78, 79, 124, 52, 53, 80, 63, 125, 82, 37, 107]|||
|IR Camera||—||—||—||—||[100, 121]||||—|
|Near-eye Camera||—||—||||—||[100, 121]||—||—|
|Platform||Computer||||—||[43, 64, 42, 113]||[38, 47, 48, 66, 68, 95, 104, 108]||[5, 36, 40, 46, 69, 70, 71, 72, 98, 99, 115, 116, 118]||[49, 59, 61, 83, 62, 123, 78, 79, 52, 53, 80, 63, 125, 82, 37, 107]|||
|Mobile Device||—||||—||||[111, 122]||[50, 124]||—|
|Head-mounted device||—||—||||||[100, 121]||—||—|
The majority of gaze estimation systems use a single RGB camera to capture eye images, while some studies use different camera settings, e.g., using multiple cameras to capture multi-view images [98, 119], using infrared (IR) cameras to handle low illumination condition [100, 121], and using RGBD cameras to provide the depth information . Different cameras and their captured images are shown in Fig. 11.
Tonsen et al. embed multiple millimeter-sized RGB cameras into a normal glasses frame 
. They use multi-layer perceptrons to process the eye images captured by different cameras, and concatenate the extracted feature to estimate gaze. Lianet al. mount three cameras at the bottom of a screen . They build a multi-branch network to extract the features of each view and concatenate them to estimate D gaze position on the screen. Wu et al. collect gaze data using near-eye IR cameras . They use CNN to detect the location of glints, pupil centers and corneas from IR images. Then, they build an eye model using the detected feature and estimate gaze from the gaze model. Kim et al. collect a large-scale dataset of near-eye IR eye images . They synthesize additional IR eye images that cover large variations in face shape, gaze direction, pupil and iris etc.. Lian et al. use RGBD cameras to capture depth facial images . They extract the depth information of eye regions and concatenate it with RGB image features to estimate gaze.
Eye gaze can be used to estimate human intent in various applications, e.g., product design evaluation , marketing studies  and human-computer interaction [128, 129, 7]. These applications can be categorized into three types of platforms: computers, mobile devices and head-mounted devices. We summarize the characteristics of these platforms in Fig. 12.
The computer is the most common platform for appearance-based gaze estimation. The cameras are usually placed below/above the computer screen [26, 130, 48, 49, 38]. Some works focus on using deeper neural networks [26, 43, 47] or extra modules [38, 48, 49] to improve gaze estimation performance, while the other studies seek to use custom devices for gaze estimation, such as multi-cameras and RGBD cameras [98, 99].
The mobile device is another common platform for gaze estimation [35, 50, 124]. Such devices often contain front cameras, but the computational resources are limited. These systems usually estimate PoG instead of gaze directions due to the difficulty of geometric calibration. Krafka et al. propose a PoG estimation method for mobile devices, named iTracker . They combine the facial image, two eye images and the face grid to estimate the gaze. The face grid encodes the position of the face in captured images and is proved to be effective for gaze estimation in mobile devices in many works [111, 50]. He et al. propose a more accurate and faster method based on iTracker . They replace the face grid with a more sensitive eye corner landmark feature. Guo et al. propose a generalized gaze estimation method . They propose a tolerant training scheme, the knowledge distillation framework. They observe the notable jittering problem in gaze point estimates and propose to use adversarial training to address this problem.
The head-mounted device usually employs near-eye cameras to capture eye images. Tonsen et al. embed millimetre-sized RGB cameras into a normal glasses frame . In order to compensate for the low-resolution captured images, they use multi-cameras to capture multi-view images and use a neural network to regress gaze from these images. IR cameras are also employed by head-mounted devices. Wu et al. collect the MagicEyes dataset using IR cameras . They propose EyeNet, a neural network that solves multiple heterogeneous tasks related to eye gaze estimation for an off-axis camera setting. They use the CNN to model D cornea and D pupil and estimate the gaze from these two D models. Lemley et al. use the single near-eye image as input to the neural network and directly regress gaze . Kim et al. follow a similar approach and collect the NVGaze dataset .
We also summarizes and categorize the existing CNN-based gaze estimation methods in Tab. I. Note that, many methods do not specify a platform [26, 49]. We categorize these methods all into ”computer” in the row of platform. In summary, there is an increasing trend in developing supervised or semi-/self-/un-supervised CNN structures to estimate gaze. More recent research interests shift to different calibration approaches through domain adaptation or user-unaware data collection. The first CNN-based gaze direction estimation method is proposed by Zhang et al. in 2015 , the first CNN-based PoG estimation method is proposed by Krafka et al. in 2016 . These two studies both provide large-scale gaze datasets, the MPIIGaze and the GazeCapture, which have been widely used for evaluating gaze estimation algorithms in later studies.
Raw images often contain unnecessary information for gaze estimation, such as the background. Directly using raw images to regress gaze not only increases the computational resource but also brings nuisance factors such as changes in scenes. Therefore, face or eye detection is usually applied in raw images to prune unnecessary information. Generally, researchers first perform face alignment in raw images to obtain facial landmarks and crop face/eye images using these landmarks. Several face alignment methods have been proposed recently [137, 138, 139]. We list some typical face alignment methods in Tab. II.
After the facial landmarks are obtained, face or eye image are cropped accordingly. There is no protocol to regulate the cropping procedure. We provide a common cropping procedure here as an example. We let be the x, y-coordinates of the th facial landmark in an raw image . The center point is calculated as , where is the number of facial landmarks. The face image is defined as a square region with the center and an width . The is usually set empirically. For example,  set as times of the maximum distance between the landmarks. The eye cropping is similar to face cropping, while the eye region is usually defined as a rectangle with the center set as the centroid of eye landmarks. The width of the rectangle is set based the distance between eye corners, e.g., 1.2 times.
Gaze data is usually collected in the laboratory environment. Subjects are required to fix their head on a chin rest . Recent research gradually shifts the attention from the constrained gaze estimation to unconstrained gaze estimation. The unconstrained gaze estimation introduces many environmental factors such as illumination and background. These factors increase the complexity of eye appearance and complicate the gaze estimation problem. Although the CNNs have a strong fitting ability, it is still difficult to achieve accurate gaze estimation in an unconstrained environment. The goal of data rectification is to eliminate the environmental factors by data pre-processing methods and to simplify the gaze regression problem. Current data rectification methods mainly focus on head pose and illumination factors.
|, gaze targets.|
|, gaze directions.|
|, origins of gaze directions.|
|, targets of gaze directions.|
|The rotation matrix of SCS w.r.t. CCS.|
|, the translation matrix of SCS w.r.t. CCS.|
|, the normal vectors of x-y plane of SCS.|
The head pose can be decomposed into the rotation and translation of the head. The change of head pose degrades eye appearance and introduces ambiguity on eye appearance. To handle the head pose changes, Sugano et al. propose to rectify the eye image by rotating the virtual camera to point at the same reference point in the human face . They assume that the captured eye image is a plane in 3D space, the rotation of the virtual camera can be performed as a perspective transformation on the image. The whole data rectification process is shown in Figure 13. They compute the transformation matrix , where is the rotation matrix and is the scale matrix. also indicates the rotated camera coordinate system. The -axis of the rotated camera coordinate system is defined as the line from cameras to reference points, where the reference point is usually set as the face center or eye center. It means that the rotated camera is pointing towards the reference point. The rotated -axis is defined as the -axis of the head coordinate system so that the appearance captured by the rotated cameras is facing the front. The rotated -axis can be computed by , the is recalculated by to maintain orthogonality. As a result, the rotation matrix . The maintains the distance between the virtual camera and the reference point, which is defined as , where is the original distance between the camera and the reference point, and is the new distance that can be adjusted manually. They apply a perspective transformation on images with , where is the intrinsic matrix of the original camera and is the intrinsic matrix of the new camera. Gaze directions can also be calculated in the rotated camera coordinate system as
. The method eliminates the ambiguity caused by different head positions and aligns the intrinsic matrix of cameras. It also rotates the captured image to cancel the degree of freedom of roll in head rotation. Zhanget al. further explore the method in . They argue that scaling can not change the gaze direction vector. The gaze direction is computed by .
Illumination also influences the appearance of the human eye. To handle this, researchers usually take gray-scale images rather than RGB images as input and apply histogram equalization in the gray-scale images to enhance the image.
Various applications demand different forms of gaze estimates. For example, in a real-world interaction task, it requires 3D gaze directions to indicate the human intent [142, 10], while it requires 2D PoG for the screen-based interaction [7, 143]. In this section, we introduce how to convert different forms of gaze estimates by post-processing. We list the symbols in Tab. III and illustrate the symbols in Fig. 14. We denote the PoG as 2D gaze and the gaze direction as 3D gaze in this section.
The 2D gaze estimation algorithm usually estimates gaze targets on a computer screen [122, 72, 35, 116, 144], while the 3D gaze estimation algorithm estimates gaze directions in 3D space [42, 43, 36, 49, 92]. We first introduce how to convert between the 2D gaze and the 3D gaze.
Given a 2D gaze target on the screen, our goal is to compute the corresponding 3D gaze direction . The processing pipeline is that we first compute the 3D gaze target and 3D gaze origin in the camera coordinate system (CCS). The gaze direction can be computed as
|Columbia , 2013, ** (Columbia University)||K images||✓||✓||Collected in laboratory; 5 head pose and 21 gaze directions per head pose.||https://cs.columbia.edu/CAVE/databases/columbia_gaze|
|UTMultiview , 2014, (The University of Tokyo; Microsoft Research Asia)||M images||✓||✓||Collected in laboratory; Fixed head pose; Multiview eye images; Synthesis eye images.||https://ut-vision.org/datasets|
|EyeDiap , 2014, *** (Idiap Research Institute)||videos||✓||✓||✓||Collected in laboratory; Free head poes; Additional depth videos.||https://idiap.ch/dataset/eyediap|
|MPIIGaze , 2015, *** (Max Planck Institute)||K images||✓||✓||Collected by laptops in daily life; Free head pose and illumination.||https://mpi-inf.mpg.de/mpiigaze|
|GazeCapture , 2016,* (University of Georgia; MIT; Max Planck Institute)||M images||✓||✓||Collected by mobile devices in daily life; Variable lighting condition and head motion.||https://gazecapture.csail.mit.edu|
|MPIIFaceGaze , *** 2017, (Max Planck Institute)||K images||✓||✓||✓||Collected by laptops in daily life; Free head pose and illumination.||footnote1|
|InvisibleEye , 2017, (Max Planck Institute; Osaka University)||17||280K Images||✓||Collected in laboratory; Multiple near-eye camera; Low resolution cameras .||https://mpi-inf.mpg.de/invisibleeye|
|TabletGaze , 2017,* (Rice University)||videos||✓||✓||Collected by tablets in laboratory; Four postures to hold the tablets; Free head pose.||https://sh.rice.edu/cognitive-engagement/tabletgaze|
|RT-Gene , 2018,**** (Imperial College London)||K images||✓||✓||Collected in laboratory; Free head pose; Annotated with mobile eye-tracker; Use GAN to remove the eye-tracker in face images.||https://github.com/Tobias-Fischer/rt_gene|
|Gaze360 , 2019, **** (MIT; Toyota Research Institute)||K images||✓||✓||Collected in indoor and outdoor environments; A wide range of head poses and distances between subjects and cameras.||https://gaze360.csail.mit.edu|
|NVGaze , 2019, *** (NVIDIA; UNC)||30||4.5M images||✓||Collected in laboratory; Near-eye Images; Infrared illumination.||https://sites.google.com/nvidia.com/nvgaze|
|ShanghaiTechGaze ,* 2019, (ShanghaiTech University; UESTC)||K images||✓||✓||Collected in laboratory; Free head poes; Multiview gaze dataset.||https://github.com/dongzelian/multi-view-gaze|
|ETH-XGaze , 2020, * (ETH Zurich; Google)||110||1.1M images||✓||✓||✓||Collected in laboratory; High-resolution images; Extreme head pose; 16 illumination conditions.||https://ait.ethz.ch/projects/2020/ETH-XGaze|
|EVE , 2020,****** (ETH Zurich)||54||K videos||✓||✓||✓||Collected in laboratory; Free head pose; Free view; Annotated with desktop eye tracker; Pupil size annotation.||https://ait.ethz.ch/projects/2020/EVE/|
|MPIIGaze ||EyeDiap ||
|Proposed for task A||Direct results of task A||Converted results from task A|
|Proposed for task B||Converted results from task B||Direct results of task B|
We will continue to add new methods and datasets. Please keep track of our website for the latest progress.
To derive the 3D gaze target , we obtain the pose of screen coordinate system (SCS) w.r.t. CCS by geometric calibration, where is the rotation matrix and is the translation matrix. The is computed as , where the additional is the -axis coordinate of in SCS. The 3D gaze origin is usually defined as the face center or the eye center. It can be estimated by landmark detection algorithms or stereo measurement methods.
On the other hand, given a 3D gaze direction , we aim to compute the corresponding 2D target point on the screen. Note that, we also need to acquire the screen pose as well as the origin point as mentioned previously. We first compute the intersection of gaze direction and screen, i.e., 3D gaze target , in CCS, and then we convert the 3D gaze target to the 2D gaze target using the pose .
To deduce the equation of screen plane, we compute , where is the normal vector of screen plane. also represents a point on the screen plane. Therefore, the equation of the screen plane is
Given a gaze direction and the origin point , we can write the equation of the line of sight as
Conventional gaze estimation methods usually estimate gaze directions w.r.t. each eye. They define the origin of gaze directions as each eye center [42, 38, 51, 115]. Recently, more attention has been paid to gaze estimation using face images, they usually estimate one gaze direction w.r.t.the whole face. They define the gaze direction as the vector starting from the face center to the gaze target [49, 40, 43, 47]. Here, we introduce a gaze origin conversion method to bridge the gap between these two types of gaze estimates.
We first compute the pose of SCS and the origin of the predicted gaze direction through calibration. Then we can write Eq. (2) and Eq. (3) based on these parameters. The 3D gaze target point can be calculated by solving the equation of Eq. (2) and Eq. (3). Next, we obtain the new origin of the gaze direction through 3D landmark detection. The new gaze direction can be computed by
|MethodsDatasets||Task C: Estimate 2D PoG.|
|MPIIFaceGaze ||EyeDiap ||GazeCapture |
|Proposed for task C||Direct results of task C|
|Itracker ||7.67 cm||10.13 cm||2.81 cm||1.86 cm|
|AFF-Net ||4.21 cm||9.25 cm||2.30 cm||1.62 cm|
|SAGE ||N/A||N/A||2.72 cm||1.78 cm|
|TAT ||N/A||N/A||2.66 cm||1.77 cm|
|Proposed for task A||Converted results from task A (In Tab. V)|
|Mnist ||7.29 cm||9.06 cm||N/A||N/A|
|GazeNet ||6.62 cm||8.51 cm||N/A||N/A|
|Proposed for task B||Converted results from task B (In Tab. V)|
|Dilated-Net ||5.07 cm||7.36 cm||N/A||N/A|
|Gaze360 ||4.66 cm||6.37 cm||N/A||N/A|
|RT-Gene ||5.36 cm||7.19 cm||N/A||N/A|
|FullFace ||5.65 cm||7.70 cm||N/A||N/A|
|CA-Net ||4.90 cm||6.30 cm||N/A||N/A|
We will continue to add new methods and datasets. Please keep track of our website for the latest progress.
Two types of metric are used for performance evaluation: the angular error and the Euclidean distance. Two kinds of evaluation protocols are commonly used: within-dataset evaluation and cross-dataset evaluation.
The angular error is usually used for measuring the accuracy of 3D gaze estimation method [49, 40, 42]. Assuming the actual gaze direction is and the estimated gaze direction is , the angular error can be computed as:
The Euclidean distance has been used for measuring the accuracy of 2D gaze estimation methods in [35, 116, 144]. We denote the actual gaze position as and the estimated gaze position as . We can compute the Euclidean distance as
The within-dataset evaluation assesses the model performance on the unseen subjects from the same dataset. The dataset is divided into the training set and the test set according to the subjects. There is no intersection of subjects between the training set and test set. Note that, most of the gaze datasets provide within-dataset evaluation protocol. They divide the data into training set and test set in advance.
The cross-dataset evaluation assesses the model performance on the unseen environment. The model is trained on one dataset and tested on another dataset.
Many large-scale gaze datasets have been proposed. In this survey, we try our best to summarize all the public datasets on gaze estimation, as shown in Tab. IV. The gaze direction distribution and head pose distribution of these datasets are shown in Fig. 15. Note that, the Gaze360 dataset do not provide the head information. We also discuss three typical datasets that are widely used in gaze estimation studies.
Zhang et al. proposed the MPIIGaze  dataset. It is the most popular dataset for appearance-based gaze estimation methods. The MPIIGaze dataset contains a total of 213,659 images collected from 15 subjects. They are collected in daily life over several months and there is no constraint for the head pose. As a result, the images are of different illumination and head poses. The MPIIGaze dataset provides both 2D and 3D gaze annotation. It also provides a standard evaluation set. The evaluation set contains 15 subjects and 3,000 images for each subject. The images are consisted of 1,500 left-eye images and 1,500 right-eye images from 15 subjects. The author further extends the original datasets in [43, 42]. The original MPIIGaze dataset only provides binocular eye images, while they supply the corresponding face images in  and manual landmark annotations in .
EyeDiap  dataset consists of 94 video clips from 16 participants. Different from MPIIGaze, the EyeDiap dataset is collected in a laboratory environment. It has three visual target sessions: the continuous moving target, the discrete moving target, and the floating ball. For each subject, they recorded a total of six sessions containing two head movements: static head pose and free head movement. Two cameras are used for data collection: an RGBD camera and an HD camera. The disadvantage of this dataset is that it lacks variation in illumination.
The GazeCapture  dataset is collected through crowdsourcing. It contains a total of 2,445,504 images from 1,474 participants. All images are collected using mobile phones or tablets. Each participant is required to gaze at a circle shown on the devices without any constraint on their head movement. As a result, the GazeCapture dataset covers various lighting conditions and head motions. The GazeCapture dataset does not provide 3D coordinates of the targets. It is usually used for the evaluation of unconstrained 2D gaze point estimation methods.
In addition to the dataset mentioned above, there are several datasets being proposed recently. In 2018, Fischer et al. proposed RT-Gene dataset . This dataset provides accurate 3D gaze data since they collect gaze with a dedicated eye tracking device. In 2019, Kellnhofe et al. proposed the Gaze360 dataset . The dataset consists of 238 subjects of indoor and outdoor environments with 3D gaze across a wide range of head poses and distances. In 2020, Zhang et al. propose the ETH-XGaze dataset . This dataset provides high-resolution images that cover extreme head poses. It also contains 16 illumination conditions for exploring the effects of illumination.
The system setups of the gaze estimation models are different. 2D PoG estimation and 3D gaze direction estimation are two popular gaze estimation tasks. In addition, with regard to 3D gaze direction estimation, some methods are designed for estimating gaze from eye images. They define the origin of gaze directions as eye centers. However, this definition is not suitable for methods estimating gaze from face images. Therefore, these methods slightly change the definition of gaze directions and define the origin of gaze direction as face centers. These different task definitions become a barrier to compare gaze estimation methods. In this section, we break through the barrier with the data post-processing method, and build a comprehensive benchmark.
We conduct benchmarks in three common gaze estimation tasks: a) estimate gaze directions originating from eye centers. b) estimate gaze directions originating from face centers. c) estimate 2D PoG. The results are shown in Figure V and Figure VI. We implement the typical gaze estimation methods of the three tasks. We use the gaze origin conversion method to convert the results of task a and task b, and use the 2D/3D gaze conversion method to convert the results of task c and task a/b. The two conversion methods are introduced in Section IV-B. The data pre-processing of each datasets and implemented method are available at http://phi-ai.org/GazeHub.
In this survey, we present a comprehensive overview of deep learning-based gaze estimation methods. Unlike the conventional gaze estimation methods that requires dedicated devices, the deep learning-based approaches regress the gaze from the eye appearance captured by web cameras. This makes it easy to implement the algorithm in real world applications. We introduce the gaze estimation method from four perspectives: deep feature extraction, deep neural network architecture design, personal calibration as well as device and platform. We summarize the public datasets on appearance-based gaze estimation and provide benchmarks to compare of the state-of-the-art algorithms. This survey can serve as a guideline for future gaze estimation research.
Here, we further suggest several future directions of deep learning-based gaze estimation. 1) Extracting more robust gaze features. The perfect gaze estimation method should be accurate under all different subjects, head poses, and environments. Therefore, a environment-invariant gaze feature is critical. 2) Improve performance with fast and simple calibration. There is a trade-off between the system performance and calibration time. The longer calibration time leads to more accurate estimates. How to achieve satisfactory performance with fast calibration procedure is a promising direction. 3) Interpretation of the learned features. Deep learning approach often serves as a black box in gaze estimation problem. Interpretation of the learned features in these methods brings insight for the deep learning-based gaze estimation.
IEEE on Computer Vision and Pattern Recognition (CVPR), 2006.
D. M. Stampe, “Heuristic filtering and reliable calibration methods for video-based pupil-tracking systems,”Behavior Research Methods, Instruments, & Computers, vol. 25, no. 2, pp. 137–142, 1993.
Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2020.
K. Cho, B. Van Merriënboer, D. Bahdanau, and Y. Bengio, “On the properties of neural machine translation: Encoder-decoder approaches,”arXiv preprint, 2014.
K. P. Murphy and S. Russell, “Dynamic bayesian networks: representation, inference and learning,” 2002.
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” inAdvances in neural information processing systems, 2012, pp. 1097–1105.
A. Bublea and C. D. Căleanu, “Deep learning based eye gaze tracking for automotive applications: An auto-keras approach,” inThe International Symposium on Electronics and Telecommunications (ISETC). IEEE, 2020, pp. 1–4.
M. J. Beal, “Variational algorithms for approximate bayesian inference,” Ph.D. dissertation, UCL (University College London), 2003.
B. Klein Salvalaio and G. de Oliveira Ramos, “Self-adaptive appearance-based eye-tracking with online transfer learning,” in2019 8th Brazilian Conference on Intelligent Systems (BRACIS), Oct 2019, pp. 383–388.
K. Zhang, Z. Zhang, Z. Li, and Y. Qiao, “Joint face detection and alignment using multitask cascaded convolutional networks,”IEEE Signal Processing Letters, vol. 23, no. 10, pp. 1499–1503, 2016.