Automatic Gaze Analysis: A Survey of DeepLearning based Approaches

Eye gaze analysis is an important research problem in the field of computer vision and Human-Computer Interaction (HCI). Even with significant progress in the last few years, automatic gaze analysis still remains challenging due to the individuality of eyes, eye-head interplay, occlusion, image quality, and illumination conditions. There are several open questions including what are the important cues to interpret gaze direction in an unconstrained environment without prior knowledge and how to encode them in real-time. We review the progress across a range of gaze analysis tasks and applications to shed light on these fundamental questions; identify effective methods in gaze analysis and provide possible future directions. We analyze recent gaze estimation and segmentation methods, especially in the unsupervised and weakly supervised domain, based on their advantages and reported evaluation metrics. Our analysis shows that the development of a robust and generic gaze analysis method still needs to address real-world challenges such as unconstrained setup and learning with less supervision. We conclude by discussing future research directions for designing a real-world gaze analysis system that can propagate to other domains including computer vision, AR (Augmented Reality), VR (Virtual Reality), and HCI (Human Computer Interaction).

READ FULL TEXT VIEW PDF

page 2

page 3

page 5

page 13

page 14

page 20

04/21/2022

Eye Gaze Estimation: A Survey on Deep Learning-Based Approaches

Human gaze estimation plays a major role in many applications in human–c...
07/04/2022

GazBy: Gaze-Based BERT Model to Incorporate Human Attention in Neural Information Retrieval

This paper is interested in investigating whether human gaze signals can...
08/10/2021

Exploring the Effect of Visual Cues on Eye Gaze During AR-Guided Picking and Assembly Tasks

In this paper, we present an analysis of eye gaze patterns pertaining to...
08/03/2022

'Labelling the Gaps': A Weakly Supervised Automatic Eye Gaze Estimation

Over the past few years, there has been an increasing interest to interp...
07/06/2022

Gaze-Vergence-Controlled See-Through Vision in Augmented Reality

Augmented Reality (AR) see-through vision is an interesting research top...
07/30/2013

An Integrated System for 3D Gaze Recovery and Semantic Analysis of Human Attention

This work describes a computer vision system that enables pervasive mapp...

1 Introduction

Humans perceive their environment through voluntary or involuntary eye movement to receive, fixate and track visual stimuli. It can also occur in response to an auditory, or cognitive stimulus, and it plays an important role in day-to-day communication and social interaction. Information gained from the eyes’ movement can, therefore, be beneficial to understand the insights of a complex mental state which includes visual attention [liu2011visual] and human cognition (emotions, beliefs and desires) [frischen2007gaze].

Automatic eye gaze analysis develops techniques to estimate the position of target objects by observing the eyes’ movement. However, accurate gaze analysis is a complex and difficult problem. An accurate gaze analysis method should be able to disentangle gaze, while being resilient to a broad array of challenges including eye-head interplay, illumination variations, eye registration errors, occlusions, and identity bias. Furthermore, research [purves2015perception] has shown how human gaze follows an arbitrary trajectory during eye movements, making any prediction or dead reckoning a further challenge.

Research in eye gaze analysis mainly involves two broad areas: eye gaze estimation and eye segmentation. There are three aspects of eye gaze analysis: registration, representation and recognition. The first step, registration, involves the detection of eyes or eye-related key points. In the second step, the detected eye is projected to a meaningful feature space which is termed as representation. In the final stage for eye gaze estimation, corresponding gaze direction or gaze location is predicted based on the representative features. On the other hand, for the eye segmentation task, the intermediate representations are mapped to a segmented mask which is widely used in several applications [jain2006biometrics, chaudhary2019ritnet].

Research interest in automatic eye gaze analysis is established in several disciplines. It primarily originates from computer vision-related assistive technology [borgestig2016eye, corno2002cost] which further propagates through human-computer interaction [joseph2020potential], consumer behavior analysis [wedel2017review], virtual and augmented reality [patney2016perceptually, azuma1997survey], egocentric vision [ragusa2020ego] and other domains [eckstein2017beyond, miller2011persistence].

A brief chronology of the seminal gaze analysis methods with important milestones is presented in Fig. 1. The first gaze analysis was performed back in 1879 [javal1878essai]

. Further, the study on gaze analysis has been driven by its requirement in human visual attention modelling. The first eye tracker was introduced in 1908, followed by ‘Purkinje Image’

[cornsweet1973accurate], ‘Bright Pupil’ [merchant1974remote] and IR based eye tracking [abadi1981listening]. However, such tracking devices are costly and require specific controlled settings. To overcome their limitations, most traditional gaze analysis models rely on handcrafted low-level features (e.g., color [hansen2009eye], shape [hansen2005eye, hansen2009eye] and appearance [smith2013gaze]

) and certain geometrical heuristics 

[sugano2014learning] mostly to handle generic unconstrained settings [zhu2007novel, sugano2014learning]. Since 2015, the deep learning-based paradigm shift has been witnessed in gaze analysis [zhang2015appearance, krafka2016eye, park2018deep], similar to other computer vision tasks. With the deep learning-based models, the challenges associated with the variation in lighting, camera setup, eye-head interplay etc. are reduced greatly over the past few years. Although, the performance enhancement comes with the cost of large scale annotated data, which is expensive to acquire. Recently, learning with limited annotation has gained increasing popularity  [dubey2019unsupervised, yu2019unsupervised, park2019few].

[javal1878essai][huey1908psychology][merchant1974remote][cornsweet1973accurate][hansen2005eye][zhu2007novel][sugano2012appearance][smith2013gaze][sugano2014learning][zhang2015appearance][park2018deep][krafka2016eye][park2019few][yu2019unsupervised][garbin2019openeds][kellnhofer2019gaze360][palmero2020openeds2020]

Figure 1: A brief chronology of seminal gaze analysis works. The very first gaze pattern modelling dates back to the work of Javal et al. in 1879 [javal1878essai]. One of the first deep learning driven appearance based gaze estimation models was proposed in  [zhang2015appearance].

This paper surveys different eye gaze analysis papers by isolating their fundamental components, eye movement patterns, and discusses how each component addresses the aforementioned challenges in eye gaze analysis. The paper discusses new trends and developments in the field of computer vision and the AR/VR domain, from the perspective of gaze analysis. Recent gaze analysis techniques in the unsupervised, self-supervised, weakly-supervised domain that aim at capturing eye movement dynamics are discussed, along with validation protocols with evaluation metrics tailored for gaze analysis. The data capturing devices include RGB/IR camera, tablet/laptop’s camera, ladybug camera and other gaze trackers (including video-oculography [hansen2009eye]) are also discussed.

Due to the rapid progress in the computer vision field (Refer Fig. 1), it is increasingly useful to get thorough guidance via exhaustive survey/review articles. In 2010 and 2013, Hansen et al. [hansen2009eye] and Chennamma et al. [chennamma2013survey] reviewed the state-of-the-art eye detection and gaze tracking techniques. These reviews provide a holistic view of hardware, user interface, eye detection, and gaze mapping techniques. Since these reviews were before the deep learning era, they mostly contain the relevant features leveraged from handcrafted techniques. Afterwards in 2016,  [jing2016survey] reviewed methods for 2D and 3D gaze estimation methods. In 2017, Kar et al. [kar2017review] provided insights into the issues related to algorithms, system configurations, and user conditions. In 2020, a more comprehensive and detailed study of deep learning-based gaze estimation methods is presented in [cazzato2020look]. However, with the recent advancements in the computer vision community, learning from less supervision has become a popular topic. Moreover, all of the existing reviews focus only on gaze estimation and ignore significant works in eye segmentation, gaze zone estimation, gaze trajectory prediction, gaze redirection and multimodal/cross-modal gaze estimation. The contributions of the paper are summarized below:

  1. [topsep=1pt,itemsep=0pt,partopsep=1ex,parsep=1ex,leftmargin=*]

  2. A systematic review of automated gaze analysis. Specifically, we categorize and summarize existing methods by considering data capturing sensors, platforms, popular gaze estimation tasks in computer vision, level of supervision and learning paradigm. The proposed taxonomies aim to help researchers to get a deeper understanding of the key components in gaze analysis.

  3. Different popular tasks under one framework. To the best of our knowledge, we are the first to put different eye and gaze related popular tasks under one framework. Apart from gaze estimation, we consider gaze trajectory, eye segmentation and gaze zone estimation tasks.

  4. Applications. We explore major applications of gaze analysis using computer vision i.e. Augmented and Virtual Reality [patney2016perceptually, clay2019eye], Driver Engagement [ghosh2020speak2label, vora2018driver] and Healthcare [harezlak2018application, kempinski2016system].

  5. Privacy Concerns. We also provide a brief review of the privacy concerns of the gaze data and its possible implications.

  6. Overview of open issues and future directions. We review several issues associated with the current gaze analysis framework (i.e. model design, dataset collection, etc.) and discuss possible future research directions.

Figure 2: Top Left:

Overview of the human visual system, eye modelling and eye movement. For computer vision based automated eye gaze analysis, we consider an image containing eyes (left) as input. Thus, such methods mostly analyze the visible eye regions (middle) and predict the 2D/3D gaze vector as output. However, there are unobservable factors by which we can predict the true gaze direction (right) which requires person-specific information and other factors 

[park2020representation]. Bottom Left: Apart from static image-based gaze estimation, the dynamic eye movement is another line of research in computer vision that provides cues regarding human behavioural traits. Right: The actual modelling of eye gaze with respect to eye anatomy. We only highlight the relevant parts i.e. pupil, cornea, iris, sclera, fovea, LOS and LOG. The angle between LOG and LOS is called angle of kappa ().

The outline of this paper is as follows: Sec. 2 provides a brief discussion of vision-based gaze analysis systems, eye modelling and problem settings. The popular tasks studied in computer vision and related domains are reviewed in Sec. 3. A detailed overview of the gaze analysis framework is presented in Sec. 4. The details regarding the validation of the frameworks are summarized in Sec. 5. Applications of eye gaze estimation are versatile and discussed in Sec. 6. The privacy concerns arising from the increased use of technology in this domain are discussed in Sec. 7. We finally provide prospective research directions and discuss open challenges in the domain in Sec.  8.

2 Preliminaries

Human visual system is a complex cognitive process. Thus, understanding and modelling human gaze has become a fundamental research problem in psychology, neurology, cognitive science, and computer vision. Below, we provide a brief description on Human Visual System and Eye Modelling (Sec. 2.1), Eye movements (Sec. 2.2), Problem Settings in Automated Eye Gaze Analysis (Sec. 2.3) and the associated Challenges (Sec. 2.4).

2.1 Human Visual System and Eye Modelling

computer vision based human visual perception methods estimate gaze quantitatively from image or video data. Interestingly, these methods rely on the visible region of the eyes, where, iris and sclera are partially visible in spite of the requirement of both visible as well as completely unobservable factors (see Fig. 2 Top Left part). Such occluded information can be estimated via learning of a better prior on human eyes or by analyzing the patterns over time or via representation learning over large scale data. For gaze estimation, the geometric shape of the eyeball is considered spherical having a radius of 12-13 mm. The gaze direction is modelled with reference to the optical axis or visual axis (see Fig. 2 Right part). The line of gaze (LoG) is the line connecting the pupil, cornea and eyeball center. Similarly, the line of sight (LoS) is the line connecting the fovea and center of the cornea. Generally, the LoS is considered as the true direction of gaze. The intersection point of the visual and optical axis is called the nodal point of eye (anatomically the cornea center), usually having subject dependent angular offset. This is the main reason for having subject dependent calibration for gaze tracking devices. According to prior studies [carpenter1988movements, guestrin2006general], the fovea is located around 4-5° horizontally and 1.5° vertically below the optical axis. Eventually, it may vary up to 3° among subjects [guestrin2006general]. Additionally, head-pose also plays an important role in gaze analysis. The axis passing through the 3D location of the eyeball center is the indicator of the head-pose direction [hansen2009eye]. Most of the time, the combined direction of LoS and head pose provide information about where the person is looking.

2.2 Eye Movements

Human perceives their environment via eye movements including the voluntary or involuntary movement of the eyes, which helps in acquiring, fixating and tracking of the visual stimuli (see Fig. 2). Generally, eye movements are divided into different categories, discussed next.

Saccade. Saccades are rapid and reflexive eye movements mostly used for adjustment in a new location in the visual environment. It can be executed voluntarily or it is invoked as a part of optokinetic measure [duchowski2017eye] which lasts for 10 to 100 ms.

Smooth Pursuit. The smooth pursuit movements occur while tracking a moving target. This involuntary action depends on the range of the target’s motion as astonishingly, human eyes can follow the velocity of a moving target to some extend.

Fixations (Microsaccades, Drift, and Tremor). Fixations are eye movements in which the focus of attention is stabilized over a stationary object of interest. Fixations are characterized by three types of miniature eye movements: tremor, drift and microsaccades [duchowski2017eye]. During Fixations, the miniature eye movements occur due to the noise present in the control system to hold gaze steadily. This noise occurs in the area of fixation, around 5° visual angle. For simplification of the underlying natural process, this noise is ignored during fixation.

2.3 Gaze Estimation: Problem Setting

The main task of gaze estimation is to determine the line of sight of the pupil. Fig. 4 depicts a typical visual sensor based real-time gaze estimation setup consisting of user, data capturing sensor(s) and visual plane. The main calibration factors in this setting are:

  • [topsep=1pt,itemsep=0pt,partopsep=1ex,parsep=1ex,leftmargin=*]

  • Estimation of camera calibration parameters which include intrinsic camera parameters.

  • Estimation of geometric calibration parameters which includes the relative position of the camera, light source and screen.

  • Estimation of personal calibration which includes headpose and eye-specific parameters such as cornea curvature, the nodal point of the eye etc.

In some of the applications, the calibration parameters are estimated in task-specific settings. For example, users are requested to fixate their gaze to some predefined points in gaze-based interaction devices. Similarly, for subject-specific calibrations, where the user-specific information needs to be registered once in the devices. With the advances in computer vision and deep learning, nowadays the gaze estimation techniques are developed on the basis of appearance-based features which does not require explicit calibration step. Recently, Santini et al. [santini2017calibme] propose a fast and unsupervised gaze tracker calibration technique for gaze-based pervasive human-computer interaction to overcome this burden.

[zhang2019evaluation, zhang2015appearance, zhang2017mpiigaze, zhang2017s, zhu2017monocular, cui2017specialized, park2018deep, FischerECCV2018, cheng2018appearance, palmero2018recurrent, jyoti2018automatic, liu2019differential, cheng2020coarse, zhang2020learning][yu2018deep, chong2018connecting, zhang2018training, zhang2019evaluation, park2019few, chen2018appearance, zhou2019learning, wang2019neuro, dubey2019unsupervised][wu2019eyenet, kim2019nvgaze, garbin2019openeds, palmero2020openeds2020, xia2007ir][tonsen2017invisibleeye, kim2019nvgaze, garbin2019openeds, palmero2020openeds2020][lian2019rgbd]

Figure 3: The plot shows the popularity of different data capturing devices across research articles over the past 10 years. Here, HMD: Head Mounted Device, RGBD: RGB Depth camera.

Role of Data Capturing Sensors. Nowadays, mostly visual stimuli are used in computer vision based gaze estimation techniques. Although the earlier methods relied on specialized hardware for gaze estimation. These methods are termed as intrusive technique and require physical contact with human skin or eyes. The widely used devices and sensors are head-mounted devices, electrodes, or scleral coils [xia2007ir, tsukada2011illumination]. These devices sometimes cause an unpleasant user experience, and the accuracy of these systems depends on the tolerance and other factors of the devices. Image processing based gaze estimation methods come under the non-intrusive category and do not require physical contact [leo2014unsupervised]. These methods face several challenges, which include partial occlusion of the iris by the eyelid, illumination condition, head pose, specular reflection in case the user wears glasses, the inability to use standard shape fitting for iris boundary detection, and other effects including motion blur and over saturation of image [leo2014unsupervised]. To deal with these challenges, most of the existing gaze estimation methods have been performed under constrained environments like fixation of head pose, controlled illumination conditions, and camera angle. Such methods require a huge dump of high-resolution labeled images. Robust gaze estimation needs accurate pupil-center localization. Fast and accurate pupil-center localization is still a challenging task [gou2017joint], particularly for images with low resolution. A trade-off of widely used sensors are mentioned in Fig. 3.

Role of Headpose. Gaze estimation is a challenging task mostly due to eye-head interplay. Head-pose plays the most important role in eye gaze estimation. The gaze direction of a subject is determined by the combined effect of position and orientation of head pose and eyeball. One can change gaze direction via eyeball and pupil movement by maintaining stationary or dynamic head-pose or by moving both. Usually, this process is subject dependent. People adjust their head-pose and eye gaze to maintain a comfortable posture. Thus, the gaze estimation task needs to consider both eye gaze and head-pose at the same time for inference. Due to this reason, it is more common to consider head-pose information in the gaze estimation methods implicitly or explicitly [zhang2015appearance, zhang2017mpiigaze].

Figure 4: Overview of gaze estimation setups (See Sec. 2.3 for more details). A traditional gaze analysis setup considers the effect of head, visual plane and camera coordinates. The gaze analysis tasks include gaze zone, point of regard, gaze trajectory estimation etc. (See Sec. 3) The gaze vector is defined by the angles () in polar co-ordinate systems as shown in the gaze direction part.

Role of Visual Plane. The visual plane is the plane containing the gaze target point i.e. where the subject is looking, which is often termed as Point of Gaze (PoG). The distance between the user and the visual plane varies a lot in a real-world setting. Thus, recent deep learning-based methods mostly do not rely on the distance or placement of the visual plane. The most common gaze analysis setup uses a RGB camera placed at 20 - 70 cm from the user in unconstrained setting. In different real-world settings, the visual plane could be desktop ( 60cm), mobile phone ( 20cm), car ( 50cm) etc. An overview is presented in Table I.

Platform Dist. VA UC Papers
Desktop,
TV Panels
30-50,
200-500
40 °,
40°-60°
Static,
Sitting,
Upright
[zhang2019evaluation, zhang2015appearance, zhang2017mpiigaze, zhang2017s, zhu2017monocular, cui2017specialized, park2018deep, FischerECCV2018, cheng2018appearance, palmero2018recurrent, jyoti2018automatic]
 [yu2018deep, chong2018connecting, zhang2018training, zhang2019evaluation, park2019few, chen2018appearance, zhou2019learning, wang2019neuro]
 [lian2018multiview, lian2019rgbd, liu2019differential, cheng2020coarse, zhang2020learning]
HMD 2-5 55°-75°
Independent
(Leanback,
Sitting,
Upright)
 [jha2018probabilistic, hu2020dgaze, garbin2019openeds, palmero2020openeds2020]
Automotive 50 40°-60°
Mobile,
Sitting,
Upright
 [ghosh2020speak2label, tawari2014driver, tawari2014robust, vasli2016driver, fridman2015driver, fridman2016owl, choi2016real, lee2011real]
Handheld 20-40 5°-12°
Leanfwd,
Sitting,
Standing,
Mobile
 [krafka2016eye, zhang2018training, he2019device, huang2015tabletgaze, guo2019generalized, bao2021adaptive]
ET/ FV
Leanfwd,
Sitting,
Standing,
Upright
 [gaze360_2019, dubey2019unsupervised]
Table I: Attributes of different platforms widely used in gaze analysis. Here, Dist.: distance (in cm), VA: viewing Angle (in °), HMD: Head Mounted Devices, FV: Free Viewing, UC: User Condition, ET: External Target.

2.4 Challenges

In this section, we discuss the major challenges associated with eye gaze analysis.

Data Annotation. Generally, deep learning based methods require large amount of annotated data for generalized representation learning. Curating large scale annotated gaze datasets is non-trivial [gaze360_2019, zhang2015appearance, huang2015tabletgaze], time consuming and requires expensive equipment setup. Current dataset recording paradigms via wearable sensors may lead to uncomfortable user experience and it require expert knowledge. Another common aspect of the current datasets is the constrained environment in which they are recorded. Recently, few datasets [gaze360_2019, huang2015tabletgaze] have been proposed to address this gap by recording in unconstrained environments. However, it is assumed that participants fixate their gaze as per the given instructions [gaze360_2019, zhang2015appearance, huang2015tabletgaze]. Despite these attempts [gaze360_2019, zhang2015appearance, huang2015tabletgaze]

, data annotation still remains complex, noisy and time-consuming. Self/weakly/un-supervised learning paradigms 

[dubey2019unsupervised, yu2019unsupervised] could be helpful to address the dataset creation and annotation challenges.

Subjective Bias. Another challenge for the automatic gaze analysis method is subjective bias. As the nodal point of the human eye have subject-specific offset, it is challenging to learn the variations across a large number of subjects with different individuality of the eyes. In an ideal scenario, any gaze analysis method should encode rich features corresponding to eye region appearance, which provides relevant information for gaze analysis. To address this challenge, few-shot learning based approach is widely adapted  [park2019few, yu2019improving] where the motivation is to adapt with a new subject with minimum subject specific information. Moreover, a combination of classical eye model based approaches with geometrical constraints along with learning based appearance encoding  [wang2018hierarchical] have the potential to generalize well across subjects which in turn could be another way to deal with subjective bias.

Eye Blink. Blinks are an involuntary and periodic motion of eyelids. They pose challenge for eye gaze analysis in the form of missing data. Few recent works [gaze360_2019, ghosh2020speak2label] assume that the head pose information is a suitable replacement for eye gaze during the eye close event based on a common line of sight between a subject’s headpose and eye gaze. However, it is noted that a large shift in the gaze is possible after subject open ups their eyes. To simplify the situation, many image based eye analysis methods [huang2015tabletgaze, jyoti2018automatic] do not use the eye blink data during their training phase. Another set of proposed approaches [vora2017generalizing, vora2018driver]

consider eye blink as a separate class. A possibility for real world deployment of such a system is generating eye gaze labels by interpolating from neighbouring frames’ labels in case an eye blink is detected.

Data Attributes. Several factors such as eye-head interplay, occlusion, blurred image and illumination can influence the performance of a gaze analysis model. The presence of any subset of these attributes can degrade the performance of a system [gaze360_2019, huang2015tabletgaze]. Many methods use face alignment [zhang2015appearance, krafka2016eye]

and 3D head pose estimation 

[zhang2015appearance] as a pre-processing step. However, face alignment on unconstrained environment based images may introduce noise in a system. To overcome this, recent approaches [gaze360_2019, zhang2020eth, jyoti2018automatic, dubey2019unsupervised] avoid these pre-processing steps and show increase in eye gaze prediction performance.

Another critical challenge in gaze estimation is eye-head interplay. Prior studies generally address this issue via implicit training [krafka2016eye, zhang2017everyday] or provide the head pose information separately as a feature [zhang2015appearance]. Similarly, it is challenging to estimate gaze in partial occlusion. When the yaw head rotation is greater than 90°, one side of the face becomes occluded w.r.t. the camera. A few prior works [krafka2016eye, zhang2015appearance] avoid these scenarios via disregarding these frames. However, Kellnhofer et al. [gaze360_2019]

argue that when the head yaw angle is in the range 90°- 135°, the partial visibility still provides relevant information about the gaze direction. This study also proposes quantile regression via pinball loss to mitigate the effect of partial occlusion in training data. Despite all of these attempts, gaze estimation still remains challenging in presence of these attributes. There still have scope to eliminate the effects of these attributes and make the gaze analysis model more robust for the real-world deployment.

Application Specific Challenges. Gaze analysis also has application-specific requirements, for example, coarse or fine gaze estimation in AR, VR, Robotics, egocentric vision and HCI. A working algorithm behind any eye-tracking devices needs to fit in the application environment.

3 Eye Gaze Analysis in Computer Vision

Here, we provide a breakdown of different gaze analysis tasks for vision-based applications.

3.1 Eye Gaze Estimation

Most of the existing studies consider gaze estimation as either the gaze direction in 3D space or as the point of regard in 2D/3D coordinates (see Fig. 4). Any gaze statistical modeling mainly estimates the relations between the input visual data and the point of regard/gaze direction. In a few cases, the exact gaze position or location is not required. Thus, gaze zone estimation is also another interesting approach for many applications [vora2018driver, ghosh2020speak2label, kaur2018prediction]. We can divide the gaze estimation methods into the following types:

Geometric. The geometric methods compute the gaze direction from the geometric model of the eye (see Fig. 2), where, the anatomical structure of the eye is considered to get the 3D gaze direction or gaze vector. These methods are widely used in prior deep learning era [hansen2009eye]. However, the recent deep learning-based approaches implicitly model these parameters during the learning process, and therefore do not explicitly require the complicated processing subject specific parameters such as cornea radii, cornea center, angles of kappa (i.e. Refer in Fig. 2), iris radius, the distance between the pupil center and cornea center etc. These measurements could contains inherit noise as well.

Regression. The regression-based methods [morimoto2005eye, zhang2015appearance, park2018deep, yu2019unsupervised] mainly map visual stimuli (image or image-related features) to gaze coordinates or gaze angles in 2D/3D. The output mapping is application-specific. For example, 2D/3D gaze coordinate mainly maps people’s focus of attention to the screen coordinates (for human-computer interaction based applications such as engagement, attention, driver monitoring). The regression-based methods can be divided into two types: the parametric approaches [morimoto2005eye, yu2019unsupervised] which assume gaze trajectories as a polynomial and the task is to estimate the parameters of the polynomial equation. The non-parametric approaches that directly work on the mappings in spite of calculating the intersection between the gaze direction and gazed object explicitly [zhang2015appearance, park2018deep, jyoti2018automatic]

. The neural network-based approaches fall into this category.

Trajectory Prediction. Eye gaze estimation has potential applications in AR/VR especially in Foveated Rending (FR) and Attention Tunneling (AT), where the future eye trajectory prediction is highly desirable. To meet this requirement, a new research direction i.e. future gaze trajectory prediction has been recently introduced [palmero2020openeds2020]. Here, possible future gaze locations can be estimated based on the prior gaze points, content of the visual plane or their combination. Thus, the problem statement can be formulated as follows: given number of prior gaze points, the algorithm will predict the future frames’ gaze direction in a person specific setting.

Gaze Zone. In many gaze estimation based applications such as driver gaze [ghosh2020speak2label, vora2018driver, vora2017generalizing, dubey2019unsupervised], gaming platforms [corcoran2012real], website designing [chu2009using] etc., the exact position or angle of the line of sight of the pupil is not required. Thus, the gaze zone is utilized in these cases for estimation. Here, the gaze zone refers to an area in 2D or 3D space. For example, for driver gaze estimation, the possible regions where the driver can look while driving is termed as gaze zone.

Gaze Redirection. Due to the challenges in different posed gaze conditions, generation on the go is gaining popularity [chen2020mggr, palmero2020openeds2020]. It aims to capture subject-specific signals from a few eye images of an individual and generate realistic eye images for the same individual under different eye states (gaze direction, camera position, eye openness etc.). The gaze redirection can be performed in both controlled and uncontrolled way [zheng2020self, chen2020mggr, chen2021coarse]. Apart from these, eye rendering is another research direction to generate realistic eye given the appearance, gaze direction of a person.

Unconstrained Gaze Estimation. 1) Single Person Setting: In webcam or RGB camera based gaze estimation approach, mostly model-based eye tracking [baltruvsaitis2016openface, baltrusaitis2018openface] is utilized. These uses geometric eye modelling to perform eye tracking as it has an inherent advantage as it requires no training data and most importantly, it is fast. However, it relies on accurate eye location and key points detection, which is hard to achieve in real-world environments. Although deep learning-based methods [king2009dlib, park2018learning] have eliminated this issue to some extent, however, it still remains a challenge as it does not generalize well in different settings. 2) Multi-Person Setting: In unconstrained multi-person setting, its is very difficult to track the eyes. For example, in a social interaction scenario, understanding the gaze behaviour of each person provides an important cues to interpret social dynamics. To this end, a new research direction is introduced where the problem is defined as whether the people are people Looking At Each Other (LAEO) in a given video sequences [marin2019laeo, marin2021pami, Kothari_2021_CVPR]. Similarly, gaze communication [fan2019understanding] is another line of research aligned with this field.

3.2 Eye Segmentation

The main task of eye segmentation is pixel-wise or region-wise differentiation of the visible eye parts. In general, the eye region is divided into three parts: sclera (the white region of the eyes), iris (the colour ring of tissue around the pupil) and pupil (the dark iris region). Prior studies [sankowski2010reliable, radu2015robust, das2017sserbc, lucio2018fully] on eye segmentation mainly explore to segment the iris and sclera region. Few studies [garbin2019openeds, palmero2020openeds2020] include the pupil region in the segmentation task as well. Eye segmentation is widely used in the biometric systems [das2013sclera] and prior for synthetic eye generation [chaudhary2019ritnet].

4 Eye Gaze Analysis Framework

We break down a gaze analysis framework into its fundamental components (Fig. 5): eye/face registration, representation and inference, and discuss their role below. As there is a high overlap between representation and inference module, we collectively called it learning paradigm.

4.1 Registration

Eye registration is the first stage of gaze analysis and requires detection of the eye and the relevant regions of interest. Here, we explore the prior literature on eye registration processes by discussing their advantages and limitations, their ability to detect eye components in different challenging conditions and computational complexity.

Eye Detection Methods. The main aim of the eye detection algorithm is to accurately identify the eye region from an input image in challenging conditions such as occlusion, eye openness, variability in eye size, head pose, illumination and viewing angle while balancing the trade-off in appearance, dynamic variation and computational complexity. Prior works on eye detection can be divided into three categories: shape-based [hansen2005eye], appearance-based [liang2013appearance, wang2016hybrid, zhang2019evaluation, baltrusaitis2018openface] and hybrid method [wang2016hybrid]. The most popular library for eye and facial point detection are Dlib111https://github.com/davisking/dlib [king2009dlib] OpenFace222https://github.com/cmusatyalab/openface [baltrusaitis2018openface, baltruvsaitis2016openface], MTCNN333https://github.com/kpzhang93/MTCNN_face_detection_alignment [zhang2016joint], Duel Shot Face Detector444https://github.com/Tencent/FaceDetection-DSFD [li2018dsfd], FaceX-Zoo555https://github.com/JDAI-CV/FaceX-Zoo [wang2021facex]. Apart from these, the learning-based pupil localization methods utilize ensemble of randomized tree [markuvs2014eye], local self similarity matching [leo2014unsupervised]

, adaptive gradient boosting 

[tian2016accurate], hough regression forest [kacete2016real], deep learning based landmark localization models [park2018learning, king2009dlib], heterogeneous CNN models [choi2019accurate] etc.

Figure 5: A generic gaze analysis framework has different components including registration, gaze representations and inference. Although in the deep learning based approaches, there is a high overlap between the representation and inference module. Refer Sec. 4 for more details.

Eye Blink Detection. Eye blinks are the involuntary and periodic activity that can help to judge the cognitive activity of a person (e.g. driver’s fatigue [pandey2021real], lie detection [monaro2020using]). KLT trackers and various sensors are also widely used to get the eye motion information to track eye blink [drutarovsky2014eye]. The existing eye blink detection approaches mostly aim to solve a binary classification problem (blink/no blink) either in a heuristic-based or data-driven way. The heuristic-based approaches mainly include motion localization [drutarovsky2014eye] and template matching [krolak2012eye]. As these methods are highly reliable on pre-defined thresholds, they could be sensitive to subjective bias, illumination and head pose. To overcome this limitation, the data-driven approaches infer on the basis of appearance-based temporal motion features [cech2016real, drutarovsky2014eye] or spatial features [daza2020mebal]. In hybrid approach [hu2019towards], multi-scale LSTM based framework is used to detect eye blink using both spatial and temporal information.

[zhang2015appearance][zhang2017s][gaze360_2019][chaudhary2019ritnet][jyoti2018automatic][park2018deep][krafka2016eye][dubey2019unsupervised][yu2019unsupervised][park2019few][zheng2020self]

Figure 6: A brief outline of different pipelines used for gaze analysis tasks. Refer Sec. 4.2 for more details of the networks.

4.2 Representative Deep Network Architectures

In this section, we provide a generic formulation and representation of gaze analysis. Given an RGB image

, a deep learning-based model mapped it to task-specific label space. The input RGB image is usually the face or eye regions. Based on the primary network architectures adopted in the literature, we classify the models into the following categories:

CNN-based, Multi-Branch network-based, Temporal-based and VAE/GAN-based. An overview is shown in Fig. 6.

4.2.1 CNN-based

Most of the recent solutions adopt a CNN-based architecture [zhang2015appearance, zhang2017s, zhang2017mpiigaze, krafka2016eye, park2018deep, wang2018hierarchical], which aims to learn end-to-end spatial representation followed by gaze prediction. The model used is often a modified version of the popular CNNs in vision (e.g. AlexNet [dubey2019unsupervised], VGG [FischerECCV2018], ResNet-18 [gaze360_2019, wu2019eyenet], ResNet-50[zhang2020eth], Capsule network [dubey2019unsupervised]). These CNNs can be single-stream [zhang2015appearance, zhang2017s], multi-stream [jyoti2018automatic, krafka2016eye], or prior based networks [park2018deep]. These CNNs learn from a single-stream of RGB images (e.g. face, left or right eye patch), or multiple streams of information (e.g. face, and eye patches), and prior knowledge based on eye anatomy or geometrical constraints.

GazeNet. It is the extended version of the first deep learning-based gaze estimation method [zhang2015appearance] which takes a grayscale eye patch image as input and maps it to angular gaze vector . The mapping function consists of 5 convolutional layers followed by 2 FC layers which are inherited from LeNet-based architecture. As headpose provides relevant features for gaze direction, the headpose vector is also added in the FC layer for better inference (Refer top left image in Fig. 6). The Extended version [zhang2017mpiigaze] consists of a 13-convolutional layer neural network adapted from the VGG network instead of its previous version which boosts the performance. To train these models, the sum of the individual losses between the predicted and actual gaze angle vectors is considered.

Spatial Weight CNN. It is a full face appearance-based gaze estimation method [zhang2017s] which uses a spatial weighting mechanism for encoding the important locations of the facial image via the standard CNN architecture (Refer top row second column image in Fig. 6). The spatial weights mechanism includes three additional

convolutional layers followed by a ReLU activation. Given a

dimensional activation map () as input (where , and are the number of feature channels, height and width of the output), the spatial weights module learns the weight matrix from element-wise multiplication of with the original activation via the following function: across the channel dimensions. Thus, the models learn to give more weights to the specific regions to avoid unwanted noise in the input. For 2D gaze estimation, the distance between the predicted and ground-truth gaze positions in the target screen coordinate system is utilized. Similarly, the distance between the predicted and ground-truth gaze angle vectors in the normalized space is used for 3D gaze estimation.

Dilated Convolution. Another interesting architecture for gaze estimation is dilated-convolutional layers which preserve spatial resolution while increasing the size of the receptive field without compromising the number of parameters [chen2018appearance]. Given an input feature map of kernel size (: height, :width, :channel with weights and bias ) and dilation rates (, ), the output feature map can be defined as follows:

The dilated convolution is applied in facial and left/right eye patches before inferring the gaze. For training the network, cross entropy loss is used in the label space. Representation learning via MinENet [perry2019minenet] also relies on dilated and asymmetric convolutions to provide context to the segmented regions of the eye by increasing the receptive field capacity of the model to learn contextual information.

Bayesian CNN. Another variant of CNN is Bayesian CNN which is used for robust and generalizable eye-tracking under different conditions [wang2019generalizing]. It captures the probabilistic relationships between eye appearance and its landmarks. Compared to the point-based eye landmark estimation methods, the BNN model can generalize better and it is also more robust under challenging real-world conditions. Additionally, the extended version of the BCNN (i.e. the single-stage model to multi-stage, yielding the cascade BCNN) allows feeding the uncertainty information from the current stage to the next stage to progressively improve the gaze estimation accuracy. This could be an interesting area for further study.

Pictorial Gaze. The pictorial gaze network [park2018deep] consists of two parts: 1) regression from eye patch image to intermediate gazemap followed by 2) regression from gazemap to gaze direction (Refer second row second column image in Fig. 6). The gazemap is an intermediate representation of a simple model of the human eyeball and iris in terms of dimensional image, where, the projected eyeball diameter is and the iris centre coordinates () are as follows:

where, and gaze direction . Basically, the iris is an ellipse with major-axis diameter of r and minor-axis diameter of .

The first part is implemented via a stacked hourglass architecture which assumed to encode complex spatial relations including the location of occluded joints or key points. The hourglass network predicts the gazemaps via 3 hourglass modules with intermediate supervision applied on the gazemap outputs. This sub-network is optimized by minimizing the cross-entropy loss between predicted and ground-truth gazemap for all pixels. Consequently, for the second part i.e. regression to ground-truth gaze , a DenseNet architecture is used which maps the gazemap to the gaze vector . The network is trained via gaze direction regression loss defined as: .

Ize-Net. This framework is used for coarse to fine gaze representation learning. Here, the main idea is to learn coarse gaze representation by dividing the gaze locations to gaze zones. The proposed network [dubey2019unsupervised] (Refer second row right image in Fig. 6

) is a combination of convolution and primary capsule layer. Similar to GazeNet, it also contains five convolution layers, each of which is followed by batch normalization and MaxPooling. After the convolution layers, the primary capsule layer is appended whose job is to take the features learned by convolution layers and produce combinations of the features to consider face symmetry into account. Further, the output of the primary capsule is flattened and fed to FC-layers of dimension 1024 and 512 before prediction. This network is trained for coarse gaze zone which is fine-tuned for downstream 2D/3D gaze estimation.

EyeNet. consists of modified residual units as the backbone, attention blocks and multi-scale supervision architecture. This network is robust for the low resolution, image blur, glint, illumination, off-angles, off-axis, reflection, glasses and different colour of iris region challenges.

4.2.2 Multi-Branch network based

There are several works [FischerECCV2018, krafka2016eye, jyoti2018automatic] which utilize multiple inputs for better inference.

iTracker. The iTracker framework [krafka2016eye] takes the left eye, right eye, face images detected and face location in the original frame as a binary mask (all of the size ) as input and predicts the distance from the camera (in cm). The model is jointly trained with Euclidean loss on the x and y gaze position. the overview of the framework is shown in the second row third column image in Fig. 6.

Multi-Branch Design. Similar to iTracker, Jyoti et al. [jyoti2018automatic] propose a framework which takes the full face, left and right eye, both eye patch as input for inferring gaze (Refer Fig. 6 second-row left image). To train this network, mean squared error between the true and predicted gaze point/direction is used.

Two-Stream VGG Network. In [FischerECCV2018], a two-stream VGG network is used for gaze inference while taking left and right eye patch as input. Similar to prior works, it utilizes the sum of the individual losses between the predicted and ground truth gaze vectors to train the ensemble network.

4.2.3 Temporal-based

Since the human gaze is continuous, some works model it temporally. Given a sequence of frames, the task is to estimate the gaze direction of the concerned person. For this modelling, popular recurrent neural network structures have been explored (e.g. GRU 

[Park2020ECCV], LSTM/bi-LSTM [gaze360_2019]).

Pinball LSTM. To capture the continuous nature of the gaze, pinball LSTM [gaze360_2019] is proposed. The video-based gaze estimation model using bidirectional LSTM considers both past and future inputs. The framework utilize sequences of 7 frames to predict the gaze of the central frame. Fig. 6 first-row third column image illustrates the architecture of the model. The facial region from each frame is provided as input to the backbone CNN having ResNet-18 architecture, which maps 256-dimensional high-level features. These features are given input to bidirectional LSTMs with two layers. Finally, the features from LSTMs are concatenated and passed through a FC layer to predict the gaze direction with an error quantile estimation i.e. (), where () is the predicted gaze direction in spherical coordinates corresponding to the ground-truth gaze vector in the eye coordinate system as and . On the other hand, corresponds to the offset from the predicted gaze i.e. and in the 90% quantiles of its distribution and and are in the 10% quantiles. The pinball loss is computed as follows: given the ground truth label , the loss for the quantile and the angle can be written as:

where, , for and otherwise. This loss will force and to converge to their ground truth values.

4.2.4 VAE/GAN-based

Variational autoencoders and GANs have been used for unsupervised or self-supervised representation learning (Refer Fig. 

6). Here, the latent space feature of the autoencoder model is used for gaze estimation inference [park2019few, yu2019unsupervised, zheng2020self]. Apart from representation learning, VAE and GAN based models are widely used for gaze redirection tasks [shrivastava2017learning, zheng2020self, chen2021coarse, chen2020mggr].

DT-ED. In order to better representation learning, Variational autoencoders are utilized [park2019few, zheng2020self]. Moreover, gaze redirection is quite popular for learning a generalizable latent embedding space representing gaze in a person independent manner. Park et al. [park2019few] extend the transforming encoder-decoder architecture to consider three important factors relevant to the gaze estimation setting, i.e. gaze direction, head orientation, and other factors related to the appearance of the eye region in given facial images. The framework disentangles the aforementioned three factors by explicitly applying constraints related to gaze and headpose rotations. This architecture is termed as Disentangling Transforming Encoder-Decoder (DT-ED). DT-ED takes an input image and maps it to the latent space via an encoder (i.e. . Further, a decoder maps it back to the redirected image (i.e. ). The latent space embedding consists of 3 parts representing: appearance (), gaze redirection () and head pose (). Thus, can be expressed as: . The gaze is estimated from the part of the latent embedding. The illustration is shown in Fig. 6 (bottom row middle image).

ST-ED. Similarly, the Self-Transforming Encoder-Decoder (ST-ED) architecture [zheng2020self] (Refer Fig. 6 bottom row right image) takes a pair of images and as input, disentangles the subject’s personal non-varying embeddings ( and ), considers pseudo-label conditions ( and ) and embedding representations ( and ). Transforming based on the pseudo condition labels during training time considers several extraneous factors in the absence of ground truth.

Gaze Redirection Network. The main idea behind the unsupervised gaze redirection network [yu2019unsupervised] is capturing generic eye representation as well as redirection (Refer Fig. 6 bottom row left image). The framework takes eye patch as input and predict the redirected eye patch as output while preserving the difference in rotation . In this work, gaze redirection is used as a pretext task for representation learning.

RITnet. RITnet [chaudhary2019ritnet] (Refer Fig. 6

top row right image) is a hybrid version of U-Net and DenseNet based upon Fully Convolutional Networks (FCN). To balance the trade-off between the performance and computational complexity, it consists of 5 Down-Blocks in the encoder and 4 Up-Blocks in the decoder where the last layer of the encoder block is termed as the bottleneck layer. Each Down-Block has 5 convolution layers with LeakyReLU activation and the layers share connections with previous layers similar to DenseNet architecture. Similarly, each Up-Block has 4 convolution layers with LeakyReLU activation. All Up-Blocks have skip connection with their corresponding Down-Block which is an effective strategy to learn representation. To train the model, the following loss functions are used:

1) Standard cross-entropy loss (CEL) is applied pixel-wise to categorize each pixel into four categories (i.e. background, iris, sclera, and pupil). 2) Generalized Dice Loss (GDL) penalize the pixels on the basis of the overlap between the ground truth pixel and corresponding prediction. 3) Boundary Aware Loss (BAL) weights each pixel in terms of distance with its two nearest neighbour. This loss helps to avoid CEL confusion in boundary region. 4) Surface Loss (SL) helps to recover small regions and contours via distance based scaling. The overall loss is defined as follows:

RITnet [chaudhary2019ritnet] uses boundary aware loss and surface loss to enhance eye segmentation maps for learning better representation and class-specific categorization of the pixels. Similarly, another lightweight model [kim2019eye] uses MobileNet with depth-wise separable convolution for efficiency. They also employ a squeeze and excitation (SE) module for improved features by modelling channel independence. Moreover, the heuristic filtering of the connected component is utilized to enforce biological coherence in the network. Few works [rot2018deep, luo2019ibug] utilize multi-class classification strategy for rich representation learning.

Other Statistical Modelling.

The statistical inference based mapping is performed based on K-nearest neighbour (KNN

[huang2015tabletgaze], support vector regression [smith2013gaze, huang2015tabletgaze]

and random forest 

[huang2015tabletgaze, sugano2014learning]. A brief overview of these methods is summarised in Table II

. Prior deep learning works on semantic eye segmentation is mainly focused on iris or sclera segmentation via Fuzzy C Means clustering, Otsu’s binarization, K-NN 

[das2016ssrbc] etc. Sclera segmentation challenge was organized since 2015 to promote development in this area [das2016ssrbc, das2017sserbc, das2019sclera]. Recently, the OpenEDS challenge was organized in 2019 by Facebook Research in which eye segmentation was one of the sub-challenges. Most of the methods in this challenge use deep learning techniques [chaudhary2019ritnet, boutros2019eye, perry2019minenet].

Ref. Reg. Represent.
Level of
Sup.
Model Prediction Validation Plat. Publ. Year
[smith2013gaze]
Face[Omron]
Appear.
Fully-Sup. SVM
Gaze locking
[smith2013gaze]
Scr.
UIST
2013
[funes2014eyediap] 3D MM
Appear.
Fully-Sup.
Convex Hull 3-D GV
 [funes2014eyediap]
ET.
ETRA
2014
[sugano2014learning]
Face, Eye [sugano2014learning]
Appear.
Fully-Sup.
RRF
3-D GV
 [sugano2014learning]
Any
CVPR
2014
[wood2015rendering]
Eye
Appear.
Fully-Sup.
CNN+CLNF
3-D GV
 [zhang2015appearance]
Any
ICCV
2015
[zhang2015appearance]
Face, L/R Eye
Appear.
Fully-Sup.
CNN [zhang2015appearance]
3-D GV
 [zhang2015appearance]
Scr.
CVPR
2015
[krafka2016eye]
Face, L/R Eye
Appear.
Fully-Sup.
iTracker [krafka2016eye]
2-D Scr.
[huang2015tabletgaze, krafka2016eye]
HH
CVPR
2016
[ganin2016deepwarp]
Eye
Appear.
Fully-Sup.
CNN [krafka2016eye]
GR Img.
[ganin2016deepwarp]
Any
ECCV
2016
[huang2015tabletgaze]
Eye[YuS]
Appear.
Fully-Sup.
SVR
2-D Scr.
[huang2015tabletgaze]
HH
MVA
2017
[FischerECCV2018]
Eye [xiang2017joint]
Appear.
Fully-Sup.
VGG-16+FC [FischerECCV2018]
3-D GV
[zhang2015appearance, FischerECCV2018]
Scr.
ECCV
2018
[park2018deep]
Eyes
Appear.
Fully-Sup.
CNN
3-D GV
[zhang2015appearance, funes2014eyediap]
Scr.
ECCV
2018
[jyoti2018automatic]
Face [baltruvsaitis2016openface]
Geo.+Appear.
Fully-Sup.
CNN [jyoti2018automatic]
3-D GV
[huang2015tabletgaze, smith2013gaze]
Desk.
ICPR
2018
[wang2018hierarchical]
Eye
Geo.+Appear.
Fully-Sup.
HGSM+c-BiGAN
Eye, GV
 [zhang2015appearance, funes2014eyediap]
Any
CVPR
2018
[chen2018appearance]
Face, L/R Eye
Appear.
Fully-Sup.
Dilated CNN
3-D GV
 [zhang2015appearance, krafka2016eye, smith2013gaze]
Scr.
ACCV
2018
[park2019few]
Face
Appear.
Few-Shot
DT-ED+ML
3-D GV
 [zhang2015appearance, krafka2016eye]
Scr.
ICCV
2019
[gaze360_2019]
Face
Appear.
Fully-Sup.
Pinball LSTM
3-D GV
 [zhang2015appearance, smith2013gaze, huang2015tabletgaze]
ET
ICCV
2019
[garbin2019openeds]
Eye Appear.
Fully-Sup.
SegNet [badrinarayanan2017segnet] Seg. Map
 [garbin2019openeds]
HMD
ICCVW
2019
[wang2019neuro]
Face, L/R Eye
Appear.
Fully-Sup.
DGTN
GV
 [wang2019neuro]
Desk.
CVPR
2019
[xiong2019mixed]
Face
Appear.
Fully-Sup.
MeNet
3-D GV
 [zhang2015appearance, sugano2014learning, krafka2016eye]
Scr.
CVPR
2019
[wang2019generalizing]
Face, Eye
Appear.
Semi/Unsup.
BCNN
3-D GV
 [zhang2015appearance, funes2014eyediap]
Desk.
CVPR
2019
[chaudhary2019ritnet]
Eyes
Appear.
Fully-Sup.
Hybrid U-net
Seg. Map
 [garbin2019openeds]
HMD
ICCVW
2019
[kansal2019eyenet]
Eyes
Appear.
Fully-Sup.
Modified Resnet
Seg. Map
 [garbin2019openeds]
HMD
ICCVW
2019
[boutros2019eye]
Eyes
Appear.
Fully-Sup.
Eye-MMS
Seg. Map
 [garbin2019openeds]
HMD
ICCVW
2019
[perry2019minenet]
Eyes
Appear.
Fully-Sup.
Dilated CNN
Seg. Map
 [garbin2019openeds]
HMD
ICCVW
2019
[yu2019improving]
Eyes
Appear.+Seg.
Few-shot
GR
2-D GV
[zhang2015appearance, smith2013gaze]
Any
CVPR
2019
[yu2019unsupervised]
Eyes
Appear.
Unsup.
GR
2-D GV
[zhang2015appearance, smith2013gaze]
Any
CVPR
2019
[dubey2019unsupervised]
Face, Eye
Appear.
Unsup.
IzeNet
3-D GV
[smith2013gaze, huang2015tabletgaze]
FV
IJCNN
2019
[buhler2019content]
Eye
Appear.+Seg.
Fully-Sup.
Seg2Eye
Eye Img.
 [garbin2019openeds]
HMD
ICCVW
2019
[zhu2020hierarchical]
Eye Seq.
Appear.
Unsup.
Hier. HMM
Eye Move.
 [komogortsev2013automated]
Any
ECCVW
2019
[shen2020domain]
Eye
Appear.
Semi/Unsup.
mSegNet+Discre.
Seg. Map
 [garbin2019openeds]
HMD
ECCVW
2019
[perry2020eyeseg]
Eye
Appear.
Few-Shot
EyeSeg
Seg. Map
 [garbin2019openeds]
HMD
ECCVW
2019
[zheng2020self]
Face
Appear.
Fully-Sup.
ST-ED
GR
[krafka2016eye, smith2013gaze, funes2014eyediap]
Scr.
NeurIPS
2020
[palmero2020openeds2020]
Eye Appear.
Fully-Sup.
Modified ResNet GR Img.
[palmero2020openeds2020]
HMD
ECCVW
2020
[Park2020ECCV]
Eyes
Appear.
Fully-Sup.
ResNet-18+GRU
PoG,3-D GV
[Park2020ECCV]
Scr.
ECCV
2020
[zhang2020eth]
Face
Appear.
Fully-Sup.
ResNet-50 3-D GV
 [zhang2015appearance, krafka2016eye, gaze360_2019, funes2014eyediap]
Scr.
ECCV
2020
[dias2020gaze]
Face
Appear.
Semi-Sup.
GRN
GV
 [NIPS2015_ec895663]
FV
WACV
2020
[zhang2020learning]
Face, Eye
Appear.
Fully-Sup.
RSN+GazeNet
GV
[funes2014eyediap, zhang2015appearance, krafka2016eye]
Scr.
BMVC
2020
[cheng2020coarse]
Face, Eye
Appear.
Fully-Sup.
CA-Net
GV
 [zhang2015appearance, funes2014eyediap]
Scr.
AAAI
2020
[cheng2020gaze]
Face, Eye
Appear.
Fully-Sup.
FAR-Net
GV
[funes2014eyediap, zhang2015appearance, FischerECCV2018]
Scr.
TIP
2020
[chen2021coarse]
Eye
Appear.+AEM
Fully-Sup.
MT c-GAN
Eye Img.
 [zhang2015appearance, smith2013gaze, sugano2014learning]
Scr.
WACV
2021
[bao2021adaptive]
Face, Eye
Appear.
Fully-Sup.
AFF-Net
Scr., GV
 [krafka2016eye, zhang2017s]
Scr.
Arxiv
2021
[cheng2021puregaze]
Face
Appear.
Unsup.
PureGaze
Face, GV
 [zhang2020eth, gaze360_2019, zhang2015appearance, sugano2014learning]
Scr.
Arxiv
2021
[Kothari_2021_CVPR]
Face
Appear.
Weakly-Sup.
ResNet-18+LSTM
GV
 [zhang2020eth, gaze360_2019, Kothari_2021_CVPR, krafka2016eye]
Any
CVPR
2021
[marin2021pami]
Face
Appear.
Fully-Sup.
LAEO-Net++
LAEO
 [marin2019laeo]
Any
TPAMI
2021
Table II: A comparison of gaze analysis methods with respect to registration (Reg.), representation (Represent.), Level of Supervision, Model, Prediction, validation on benchmark datasets (validation), Platforms, Publication venue (Publ.) and year. Here, GV: Gaze Vector, Scr.: Screen, LOSO: Leave One Subject Out, LPIPS: Learned Perceptual Image Patch Similarity, MM: Morphable Model, RRF: Random Regression Forest, AEM: Anatomic Eye Model, GRN: Gaze Regression Network, ET: External Target, FV: Free Viewing, HH: HandHeld Device, HMD: Head Mounted Device, Seg.: Segmentation and GR: Gaze Redirection, LAEO: Looking At Each Other.

4.2.5 Discussion

In an attempt to summarize the recent deep network based gaze analysis methods, we present some main take away points as follows:

  • [topsep=1pt,itemsep=0pt,partopsep=1ex,parsep=1ex,leftmargin=*]

  • The overall gaze estimation methods are divided into two broad categories: 1) 2D Gaze Estimation: In this context the proposed methods map the input image to 2D Point of Regard (PoR) in the visual plane. The visual planes could either be the observable object or screen. Non deep learning methods or early deep learning methods [hansen2009eye, zhang2015appearance, zhang2017mpiigaze, zhang2017s] mostly perform these mappings. 2) 3D Gaze Estimation: The 3D gaze estimation basically considers the gaze vector instead of 2D PoR. The gaze vector is the line joining the pupil center point with the point of regard. Recent works [zhang2020eth, park2018deep, park2019few, Park2020ECCV, Kothari_2021_CVPR] mainly relies on 3D gaze estimation methods. Based on the application and requirement, these gaze estimation methods are utilized accordingly.

  • Single branch CNN based architectures [zhang2015appearance, zhang2017s, zhang2017mpiigaze, krafka2016eye, park2018deep, wang2018hierarchical] are widely used over the past few years for progressive improvements on benchmark datasets. The input to these networks are restricted to single eye, eye patch or face. Thus, to further boost the performance, multi branch networks are proposed which utilize eyes, face, geometric constraints, visual plane grid as inputs.

  • Both single or multi branch networks depend on spatial information. However, eye movement is dynamic in nature. Thus, few recent proposed architectures [gaze360_2019, wang2018hierarchical] use temporal information for inference.

  • From representation learning perspective, VAE and GAN based architectures [park2019few, yu2019unsupervised, zheng2020self] are explored. However, it is observed that these architectures could have high time complexity as compared to single or multi branch CNN.

  • Prior based appearance encoding is another line of approaches for encoding rich feature representation. Few works have defined priors based on eye anatomy [park2018deep], geometrical constraint [cheng2020gaze] as biases for better generalization. Despite direct appearance encoding, Park et al. [park2018deep] proposed an intermediate pictorial representation, termed a ‘gazemap’ (refer Fig. 6) of the eye to simplify the gaze estimation task. Similarly, the ‘two eye asymmetry’ property is utilized for gaze estimation [cheng2020gaze] where the underlying hypothesis is that despite the difference in appearances of two eyes due to environmental factors, the gaze directions remains approximately the same. The CNN based regression model is assumed to be independent of identity distribution, however, due to the subject-specific offset of the nodal point of the eyes, gaze datasets have identity specific bias. Xiong et al. [xiong2019mixed] inject this bias as a prior by mixing different models. Similarly, to handle this offset, the gaze is decomposed into the subject independent and dependent bias for performance enhancement and better generalization [chen2020offset].

  • In order to train the deep learning based models, mostly  [zhang2015appearance, zhang2017s, zhang2017mpiigaze]

    and cosine similarity based losses

    [zhang2020eth, Park2020ECCV] are used. However, a novel pinball loss [gaze360_2019] is proposed to model the uncertainty in gaze estimation, especially in unconstrained settings.

  • Similarly for deep learning-based eye segmentation approaches, mostly the eye image to segmentation mapping is performed in a non-parametric way which implicitly encodes shape, geometry, appearance and other factors [luo2020shape, kansal2019eyenet, rot2018deep, chaudhary2019ritnet, wu2019eyenet, garbin2019openeds]. The most popular network architectures for eye segmentation are U-net [ronneberger2015u], modified version of SegNet [garbin2019openeds], RITnet [chaudhary2019ritnet], EyeNet [kansal2019eyenet]. These VAE based architechtures have high time and space complexity. However, recent methods [kansal2019eyenet, chaudhary2019ritnet] do consider these factors without compromising the performance.

4.3 Level of Supervision

Based on the type of supervision, the training procedure can be classified into the following categories: fully-supervised, Semi-/Self-/weakly-/unsupervised.

4.3.1 Fully-Supervised.

Supervised learning paradigm is the most commonly used training framework in gaze estimation literature [smith2013gaze, sugano2014learning, zhang2015appearance, huang2015tabletgaze, park2018deep, park2018learning] and eye segmentation literature [chaudhary2019ritnet, kansal2019eyenet, boutros2019eye, perry2019minenet, das2013sclera, das2016ssrbc, das2017sserbc, das2019sclera]. As the fully-supervised methods require a lot of accurately annotated data which is resource expensive and time-consuming process. Thus, the research community is moving towards learning with less supervision.

Multi-Task Learning. Multi-task learning incorporates different tasks which provide auxiliary information as a bias to improve model performance. The auxiliary information can be Gaze+Landmark [yu2018deep], PoG+Screen saliency [Park2020ECCV, wang2019inferring], Gaze+Depth [lian2019rgbd], Gaze+Headpose [zhu2017monocular], Segmentation+Gaze [wu2019eyenet] and Gaze-direction+Gaze-uncertainty [gaze360_2019]. These gaze aligned tasks facilitate strong representation learning with additional task based supervision.

4.3.2 Semi-/Self-/weakly-/unsupervised.

Several studies [sugano2014learning, benfold2011unsupervised, zhang2017everyday, santini2017calibme, park2019few, karessli2017gaze, he2019device] explore gaze estimation in unsupervised and semi-supervised settings to reduce the data annotation burden. These approaches are mainly based on ‘learning-by-synthesis’ [sugano2014learning], hierarchical generative models [wang2018hierarchical], conditional random field [benfold2011unsupervised], unsupervised gaze target discovery [zhang2017everyday], few-shot learning [park2019few, yu2019improving] and self/unsupervised [yu2019unsupervised, dubey2019unsupervised]. For eye segmentation, few studies [shen2020domain, perry2020eyeseg] explore gaze estimation in semi-supervised and few-shot settings to reduce the data annotation burden.

Auxiliary Task/Pretext Tasks. Nowadays, self-supervised learning is being popular for representation learning. It requires pseudo labels for any pre-designed task. The pre-designed task is mostly aligned with the gaze estimation. Dubey et al. [dubey2019unsupervised] propose a pretext task where the visual regions of the person is divided into zones by geometric constraints. These pseudo labels are utilized for representation learning. Yu et al. [yu2019unsupervised] uses subject specific gaze redirection as a pretext task. The self supervised representation learning have the potential to eliminate the major drawback of gaze data annotation which is quite difficult and error prone process.

Figure 7: Data collection procedure in different settings for benchmark datasets. From left to right the examples are from CAVE [smith2013gaze], Eth-XGaze [zhang2020eth], MPII [zhang2017mpiigaze] and Gaze360 [gaze360_2019] datasets. The leftmost one is more constrained and the rightmost one is less constrained. Images are taken from respective datasets [gaze360_2019, zhang2020eth, zhang2017mpiigaze, smith2013gaze]. Refer Table III for more details.

5 Validation

Here, we review the commonly followed evaluation procedures on various datasets along with the metrics adopted in the literature.

5.1 Datasets for Gaze Analysis

With the rapid progress in the gaze analysis domain, several datasets have been proposed for different gaze analysis tasks (see Sec. 3). The dataset collection technique has evolved from constrained lab environments [smith2013gaze] to unconstrained indoor [huang2015tabletgaze, zhang2017mpiigaze, zhang2017s, FischerECCV2018] and outdoor settings [gaze360_2019] (Refer Fig. 7). We provide a detailed overview of the datasets in Table III. Compared with early datasets [smith2013gaze, funes2014eyediap], recently released datasets [gaze360_2019, Park2020ECCV] are typically more advanced with less bias, improved complexity, and larger in scale. These are better suited for training and evaluation. We describe a few important datasets below:

CAVE [smith2013gaze] contains 5,880 images of 56 subjects with different gaze directions and head poses. There are 21 different gaze directions for each person and the data was collected in a constrained lab environment, with 7 horizontal and 3 vertical gaze locations.

The Eyediap dataset [FunesMora_ETRA_2014] was designed to overcome the main challenges associated with the head pose, person and 3D target variations along with changes in ambient and sensing conditions.

TabletGaze [huang2015tabletgaze] is a large unconstrained dataset of 51 subjects with 4 different postures and 35 gaze locations collected using a tablet in an indoor environment. TabletGaze dataset is also collected in a grid format.

MPII [zhang2017mpiigaze] gaze dataset contains 213,659 images collected from 15 subjects during natural everyday events in front of a laptop over a three-month duration. MPII gaze dataset is collected by showing random points on the laptop screen to the participants. Further, Zhang et al. [zhang2017s] curate MPIIFaceGaze dataset with the hypothesis that gaze can be more accurately predicted when the entire face is considered.

RT-GENE dataset [FischerECCV2018] is recorded in a more naturalistic environment with varied gaze and head pose angles. The ground truth annotation was done using a motion capture system with mobile eye-tracking glasses.

Gaze360 [gaze360_2019] is a large-scale gaze estimation dataset collected from 238 subjects in unconstrained indoor and outdoor settings with a wide range of head pose.

ETH-XGaze [zhang2020eth] is a large scale dataset collected in a constraint environment with a wide range of head pose, high-resolution images. The dataset contains images from different camera positions, illumination conditions to add more challenges to the data.

EVE [Park2020ECCV] is also collected in constraint indoor setting with different camera views to map human gaze in screen co-ordinate.

Similar to eye gaze estimation several benchmark datasets have been proposed over the past few years for eye and sclera segmentation. The datasets collected for sclera segmentation is mostly in a constraint environment and with very few subjects [derakhshani2006new, derakhshani2007texture, crihalmeanu2009enhancement]. A more challenging publicly available dataset was released in sclera recognition challenges [das2016ssrbc, das2017sserbc, das2019sclera]. Recently, a large scale dataset termed as OpenEDS: Open Eye Dataset [garbin2019openeds], is released which contains eye images collected by using a VR head-mounted display mounted device. Additionally, there was two synchronized eye facing cameras having a frame rate of 200 Hz. The data was collected under controlled illumination and contains 12,759 images with eye segmentation masks collected from 152 participants.

Dataset # Sub Label Modality Head-Pose Gaze Env. Baseline # Data Year
CAVE [smith2013gaze] 56 3-D
Image
Dim.:
°, ° °, ° In
SVM
Eval.:Cross-val
Total:5880
2013
EYEDIAP [funes2014eyediap] 16 3-D
Image
Dim.: HD and VGA
°, ° °, ° In
Convex Hull
Eval.: Hold out
Total:237 min
2014
UT MV[sugano2014learning] 50 3-D
Image
Dim.:
°, ° °, ° In
Random Reg.
Forests
Eval.:Hold out
Total:64,000
2014
OMEG [he2015omeg] 50 3-D
Image
Dim.:
°, °
°to °,
°to °
In
SVR
Eval.:LOSO
Total:44,827
2015
MPIIGaze [zhang2015appearance] 15 3-D
Image
Dim.:
°, ° °, ° In
CNN variant [zhang2015appearance]
Eval.:LOSO
Total:213,659
2015
GazeFollow [nips15_recasens] 130,339 3-D
Image
Dim.: Variable
Variable Variable Both
CNN variant [nips15_recasens]
Eval.:Hold out
Total:122,143
2015
SynthesEye[wood2015rendering] NA 3-D
Image
Dim.:
°, ° °, ° Syn
CNN [wood2015rendering]
Eval.:Hold out
Total:11,400
2015
GazeCapture [krafka2016eye] 1450 2-D
Image
Dim.:
°, ° °, ° Both
CNN [krafka2016eye]
Eval.: Hold out
Total:2,445,504
2016
UnitEyes [wood2016learning] NA 3-D
Image
Dim.:
Variable Variable Syn
KNN
Eval.:NA
Total: 1,000,000
2016
TabletGaze [huang2015tabletgaze] 51 2-D Sc.
Video
Dim.:
°, ° °, ° In
SVR
Eval.:Cross-val
Total:816 Seq.
300,000 img.
2017
MPIIFaceGaze
 [zhang2017s]
15 3-D
Image
Dim.:
°, ° °, ° In
CNN variant [zhang2017s]
Eval.:LOSO
Total:213,659
2017
InvisibleEye [tonsen2017invisibleeye] 17 2-D Sc
Image
Dim.:
Unknown
pixel VF
In
ANN [tonsen2017invisibleeye]
Eval.: Hold out
Total:280,000
2017
RT-GENE [FischerECCV2018] 15 3-D
Image
Dim.:
°, ° °, ° In
CNN [FischerECCV2018]
Eval.:Cross val
Total:122,531
2018
Gaze 360 [gaze360_2019] 238 3-D
Image
Dim.:
°, u/k °, °
Both
Pinball LSTM
Eval.: Hold out
Total: 172,000
2019
RT-BENE[cortacero2019rt] 17 EB
Image
Dim.:
°, ° °, ° In
CNNs
Eval.: Cross val
Total: 243,714
2019
NV Gaze [kim2019nvgaze] 30
3-D,
Seg.
Image (Synthetic)
Dim.:,
Unknown
°°
VF
Both
CNN [laine2017production]
Eval.: Hold out
Total: 2,500,000
2019
HUST-LEBW
 [hu2019towards]
172 EB
Video
Dim.:
Variable Variable Both
MS-LSTM
Eval.: Hold out
Total: 673
2019
VACATION
 [fan2019understanding]
206,774 GC
Video
Dim.:
Variable Variable Both
GNN
Eval.: Hold out
Total: 96,993
2019
OpenEDS-19 [garbin2019openeds]
Track 1: Semantic
Segmentation
152 Seg.
Image
Dim.:
Unknown Unknown In
SegNet [badrinarayanan2017segnet]
Eval.: Hold out
Total:12,759
(in # SegSeq [garbin2019openeds])
2019
OpenEDS-19 [garbin2019openeds]
Track 2: Synthetic
Eye Generation
152 Gen.
Image
Dim.:
Unknown Unknown In
Eval.: Hold out
Total: 252,690
2019
OpenEDS-20 [palmero2020openeds2020]
Track 1: Gaze
Prediction
90 3-D
Image
Dim.:
Unknown °, ° In
Modified ResNet
Eval.: Hold out
Total: 8,960 Seq.,
550,400 img.
2020
OpenEDS-20 [palmero2020openeds2020]
Track 2: Sparse
Temporal Semantic
Segmentation
90 Seg.
Image
Dim.:
Unknown °, ° In
SegNet [badrinarayanan2017segnet]
(Power
Efficient version)
Eval.: Hold out
Total: 200 Seq.
29,500 img.
2020
mEBAL[daza2020mebal] 38 EB
Image
Dim.:
Variable Variable In
VGG-16 Varient
Eval.: Hold out
Total: 756,000
2020
ETH-XGaze [zhang2020eth] 110 3-D
Image
Dim.:
°, ° °, ° In
ResNet-50
Eval.: Hold out
Total: 1,083,492
2020
EVE [Park2020ECCV] 54 3-D
Image
Dim.:
°, ° °, ° In
ResNet-18
Eval.: Hold out
Total: 12,308,334
2020
GW [kothari2020gaze] 19 GE
Image
Dim.:
Variable Variable In
RNN
Eval.: Hold out
Total: 5,800,000
2020
LAEO [Kothari_2021_CVPR] 485 3-D
Image
Dim.: Variable
Variable Variable Both
ResNet-18+LSTM
Eval.: Hold out
Total: 800,000
2021
GOO [tomas2021goo] 100 3-D
Image
Dim.: Variable
Variable Variable Both
ResNet-50
Eval.: Hold out
Total: 201,552
2021
OpenNEEDS [emery2021openneeds] 44 3-D
Image
Dim.:
Variable Variable VR
GBRT
Eval.: Hold out
Total: 2,086,507
2021
Table III: A comparison of gaze datasets with respect to several attributes (i.e. number of subjects (# subjects), gaze labels, modality, headpose and gaze angle in yaw and pitch axis, environment (Env.), baseline method, data statistics (# of data), and year of publication.) The abbreviations used are: In: Indoor, Out: Outdoor, Both: Indoor + Outdoor, Gen.: Generation, u/k: unknown, Seq.: Sequence, VF: Visual Field, EB: Eye Blink, GE: Gaze Event [kothari2020gaze], GBRT: Gradient Boosting Regression Trees, GC: Gaze Communication, GNN: Graph Neural Network and Seg.: Segmentation.

Data Generation/Gaze Redirection.

Since eye gaze data collection and annotation is an expensive and time-consuming process, the research community moves towards a data generation process for benchmarking with a large variation in data attributes. Prior works in this domain generate both synthetic and real image. Mostly the methods are based on Generative Adversarial Networks (GANs). In order to capture the possible rotational variation in image, gaze redirection techniques 

[ganin2016deepwarp, he2019photo, wood2018gazedirector, yu2019improving, kaur2021subject] are quite popular. An early work on gaze manipulation [wolf2010eye] uses pre-recording of several potential eye replacements during test time. Further, Kononenko et al. [kononenko2015learning] propose wrapping based gaze redirection using supervised learning, which learns the gaze redirection via a flow field to move eye pupil and relevant pixels from the input image to the output image. The gaze re-direction methods may struggle with extrapolation since it depends on the training samples and training methods. Moreover, these works suffer from low-quality generation and low redirection precision. To overcome this, Chen et al. [chen2020mggr] propose a MultiModal-Guided Gaze Redirection (MGGR) framework which uses gaze-map images and target angles to adjust a given eye appearance via learning. The other approaches are mainly based on random forest [kononenko2015learning] and style transfer [sela2017gazegan]. Random forest is used to decide the possible gaze direction and in style transfer, the appearance-based feature is mainly encoded. Sela et al. [sela2017gazegan] propose a GAN based framework to generate a large dataset of high-resolution eye images having diversity in subjects, head pose, camera settings and realism. However, the GAN based methods lack in their capability to preserve content (i.e. eye shape) for benchmarking. Buhler et al. [buhler2019content] synthesize person-specific eye images with a given semantic segmentation mask by preserving the style and content of the reference images. In summary, we can say that although a lot of effort has been made to generate realistic eye images due to several limitations (perfect gaze direction, image quality), these images are not used for benchmarking.

5.2 Evaluation Strategy

In this section, we describe the most widely used gaze metrics in the gaze analysis domain.

Gaze Estimation. The most common practice to measure the gaze estimation accuracy/error is in terms of angular error (in degrees) [park2018deep, zhang2020eth, park2019few, Park2020ECCV] and gaze location (in pixels or cm/mm(s)) [huang2015tabletgaze, Park2020ECCV]. The angular error is measured between the actual gaze direction () and predicted gaze direction () defined as . On the other hand, Euclidean distance is measured between the original and predicted point of gaze (PoG).

Gaze Redirection. The gaze redirection evaluation is performed in both quantitative and qualitative manner [zheng2020self, chen2020mggr, chen2021coarse]

. The quantitative analysis is done in terms of angular gaze redirection error estimated between the predicted values and their intended target values. As in this task, the moment of the eye pupil is pre-defined, thus, this angular error weakly quantifies how perfectly the eye redirection occurs, although the method for measuring the angle has some inherent noise. For qualitative analysis, the Learned Perceptual Image Patch Similarity (LPIPS) metric is used which measures the paired image similarity in the gaze redirection task.

Eye Segmentation. Commonly used evaluation metric for eye segmentation methods, is average of the mean Intersection over Union (mIoU). Although, for the recent OpenEDS challenge [garbin2019openeds], the mIoU metric is calculated for all classes and model size (S) calculated as a function of a number of trainable parameters in megabytes (MB).

Figure 8: Importance of eye tracking in AR and VR industry. From left to right: Oculus rift, HTC Vive, Hololens, and Magic leap. The hololens image is taken from microsoft.com, other images have creative common license.

6 Applications

6.1 Eye Gaze in Augmented Reality and Virtual Reality

We are witnessing great progress in the adaptation of VR and AR technology. Eye-tracking has the potential to bring revolution in the AR/VR space since could enhance the device’s awareness by learning about users attention at any given point in time. Consequently, user’s focus based optimization reduces power consumption by the device [10.1145/3308755.3308765, patney2016towards, palmero2020openeds2020]. In this section, we will cover the importance of eye-tracking technology and how it enables better user experience in AR and VR devices. Few widely used eye-tracking devices are shown in Fig. 8.

Foveated Rendering is a rendering process designed to show the user only a portion of what they are looking at in full detail [patney2016towards, garbin2019openeds, palmero2020openeds2020]. The focus region follows the user’s visual field. Graphics displayed with foveated rendering better matches the way we see objects. Following are the three important benefits of the rendering process: 1. Improved Image Quality: It can enable 4k displays on the current generation graphics processing units (GPUs) without degradation in performance. 2. Lower cost: Similarly, the end-users can run AR/VR based applications on low-cost hardware without compromising the performance. 3. Increased Frame Rate per Second (FPS): The end-user can run at a higher frame rate using the same graphical settings. There are two types of foveated rendering: dynamic foveated rendering and static foveated rendering. Dynamic foveated rendering follows the user’s gaze trajectory using eye-tracking and renders a sharp image in the required region, but this eye tracking is challenging in many scenarios. On the other hand, static foveated rendering considers a fixed area of the highest resolution at the center of the viewer’s device irrespective of the user’s gaze. It depends on the user’s head movements, thus, facing a challenge in eye-head interplay as the image quality is drastically reduced if the user looks away from the center of the field of view. The main key aspects of accurate eye position estimation in terms of IPD is to enhance user experience via providing high image quality in the subject’s visual focus area. It requires person-specific calibration as IPD varies a lot from person to person. Thus, generalizing it across user poses a challenge in the gaze analysis community to address [10.1145/3313831.3376260]. For better understanding, a key pain point corresponding to a different user, many VR and AR headsets use iris-based user identification. This process will enable user-specific recommendation and other facilities to enhance user experience [10.1145/3277644.3277771].

6.2 Driver Engagement

With the progress in autonomous and smart cars, the requirement for automatic driver monitoring has been observed and researchers have been working on this problem for a few years now [vasli2016driver, tawari2014driver, ghosh2020speak2label, jha2020multimodal]. In the literature, the problem is treated as a gaze zone estimation problem. A summary of the gaze estimation benchmarks is shown in Table IV. The proposed methods can be classified into two categories:

Sensor Based Tracking. These mainly utilize dedicated sensors integrated hardware devices for monitoring the driver’s gaze in real-time. These devices require accurate pre-calibration and additionally these devices are expensive. Few examples of these sensors are Infrared (IR) camera [johns2007monitoring], head-mounted devices [jha2018probabilistic, jha2017challenges] and other systems [feng2013low, zhang2017exploring].

References # Sub # Zones Illumination Labelling
Choi et al. [choi2016real] 4 8
Bright &
Dim
3D Gyro.
Lee et al.  [lee2011real]
12 18 Day Manual
Fridman et al.  [fridman2015driver]
50 6 Day Manual
Tawari et al.  [tawari2014driver] 6 8 Day Manual
Vora et al.  [vora2018driver] 10 7
Diff.
day times
Manual
Jha et al. [jha2018probabilistic] 16 18 Day
Head-
band
Wang et al. [wang2019continuous] 3 9 Day
Motion
Sensor
DGW[ghosh2020speak2label]
338 9
Diff. day times
Automatic
MGM[jha2020multimodal]
60 21
Diff. day times
Multiple
Sensors
Table IV: Comparison of driver gaze estimation datasets with respect to number of subjects (# Sub), number of zones (# Zones), illumination conditions and labelling procedure.

Image processing and vision-based methods. These are mainly focused on two types of methods: head-pose based only [lee2011real, tawari2014robust, mukherjee2015deep, wang2019continuous] and both head-pose and eye-gaze based [vasli2016driver, tawari2014driver, tawari2014robust, fridman2015driver, fridman2016owl, choi2016real]. Driver’s head pose provides partial information regarding his/her gaze direction as there may be an interplay between eyeball movement and head pose [fridman2016owl]. Hence, methods relying on head pose information may fail to disambiguate between the eye movement with fixed head-pose. Thus, the methods relying on both head pose and eye gaze-based prediction are more robust.

6.3 Eye Gaze in Healthcare and Wellbeing

Eye gaze is widely used in healthcare domain to enhance the diagnosis performance. Generally, eye movement patterns is widely used as behavioral bio-markers of various mental health problems including depression [alghowinem2013eye], post traumatic stress disorder [milanak2018ptsd] and Parkinson’s disease [harezlak2018application]. Similarly, individuals diagnosed with Autism Spectral Disorder display gaze avoidance in social scenes [harezlak2018application]. Even intoxication including alcohol consumption and/or other drugs usage reflects on eye and gaze properties, especially, decreased accuracy and speed of saccades, changes in pupil size, and an impaired ability to fixate on moving objects. A recent survey [harezlak2018application] discusses the potential applications in healthcare including concussion [kempinski2016system], multiple sclerosis [avital2015method].

Physiological Signals. A gaze estimation system could be one of the communication methods for severely disabled people who cannot perform any type of gestures and speech. Sakurai et al. [sakurai2016study] developed an eye-tracking method using a compact and light electrooculogram (EOG) signal. Further, this prototype is improved via the usage of the EOG component which strongly correlated with the change of eye movements [sakurai2017gaze] (Refer Fig. 9). The setup can detect object scanning only by eye and face muscle movements. The experimental results open the possibility of eye-tracking via EOG signals and a Kinect sensor. Research along this direction can be extremely useful for disabled people

Figure 9: Electro-oculogram (EOG) based gaze estimation method [sakurai2017gaze]. This prototype opens the possibility of communication for severely disabled people. Refer Sec. 6.3 for more details.

7 Privacy in gaze estimation

Due to the rapid progress over the past few years, gaze estimation technologies have become more reliable, cheap, and compact and observe increasing use in many fields, such as gaming, marketing, driver safety, and healthcare. Consequently, these expanding uses of technology raise serious privacy concerns. Gaze patterns can reveal much more information than a user wishes and expects to give away. By portraying the sensitivity of gaze tracking data, this section provides a brief overview of privacy concerns and consequent implications of gaze estimation and eye-tracking. Fig. 10 shows the overview of the privacy concerns, including common data capturing scenarios with their possible implications. A recent analysis [kroger2019does] of the literature shows that eye-tracking data may implicitly contain information about a user’s biometric identity [john2019eyeveil], personal attributes (such as gender, age, ethnicity, personality traits, intoxication, emotional state, skills etc.) [erbilek2013age, moss2012eye, hoppe2018eye], physical and mental health [harezlak2018application, alghowinem2013eye]. Few eye-tracking measures may even reveal underlying cognitive processes, mental and physical well-being [eckstein2017beyond]. The widespread consideration of eye-tracking enhance the potential to improve our lives in many directions, but the technology can also pose a substantial threat to privacy. Thus, it is necessary to understand the sensitiveness of gaze data from a holistic perspective to prevent its misuse.

Figure 10: The possible privacy concerns related to gaze analysis framework [kroger2019does]. Please refer Sec. 7 for more details.

8 Conclusion and Future Direction

Eye gaze analysis is a technology in search of an application in several domains mainly in assistive technology and human-computer interactions. The wide applications of eye gaze related technology is growing rapidly. Thus, it opens a lot of research opportunity ahead of the community. Here, in this paper, we present an overall review of gaze analysis frameworks with different perspectives from different point of view. Beginning with the preliminaries of gaze modelling and eye movement, we further elaborate on challenges in this field, overview of gaze analysis framework and its possible applications in different domains. For eye analysis, mainly geometric and appearance properties are widely explored in prior works. Despite recent progress, the eye gaze analysis remains challenging due to eye head interplay, occlusion and other challenges mentioned in Sec. 2.4. Thus, there is a scope for future development in this respect. Moreover, mostly all of the proposed datasets in this domain are collected in constraint environments. In order to overcome these limitations, the generative adversarial network-based data generation approach has come into play. Due to several image quality-related issues, these datasets are not used for benchmarking. There is another domain where automatic labelling of images based on heuristic is proposed, this will reduce the burden greatly for data annotation. Future directions for the eye and gaze trackers include:

  • [topsep=1pt,itemsep=0pt,partopsep=1ex,parsep=1ex,leftmargin=*]

  • Gaze Analysis in Unconstrained Setup: The most precise methods for eye gaze estimation is via intrusive sensors, IR camera and RGBD camera. The main drawback of these systems is that their performance degrades when used in real-world settings. In future, gaze estimation models should consider these situations. Although several current efforts in this direction employ techniques, yet further research is needed. Moreover, most of the current gaze estimation benchmark datasets require the proper geometric arrangement as well as user cooperation (e.g., CAVE, TabletGaze, MPII, Eyediap, ETH-XGaze etc). It would be an interesting direction to explore gaze estimation in a more flexible setting.

  • Learning with Less Supervision: With the surge in unsupervised, self-supervised, weakly supervised techniques in this domain, more exploration in this direction is required to eliminate the dependency on ground truth gaze label which could be error-prone due to data acquisition limitations.

  • Gaze Inference: Apart from localizing the eye and determining gaze, the gaze patterns provides vital cues for encoding the cognitive and affective states of the concerned person. More exploration and cross-domain research could be another direction to encode visual perception.

  • AR/VR: Eye tracking has potential application in AR/VR including Foveated Rending (FR) and Attention Tunneling. The gaze-based interaction require low latency gaze estimation. In these applications, the visual environment presents a high-quality image at the point where the user is looking while blurring the other peripheral region. The intuition is to reduce power consumption without compromising the perceptual quality as well as user experience. However, eye movements are fast and involuntary action which restrict the use of this techniques (in FR) due to the subsequent delays in the eye-tracking pipelines. In order to address this issue, a new research direction i.e. future gaze trajectory prediction has been recently introduced [palmero2020openeds2020]. More exploration along this direction is highly desirable.

  • Eye Model and Learning Based Hybrid Approaches: Traditional geometrical eye model based and appearance guided learning based approaches have complimentary advantages. The geometrical eye model based methods does not require training data. Moreover, it has strong generalization capability but it is highly relied on relevant eye landmark localization performance. Accurate localization of eye landmarks is quite challenging in real world settings as the subject could have extreme headpose, occlusion, illumination and other environmental factors. On the other hand, the learning based approaches can encode eye appearance feature but it does not generalize well across different setups. Thus, a hybrid model which can take the advantage of both scenarios could be a possible research direction for gaze estimation and eye tracking domain.

  • Multi-modal/Cross-modal Gaze Estimation: Over the past decade, head gesture synthesis has become an interesting line of research. Prior works in this area have mainly used handcrafted audio features such as energy based features [ben2013articulatory], MFCC (Mel Frequency Cepstral Coefficent) [ding2015head], LPC (Linear Predictive Coding) [ding2015head] and filter bank [ding2015head, ding2015blstm] to generate realistic head gesture. The main challenge in this domain is audio data annotation for head motion synthesis which is a noisy and error prone process. Prior works approach this problem via multi-stream HMMs [ben2013articulatory], MLP based regression modelling [ding2015head], bi-LSTM [ding2015blstm] and Conditional Variational Autoencoder (CVAE) [greenwood2017predicting]. In vision domain, mainly visual stimuli is utilized for gaze estimation. As the audio signal is non-trivial for gaze estimation, yet, it has the potential to coarsely define the gaze direction. Research along this direction have potential to estimate gaze in challenging situation where visual stimuli fails.

The techniques surveyed in this paper focus on eye gaze estimation and eye segmentation from different perspective, however, these techniques can be useful for other computer vision and HCI tasks. Gaze analysis and its widespread applications is a unique and well-defined topic, which have already influenced recent technologies. Scholarly interest in gaze estimation is established in a large number of disciplines. It primarily originates from vision-related assistive technology which further propagates through other domains and attracts a lot of future research attention across various fields.

References