Evaluation of Appearance-Based Methods and Implications for Gaze-Based Applications

01/30/2019 ∙ by Xucong Zhang, et al. ∙ University of Stuttgart Max Planck Society Osaka University 0

Appearance-based gaze estimation methods that only require an off-the-shelf camera have significantly improved but they are still not yet widely used in the human-computer interaction (HCI) community. This is partly because it remains unclear how they perform compared to model-based approaches as well as dominant, special-purpose eye tracking equipment. To address this limitation, we evaluate the performance of state-of-the-art appearance-based gaze estimation for interaction scenarios with and without personal calibration, indoors and outdoors, for different sensing distances, as well as for users with and without glasses. We discuss the obtained findings and their implications for the most important gaze-based applications, namely explicit eye input, attentive user interfaces, gaze-based user modelling, and passive eye monitoring. To democratise the use of appearance-based gaze estimation and interaction in HCI, we finally present OpenGaze (www.opengaze.org), the first software toolkit for appearance-based gaze estimation and interaction.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 5

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Figure 1.

We study the gap between dominant eye tracking using special-purpose equipment (right) and appearance-based gaze estimation using off-the-shelf cameras and machine learning (left) in terms of accuracy (gaze estimation accuracy), sensing distance, usability (personal calibration), and robustness (glasses and indoor/outdoor use).

Eye gaze has a long history as a modality in human-computer interaction (HCI), whether for attentive user interfaces (Bulling, 2016), gaze interaction (Vidal et al., 2013; Zhang et al., 2017a), or eye-based user modelling (Xu et al., 2016; Kosch et al., 2018). A key requirement of gaze-based applications is special-purpose eye tracking equipment, either worn on the body (head-mounted) or placed in the environment (remote). Despite the fact that the costs of hardware and software have decreased, particularly over the last couple of years, this requirement still represents a major hurdle for wider adoption of gaze in HCI research and practical applications. Another hurdle is the need for expert knowledge on how to set up and operate these trackers to obtain accurate gaze estimates, i.e. how to calibrate them properly to each individual user.

With the goal to address these limitations, research in computer vision has focused on developing gaze estimation methods that are calibration-free and that only require off-the-shelf RGB cameras, such as those readily integrated in an ever-increasing number of personal devices or ambient displays (Zhang et al., 2018c; Wood and Bulling, 2014; Huang et al., 2015). While model-based methods fit a geometric eye model to the eye image, appearance-based methods directly regress from eye images to gaze directions using machine learning (Hansen and Ji, 2010). For a long time, these methods remained far inferior to special-purpose eye trackers, particularly in terms of gaze estimation error and robustness to head pose variations. However, appearance-based gaze estimation methods have recently improved significantly (Zhang et al., 2017d, 2018c) and promise a wide range of new applications, for example in attentive user interfaces (Sugano et al., 2016; Zhang et al., 2017c), mobile gaze interaction (Khamis et al., 2018) or social signal processing (Müller et al., 2018b).

Despite their potential, particularly given expected future improvements and availability of even larger-scale training data, appearance-based gaze estimation methods are still not yet widely used in HCI. We believe this is partly because it currently remains unclear how they perform compared to dominant, special-purpose eye tracking equipment. In the gaze estimation literature, evaluations have been often performed within the category that the proposed method belongs to. We are not aware of a single, principled comparison of model- and appearance-based gaze estimation methods with a common stationary eye tracker. Another likely reason is that using these methods remains challenging for user interface and interaction designers. While source code for some current methods is available (Zhang et al., 2017d, 2018c; Park et al., 2018), the code has been written by computer vision experts for evaluation purposes. The code is typically either not optimised for real-time use, doesn’t implement all functionality required for interactive applications in a single pipeline, or cannot be easily extended or integrated into other software or user interface frameworks.

This work aims to provide a basis for developers from the HCI community to integrate the appearance-based gaze estimation method into interactive applications. In order to achieve this goal, we make the following contributions: First, we evaluate the accuracy of state-of-the-art appearance-based gaze estimation for interaction scenarios with and without personal calibration, indoors and outdoors, for different interaction distances, as well as for users with and without glasses. We compare accuracy with a state-of-the-art model-based gaze estimation method (Park et al., 2018) and, for the first time, with a commercial eye tracker. Second, we discuss the obtained findings and their implications for the most important gaze-based applications (Majaranta and Bulling, 2014) ranging from explicit eye input, to attentive user interfaces and gaze-based user modelling, to passive eye monitoring. Third, to democratise the use of appearance-based gaze estimation and interaction in HCI, we present OpenGaze, the first software toolkit for appearance-based gaze estimation and interaction that is specifically developed for user interface designers. The framework implements the full gaze estimation pipeline, is easily extensible and integratable, and is usable by non-experts.

2. Related Work

We start with a general introduction of the various gaze-based interactions, followed by pertinent studies on gaze estimation methods.

2.1. Gaze-based human-computer interaction

Taking eye gaze from users as a command to computers is the most intuitive gaze-aware application. The typical usage of eye gaze information is as a replacement of the mouse, such as typing words with the eye (Mott et al., 2017), indicating user attention (Nguyen and Liu, 2016), and selecting items (Zhang et al., 2014). Researchers have investigated daily human-computer interactions using different eye movements, such as fixations (Majaranta et al., 2009; Higuch et al., 2016; Mott et al., 2017), smooth pursuit (Esteves et al., 2015), and eye gestures (Mardanbegi et al., 2012).

In addition, gaze information has also shown significant potential for user understanding. Most intuitively, eye tracking techniques have been used to capture and infer user behaviours, such as eye contact (Zhang et al., 2017c) and daily activities (Bulling et al., 2013; Steil and Bulling, 2015). Eye tracking data has been also used to recognise users’ latent states, including interest and engagement (Li et al., 2017; Lagun et al., 2014), affective states (Müller et al., 2018a), cognitive states (Huang et al., 2016b; Matthews et al., 1991), and attentive states (Vertegaal et al., 2003; Faber et al., 2017). It has been pointed out that eye tracking data can even be associated with mental disorders, such as Alzheimer’s disease (Hutton et al., 1984), Parkinson’s disease (Kuechenmeister et al., 1977), and schizophrenia (Holzman et al., 1974). Furthermore, eye tracking data holds rich personal information, including personality traits (Hoppe et al., 2018), gender (Sammaknejad et al., 2017), and user identity (Cantoni et al., 2015).

These gaze-base applications have been studied across different platforms, underlining the significance of gaze as an interaction modality, and a rich source of information on users as well as their mental and physical states, in both stationary and mobile settings. The most prevalent examples include use on personal devices, such as desktops and laptops (Zhang et al., 2018a; Huang et al., 2016a), tablets (Zhang et al., 2018a; Wood and Bulling, 2014), and mobile phones (Huang et al., 2017). More recently, new gaze-aware applications are emerging and eye tracking devices have been integrated in public displays (Sugano et al., 2016), head-mounted VR devices (Piumsomboon et al., 2017), and vehicles (Palinko et al., 2010). However, application scenarios have been still strongly influenced by the technical requirements of, mostly commercial, eye tracking devices. The use of camera-based, in particular appearance-based, gaze estimation in interactive applications has not been fully explored due to the lack of a complete, extensible, and cross-platform software toolkit.

2.2. Gaze estimation methods

Gaze estimation methods can be categorised into feature-based, model-based, and appearance-based approaches (Hansen and Ji, 2010). Feature-based methods use eye features for gaze direction regression, such as corneal reflections caused by reflections of an external light source on the cornea (Zhu and Ji, 2005; Zhu et al., 2006). Feature-based methods are commonly used in commercial eye trackers, such as the entry-level eye tracker Tobii EyeX.

Model-based methods first detect visual features, such as pupil, eyeball centre and eye corners, and then fit a geometric 3D eyeball model to them to estimate gaze (Chen and Ji, 2008). While early model-based methods required high-resolution cameras and infrared light sources (Ishikawa et al., 2004; Yamazoe et al., 2008), recent approaches only use input images from a single webcam (Valenti et al., 2012; Wood and Bulling, 2014). More recent works leverage machine learning to improve the accuracy of eye feature detection, for example to train eye feature detectors with a large amount of synthetic data (Baltrusaitis et al., 2018; Park et al., 2018).

Appearance-based methods also only require images obtained from an off-the-shelf camera, but directly learn a mapping from 2D input images to gaze directions using machine learning (Tan et al., 2002)

. Since there is no explicit eye feature detection step involved, this family of methods can typically handle input images with lower resolution and quality than model-based methods. Recent works leveraged both large-scale training data and deep learning to significantly improve the gaze estimation accuracy in more challenging real-world settings 

(Zhang et al., 2018c; Shrivastava et al., 2017; Zhang et al., 2017d). These advances have enabled a range of new applications, such as in eye contact detection (Zhang et al., 2017c; Müller et al., 2018b) or attention analysis on public displays (Sugano et al., 2016). Further new applications can be expected given the ever-increasing number of camera-equipped devices and displays, particularly mobile devices (Khamis et al., 2018).

Mainly because these three families of gaze estimation methods have different requirements in terms of hardware and deployment setting, they have never been compared with each other in a principled way. Consequently, interaction designers currently lack guidance on which methods they should choose for their particular applications. As discussed above, this prevents the exploration of gaze interaction applications taking the full advantages of these different gaze estimation methods.

3. Dataset for Evaluation

In gaze estimation research in computer vision, the primary experiment of interest is typically a performance comparison between different gaze estimation methods (Zhang et al., 2018c; Shrivastava et al., 2017). One likely reason for this is the lack of a suitable dataset that facilitates such a comparison. We therefore collected a dataset specifically geared to study performance of the different methods with respect to core affordances important in gaze interaction research: specifically, the distance between user and camera, number of required calibration samples, use of the method indoors or outdoors, as well as whether the user wears glasses or not.

For data recording, we used a Logitech C910 webcam with a resolution of pixels. We chose Tobii EyeX as the representative for feature-based gaze estimation (commercial eye tracking) because it is affordable, has recently become popular and is used in a range of research (e.g. (Kurauchi et al., 2016; Schenk et al., 2017)). Data collection was performed with 20 participants (10 female, aged between 21 and 45 years) whom we recruited through university mailing lists and notice boards. Our participants were from six different countries, and four of them wore glasses during the recording. During data collection, we labelled ground-truth gaze locations by showing the target stimuli on the screen as a circle shirking to a dot that the participants were instructed to look at. The screen pose was measured using the mirror-based calibration (Rodrigues et al., 2010) beforehand, and ground-truth gaze locations have been recorded in the 3D camera coordinate system.

Figure 2. Our data collection setting. Participants stood at a pre-defined distance and were instructed to look at a dot on the screen. Data recording was performed using a Tobii EyeX eye tracker and a Logitech C910 webcam.

Distances

Interaction distance is one of the most important factors to discuss the versatility of gaze estimation methods. Distance can vary even for the same device, such as a mobile phone that is held at different distances, and even more so across different devices, such as a mobile phone or a public display. Most commercial eye trackers, as represented by Tobii EyeX, are made to work optimally when the distance from the user is cm. In contrast, webcam-based gaze estimation could output estimates for a large range of distances as long as the target faces are detected in the input image. To evaluate the performance of gaze estimation methods at different distances, we collected the sample data with different distances between participants and cameras. The recording setting is illustrated in Figure 2. We showed the stimuli on a 55-inch public display, and mounted the webcam and Tobii EyeX below it. We chose distances of 30, 50, 75, 110, 140 and 180 cm, where 50 and 75 cm fall inside the operational distance range of Tobii EyeX. In order to make sure the ranges of view angles stay the same for different distances, the stimuli were displayed inside pre-defined regions corresponding to each distance. They roughly correspond to 8.4, 12, 21, 32, 40 and 50 inches for 30, 50, 75, 110, 140 and 180 cm, respectively.

Outdoor settings

Ideally gaze estimation methods should also yield robust performance independent of whether they will be used for interaction indoors or outdoors. Therefore, we recorded two sessions with a laptop for both indoor and outdoor settings. We mounted the Tobii EyeX below the laptop screen and a webcam above the screen due to the limited space. We first instructed participants to stand or sit at around 50 cm from the cameras to collect the data, and repeated for both indoor and outdoor environments. During recording for the outdoor setting, the participants were free to chose one of three locations outside our lab, and the recordings were done at different times of day.

3.1. Procedure

During recording, participants were asked to stand or sit at certain distances, and were instructed to look at a shrinking circle at random locations and click the mouse when the circle became a dot. For each distance, we collected 80 samples: 60 samples were used for personal calibration and the rest for testing. We continuously recorded video stream with a webcam, together with gaze estimates from the Tobii EyeX. The time stamps were logged individually for mouse clicking events, video frames, and outputs of the Tobii EyeX.

Since the calibration procedure implemented in the Tobii SDK is black-box and could be different from our implementation, we collected samples both with and without personal calibration from the Tobii SDK for comparison. Specifically, for distances of 50 and 75 cm, we first recorded 20 samples with the calibration profile from another person. These samples were used as a test set for the Tobii EyeX without any personal calibration. Then we performed the personal calibration provided by the Tobii SDK, which includes seven calibration points. Finally, we recorded the 80 samples with Tobii EyeX together with the webcam.

Figure 3. Data samples from one of our participants. Left: samples with recording distances marked at the top of the images. Right: samples under indoor and outdoor conditions, as marked at the top of images. As can be seen, our recorded data includes varying face sizes caused by different distances, and illumination conditions in indoor and outdoor settings.

Figure 3 shows example images recorded from the webcam at different distances, as well as in indoor and outdoor settings. The recording distances and conditions are marked on top of the samples. As can be seen from these images, the distances lead to different face sizes, which can affect the input quality for gaze estimation methods. The indoor and outdoor settings have very different illumination conditions, which directly affected the appearance of the faces. Besides, the sunlight in the outdoor setting affects the active infrared light of Tobii EyeX, which resulted in as more invalid gaze data compared to the indoor setting.

4. Experiments

The main goal of our experiments was to study the accuracy gap of state-of-the-art appearance-based gaze estimation (represented by MPIIFaceGaze (Zhang et al., 2017d)) with a model-based counterpart (represented by GazeML (Park et al., 2018)) as well as a commercial eye tracker (Tobii EyeX). The GazeML pupil detector was trained on large-scale synthetic eye images (Wood et al., 2015)

with deep convolutional neural networks. The method takes the vector from the estimated 3D eyeball centre to the detected pupil location as the estimated gaze direction. We therefore deemed GazeML to represent the state of the art in model-based gaze estimation, because it reported (confirmed in our own comparisons) better accuracy than the gaze estimation method 

(Wood et al., 2015) implemented in the widely used OpenFace toolkit (Baltrusaitis et al., 2018). We trained the MPIIFaceGaze method using two commonly-used gaze datasets with full-face images, MPIIFaceGaze dataset (Zhang et al., 2017d) and EYEDIAP dataset (Mora et al., 2014). According to the training data distribution, this pre-trained model can handle head poses between horizontally and vertically under challenging real-world illuminations.

Eye tracking accuracy is often measured in terms of 2D gaze estimation error on the screen calculated as the differences between ground-truth and estimated gaze direction. However, since 2D on-screen error measurement also depends on the distance between the screen and user, it cannot be used to compare accuracy on our data with varying distances. Instead, we measured the 3D gaze estimation error in degrees, i.e. the difference between the estimated and the ground-truth 3D gaze vectors. 2D gaze points on the screen can be converted to 3D vectors in the camera coordinate system by using the screen-camera relationship. The on-screen gaze location represents the point end of the gaze vector, while the face centre serves as the starting point (Zhang et al., 2017d). Given that MPIIFaceGaze can output gaze vectors in the camera coordinate system, those can be directly compared with the ground truth vectors. GazeML outputs two gaze vectors, one for each eye. We first projected them to the screen plane to obtain two intersecting points, and then took the middle point of the two intersection points as the point end of the gaze vector. Tobii EyeX outputs 2D on-screen locations that were used as the point end of the gaze vector.

4.1. Distances between user and camera

Figure 4. Gaze estimation errors of different methods in degrees across distances between the user and camera. Dots are results averaged across all 20 participants for each distances, and we linked them by lines.

We first evaluated accuracy of the different methods across different distances between user and camera. With our recorded data, we used all of the 60 samples to perform personal calibration for all methods, and tested them on the remaining 20 samples. To show the full ability of Tobii EyeX, we first used its own personal calibration provided by Tobii SDK which requires seven calibration points. We then applied the personal calibration with an additional 53 samples. We conducted this calibration procedure at each distance. The errors were averaged across all participants.

The results are summarised in Figure 4

. There are statistically significant differences between the three methods (t-test,

). While Tobii EyeX performed the best with 1.2 degree gaze estimation error for distance 50 cm and 0.8 degrees for distance 75 cm, the tracking range of Tobii EyeX is severely limited compared to the other methods. The appearance-based method (MPIIFaceGaze) achieved the second-best result as mean gaze estimation error from 2.3 degrees to 3.1 degrees, and this accuracy is robust across the full distance range from 30 cm to 180 cm with only minor variation. The model-based method (GazeML) achieved the worst accuracy with mean gaze estimation errors ranging from 3.8 degrees to 12.1 degrees. In contrast to MPIIFaceGaze, GazeML’s accuracy was also sensitive to the distance: the larger the distance, the worse its accuracy. This is most likely caused by the fact that accurate pupil detection and eyeball centre estimation, on which these types of methods crucially rely, become increasingly difficult with larger distances.

In summary, this evaluation shows that while there is still a accuracy gap of around two degrees between the appearance-based method and Tobii EyeX, the former has a much larger operational range. This finding underlines the practical usefulness of appearance-based gaze estimation, in particular for interactive applications where the ability to track robustly across a large interaction space is important and where gaze estimation error can be compensated for, e.g. on interactive public displays using pursuits (Vidal et al., 2013).

4.2. Number of calibration samples

Figure 5. Gaze estimation errors of different methods in degrees across numbers of personal calibration samples. Dots are results averaged across all 20 participants, and we linked them with lines.

The number of required calibration samples is an important factor for usability. Calibration with a large number of samples can be time-consuming and prohibitive for certain applications where spontaneous interaction is crucial, e.g. on gaze-enabled public displays (Zhang et al., 2013). Therefore, we evaluated the influence of the number of calibration samples on gaze estimation accuracy. We analysed accuracy while varying the number of samples used for calibration, from zero, to one, two, three, four, five, seven, 10, 15, 20, 30, 40, 50 and 60. For the calibration-free case (zero calibration samples) we directly used the raw gaze estimates as calculated by GazeML and MPIIFaceGaze, while Tobii EyeX was uncalibrated. We opted for a distance of 75 cm because this is exactly within the optimal tracking range for Tobii EyeX.

Figure 5 summarises the results and reveals a number of interesting insights. As expected, the calibration-free setting achieves a large gaze estimation error, between 5.5 degrees of visual angle (for Tobii EyeX) and 12.1 degrees (for model-based GazeML). The appearance-based method (MPIIFaceGaze) achieved 6.4 degrees. These differences were statistically significant (t-test, ). However, accuracy gets even worse for one-calibration samples, where the third-order polynomial mapping function is underdetermined. With an increasing number of calibration samples, gaze estimation error decreases for all methods, down to 1.1 degrees for Tobii EyeX, and 2.5 degrees for MPIIFaceGaze. These results show that current appearance-based methods (MPIIFaceGaze) can achieve accuracy competitive with Tobii EyeX, even with only four calibration samples. This is exciting given that current appearance-based methods are competitive in terms of accuracy and usability and, hence, seem feasible for a range of everyday gaze interfaces, such as on camera-equipped personal devices.

4.3. Indoor and outdoor settings

Figure 6.

Gaze estimation errors for indoor and outdoor settings. Bars show mean error across all participants; error bars indicate standard deviations. The numbers above the bars indicate accuracy differences from indoor to outdoor setting in percent.

We then evaluated the impact on accuracy of different illumination conditions – a common problem when moving between indoor and outdoor interaction settings. For this evaluation, we used the data collected for indoor and outdoor settings described in section 3. We again used all 60 samples for personal calibration and evaluated the different methods for both settings. The results of this evaluation are summarised in Figure 6 where bars show the gaze estimation error in degrees across all 20 participants, and error bars show standard deviations. The figure also shows the relative accuracy differences between indoors to outdoors in percent.

As can be seen from the figure, the best accuracy was again achieved by Tobii EyeX (indoors: 1.1 degrees, outdoors: 1.3 degrees), followed by the appearance-based method (MPIIFaceGaze) (indoors: 2.8 degrees, outdoors: 3.2 degrees), and model-based method (GazeML) (indoors: 4.9 degrees, outdoors: 5.7 degrees). Although the accuracy differences between indoors and outdoors are not statistically significant (t-test, ), better accuracy tends to be achieved for the indoor environment, likely as a result of changing illumination conditions.

4.4. With and without glasses

Figure 7. Gaze estimation errors for participants wearing glasses or not. Bars show mean error across all participants; error bars indicate standard deviations. The numbers above the bars indicate accuracy differences between not wearing glasses and wearing glasses in percent.

Glasses can have a significant effect on gaze estimation error due to strong reflections and distortions they may cause. In our dataset, four participants wore glasses while the rest did not. To evaluate the impact of glasses, we analysed the error after personal calibration with 60 samples, distances range ranging 50 and 75 cm, and in both indoor and outdoor settings. The results of this evaluation are shown in Figure 7, where bars show the gaze estimation error in degrees across participants when wearing or not wearing glasses, and the error bars show standard deviations. These differences were statistically significant (t-test, ). The figure also shows the relative accuracy differences between wearing glasses and not wearing glasses in percent.

As we can see from the figure, glasses have a stronger effect on gaze estimation accuracy than illumination conditions (see Figure 6). The gaze estimation errors were 5.4 (without) and 6.3 (with) degrees for the model-based method (GazeML), 2.8 and 4.5 degrees for the appearance-based method (MPIIFaceGaze), and 1.0 and 1.4 degrees for Tobii EyeX. The estimation results differences between with and without glasses are larger for the appearance-based method (MPIIFaceGaze) than for Tobii EyeX. Similarly, as for the previous evaluation on indoor and outdoor settings, the likely reason for this is the training data that, in this case, does not contain a sufficient number of images of people wearing glasses. As a result, the appearance-based method cannot handle these cases as well. Another reason could be that Tobii EyeX uses infrared light, which filters out most reflections on the glasses.

5. Implications for gaze applications

Gaze estimation devices, especially the dominant commercial eye trackers, have facilitated the development of gaze-based interactive systems in past years. In this section, we discuss the implications of our findings for the most important gaze-based applications as well as the potential of appearance-based methods for application scenarios that only require a single off-the-shelf camera.

5.1. Gaze applications

Gaze-based interactive applications can be divided into four groups: explicit eye input, attentive user interfaces, gaze-based user modelling, and passive eye monitoring (Majaranta and Bulling, 2014). The explicit eye input applications take the gaze input to command and control the computer. Attentive user interfaces do not expect explicit commands from the user, while using the natural eye movements subtly in the background. Gaze-based user modelling uses gaze information to understand user behaviour, cognitive processes, and intentions; this usually utilise short-time-period data. Passive eye monitoring stands for off-line analysis with long-term gaze data for diagnostic applications.

While all of the above categories take gaze information from users; they have different requirements in terms of properties on the gaze estimation methods. In this section, we summarise the requirements of applications regardless of the technical limitations of existing eye tracking methods. In this way, we clarify how the application scenarios can be extended by using appearance-based gaze estimation beyond the limitation of commonly-used commercial eye trackers. We show the relationships of different gaze-based applications and affordances in Figure 8, and explain them in the following.

Figure 8. Relationship of different gaze-based applications and affordances. We show three typical gaze-based applications and the different levels of each affordance.

Accuracy

Applications that rely on explicit eye input usually require high-accuracy gaze estimates, such as for eye typing (Kurauchi et al., 2016; Mott et al., 2017), authentication (Khamis et al., 2016), or system control (Nguyen and Liu, 2016). However, the allowed gaze estimation error depends on the sizes of the gaze targets. These gaze targets could be fine-level details on the screen (D’Angelo and Gergle, 2018), closely connected (Zhang et al., 2017b) or separate content on the screen (D’Angelo and Gergle, 2016), large physical objects (Andrist et al., 2017), or rough gaze direction  (Otsuki et al., 2017; Zhang et al., 2017a). Gaze-based user modelling and passive eye monitoring require gaze estimation to detect gaze patterns instead of individual points. These gaze patterns could be large regions on the screen, such as during saliency prediction (Xu et al., 2016); relative eye movements for inferring everyday activities (Steil and Bulling, 2015; Sattar et al., 2015), cognitive load and processes (Tessendorf et al., 2011; Bulling and Zander, 2014), or mobile interaction (Vaitukaitis and Bulling, 2012); or off-line user behaviour analysis for game play (Newn et al., 2018). For attentive user interfaces, usually it is sufficient to detect the attention of the user (Alt et al., 2016) with binary eye contact detection (Smith et al., 2013; Dickie et al., 2004; Zhang et al., 2017c; Müller et al., 2018b).

Usability

We rate applications with high usability as it can work with calibration-free fashion and can be used with multi-user simultaneously. Since the explicit eye input usually assumes a single target user, it is relatively straightforward to include personal calibration process. Such a personal calibration is also required from the high accuracy requirement discussed above. While the attentive user interfaces also requires specific object calibration for each different camera-object relationship (Smith et al., 2013), the underlying use case scenario demands pervasive multi-user gaze estimation without personal calibration. For gaze-based user modelling and passive eye monitoring, relative eye movement could be sufficient considering that there have been already applications implemented in a calibration-free fashion (Zhang et al., 2013). They can be also naturally extended to multi-user scenarios.

Robustness

Performance consistency between indoor and outdoor environments becomes more and more important with the popularisation of personal mobile devices. It has impacts on passive eye monitoring since usually long-term recording through daily life is necessary (Steil and Bulling, 2015). Explicit eye input and  gaze-based user modelling also require such consistency as users could run these applications anywhere with their mobile devices. In contrast, attentive user interfaces could be conducted within a stable scene if the target object is stationary. In addition, there are large numbers of people wearing glasses nowadays. The glasses could cause problems for gaze-based interaction since thick frames, distortion, and reflection could potentially impair the quality of gaze estimation methods. For all types of gaze-based applications, they must be robust to the user with or without glasses.

5.2. Extension of application scenarios

From our experimental results, we can see that the current appearance-based gaze estimation can achieve reasonable accuracy. Figure 4 shows that the appearance-based gaze estimation method can achieve around two to three degrees accuracy, which can provide good enough estimates for some applications. Figure 5 and Figure 6 show that appearance-based gaze estimation is comparable to Tobii EyeX in terms of its requirements for calibration samples and robustness to indoor and outdoor environments. Appearance-based gaze estimation could thus be used in applications requiring explicit eye input, such as object selection (D’Angelo and Gergle, 2016; Andrist et al., 2017) or gaze pointing  (Otsuki et al., 2017; Zhang et al., 2017a). Appearance-based methods are already suitable for applications only requiring measurement of relative changes in gaze direction over time, such as gaze-based user modellingpassive eye monitoring or detection of gaze patterns, such as smooth pursuit eye movements (Esteves et al., 2015). Also the latest attentive user interfaces could use appearance-based gaze estimation methods, such as for eye contact detection (Smith et al., 2013; Zhang et al., 2017c; Müller et al., 2018b) or attention forecasting (Steil et al., 2018). This suggests webcams could replace commercial eye trackers for some applications, and even enable new application scenarios such as online software-based services.

From Figure 4, we can see the advantage of appearance-based gaze estimation on large operation at distances between users and camera; it also maintains consistent gaze estimation accuracy across the different distances. This method enables gaze-based applications for different devices such as a cellphone with a short distance and a large TV with a long distance. The current gaze-based applications for these short and long distances are limited to rough gaze direction with previous model-based gaze estimation methods (Zhang et al., 2017a; Zhang et al., 2015). Appearance-based gaze estimation can achieve gaze estimation error around two to three degrees for these devices, which allows researchers to use the estimated gaze points for fine-level interaction, such as object selection or attention measurement.

Another major flaw of current commercial eye trackers is they usually can only work with a single user due to the limited camera angle of view. This is not an issue for gaze estimation with a webcam, which can output multiple-person gaze information. Therefore, it enables new applications where multiple persons are involved, and we can achieve their gaze information with a single webcam. This has been implemented in one previous work that performs the eye contact detection with webcams and there could be more than one user in the input image (Müller et al., 2018b).

Above all, gaze estimation with a single webcam instead of an additional commercial eye tracker enables new forms of applications. It allows researchers to develop gaze-based applications with common devices, such as cellphones, tablets, laptops and TVs. Participants can stick with their own personal devices and run the gaze-based software to perform the interaction without any additional hardware requirement. This is the key advantage of using a single webcam for gaze estimation instead of the current commercial eye trackers.

6. The OpenGaze software toolkit

As shown before, appearance-based gaze estimation has significant potential to facilitate gaze-based interactive applications on the millions of camera-equipped devices already used worldwide today, such as mobile phones or laptops. However, most existing methods – if code is available for these at all – were published with research-oriented implementations and there is no easy-to-use software toolkit available that is specifically geared to HCI purposes. It is also challenging for designers to integrate existing computer vision and machine learning pipelines into end-user applications.

We therefore extended the MPIIFaceGaze method into a complete open source toolkit for gaze-based applications. The goal of our OpenGaze software toolkit is to provide an easy way for HCI researchers and designers to use appearance-based gaze estimation techniques, and enable a wider range of eye tracking applications using off-the-shelf cameras. We designed OpenGaze with four main objectives in mind: 1) to implement state-of-the-art appearance-based gaze estimation for HCI researchers, 2) to make the functionality easy to use and work out-of-the-box for rapid development, 3) to be extensible to include more functions, and 4) to be flexible as developers can replace any components with their own methods.

The overall pipeline of OpenGaze is shown in Figure 9. Unlike the dominate commercial eye trackers, the input of OpenGaze can be single RGB images, such as the video stream from the camera, recorded videos, a directory of images, or a single image. Given an input frame/image, our OpenGaze first detects faces and facial landmarks, which are used to estimate 3D head pose and data normalization. The data normalization procedure essentially crops the face image with a normalised camera to cancel out some of appearance variations caused by head pose. The cropped face image then will be input to the appearance-based gaze estimation method. The output of the gaze estimation model is the gaze direction in the camera coordinate system, which can be further projected to the screen coordinate system. OpenGaze has user-friendly APIs for developers to perform their desired functions with minimal effort on coding, and also provides easy-to-install packages including pre-compiled libraries to facilitate use of gaze estimation in interactive applications.

Figure 9. Taking an image as input, our OpenGaze toolkit first detects the faces and facial landmarks (a) and then crops the face image using data normalisation (Zhang et al., 2018b). The appearance-based gaze estimation model predicts the gaze direction in the camera coordinate system from the normalised face image. The direction is finally converted to the screen coordinate system.

6.1. Face and facial landmark detection

Given an input image, the first step is to detect the face as well as facial landmarks of the user. OpenGaze integrates OpenFace 2.0 (Baltrusaitis et al., 2018) for facial landmark detection that, in turn, relies on the widely used dlib computer vision library (King, 2009) to detect the faces in the input image. OpenFace also assigns unique IDs to each detected faces via temporal tracking. The detected facial landmarks (4 eye corners and 2 mouth corners) are mainly used to estimate 3D head pose including head rotation and translation, which is achieved by fitting a pre-defined generic 3D face model to the detected facial landmarks by estimating the initial solution using the EPnP algorithm (Lepetit et al., 2009). The estimated 3D head pose is used in data normalisation and the 3D face centre (centroid of the six facial landmarks) is taken as the origin of gaze directions.

6.2. Data normalisation

As input for appearance-based gaze estimation, the system then crops and resizes the face image according to the facial landmarks. However, since appearance-based 3D gaze estimation is a geometric task, inappropriate cropping and resizing can significantly affect the estimation accuracy. Data normalization was proposed to efficiently train appearance-based gaze estimators, and it cancels out the geometric variability by warping input images to a normalised space (Zhang et al., 2018b). OpenGaze implements the data normalization scheme as pre-processing to appearance-based gaze estimation. Specifically, OpenGaze

first crops the face image after rotating the camera so that the x-axis of the camera coordinate system is perpendicular to the y-axis of the head coordinate system. Then it scales the image so that the normalised camera is located at a fixed distance away from the face centre. In this way, the input image has only 2 degrees of freedom in head pose for all kinds of cameras with different intrinsic parameters.

6.3. Gaze estimation

OpenGaze implements an appearance-based gaze estimation method that reports state-of-the-art accuracy (Zhang et al., 2017d). In this method, the whole face image is fed into the convolutional neural network to output 3D gaze directions. OpenGaze uses the same neural network architecture as in (Zhang et al., 2017d), which is based on the AlexNet architecture (Krizhevsky et al., 2012). The toolkit comes with the model used in our experiments, which was pre-trained on two commonly-used gaze datasets with full-face images, MPIIFaceGaze dataset (Zhang et al., 2017d) and EYEDIAP dataset (Mora et al., 2014). Therefore, while the toolkit is flexible enough to replace the network with user-trained ones, there is basically no need to train the network from scratch, and developers can directly use the pre-trained model. It is important to note that OpenGaze is fully extensible, e.g. it allows developers to add new network architectures and train models on other datasets.

6.4. Projection on screen

The gaze direction is estimated by the appearance-based gaze estimator in the normalised space, and OpenGaze projects it back to the original camera coordinate system. OpenGaze also provides APIs to project the 3D gaze direction from the camera coordinate system to the 2D screen coordinate system, and vice versa. In order to project gaze direction to the 2D screen coordinate system, OpenGaze requires the camera intrinsic parameters and camera-screen relationship, i.e., rotation and translation between the camera and screen coordinate system. The camera intrinsic parameters can be obtained by camera calibration function in OpenCV (Bradski, 2000) by moving a calibration pattern in front of the camera. The camera-screen relationship can be calculated with a mirror-based camera-screen calibration method (Rodrigues et al., 2010). It requires showing a camera calibration pattern on the screen, and then move a planar mirror in front of the camera to let the camera capture several calibration samples with the full view of the calibration pattern.

6.5. Personal calibration

Cross-person gaze estimation is the ultimate goal of data-driven appearance-based gaze estimation, and as described above, OpenGaze comes with a pre-trained generic gaze estimator which works across users, environments, and cameras without any personal calibration (Zhang et al., 2018c). However, if the application allows for additional calibration data collection, the gaze estimation accuracy can be significantly improved by personal calibration. To make the estimated gaze usable for interactive applications, OpenGaze further provides a personal calibration scheme to make corrections to raw gaze estimates from the appearance-based gaze estimation model. To collect the ground-truth calibration samples, OpenGaze provides a GUI to collect the personal calibration data from users. During the personal calibration, OpenGaze shows the shrinking circle on the screen and the user has to fixate on the circle until it become a dot, while confirming that he/she is looking at the dot by mouse-clicking within a half second. Meanwhile, OpenGaze also captures the face image from the webcam associated with the dot position on the screen. These samples are used to find a third-order polynomial mapping function between the estimated and ground-truth 2D gaze locations in the screen coordinate system.

6.6. Implementation and speed

With extensibility in mind, we implemented each of the above components as separate classes written in C++ with interfaces to communicate between each other. It is feasible for developers to replace and include different components if desired. As far as we tested, OpenGaze achieved 13 fps at running time with a desktop machine which has a 3.50GHz CPU and a GeForce GTX TITAN Black GPU with 6GB memory with stream from a webcam. Note the most time-consuming process is the face and facial landmark detection as they can only reach 17 fps. By replacing these components, which is possible with in our toolkit, it is expected that a much faster speed. In addition, the gaze estimation network can be also replaced with more compact versions to achieve higher speed and memory efficiency.

7. Conclusion

In this work, we compared the appearance-based method with a model-based method and a commercial eye tracker, and we showed that it achieves better performance than the model-based method, and a larger operational range than the commercial eye tracker. Following the result, we further discussed design implications for the most important gaze-based applications. We present the first software toolkit OpenGaze which provides an easy-to-use webcam-based gaze estimation method for interaction designers. The goal of the toolkit is to be applied to a diverse range of interactive systems, and we evaluated the performance of state-of-the-art appearance-based gaze estimation across different affordances. We believe our OpenGaze enables new forms of application scenarios for both HCI designers and researchers.

8. Acknowledgments

This work was supported by the European Research Council (ERC; grant agreement 801708) as well as by a JST CREST research grant (JPMJCR14E1), Japan.

References

  • (1)
  • Alt et al. (2016) Florian Alt, Andreas Bulling, Lukas Mecke, and Daniel Buschek. 2016. Attention, please! Comparing Features for Measuring Audience Attention Towards Pervasive Displays. In Proc. ACM SIGCHI Conference on Designing Interactive Systems (DIS). 823–828. https://doi.org/10.1145/2901790.2901897
  • Andrist et al. (2017) Sean Andrist, Michael Gleicher, and Bilge Mutlu. 2017. Looking Coordinated: Bidirectional Gaze Mechanisms for Collaborative Interaction with Virtual Characters. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems. ACM, 2571–2582.
  • Baltrusaitis et al. (2018) Tadas Baltrusaitis, Amir Zadeh, Yao Chong Lim, and Louis-Philippe Morency. 2018. OpenFace 2.0: Facial Behavior Analysis Toolkit. In Automatic Face & Gesture Recognition (FG 2018), 2018 13th IEEE International Conference on. IEEE, 59–66.
  • Bradski (2000) G. Bradski. 2000. The OpenCV Library. Dr. Dobb’s Journal of Software Tools (2000).
  • Bulling (2016) Andreas Bulling. 2016. Pervasive Attentive User Interfaces. IEEE Computer 49, 1 (2016), 94–98. https://doi.org/10.1109/MC.2016.32
  • Bulling et al. (2013) Andreas Bulling, Christian Weichel, and Hans Gellersen. 2013. EyeContext: Recognition of High-level Contextual Cues from Human Visual Behaviour. In Proc. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI). 305–308. https://doi.org/10.1145/2470654.2470697
  • Bulling and Zander (2014) Andreas Bulling and Thorsten O. Zander. 2014. Cognition-Aware Computing. IEEE Pervasive Computing 13, 3 (2014), 80–83. https://doi.org/10.1109/mprv.2014.42
  • Cantoni et al. (2015) Virginio Cantoni, Chiara Galdi, Michele Nappi, Marco Porta, and Daniel Riccio. 2015. GANT: Gaze analysis technique for human identification. Pattern Recognition 48, 4 (2015), 1027–1038. https://doi.org/10.1016/j.patcog.2014.02.017
  • Chen and Ji (2008) Jixu Chen and Qiang Ji. 2008. 3D gaze estimation with a single camera without IR illumination. In Pattern Recognition, 2008. ICPR 2008. 19th International Conference on. IEEE, 1–4.
  • D’Angelo and Gergle (2016) Sarah D’Angelo and Darren Gergle. 2016. Gazed and confused: Understanding and designing shared gaze for remote collaboration. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems. ACM, 2492–2496.
  • D’Angelo and Gergle (2018) Sarah D’Angelo and Darren Gergle. 2018. An Eye For Design: Gaze Visualizations for Remote Collaborative Work. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems. ACM, 349.
  • Dickie et al. (2004) Connor Dickie, Roel Vertegaal, Jeffrey S Shell, Changuk Sohn, Daniel Cheng, and Omar Aoudeh. 2004. Eye contact sensing glasses for attention-sensitive wearable video blogging. In CHI’04 extended abstracts on Human factors in computing systems. ACM, 769–770.
  • Esteves et al. (2015) Augusto Esteves, Eduardo Velloso, Andreas Bulling, and Hans Gellersen. 2015. Orbits: Enabling Gaze Interaction in Smart Watches using Moving Targets. In Proc. ACM Symposium on User Interface Software and Technology (UIST). 457–466. https://doi.org/10.1145/2807442.2807499
  • Faber et al. (2017) Myrthe Faber, Robert Bixler, and Sidney K D’Mello. 2017. An automated behavioral measure of mind wandering during computerized reading. Behavior Research Methods (2017), 1–17. https://doi.org/10.3758/s13428-017-0857-y.
  • Hansen and Ji (2010) Dan Witzner Hansen and Qiang Ji. 2010. In the eye of the beholder: A survey of models for eyes and gaze. IEEE Transactions on Pattern Analysis and Machine Intelligence 32, 3 (2010), 478–500.
  • Higuch et al. (2016) Keita Higuch, Ryo Yonetani, and Yoichi Sato. 2016. Can Eye Help You?: Effects of Visualizing Eye Fixations on Remote Collaboration Scenarios for Physical Tasks. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems. ACM, 5180–5190.
  • Holzman et al. (1974) Philip S Holzman, Leonard R Proctor, Deborah L Levy, Nicholas J Yasillo, Herbert Y Meltzer, and Stephen W Hurt. 1974. Eye-tracking dysfunctions in schizophrenic patients and their relatives. Archives of general psychiatry 31, 2 (1974), 143–151.
  • Hoppe et al. (2018) Sabrina Hoppe, Tobias Loetscher, Stephanie A Morey, and Andreas Bulling. 2018. Eye movements during everyday behavior predict personality traits. Frontiers in Human Neuroscience 12 (2018), 105. https://doi.org/10.3389/fnhum.2018.00105
  • Huang et al. (2016a) Michael Xuelin Huang, Tiffany CK Kwok, Grace Ngai, Stephen CF Chan, and Hong Va Leong. 2016a. Building a personalized, auto-calibrating eye tracker from user interactions. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems. ACM, 5169–5179.
  • Huang et al. (2016b) Michael Xuelin Huang, Jiajia Li, Grace Ngai, and Hong Va Leong. 2016b. StressClick: Sensing Stress from Gaze-Click Patterns. In Proceedings of the 2016 ACM on Multimedia Conference. ACM, 1395–1404.
  • Huang et al. (2017) Michael Xuelin Huang, Jiajia Li, Grace Ngai, and Hong Va Leong. 2017. Screenglint: Practical, in-situ gaze estimation on smartphones. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems. ACM, 2546–2557.
  • Huang et al. (2015) Qiong Huang, Ashok Veeraraghavan, and Ashutosh Sabharwal. 2015. TabletGaze: unconstrained appearance-based gaze estimation in mobile tablets. arXiv preprint arXiv:1508.01244 (2015).
  • Hutton et al. (1984) J Thomas Hutton, JA Nagel, and Ruth B Loewenson. 1984. Eye tracking dysfunction in Alzheimer-type dementia. Neurology 34, 1 (1984), 99–99.
  • Ishikawa et al. (2004) Takahiro Ishikawa, Simon Baker, Iain Matthews, and Takeo Kanade. 2004. Passive driver gaze tracking with active appearance models. In Proceedings of the 11th world congress on intelligent transportation systems, Vol. 3. 41–43.
  • Khamis et al. (2018) Mohamed Khamis, Florian Alt, and Andreas Bulling. 2018. The Past, Present, and Future of Gaze-enabled Handheld Mobile Devices: Survey and Lessons Learned. In Proc. International Conference on Human-Computer Interaction with Mobile Devices and Services (MobileHCI). 38:1–38:17. https://doi.org/10.1145/3229434.3229452 best paper honourable mention award.
  • Khamis et al. (2016) Mohamed Khamis, Florian Alt, Mariam Hassib, Emanuel von Zezschwitz, Regina Hasholzner, and Andreas Bulling. 2016. GazeTouchPass: Multimodal Authentication Using Gaze and Touch on Mobile Devices. In Ext. Abstr. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI). 2156–2164. https://doi.org/10.1145/2851581.2892314
  • King (2009) Davis E. King. 2009. Dlib-ml: A Machine Learning Toolkit. Journal of Machine Learning Research 10 (2009), 1755–1758.
  • Kosch et al. (2018) Thomas Kosch, Mariam Hassib, Pawel W Wozniak, Daniel Buschek, and Florian Alt. 2018. Your Eyes Tell: Leveraging Smooth Pursuit for Assessing Cognitive Workload. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems. ACM, 436.
  • Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems. 1097–1105.
  • Kuechenmeister et al. (1977) Craig A Kuechenmeister, Patrick H Linton, Thelma V Mueller, and Hilton B White. 1977. Eye tracking in relation to age, sex, and illness. Archives of General Psychiatry 34, 5 (1977), 578–579.
  • Kurauchi et al. (2016) Andrew Kurauchi, Wenxin Feng, Ajjen Joshi, Carlos Morimoto, and Margrit Betke. 2016. EyeSwipe: Dwell-free text entry using gaze paths. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems. ACM, 1952–1956.
  • Lagun et al. (2014) Dmitry Lagun, Chih-Hung Hsieh, Dale Webster, and Vidhya Navalpakkam. 2014. Towards better measurement of attention and satisfaction in mobile search. In Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval. ACM, 113–122.
  • Lepetit et al. (2009) Vincent Lepetit, Francesc Moreno-Noguer, and Pascal Fua. 2009. Epnp: An accurate o (n) solution to the pnp problem. International journal of computer vision 81, 2 (2009), 155.
  • Li et al. (2017) Yixuan Li, Pingmei Xu, Dmitry Lagun, and Vidhya Navalpakkam. 2017. Towards measuring and inferring user interest from gaze. In Proceedings of the 26th International Conference on World Wide Web Companion. International World Wide Web Conferences Steering Committee, 525–533.
  • Majaranta et al. (2009) Päivi Majaranta, Ulla-Kaija Ahola, and Oleg Špakov. 2009. Fast gaze typing with an adjustable dwell time. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM, 357–360.
  • Majaranta and Bulling (2014) Päivi Majaranta and Andreas Bulling. 2014. Eye tracking and eye-based human–computer interaction. In Advances in physiological computing. Springer, 39–65.
  • Mardanbegi et al. (2012) Diako Mardanbegi, Dan Witzner Hansen, and Thomas Pederson. 2012. Eye-based head gestures. In Proceedings of the symposium on eye tracking research and applications. ACM, 139–146.
  • Matthews et al. (1991) G Matthews, W Middleton, B Gilmartin, and MA Bullimore. 1991. Pupillary diameter and cognitive load. Journal of Psychophysiology (1991).
  • Mora et al. (2014) Kenneth Alberto Funes Mora, Florent Monay, and Jean-Marc Odobez. 2014. Eyediap: A database for the development and evaluation of gaze estimation algorithms from rgb and rgb-d cameras. In Proceedings of the Symposium on Eye Tracking Research and Applications. ACM, 255–258.
  • Mott et al. (2017) Martez E Mott, Shane Williams, Jacob O Wobbrock, and Meredith Ringel Morris. 2017. Improving dwell-based gaze typing with dynamic, cascading dwell times. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems. ACM, 2558–2570.
  • Müller et al. (2018a) Philipp Müller, Michael Xuelin Huang, and Andreas Bulling. 2018a. Detecting Low Rapport During Natural Interactions in Small Groups from Non-Verbal Behaviour. In 23rd International Conference on Intelligent User Interfaces. ACM, 153–164.
  • Müller et al. (2018b) Philipp Müller, Michael Xuelin Huang, Xucong Zhang, and Andreas Bulling. 2018b. Robust Eye Contact Detection in Natural Multi-Person Interactions Using Gaze and Speaking Behaviour. In Proc. International Symposium on Eye Tracking Research and Applications (ETRA). 31:1–31:10. https://doi.org/10.1145/3204493.3204549
  • Newn et al. (2018) Joshua Newn, Fraser Allison, Eduardo Velloso, and Frank Vetere. 2018. Looks can be deceiving: Using gaze visualisation to predict and mislead opponents in strategic gameplay. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems. ACM, 261.
  • Nguyen and Liu (2016) Cuong Nguyen and Feng Liu. 2016. Gaze-based Notetaking for Learning from Lecture Videos. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems. ACM, 2093–2097.
  • Otsuki et al. (2017) Mai Otsuki, Taiki Kawano, Keita Maruyama, Hideaki Kuzuoka, and Yusuke Suzuki. 2017. ThirdEye: Simple Add-on Display to Represent Remote Participant’s Gaze Direction in Video Communication. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems. ACM, 5307–5312.
  • Palinko et al. (2010) Oskar Palinko, Andrew L Kun, Alexander Shyrokov, and Peter Heeman. 2010. Estimating cognitive load using remote eye tracking in a driving simulator. In Proceedings of the 2010 symposium on eye-tracking research & applications. ACM, 141–144.
  • Park et al. (2018) Seonwook Park, Xucong Zhang, Andreas Bulling, and Otmar Hilliges. 2018. Learning to Find Eye Region Landmarks for Remote Gaze Estimation in Unconstrained Settings. In Proc. International Symposium on Eye Tracking Research and Applications (ETRA). 21:1–21:10. https://doi.org/10.1145/3204493.3204545
  • Piumsomboon et al. (2017) Thammathip Piumsomboon, Gun Lee, Robert W Lindeman, and Mark Billinghurst. 2017. Exploring natural eye-gaze-based interaction for immersive virtual reality. In 3D User Interfaces (3DUI), 2017 IEEE Symposium on. IEEE, 36–39.
  • Rodrigues et al. (2010) Rui Rodrigues, Joao P. Barreto, and Urbano Nunes. 2010. Camera pose estimation using images of planar mirror reflections. In Proceedings of the 11th European Conference on Computer Vision. 382–395.
  • Sammaknejad et al. (2017) Negar Sammaknejad, Hamidreza Pouretemad, Changiz Eslahchi, Alireza Salahirad, and Ashkan Alinejad. 2017. Gender classification based on eye movements: A processing effect during passive face viewing. Advances in cognitive psychology 13, 3 (2017), 232.
  • Sattar et al. (2015) Hosnieh Sattar, Sabine Müller, Mario Fritz, and Andreas Bulling. 2015. Prediction of Search Targets From Fixations in Open-world Settings. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 981–990. https://doi.org/10.1109/CVPR.2015.7298700
  • Schenk et al. (2017) Simon Schenk, Marc Dreiser, Gerhard Rigoll, and Michael Dorr. 2017. GazeEverywhere: Enabling Gaze-only User Interaction on an Unmodified Desktop PC in Everyday Scenarios. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems. ACM, 3034–3044.
  • Shrivastava et al. (2017) Ashish Shrivastava, Tomas Pfister, Oncel Tuzel, Josh Susskind, Wenda Wang, and Russ Webb. 2017. Learning from Simulated and Unsupervised Images through Adversarial Training. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on.
  • Smith et al. (2013) Brian A Smith, Qi Yin, Steven K Feiner, and Shree K Nayar. 2013. Gaze locking: passive eye contact detection for human-object interaction. In Proceedings of the 26th annual ACM symposium on User interface software and technology. ACM, 271–280.
  • Steil and Bulling (2015) Julian Steil and Andreas Bulling. 2015. Discovery of everyday human activities from long-term visual behaviour using topic models. In Proceedings of the 2015 ACM International Joint Conference on Pervasive and Ubiquitous Computing. ACM, 75–85. https://doi.org/10.1145/2750858.2807520
  • Steil et al. (2018) Julian Steil, Philipp Müller, Yusuke Sugano, and Andreas Bulling. 2018. Forecasting User Attention During Everyday Mobile Interactions Using Device-Integrated and Wearable Sensors. In Proc. International Conference on Human-Computer Interaction with Mobile Devices and Services (MobileHCI) (2018-04-16). 1:1–1:13. https://doi.org/10.1145/3229434.3229439
  • Sugano et al. (2016) Yusuke Sugano, Xucong Zhang, and Andreas Bulling. 2016. Aggregaze: Collective estimation of audience attention on public displays. In Proceedings of the 29th Annual Symposium on User Interface Software and Technology. ACM, 821–831.
  • Tan et al. (2002) Kar-Han Tan, David J Kriegman, and Narendra Ahuja. 2002. Appearance-based eye gaze estimation. In Applications of Computer Vision, 2002.(WACV 2002). Proceedings. Sixth IEEE Workshop on. IEEE, 191–195.
  • Tessendorf et al. (2011) Bernd Tessendorf, Andreas Bulling, Daniel Roggen, Thomas Stiefmeier, Manuela Feilner, Peter Derleth, and Gerhard Tröster. 2011. Recognition of Hearing Needs From Body and Eye Movements to Improve Hearing Instruments. In Proc. International Conference on Pervasive Computing (Pervasive). 314–331. https://doi.org/10.1007/978-3-642-21726-5_20
  • Vaitukaitis and Bulling (2012) Vytautas Vaitukaitis and Andreas Bulling. 2012. Eye Gesture Recognition on Portable Devices. In Proc. International Workshop on Pervasive Eye Tracking and Mobile Gaze-Based Interaction (PETMEI). 711–714. https://doi.org/10.1145/2370216.2370370
  • Valenti et al. (2012) Roberto Valenti, Nicu Sebe, and Theo Gevers. 2012. Combining head pose and eye location information for gaze estimation. IEEE Transactions on Image Processing 21, 2 (2012), 802–815.
  • Vertegaal et al. (2003) Roel Vertegaal et al. 2003. Attentive user interfaces. Commun. ACM 46, 3 (2003), 30–33.
  • Vidal et al. (2013) Mélodie Vidal, Andreas Bulling, and Hans Gellersen. 2013. Pursuits: Spontaneous Interaction with Displays based on Smooth Pursuit Eye Movement and Moving Targets. In Proc. ACM International Joint Conference on Pervasive and Ubiquitous Computing (UbiComp). 439–448. https://doi.org/10.1145/2468356.2479632
  • Wood et al. (2015) Erroll Wood, Tadas Baltrusaitis, Xucong Zhang, Yusuke Sugano, Peter Robinson, and Andreas Bulling. 2015. Rendering of eyes for eye-shape registration and gaze estimation. In Proceedings of the IEEE International Conference on Computer Vision. 3756–3764.
  • Wood and Bulling (2014) Erroll Wood and Andreas Bulling. 2014. Eyetab: Model-based gaze estimation on unmodified tablet computers. In Proceedings of the Symposium on Eye Tracking Research and Applications. ACM, 207–210.
  • Xu et al. (2016) Pingmei Xu, Yusuke Sugano, and Andreas Bulling. 2016. Spatio-temporal modeling and prediction of visual attention in graphical user interfaces. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems. ACM, 3299–3310.
  • Yamazoe et al. (2008) Hirotake Yamazoe, Akira Utsumi, Tomoko Yonezawa, and Shinji Abe. 2008. Remote gaze estimation with a single camera based on facial-feature tracking without special calibration actions. In Proceedings of the 2008 symposium on Eye tracking research & applications. ACM, 245–250.
  • Zhang et al. (2018a) Xucong Zhang, Michael Xuelin Huang, Yusuke Sugano, and Andreas Bulling. 2018a. Training Person-Specific Gaze Estimators from Interactions with Multiple Devices. In Proc. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI). 624:1–624:12. https://doi.org/10.1145/3173574.3174198
  • Zhang et al. (2017a) Xiaoyi Zhang, Harish Kulkarni, and Meredith Ringel Morris. 2017a. Smartphone-Based Gaze Gesture Communication for People with Motor Disabilities. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems. ACM, 2878–2889.
  • Zhang et al. (2017c) Xucong Zhang, Yusuke Sugano, and Andreas Bulling. 2017c. Everyday Eye Contact Detection Using Unsupervised Gaze Target Discovery. In Proc. ACM Symposium on User Interface Software and Technology (UIST). 193–203. https://doi.org/10.1145/3126594.3126614
  • Zhang et al. (2018b) Xucong Zhang, Yusuke Sugano, and Andreas Bulling. 2018b. Revisiting data normalization for appearance-based gaze estimation. In Proceedings of the 2018 ACM Symposium on Eye Tracking Research & Applications. ACM, 12.
  • Zhang et al. (2017d) Xucong Zhang, Yusuke Sugano, Mario Fritz, and Andreas Bulling. 2017d. It’s written all over your face: Full-face appearance-based gaze estimation. In Computer Vision and Pattern Recognition Workshops (CVPRW), 2017 IEEE Conference on. IEEE, 2299–2308.
  • Zhang et al. (2018c) Xucong Zhang, Yusuke Sugano, Mario Fritz, and Andreas Bulling. 2018c. MPIIGaze: Real-world dataset and deep appearance-based gaze estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2018).
  • Zhang et al. (2013) Yanxia Zhang, Andreas Bulling, and Hans Gellersen. 2013. SideWays: A Gaze Interface for Spontaneous Interaction with Situated Displays. In Proc. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI). 851–860. https://doi.org/10.1145/2470654.2470775
  • Zhang et al. (2015) Yanxia Zhang, Ming Ki Chong, Jörg Müller, Andreas Bulling, and Hans Gellersen. 2015. Eye tracking for public displays in the wild. Springer Personal and Ubiquitous Computing 19, 5 (2015), 967–981. https://doi.org/10.1007/s00779-015-0866-8
  • Zhang et al. (2014) Yanxia Zhang, Hans Jörg Müller, Ming Ki Chong, Andreas Bulling, and Hans Gellersen. 2014. GazeHorizon: Enabling Passers-by to Interact with Public Displays by Gaze. In Proc. ACM International Joint Conference on Pervasive and Ubiquitous Computing (UbiComp). 559–563. https://doi.org/10.1145/2632048.2636071
  • Zhang et al. (2017b) Yanxia Zhang, Ken Pfeuffer, Ming Ki Chong, Jason Alexander, Andreas Bulling, and Hans Gellersen. 2017b. Look together: using gaze for assisting co-located collaborative search. Personal and Ubiquitous Computing 21, 1 (2017), 173–186.
  • Zhu and Ji (2005) Zhiwei Zhu and Qiang Ji. 2005. Eye gaze tracking under natural head movements. In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, Vol. 1. IEEE, 918–923.
  • Zhu et al. (2006) Zhiwei Zhu, Qiang Ji, and Kristin P Bennett. 2006. Nonlinear eye gaze mapping function estimation via support vector regression. In Pattern Recognition, 2006. ICPR 2006. 18th International Conference on, Vol. 1. IEEE, 1132–1135.