PureGaze: Purifying Gaze Feature for Generalizable Gaze Estimation

03/24/2021 ∙ by Yihua Cheng, et al. ∙ Beihang University 0

Gaze estimation methods learn eye gaze from facial features. However, among rich information in the facial image, real gaze-relevant features only correspond to subtle changes in eye region, while other gaze-irrelevant features like illumination, personal appearance and even facial expression may affect the learning in an unexpected way. This is a major reason why existing methods show significant performance degradation in cross-domain/dataset evaluation. In this paper, we tackle the domain generalization problem in cross-domain gaze estimation for unknown target domains. To be specific, we realize the domain generalization by gaze feature purification. We eliminate gaze-irrelevant factors such as illumination and identity to improve the cross-dataset performance without knowing the target dataset. We design a plug-and-play self-adversarial framework for the gaze feature purification. The framework enhances not only our baseline but also existing gaze estimation methods directly and significantly. Our method achieves the state-of-the-art performance in different benchmarks. Meanwhile, the purification is easily explainable via visualization.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 4

page 6

page 7

page 10

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Human gaze implicates important cues for understanding human cognition [13] and behavior [7]. It enables researchers to gain insights into many areas such as saliency detection [18, 19], virtual reality [21] and first-person video analysis [22]

. Recently, appearance-based gaze estimation with deep learning becomes a hot topic. They leverage convolutional neural networks (CNNs) to estimate gaze from human appearance 

[12, 20], and achieve accurate performance.

Figure 1: We propose a domain-generalization framework for gaze estimation. Our method is only trained in the source domain and brings improvement in cross-domain performance without knowing the target domains. The key idea of our method is to purify the gaze feature with self-adversarial learning. The visualization result shows gaze-irrelevant factors such as illumination and identity are eliminated from the extracted feature.
Figure 2: Overview of the gaze feature purification. Our goal is to preserve the gaze-relevant feature and eliminate gaze-irrelevant features. Therefore, we define two adversarial tasks, which are to preserve gaze information and to remove general facial image information. Simultaneously optimizing the two tasks, we implicitly purify the gaze feature without explicit defining the gaze-irrelevant feature.

CNN-based gaze estimation requires a large number of samples for training. But collecting gaze sample is difficult and time-consuming. This challenge can be ignored in a fixed environment, but becomes a bottleneck when gaze estimation is required in a new environment. The changed environment brings many unexpected factors such as different illumination, thus degrades the performance of pre-trained model. Recent methods usually handle the cross-environment problem111We refer it as cross-domain problem in the rest of the paper. as a domain adaption problem. Researches aim to adapt the model trained in the source domain to the target domain. Zhang et al[23] fine-tune the model in target domains with 200 calibration samples. Wang et al[17] and Kellnhoder et al[9] propose to use adversarial learning to align the features in the two domains.

In this paper, we innovate a new direction to solve the problem. We propose a domain-generalization method for improving the cross-domain performance. Our method doesn’t require any images or labels in target domains, but aims to learn a generalized model in the source domain for any “unseen” target domains. The intrinsic gaze pattern is indeed invariant in any domain, but there are domain differences in some gaze-irrelevant factors such as illumination and identity. These factors are usually domain-specific, and directly blend in the captured appearance images. The in-depth fusion makes these factors difficult to be eliminated during feature extraction. As a result, the pre-trained model usually learns a joint distribution of gaze and these factors, 

i.e., overfit in the source domain, and therefore collapses in target domains.

As shown in Fig. 1, the key idea of our method is to purify gaze feature, i.e., we eliminate gaze-irrelevant factors such as illumination and identity. The purified feature is more generalized than original feature, and naturally brings improvement in cross-domain performance. To be specific, we propose a plug-and-play self-adversarial framework. As shown in Fig. 2

, the framework contains two adversarial tasks, which are to preserve gaze information and to remove general facial image information. Simultaneously optimizing the two tasks, we implicitly purify the gaze feature without explicit defining the gaze-irrelevant feature. We also realize the framework with a practical neural network. As shown in  Fig. 

3, the two adversarial tasks are respectively approximated as a gaze estimation task and a adversarial reconstruction task. We propose the final PureGaze to simultaneously perform the two tasks to purified the gaze feature. The PureGaze contains a plug-and-play SA-Module, which can be used to enhance existing gaze estimation methods directly and significantly.

The contributions of this work are threefold:

  • We propose a plug-and-play domain-generalization framework for gaze estimation methods. It improves the cross-dataset performance without knowning the target dataset or touching any new samples.

  • The domain-generalizability comes from our proposed gaze feature purification. We design a self-adversarial framework to purify the gaze features, which eliminates the gaze-unrelevant factors such as illumination and identity. The purification is easily explainable via visualization.

  • Our method achieves the state-of-the-art performance in different benchmarks. Our plug-and-play module also enhances existing gaze estimation methods directly and significantly.

2 Related Works

Appearance-based gaze estimation methods aim to infer human gaze from appearance  [10]. They usually learn a mapping function to directly map appearance to gaze. Recently, Many methods leverage CNNs to model the mapping function and achieve outstanding performance [16, 17].

CNN-based Gaze Estimation. Zhang et al. propose the first CNN-based gaze estimation method which estimates gaze from eye images [25]. Later, many similar methods are proposed. For example, Cheng et al. explore the asymmetry between two eyes [4, 5]. Park et al

. generate pictorial gaze representation to handle subject variance 

[12]. Fisher et al. leverage two VGG networks [14] to process two eye images [8]. Bao et al[1] propose a self-attention mechanism to fuse two eye features. Recently, face images are proven to be effective in gaze estimation. Zhang et al. propose a spatial-weight CNN. The CNN utilizes self-attention mechanism to weight facial feature maps [26]. Chen et al. leverage dilated convolution to estimate gaze [2]. Cheng et al. utilize facial images to estimate gaze and refine the gaze with eye images [3].

Cross-domain Gaze Estimation. Cross-domain performance is always measured by cross-dataset evaluation in gaze field. Almost all methods perform well within datasets while degrade severely in cross-dataset evaluation [25, 4, 17]. To solve the problem, Zhang et al[23] fine tune the pre-trained model in target domain. Wang et al[17] and Kellnhoder et al[9] propose to use classical adversarial learning to align the features in the source and target domain. These methods utilize data from target domain to solve this problem, which is not always user-friendly.

3 Overview

3.1 Definition of the Proposed Purification

We first formulate the proposed self-adversarial framework in this section. Without loss of generality, we reformulate the gaze estimation problem as

(1)

where is a feature extraction function (e.g. neural networks) and is a regression function. We use to denote the extracted feature, i.e., .

We slightly abuse the notation to represent a set of all features in one image. We can simply divide the whole image feature into two subsets, gaze-relevant feature and gaze-irrelevant feature . It is easy to get the following relation.

(2)

Our goal is to find the optimal to extract purified feature , where does not contain gaze-irrelevant feature, i.e., . Besides, we believe the feature which has weak relations with gaze should be also eliminated to improve generalization.

3.2 Self-adversarial Framework

As shown in Fig. 2, we design two adversarial tasks. The first task is to minimize the mutual information (MI) between image feature and the extracted feature, i.e.,

(3)

The function computes the MI between and . It indicates the relation between and e.g. , if is independent with . This task also means the extracted feature should contain less image information.

The other task is to maximize the MI between gaze-relevant feature and extracted feature, i.e.,

(4)

This constraint means the extracted feature should contain more gaze-relevant information.

3.3 Learning to Purify in the Framework

We simultaneously optimize Equ. (13) and Equ. (14). In other words, the extracted feature needs to contain more gaze information (Equ. (14)) and less image information (Equ. (13)). The two optimization tasks compose a self-adversarial learning in the extracted feature. During the self-adversarial learning, gaze-irrelevant feature is eliminated to satisfy Equ. (13) and gaze-relevant feature is preserved to satisfy Equ. (14). In the other word, we purify the extracted feature with the self-adversarial framework.

In addition, Equ. (13) and Equ. (14) implicate the minimax problem of . It is intuitive that the extracted feature will gradually discard some gaze-relevant information to decrease image information, i.e., to satisfy Equ. (13). Meanwhile, to satisfy Equ. (14), the feature having weak relations with gaze will be discarded first.

4 PureGaze

Figure 3: The architecture of PureGaze. It is consisted of two share-weight backbones (The convolutional layer of ResNet18) for feature extraction, one two-layer MLP (Muti-layer Perception) for gaze estimation, and one SA-Module (N=5) for recovering images.

In the previous section, we propose two key adversarial tasks, i.e. , Equ. (13) and  Equ. (14). The two tasks perform self-adversarial learning to purify the extracted feature. In this section, we propose PureGaze based on the two tasks. We realize the two tasks proposed in the self-adversarial framework with two practical tasks, gaze estimation and adversarial reconstruction.222The detailed deduction is shown in the Supplementary Material.

4.1 Two Adversarial Tasks

Gaze estimation: We use gaze estimation tasks to preserve gaze information in the extracted feature, i.e., Equ. (14). Indeed, the task can be realized with any gaze estimation network. In this paper, we first simply divide the gaze estimation network into two subnets, backbone for extracting feature and MLP for regressing gaze from the feature (Fig. 3

(a)). We use a gaze loss function

such as L1 loss to optimize the two subnets. The two subnets cooperate to preserve gaze information.

Adversarial reconstruction: We use the adversarial reconstruction task to remove general image information in the extracted feature, i.e., Equ. (13). The key idea of this task is that we do not want the feature to contain any image information. As a result, the model would fail to recover the input image based on the extracted feature. We first propose a SA-Module for reconstruction. The architecture of SA-Module is shown in  Fig. 3(c). It contains a block for upsampling and a convolution layer to align the channel with the images’. Further, the network architecture for adversarial reconstruction is shown in  Fig. 3(b). We use a backbone for feature extraction and our SA-Module for recovering images. Meanwhile, to achieve the adversarial reconstruction, we assign a adversarial task for the backbone and SA-Module. The SA-Module tries to recover images. We use a reconstruction loss function such as pixel-wise MSE Loss to optimize the SA-Module. The backbone tires to prevent the reconstruction. We use a adversarial loss to optimize it, where

(5)

It is obvious that the backbone and the SA-Module are adversarial for preventing from the reconstruction, i.e., the backbone removes general image information in the extracted feature with adversarial learning.

4.2 Architecture of PureGaze

The architecture of PureGaze is shown in the left part of  Fig. 3. We respectively build two networks for gaze estimation and adversarial reconstruction with the same backbone. We also share the weight of the backbones in the two networks. In other words, the two networks share the same backbone for feature extraction.

The PureGaze exactly contains three networks, which are the backbone for feature extraction, the MLP for gaze estimation and the SA-Module for reconstruction. The loss functions of the three parts are

(6)
(7)
(8)

where and are hyper-parameters.

In this paper, we use L1 Loss for gaze estimation and pixel-wise MSE for reconstruction:

(9)
(10)

4.3 Purifying Feature in Training.

In this part, we explain the purification in PureGaze for deeper understanding of our method.

PureGaze uses backbone to extract feature. The backbone has two goals, minimizing and minimizing . Minimizing means the backbone should extract gaze feature, while minimizing means the backbone should not extract image feature. The two goals is not cooperative but adversarial. The self-adversarial learning in the backbone purify the extract feature. In addition, is easily satisfied with learning a local optimal solution to cheat the SA-Module. We use the adversarial task between and to avoid the local optimal solution.

4.4 Two Auxiliary Modules

Local Purification Loss: It is intuitive that eye region is more important than other facial regions for gaze estimation. Therefore, we want the network to pay more attention to purify the feature of eye region. One simply solution is directly use eye images as input. However, we believe it is unreasonable since other facial regions also provide some useful information for gaze estimation.

We propose to leverage an attention map to focus the purification on a local region. The attention map is only applied to i.e., the loss function of the backbone is modified as

(11)

where is the attention map, and means element-wise multiplication. The modification assigns different weights into the image for restricting the task of removing general image information. Therefore, the restricted task and gaze estimation task compose a new local self-adversarial purification. In addition, one advantage of the modification is the attention map does not effect the gaze estimation network.

In this paper, we use mixed guassian distribution to generate the attention map. We respectively use two eye centers’ coordinates as the mean value, and the variance of the distribution can be customized.

Truncated Adversarial Loss: The adversarial reconstruction task plays an important role in the PureGaze. It ensures the extracted feature contains less image feature. In the PureGaze, the loss function is used to prevent the reconstruction. A smaller value of indicates a larger pixel difference between the generated and original images. However, we think it is redundant to produce a very large pixel difference. The reason is that is designed to prevent the SA-Module from recovering the original image rather than recovering an “inverse” version of the the original image.

Therefore, we leverage a threshold to truncate the adversarial loss . More concretely, will be zero if the pixel difference is larger than . The final loss function of the backbone is:

(12)

where is the indicator function and is the threshold.

Methods GM GD EM ED
RT-Gene [8] - -
Dilated-Net [2] - -
Full-Face [26]
CA-Net [3] - -
PinBall [9]
ETH [24] - -
Modified-adv - -
Baseline (ours)
PureGaze (ours)
Table 1: Cross-dataset performance. Our method shows the state-of-the-art performance in all four tasks.

4.5 Implementation Detail

Indeed, the backbone, MLP and SA-Module are arbitrary where we only need to ensure the SA-Module can output a image having the same sizes with input image (simply change the parameter N in Fig. 3(c)). In this paper, we use the convolutional part of ResNet- as the backbone. We use a two-layer MLP for gaze estimation, where the output numbers of the two layers are and . As for SA-Module, we use a five-layers SA-Module (N=5). The numbers of feature maps are 256, 128, 64, 32, 16 for each block (from bottom to top) and 3 for last convolutional layer. We use Adam for optimization and the learning rate is for all three networks. We empirically set and as and as . The of attention map is pixel.

Methods EM ED GM GD
Full-Face [26]
Full-Face+SA (ours)
CA-Net [3] - -
CA-Net+SA (ours) - -
Baseline (ours)
PureGaze (ours)
Table 2: We apply the self-adversarial framework into other advanced gaze estimation methods. Our framework directly enhances existing gaze estimation methods.
(a) Purified feature: less identity & more accurate gaze information.
(b) Purified feature: eliminated illumination & enhanced gaze information.
Figure 4: We visualize the extracted feature of the purified feature and the original feature via reconstruction. a) The purified feature contains less identity information than original feature. Besides, it is interesting that the head rest is captured and blend in the original images of the first and fourth columns. The original feature also contains the head rest information while our method eliminates it. b) Our method eliminates the illumination factor. We also manually augment the original image in the fourth row. The result shows our method accurately captures the gaze information under the dash area. Note that the gaze information is not only preserved but also enhanced. The purified feature captures more accurate gaze information than the original feature.

5 Experiments

5.1 Data-preprocessing

Task definitions. We select two datasets, Gaze360 [9] and ETH-XGaze [24] as training set, since they have a large number of subjects, various gaze range and head pose. We test our model in two popular datasets, which are MPIIGaze [26] and EyeDiap [11]. We totally conduct four cross-dataset tasks, and denote them as E (ETH-XGaze)M (MPIIGaze), ED (EyeDiap), G (Gaze360)M, GD.

Data Preparing. Gaze360 [9]

dataset contains a total of 172K images from 238 subjects. Note that some of the images in Gaze360 only captured the back side of the subject. These images is not suitable for appearance-based methods. Therefore, we first clean the dataset with a simply rule. We remove the images without face detection results based on the provided face detection annotation. ETH-XGaze 

[24] contains a total of 1.1M images from 110 subjects. It provides a training set containing 80 subjects. We split 5 subjects for validation and others are used for training. MPIIGaze [26] is prepared based on the standard protocol. We collect a total of 45K images from 15 subjects. EyeDiap [11] provide a total of 94 video clips from 16 subjects. We follow the common steps to prepare the data as in  [26, 3]. Concretely, we select the VGA videos of screen targets session and sample one image every fifteen frames. We also truncate their data to ensure the amount of images from every subject is the same.

Data rectification. Data rectification is performed to decrease data space and simplify the gaze estimation task. Sugano et al. propose a rectification method to alleviate the effect caused by camera pose and align cropped images via head pose [15]. We follow this method to process MPIIGaze and EyeDiap. ETH-XGaze is already rectified before publication. We use the provided data for experiment. Gaze360 only rectifies their gaze directions to cancel the effect caused by camera pose. We directly use their provided data as they don’t provide reliable head pose annotation.

5.2 Comparison Methods

Baseline: We remove the SA-Module in the PureGaze, and denote the new network as Baseline. It is obvious the performance difference between PureGaze and the Baseline is caused by the SA-Module. We also denote the feature extracted by the Baseline as the original feature, and the feature extracted by the PureGaze as the purified feature.

SOTA Methods: We use five methods for comparison, which are Full-Face [26], RT-Gene [8], Dilated-Net [2], CA-Net [3], and PinBall [9]. These methods all have great performance in many benchmarks. In particular, CA-Net [3] and PinBall [9]

respectively maintain the SOTA performance in different benchmarks. We implement Full-Face and Dilated-Net using Pytorch, and use the official code of RT-Gene, CA-Net and PinBall for experiments.

Another plug-and-play method: In addition, we also modify the conventional adversarial learning (Modified-adv) [17, 9] for fair comparison. Conventional adversarial learning use a discriminator to align the features in the source and target domain. We use a discriminator to align the personal feature in the source domain, i.e

., we build a discriminator for classifying subjects according to the extracted feature. Note that, our method and Modified-adv both are plug-and-play methods. We ensure the same backbone for fair comparison in following experiments.

5.3 Cross-dataset Evaluation

We first conduct experiments in four cross-dataset tasks. The experimental result is shown in Tab. 1. Note that, Dilated-Net, CA-Net and RT-Gene are not suitable for ETH-XGaze, since ETH-XGaze cannot always provide reliable eye images. Besides, ETH-XGaze dataset employs an off-the-shelf ResNet50 as baseline (we denote it as ETH). We download the official pre-trained model and evaluate the model in MPIIGaze and EyeDiap for reference. For fair comparison, we replace the backbone in our method and Modified-adv with ResNet50 when training in ETH-XGaze.

It is obvious that our method achieves the state-of-the-art performance in all four tasks. The reason is that our method learns to extract more generalized and purified features, which decrease the dataset differences. RT-Gene has large cross-dataset errors since they estimate gaze only from eye images. CA-Net shows state-of-the-art performance in within-dataset evaluation [3], while performs worst in cross-dataset evaluation.

It is interesting that Modified-adv does not bring any performance improvement, it even makes the performance worse. It is reasonable since the training set contains many subjects. Training process already implicates aligning different subjects’ feature, i.e., the model can perform well for all subjects in the training set. This also proves the advantage of our method. The key idea of our method is purifying the feature rather than aligning it. This key idea ensures that our method extracts a generalized purified feature.

5.4 Plug Existing Gaze Estimation Methods

As a plug-and-play framework, our self-adversarial framework can be easily applied into other gaze estimation methods. In this section, we show the plug-and-play performance by applying our framework into Full-Face [26] and CA-Net [3]. To do so, we input their final facial feature maps into SA-Module, and simply add two loss functions, and . We use SA to represent the modified method.

The result is shown in  Tab. 2. It is interesting that CA-Net has the worst performance in the GM task, while achieves the best performance after applying our framework. Besides, it also improved by nearly in the GD task. These results show the effectiveness of our self-adversarial framework. In addition, Full-Face shows relative good performance in EM, GM and GD tasks, and our framework further improves the performance in these three tasks.

Overall, our framework brings performance improvement in all cases without additional training data or inference parameters. It is a key advantage of our method.

(a) Pre-trained on ETH
(b) Pre-trained on ETH
(c) Pre-trained on Gaze360
(d) Pre-trained on Gaze360
Figure 5: We visualize the extracted feature in different datasets via tSNE. We separately sample 1,000 images from ETH-XGaze, Gaze360, MPIIGaze and EyeDiap datasets, and use pre-trained models to extract feature. It is obvious that our method shorten the distance between different datasets.

5.5 Visualize Extracted Feature via Reconstruction

To verify the key idea of gaze feature purification, we visualize purified features for further understanding. We provide reconstruction results of purified features and original features for comparison. We directly show the output of SA-Module to visualize the purified feature. We freeze the parameters of the pre-trained model and simply train a SA-Module to reconstruct images from the original feature.

According to the visualization result shown in Fig. 4, we could easily draw following conclusions:

  • The purified feature contains less identity information than original feature. The reconstructed face appearances are approximately the same for each subject.

  • Our method eliminates the illumination factor in the captured images. Besides, it is interesting that our method also recover a bright gaze region accurately from low-light images. This means our method is able to effectively extract gaze information under the dash area.

  • Except illumination and identity factors, our method also eliminates other gaze-irrelevant features like the head rest in  Fig. 4(a). Although we don’t specify the eliminated factors, it still automatically purifies the learned feature with self-adversarial learning. This is also an advantage of our method.

  • The gaze information is not only preserved but also enhanced. The purified feature captures more accurate gaze information than the original feature.

Figure 6: We show the performance improvement of PureGaze compared with the baseline. Our PureGaze improves the performance in the extreme illumination condition. This conclusion also matches the feature visualization result.
(a) Fine-tuning on MPIIGaze
(b) Fine-tuning on EyeDiap
Figure 7: We further fine-tune our PureGaze and the baseline in two target domains to show the advantage of the purified feature. The fine-tuned model is trained in ETH-XGaze.In the fine-tuning stage, we discard the SA-Module in the PureGaze, i.e., the PureGaze and the baseline have the same architecture and different weights. It is obvious that the purified feature always leads to more accurate result.

5.6 Visualize Feature Space via tSNE[6]

Our method improves the cross-dataset performance without knowing target datasets. In this section, we qualitatively analysis the learned feature space across different datasets to show the advantage of our method. We separately sample 1,000 images from each dataset and visualize their feature embedding via tSNE. The result is shown in Fig. 5. We show the feature space of the baseline in the first column of Fig. 5 and ours in the second column.

According to Fig. 5, our method learns a more compact feature space than the baseline. This is because our method can eliminate some gaze-irrelevant feature, which are usually dataset-specific. In other words, our method reduces the differences between datasets without touching them.

5.7 Gaze Estimation Improvement by Purification

The feature reconstruction experiments show our method has the ability to remove the illumination factor from the extracted features. In this section, we provide the performance improvement distribution across different illumination intensities for quantitatively analysis. We train the model in ETH-XGaze and test it in MPIIGaze for their rich illumination variance. We first cluster the images into 51 clusters according to their mean intensity. Then we remove clusters less than 7 images and compute the average accuracy.

We illustrate the performance improvement of the PureGaze compared with the baseline in  Fig. 11. It is interesting that our method improves the performance in extreme illumination conditions. It is because our method tries to remove the the gaze-irrelevant illumination information from the extracted feature, therefore becomes more robust than the baseline especially in extreme illumination conditions. These results prove the advantage of the purified feature.

5.8 Evaluate the Potential of Purified Feature.

The purified feature contains less gaze-irrelevant information, makes it more generalized and reasonable than the original feature. To prove the advantage of purified feature, we use PureGaze and baseline trained in ETH-XGaze and fine-tune them in MPIIGaze and EyeDiap. Note that the SA-Module in PureGaze is removed while fine-tuning for fair comparison. We randomly select five images per subject for fine-tuning and repeat the whole process five times. The average performance is reported as final result in Fig. 7.

The model fine-tuned with the purified feature always outperforms the baseline in the two tasks. This shows the purified feature has an essential differences between the original feature. The purified feature contains less gaze-irrelevant feature. It can easily achieve higher performance than the original feature.

5.9 Within-dataset Result

Although our method is proposed to solve the cross-dataset problem, we provide the within-dataset performance in four datasets for reference. The result is shown in Tab. 3. Compared with the baseline, Our PureGaze has improvement in EyeDiap. However, the PureGaze also has slightly performance decrease in MPIIGaze and Gaze360. This is because the PureGaze also eliminates some features with weak relations to gaze. It improves the model generalization but brings slightly performance decrease. We also summarize the SOTA performance in Tab. 3 for reference.

Methods MPIIGaze EyeDiap Gaze360 ETH-XGaze
Baseline
PureGaze
SOTA  [3]  [3]  [9]  [24]
Table 3: We provide the within-dataset performance of PureGaze and summarize the SOTA performance in datasets.

6 Conclusion

In this paper, we innovate a domain-generalization method for gaze estimation. The key idea of our method is the gaze feature purification. We propose a plug-and-play self-adversarial framework for purification. The framework contains two adversarial tasks, to preserve gaze information and to remove general image information. Simultaneously optimizing the two tasks, we can eliminate the gaze-irrelevant factors without explicitly defining them. We also propose the PureGaze and achieve the state-of-the-art performance in many benchmarks. The PureGaze contains a plug-and-play SA-Module. The module is easily added into existing methods for improving cross-dataset performance.

Appendix A Appendix

a.1 Evaluating Two Auxiliary Modules

We propose two auxiliary modules to enhance PureGaze. The two modules both have custom parameters, i.e. , the variance and the threshold . In this section, we provide detailed experiments to evaluate the two modules with different parameters.

We set four values for , which are , , and . We also evaluate the performance without the attention map. The results are shown in Fig. 8(a) and Fig. 8(b)

. We illustrate the generated attention maps in the top of the two figures. The results show that local purification does not always bring performance improvement in Gaze360 dataset. It is probably because of the large camera-user distance in the dataset. The large distance lead to a small and blurred eye region, which carries less gaze information. However, when

is set to , our method shows the best performance in both datasets. This proves the effectiveness of the local purification.

We also set four values for , which are , , and . As shown in Fig. 8(c) and Fig. 8(d), A large usually improves the performance.

a.2 Feature Space across Subjects

In the main manuscript, we show our method learns a compact feature space across datasets. We further illustrate the learned feature space across subjects in this section. The result is shown in  Fig. 9. We randomly sample 100 images from each subjects, and use the model pre-trained in ETH to extract the feature. We use tSNE [Maaten_2008_MLR] to visualize the feature space. We respectively visualize the feature space of the baseline and our PureGaze for comparison. The result of the baseline is shown in the first column in Fig. 9, and the result of PureGaze is shown in the second column. Again, the only difference between the baseline and PureGaze is whether we add the SA-Module.

According to Fig. 9, it is obvious that the baseline has a more spread feature space. The feature of one subject is usually clustered, e.g., the red, green and blue points in Fig. 9(a). This phenomenon is more obvious in  Fig. 9(c). In contrast, our method learns a compact feature space. The feature embedding of each subject is mixed. This demonstrates our method reduces the difference between the learned feature of different subjects.

(a) Training on Gaze360
(b) Training on ETH
(c) Training on Gaze360
(d) Training on ETH
Figure 8: We evaluate the two auxiliary modules in PureGaze. The first row shows the experiments about the local purification loss. And the second row shows the experiments about the truncated adversarial loss. The result is the best when is 20 and is 0.75.

a.3 Visualization on Dark Environment.

The visualization result of reconstruction shows our method eliminates the illumination factor. In this section, we provide more strict cases to prove the ability of our method.

(a) Result on MPIIGaze (w/o SA)
(b) Result on MPIIGaze (with SA)
(c) Result on EyeDiap (w/o SA)
(d) Result on EyeDiap (with SA)
Figure 9: We visualize the feature space across subjects via tSNE. The first column shows the result of the baseline and the second column shows the result of PureGaze. The PureGaze learns a more compact feature space with SA-Module. The result proves our method can reduce the difference between subjects in feature aspect.
Figure 10: We show the visualization result in very dark environment. The first row is the original images. The second row is the augmented images. We manually augment these images to clearly display the image content. The third row is the reconstruction results of the purified feature. It is obvious that our method can eliminates the illumination factor in very dark environment and accurately enhance the gaze information.

The original images are shown in the first row of Fig. 10. Subjects in these images are invisible due to extremely dark environment. Therefore, we manually augment these images and show them in the second row. The reconstruction result of PureGaze is shown in the third row. Even in very dark environment, our method still eliminates the illumination factor and captures gaze-relevant information. In addition, compare the reconstructed images with the augmented images, it is obvious our method accurately captures the eye movement.

Figure 11: We perform within-dataset evaluation on ETH-XGaze and count the average accuracy on each illumination interval. Since the illumination factor can be well fitted within datasets, PureGaze and the baseline have similar average accuracy in many intervals. However, in the low-light interval, PureGaze also performs well than the baseline.

a.4 Discussion about PureGaze

Our method learns a generalized and purified gaze feature for improving cross-dataset gaze estimation performance. In this section, we conduct experiments to discuss the difference between the PureGaze and common methods. We first train PureGaze and the baseline in the ETH-XGaze. We then evaluate the two models in the validation set of ETH-XGaze, and count evaluation results and illumination density of images. We cluster results per 10 illumination density and summarize the average accuracy in Fig. 11.

The PureGaze and the baseline have similar average accuracy in many intervals because the baseline can well fit the illumination factor. In the low-light interval, PureGaze performs well than the baseline due to the purified feature. This result also indicates the difference between PureGaze and most of common gaze estimation methods. Most of gaze estimation methods fit environment factors during training. Therefore, they have reasonable performance within datasets while the performance is degraded in cross-dataset experiments. Our PureGaze eliminates the environment factor and regresses gaze from purified feature. It naturally has better performance in cross-dataset experiments.

a.5 Deduction from the Framework to PureGaze

In the main manuscript, we propose a self-adversarial framework containing two adversarial tasks:

(13)

and

(14)

Here, we introduce how to deduct the PureGaze from the two formulations.

To realize the framework, we first simplify Equ. (13) and Equ. (14). The mutual information can be further deduced as:

(15)

Substituting Equ. (15) into Equ. (13) and Equ. (14), we get:

(16)

and

(17)

Equ. (16) means we should minimize the probability . We approximate this probability with a reconstruction network . To minimize , we want to fail reconstructing images from the extracted feature. In other words, the feature extraction network should extract the feature which is independent with images. Certainly, it is easy for feature extraction network to fool a pre-trained reconstruction network. And it is time-consuming to train a special reconstruction network for each iteration. Thus, and are designed to perform an adversarial reconstruction task. tries to reconstruct images from the feature, while tries to prevent the reconstruction. With the adversarial reconstruction, will discard all image information so that cannot reconstruct images, i.e., minimize .

Equ. (17) means we should maximize the probability of i.e.,  given image feature , we should accurately recover gaze information from . We approximate as a gaze regression network . aims to accurately estimate gaze from the extracted feature.

a.6 Pseudo Code

Figure 12: The pseudo code of PureGaze in a PyTorch style. We use gloss to optimize self.feature and self.MLP, and use saloss to optimize self.sa.

We also provide the pseudo code of the self-adversarial framework (PureGaze) in Fig. 12. The framework is easy to be applied into other methods by simply modifying the self.feature module.

References

  • [1] Y. Bao, Y. Cheng, Y. Liu, and F. Lu (2020) Adaptive feature fusion network for gaze tracking in mobile tablets. In

    The International Conference on Pattern Recognition (ICPR)

    ,
    Cited by: §2.
  • [2] Z. Chen and B. E. Shi (2019) Appearance-based gaze estimation using dilated-convolutions. In

    Asian Conference on Computer Vision (ACCV)

    , C.V. Jawahar, H. Li, G. Mori, and K. Schindler (Eds.),
    Cham, pp. 309–324. External Links: ISBN 978-3-030-20876-9 Cited by: §2, Table 1, §5.2.
  • [3] Y. Cheng, S. Huang, F. Wang, C. Qian, and F. Lu (2020) A coarse-to-fine adaptive network for appearance-based gaze estimation. In

    Proceedings of the AAAI Conference on Artificial Intelligence (AAAI)

    ,
    Cited by: §2, Table 1, Table 2, §5.1, §5.2, §5.3, §5.4, Table 3.
  • [4] Y. Cheng, F. Lu, and X. Zhang (2018-09) Appearance-based gaze estimation via evaluation-guided asymmetric regression. In The European Conference on Computer Vision (ECCV), Cited by: §2, §2.
  • [5] Y. Cheng, X. Zhang, F. Lu, and Y. Sato (2020) Gaze estimation by exploring two-eye asymmetry. IEEE Transactions on Image Processing 29 (), pp. 5259–5272. Cited by: §2.
  • [6] L. V. der Maaten and G. Hinton (2008) Visualizing data using t-sne..

    Journal of machine learning research

    9 (11).
    Cited by: §5.6.
  • [7] P. A. Dias, D. Malafronte, H. Medeiros, and F. Odone (2020-03) Gaze estimation for assisted living environments. In The IEEE Winter Conference on Applications of Computer Vision (WACV), Cited by: §1.
  • [8] T. Fischer, H. J. Chang, and Y. Demiris (2018-09) RT-gene: real-time eye gaze estimation in natural environments. In The European Conference on Computer Vision (ECCV), Cited by: §2, Table 1, §5.2.
  • [9] P. Kellnhofer, A. Recasens, S. Stent, W. Matusik, and A. Torralba (2019-10) Gaze360: physically unconstrained gaze estimation in the wild. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §1, §2, Table 1, §5.1, §5.1, §5.2, §5.2, Table 3.
  • [10] F. Lu, Y. Sugano, T. Okabe, and Y. Sato (2014)

    Adaptive linear regression for appearance-based gaze estimation

    .
    IEEE Transactions on Pattern Analysis and Machine Intelligence 36 (10), pp. 2033–2046. Cited by: §2.
  • [11] K. A. F. Mora, F. Monay, and J. Odobez (2014-03) EYEDIAP: a database for the development and evaluation of gaze estimation algorithms from rgb and rgb-d cameras. In Proceedings of the 2014 ACM Symposium on Eye Tracking Research & Applications, External Links: Document Cited by: §5.1, §5.1.
  • [12] S. Park, A. Spurr, and O. Hilliges (2018-09) Deep pictorial gaze estimation. In The European Conference on Computer Vision (ECCV), Cited by: §1, §2.
  • [13] R. Rahal and S. Fiedler (2019) Understanding cognitive and affective mechanisms in social psychology through eye-tracking. Journal of Experimental Social Psychology 85, pp. 103842. External Links: ISSN 0022-1031, Document, Link Cited by: §1.
  • [14] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §2.
  • [15] Y. Sugano, Y. Matsushita, and Y. Sato (2014-06) Learning-by-synthesis for appearance-based 3d gaze estimation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §5.1.
  • [16] K. Wang, H. Su, and Q. Ji (2019-06) Neuro-inspired eye tracking with eye movement dynamics. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • [17] K. Wang, R. Zhao, H. Su, and Q. Ji (2019-06) Generalizing eye tracking with bayesian adversarial learning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2, §2, §5.2.
  • [18] W. Wang, J. Shen, X. Dong, A. Borji, and R. Yang (2019) Inferring salient objects from human fixations. IEEE Transactions on Pattern Analysis and Machine Intelligence. External Links: Document, ISSN 1939-3539 Cited by: §1.
  • [19] W. Wang and J. Shen (2018) Deep visual attention prediction. IEEE Transactions on Image Processing 27 (5), pp. 2368–2378. Cited by: §1.
  • [20] Y. Xiong, H. J. Kim, and V. Singh (2019-06) Mixed effects neural networks (menets) with applications to gaze estimation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
  • [21] Y. Xu, Y. Dong, J. Wu, Z. Sun, Z. Shi, J. Yu, and S. Gao (2018-06) Gaze prediction in dynamic 360 immersive videos. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
  • [22] H. Yu, M. Cai, Y. Liu, and F. Lu (2020) First- and third-person video co-analysis by learning spatial-temporal joint attention. IEEE Transactions on Pattern Analysis and Machine Intelligence (), pp. 1–1. External Links: Document Cited by: §1.
  • [23] X. Zhang, M. X. Huang, Y. Sugano, and A. Bulling (2018) Training person-specific gaze estimators from user interactions with multiple devices. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, CHI ’18, New York, NY, USA. External Links: ISBN 9781450356206, Link, Document Cited by: §1, §2.
  • [24] X. Zhang, S. Park, T. Beeler, D. Bradley, S. Tang, and O. Hilliges (2020) ETH-xgaze: a large scale dataset for gaze estimation under extreme head pose and gaze variation. In The European Conference on Computer Vision (ECCV), Cited by: Table 1, §5.1, §5.1, Table 3.
  • [25] X. Zhang, Y. Sugano, M. Fritz, and A. Bulling (2015-06) Appearance-based gaze estimation in the wild. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2, §2.
  • [26] X. Zhang, Y. Sugano, M. Fritz, and A. Bulling (2017-07) It’s written all over your face: full-face appearance-based gaze estimation. In The IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 2299–2308. Cited by: §2, Table 1, Table 2, §5.1, §5.1, §5.2, §5.4.