Appearance-Based Gaze Estimation Using Dilated-Convolutions

03/18/2019 ∙ by Zhaokang Chen, et al. ∙ The Hong Kong University of Science and Technology 0

Appearance-based gaze estimation has attracted more and more attention because of its wide range of applications. The use of deep convolutional neural networks has improved the accuracy significantly. In order to improve the estimation accuracy further, we focus on extracting better features from eye images. Relatively large changes in gaze angles may result in relatively small changes in eye appearance. We argue that current architectures for gaze estimation may not be able to capture such small changes, as they apply multiple pooling layers or other downsampling layers so that the spatial resolution of the high-level layers is reduced significantly. To evaluate whether the use of features extracted at high resolution can benefit gaze estimation, we adopt dilated-convolutions to extract high-level features without reducing spatial resolution. In cross-subject experiments on the Columbia Gaze dataset for eye contact detection and the MPIIGaze dataset for 3D gaze vector regression, the resulting Dilated-Nets achieve significant (up to 20.8 Our proposed Dilated-Net achieves state-of-the-art results on both the Columbia Gaze and the MPIIGaze datasets.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 8

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Gaze tracking has long been considered as an important research topic, as it has various promising real-world applications, such as gaze-based interfaces [20, 5], foveated rendering in virtual reality [19], behavioral analysis [11] and human-robot interaction [12]. Early gaze tracking techniques required strong constraints, e.g., facing the tracker frontally and keeping the positions of eyes inside a certain region. These constraints limited the applications to relatively controlled environments. In order to apply gaze tracking in real-world and more flexible environments, researchers proposed many novel methods to alleviate these constraints and move towards unconstrained gaze tracking, e.g.,  [22, 27, 36, 1, 21, 6, 28].

Unconstrained gaze tracking refers to calibration-free, subject-, viewpoint- and illumination-independent gaze tracking [21]. Appearance-based gaze estimation is a promising approach to unconstrained gaze tracking. It estimates the 2D gaze target position on a given plane or 3D gaze angles based on the images captured by RGB sensors. The key advantage of this method is that it does not require expensive custom hardware but off-the-shelf cameras, which are inexpensive and commonly available nowadays. However, it is a very challenging problem as it needs to address several factors, such as differences in individual appearances, head-eye relationships, gaze ranges and illumination conditions [36].

In recent years, with the success of deep convolutional neural networks (CNNs) in the field of computer vision, researchers have started to apply deep CNNs to appearance-based gaze tracking. Thanks to the large number of publicly-available high quality real and synthetic datasets 

[34, 15, 27, 25, 8, 23], deep CNNs have demonstrated good performance, but there is still room for improvement.

In this article, we propose to improve the accuracy of appearance-based gaze estimation by extracting higher resolution features from the eye images using deep neural networks. Given the fact that eye images with different gaze angles may differ only by a few pixels (see Fig.1), we argue that extracting features at high resolution could improve accuracy by capturing small appearance changes. To extract high-level features at high spatial resolution, we applied dilated-convolutions (alternatively, atrous-convolutions), which efficiently increase the receptive field sizes of the convolutional filters without reducing the spatial resolution. The main contributions of this article are that we propose Dilated-Net and quantitatively evaluate the use of high resolution features on the Columbia Gaze [25] and MPIIGaze datasets [34] through Dilated-Net. In cross-subject experiments, the proposed Dilated-Net outperform CNNs with similar architecture significantly from to depending on the task. It achieves state-of-the-art results on both datasets. The results demonstrate that the use of high resolution features benefit gaze estimation.

2 Related Work

2.1 Appearance-Based Gaze Estimation

Methods for appearance-based gaze estimation learn a mapping function from images to gaze estimate, where the estimation target is normally defined as either a gaze target in a give plane (2D estimation) or a gaze direction vector in camera coordinates (3D estimation). Appearance-based methods are attracting more and more attention as they use inputs from off-the-shelf cameras, which are widely available. Given enough training data, they may be able to achieve unconstrained gaze estimation.

Several methods in computer vision have been applied to this problem, e.g., Random Forests 

[27], k-Nearest Neighbors [27, 22], Support Vector Regression [22] and, recently, deep CNNs. Zhang et al. proposed the first deep CNN to estimate 3D gaze angles [34, 36]. Their network takes the left eye image and the estimated head pose angles as input. They showed that the use of deep CNNs trained on a large amount of data improved accuracy significantly. To employ information outside the eye region, Krafka et al. proposed a multi-region CNN that takes an image of the face, images of both eyes and a face grid as input to estimate the gaze target on phone and tablet screens [15]. Zhang et al. proposed a network that takes the full face image as input and uses a spatial weights method to emphasize features extracted from particular regions [35]. This work has shown that regions of the face other than the eyes also contain information about gaze angles. To further improve the accuracy, other work has concentrated on estimating the head-eye relationships. Ranjan et al. applied a branching architecture, where parameters are switched by clustering head pose angles into different groups [21]. Deng and Zhu trained two networks to estimate the head pose angles in camera coordinates and gaze angles in head coordinates separately. The final gaze angles in camera coordinate were obtained by combing the estimates geometrically [6].

Instead of estimating the continuous gaze directions, some work considered gaze tracking as a classification problem by dividing the gaze directions into certain blocks. For example, George and Routray applied a CNN to classify eye images into 3 or 7 target regions 

[9]

. The binary classification problem, referred as gaze locking or eye contact detection, detects whether the user is looking at the camera. A Support Vector Machine (SVM) 

[25], Random Forests [31] and a CNN with multi-region input [18] have been applied to this problem.

While the recent trend has been to investigate how information from regions other than the eyes can benefit gaze estimation [35, 21, 6], here we focus on how better features extracted from the eye images can be used to benefit multi-region networks.

(a)
(b)
(c) Absolute differences
Figure 1: Images of two left eyes and their difference from the Columbia Gaze dataset [25]. (a) Left eye image with horizontal and vertical gaze angle. (b) Left eye image with horizontal and vertical gaze angle. (c) The absolute difference between (a) and (b) (scaled for better illustration).

2.2 Dilated-Convolutions

Dilated-convolutions were first introduced in the field of computer vision to extract dense features for dense label prediction, i.e., semantic segmentation [32, 3]. Given a convolutional kernel of size (heightwidthchannel), the key idea of dilated-convolutions is to insert spaces (zeros) between the weights so that the kernel covers a region larger than . Therefore, dilated-convolutions increase the size of receptive field without reducing the spatial resolution nor increasing the number of parameters. Comprehensive studies of dilated-convolution in semantic segmentation were reported in [29, 4], where the results show that dilated-convolutions improve performance significantly. Recently, Yu et al. proposed dilated residual networks [33]

and showed that they outperform their non-dilated counterparts in image classification and object localization on the ImageNet dataset 

[7].

3 Methodology

3.1 Issue of Spatial Resolution

When a person looks at two different locations with his/her head fixed, the appearance of the eyes changes. However, these differences can be subtle, as shown in Fig. 1. A horizontal difference only results in differences at a few pixels. Other small changes, e.g., in the openness and the shape of the eyes, also contain information about gaze direction. Intuitively, extracting high-level features at high resolution will better capture these subtle differences.

Most current CNN architectures use multiple downsampling layers, e.g., convolutional layer with large stride and pooling layers. In this article, we use max-pooling layers as an example for discussion because they are commonly used both in general and in gaze estimation 

[36, 35, 15, 6, 21]. Similar considerations apply for convolutional layers with large stride. The use of max-pooling layers progressively reduces the spatial resolution of feature maps. This enables the networks to tolerate small variations in position, increases the effective size of receptive field (RF) at higher layers and reduces the number of parameters in the networks. However, the drawback is that spatial information is lost during pooling. For example, of activations will be discarded if a pooling window with a stride of 2 is used. To better illustrate, Fig. (a)a shows the RFs resulting from first applying convolution, followed by max-pooling with stride of 2, followed by convolution. Inserting a pooling layer increases the size of RF. A kernel on the higher-level feature map has an RF on the lower-level feature map. However, the lower level locations that pass on information to the higher levels varies with the input. Successively applying max-pooling layers results in a loss of important spatial information, which we expect will degrade the performance of gaze estimation.

(a) convolution max-pooling convolution
(b) dilated-convolution dilated-convolution
(c) convolution convolution
Figure 2: Receptive fields for three different combinations of layers. The grid on the left represents the lower-level feature map. The grid on the right represents the output of the max-pooling (a) or the convolution (b,c). Locations in dark blue show locations weighted by the convolution operating on the right grid, and the corresponding locations in the left grid. Light blue shows the effective size of the RF in the lower layer due to the first convolution. The strides for convolutions and dilated-convolutions are 1 and the stride for max-pooling are 2.

3.2 Dilated-Convolutions

Dilated-convolutional layers preserve spatial resolution while increasing the size of RF without a large increase in the number of parameters. Given an input feature map , a kernel of size (with weights and bias ) and dilation rates , the output feature map of a dilated-convolutional operation can be calculated by

(1)

where represents the position in the corresponding feature map. Equation (1) shows that the dilation rates determine the amount by which the size of RF increases. Dilation rates which are larger than one allow the network to enlarge the RF without decreasing the spatial resolution (compared to the use of pooling layers) or increasing the number of parameters.

Fig. (b)b shows the RF resulting from a dilated-convolutional layer with dilation rates followed by a dilated-convolutional layer with dilation rates . Because of the dilation rates, a kernel applied to the higher-level feature map corresponds to a RF on the lower-level feature map. Spatial resolution is preserved. The lower level locations feeding into the higher level units are also constant, independent of the input.

Fig. (c)c shows the result of successively applying two convolutional layers while maintaining spatial resolution. The corresponding RF on the lower level map is only . Stacking convolutional layers only increase the size of RF linearly, which makes it hard to cover large regions at higher layers.

3.3 Dilated-Nets

3.3.1 Multi-Region Dilated-Net.

Our proposed architecture, which we refer to in our results as Dilated-Net (multi), is shown in Fig. 3. It takes an image of the face and images of both eyes as input and feeds them to a face network and two eye networks, respectively. The general architecture is inspired by iTracker [15]. However, here the eye networks adopt dilated-convolutional layers to extract high resolution features.

Figure 3: Architecture of the multi-region Dilated-Net. The numbers after convolutional layers (Conv) represent the number of filters. The numbers after dilated-convolutional layers (Dilated-Conv) represent the number of filters and dilation rates. filter size is used except that Conv-F3_4 and Conv-E2_3 use . FC represents fully connected layer.

The face network is a VGG-like network [24] which consists of four blocks of stacked convolutional layers (Conv) followed by max-pooling layer, as well as two fully connected layers (FC). The weights of the first seven convolutional layers are transferred from the first seven layers of VGG-16 pre-trained on the ImageNet dataset. We insert a convolutional network (Network in Network [16]) after the last transferred layer to reduce the number of channels.

The two eye networks have identical architecture and share the same parameters in all convolutional and dilated-convolutional layers. The network starts with four convolutional layers with a max-pooling layer in the middle, followed by a convolutional layer, four dilated-convolutional layers (dilated-Conv) and one fully connected layer. The dilation rates of the layers are , , and , respectively. These dilation rate are designed according to the hybrid dilated convolution (HDC) in [29] so that the RF of each layer covers a square region without any holes in it. The weights of first four convolutional layers are transferred from the first four layers of VGG-16 pre-trained on the ImageNet dataset.

We concatenate the output of FC-F6, FC-EL7 and FC-ER7 to combine features from different input. This is fed to FC-2 and then an output layer.

We use the Rectified Linear Unit (ReLU) as activation function for all convolutional and fully connected layers. Zero padding is applied to convolutional layers to preserve dimension. No padding is applied to dilated-convolutional layers to reduce the output dimension and computation. We apply the batch renormalization layers (Batch Norm) 

[13] to all layers trained from scratch. Dropout layers are applied to all fully connected layers.

3.3.2 Single-Eye Dilated-Net.

To compare with the networks which only use the left eye image and estimated head pose as input, e.g., [36, 21], we shrunk the Dilated-Net (multi) in Fig. 3 down to only one eye network and concatenated the estimated head pose angles with FC-EL7. In our results, we refer to it as Dilated-Net (single).

3.3.3 CNN without Dilated-Convolutions.

To show the improvement achieved by dilated convolutions, we use a deep CNN that has similar architecture but no dilated-convolutions. It replaces the four dilated-convolutional layers by four convolutional layers and three max pooling layers located at the beginning, in the middle and at the end of the four convolutional layers, respectively. The size of final feature maps is the same as the one in Dilated-Nets, and this CNN has the same number of parameters as the corresponding Dilated-Net.

One key difference between the CNN and the Dilated-Net is shown in Fig. 4, where we show the sizes and centers of RFs of the third max-pooling layer in the CNN and of the Dilated-Conv-E4 layer in Dilated-Net. In the CNN, the progressively use of max-pooling results in larger distance (8 px) between two centers of RFs. For Dilated-Net, the distance between two centers is preserved (2 px).

(a)
(b)
Figure 4: Size (blue square) and centers (white dots) of the receptive fields. Top row: The third max-pooling layer in our CNN ( with distance 8). Bottom row: The Dilated-Conv-E4 layer in our Dilated-Net ( with distance 2).

3.4 Preprocessing

We apply a two-step preprocessing method. In the first step, We apply the same image normalization method used in [21, 36, 35]. This method virtually rotates and translates the camera so that the virtual camera faces the reference point at a fixed distance and cancels out the roll angle of the head. The reference point is set to be the center of the left eye for Dilated-Net (single) and the center of the face for Dilated-Net (multi). The images are normalized by perspective warping, converted to gray scale and histogram-equalized. The estimated head pose angles and the ground truth gaze angles are also normalized.

In the second step, we obtain the eyes images and face image from the warped images based on facial landmarks. For automatically detected landmarks we use dlib [14]. Then, for each eye image, we use the eye center as the image center and warp the image by an affine transformation so that the eye corners are at fixed positions. For the face image, we fix the position of the center of two eyes and scale the image so that the horizontal distance between the two eye centers is a constant.

4 Experiments

4.1 Cross-Subject Evaluation

We performed cross-subject experiments on the Columbia Gaze dataset [25] for eye contact detection and the MPIIGaze dataset [34] for 3D gaze regression.

4.1.1 Columbia Gaze Dataset.

This dataset was collected for a task to classify whether the subject is looking at the camera (gaze locking). It comprises 5880 full face images of 56 people (24 female, 21 with glasses) taken in a controlled environment. For each person, images were collected for each combination of five horizontal head poses , seven horizontal gaze directions and three vertical gaze directions . Among the 105 images of each person, five are gaze locking ( horizontal and vertical gaze direction).

 

Name Method Input Training set PR-AUC Best F1-score
GL [25]
PCA+MDA
+SVM
Two eyes Columbia 0.08 0.15
PCA+SVM Two eyes Columbia 0.16 0.25
OpenFace [1] Model-based
Two eyes
SynthesEyes [30] 0.05 0.10
CNN (single)
(ours)
CNN
Left eye
+ estimated
head pose
Columbia 0.40 0.44
Dilated-Net
(single) (ours)
Dilated-CNN
Left eye
+ estimated
head pose
Columbia 0.42 0.48
CNN (multi)
(ours)
CNN
Two eyes
+ Face
Columbia 0.48 0.52
Dilated-Net
(multi) (ours)
Dilated-CNN
Two eyes
+ Face
Columbia

 

Table 1: Results on the Columbia Gaze Dataset.

For cross-subject evaluation, we divided 56 subjects into 11 groups, where ten groups contained five subjects and one group contained six. The numbers of male/female subjects with/without glasses were balanced among different groups. We conducted leave-one-group-out cross-validation. In each fold, if a validation set was needed, we randomly select one group from the training set. Since the ratio between negative and positive samples is unbalanced , we upsampled the positive examples to balance positive and negative examples by randomly disturbing the facial landmarks.

We used a sigmoid function as output. During training, we used cross-entropy as loss function and stochastic gradient descent with momentum (0.9) to train the network (mini-batch size 64). We used an initial learning rate of 0.01 and multiplied it by 0.5 after every 3000 iterations.

We re-implemented the gaze locking method (GL) [25]

as baseline, which reduces the intensity features by Principal Component Analysis (PCA) and Multiple Discriminant Analysis (MDA), and uses an SVM 

[2] as the final classifier. We applied OpenFace 2.0 [1] to estimate 3D gaze vectors and calculated the cosine of the angular error from the ground truth.

Figure 5: Precision-recall curve of different models on the Columbia Gaze dataset. The circle on each curve indicates the location where the best F1-score is obtained.

We present the average testing results over 11-folds by the precision-recall (PR) curve in Fig. 5. We also report the value of area under curve (PR-AUC) and the best F1-score in Table 1. OpenFace performs the worst, mostly because it is a model-based method and trained on another dataset. All deep networks performed better than the two SVM methods, indicating that deep networks have better capacity for appearance-based gaze estimation.

Among the four deep networks, Dilated-Net (multi) performed the best and CNN (single) performed the worst. The degradation of single-eye networks from multi-region networks is because the CNN captures better and more information from the face, which is consistent to [35]. When comparing our Dilated-Net (multi) with the second best model, i.e., CNN (multi), Dilated-Net (multi) outperformed by 0.10 () in terms of PR-AUC and by 0.10 () in terms of best F1-score. Fig. 5

shows that Dilated-Net (multi) achieves higher precision for nearly all values of recall, indicating that it generally achieves a better trade-off between precision and recall.

4.1.2 MPIIGaze Dataset.

The MPIIGaze dataset [34] was collected for a task to estimate the continuous gaze direction angles. It contains images of 15 subjects (six female, five with glasses). It provides an “Evaluation Subset”, which contains randomly selected samples for each subject with automatically detected (manually annotated) facial landmarks. As in [36, 35, 21], we trained and tested our Dilated-Net on this “Evaluation Subset”, which we refer to as MPIIGaze (MPIIGaze+).

We conducted leave-one-subject-out cross-validation. During each fold, we randomly chose the data from three subjects in training set for validation. We trained all our networks with the same validation set in each fold. We used a linear layer to output estimated yaw and pitch gaze angles. The Euclidean distance between the estimated gaze angles and the ground truth angles in normalized space was used as the loss function. We trained all the networks using the Adam optimizer with a mini-batch size 64. An initial learning rate of 0.001 was used. The learning rate was multiplied by 0.1 after every 8000 iterations.

 

Name Architecture Input Pre-train MPIIGaze MPIIGaze+
GazeNet [36] VGG-16
Left eye
+ estimated
head pose
ImageNet
Branched CNN [21] AlexNet
Left eye
+ estimated
head pose
ImageNet
RFC [26]
RFC
+ 1M Synth
AlexNet
+ branch
Left eye
+ estimated
head pose
ImageNet
RFC
RFC
+ 1M Synth
CNN (single)
(Ours)
VGG-16
Left eye
+ estimated
head pose
ImageNet
Dilated-Net(single)
(Ours)
Dilated-CNN
Left eye
+ estimated
head pose
ImageNet

 

Table 2: Mean Angular Errors in Normalized Space Using Eye Center as Origin.

 

Name Architecture Input Pre-train MPIIGaze+
GazeNet [34] AlexNet
Left eye + estimated
head pose
ImageNet
iTracker [15, 35] AlexNet
Two eyes
+ face
ImageNet
Spatial weights
CNN [35]
AlexNet Face ImageNet
AlexNet + sptial
weights
Face ImageNet
CNN (multi)
(Ours)
VGG-16
Two eyes
+ face
ImageNet
Dilated-Net (multi)
(Ours)
Dilated-CNN
Two eyes
+ face
ImageNet

 

Table 3: Mean Angular Errors in Original Space Using Face Center as Origin.

For the networks that use a single eye image and estimated head pose as input, we compared with the GazeNet [36] and the state-of-the-art branched CNN [21]. For the networks that use images of the full face, we compared with a re-implementation of iTracker [15] reported in [35] and the state-of-the-art spatial weights CNN [35]. Note that [21, 36] reported the angular errors in the normalized space and  [35] reported the results in the original space (the camera coordinates).

We present the mean angular errors across 15 subjects in Table 2 and Table 3. In Table 2, our Dilated-Net (single) achieved the best performance. It achieved on MPIIGaze and on MPIIGaze+. It outperformed the second best method, the branched CNN without head-pose-depending-branching, by () on MPIIGaze and by () on MPIIGaze+, even though the branched CNN were pre-trained on more related RFC and 1M synthetic images. The gain was higher if we only considered the networks that were pre-trained on ImageNet. In this case, it outperformed the second best network, CNN (single), by () on MPIIGaze and by () on MPIIGaze+.

Figure 6:

Mean angular error of different subjects on the MPIIGaze dataset in original space. Error bars indicate standard errors computed across subjects.

Similar results can be observed in Table 3. When compared to the networks that use similar input (face + two eyes), our proposed Dilated-Net (multi) outperformed iTracker (AlexNet) by () and CNN (multi) by (). In Fig. 6, we further compare the average error for each subject. Dilated-Net (multi) outperformed CNN (multi) for 12 out of the 15 subjects. Dilated-convolutions improves accuracy for most subjects, despite variations in individual appearance.

Finally, we would like to note that our Dilated-Net (multi) achieved the same results as the state-of-the-art spatial weights CNN. Compared to the spatial weights CNN, our Dilated-Net has several advantages including smaller input size ( v.s. ), lower () input resolution, and much smaller number of parameters ( M v.s. M). This suggests that the Dilated-Net might achieve better performance for low resolution images.

4.2 Comparing Dilated-CNN with CNN

To better understand the differences between Dilated-Net (multi) and CNN (multi), we studied the features learned by the final convolutional layers and evaluated their importance. Both networks have final feature maps. The sizes of the RFs are also similar ( for the CNN and for the Dilated-Net), but they are centered at different locations. The center locations of the CNN units spread over the entire eye image (white dots in Fig. (a)a), but the center locations of the Dilated-Net are concentrated at the center (red dots).

We performed an ablation study to determine the contribution of features from different spatial locations, where we only retrained the parameters of the fully connected layers. We left the face network unchanged and evaluated on MPIIGaze+. The average angular errors are presented in Fig. (b)b for three cases: (1) using all spatial locations, (2) eliminating the boundary locations and using only the array in the center, and (3) using only the array in the center. For the CNN, eliminating the boundary features actually improves performance. This may be due to the removal of person-specific features, enabling better generalization. For the Dilated-Net, we see a degradation in performance as features from different locations are removed. This indicates that despite the significant overlap of the RFs due to the close center spacing, the features at different locations are not redundant. Note that Dilated-Net only using features in center region still outperforms the best performing CNN.

(a)
(b)
Figure 7: (a) The corresponding center locations of the learned features of the final (dilated-)convolutional layers of the CNN (white) and the Dilated-Net (red); (b) The average angular errors as a function of the remaining features.

We applied t-Distributed Stochastic Neighbor Embedding (t-SNE) [17] to reduce the feature dimension at each location from to , and used Pearson’s r to evaluate the linear correlation between these 1D features and the horizontal gaze angles with fixed head poses and vertical gaze angles. The CNN features were less correlated with gaze. Even restricting attention to the central array, correlation coefficients ranged from - , whereas for the Dilated-Net, they ranged from - for all 24 locations.

4.3 The Effect of Landmarks Precision

To evaluate the influence of facial landmark detection, we randomly disturbed the landmarks by of a eye image in each direction. In cross-subject evaluation of MPIIGaze+, performance of the Dilated-Net (multi) degrades by to from . CNN (multi) degrades by to from . While the Dilated-Net is more sensitive to landmark detection, it is still robust and maintains the performance gains over the CNN.

4.4 The Use of ResNet

To study whether a similar improvement can be obtained using a more advanced architecture, we trained a modified ResNet-50 [10] and a Dilated-ResNet on the MPIIGaze+ and tested them on the Columbia Gaze dataset. We changed the stride of the first layer to one and modified the last two residual blocks to dilated-residual blocks. The average angular errors are (Dilated-ResNet), (Dilated-VGG), (ResNet) and (VGG). While we can achieve improvement by replacing VGG with the more advanced ResNet, greater improvement is achieved by using dilated-convolutions on VGG. In addition, our results indicate that introducing dilated-convolutions to ResNet further improves performance.

5 Conclusion

We applied dilated-convolutions in deep neural networks to improve appearance-based gaze estimation. The use of dilated-convolutions allows the networks to extract high level features at high resolution from eye images so that the networks can capture small variability. We conducted cross-subject experiments on the Columbia Gaze and the MPIIGaze datasets. Our results indicated significant gains from the use of dilated-convolutions when compared to CNNs with similar architectures but without dilated-convolutions. These high resolution features improve the accuracy of gaze estimation. Our proposed multi-region Dilated-Net achieved state-of-the-art results on both datasets.

Moving forward, we plan to apply our gaze estimation in real-world settings for human-machine interaction and human-robot interaction. As gaze trajectory is an excellent cue about user intent, the results of gaze tracking or eye contact detection can be used to estimate the user intent. This estimated intent enables the systems to react more naturally and to provide appropriate assistance.

References

  • [1] Baltrusaitis, T., Zadeh, A., Lim, Y.C., Morency, L.P.: OpenFace 2.0: Facial behavior analysis toolkit. In: IEEE International Conference and Workshops on Automatic Face and Gesture Recognition. pp. 59–66. IEEE (2018)
  • [2] Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology 2(3),  27 (2011)
  • [3] Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Semantic image segmentation with deep convolutional nets and fully connected crfs. arXiv preprint arXiv:1412.7062 (2014)
  • [4] Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelligence 40(4), 834–848 (2018)
  • [5] Chen, Z., Shi, B.E.: Using variable dwell time to accelerate gaze-based web browsing with two-step selection. International Journal of Human–Computer Interaction pp. 1–16 (2018)
  • [6]

    Deng, H., Zhu, W.: Monocular free-head 3d gaze tracking with deep learning and geometry constraints. In: IEEE International Conference on Computer Vision. pp. 3162–3171. IEEE (2017)

  • [7]

    Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: IEEE Conference on Computer Vision and Pattern Recognition. pp. 248–255. IEEE (2009)

  • [8] Funes Mora, K.A., Monay, F., Odobez, J.M.: Eyediap: A database for the development and evaluation of gaze estimation algorithms from rgb and rgb-d cameras. In: ACM Symposium on Eye Tracking Research & Applications. pp. 255–258. ACM (2014)
  • [9] George, A., Routray, A.: Real-time eye gaze direction classification using convolutional neural network. In: International Conference on Signal Processing and Communications. pp. 1–5. IEEE (2016)
  • [10] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385 (2015)
  • [11] Hoppe, S., Loetscher, T., Morey, S.A., Bulling, A.: Eye movements during everyday behavior predict personality traits. Frontiers in Human Neuroscience 12,  105 (2018)
  • [12] Huang, C.M., Mutlu, B.: Anticipatory robot control for efficient human-robot collaboration. In: ACM/IEEE International Conference on Human Robot Interaction. pp. 83–90. IEEE (2016)
  • [13]

    Ioffe, S.: Batch renormalization: Towards reducing minibatch dependence in batch-normalized models. In: Advances in Neural Information Processing Systems. pp. 1942–1950. MIT Press (2017)

  • [14]

    King, D.E.: Dlib-ml: A machine learning toolkit. Journal of Machine Learning Research

    10(Jul), 1755–1758 (2009)
  • [15] Krafka, K., Khosla, A., Kellnhofer, P., Kannan, H., Bhandarkar, S., Matusik, W., Torralba, A.: Eye tracking for everyone. In: IEEE Conference on Computer Vision and Pattern Recognition. pp. 2176–2184. IEEE (2016)
  • [16] Lin, M., Chen, Q., Yan, S.: Network in network. arXiv preprint arXiv:1312.4400 (2013)
  • [17] Maaten, L.v.d., Hinton, G.: Visualizing data using t-sne. Journal of Machine Learning Research 9(Nov), 2579–2605 (2008)
  • [18] Parekh, V., Subramanian, R., Jawahar, C.: Eye contact detection via deep neural networks. In: International Conference on Human-Computer Interaction. pp. 366–374. Springer (2017)
  • [19] Patney, A., Salvi, M., Kim, J., Kaplanyan, A., Wyman, C., Benty, N., Luebke, D., Lefohn, A.: Towards foveated rendering for gaze-tracked virtual reality. ACM Transactions on Graphics 35(6),  179 (2016)
  • [20] Pi, J., Shi, B.E.: Probabilistic adjustment of dwell time for eye typing. In: International Conference on Human System Interactions. pp. 251–257. IEEE (2017)
  • [21] Ranjan, R., De Mello, S., Kautz, J.: Light-weight head pose invariant gaze tracking. In: IEEE Conference on Computer Vision and Pattern Recognition Workshops. pp. 2156–2164. IEEE (2018)
  • [22] Schneider, T., Schauerte, B., Stiefelhagen, R.: Manifold alignment for person independent appearance-based gaze estimation. In: International Conference on Pattern Recognition. pp. 1167–1172. IEEE (2014)
  • [23] Shrivastava, A., Pfister, T., Tuzel, O., Susskind, J., Wang, W., Webb, R.: Learning from simulated and unsupervised images through adversarial training. In: IEEE International Conference on Computer Vision. pp. 2242–2251. IEEE (2017)
  • [24] Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
  • [25] Smith, B.A., Yin, Q., Feiner, S.K., Nayar, S.K.: Gaze locking: passive eye contact detection for human-object interaction. In: ACM Symposium on User Interface Software and Technology. pp. 271–280. ACM (2013)
  • [26] Su, H., Qi, C.R., Li, Y., Guibas, L.J.: Render for cnn: Viewpoint estimation in images using cnns trained with rendered 3D model views. In: IEEE International Conference on Computer Vision. pp. 2686–2694 (2015)
  • [27] Sugano, Y., Matsushita, Y., Sato, Y.: Learning-by-synthesis for appearance-based 3D gaze estimation. In: IEEE Conference on Computer Vision and Pattern Recognition. pp. 1821–1828. IEEE (2014)
  • [28] Wang, H., Pi, J., Qin, T., Shen, S., Shi, B.E.: SLAM-based localization of 3D gaze using a mobile eye tracker. In: ACM Symposium on Eye Tracking Research & Applications. p. 65. ACM (2018)
  • [29] Wang, P., Chen, P., Yuan, Y., Liu, D., Huang, Z., Hou, X., Cottrell, G.: Understanding convolution for semantic segmentation. arXiv preprint arXiv:1702.08502 (2017)
  • [30] Wood, E., Baltrusaitis, T., Zhang, X., Sugano, Y., Robinson, P., Bulling, A.: Rendering of eyes for eye-shape registration and gaze estimation. In: IEEE International Conference on Computer Vision. pp. 3756–3764. IEEE (2015)
  • [31] Ye, Z., Li, Y., Liu, Y., Bridges, C., Rozga, A., Rehg, J.M.: Detecting bids for eye contact using a wearable camera. In: IEEE International Conference and Workshops on Automatic Face and Gesture Recognition. pp. 1–8. IEEE (2015)
  • [32] Yu, F., Koltun, V.: Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122 (2015)
  • [33] Yu, F., Koltun, V., Funkhouser, T.: Dilated residual networks. In: IEEE International Conference on Computer Vision. pp. 636–644. IEEE (2017)
  • [34] Zhang, X., Sugano, Y., Fritz, M., Bulling, A.: Appearance-based gaze estimation in the wild. In: IEEE Conference on Computer Vision and Pattern Recognition. pp. 4511–4520. IEEE (2015)
  • [35] Zhang, X., Sugano, Y., Fritz, M., Bulling, A.: It’s written all over your face: Full-face appearance-based gaze estimation. In: IEEE Conference on Computer Vision and Pattern Recognition Workshops. pp. 2299–2308. IEEE (2017)
  • [36] Zhang, X., Sugano, Y., Fritz, M., Bulling, A.: MPIIGaze: Real-world dataset and deep appearance-based gaze estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2017)