Recent appearance-based gaze estimation is performed under outdoor conditions by using an annotated large-scale real image training dataset. However, annotating training data sets requires a lot of manual labor. To solve this problem, a training model on a synthetic image is preferred because the annotations are automatically available. But this solution has a drawback, the distribution between the real image and the synthetic image is quite different. Traditionally, the solution is to use unmarked actual data to improve the authenticity of the synthetic image from the simulator, such as SimGANs, these methods only learn the global features without considering local features. In the gaze estimation task, after realization with simGANs, the shape of the pupil or the edge of the pupil might by changed , the gaze estimation error will be increased due to the wrong pupil center location. Thus, these methods cannot be applied to outdoor (field) scenes due to its weak training time and adaptability to different situations in the field. In a different manner, we try to purify real image by extracting discriminative and robust features to convert outdoor real images to indoor synthetic images. Synthetic images is more regular and easy to learn, meanwhile, the annotations are automatically available.
To avoid changing the shape of the pupil or the edge of the pupil, we propose an mask-guided style transfer network to learn both local and global features. The way to handle local features is to obtain the attention region (pupil or iris) by segmentation. Fortunately, with the rapid development of deep learning[2, 3] based image segmentation methods including FCN,SegNet,U-net,Mask R-CNN
, we can obtain much better mask. To learn the style and content information from synthetic images, we first introduce the segmentation masks to construct RGB-mask pairs as inputs, then we design a mask-guided style transfer network to learn style features separately from the attention and bgkd(background) region , learn content features from full and attention region. For feature extraction, our work is most directly related to the work initiated by Gates et al.
. The feature map of the deep convolutional neural network with differentiated training is used to achieve the breakthrough performance of the transfer of painting style. We train a feed-forward feature extraction network for image transformation tasks. Our network aims to learn as much as possible on the premise of synthetic distribution, to minimize the loss of content transmission, and to solve the problem of insufficient spatial alignment information caused by the gram matrix. To achieve this goal, we propose a loss network with a novel task-guided loss, the attention region ,background region and full image region will be calculated in different task.
Our contributions are presented in three folds:
1. We took the first step to consider the attention region in style transfer task and propose an mask-guided style transfer network to purify the real image, making it similar to indoor conditions while retaining annotation information. Different with previous work in refining the synthetic images with global features, we purified the real images with local and global features.
2.Our network not only considers the RGB color channel, but uses the segmentation masks to construct RGB-mask pairs as inputs. We learned style features separately from the attention and bkgd(background) region and learned content features from full and attention region.
3. We proposed a hybrid research method (qualitative and quantitative) for experiments on two tasks. The results show that the proposed architecture significantly purified the real image compared with the baseline methods. Meanwhile, We achieve the state-of-the-art results on gaze estimation task.
2 Proposed Method
As shown in Figure 1, there are three multi-scale stages and a loss net to learn final features. It contains three multi-scale stages and a loss net to learn final features. There are three main streams which extracted from different regions of image , i.e. , the full-stream , the attention stream , the background stream . The full-stream learns features from the raw images. Meanwhile, the attention stream and the background stream are learned attention features and background features with attention maps. The attention maps are generated by the attention subnet, the streams are designed to retain the content of input image and transfer the style from style image with the input image mask and style image mask .
2.1 Attention Subnet
Given the input(style) image pair (RGB-Mask) as inputs, the attention subnet then produces attention maps which can be denoted as
is the sigmoid function, weight andare the convolutional filter weights and bias. In the contrary, the background maps denoted as , and constitute a contrastive attention pair, for each location (i,j) which in the pair of attention maps and backgrounds maps should meet the constraint:
Thus, the stream of attention and background can be denoted as :
where means the spatial weighting operation.
2.2 Loss network with region-level task-guided loss
With the attention maps described in last subsection, we further introduce the region-level triplet loss to enhance contrastive feature learning. After the attention operation, features from three main streams can be denoted as , and , , and are used to calculate region-level task-guided loss for two tasks: keep the content and style transfer.
2.2.1 Feature reconstruction loss
Traditional feature reconstruction loss which known as content loss only takes the input image as input and try to minimize the loss between the content of input image and output image without considering encoding content reconstructions. We address this problem with the image segmentation masks for the input images, the local feature of pupil region can be addressed when calculating the loss of and
. To visualise the image information that is encoded at different layers of the input image with masks, we perform gradient descent on a white noise image to find another image that matches the feature responses of the original image with mask. We then define the squared-error loss between the two feature representations
where C is the number of channels in the semantic segmentation mask and indicates the -th convolutional layer of the deep convolutional neural network, is the in each layer with the channel c, is the in each layer with the channel c, is the weight to configure layer preferences of global losses , is the weight to configure layer preferences of local losses .
Each layer with distinct filters has feature maps each of size , where is the height times the width of the feature map. So the responses in each layer can be stored in a matrix where is the activation of the filter at position in each layer . As minimizing , the image content and overall spatial structure are preserved but color, texture, and exact shape are not.
2.2.2 Style reconstruction loss
Feature Gram matrices are effective at representing texture, because they capture global statistics across the image due to spatial averaging. Since textures are static, averaging over positions is required and makes Gram matrices fully blind to the global arrangement of objects inside the reference image. So if we want to keep the global arrangement of objects, make the gram matrices more controllable to compute over the exact region of entire image, we need to add some texture information to the image.
Instead of taking input image and style image as inputs, we take the input image and style image with their mask and as pair inputs. To learn the skin style and pupil style respectively, we denote the pupil region as attention region and extract attention maps from both style image and input image, meanwhile, the skin region denoted as background region and product background maps from style image and input image. We then define the squared-error loss between the two region feature representations
where C is the number of channels in the semantic segmentation mask and indicates the -th convolutional layer of the deep convolutional neural network. Each layer with distinct filters has feature maps each of size , where is the height times the width of the feature map. So the responses in each layer can be stored in a matrix where is the activation of the filter at position in each layer . can be denoted as follows:
where is the in each layer with the channel c, is the in each layer with the channel c, is the in each layer with the channel c, is the weight to configure layer preferences of global losses , is the weight to configure layer preferences of local losses.
We formulate the style transfer objective by combining both two components together:
where L is the total number of convolutional layers and indicates the -th convolutional layer of the deep convolutional neural network. and are the weights to configure layer preferences. is the content loss (Eq.(10)) and is the style loss(Eq.(13)). , are scalars, ,
, in all cases the hyperparameters, are exactly the same. We find that unconstrained optimization of Equation 20 typically results in images whose pixels fall outside the range [0,255]. For a more fair comparison with our method whose output is constrained to this range, for the baseline we minimize Equation 20 using projected L-BFGS. Image O is generated by solving the problem
where I is initialized with white noise. The advantage of this solution is that the requirement for mask is not too precise. It does not only retain the desired structural features, but also enhance the estimation of the pupil and iris information during the reconstruction of the style.
3 Experimental Results
3.1 Style Transfer
The purpose of the style transfer is to generate an image that combines the content of the target content image as the real image content with the style of the target style image as the style of the synthetic image. We train an image transformation network for each of the several hand selection style goals and compare our results with the baseline methods of Gatys et al. and Feifei Li et al.. As a baseline, we re-implemented the method of Gatys et al. and Feifei Li et al.. In order to make a fairer comparison with our method whose output is constrained to [0, 255], for the baseline, we minimize the equation 1 and equation 6 by using the projected L-BFGS by cropping the image to the range [0, 255] at each iteration. In most cases, the optimization converges to satisfactory results in 500 iterations.
: We compare the proposed style transfer method with methods proposed by Gatys et al. and Feifei Li et al.. Figure 4 shows the qualitative results in the indoor and outdoor scenes for the UnityEyes and LPW datasets respectively. From the LPW dataset, we choose six different real images to represent six different conditions of outdoor scenes, which are images (a),(b),(c),(d),(e), and (f). From the UnityEyes dataset, we selected two synthetic images that have different distributions as the style images, style A, B, none of which has no same gaze angle as the six real images. As can be seen from (a),(b), and (c), the proposed method is less affected by external factors, such as illumination, and is similar to the results obtained by Gatys et al.  and Feifei Li et al.. However, if look closely, we can find that the proposed method is more capable of preserving the color information of the style image. From (d),(e), and (f), it is obvious that Gatys et al. and Feifei Li et al. are so greatly influenced by light and other factors that the pupil and the iris regions cannot be completely separated and even in (e),there is a loss of pupil area. What’s more, the distribution of pupil and iris regions is dramatically different from style image. In comparison, our proposed method can not only separate the pupil and the iris regions more easily but also the distribution of pupil and iris regions is more similar to style image.
| vs  speedup|
| vs  speedup|
: Table 2 compares the runtime of our method with Gatys et al., Feifei Li et al. for several image sizes. Across all image sizes, compared to 400 iterations of the baseline methods, our method is three orders of magnitude faster than Gatys et al. and we achieve better qualitative results (Fig.6) compared with Feifei Li et al. in tolerate speed. Our method processes images of size at 20 FPS, making it feasible to run in real-time or on video.
3.2 Appearance-based Gaze Estimation
: In order to verify the effectiveness of the proposed method for gaze estimation, 3 public datasets (UTView, SynthesEyes, UnityEyes) are used to train the estimator with k-NNand CNN[18, 19, 20, 21, 22]. MPIIGaze dataset and purified MPIIGaze dataset (purified by proposed method) are used for testing the accuracy.
|KNN with UTview||16.2||13.6|
|CNN with UTview||13.9||11.7|
|KNN with UnityEyes||12.5||9.9|
|CNN with UnityEyes||11.2||7.8|
|KNN with Syntheyes||11.4||8.0|
|CNN with Syntheyes||13.5||8.8|
: There are five gaze estimation methods used as base-line estimation methods. Three of them are common methods, Support Vector Regression(SVR), Adaptive Linear Regression(ALR) and Random Forest(RF). The other two methods are reproduced for fairly comparison with state-of-the-art. The first method is a simple cascaded method which uses multiple -NN(
-Nearest Neighbor) classifier to select neighbors in feature space joint head pose, pupil center and eye appearance. The second one aims to train a simple convolutional neural network (CNN) to predict the eye gaze direction by using loss. As is shown in table 2, we compare the performance of these two gaze estimation methods on different datasets, where the ”Method” represents gaze estimation methods with different training sets. It can be observed that the accuracy of gaze estimation of each dataset has improved at least three degrees, which means that our proposed method has greatly improved the performance when testing the output. This improvement will have some practical value in human-computer interaction.
|CNN with UnityEyes||R MPIIGaze||11.2|
|CNN with Refined UnityEyes||R MPIIGaze||9.9|
|CNN with UnityEyes||P MPIIGaze||7.8|
Furthermore, in order to prove that purifying real dataset can achieve better performance than refining synthetic dataset on gaze estimation task, we compare proposed method with state-of-the-art methods of refining synthetic dataset for gaze estimation,SimGANs. In order to make a fairer comparison with our method, we reproduce SimGANs, different from traditional methods which training on synthetic UnityEyes dataset and testing on real MPIIGaze dataset, SimGANs refining UnityEyes with its method, try to realistic synthetic image and test on real MPIIGaze dataset. In the contrary, proposed method purify the testing real MPIIGaze dataset without modifying training synthetic datset. From Table 3, we can see that purifying testing real dataset outperforms traditional methods and SimGans on gaze estimation.
: We manually quantify the gaze center of the 200 real and purified images by fitting the ellipse to the pupil, which can be seen as an approximation of the direction of gaze and is difficult for humans to label accurately, to quantify that there is no significant change in the ground truth gaze direction. The absolute difference of the estimated pupil center between the real and corresponding purified images is very small: 0.8 1.1 (eye width=55px)
This paper purify the real image by weakening its distribution, which is a better choice than improving the realism of synthetic image. We have applied this method to style transfer and gaze estimation tasks where we achieved comparable performance and drastically improved speed compared with existing methods. Performance evaluation indicates that purified MPIIGaze dataset (purified by our proposed method) records smaller error angle when used for gaze estimation task as compared with the raw MPIIGaze dataset.
In the future, we intend to explore modeling the real-time gaze estimation system based on the proposed method and improve the speed of purifying videos.
-  Ashish Shrivastava, Tomas Pfister, Oncel Tuzel, Joshua Susskind, Wenda Wang, and Russell Webb, “Learning from simulated and unsupervised images through adversarial training,” pp. 2107–2116, 2017.
-  Lin Wu, Yang Wang, Xue Li, and Junbin Gao, “Deep attention-based spatially recursive networks for fine-grained visual recognition,” IEEE Transactions on Cybernetics, vol. 49, no. 5, pp. 1791–1802, 2019.
-  Lin Wu, Yang Wang, Junbin Gao, and Xue Li, “Where-and-when to look: Deep siamese attention networks for video-based person re-identification,” IEEE Transactions on Multimedia, 2018.
-  Jonathan Long, Evan Shelhamer, and Trevor Darrell, “Fully convolutional networks for semantic segmentation,” pp. 3431–3440, 2015.
-  Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla, “Segnet: A deep convolutional encoder-decoder architecture for image segmentation,” IEEE transactions on pattern analysis and machine intelligence, vol. 39, no. 12, pp. 2481–2495, 2017.
-  Olaf Ronneberger, Philipp Fischer, and Thomas Brox, “U-net: Convolutional networks for biomedical image segmentation,” pp. 234–241, 2015.
-  Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick, “Mask r-cnn,” pp. 2961–2969, 2017.
-  Leon A Gatys, Alexander S Ecker, and Matthias Bethge, “A neural algorithm of artistic style,” arXiv preprint arXiv:1508.06576, 2015.
Justin Johnson, Alexandre Alahi, and Li Fei-Fei,
“Perceptual losses for real-time style transfer and super-resolution,”pp. 694–711, 2016.
-  Erroll Wood, Tadas Baltrušaitis, Louis-Philippe Morency, Peter Robinson, and Andreas Bulling, “Learning an appearance-based gaze estimator from one million synthesised images,” pp. 131–138, 2016.
-  Marc Tonsen, Xucong Zhang, Yusuke Sugano, and Andreas Bulling, “Labelled pupils in the wild: a dataset for studying pupil detection in unconstrained environments,” pp. 139–142, 2016.
-  Yusuke Sugano, Yasuyuki Matsushita, and Yoichi Sato, “Learning-by-synthesis for appearance-based 3d gaze estimation,” pp. 1821–1828, 2014.
-  Erroll Wood, Tadas Baltrusaitis, Xucong Zhang, Yusuke Sugano, Peter Robinson, and Andreas Bulling, “Rendering of eyes for eye-shape registration and gaze estimation,” pp. 3756–3764, 2015.
-  Lin Feng, Huibing Wang, Bo Jin, Haohao Li, Mingliang Xue, and Le Wang, “Learning a distance metric by balancing kl-divergence for imbalanced datasets,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, , no. 99, pp. 1–12, 2018.
Yang Wang, Lin Wu, Xuemin Lin, and Junbin Gao,
“Multiview spectral clustering via structured low-rank matrix factorization,”IEEE transactions on neural networks and learning systems, vol. 29, no. 10, pp. 4833–4843, 2018.
-  Yang Wang, Xuemin Lin, Lin Wu, Wenjie Zhang, Qing Zhang, and Xiaodi Huang, “Robust subspace clustering for multi-view data by exploiting correlation consensus,” IEEE Transactions on Image Processing, vol. 24, no. 11, pp. 3939–3949, 2015.
Huibing Wang, Lin Feng, Jing Zhang, and Yang Liu,
“Semantic discriminative metric learning for image similarity measurement,”IEEE Transactions on Multimedia, vol. 18, no. 8, pp. 1579–1589, 2016.
-  Lin Wu, Yang Wang, and Ling Shao, “Cycle-consistent deep generative hashing for cross-modal retrieval,” IEEE Transactions on Image Processing, vol. 28, no. 4, pp. 1602–1612, 2019.
-  Qichang Hu, Huibing Wang, Teng Li, and Chunhua Shen, “Deep cnns with spatially weighted pooling for fine-grained car recognition,” IEEE Transactions on Intelligent Transportation Systems, vol. 18, no. 11, pp. 3147–3156, 2017.
-  Lin Wu, Yang Wang, Ling Shao, and Meng Wang, “3-d personvlad: Learning deep global representations for video-based person reidentification,” IEEE transactions on neural networks and learning systems, 2019.
-  Yang Wang, Xuemin Lin, Lin Wu, and Wenjie Zhang, “Effective multi-query expansions: Collaborative deep networks for robust landmark retrieval,” IEEE Transactions on Image Processing, vol. 26, no. 3, pp. 1393–1404, 2017.
Yang Wang, Wenjie Zhang, Lin Wu, Xuemin Lin, Meng Fang, and Shirui Pan,
“Iterative views agreement: An iterative low-rank based structured
optimization method to multi-view spectral clustering,”
International Joint Conference on Artificial Intelligence, 2016, pp. 2153–2159.
-  Xucong Zhang, Yusuke Sugano, Mario Fritz, and Andreas Bulling, “Mpiigaze: Real-world dataset and deep appearance-based gaze estimation,” IEEE transactions on pattern analysis and machine intelligence, vol. 41, no. 1, pp. 162–175, 2019.
-  Yafei Wang, Tongtong Zhao, Xueyan Ding, Jinjia Peng, Jiming Bian, and Xianping Fu, “Learning a gaze estimator with neighbor selection from large-scale synthetic eye images,” Knowledge-Based Systems, vol. 139, pp. 41–49, 2018.
-  Tongtong Zhao, Yuxiao Yan, JinJia Peng, HaoHui Wei, and Xianping Fu, “Refining synthetic images with semantic layouts by adversarial training,” pp. 863–878, 2018.
-  Tongtong Zhao, Yafei Wang, and Xianping Fu, “Refining eye synthetic images via coarse-to-fine adversarial networks for appearance-based gaze estimation,” pp. 419–428, 2017.
-  Xucong Zhang, Yusuke Sugano, Mario Fritz, and Andreas Bulling, “Appearance-based gaze estimation in the wild,” pp. 4511–4520, 2015.
-  Tongtong Zhao, Yuxiao Yan, Ibrahim Shehi Shehu, and Xianping Fu, “Image purification networks: Real-time style transfer with semantics through feed-forward synthesis,” pp. 1–7, 2018.
-  Tongtong Zhao, Yuxiao Yan, Jinjia Peng, Zetian Mi, and Xianping Fu, “Guiding intelligent surveillance system by learning-by-synthesis gaze estimation,” Pattern Recognition Letters, 2019.