Gaze redirection is a new research topic in computer vision and computer graphics aiming at manipulating a given eye gaze to a desirable direction according to a reference angle (see Fig.1). This task is important in many real-world scenarios. For example, when taking a group photo, it rarely happens that everyone is looking at the camera at the same time, and adjusting each person’s eye gaze to the same direction (camera) will make the photo look better and user acceptable. In another scenario, when talking over a video conferencing system, eye contact is important as it can express attentiveness, confidence and specific requirements. However, due to the location disparity between the video screen and the camera, the participants do not have direct eye contact.
. These methods suffer from two major problems: it is not easy to render the entire input region and they have an excessive requirement for heavy instrumentation. Another type of gaze redirection is using machine learning for image re-synthesis, such as DeepWarp or PRGAN . DeepWarp 
employs a neural network to predict the dense flow field which is used to warp the input image for gaze redirection. However, this method cannot generate perceptually plausible samples, as only using the pixel-wise difference between the synthesized and ground truth images is insufficient. Additionally, PRGAN proposes a GAN-based model with a cycle consistent loss for monocular gaze redirection and it can synthesize samples with high quality and redirection precision, but the results are still far from the requirements imposed by many application scenarios.
To further improve gaze redirection, we developed a coarse-to-fine strategy and combined flow learning with adversarial learning to produce higher quality and more precise redirection results. As shown in Fig. 2, our model consists of three parts. The first one is a coarse-grained model which is an encoder-decoder architecture with flow learning for modeling the eye spatial transformation. Specifically, this network is fed with source images and with difference angles between target and source. Second, in order to refine the warped results with target angles, we propose using a conditional architecture for the generator to learn the residual image between the warped output and the ground truth aiming to reduce the unwanted artifacts in texture and the distortions in shape. Finally, a discriminator network for adversarial and gaze regression learning is designed to ensure the refined results have the same distribution and desirable angles as the ground truth. Additionally, we propose utilizing the gazemap which represents the visual results of gaze numeric value as input to guide the entire synthesis process and to produce more accurate results. The gazemap can provide additional spatial and semantic information of target angles. Note that in this way we have a multimodal guidance because we use images and numeric values together.
The main contributions of our work are:
We propose a coarse-to-fine eye gaze redirection model combining flow learning and adversarial learning.
We have developed a multimodal-guided eye gaze redirection framework exploiting the gazemap as a condition in addition to numeric angles.
We conducted a comprehensive experimental evaluation demonstrating the superiority of our proposed model in terms of image quality of eye gaze reconstruction and angle redirection precision.
2 Related Work
Generative Adversarial Networks (GANs)  are powerful generative models which have shown promising results in various tasks such as high level semantics or style transfer [35, 13, 42, 18, 3, 29, 21]
, image animation, image in-painting and super-resolution[37, 17, 12, 36, 26], but also in classical detection and segmentation tasks [30, 22, 19]. A typical GAN framework contains a generative model and a discriminative model . The two models play a min-max two-player game in which learns to distinguish between real and fake samples and generates fake samples that fool to make a mistake about whether the samples are real or from . In this paper, we use adversarial learning to improve the visual quality of gaze redirection results.
Facial Attribute Manipulation
, an interesting multi-domain image-to-image translation problem, aims at modifying the semantic content of a facial image according to a specified attribute while preserving other unrelated regions unchanged. Most works[3, 13, 20, 24, 9, 25, 33, 2, 38, 11, 41] are based on GANs and have achieved impressive facial attribute manipulation results. However, these methods tend to learn the style or texture translation and are not good in obtaining high-quality, natural geometry translations. To alleviate this problem,  proposed a geometry-aware flow which is learned under the geometry guidance from facial landmarks to warp the input.  also exploits the flow field to perform spontaneous motion which achieves higher quality facial attribute manipulation. Eye gaze redirection can be considered as one type of facial attribute manipulation. To the best of our knowledge, our model is the first combining flow learning and adversarial learning for the eye gaze redirection task.
Gaze Redirection. Traditional methods are based on a 3D model with re-rendering the entire input region. 
uses an example-based approach for deforming the eyelids and slides the iris across the model surface with texture-coordinate interpolation. GazeDirector is modeling the eye region in 3D for recovering the shape, pose, and appearance of the eye, then it feeds an acquired dense flow field corresponding to eyelid motion to the input image to warp the eyelids. Finally, the redirected eyeball model is rendered into the output image.
Recently, machine learning based methods have shown remarkable results using a large training set labelled with eye angles and head pose information. 6] uses a deep convolution network with the coarse-to-fine warping operation to generate redirection results. However, these warping methods based on pixel-wise differences between the synthesized and ground-truth images have difficulties in generating photo-realistic images and they fail in the presence of large redirection angles, due to dis-occlusion problems. Recently, PRGAN  adopted a GAN-based model with cycle-consistent loss for gaze redirection tasks and succeeded in generating better quality results, but they are still far from being satisfactory.  proposes a GAN-based model, GazeGAN, which is based on inpainting to learn from the face image how to fill in the missing corrected eye gaze. However, GazeGAN is only suited for gaze correction when the subject is staring at the camera (gaze locking) and is not suitable for gaze redirection, which aims to adjust the eye gaze to any direction.
Compared to the previous methods, our model is performing a coarse-to-fine learning process and it combines flow field learning for spatial transformation with adversarial learning for recovering the finer texture details. Moreover, we are the first to propose utilizing the eye map as an input to provide additional spatial and semantic information for gaze redirection. Experimentally, we found this is beneficial to attain more precise redirection results.
The pipeline of the proposed method is shown in Fig. 2. It is mainly divided into two learning stages. In the coarse learning stage, an encoder-decoder architecture is proposed to generate coarse-grained results by learning the flow field to warp the input. In the fine learning stage, a multi-task cGAN is designed, which uses a generator network with conditional residual image learning to refine the coarse output with recovering the finer texture details and eliminating the distortion in the eye geometry. Moreover, we propose to employ the multimodal input to guide both coarse and fine processes to further improve the precision of gaze redirection as shown in Fig 3. Before introducing the details, we first clarify the notations for convenience.
Two angle domains: source domain and target domain . Note that paired samples exist in the two domains.
indicates the input eye image from domain A and its corresponding angles representing the eyeball pitch and yaw . are defined similarly. are the width, the height, the channel for the input eye image. and are paired samples with different labeled angles. Our model learns the gaze redirection from to .
denotes the difference between angles in domain and angles in domain .
denotes the two channel gazemap (eyeball and iris) generated from angles : where is a simple mapping. Note that the gazemap is domain-specific and each instance from domain has the same .
3.1 MultiModal-Guided Flow Learning for Coarse-Grained MGGR
To redirect the gaze angles from domain to domain , our encoder takes both and the corresponding head pose as inputs. Then, we employ the decoder architecture to attain the coarse-grained output from the encoded code and the multimodal angle guidance . As shown in Fig. 2, is concatenated into different scales of to strengthen the guided ability of the conditional information. This can be formulated as follows:
where is the learned flow field from domain to domain . Similar to DeepWarp , we generate the flow field for warping the input to efficiently learn the spatial transformation.
In details, the last convolutional layer of produces a dense flow field (a two-channel map) which is applied to warp the input images by means of a bilinear sampler . Here, the sampling procedure samples the pixels of at pixels determined by the flow field :
where is the warped result representing the coarse output, denotes the channels of image, and the curly brackets represent the bilinear interpolation to avoid positions with illegal values in the warping process. We use the distance between the output and the ground truth as the training objective function which is defined as follows:
MultiModal Guidance Module with GazeMap.
As shown in Fig. 3, we use multimodal guidance as a condition for improving the visual quality of gaze redirection. In additional to the gaze numeric value , the gazemap image generated from is integrated into the multimodal term to provide additional spatial and semantic information about the angle direction.
Different from the previous models [6, 10] in gaze redirection, we first the take the difference angle vector as input instead of the target angle to attain a better preservation in identity. Next, we generate the corresponding gazemap of angles and by a synthesis process (details can be found below). Then, we concatenate and into one term to get . Finally, the multimodal angle guidance is produced:
How to generate the gazemap corresponding to the gaze angle and what are the details of ? Similar to , our gazemap is also a two-channel boolean image: one channel is for the eyeball which is assumed to be a perfect sphere, the other channel is for iris which is assumed to be a perfect circle. For an output map size with the projected eyeball diameter , the coordinates of the iris center can be calculated as follows:
where the input gaze angle . The iris is drawn as an ellipse with the major-axis diameter of , and the minor-axis diameter of . As we know, the synthesized gazemap is just related to the angle value, not the input samples.
3.2 Multi-task cGAN for Fine-grained MGGR
The warped result is inevitably blurry when using only the loss. Additionally, it also suffers from unwanted artifacts and unnatural distortions in the shape of the iris for large redirection angles. To remove these problems, we employ a generator to refine the output of the decoder. Instead of manipulating the whole image directly, we use to learn the corresponding residual image defined as the difference between the coarse output and the ground-truth. In this way, the manipulation can be operated with modest pixel modification for providing high-frequency details while preserving the identity information in shape. The learned residual image is added to the coarse output of network :
where represents the refined output.
Conditional Residual Learning. Learning the corresponding residual image is not a simple task as it requires the generator to be able to recognize subtle differences. Additionally, previous works [42, 6] indicate that introducing a suitable conditional information improves the performance of . Consequently, we employ the input image and the head pose as condition inputs for . We also take the multimodal angle guidance as input to provide stronger conditional information. The conditional residual image learning phase can be written as:
and similarly to the coarse process, the image reconstruction loss using distance is defined as follows:
The loss above penalizing pixel-wise discrepancies causes blurry results. To overcome this issue, we adopted the perceptual loss proposed in  to improve the visual quality of the results. We use VGG-16 model 
pre-trained on ImageNet and denote the pre-trained VGG-16 network as
. We encourage the generated images and ground-truth images to have the same representations, and the feature reconstruction loss is defined as follows:
where is the output of the -th layer of . In our experiments, we use the activation of the 5th layer. denotes the Gram matrix and the details can be found in .
Multi-task Discriminator Learning. We design a multi-task discriminator in our model. Different from which is using multiple terms as the condition, does not use them as input. Additionally, is not only used to perform adversarial learning () but is also used to regress the gaze angle (). Note that and share most of the layers with the exception of the last two layers. The regression loss is defined as follows:
The adversarial loss for and is defined as:
Overall Objective Functions. As mentioned above, we use to train the encoder-decoder and for attaining the coarse-grained results. The overall objective function for is:
The overall objective function for is as follows.
, , , and are hyper-parameters controlling the contributions of each loss term. Note that is used only for optimizing , and not for updating the and networks.
We first introduce the dataset used for evaluation, the training details, baseline models and metrics. We then compare the proposed model with two baselines doing qualitative and quantitative assessments for gaze redirection. Next, we present an ablation study to demonstrate the effect of each component in our model, e.g., flow learning, residual image learning and gazemap guidance. Finally, we investigate the efficiency of our model. We refer to the full model as MGGR and the encoder-decoder with coarse-grained results as MGGRC.
4.1 Experimental Settings
Dataset. We use Columbia gaze dataset  containing 5,880 images of 56 people with varying gaze directions and head poses. For each subject, there are 5 head directions () and 21 gaze directions ( for yaw angles and for pitch angles). In our experiments, we use the same dataset settings of PRGAN . In details, we use a subset of 50 people (0-50) for training and the rest (51-56) for testing. To extract the eye regions from the face image, we employ an external face alignment library dlib . Fixed image patches () are cropped as the input images for training and testing. Both pixel values of images and gaze directions were normalized into the range . Other publicly available gaze datasets, e.g., MPIIGaze  or EYEDIAP  provide only low-resolution images and were not considered.
Training Details. We train MGGRC, generator, and discriminator independently. The MGGRC is trained firstly, followed by and . The optimizer is Adam solver with and . The batch size is 8 for all experiments. The learning rate for MGGRC is 0.0001 and 0.0002 for and in the first 20000 iterations and linearly decayed to 0 over the remaining iterations. , , , , and in our all experiments.
and train it using the default parameters. We reimplemented DeepWarp, as its code is not available. In details, different from the original DeepWarp which is used only for gaze redirection task in a single direction, we trained DeepWarp for gaze redirection tasks in arbitrary directions. Additionally, DeepWarp uses 7 eye landmarks as input, including the pupil center. However, detecting the pupil center is very challenging. Thus, we computed the geometric center among the 6 points as a rough estimation of the pupil center.
Metrics. It remains an open problem to effectively evaluate the appearance consistency and redirection precision of the generated images. The traditional metrics, i.e., PSNR and MS-SSIM are not correlated with the perceptual image quality. Similar to PRGAN, we adopted LPIPS metric  to compute the perceptual similarity in the feature space to evaluate the quality of redirection results. Additionally, we use GazeNet  as our gaze estimator and we did not use DPGE  for which the code is not publicly available. We pre-trained on GazeNet on MPIIGaze datasets and trained on Columbia dataset.
We first introduce the details of the qualitative and quantitative evaluations. For each head pose, we divide all redirection angles into ten target groups by the sum of the direction differences in both pitch and yaw: , , , , , , , , , (e.g., indicates the the angle differences between domain and domain are 0 in the vertical and horizontal directions). The test results of every group is used for quantitative evaluations. Additionally, we select 10 redirection angles as target angles for qualitative evaluations: [, ], [, ], [, ], [, ],[, ], [, ], [, ], [, ], [, ], [, ].
Qualitative Results. In the 5th row of Fig. 4 we show the redirection results of MGGR. The visually plausible results in texture and shape and the high redirection precision validate the effectiveness of the proposed model. Additionally, comparing to MGGRC (with no refined generator module), we conclude that our refined model provides more detailed texture information and eliminates unwanted artifacts and unnatural distortions in the iris shape.
As shown in the 2nd and the 4th rows of Fig. 4, we observe that both DeepWarp and MGGRC redirect the input eye into target angles, which demonstrates the ability of flow field in spatial transformation. Still, DeepWarp has several obvious disadvantages (marked by the yellow box in Fig. 4 and the corresponding zoom-in shown in the bottom row), for example, textures are more blurry. In contrast, our coarse-grained MGGRC achieves better performance. We attribute this to the fact that the loss works better with the encoder-decoder architecture instead of using the fully convolutional architecture without scale variation.
As shown in the 3rd and 5th rows of Fig. 4, both PRGAN and MGGR achieve high-quality redirection results with visual plausible textures and natural shape transformation for iris. However, compared with MGGR, PRGAN suffers from two serious problems: (1) lower quality with poor identity preservation (marked by red box on the left); (2) incorrect redirection angles and blurry boundary causing distortion of the eyeball (marked by the yellow box and bottom row for zoom-in results). Our results are better because the coarse-to-fine learning process with conditional residual learning is able to recognize the subtle redirection angles and as such achieves more precise results.
Quantitative evaluation. Fig. 5 shows the curves of gaze estimation error and LPIPS metric on gaze redirection results for different models. The three columns show the curves of redirection results for , and head pose angles, respectively. It can be observed from the 1st row of Fig. 5 that MGGR achieves much lower gaze estimation error than DeepWarp and it is superior to PRGAN in most cases indicating that our model achieves more precise gaze redirection results. Additionally, without the refined process, MGGRC has a much higher gaze error, especially for large gaze directions (e.g., 50). This is because of the presence of some artifacts and unnatural shapes in its redirected samples.
The 2nd row of Fig. 5 shows the curves of LPIPS scores. Here, we see that MGGR leads to much smaller scores than DeepWarp. Additionally, our model also has lower LPIPS scores than PRGAN indicating that our method can generate a new eye image which is more perceptually similar to the ground truth. Yet, MGGR has higher gaze error or LPIPS scores in some cases, especially for redirection results with head pose. Overall, as shown in Table 1, our model achieves 0.0333 LPIPS score, lower than 0.0946 for DeepWarp, 0.0409 for PRGAN and 5.15 gaze error, lower than 14.18 for DeepWarp and 5.37 for PRGAN.
User Study. We conducted a user study to evaluate the proposed model under human perception. In details, we divided gaze redirection results on test data into three groups by the head pose of the input image and randomly selected 20 samples generated by all methods for each group. Then, for each image, 10 users were asked to indicate the gaze image that looks more similar with the ground truth. Table 2 shows the results of this user study. We can observe that our method outperforms PRGAN, DeepWarp in groups with and head pose. Although PRGAN obtains the majority of votes for the best redirection results in groups with , MGGR is selected as the best model on average.
4.3 Ablation Study
Perceptual Loss with Pre-trained VGG model. In Fig. 6, we observe that MGGR without the perceptual loss has attained very close results to the full model. However, some of its results have more artifacts (marked with red box in 2th column). Additionally, as shown in Fig. 7, the Gaze estimation error and LPIPS score are larger when removing this perceptual loss. It can be concluded that this perceptual loss is helpful to slightly improve the visual quality and the redirection precision for the generated samples, but there is little impact if this part is removed.
Residual Learning. We eliminated the residual strategy in Eq. 6 to evaluate its effect. As shown in Fig. 6, the results are very blurry with lots of artifacts. The quantitative evaluations in Fig. 7 are consistent with the visual results.
Flow Learning. Our encoder-decoder network predicts the flow field to warp the input for rapidly learning the spatial transformation in shape. As shown in Fig. 6, our full model achieves more natural results for the iris shape. Additionally, the quantitative results in Fig. 7 demonstrate the effectiveness of flow learning in improving the redirection precision.
GazeMap as Guidance. We propose a multimodal guidance by combining the numeric value for angle and the gazemap providing the spatial and semantic information to further improve the redirection precision of the proposed model. When removing the gazemap in this guidance (see the 6th column in Fig. 6), the visual results present more distortions in shape comparing with the full model. Additionally, the quantitative results in Fig. 7 demonstrate the effect of gazemap in improving the redirection precision.
4.4 Analysis of Model by Iteration.
In Fig. 8 (top), we can observe that MGGR achieves lower gaze estimation error than PRGAN for the same iteration except for iterations. Moreover, MGGR in iterations has almost the same performance as MGGR in (not for PRGAN) demonstrating the efficiency of our model for training. This is because the flow learning with fast spatial transformation and residual learning improves the efficiency for learning the gaze redirection.
In Fig. 8(bottom) we show the evaluation scores of LPIPS indicating a similar conclusion with the gaze estimation scores. Additionally, MGGR in has attained lower scores than MGGR in indicating that there is no point in doing more iterations.
In this paper, we have presented a multimodal-guided gaze redirection model with coarse-to-fine learning. Specifically, the encoder-decoder learns to warp the input by the flow field for a coarse-grained gaze redirection. Then, the generator refines the coarse output to improve the quality of gaze-redirection by removing unwanted artifacts in texture and distortions in shape. The refined model consists of a generator with conditional residual learning and a discriminator for adversarial learning. Moreover, we combine the gazemap and the numeric angle in a multimodal guidance to further improve the quality of gaze redirection. The qualitative and quantitative evaluations well validate the effectiveness of the proposed model and demonstrate that our model obtains better results than the baselines both in visual quality and in redirection precision. In the future, we will consider exploring gaze redirection in the wild.
-  (2009) Example-based rendering of eye movements. 28 (2), pp. 659–666. Cited by: §1, §2.
-  (2019) TC-gan: triangle cycle-consistent gans for face frontalization with facial features preserved. In ACM Multimedia, pp. 220–228. Cited by: §2.
-  (2018) Stargan: unified generative adversarial networks for multi-domain image-to-image translation. In CVPR, Cited by: §2, §2.
-  (2009) Imagenet: a large-scale hierarchical image database. In CVPR, Cited by: §3.2.
-  (2014) Eyediap: a database for the development and evaluation of gaze estimation algorithms from rgb and rgb-d cameras. In Symposium on Eye Tracking Research and Applications, Cited by: §4.1.
-  (2016) Deepwarp: photorealistic image resynthesis for gaze manipulation. In ECCV, Cited by: §1, §2, §3.1, §3.1, §3.2, §4.1.
Image style transfer using convolutional neural networks. In CVPR, pp. 2414–2423. Cited by: §3.2.
-  (2014) Generative adversarial nets. In NIPS, pp. 2672–2680. Cited by: §2.
-  (2019-11) AttGAN: facial attribute editing by only changing what you want. IEEE Transactions on Image Processing 28 (11), pp. 5464–5478. External Links: Cited by: §2.
-  (2019) Photo-realistic monocular gaze redirection using generative adversarial networks. In ICCV, pp. 6932–6941. Cited by: §1, §2, §3.1, §4.1, §4.1.
-  (2019) Attgan: facial attribute editing by only changing what you want. IEEE Transactions on Image Processing 28 (11), pp. 5464–5478. Cited by: §2.
-  (2017) Globally and locally consistent image completion. ACM Transactions on Graphics (ToG) 36 (4), pp. 107. Cited by: §2.
-  (2018) Sparsely grouped multi-task generative adversarial networks for facial attribute manipulation. In ACM Multimedia, Cited by: §2, §2.
-  (2016) Perceptual losses for real-time style transfer and super-resolution. In ECCV, Cited by: §3.2.
-  (2009) Dlib-ml: a machine learning toolkit. Journal of Machine Learning Research 10 (Jul). Cited by: §4.1.
-  (2015) Learning to look up: realtime monocular gaze correction using machine learning. In CVPR, Cited by: §2.
Photo-realistic single image super-resolution using a generative adversarial network.
Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4681–4690. Cited by: §2.
-  (2018) Diverse image-to-image translation via disentangled representations. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 35–51. Cited by: §2.
-  (2017) Perceptual generative adversarial networks for small object detection. CoRR abs/1706.05274. External Links: Cited by: §2.
-  (2019) STGAN: a unified selective transfer network for arbitrary image attribute editing. arXiv preprint arXiv:1904.09709. Cited by: §2.
-  (2019) Gesture-to-gesture translation in the wild via category-independent conditional maps. In ACM Multimedia, pp. 1916–1924. Cited by: §2.
-  (2016) Semantic segmentation using adversarial networks. CoRR abs/1611.08408. External Links: Cited by: §2.
-  (2018) Deep pictorial gaze estimation. In ECCV, pp. 721–738. Cited by: §3.1, §4.1.
-  (2016) Invertible conditional gans for image editing. arXiv preprint arXiv:1611.06355. Cited by: §2.
-  (2019) GANimation: one-shot anatomically consistent facial animation. Cited by: §2.
-  (2019) First order motion model for image animation. In NIPS, pp. 7135–7145. Cited by: §2.
-  (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §3.2.
-  (2013) Gaze locking: passive eye contact detection for human-object interaction. In ACM Symposium on User Interface Software and Technology, pp. 271–280. Cited by: §4.1.
-  (2019) Cycle in cycle generative adversarial networks for keypoint-guided image generation. In ACM Multimedia, pp. 2052–2060. Cited by: §2.
-  (2017) A-fast-rcnn: hard positive generation via adversary for object detection. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
-  (2019) Instance level facial attributes transfer with geometry-aware flow. In AAAI, Cited by: §2.
-  (2018) Gazedirector: fully articulated eye gaze redirection in video. In Computer Graphics Forum, Vol. 37, pp. 217–225. Cited by: §1, §2.
-  (2020) Cascade ef-gan: progressive facial expression editing with local focuses. arXiv preprint arXiv:2003.05905. Cited by: §2.
-  (2019) Attribute-driven spontaneous motion in unpaired image translation. In ICCV, pp. 5923–5932. Cited by: §2.
-  (2019) TET-gan: text effects transfer via stylization and destylization. In AAAI, Cited by: §2.
Generative image inpainting with contextual attention. arXiv preprint arXiv:1801.07892. Cited by: §2.
-  (2019) GazeCorrection: self-guided eye manipulation in the wild using self-supervised generative adversarial networks. arXiv preprint arXiv:1906.00805. Cited by: §2, §2.
-  (2017) ST-gan: unsupervised facial image semantic transformation using generative adversarial networks.. In ACML, pp. 248–263. Cited by: §2.
The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, pp. 586–595. Cited by: §4.1.
-  (2017) Mpiigaze: real-world dataset and deep appearance-based gaze estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence 41 (1), pp. 162–175. Cited by: §4.1, §4.1.
-  (2020) A survey of deep facial attribute analysis. IJCV, pp. 1–33. Cited by: §2.
-  (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pp. 2223–2232. Cited by: §2, §3.2.