Purifying Naturalistic Images through a Real-time Style Transfer Semantics Network

03/14/2019 ∙ by Tongtong Zhao, et al. ∙ 10

Recently, the progress of learning-by-synthesis has proposed a training model for synthetic images, which can effectively reduce the cost of human and material resources. However, due to the different distribution of synthetic images compared to real images, the desired performance cannot still be achieved. Real images consist of multiple forms of light orientation, while synthetic images consist of a uniform light orientation. These features are considered to be characteristic of outdoor and indoor scenes, respectively. To solve this problem, the previous method learned a model to improve the realism of the synthetic image. Different from the previous methods, this paper takes the first step to purify real images. Through the style transfer task, the distribution of outdoor real images is converted into indoor synthetic images, thereby reducing the influence of light. Therefore, this paper proposes a real-time style transfer network that preserves image content information (eg, gaze direction, pupil center position) of an input image (real image) while inferring style information (eg, image color structure, semantic features) of style image (synthetic image). In addition, the network accelerates the convergence speed of the model and adapts to multi-scale images. Experiments were performed using mixed studies (qualitative and quantitative) methods to demonstrate the possibility of purifying real images in complex directions. Qualitatively, it compares the proposed method with the available methods in a series of indoor and outdoor scenarios of the LPW dataset. In quantitative terms, it evaluates the purified image by training a gaze estimation model on the cross data set. The results show a significant improvement over the baseline method compared to the raw real image.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 9

page 10

page 12

page 17

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The accuracy of gaze estimation under indoor conditions is currently about 0.5 to 1 degree, which is commendable, but the accuracy of gaze estimation under outdoor conditions is still unsatisfactory, between 8 and 10 degrees. However, appearance-based gaze estimation has recently progressed under outdoor conditions by using a large-scale real-image training data set with annotations through recent rises in high-capacity deep convolution networks. In addition, annotating training data sets requires a lot of manual labor. To solve this problem, a training model on a synthetic image is preferred because the annotations are automatically available. There are four main types of eye image synthesis methods: optical flow, three-dimensional eye reconstruction, eye model method, and GAN (Generation Against Network). But this solution has a drawback, the distribution between the real image and the synthetic image is quite different. The distribution of synthetic images is more prone to indoor lighting, with slight variations depending on the synthesis method. On the other hand, due to the interference of light and other external factors, the distribution of real images is more complicated (prone to outdoor lighting), making the distribution of real images difficult to learn. Therefore, using synthetic images for training, the effects of testing in real scenes or on real image data sets will not be satisfactory. One solution is to attenuate the distribution of real images by improving the simulator, which can be expensive and time consuming. Another solution is to use unmarked actual data to improve the authenticity of the synthetic image from the simulator. This method cannot be applied to outdoor (field) scenes due to its weak training time and adaptability to different situations in the field.

In a different manner, we see the image distribution change between the real image and the synthetic image as a gap from the previous solution. The distribution of synthetic images is more regular and easy to learn, rather than trying to improve the realism of the synthetic image, we prefer to purify the real image to make it similar to the indoor scene, while retaining annotation information such as the gaze direction. Considering the distribution of images as its style information and annotation information as its content information, the problem can be considered as a style transfer task. Based on this, we propose a controllable neural transfer architecture to purify real images.

Similar to Gatys et al.Gatys2015

, they proposed a new method of using neural networks to capture artistic image styles and transfer them to real-world photos, and Feifei Li et al.

lifeifei2016

, who proposed using a perceptual loss function to train feed-forward networks for image transformation tasks. Our approach not only uses advanced feature representations of images from the hidden layers of the VGG convolutional network to separate and recombine content and style, but also trains the feed-forward network to better calculate the loss of content and style. The main difference with Gatys et al.

Gatys2015 and Feifei Li et al.lifeifei2016 is that images for gaze prediction need more-precise content information and more emphasis on image spatial arrangement of reservations, as such we consider the image distribution variation between real and synthetic images as the gap between the indoor and the outdoor situation and propose a real-time style transfer with semantics network for purifying real images.

Our image purification architecture can be divided into three parts: coarse segmentation network, feature extraction network and loss network. We train the coarse segmentation network to segment the pupil and iris regions to avoid ”orphan semantic tags” that only appear in the input image. Due to the outdoor lighting effect, the ”orphan tag” is usually the pupil area, and we limit the pupil semantic area to the center of the iris area. We also observed that segmentation does not require pixel precision because the final output is constrained by our loss network.

For feature extraction, our work is most directly related to the work initiated by Gates et al.Gatys2015

. The feature map of the deep convolutional neural network with differentiated training is used to achieve the breakthrough performance of the transfer of painting style. We train a feed-forward feature extraction network for image transformation tasks. However, we did not directly use the network of Feifei Li et al.

lifeifei2016

, but modified the network to delete the checkerboard. Our image conversion task network consists of four residual blocks. All non-residual convolutional layers are followed by bulk normalization and ReLU non-linearity, except for the output layer, which uses scaled tanh to ensure that the output has pixels in the range [0,255]. The first and last layers use

cores; all other deconvolution layers use cores to fill 1 and the convolution layer uses kernels to fill 0. The main difference with the network of Feifei Li et al.lifeifei2016 is our network that aims to learn as much as possible on the premise of synthetic distribution, to minimize the loss of content transmission, and to solve the problem of insufficient spatial alignment information caused by the gram matrix. To achieve this goal, we propose a loss network with a novel loss function that makes some key modifications to the standard perceptual loss to maximize the content of the real image and the distribution of the synthetic image. Our network not only considers red, green, and blue (RGB) color channels, but uses color and semantic representations for style transformation. Through the semantic features, we can solve the spatial arrangement information and avoid the spatial configuration that the image is destroyed due to the style transformation.

Our contributions are presented in this paper in three folds:

1. We took the first step to consider the image distribution variation between real and synthetic images as the gap between the indoor and outdoor situation and propose image purification network architecture to purify the real image, making it similar to indoor conditions while retaining annotation information.

2. We proposed a loss network with a novel loss function, with some key modifications to the standard perceptual loss to maximize the content of the real image and the distribution of the synthetic image. Our network not only considers the RGB color channel, but uses the representation of color and semantic features for style conversion. Through the semantic features, we can solve the spatial arrangement information and avoid the spatial configuration that the image is destroyed due to the style conversion.

3. We proposed a hybrid research method (qualitative and quantitative) for experiments. The results show that the proposed architecture significantly purifies the real image compared with the existing methods. We used different gaze estimation methods to achieve improved results on cross-data sets.

2 Related Works

In general, there are two main types of eye gaze estimation methods: feature-based and appearance-basedhansen2010in . Feature-based methods are intended to identify local features of the eye, such as contours, corners of the eye, and reflections from images of the eye. Pupil and corneal reflexes are commonly used for eye localization. Calibration with high-resolution cameras and other specialized hardware such as synchronized cameras and light sources can extract more precise geometric properties. However, the gaze feature is not stable enough under natural light.

2.1 Appearance-based gaze estimation

The appearance-based approach is believed to work better under natural light. These methods use image content as input to map these directly to screen coordinates. A regression model of the gaze can then be constructed, by which we can obtain the position of the gaze or the two-dimensional rotation angle when using the new eye image as input. Compared to feature-based methods, the appearance-based approach does not require any dedicated hardware and exhibits good robustness to outliers. Most appearance-based methods extract feature vectors of cropped eye images and map them into low-dimensional spaces. Then the regression model in this space can be constructed.

Recent studies aim to better represent the appearance, and Lu et al. lu2014adaptive proposed a low-dimensional feature extraction method. It divides the eye area into three columns and five columns and calculates the gray value and the percentage of each area. Therefore, a 15-dimensional feature vector is defined. However, this feature does not apply to eye images under free head movement. Wang et al.wang2016appearance

introduced a deep feature extracted from convolutional neural networks. The deep feature has sparse characters and provides a effective solution for gaze estimation.

2.2 Eye image synthesis

There are four main categories of eye image synthesis methods: Optical Flowlu2015gaze Wang2014hierarchical , 3D eye reconstructionSugano2014learning Wood2015rendering , Model-based method Wood2016a3d and GANs (Generative Adversarial Networks)Shrivastava2016learning .

The eye image synthesis process in Lu et al.lu2015gaze used 1D flows to simulate the appearance distortion caused by head pose moving, and Wang et al.Wang2014hierarchical

introduced a 2D interpolation to synthesize the eye appearance variation caused by eyeball moving. These optical flow methods treat eye image synthesis as optical shift of original image and could not be utilized under large head rotation. Generating eye images by 3D eye reconstruction is highly depending on the pre-trained face 3D model. Sugano et al.

Sugano2014learning recovered multi-view eye images from 3D shapes of eye region reconstructed from 8 cameras eye image capture system. While Wood et al.Wood2015rendering relied on high-quality head scans to collect high resolution eye images. In order to generate multi-part eye images, Wood et al.Wood2016a3d also presented a morphable model of the facial eye region, as well as an anatomy-based eyeball model. Model-based method tunes parameters to obtain high resolution eye images, which are coincide with the ground truth situation. Shrivastava et al.Shrivastava2016learning used GANs to generate synthetic eye images using unlabeled real data and learnt a refiner model that improves the realism of these synthetic images. While GANs output different synthetic image by same input image, it is still not controllable to generate image with specific gaze angle.

2.3 Learning-by-synthesis

Learning-based methods perform well in appearance-based gaze estimation but require large amounts of training data. Learning-by-synthesis approaches were proposed to solve this problem. Wood et al.Wood2016learning presented a novel method to synthesize large amounts of variable eye region images as training data, which addressed the limitation of learning-by-synthesis with respect to the appearance variability and the head pose and gaze angle distribution. Other works learn a feature representation in feature space. For instance, Wang et al.Wang2014hierarchical proposed an appearance-based gaze estimation method by supervised adaptive feature extractionAIterative ALearning AMultiview ARobust ASemantic and hierarchical mapping modelACycle-Consistent ADeep AEffective , during which appearance synthesis method is proposed to increase the sample density. Zhang et al.wild2017 introduced a CNN-based gaze estimation method, which concatenated head pose vector in the hidden layer of neural network. This change improved the performance of CNN-based gaze estimation training by synthetic image dataset. Sugano et al.Sugano2014learning presented a learning-by-synthesis approach for appearance-based gaze estimation and trained a 3D gaze estimator by a large amount of cross-subject training data. In their experiments, -nearest neighbor was selected as comparison, from which we can see that -NN regression estimators can perform well with a large amount of dense training samples.

2.4 Style Transfer

Previous methods learn a model to improve the realism of synthetic images, instead we take the first step to purify real images to weaken the influence of light and convert the distribution of outdoor real image to that of indoor synthetic image. This can be seen as a style transfer task, global style transfer algorithms process an image by applying a spatially-invariant transfer function. Reinhard et al.Reinhard2001

match the means and standard deviations between the input and reference style image after converting them into a decorrelated color space. Local style transfer algorithms based on spatial color mappings are more expressive and can handle a broad class of applications such as transfer of artistic edits

Gatys2015 Selim2016 , weather and season changeGardner2015 . Similar to Gatys et al.Gatys2015 , which proposed a novel approach using neural networks to capture the style of artistic images and transfer it to real-world photographs, and Feifei Li et al.lifeifei2016 which proposed the use of perceptual loss function for training feed-forward networks for image transformation tasks. Our approach not only uses high-level feature representations of images from hidden layers of the VGG convolutional network to separate and reassemble content and style but also trains feed-forward networks to better calculate the loss of content and style.

3 Proposed Method

We briefly reviewed the style transfer approach introduced by Feifei Li et al.lifeifei2016 that transfers the reference style image S into the input image I and then generate a stylized image O by minimizing the perceptual losses. Meanwhile, we briefly review the style transfer approach introduced by Gatys et al.Gatys2015 that transfers the reference style image S into the input image I and then generate a stylized image O by minimizing the objective function consisting of a content loss and a style loss.

Figure 1: The overview of proposed method. Our method can be divided into three parts: coarse segmentation network, feature extraction network and loss network.

The key idea of Feifei Li et al.lifeifei2016

is that method consists of two components: an image transformation network

and a loss network , is a deep residual convolutional network with weights , it transforms I into O via mapping . Loss network

is used to minimize the loss between O and I, O and S with perceptual loss method. Loss between O and I is denoted as feature reconstruction loss

which can be represent as :

(1)

where j is a convolutional layer and is a feature map of shape

. Loss between O and S is denoted as style reconstruction loss which is the squares Frobenius norm of the difference between the Gram matrices(similar with

Gatys2015 ) of O and I:

(2)

The Gram matrix can be computed efficiently by reshaping into a matrix of shape ; then . Image O is generated by solving the problem

(3)

where ,,

are scalars, I is initialized with white noise, and optimization is performed using L-BFGS.

The key idea of Gatys et al. Gatys2015 is that features extracted by a convolutional network carries information about the content of the image, while the correlations of these features encode the style. The Objective function can be represented as:

(4)

where L is the total number of convolutional layers and indicates the -th convolutional layer of the deep convolutional neural network. and are the weights to configure layer preferences. Each layer with distinct filters has feature maps each of size , where is the height times the width of the feature map. So the responses in each layer can be stored in a matrix where is the activation of the filter at position in each layer . The content loss, denoted as , is simply the mean squared error between and .

(5)

The style loss, denoted as , can be represented as:

(6)

Gram matrix is defined as the inner product between the vectored feature maps which is .

Our proposed network (As Fig. 1) takes two images with their mask: an input image which is a real eye image from video of driving environment or real eye image dataset. A stylized and retouched image referred as the reference style image from synthetic image dataset. We use this to train the gaze estimator, as we seek to transfer the style of the reference to the input image while keeping the content and spatial information due to its importance in appearance-based gaze estimation. The proposed architecture can be divided into three parts: coarse segmentation network, feature extraction network and loss network.

Figure 2: Example of ”orphan semantic labels”. We can observe that only the iris region (red) is present in the image and the coarse segmentation network could not label the pupil region (white) in this image.

3.1 Coarse segmentation network

We train fully convolutional networks which according to Long et al.Long2015 as the coarse segmentation network to segment, according to the line of sight estimate for the human eye image and the consideration of simplifying the task, we only mark two kinds of information on the human eye image: the pupil(white) and the iris(red). However, many human eye images are influenced by light and other factors, and sometimes the pupil and the iris cannot be completely separated, as Fig. 2 shows that only the iris region is present in the image and the coarse segmentation network could not label the pupil region in this image, thus we called this kind label ”orphan semantic labels”.

Figure 3: Coarse segmentation on LPW dataset (a) and MPIIGaze dataset(b).Pupil region is labelled on white and iris region is red. We can observe that although the testing dataset under different illumination condition, proposed network can achieve good results without ”orphan semantic labels”.

To avoid ”orphan semantic labels”, which are caused by outdoor illumination effect on the pupil region, we constrain the pupil semantic region to be set as the center of iris region. The segmentation results without ”orphan semantic labels” are shown in Fig 3. We have also observed that the segmentation does not need to be pixel accurate since eventually the output is constrained by our loss network.

3.2 Feature extraction network

For feature extraction, our work is inspired by Gatys et al.Gatys2015 that employs the feature maps of discriminatively trained deep convolutional neural networks to achieve ground breaking performance for painterly style transferSelim2016 . The main difference with our work is that, our approach learns as much as possible on the premise of the synthesis distribution, to minimize the loss of content transfer and to solve the problem of the lack of spatial arrangement information caused by the gram matrix.

Figure 4: The overview of feature extraction network. The first and last layers use kernels, all other deconvolutional layers use

kernels with padding 1 and convolutional layers use

kernels with padding 0.

Similar with Feifei Li et al.lifeifei2016 , our feature extraction network roughly follow the architectural guidelines set forth by DCGAN2015 . Feifei Li et al.lifeifei2016

propose a image transformation network which eschews pooling layers, instead using strided and fractionally strided convolutions for in-network convolution and deconvolution. The image transformation network comprises five residual blocks, all nonresidual convolutional layers are followed by batch normalization and ReLU nonlinearities with the exception of the output layer. However, from

checkboard2016 we know that the standard approach of producing images with deconvolution has some conceptually simple issues that lead to artifacts in produced images. Inspired by checkboard2016 , we modified the structure of image transformation networklifeifei2016 to our feature extraction network. The structure can be shown as Fig. 4 and Table 1.

Our network body comprises four residual blocks. All nonresidual convolutional layers are followed by batch normalization and ReLU nonlinearities with the exception of the output layer, which instead uses a scaled tanh to ensure that the output has pixels in the range [0,255].

The first and last layers use kernels, all other deconvolutional layers use kernels with padding 1 and convolutional layers use kernels with padding 0. We use the residual block design similar with lifeifei2016 but with dropout followed by spatial batch normalization and a ReLU nonlinearity in order to avoid overfitting, shown in the Fig. 5.

Figure 5: The structure of residual block. Our residual blocks each contain two convolutional layers with the same number of filters on both layer, similar with lifeifei2016 but with dropout followed by spatial batch normalization and a ReLU nonlinearity in order to avoid overfitting.

3.3 Loss network

We propose loss network with novel loss function which is a pre-trained VGG-19Simonyan2014 network and made some key modifications to the standard perception losses to keep the content of the real images and distribution of the synthetic images to the fullest extent.

In Fig. 6, our loss network can be divided into two parts: Style reconstruction loss (a) and Feature reconstruction loss(b), feature reconstruction loss is denoted as which is the summary of and , meanwhile, style reconstruction loss is denoted as which is the summary of and . As Fig. 6 shows that instead of taking only RGB color channels into consideration, our network utilizes the representations of both color and semantic features for style transfer. With the semantic features, we can address the spatial arrangement information and avoid the spatial configuration of the image being disrupted because of the style transformation.

Figure 6: The overview of loss network. In this figure, our loss network can be divided into two parts: Style reconstruction loss (a) and Feature reconstruction loss(b), feature reconstruction loss is denoted as which is the summary of and , meanwhile, style reconstruction loss is denoted as which is the summary of and .

3.3.1 Feature reconstruction loss

A limitation of the general content loss is that image structure is not considered when encoding content reconstructions. We address this problem with the image segmentation masks for the input images. With the segmentation mask, the content we prefer to address can be preserved more effectively. To visualise the image information that is encoded at different layers of the input image with masks, we perform gradient descent on a white noise image to find another image that matches the feature responses of the original image with mask. We then define the squared-error loss between the two feature representations

(7)
(8)
(9)

where C is the number of channels in the semantic segmentation mask and indicates the -th convolutional layer of the deep convolutional neural network, is the segmentation mask in each layer with the channel c. is the weight to configure layer preferences of global losses which calculated between raw input image and features which was extracted by feature extraction network. is the weight to configure layer preferences of local losses which calculated between input segmentation image and features which was extracted by feature extraction network with the input of segmentation image.

Each layer with distinct filters has feature maps each of size , where is the height times the width of the feature map. So the responses in each layer can be stored in a matrix where is the activation of the filter at position in each layer . The content loss, denoted as , is simply the mean squared error between and .

(10)
(11)

As minimizing , the image content and overall spatial structure are preserved but color, texture, and exact shape are not. Using a feature reconstruction loss for training our image transformation networks encourages the output image O to be perceptually similar to the style image S, without forcing them to match exactly.

3.3.2 Style reconstruction loss

Feature Gram matrices are effective at representing texture, because they capture global statistics across the image due to spatial averaging. Since textures are static, averaging over positions is required and makes Gram matrices fully blind to the global arrangement of objects inside the reference image. So if we want to keep the global arrangement of objects, make the gram matrices more controllable to compute over the exact region of entire image, we need to add some texture information to the image. Luan et al.Deep2017 present a method which add the masks to the input image as additional channels and augment the neural style algorithm by concatenating the segmentation channels, inspired by it, mask is added as the texture information we need to compute over the exact region of entire image, thus the style loss can be denoted as:

(12)
(13)
(14)

where C is the number of channels in the semantic segmentation mask and indicates the -th convolutional layer of the deep convolutional neural network. Each layer with distinct filters has feature maps each of size , where is the height times the width of the feature map. So the responses in each layer can be stored in a matrix where is the activation of the filter at position in each layer .

(15)
(16)
(17)

is the segmentation mask in each layer with the channel c. is the weight to configure layer preferences of global losses which calculated between raw input image and features which was extracted by feature extraction network. is the weight to configure layer preferences of local losses which calculated between input segmentation image and features which was extracted by feature extraction network with the input of segmentation image.

We formulate the style transfer objective by combining both two components together:

(18)

where L is the total number of convolutional layers and indicates the -th convolutional layer of the deep convolutional neural network. and are the weights to configure layer preferences. is the content loss (Eq.(4)) and is the style loss(Eq.(9)). , are scalars, ,

, in all cases the hyperparameters

, are exactly the same. We find that unconstrained optimization of Equation 18 typically results in images whose pixels fall outside the range [0,255]. For a more fair comparison with our method whose output is constrained to this range, for the baseline we minimize Equation 18 using projected L-BFGS. Image O is generated by solving the problem

(19)

where I is initialized with white noise. The advantage of this solution is that the requirement for mask is not too precise. It does not only retain the desired structural features, but also enhance the estimation of the pupil and iris information during the reconstruction of the style.

Layer Activation size
Input
Reflection Padding()
conv
conv
conv
Residual block, 128 filters
Residual block, 128 filters
Residual block, 128 filters
Residual block, 128 filters
deconv
deconv
deconv
Output
Table 1: Network architecture used for feature extraction network.

4 Experimental Results

We experimented with two tasks: style transfer and appearance-based gaze estimation. Previous style style transfer work has used optimization to generate images; our feed-forward structure gives similar qualitative results, but the speed is increased by three orders of magnitude. Previous work on appearance-based gaze estimation has used fine synthetic images for training and real images for testing, or training with real images and testing with fine synthetic images. By using simulated data or purified real data for training, and using purified real data for testing, we can get encouraging qualitative and quantitative results.

4.1 Style Transfer

The purpose of the style transfer is to generate an image that combines the content of the target content image as the real image content with the style of the target style image as the style of the synthetic image. We train an image transformation network for each of the several hand selection style goals and compare our results with the baseline methods of Gatys et al.Gatys2015 and Feifei Li et al.lifeifei2016 . As a baseline, we re-implemented the method of Gatys et al.Gatys2015 and Feifei Li et al.lifeifei2016 . In order to make a fairer comparison with our method whose output is constrained to [0, 255], for the baseline, we minimize the equation 3 and equation 4 by using the projected L-BFGS by cropping the image to the range [0, 255] at each iteration. In most cases, the optimization converges to satisfactory results in 500 iterations.

Figure 7: Comparison on public LPW dataset with available style transfer methods.(a),(b),(c),(d),(e),and (f) represent the purified results of different distributions under six outdoor conditions from LPW dataset with three different styles from UnityEyes dataset. Style A, B, and C represent three different distributions of indoor conditions. The distribution of pupil and iris regions is dramatically different from style image. The proposed method, therefore can separate the pupil and the iris regions easily and the distribution of pupil and iris regions is similar to style image.

: We resize each of the 80 thousand training images to

and train our network with a batch size of 4 for 50000 iterations, giving roughly two epochs over the training data. We use Adam with a learning rate of

. The output images are regularized with total variation regularization with a strength of between and . We choose conv as the local content representation, and conv, conv, conv, conv and conv as the local style representation. conv as the global content representation, and conv, conv, conv, conv and conv as the global style representation. Our implementation use Torch7 and cuDNN, training takes roughly 3 hours on a single GTX Titan X GPU.

Figure 8: Proposed style transfer networks and Gatys et al.Gatys2015 and Feifei Li et al.lifeifei2016 minimize the same objects. We compare their object values on 50 images; dashed lines and error bars show standard deviations. Our networks are trained on images but generalize to larger images.

: Fig.7 describes the style transfer method proposed in comparison to methods proposed by Gatys et al.Gatys2015 and Feifei Li et al.lifeifei2016 across series of indoor and outdoor scenes from UnityEyesWood2016learning and LPWLPW2016 datasets respectively. (a),(b),(c),(d),(e), and (f) represents six different conditions of outdoor scenes from LPW dataset. On the other hand, styles A, B, and C from the UnityEyes dataset represent three different distributions of indoor conditions, which if closely observed, it can be seen that none of these styles has similar gaze angle with real images.

From (a),(b), and (c), it can be observed that the proposed method is less affected by light and achieves similar results with Gatys et al. Gatys2015 and Feifei Li et al.lifeifei2016 , but the proposed method can better preserve the color information of style image. From (d),(e), and (f), it can be seen that Gatys et al.Gatys2015 and Feifei Li et al.lifeifei2016 are influenced by light and other factors, the pupil and the iris cannot be completely separated. What’s more, the distribution of pupil and iris regions is dramatically different from style image. The proposed method, therefore can separate the pupil and the iris regions easily and the distribution of pupil and iris regions is similar to style image.

Furthermore, it can be observed that no matter how style image changes, the distribution of the purified image is more inclined to that of the style image, which changes slightly according to different style images. However, the distribution of Gatys et al.Gatys2015 , Feifei Li et al.lifeifei2016 is more complex, because of light and other external factor interference, making it difficult to learn for gaze estimation tasks. Note that the proposed method preserves the annotation information while purifying the illumination of the real images.

: As evidenced by Gatys et al.Gatys2015 and Feifei Li et al.lifeifei2016 and reproduced in Figure 8, the image that produces the minimized pattern reconstruction loss preserves the style characteristics of the target image, but does not preserve its spatial structure. Reconstruction from higher layers transfers large-scale structures from the target image. The baseline and our methods both minimize equation 19. The baseline performs explicit optimization over the output image, while our method is trained to find a solution for any content image in a single forward pass. We may therefore quantitatively compare the two methods by measuring the degree to which they successfully minimize Equation 19.

We used Pablo Picasso’s Muse as a style image to run our method and baseline method on 50 images of the MS-COCO validation set. For the baseline method, we record the value of the objective function for each optimization iteration. For our method, we record Equation 19 for each image. From Figure 9, we can see that Feifei Li et al.lifeifei2016 achieved high losses, and our method achieved a loss comparable to 0 to 80 explicit optimization iterations.

Although our networks are trained to minimize Equation 19 for images, they are also successful in minimizing the objective when applied to larger images. We repeat the same quantitative evaluation for 50 images at and , results are shown in figure 9. We can see that even at higher resolutions our method achieves a loss comparable to 50 to 100 iterations of the baseline method.

MethodTimeResolution
Gatys et al. 12.69s 45.88s 171.55s
Feifei li et al. 0.023s 0.08s 0.35s
speedup (proposed vs Gatys)
speedup (proposed vs Feifei Li)
Table 2: Speed (in seconds) for our style transfer networks vs Gatys et al. Gatys2015 , Feifei Li et al.lifeifei2016 for various resolutions. Across all image sizes, compared to 400 iterations of the baseline method, our method is three orders of magnitude faster than Gatys et al. Gatys2015 and we achieve better qualitative results (Fig. 7) compared with Feifei Li et al.lifeifei2016 in tolerate speed. Our method processes images at 20 FPS, making it feasible to run in real-time or on video. All benchmarks use a Titan X GPU.

: Table 2 compares the runtime of our method and Gatys et al.Gatys2015 , Feifei Li et al.lifeifei2016 for several image sizes. Across all image sizes, compared to 400 iterations of the baseline method, our method is three orders of magnitude faster than Gatys et al.Gatys2015 and we achieve better qualitative results (Fig. 7) compared with Feifei Li et al.lifeifei2016 in tolerate speed. Our method processes images of size at 20 FPS, making it feasible to run in real-time or on video.

4.2 Appearance-based Gaze Estimation

We evaluate our method for appearance-based gaze estimation on the MPIIGaze and purified MPIIGaze with base-line methods.

Figure 9: Example output of proposed method for the LPW gaze estimation dataset. The skin texture and the iris region in the purified real images are qualitatively significantly more similar to the synthetic images than to the real images.

: In order to verify the effectiveness of the proposed method for gaze estimation, 3 public datasets (UTViewSugano2014learning , SynthesEyesWood2015rendering , UnityEyesWood2016learning ) are used to train the estimator with k-NNwang2017 , MPIIGaze datasetMPIIGaze2017 and purified MPIIGaze dataset (purified by proposed method) are used for test the accuracy. The eye gaze estimation network is similar to IJCNN2018 PR2018 , the input is a gray scale image that is passed through 5 convolutional layers followed by 3 fully connected layers, the last one encoding the 3-dimensional gaze vector: (1)Conv 3233 (2)Conv 3233 (3)Conv 643

3 (4)Max-Pooling 3

3 (5)Conv 8033 (6)Conv 19233 (7)Max-Pooling 22 (8)FC9600 (9)FC1000 (10)FC3 (11)Euclidean loss. All networks are trained with a constant learning rate and 512 batch size, until the validation error converges.

: Fig.9 shows examples of real, synthetic and purified real images from the eye gaze dataset. As shown, we observe a significant qualitative improvement of real images: Proposed method successfully captures the skin texture, sensor noise and the appearance of the iris region in the synthetic images. Note that our method preserves the annotation information(gaze direction) while purifying the illumination.

Method MPIIGaze purified MPIIGaze
Support Vector Regression(SVR) 16.5 14.3

Adaptive Linear Regression(ALR)

16.4 13.9
Random Forest(RF) 15.4 14.2
KNN with UTview 16.2 13.6
CNN with UTview 13.9 11.7
KNN with UnityEyes 12.5 9.9
CNN with UnityEyes 9.9 7.8
KNN with Syntheyes 11.4 8.0
CNN with Syntheyes 13.5 8.8
Table 3: Test performance on MPIIGaze and purified MPIIGaze; Purified MPIIGaze is the dataset which purified by proposed method. ”Method” represents training set used with gaze estimation method. Note how purifying real dataset for training lead to improved performance.
Figure 10: Quantitative results for appearance-based gaze estimation on the MPIIGaze dataset and purified MPIIGaze dataset. The plot shows cumulative curves as a function of degree error as compare to the ground truth eye gaze direction, for different numbers of testing examples of data.

: Five gaze estimation methods are used as base-line estimation methods. In addition to common methods such as Support Vector Regression(SVR), Adaptive Linear Regression(ALR) and Random Forest(RF), two methods are reproduced for fairly comparison with state-of-the-art. First method is a simple cascaded methodwang2017 ACML2018 ICIMCS2017 which uses multiple -NN(

-Nearest Neighbor) classifier to select neighbors in feature space joint head pose,pupil center and eye appearance. The other method is to train a simple convolutional neural network (CNN)

wild2017 IJCNN2018 PR2018 to predict the eye gaze direction with loss. We train on UnityEyes ,UTView, SynthesEyes and test on MPIIGaze, purified MPIIGaze which is purified by proposed method. When testing on the MPIIGaze dataset, the training data can be either a raw dataset or a synthetic dataset, and when tested on a purified MPIIGaze dataset, the training data is either a purified real dataset or a raw synthetic dataset. Table 3 compares the performance of these two gaze estimation methods with different datasets. ”Method” represents training set used with gaze estimation method. Large improvement in performance of testing on the output of proposed method is observed, each dataset improves at least three degrees of gaze estimation accuracy. This improvement shows the practical value of our method in many HCI tasks.

: To quantify that the ground truth gaze direction doesn’t change significantly, we manually labeled the ground truth pupil centers in 200 real and purified images by fitting an ellipse to the pupil. This is an approximation of the gaze direction, which is difficult for humans to label accurately. The absolute difference between the estimated pupil center of real and corresponding purified images is quite small: 0.8 1.1 (eye width=55px)

5 Conclusion

This paper took the first step to purify the real image by weakening its distribution, which is a better choice than improving the realism of synthetic image. We have applied this method to style transfer and gaze estimation tasks where we achieved comparable performance and drastically improved speed compared to existing methods. Performance evaluation indicates that purified MPIIGaze dataset (purified by our proposed method) recorded smaller error angle when used for gaze estimation task as compared with the raw MPIIGaze dataset.

In future, we intend to explore modeling the real-time gaze estimation system based on the proposed method and improve the speed of purifying videos.

Acknowledgments

The authors sincerely thank the editors and anonymous reviewers for the very helpful and kind comments to assist in improving the presentation of our paper. This work was supported in part by the National Natural Science Foundation of China Grant 61370142 and Grant 61802043, by the Fundamental Research Funds for the Central Universities Grant 3132016352, by the Fundamental Research of Ministry of Transport of P. R. China Grant 2015329225300, by the Dalian Science and Technology Innovation Fund 2018J12GX037 and Dalian Leading talent Grant, by the Foundation of Liaoning Key Research and Development Program.

References

References

  • (1) Lu F, Sugano Y, Okabe T, et al. Gaze estimation from eye appearance: a head pose-free method via eye image synthesis[J]. IEEE Transactions on Image Processing, 2015, 24(11): 3680-3693.
  • (2) Wang X, Xue K, Nam D, et al. Hierarchical gaze estimation based on adaptive feature learning[C]//Image Processing (ICIP), 2014 IEEE International Conference on. IEEE, 2014: 3347-3351.
  • (3)

    Sugano Y, Matsushita Y, Sato Y. Learning-by-synthesis for appearance-based 3d gaze estimation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2014: 1821-1828.

  • (4) Wood E, Baltrusaitis T, Zhang X, et al. Rendering of eyes for eye-shape registration and gaze estimation[C]//Proceedings of the IEEE International Conference on Computer Vision. 2015: 3756-3764.
  • (5) Wood E, Baltrušaitis T, Morency L P, et al. A 3D morphable eye region model for gaze estimation[C]//European Conference on Computer Vision. Springer International Publishing, 2016: 297-313.
  • (6) Hansen D W, Ji Q. In the eye of the beholder: a survey of models for eyes and gaze[J]. IEEE Transactions on Pattern Analysis & Machine Intelligence, 2010, 32(3):478-500.
  • (7) Shrivastava A, Pfister T, Tuzel O, et al. Learning from simulated and unsupervised images through adversarial training[J]. arXiv preprint arXiv:1612.07828, 2016.
  • (8) Long J, Shelhamer E, Darrell T. Fully convolutional networks for semantic segmentation[C]// Computer Vision and Pattern Recognition. IEEE, 2015:3431-3440.
  • (9) Wang Y, Zhao T, Ding X, et al. Learning a gaze estimator with neighbor selection from large-scale synthetic eye images[J]. Knowledge-Based Systems, 2017.
  • (10) Wood E, Baltrusaitis T, Morency L P, et al. Learning an appearance-based gaze estimator from one million synthesised images[C]//Proceedings of the Ninth Biennial ACM Symposium on Eye Tracking Research & Applications. ACM, 2016: 131-138.
  • (11) Zhang X, Sugano Y, Fritz M, et al. MPIIGaze: Real-World Dataset and Deep Appearance-Based Gaze Estimation[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017.
  • (12) Zhang X, Sugano Y, Fritz M, et al. Appearance-based gaze estimation in the wild[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015: 4511-4520.
  • (13)

    Johnson J, Alahi A, Li F F. Perceptual Losses for Real-Time Style Transfer and Super-Resolution[C]// European Conference on Computer Vision. Springer, Cham, 2016:694-711.

  • (14) Gatys L A, Ecker A S, Bethge M. A neural algorithm of artistic style[J]. arXiv preprint arXiv:1508.06576, 2015.
  • (15) Wu L, Wang Y, Shao L. Cycle-Consistent Deep Generative Hashing for Cross-Modal Retrieval[J]. IEEE Transactions on Image Processing 28 (4), 1602-1612.
  • (16) Wang Y, Lin X, Wu L, et al. Effective multi-query expansions: Collaborative deep networks for robust landmark retrieval[J]. IEEE Transactions on Image Processing 26 (3), 1393-1404.
  • (17) Wang Y, Lin X, Wu L, et al. Robust subspace clustering for multi-view data by exploiting correlation consensus[J]. IEEE Transactions on Image Processing, 2015, 24(11): 3939-3949.
  • (18)

    Wang Y, Wu L, Lin X, et al. Multiview spectral clustering via structured low-rank matrix factorization[J]. IEEE Transactions on Neural Networks and Learning Systems, 29 (10), 4833-4843, 2018.

  • (19)

    Wang Y, Zhang W, Wu L, et al. Iterative views agreement: An iterative low-rank based structured optimization method to multi-view spectral clustering[J]. IJCAI 16-The 25th International Joint Conference on Artificial Intelligence, 2153-2159, 2016.

  • (20)

    Wang H, Feng L, Zhang J, et al. Semantic discriminative metric learning for image similarity measurement[J]. IEEE Transactions on Multimedia, 2016, 18(8): 1579-1589.

  • (21) Hu Q, Wang H, Li T, et al. Deep cnns with spatially weighted pooling for fine-grained car recognition[J]. IEEE Transactions on Intelligent Transportation Systems, 2017, 18(11): 3147-3156.
  • (22) Feng L, Wang H, Jin B, et al. Learning a Distance Metric by Balancing KL-Divergence for Imbalanced Datasets[J]. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2018.
  • (23) Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition[J]. arXiv preprint arXiv:1409.1556, 2014.
  • (24) Tonsen M, Zhang X, Sugano Y, et al. Labelled pupils in the wild: a dataset for studying pupil detection in unconstrained environments[C]//Proceedings of the Ninth Biennial ACM Symposium on Eye Tracking Research & Applications. ACM, 2016: 139-142.
  • (25) Selim A, Elgharib M, Doyle L. Painting style transfer for head portraits using convolutional neural networks[J]. ACM Transactions on Graphics (ToG), 2016, 35(4): 129.
  • (26) Reinhard E, Adhikhmin M, Gooch B, et al. Color transfer between images[J]. IEEE Computer graphics and applications, 2001, 21(5): 34-41.
  • (27) Gardner J R, Upchurch P, Kusner M J, et al. Deep manifold traversal: Changing labels with convolutional features[J]. arXiv preprint arXiv:1511.06421, 2015.
  • (28) Radford A, Metz L, Chintala S. Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks[J]. Computer Science, 2015.
  • (29) Odena A, Dumoulin V, Olah C. Deconvolution and checkerboard artifacts[J]. Distill, 2016, 1(10): e3.
  • (30) Lu F, Sugano Y, Okabe T, et al. Adaptive Linear Regression for Appearance-Based Gaze Estimation.[J]. IEEE Transactions on Pattern Analysis & Machine Intelligence, 2014, 36(10):2033.
  • (31) Wang Y, Shen T, Yuan G, et al. Appearance-based gaze estimation using deep features and random forest regression[J]. Knowledge-Based Systems, 2016, 110(C):293-301.
  • (32) Chang Z, Qiu Q, Sapiro G. Synthesis-Based Low-Cost Gaze Analysis[C]// International Conference on Human-Computer Interaction. Springer International Publishing, 2016:95-100.
  • (33) Luan F, Paris S, Shechtman E, et al. Deep Photo Style Transfer[J]. arXiv preprint arXiv:1703.07511, 2017.
  • (34) Zhao T, Yan Y, Shehu I S, et al. Image Purification Networks: Real-time Style Transfer with Semantics through Feed-forward Synthesis[C]//2018 International Joint Conference on Neural Networks (IJCNN). IEEE, 2018: 1-7.
  • (35) Zhao T, Wang Y, Fu X. Refining Eye Synthetic Images via Coarse-to-Fine Adversarial Networks for Appearance-Based Gaze Estimation[C]//International Conference on Internet Multimedia Computing and Service. Springer, Singapore, 2017: 419-428.
  • (36) Zhao T, Yan Y, Peng J, et al. Guiding Intelligent Surveillance System by learning-by-synthesis gaze estimation[J]. arXiv preprint arXiv:1810.03286, 2018.
  • (37)

    Zhao T, Yan Y, Peng J J, et al. Refining Synthetic Images with Semantic Layouts by Adversarial Training[C]//Asian Conference on Machine Learning. 2018: 863-878.