Unsupervised Domain Adaptation for Learning Eye Gaze from a Million Synthetic Images: An Adversarial Approach

10/18/2018 ∙ by Avisek Lahiri, et al. ∙ ERNET India 0

With contemporary advancements of graphics engines, recent trend in deep learning community is to train models on automatically annotated simulated examples and apply on real data during test time. This alleviates the burden of manual annotation. However, there is an inherent difference of distributions between images coming from graphics engine and real world. Such domain difference deteriorates test time performances of models trained on synthetic examples. In this paper we address this issue with unsupervised adversarial feature adaptation across synthetic and real domain for the special use case of eye gaze estimation which is an essential component for various downstream HCI tasks. We initially learn a gaze estimator on annotated synthetic samples rendered from a 3D game engine and then adapt the features of unannotated real samples via a zero-sum minmax adversarial game against a domain discriminator following the recent paradigm of generative adversarial networks. Such adversarial adaptation forces features of both domains to be indistinguishable which enables us to use regression models trained on synthetic domain to be used on real samples. On the challenging MPIIGaze real life dataset, we outperform recent fully supervised methods trained on manually annotated real samples by appreciable margins and also achieve 13% more relative gain after adaptation compared to the current benchmark method of SimGAN

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

A major reason for the contemporary success of deep learning models has been availability of large annotated datasets. It is undeniable that without abundance of labeled data, deep learning would not have reached its current pinnacle of success in numerous fields such as object recognition (He et al., 2016; Krizhevsky et al., 2012), object detection (Girshick, 2015; Ren et al., 2015; He et al., 2017), action recognition (Wang et al., 2015; Rahmani et al., 2018)

. Large datasets such as Imagenet

(Russakovsky et al., 2015), MS-COCO (Lin et al., 2014), PASCAL VOC (Everingham et al., 2010), YouTube-8M (Abu-El-Haija et al., 2016) have played a vital role in this progress. Often these datasets consists of millions of annotations which require both time and money. The question of the hour is ‘Can we train deep nets in smarter ways ?’ One genre of approach which is quite popular these days is to resort to automated labeled data generation from video game engines. With rapid progress of graphics research, contemporary engines are capable of rendering high quality visual samples. For example, recent works of (Richter et al., 2016; Johnson-Roberson et al., 2017; Lee et al., 2017) show possibility of collecting infinite amount of simulated driving scenario data from video games. Similar efforts were also seen for autonomous drones (Shah et al., 2017) and truck driving (Im, 2017).

Figure 2. Visualization of two large scale gaze estimation datasets. (a): UnityEyes(Shrivastava et al., 2017) synthetic dataset with simulator GUI and some exemplary synthetic samples; (b): MPIIGaze(Zhang et al., 2015) dataset with typical data capture environments and real samples. The core question in this paper is, ‘Can we learn a gaze estimation model from automatically annotated dataset such as UnityEyes and apply on real world dataset such as MPIIGaze with zero supervision from the latter?

While the prospect of learning from simulated data may look promising, we take a step back and ask, ‘Is this really a free lunch ?’ Samples from simulation engines come from a different distribution compared to real world samples. Thus discriminative models trained on synthetic data is expected to perform sub optimally on real world compared to a model which is trained solely on annotated real samples. There are two ways to tackle this problem, viz., a) improve the fidelity of graphics engine itself - this requires lot of computationally expensive optimizations and is time consuming b) project real and synthetic samples to a domain invariant representation space. In this paper, we focus on the second aspect for the particular use case of learning gaze estimation from synthetic samples generated by Unity game engine and applied on real life ‘in-the-wild’ gaze data of MPIIGaze.

We pose the above problem as an unsupervised domain adaptation problem and leverage the recent concept of generative adversarial networks (GAN) (Goodfellow et al., 2014) to match the feature distributions of synthetic and real samples. It is a 3 stage process as depicted in Fig. 3. We perceive a deep neural as consisting of two modules, feature representer and gaze regressor. In unsupervised domain adaptation, we assume the presence of labeled data from source domain, in our case it is the simulated/synthetic domain. We train a Source gaze Estimator (SE) on UnityEyes. In Stage2, we fix

and initialize a target representer with weights of SE. However, there are no labels available in target domain. So, intermediate features of target and source networks are fed to a adversarial domain classifier which predicts class belongingness based on features. Gradients from the domain classifier is used for updating the target features. This step pushes the feature distribution of real samples towards synthetic samples. In Stage3, features from Target Representer are used in conjunction with regression section of Source Estimator to predict gaze on real test data. It is assumed that in Stage2, features of real and synthetic samples have become indistinguishable and thus it makes sense to use the higher order regression specific fully connected layers from source domain. We show that our model achieves 43% relative improvement after domain adaptation compared to 30% relative improvement achieved by the state-of-the-art method of Shrivastava

et al. (Shrivastava et al., 2017)(SimGAN) on the challenging MPIIGaze real gaze dataset.
Contributions:

  • This is the first demonstration of application of unsupervised(no annotation on real data) adversarial feature adaptation for 3D eye gaze estimation across simulated and real world samples

  • A data driven adaptive feature importance learning framework is introduced for assigning dynamic importance to different layers of a deep neural net for adaptation

  • Going against the usual trend of ‘gradient reversal’ (Ganin et al., 2016) in adversarial adaptation, we empirically show that freezing source distribution prior to adaptation manifests better post adaptation performance

  • We achieve 43% improvement post adaptation compared to 30% improvement by the current state-of-the-art method of SimGAN (Shrivastava et al., 2017)

The rest of the paper is organized as follows. In Sec. 2, we briefly summarize some recent works on unsupervised domain adaptation and cross domain learning. Sec. 3 details about our proposed approach. In Sec. 4, we provide detailed description of the gaze predictor and domain discriminator networks and other related training details. Sec. 5 is related to our experimental findings and finally we conclude the paper with future scopes in Sec. 6.

Figure 3. Stepwise flow of our model for adapting gaze estimator from automatically annotated synthetic domain to unannotated real domain. Step 1: Source estimator network is trained on labels of synthetic data. Feature representation layers are denoted by and regression specialized layers are denoted by . Step 2: Source estimator is frozen. Similar network, (target representer) is initialized with with corresponding parameters, and . Different combined layers of and are fed to a domain discriminator which distinguishes features from two domains. Target representer and domain discriminator are iteratively updated in an adversarial game paradigm (Goodfellow et al., 2014). It is expected, that feature representations of real and synthetic samples will become indistinguishable at termination of this stage. Step 3: For inference on real samples, features are taken from () of target representer while regression specialized fully connected layers() are used from source estimator for gaze estimation.

2. Related Works

2.1. Unsupervised domain adaptation

Domain adaptation at feature level has been a recent genre of interest in computer vision. Closely related to our approach is the concept of Domain Adversarial Networks

(Ganin et al., 2016) by Ganin et al. to learn domain invariant features. The source network and target network share initial few layers for feature adaptation. Source network is trained on the source task while simultaneously a domain classifier discriminates two classes of features. Our approach is fundamentally different than (Ganin et al., 2016) in the sense that we initially fix the source distribution and treat it as a stationary distribution which we try to approximate with the dynamic target distribution with the adversarial training. Our approach is more aligned with the original formulation of GAN (Goodfellow et al., 2014) in which the objective of generator is to approximate a natural stationary distribution (in our case distribution of synthetic samples’ features). Similar approach of (Ganin et al., 2016) was also exploited by Kamnitsas et al. (Kamnitsas et al., 2017) for brain lesion segmentation across different datasets and it was reported that simultaneous training of source loss with domain adversarial loss requires very specific scheduling of training of each component. As shown in Fig. 3, our three stage training is very straight forward and does not require examining individual components to trigger/dampen any component of training. Ghiffary et al.(Ghifary et al., 2016) extended DANN by replacing maximization of domain classification loss by minimization of Maximum Mean Discrepancy (MMD) metric (Gretton et al., 2012) between features of samples from two domains.

Another paradigm of feature adaptation using using deep learning is to fix feature representation from both domains and then finding some subspaces to align the domains (Caseiro et al., 2015; Gopalan et al., 2011)

. This kind of strategy was also recently applied on deep features by CORAL

(Sun et al., 2016) which minimized Frobenius norm between a linear projection of covariance feature matrix of source domain and target covariance matrix.

2.2. Learning across synthetic and real domains

Learning from simulated/synthetic data has been an active area of research in recent times. Wood et al.(Wood et al., 2016) used Unity game engine to generate one million synthetic eye samples to learn gaze estimator and achieved state-of-the-art performance on appearance based gaze estimation. Synthetic data coming from video games are being actively used in semantic understanding of street videos (Richter et al., 2016; Johnson-Roberson et al., 2017; Lee et al., 2017). This is particularly helpful because collecting street videos is tedious and sometimes impossible. For example, in (Lee et al., 2017), the authors simulated car crashes in video games to predict in real life. These methods were particularly trained on such artificial data and had no access to real datasets. Recently, Shrivastava et al.(Shrivastava et al., 2017)(SimGAN) proposed an adversarial pixel domain adaptation to exploit samples from both synthetic and real domain. Their idea was to use a ‘pixel level refiner’ network to adversarially transform annotated synthetic data coming from UnityEyes to be visually indistinguishable from real samples of MPIIGaze. A regression model trained on such transformed image dataset is expected to perform better on real samples At the same time, a similar approach was proposed by Bousmalis et al. (Bousmalis et al., 2017) for pixel level domain adaptation with adversarial loss. We take a complementary approach to both (Bousmalis et al., 2017; Shrivastava et al., 2017) in the sense that we adapt the feature representation of the two domains instead of pixel space. Our intuition is that, close adherence of visual properties between two domains might not necessarily indicate close performance of discriminative tasks(Salimans et al., 2016). Thus instead of pixel space adaptation, it is more intuitive to adapt the discriminative features directly related to the task at hand. Our approach encourages features of two domains to be similar not just based on visual appearances but also utilizes labeled data on source domain to learn task specific transferable features. This should help in gaining better relative improvement after adaptation and indeed we will see in Sec. 5.6 that our method achieves 43% relative improvement after adaptation compared to 30% by (Shrivastava et al., 2017).

3. Approach

3.1. Background on Generative Adversarial Network (GAN)

Generative adversarial network engages two parametrized models, viz., discriminator and generator in a two-player min-max game. Realized as a feed forward neural net, the generator network takes a latent noise vector

drawn from a prior noise distribution . Following (Goodfellow et al., 2014),

(uniform distribution) and generator maps it onto an image,

; . The other network, discriminator, has the task to discriminate samples coming from the true data distribution and the generated distribution, . Specifically, generator and discriminator play the following game on :

(1)

This min-max game has global optimum when and this happens when both discriminator and generator have enough capacity (Goodfellow et al., 2014). Empirically, it has been observed that for generator, it is prudent to maximize instead of minimizing .

3.2. Unsupervised domain adaptation

The general formulation of unsupervised domain adaptation can be stated as follows. We assume a labeled dataset, often known as source dataset, . In our case, is the UnityEyes dataset on which we have automated regression labels for each image.

3.2.1. Training source regression model:

Using the labeled data, we can learn a parametrized source eye gaze regression function, , where, is the representation of source image . We breakdown

into two components, viz., a) feature extraction/representation section (

) and b) layers specialized for 3D gaze regression (). Together, these parameters are grouped as . Source regression network is optimized using the usual discriminative loss, ,

(2)

where can be any distance metric. In our case we have taken Euclidean norm between the normalized predicted and ground truth gaze vectors.

3.2.2. Domain representation and adaptation:

For a supervised model, such as source regressor, it is usually easy to represent input images as a function of the discriminatively trained convolution layers. Different layers of the network gives different orders of task specific representations. However, due to zero annotated data, getting a corresponding representation for the target domain, , is bit tricky. However, assuming(will touch upon this in upcoming section), we have some way of representing source and target samples with parametrized networks, and respectively, our aim will be to train a domain discriminator, , a standard binary classifier, which will distinguish between and based on and . Specifically, is optimized by minimizing the usual binary classification loss, :

(3)

Eq. 3 is suited for training a domain classifier with the assumption that we have finalized the domain representations, and . However, domain representations needs to be optimized so as to maximize domain confusion for the discriminator. This is because, if the representations of unlabeled target domain is indistinguishable from source domain, then the regression section, , of source domain can operate on feature section, , of target domain for predicting 3D gaze. Thus in general, the adversarial feature adaptation optimization criteria can be written as:

(4)

where is feature mapping loss under the constraints of .

Returning back to the question of, ‘How to represent source and target images?

’: Previous works on transfer learning and domain adaptation prefer to initialize the representation of the target domain to be exactly same as source domain but leverage different formulations of constraint,

to regularize target representation learning. Usually, is imposed as a layerwise constraint; to be specific, a substantial number of approaches (Tzeng et al., 2015; Ganin et al., 2016)

consider exact layer wise equality between the representation of the two domain. Thus, for a multi layer neural network with

layers, constraint on layer, can be expressed as:

(5)

This genre of approach is termed as ‘fully constrained’ adaptation, wherein adaptation is performed over all the layers of representation. For a practical perspective, such layerwise equality be imposed by weight sharing. However, fully sharing weights across two domains can lead to sub optimal performance because a single network has to handle two different domains of input.

To mitigate this, recent efforts focus on learning shared representations across domains. In such scenarios, is only imposed on shared layers of the two networks. In (Rozantsev et al., 2018), the authors show that partial alignment of network weights leads to efficient learning for both semi supervised and unsupervised learning. Influenced by this recent trend, we also chose to adapt partial adaptation of network layers of source and target. Selection protocol for weight shared layers is described in Sec. 5.4.

3.2.3. Adversarial loss for feature alignment:

Once we decide how to represent, and , and the mode of alignment (fully shared or partial), we have to decide the functional form of constraint, . A very basic approach is to impose L loss between the shared layers (Tzeng et al., 2015). While simple from implementation point of view, recent works(Mathieu et al., 2015) have shown that L loss is rather conservative and yields an average solution not lying on original data manifold. In our case this would mean that while adapting target features with respect to source features using L equality constraint, the optimizer would settle for a low risk feature generation which is not viable for either source or target but expected empirical less is minimized. Such short coming can be alleviated if we leverage the adversarial loss which encourages solutions to be near to natural data manifold. Early work on gradient reversal layer (Ganin et al., 2016) propose to use to exact zero sum min-max game formulation of GAN (Goodfellow et al., 2014), by formulating,

(6)

However, the problem with Eq. 6, is that during initial phase of training, it is very easy for the domain discriminator to distinguish representations of the two domains and this leads to small magnitudes of gradients flowing to the target network which is trying to update itself based on these adversarial gradients. We follow, the numerical trick in (Goodfellow et al., 2014) by formulating,

(7)

Eq. 7 has the same fixed point properties as that of Eq. 6 but provides higher magnitudes of gradients towards beginning of training. Note that unlike some recent adversarial adaptation approaches such as in (Ganin et al., 2016; Kamnitsas et al., 2017) where the source and target distributions are simultaneously updated, the presented approach keeps the already learnt source distribution constant and tries to align the target distribution to source. This is more in the spirit of the original GAN formulation where the objective was to approximate a stationary distribution.

3.2.4. Stepwise optimization

Combining Eqs. 2, 3 and 7 and noting that source representation can be stated as , the overall optimization criteria for our entire framework can be written as:

,

(8)

The system of equations are optimized in the following steps. To begin with, we optimize independently on the labeled source domain, i.e., on UnityEyes dataset. We then fix both and and do not update for remaining steps of the pipeline. Since, we fix , optimizing is essentially optimizing over possible alignment for to be indistinguishable from . We follow the iterative optimization procedure in (Goodfellow et al., 2014) to optimize, and . Specifically, in the update step for , the parameters of the target network adapt to align the target feature representation(coming from several shared layers as will be described in Sec. 5.4) with source representation. In contrary, update step of forces the domain classifier, to distinguish between features coming source and target domain.

3.2.5. Data driven adaptive feature importance

Given the entire parameter set, , of the target network for domain adaptation, target feature representation, , basically consists of concatenation of a subset of layers from the part of the network; . The naive way to adapt will be to adapt all the layers with equal importance. This has been the general trend in domain adaptation literature (Tzeng et al., 2017). However, in absence of any prior, it is prudent to learn the importance of each layer from data. This can be possible by associating a learnable importance vector , which gets updated by gradients from and . Specifically, with this importance vector, modified feature representation of can be written as,

(9)

where, is the Hadamard product operator between and .

4. Implementation Details

4.1. Network Architectures

Gaze Estimation Network: We use the CNN architecture for gaze estimation as reported in (Shrivastava et al., 2017). In Table LABEL:table_gaze_architecture we show the details of the layers of the network. Basically, the network takes input images of dimension, 3555 and is processed by five layers of 3

3 convolution with stride = 1. To ensure, invariance to local perturbations, two max pooling layers are also introduced. We consider upto layer L

as the feature representation block () of the entire gaze estimator network. Next, comes the regression specialized fully connected section of the network consisting of last two fully connected layers, L and L

. Each layer throughout the network, except the last, is followed by leaky Relu non linearity with leak(negative) slope of 0.2. The output of last fully connected layer,

, is not followed by any non linearity. It is unit normalized before we calculate the Euclidean loss between predicted and original gaze vectors.

Discriminator Network: The exact architecture of the discriminator depends on the genre of approach we undertake for adapting the features. For single layer adaptation, our discriminator is a 2D CNN with 33 convolution kernel, stride = 2. This reduces the spatial resolution 2 along each dimension. First convolution has 16 channels and we double it in each layer. This is done thrice to reduce overall feature dimension by 8

along each dimension. This is followed by a fully connected layer with one output node which yields the probability of incoming features to belong to synthetic class. For adapting features from two layers(stacked along channel dimension), we used 3D CNNs for better exploitation of feature changes along the depth(channel) dimension. Specifically, the smaller feature maps are resized to map the resolution of the bigger maps and concatenated along channel dimension. We again follow the above principle of stagewise reduction of spatial resolution by repetitive application of 3

33 (depth, height, width) kernels with stride of 122. This is again followed by a fully connected layer with a single node. Leaky Relu, with negative slope of 0.2 was used after each layer, except the last layer which uses sigmoid non linearity.

I/P Channels O/P Channels Operation Kernels Stride Name
1 32 Conv 3X3 1 C
32 32 Conv 3X3 1 C
32 64 Conv 3X3 1 C
64 64 MaxPool 3X3 2 P
64 80 Conv 3X3 1 C
80 192 Conv 3X3 1 C
192 192 MaxPool 2X2 2 P
                          Fully Connected (9600)                      FC
                          Fully Connected (1000)                      FC
                          Fully Connected (3)                         FC
Unit Normalization + Euclidean Loss
Table 1. Architecture of gaze regression network.

5. Experiments

5.1. Training Details

Source gaze estimation:

Source domain gaze estimation on UnityEyes follows usual supervised learning approach. We used Adam optimizer

(Kingma and Ba, 2014) for mini batch gradient descent to optimize the parameter set, . Batch size was 512 and learning rate was kept at 0.001. Training was stopped when average error saturated around 2 on held out validation set of 10,000 samples of UnityEyes dataset.
Adversarial Feature Adaptation: During adversarial feature adaptation, parameters ()of source network are frozen. Parameters () are initialized with their respective components from source domain. Here also, we used Adam optimizer to update target representer, () based on single or multi level adaptation. Following the iterative training procedure in (Goodfellow et al., 2014), we update target representer in one step and the domain discriminator in next step. Learning rate was set to 0.0001 for both the competing networks. Batch size was set to 64. It was particularly important to introduce dropout(Srivastava et al., 2014) in the discriminator network; else the discriminator gets too powerful and the adaptation stage diverges. Specifically, we used dropout rate of 25% for convolutional layers and 50% for fully connected layer.

Single Level Adaptation
C C C C C
12.7 12.8 12.5 12.1 12.0
Double Level Adaptation
CC CC CC C CC CC CC CC CC CC
12.6 12.3 11.9 10.2 12.4 11.7 10.7 12.1 8.8 11.8
Table 2. Self comparison of mean angle error on MPIIGaze test set after adversarial feature adaptation(before adaptation: mean error of 14.5) for different choices of adapting feature maps. Single level adaptation represents adapting only a specific feature layer across real and synthetic domain. Double level refers to adaptation by concatenating feature maps from two different levels. C refers to convolution layer of gaze estimator architecture. See Table LABEL:table_gaze_architecture for details of each layer.

5.2. Dataset description

Unity Eyes(Wood et al., 2016): For source domain, we have used the automated synthetic eye gaze generation engine of UnityEyes. As shown in Fig. 2, the framework provides a graphical user interface to set up the ranges of camera and eye gaze directions. The graphics engine then randomly generates gaze samples within these ranges at 480640 resolution. Default settings were used following discussion with authors of (Shrivastava et al., 2017). We generated 1 million synthetic annotated examples within 7 hours. This shows the effectiveness of using graphics engines for annotated data generation. We also kept 10,000 samples as validation set. Images were center cropped to 3555.
MPIIGaze(Zhang et al., 2015): We used this dataset as target domain because we explicitly do not use the labels of this dataset. MPIIGaze is the largest ‘in-the-wild’ captured eye gaze dataset consisting of data captured on consumer laptops at random everyday unconstrained environments. There are total 213,659 images from 15 participants with 80,000 samples for testing. So this dataset captures appreciable variations of real world such as different poses, lightning, indoor/outdoor, time of the day. Images of Unity Eyes are first converted to grey scale to be compatible with MPII Gaze samples. Original range of pixel values between [0, 255] was scaled to [-1, 1] for images of both domains as a pre-processing step.

5.3. Pre adaptation performance

The source regression network discussed in Sec. 4.1 was trained for 80,000 iterations until the mean error converged at around 1.9 on the held out validation set of UnityEyes. Before any adaptation, the source regression network incurs a mean error of 14.5 on the MPIIGaze test set. We fix the source network and make a copy as an initializer for the target network.

5.4. Selecting layers for adaptation

The natural question which first occurs in mind for our approach is, ‘Which layer(s) to adapt?’. Choosing appropriate layers for transfer learning/domain adaptation is still an open problem, mostly studied in the context of object recognition, detection. For example, Tzeng et al. (Tzeng et al., 2014) showed that adapting the last three fully connected layers of Alexnet gives best performance for cross domain classification. Tzeng et al. (Tzeng et al., 2017), from which our work is adopted, utilized the last fully connected layers of a classification framework for adaptation. This makes sense for object recognition/detection because the higher order features are agnostic to local image statistics. Deeper layers are concerned for capturing global understanding for an object. However, in our case, the scenario is different. Gaze prediction requires a network to analyze local image textures yet has to manifest robustness to local perturbations. Such lower order features are mainly derived from shallower levels of the network while we need to resort to deeper channels for local invariance. Thus there is a need to combine the best of both worlds. Our initial experiments of adapting FC and FC layers were not promising with post adaptation errors of 14.3 and 14.1 respectively; this shows that extreme deeper sections of fully connected layers are task specialized. Thus we keep FC and FC as , while the layers till FC are kept as .

In Table 2 we report mean error in degree on the MPIIGaze test set. Note that for every setting reported in the Table 2 we have also adapted the FC layer by reshaping and concatenating with the convolutional feature maps. From Table 2 we see that adapting multiple layers yields better results compared to adapting only single layers. By adapting a combination of {C, C, L} we achieved lowest error of 8.8. Instead of the vanilla GAN loss formulation for adapting features, we also tried the Wasserstein GAN(Arjovsky et al., 2017) loss formulation and it helped in reducing mean angle error to 8.2-a relative improvement of 43.5% starting from 14.5 before adaptation.

For completeness of analysis, it is to be noted that we also initially experimented with adapting combinations of triple and quadruple feature maps. But the adversarial adaptation phase did not converge properly under such settings leading to negligible post adaptation improvements. Thus, moving forward, we have not included those configurations for further analysis.

5.5. Comparison with ‘Gradient Reversal’ (Ganin et al., 2016)

Following the usual trend of applying gradient reversal technique of (Ganin et al., 2016) for domain adversarial learning, we also initially trained the source regression model and target feature alignment module(aligning combination of CC layers)simultaneously. However, as also reported by (Kamnitsas et al., 2017), this leads to instability of training. For example, with this vanilla strategy, we could manage only up to 6 error on source test set while error on target domain was around 20. Thus neither source task nor adaptation was successful. Following, (Kamnitsas et al., 2017)

, we initially trained source regression model for few epochs and then pitched the adversarial learning in conjunction with source task. This culminated in getting a mean error of 3

on source test set and 13.3 on target test set. In views of absolute test set performance and ease of training the network components, our method clearly has a significant edge over gradient reversal technique.

5.6. Comparison with state-of-the-art

In Table 3 we compare our method with recent state-of-the-art methods on MPIIGaze test set. We report results in two parts. The first part consists of methods which were trained on manually annotated gaze datasets. It is encouraging to see that our method, which has not involved any human annotation, appreciably surpasses these fully supervised methods. Schneider et al.(Schneider et al., 2014) presented a manifold alignment method for learning person independent, calibration free gaze estimation using a variety of low level features such as Local Binary Pattern(LBP), Discrete Cosine Transform(DCT) with different regression frameworks such as regression forests, Support Vector Regression (SVR) trained on Columbia gaze dataset (Smith et al., 2013). In Table 3 we report their best results with SVR. In (Lu et al., 2014), Lu et al.

, maps high dimensional eye features to a low dimensional gaze positions with an adaptive linear regression. The ALR helps in selecting scarce training examples via l

optimization for high fidelity gaze estimation. Sugano et al.(Sugano et al., 2014)

created a massive 3D reconstructed fully calibrated eye gaze dataset from head and eye pose readings from 50 subjects. The calibration includes 160 different gaze direction and 8 head poses with a total of 64,000 eye samples. Next, they learn a random forest regression model on their rendered 3D gaze models for predicting subject independent 3D gaze. Zhang

et al.(Zhang et al., 2015) released the till date largest real life eye gaze dataset, MPIIGaze. The authors trained a multi modal deep neural network consisting of labeled information of both head pose and eye gaze.

In the second part we compare models which have used labels only produced by automatic rendering engines. The seminal work of Wood et al.(Wood et al., 2016) released the UnityEyes synthetic 3D eye gaze dataset and achieves 9.9 error; an already improvement of 4 compared to best performing fully supervised method of Zhang et al.(Zhang et al., 2015). As of today, SimGAN(Shrivastava et al., 2017), with its adversarial pixel domain adaptation across UnityEyes and MPIIGaze is the benchmark for gaze estimation on MPIIGaze. Before adaptation, SimGAN achives an error of 11.2 while the error goes down to 7.8 after adaptation - a relative improvement of 30%. It is to be noted that we intentionally kept the gaze predictor network same as SimGAN (Shrivastava et al., 2017) to get the same baseline performance before adaptation. However, SimGAN’s reported pre-adaptation error of 11.2 was not reproducible by us with the limited information made public. Before adaptation we attain a mean error of 14.5 on MPIIGaze. After adaptatin, our GAN and WGAN based models achieve mean errors of 8.8 and 8.2 respectively. Thus our WGAN based framework achieves 43% relative improvement compared to performance before adaptation; whereas SimGAN achieves a relative improvement of 30% improvement. In Fig. 1 we visualize some examples showing that after adaptation the predicted gaze vectors come closer to ground truth vectors compared to the vectors before adaptation.

Training Genre Method Error()
Manually Annotated
Real Samples
Schneider et al.(Schneider et al., 2014) 16.5
Lu et al.(Lu et al., 2014) 16.4
Sugano et al.(Sugano et al., 2014) 15.4
Zhang et al.(Zhang et al., 2015) 13.9
Auto Annotated
Synthetic Samples
Wood et al.(Wood et al., 2016) 9.9
SimGAN(Shrivastava et al., 2017) (Before Adaptation) 11.2
SimGAN (After Adaptation) 7.8
Ours (Before Adaptation) 14.4
Ours(Adaptation with GAN ) 8.8
Ours(adaptation with WGAN) 8.2
Table 3. Comparisons of mean average error (in ) by state-of-the-art algorithms on MPIIGaze test set. Our best model achieves 8.2 error after adversarial feature space adaptation, a relative improvement of around 43% compared to 14.5 before adaptation. Compared to us, SimGAN achieves a relative improvement of 30% after adversarial pixel space adaptation.

6. Discussion and Conclusion

In this paper, we presented an unsupervised domain adaptation paradigm for learning to predict real life ‘in-the-wild’ 3D eye gaze by leveraging large number of completely unannotated real gaze samples and a pool of one million automatically labeled graphics engine generated synthetic samples. Going against the traditional trend of ‘gradient reversal’ (Ganin et al., 2016) genre of adversarial adaptation, wherein both source and target distributions are non-stationary and simultaneously updated, we chose to follow a more ‘GAN(Goodfellow et al., 2014) like approach of fixing the source distribution and trying to approximate this stationary distribution with a dynamic target distribution. Also, quite contrary to the recent approach of (Tzeng et al., 2017), where the authors advocate adapting only the last layer of a deep neural net, we show that for low level and fine grained vision application such as gaze prediction, it is more prudent to adapt a multi-depth(aligning features from different depths) feature representation. Lastly, we showed that in absence of any prior assumption of importance of a layer for adaptation, it is beneficial to jointly learn the relative importance of each layer along with feature alignment. Our method achieves a very competitive absolute performance (8.2 post adaptation) compared to the recent benchmark of SimGAN (7.8 post adaptation). However, it is promising to note that our method yields a relative improvement of 43% with respect to pre adaptation performance compared to only 30% relative improvement by SimGAN. Our findings suggest that it might be more prudent to tackle domain adaptation in feature space compared to adaptation in absolute pixel space as done in SimGAN. Since our work is the first attempt of adversarial feature adaptation across Unity and MPII, an immediate extension would be combine our method and pixel adaptation approach of SimGAN. Both of these methods are complementary to each other and thus it would be an interesting approach to formulate a joint optimization of pixel and feature adaptation.

References

  • (1)
  • Abu-El-Haija et al. (2016) Sami Abu-El-Haija, Nisarg Kothari, Joonseok Lee, Paul Natsev, George Toderici, Balakrishnan Varadarajan, and Sudheendra Vijayanarasimhan. 2016. Youtube-8m: A large-scale video classification benchmark. arXiv preprint arXiv:1609.08675 (2016).
  • Arjovsky et al. (2017) Martin Arjovsky, Soumith Chintala, and Léon Bottou. 2017. Wasserstein gan. arXiv preprint arXiv:1701.07875 (2017).
  • Bousmalis et al. (2017) Konstantinos Bousmalis, Nathan Silberman, David Dohan, Dumitru Erhan, and Dilip Krishnan. 2017. Unsupervised pixel-level domain adaptation with generative adversarial networks. In

    The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , Vol. 1. 7.
  • Caseiro et al. (2015) Rui Caseiro, Joao F Henriques, Pedro Martins, and Jorge Batista. 2015. Beyond the shortest path: Unsupervised domain adaptation by sampling subspaces along the spline flow. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3846–3854.
  • Everingham et al. (2010) Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. 2010. The pascal visual object classes (voc) challenge. International journal of computer vision 88, 2 (2010), 303–338.
  • Ganin et al. (2016) Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, and Victor Lempitsky. 2016. Domain-adversarial training of neural networks.

    The Journal of Machine Learning Research

    17, 1 (2016), 2096–2030.
  • Ghifary et al. (2016) Muhammad Ghifary, W Bastiaan Kleijn, Mengjie Zhang, David Balduzzi, and Wen Li. 2016. Deep reconstruction-classification networks for unsupervised domain adaptation. In European Conference on Computer Vision. Springer, 597–613.
  • Girshick (2015) Ross Girshick. 2015. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision. 1440–1448.
  • Goodfellow et al. (2014) Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In Advances in neural information processing systems. 2672–2680.
  • Gopalan et al. (2011) Raghuraman Gopalan, Ruonan Li, and Rama Chellappa. 2011. Domain adaptation for object recognition: An unsupervised approach. In Computer Vision (ICCV), 2011 IEEE International Conference on. IEEE, 999–1006.
  • Gretton et al. (2012) Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Schölkopf, and Alexander Smola. 2012. A kernel two-sample test. Journal of Machine Learning Research 13, Mar (2012), 723–773.
  • He et al. (2017) Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017. Mask r-cnn. In Computer Vision (ICCV), 2017 IEEE International Conference on. IEEE, 2980–2988.
  • He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778.
  • Im (2017) Gyuri Im. 2017. A toolkit for controlling Euro Truck Simulator 2 with Python to develop self-driving algorithms. (2017). https://github.com/marsauto/europilot
  • Johnson-Roberson et al. (2017) Matthew Johnson-Roberson, Charles Barto, Rounak Mehta, Sharath Nittur Sridhar, Karl Rosaen, and Ram Vasudevan. 2017. Driving in the matrix: Can virtual worlds replace human-generated annotations for real world tasks?. In Robotics and Automation (ICRA), 2017 IEEE International Conference on. IEEE, 746–753.
  • Kamnitsas et al. (2017) Konstantinos Kamnitsas, Christian Baumgartner, Christian Ledig, Virginia Newcombe, Joanna Simpson, Andrew Kane, David Menon, Aditya Nori, Antonio Criminisi, Daniel Rueckert, and others. 2017. Unsupervised domain adaptation in brain lesion segmentation with adversarial networks. In International Conference on Information Processing in Medical Imaging. Springer, 597–609.
  • Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
  • Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012.

    Imagenet classification with deep convolutional neural networks. In

    Advances in neural information processing systems. 1097–1105.
  • Lee et al. (2017) Kangwook Lee, Hoon Kim, and Changho Suh. 2017. Crash to not crash: Playing video games to predict vehicle collisions. In ICML Workshop on Machine Learning for Autonomous Vehicles.
  • Lin et al. (2014) Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In European conference on computer vision. Springer, 740–755.
  • Lu et al. (2014) Feng Lu, Yusuke Sugano, Takahiro Okabe, and Yoichi Sato. 2014. Adaptive linear regression for appearance-based gaze estimation. IEEE transactions on pattern analysis and machine intelligence 36, 10 (2014), 2033–2046.
  • Mathieu et al. (2015) Michael Mathieu, Camille Couprie, and Yann LeCun. 2015. Deep multi-scale video prediction beyond mean square error. arXiv preprint arXiv:1511.05440 (2015).
  • Rahmani et al. (2018) Hossein Rahmani, Ajmal Mian, and Mubarak Shah. 2018. Learning a deep model for human action recognition from novel viewpoints. IEEE transactions on pattern analysis and machine intelligence 40, 3 (2018), 667–681.
  • Ren et al. (2015) Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems. 91–99.
  • Richter et al. (2016) Stephan R Richter, Vibhav Vineet, Stefan Roth, and Vladlen Koltun. 2016. Playing for data: Ground truth from computer games. In European Conference on Computer Vision. Springer, 102–118.
  • Rozantsev et al. (2018) Artem Rozantsev, Mathieu Salzmann, and Pascal Fua. 2018. Beyond sharing weights for deep domain adaptation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2018).
  • Russakovsky et al. (2015) Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, and others. 2015. Imagenet large scale visual recognition challenge. International Journal of Computer Vision 115, 3 (2015), 211–252.
  • Salimans et al. (2016) Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. 2016. Improved techniques for training gans. In Advances in Neural Information Processing Systems. 2234–2242.
  • Schneider et al. (2014) Timo Schneider, Boris Schauerte, and Rainer Stiefelhagen. 2014. Manifold alignment for person independent appearance-based gaze estimation. In Pattern Recognition (ICPR), 2014 22nd International Conference on. IEEE, 1167–1172.
  • Shah et al. (2017) Shital Shah, Debadeepta Dey, Chris Lovett, and Ashish Kapoor. 2017. AirSim: High-Fidelity Visual and Physical Simulation for Autonomous Vehicles. In Field and Service Robotics. https://arxiv.org/abs/1705.05065
  • Shrivastava et al. (2017) Ashish Shrivastava, Tomas Pfister, Oncel Tuzel, Joshua Susskind, Wenda Wang, and Russell Webb. 2017. Learning from Simulated and Unsupervised Images through Adversarial Training.. In CVPR, Vol. 2. 5.
  • Smith et al. (2013) Brian A Smith, Qi Yin, Steven K Feiner, and Shree K Nayar. 2013. Gaze locking: passive eye contact detection for human-object interaction. In Proceedings of the 26th annual ACM symposium on User interface software and technology. ACM, 271–280.
  • Srivastava et al. (2014) Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15, 1 (2014), 1929–1958.
  • Sugano et al. (2014) Yusuke Sugano, Yasuyuki Matsushita, and Yoichi Sato. 2014. Learning-by-synthesis for appearance-based 3d gaze estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1821–1828.
  • Sun et al. (2016) Baochen Sun, Jiashi Feng, and Kate Saenko. 2016. Return of frustratingly easy domain adaptation.. In AAAI, Vol. 6. 8.
  • Tzeng et al. (2015) Eric Tzeng, Judy Hoffman, Trevor Darrell, and Kate Saenko. 2015. Simultaneous deep transfer across domains and tasks. In Proceedings of the IEEE International Conference on Computer Vision. 4068–4076.
  • Tzeng et al. (2017) Eric Tzeng, Judy Hoffman, Kate Saenko, and Trevor Darrell. 2017. Adversarial discriminative domain adaptation. In Computer Vision and Pattern Recognition (CVPR), Vol. 1. 4.
  • Tzeng et al. (2014) Eric Tzeng, Judy Hoffman, Ning Zhang, Kate Saenko, and Trevor Darrell. 2014. Deep domain confusion: Maximizing for domain invariance. arXiv preprint arXiv:1412.3474 (2014).
  • Wang et al. (2015) Limin Wang, Yu Qiao, and Xiaoou Tang. 2015. Action recognition with trajectory-pooled deep-convolutional descriptors. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4305–4314.
  • Wood et al. (2016) Erroll Wood, Tadas Baltrušaitis, Louis-Philippe Morency, Peter Robinson, and Andreas Bulling. 2016. Learning an Appearance-Based Gaze Estimator from One Million Synthesised Images. In Proceedings of the Ninth Biennial ACM Symposium on Eye Tracking Research & Applications. 131–138.
  • Zhang et al. (2015) Xucong Zhang, Yusuke Sugano, Mario Fritz, and Andreas Bulling. 2015. Appearance-based gaze estimation in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4511–4520.