Recent generative models are capable of synthesizing realistic images by using adversarial methods. However, realism is not the only requirement for computer vision research, as content and style can play important roles for specific tasks, such as regression tasks which require high accuracy, e.g. hand joints regression and eye gaze estimation. In this paper, we study the task of generating realistic near-eye images while preserving the content defined by a semantic segmentation mask, and style defined by a few images from a target person. We propose two methods to tackle this task. Our first method uses image refinement and is the winning solution of the OpenEDS Synthetic Eye Generation Challenge 2019111https://research.fb.com/programs/openeds-challenge, and our second method is a novel architecture for ensuring preservation of desired content and style. However, due to optimizing for a very specific error metric, the generated images show blurry regions. Therefore, we propose another more principled method for image synthesis that produces realistic high-quality images that still satisfy both content and style, and furthermore allows for an interpolation between styles. This method, Seg2Eye (Style and Semantic Segmentation preserving GAN), uses content-preserving spatially adaptive normalization blocks (SPADE) [Park2019SPADE] alongside style-preserving adaptive instance normalization layers (AdaIN) [Huang2017AdaIN, Huang2018ECCV, karras2018stylegan], to inject both content and style information at different feature map scales. It is simple yet highly effective when applied in conjunction with a style consistency loss. In addition, style injection in Seg2Eye is performed from latent embeddings of multiple reference style images from the target person. This allows for control as well as a sampling from the learned latent space for synthesizing entirely new people (see Figure 1).
2 Related Work
Recent works in gaze estimation modify existing synthetic or real eye images through domain randomization [Park2018ETRA], style transfer [Shrivastava2017CVPR, Sela2017arXiv, Lee2018ICLR] and gaze re-direction [Ganin2016ECCV, wood2018gazedirector, Yu2019CVPR, He2019ICCV, Park2019ICCV] to yield data for training more robust and accurate models. However, the style transfer methods can be poor in preserving content (i.e. eye shape), and re-direction methods can struggle with extrapolating from gaze directions available during training.
Prior art in image translation often train domain-specific models for cross-domain style transfer [isola2017pix2pix, wang2018pix2pixhd], use Adaptive Instance Normalization (AdaIN) to allow real-time control of visual style [Li2017AAAI, Huang2017AdaIN], and more recently inject style [karras2018stylegan] or content information (via SPADE blocks [Park2019SPADE]) at multiple feature map scales. Our proposed Seg2Eye approach combines the power of AdaIN and SPADE blocks for control of the style and content of generated eye images respectively.
The provided dataset for the challenge, OpenEDS [Garbin2019arXiv], is a collection of near infra-red eye images from people captured by a virtual-reality headset (similar to [Kim2019CHI]) and includes segmentation map annotations for pupil, iris and sclera regions for a subset of images and corneal topography data for of the participants. The unlabeled dataset has two subsets: a “generative” subset with images, and a “sequence” dataset with frames (from seconds videos collected at Hz).
In this section, we describe how we won the OpenEDS challenge, and our suggested approach to generating realistic eye images while preserving both content and style.
4.1 Eye Segmentation and Similarity Ranking
The OpenEDS dataset exhibits two modes, which we assume to come from left vs. right eyes. Feeding a left image in order to generate a synthetic image of the right eye impacts performance due to sources of high error (glints, black regions from the head mount) not being apparent in segmentation maps nor consistently in the unlabeled images. Hence, there is a need to carefully select images that come from the same mode. In addition, the more similar the selected unlabeled image is to the target image, the easier it is for the network to produce high-quality output.
In order to find unlabeled images that are similar to a segmentation mask, we train a DeepLab v3+ network [Chen2016DeepLab:CRFs, chen2017rethinking] to predict pseudo-labels (segmentation masks) on the unlabeled dataset and compare them to the target segmentation mask. We use mean squared error as similarity measure. The unlabeled images are then ranked by the similarity of their predicted segmentation mask with the target segmentation mask. We found the matching to perform better when coloring the segmentation masks by the mean value of the respective region across all persons in the training set.
4.2 Refiner Network
Learning a Residual Map.
The refiner is a DeepLab v3+ network [chen2017rethinking]
with modified inputs and a different loss function. It learns a residual map from the target segmentation map, a similar reference image from the same mode and its pseudo-label . The predicted residual map is added to the reference image in order to produce the final output image . We train the refiner end-to-end with mean L2 error as the optimization objective.
4.3 Seg2Eye Network
While the refinement approach (Section 4.2) wins the challenge the images produced are not comparable in visual fidelity to the near IR eye images in the dataset (Figure 3). Furthermore, the previous approach is unable to synthesize new styles (or person identities). Hence, we propose a generative adversarial network approach, which learns latent style embeddings of unlabeled images and merges them in order to produce photo-realistic outputs.
The basis of our method is a mixture between GauGAN [Park2019SPADE] and StyleGAN [karras2018stylegan]. GauGAN is a recently proposed generative adversarial network (GAN) with strong segmentation mask (content) consistency, whereas StyleGAN learns a suitable latent representation of style and injects it via AdaIN [Huang2017AdaIN].
Stylizing the Output.
Figure 2c illustrates the information flow from the encoder to the generator. We calculate the latent style code by sampling a set of images () from a specified target person and embed them via an encoder network . The style codes are aggregated by taking the element-wise maximum: .
This is inspired by the set-based face recognition literature[Parkhi2015BMVC, Schroff2015CVPR], where variations in appearance and head pose in the real world and consequent loss of information can be mitigated by merging information from multiple images of the same person. Our style encoder uses spectral instance normalization in order to preserve style information [Huang2017AdaIN, Li2017AAAI]. The aggregated style code is used to calculate the parameters of AdaIN in the generator blocks. In this way, we can create a realistic eye image of a person with just a few unlabeled eye images and perform latent walks in style space.
In order to produce images following the input segmentation mask, our generators repeatedly apply spatially adaptive normalization (SPADE) [Park2019SPADE]. In comparison to batch or instance norm, where the adaptive parameters modify channels of feature maps as a whole, the SPADE layer learns a scale and offset for each element in the feature map based on the reference segmentation mask.
Our generator model is an extension of GauGAN [Park2019SPADE], a model that applies spatially adaptive normalization (SPADE) to generate synthetic images given a semantic segmentation mask. We modify the SPADE block to inject both content and style. Our combined normalization, the SPADE+Style Block, takes three inputs: a semantic segmentation map
, a style vectorand the actual feature maps . The input feature maps take two different paths and are merged again by addition before leaving the block. One path is an AdaIN [Huang2017AdaIN] layer where the parameters are computed from a style vector. The AdaIN layer applies a learned affine transform to adjust the dimensionality of the style vector to the correct number of channels. The other path is a SPADE block as described by [Park2019SPADE], but we apply spectral instance norm instead of synchronized batch norm. The output is divided by two, which we found to stabilize the training. In short, we compute .
Following GauGAN, our generator starts from a small segmentation map and then doubles feature map resolution in a series of SPADE+Style ResBlocks. Each SPADE+Style ResBlock consists of two SPADE+Style Blocks and a residual connection. Figure2 shows the elements of a SPADE+Style (a) Block and (b) ResBlock.
We train Seg2Eye on paired labeled samples from the training subset of OpenEDS, and sample style images from the top 200 images from the similarity ranking stage (Section 4.1). Further implementation information can be found in our supplementary materials.
4.3.1 Objective Function
The adversarial loss follows GauGAN [Park2019SPADE] and thus takes as input the concatenation of the semantic segmentation mask and the generated image. We also apply an L1 consistency loss on the discriminator feature maps. Concretely, let extract the feature map of the discriminator layer and let and be the real and generated image. Our feature map consistency loss is computed as the sum of the intermediate feature map consistency terms. Let be the number of feature maps in the discriminator. We compute the loss as
As we have paired training samples, we apply a simple L2 consistency loss on the generated vs. target image: .
Style Consistency Loss.
We compute the style consistency loss by passing the generated image through the style encoder . Let be the aggregated style vectors as described above and be the style vector for the generated image . We compute the latent style code consistency loss as . In addition, we compute a consistency loss on the Gram matrix of the feature maps of the encoder.
Let extract the -th feature map of the encoder . We calculate the encoder Gram matrix consistency loss as
where is the number of feature maps in the encoder and is the function to compute the Gram matrix [Gatys2016CVPR].
Full Training Objective.
The full training objective for the generator (given a discriminator ) is written as:
with , , and .
The per-image objective function of the OpenEDS Synthetic Eye Generation Challenge222http://evalai.cloudcv.org/web/challenges/challenge-page/354/evaluation is given as: . For this challenge, we developed multiple models, two of which we present in this paper. Section 5.1 describes the results that optimizes for the target metric and wins the challenge and Section 5.2 talks about an alternative solution to the problem.
5.1 Refiner Network
Our refinement model (cf. Section 4.2) achieved the lowest score of all teams at . The scores of the second (PAU) and third teams (tomcarrot) were and , respectively. The baseline provided by the challenge organizers was . Although the refiner network could win the challenge, we found that its generated images contain blurry regions and ghosting effects (see Figure 3). We believe that the reason for these effects lie in the L2 optimization objective, which encourages “washed out” regions.
The original GauGAN architecture starts from just a single downsampled style image, whose style is supposed to be preserved. When directly applied to the OpenEDS dataset, we found that all output images followed the same style, i.e., a complete lack of style preservation. We introduced our novel SPADE+Style blocks to incorporate style information via AdaIN, specifically allowing for the merging of style information from several images via a style encoder and max-reduction. We found that this improves model performance in terms of both L2 score and visual quality. However, without any additional encouragement, the network did not learn a consistent latent style space that allowed for interpolation in the style latent space. For this reason, we added a style consistency loss on both the latent code and the Gram matrix of the encoder feature maps. In this scenario, style was applied considerably better and we could perform latent space walks as shown in Figure 1
The final Seg2Eye approach does not achieve the lowest score of all approaches, but produces realistic images of high-perceptual quality. Figure 4 illustrates some example in- and outputs. It can be seen that visual fidelity is vastly improved compared to the Refiner Network. Additionally, Figure 1 shows a style interpolation between two people, demonstrating that a good understanding of style has been met. We also believe that this is more in line with the expected final usage of such a method in generating vast amounts of training data in a controlled manner.
In this paper, we described both the winning approach to the OpenEDS Synthetic Eye Generation Challenge and a principled approach to generating person-specific eyes given a semantic segmentation map. Our method, Seg2Eye, is inspired by [karras2018stylegan] and suggests a modification to spatially adaptive normalization as introduced by [Park2019SPADE] that leads to a a more consistent application of style.
In future work, one should explore different ways to combine the content and style information in the Seg2Eye generator, or combine style information from multiple style images. Furthermore, to truly take advantage of Seg2Eye, an interpretable latent space of eye shapes should be learned such that plausible and high fidelity IR images of eyes can be created from entirely unseen people.
This work was supported in part by the ERC Grant OPTINT (StG-2016-717054).