Style Separation and Synthesis via Generative Adversarial Networks

11/07/2018 ∙ by Rui Zhang, et al. ∙ Institute of Computing Technology, Chinese Academy of Sciences Qihoo 360 Technology Co. Ltd. 18

Style synthesis attracts great interests recently, while few works focus on its dual problem "style separation". In this paper, we propose the Style Separation and Synthesis Generative Adversarial Network (S3-GAN) to simultaneously implement style separation and style synthesis on object photographs of specific categories. Based on the assumption that the object photographs lie on a manifold, and the contents and styles are independent, we employ S3-GAN to build mappings between the manifold and a latent vector space for separating and synthesizing the contents and styles. The S3-GAN consists of an encoder network, a generator network, and an adversarial network. The encoder network performs style separation by mapping an object photograph to a latent vector. Two halves of the latent vector represent the content and style, respectively. The generator network performs style synthesis by taking a concatenated vector as input. The concatenated vector contains the style half vector of the style target image and the content half vector of the content target image. Once obtaining the images from the generator network, an adversarial network is imposed to generate more photo-realistic images. Experiments on CelebA and UT Zappos 50K datasets demonstrate that the S3-GAN has the capacity of style separation and synthesis simultaneously, and could capture various styles in a single model.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 6

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Style synthesis (Gatys et al., 2016), also known as style transfer and texture synthesis, attracts enormously attentions recently. The goal of style synthesis is to generate a new image which migrates the style (e.g. colors, textures) from the style target image, and maintain the content (e.g. edges, shapes) of the content target image. Approaches (Gatys et al., 2016; Johnson et al., 2016)

based on Convolutional Neural Networks (CNNs)

(Krizhevsky et al., 2012; Simonyan and Zisserman, 2014) achieve remarkable success on style synthesis and generate astonishing results. Most of those works focus on migrating styles of artistic works to photographs. However, all the objects in photographs have their individual styles, which could also be migrated to other photographs. Moreover, the success of style synthesis shows that the content and style of an image are independent.

Thus how to learn the individual representations for content and style from a given image is the dual problem of style synthesis. We named this problem as “style separation”. Nowadays, existing works focus on style synthesis and pay a little attention to style separation. For example, methods (Johnson et al., 2016; Ulyanov et al., 2016) could represent styles with the learned feedforward networks, but they cannot represent image contents at the same time.

Figure 1. The proposed S-GAN employs a pair of encoder and generator to build mappings between the object photograph manifold and a latent vector space. The encoder performs style separation by encoding the object photograph to the latent vector, half of which represents style, the other half represents content. The generator produces the result of style synthesis from the concatenated vector.

In this work, we aim to implement the style separation and style synthesis simultaneously for object photographs. Therefore, we propose a novel network named Style Separation and Synthesis Generative Adversarial Network (S-GAN). The S-GAN is trained on specific categories of objects (e.g. faces, shoes, etc.) due to GANs could generate realistic images in specific domains. Inspired by (Gatys et al., 2016), we define the structures of objects as “content” (e.g. the identities and poses of faces, the shapes of shoes), and the colors and textures of objects as “style” (e.g. the skin color and hair color of faces, the colors and patterns of shoes). Based on the assumption that the object photographs lie on a high dimensional manifold, S-GAN employs a pair of encoder and generator to build mappings between the manifold and a latent vector space, as illustrated in Figure 1. The encoder is used for style separation. At the encoder stage, we map a given photograph to the latent space. As the content and style are independent, we enforce half of the latent vector represents the style, and the other half represents the content. The generator network performs style synthesis by taking the concatenated vector as input. The concatenated vector contains the style half vector of the style target image and the content half vector of the content target image. The object photograph generated from the concatenated vector has the similar style to the style target image while preserving the content of the content target image.

The proposed S-GAN shows major differences with existing style synthesis approaches (Gatys et al., 2016, 2015; Li and Wand, 2016a; Johnson et al., 2016; Ulyanov et al., 2016). Some of them are an iterative optimization method (Gatys et al., 2016, 2015; Li and Wand, 2016a), which can generate high-quality images with high computationally cost. The other approaches employ feedforward networks to generate images closing to the given style target images (Johnson et al., 2016; Ulyanov et al., 2016). These methods could produce results in real-time, but are only able to handle a specific style one at a time. By comparison, the proposed S-GAN could handle various styles of objects by a single model, as well as high-efficiently synthesize different styles by concatenating half vectors of different styles and processing forward propagation.

The proposed S-GAN is derived from GANs, but it has some differences with GANs. The GANs-based methods achieve impressive success in image generation and editing (Isola et al., 2017; Kim et al., 2017; Zhou et al., 2017)

. However, most of them build mappings between two application-specific domains for image-to-image translation, which could be regarded as the conversion between two styles. Differently, the proposed S

-GAN could build transfers among various styles. The half vectors of styles could be treated as the conditions to generate images of associated styles. Thus the translations between any two different styles can be simply accomplished by replacing the style half vectors.

We perform the proposed S-GAN on photographs of two specific categories of objects, faces from CelebA dataset (Liu et al., 2015) and shoes from UT Zappos50K dataset (Yu and Grauman, 2014). Experimental results show the effectiveness of style separation and synthesis for our proposed method. The main contributions of our work could be summarized as follows:

  • We propose a novel S-GAN framework for style separation and synthesis. Extensive experiments on photographs of faces and shoes demonstrate the effectiveness of the S-GAN.

  • The S-GAN performs style separation with an encoder, which builds the mapping between the object photograph manifold and a latent vector space. For a given object photograph, half of its latent vector is the style representation, and the other half is the content representation.

  • The S-GAN performs style synthesis with a generator. By concatenating the style half vector of the style target image and the content half vector of the content target image, the generator maps the concatenated vector back to the object photograph manifold to produce the style synthesis result.

2. Related Work

2.1. Style Synthesis

Style synthesis can be regarded as a generalization of texture synthesis. The previous texture synthesis methods mainly apply low-level image features to grow textures and preserve image structures (Efros and Freeman, 2001; Hertzmann et al., 2001; Efros and Leung, 1999).

Recently, approaches based on CNNs generate astonishing results. These approaches employ the perceptual losses measured from CNN features to estimate the style and content similarity of generated images and target images.

(Gatys et al., 2016, 2015) propose the optimization-based methods to minimize the perceptual losses through an iterative process directly. (Li and Wand, 2016a) extends these works through matching neural patches with Markov Random Fields (MRFs). The optimization-based methods are computationally expensive since the pixel values of synthesis results are gradually optimized from hundreds of backward propagations.

To speed up the process of style synthesis, approaches based on feedforward networks are proposed (Johnson et al., 2016; Ulyanov et al., 2016; Li and Wand, 2016b). These approaches learn feedforward networks to minimize perceptual losses of a specific style target image and any content target images. Therefore, the stylized results of the given photographs can be gained through the forward propagation process, saving the computational time of iterations. However, one model of these methods is only able to represent a single style. For a new style, the feedforward networks have to be retrained.

Until very recently, some approaches attempt to capture multiple styles in a single feedforward network, which represents styles with multiple filter banks (Chen et al., 2017), conditional instance normalization (Dumoulin et al., 2016) or binary selection units (Li et al., 2017a). There are also some approaches try to represent arbitrary styles in a single model through learning general mappings (Ghiasi et al., 2017), adaptive instance normalization (Huang and Belongie, 2017) or feature transforms (Li et al., 2017b).

In this paper, we propose the S-GAN to implement both style separation and style synthesis. Contents and styles of object photographs are represented as latent vectors. The S-GAN could not only perform style synthesis through a forward propagation process, but also capture various styles in a single model.

2.2. Generative Adversarial Networks

GAN is one of the most successful generative models to generate photorealistic images. The standard GANs (Goodfellow et al., 2014; Radford et al., 2016) learn a generator and a discriminator from the min-max two-player game. The generator produces plausible images from random noises, while the discriminator distinguishes the generated images from the real samples. The training processes of original GANs are unstable, thus many approaches are proposed for improvement, such as WGAN (Arjovsky et al., 2017), WGAN-GP (Gulrajani et al., 2017), EBGAN (Zhao et al., 2016), LS-GAN (Qi, 2017).

Figure 2. The architecture of the proposed S-GAN consists of the encoder, generator, discriminator and perceptual network. The encoder acquires the representation of style separation by mapping the target images and to latent vector and . The generator produces the result of style synthesis from the concatenated vector . The discriminator evaluates the adversarial loss to help to generate plausible images. The perceptual network is applied to gain perceptual losses, including content perceptual loss and style perceptual loss. Reconstruction loss and total variation loss are added to the objective function for supplementation (total variation loss is omitted in the figure for simplification).

Moreover, approaches based on Conditional GANs (CGANs) (Mirza and Osindero, 2014) have been successfully applied to many tasks. These approaches condition GANs on discrete labels (Mirza and Osindero, 2014), text (Reed et al., 2016) and images. Among them, CGANs conditioned on images accomplish image-to-image translation (Isola et al., 2017)

with an additional encoder, which is introduced to obtain conditions from the input images. These frameworks are widely used to tackle many challenge tasks, such as image inpainting

(Pathak et al., 2016; Yang et al., 2017)

, super-resolution

(Ledig et al., 2017), age progression and regression (Zhang et al., 2017), style transfer (Zhu et al., 2017), scene synthetic (Wang and Gupta, 2016), cross-modal retrieval (Chi and Peng, 2018; Zhang et al., 2018a; Yao et al., 2017) and face attribute manipulation (Shen and Liu, 2017; Kim et al., 2017; Zhou et al., 2017). Moreover, approaches of domain-adaptation (Tsai et al., 2018; Zhang et al., 2018b) employ GANs to adapt features and boost models of traditional tasks, such as semantic segmentation. Some other approaches (Ma et al., 2017; Siarohin et al., 2018) utilize GANs to generate human images of arbitrary poses and benefit the related tasks such as person re-identification. Most of these approaches perform image-to-image translation by building mappings between two application-specific domains.

In this paper, the proposed S-GAN are trained to represent the domain consisting of object photographs. These domains can be divided into many sub-domains by different styles. The S-GAN could perform mappings between any pair of sub-domains to accomplish arbitrary style transfer.

3. Proposed Approaches

In this section, we first formulate the latent vector space which is introduced to disentangle the content and style representations. Then we demonstrate the pipeline of S

-GAN and describe each component in detail. Finally, we present all the individual loss functions utilized to optimize the S

-GAN.

3.1. Formulation

We assume the object photographs of a specific category lie on a high dimensional manifold in the photograph domain. Objects with same styles or same contents will be clustered to the sub-domains of associated styles or contents.

Since it is difficult to directly model photographs in the manifold , we build mappings between manifold and a latent vector space , where represent the dimensions of vectors in . Considering contents and styles are independent, we attempt to disentangle the representations of contents and styles to different dimensionality of the latent vectors in . Suppose for a given object photograph , its associated latent vector in is , where and are the sub-vectors representing its content and style respectively. We set the sub-vectors of content and style with equal dimensionality for simplification. Therefore, (or ) could represent the sub-domain containing all the objects showing different contents (or styles) but the same style (or content) with . For any style sub-vector , is the intersection of the sub-domain of style and the sub-domain . Thus, could represent the result of modifying style to while preserving the content of .

3.2. Architecture

The proposed S-GAN employs the framework of GANs to learn the mapping from the manifold to the latent vector space , as well as generate realistic images from . The pipeline of the S-GAN consists of four components, including the encoder, generator, discriminator and perceptual network, as shown in Figure 2. The encoder and generator are applied both in the training and test stages to perform style separation and style synthesis, while the discriminator and perceptual network are employed only in the training stage to optimize the objective functions.

We learn the encoder to build the mapping from the manifold to the latent vector space . For any content target image and style target image , their corresponding latent vectors in are denoted as:

(1)
(2)

Thus, style separation could be implemented by . (or ) is the representation of content and (or ) is the representation of style for the object photograph (or ).

On the contrary, we also learn the generator to build the mapping from latent vector space back to manifold :

(3)
(4)

where (or ) is the reconstruction having the same content and style of (or ). The latent sub-vectors of content and style target images could be utilized as the conditions to generate results of style synthesis. Therefore, the generator could produce a synthetic photograph from concatenating the associated sub-vectors from the content target and from the style target :

(5)

Inspired by the framework of GANs, we also introduce the discriminator

to classify whether an image is real or fake (

i.e., produced by the generator). The synthetic result and real photographs randomly sampled from the training set are fed into the discriminator to acquire the adversarial loss. Indistinguishable object photographs will be generated during the optimal process of the min-max game.

Moreover, we bring the perceptual network to evaluate and improve style synthesis results. is employed to extract features from the synthetic result , the content target and the style target to evaluate the perceptual losses, including content perceptual loss and style perceptual loss. The perceptual losses enforce to acquire the style of while preserving the content of .

3.3. Loss Functions

Figure 2 also presents the losses for optimizing the proposed S-GAN. The objective function is the weighted sum of five losses, including adversarial loss, content perceptual loss, style perceptual loss, reconstruction loss and total variation loss. They will be described in detail in the following.

3.3.1. Adversarial Loss

We apply the discriminator to evaluate the adversarial loss . The adversarial loss of original GAN (Goodfellow et al., 2014)

is based on the Kullback-Leibler (KL) divergence. However, when the discriminator is quickly trained towards its optimality, the KL divergence will lead to a constant and cause the vanishing gradient problem, which will restrain the updating of the generator. To tackle this problem, we exploit the adversarial loss with the recently proposed WGAN

(Arjovsky et al., 2017), which is based on the Earth Mover (EM) distance.

We denote the distribution of the training data (i.e., the object photographs of specific categories) in the manifold as . Random sampling process from is denoted as . Thus, the adversarial loss is:

(6)

where is the generated result of style synthesis, formulated in Eq. (1), Eq. (2) and Eq.(5). A min-max objective function is employed to optimize the adversarial loss:

(7)

where tries to minimize so as to generate image that looks indistinguishable to images from training set, while tries to maximize so as to classify the generated image and real sample .

The adversarial loss ensures that the generated images reside in the manifold , and forces them to be indistinguishable from real images. Thus, we exploit this loss function to produce realistic images. Blurry images look obviously fake, so that they will be prevented by the adversarial loss.

3.3.2. Content Perceptual Loss

The generated image are purposed to be stylistically similar to the style target and preserve the content of the content target . Since the groundtruths of style synthesis are not provided in the training set, we employ the perceptual network and utilize the feature representations to penalize the differences between generated images and target images, by incorporating the prior knowledge of style synthesis (Gatys et al., 2016; Johnson et al., 2016).

Generated results are expected to match the feature responses of the target images. Let be the feature maps extracted from layer of the perceptual network and the input image . The content perceptual loss is defined as the squared Euclidean distance between feature responses:

(8)

where are the layers utilized to evaluate the content perceptual loss.

Considering the design of neural networks, higher layers capture semantic-level information including shapes and spatial structures but ignoring low-level information such as colors and textures. Therefore, we calculate on higher layers of , so that the generated image will preserve the content of the content target .

3.3.3. Style Perceptual Loss

Suppose the feature map from layer of input has the shape of . The style perceptual loss is calculated with the squared Frobenius distance of Gram matrix, denoted as:

(9)

where is the layers applied for the style perceptual loss. The Gram matrix is a matrix inspired from the uncentered covariance of the feature map along the channel dimension. Its element at is denoted as:

(10)

The Gram matrix focuses on features activating together from different channels, omitting the spatial information of images. Thus, the style perceptual loss based on Gram matrix maintains the style of the style target and ignores the content. In contrast to the content perceptual loss, are calculated on lower layers of to focus on low-level information including style-related colors and textures.

3.3.4. Reconstruction Loss

We could also gain the reconstruction of and , as formulated in Eq.(3) and Eq.(4). The reconstruction loss calculated from the original images and the reconstructed images is added to the objective function for supplementation, denoted as:

(11)

We apply L1 distance rather than L2 in , because L1 results in less blurring.

The reconstruction loss ensure that the encoder and the generator are a pair of inverse mappings to each other. Considering the groundtruth of style synthesis are not given in the training set, the implementation of reconstruction could also provide an analogous groundtruth output, so as to accelerate the training process and improve the realistic effect. In addition, although the reconstruction loss may overly smooth and lead to blurry images, serious results could be prevented with an appropriate loss weight and the restrict of the adversarial loss.

3.3.5. Total Variation Loss

Another auxiliary loss function is the total variation loss , which could encourage spatial smoothness of generated results and reduce spike artifacts. It performs total variation regularizer on both the synthesis products and the reconstruction results, formulated as:

(12)
(13)

3.3.6. Full Objective Function

The full objective function is the weighted sum of all the losses defined above, denoted as:

(14)

where are the loss weights which control the relative importance in the objective function. The optimizing process is to solve the min-max problem:

(15)

4. Experiments

In this section, we perform experiments on object photographs of two specific categories, including faces from CelebA dataset (Liu et al., 2015) and shoes from UT Zappos50K dataset (Yu and Grauman, 2014).

4.1. Experimental Settings

4.1.1. CelebA Dataset

The CelebA dataset consists of more than 200K celebrity images of 10K identities. We crop the 128128 center part of the aligned face images in the CelebA dataset for pre-processing. We randomly select 2K images for testing, and the rest images are employed as training samples. The content and style target image pairs are randomly selected, while the forty face attributes and five key points annotated in the CelebA dataset are not utilized.

4.1.2. UT Zappos50K Dataset

The UT Zappos50K dataset is collected from the online shopping website Zappos.com. This dataset contains 50K catalog shoe images which are pictured in the same orientation with blank backgrounds. The images are scaled to 128128 before being fed to the network. We randomly split these images into two parts, one contains 2K images for testing and the other contains 48K for training. We also randomly select the content and style target image pairs and ignore the meta-data (e.g., shoe type, materials, gender, etc.) of the images.

Encoder
Layer

Filter Size — Stride

Activation Size
Input color image - 3128128

Conv, BN, Leaky ReLU

64442 646464
Conv, BN, Leaky ReLU 128442 1283232
Conv, BN, Leaky ReLU 256442 2561616
Conv, BN, Leaky ReLU 512442 51288
Conv, BN, Leaky ReLU 1024442 102444
Generator
Layer Filter Size Activation Size
Input latent vector - 102444
Deconv, BN, ReLU 512442 51288
Deconv, BN, ReLU 256442 2561616
Deconv, BN, ReLU 128442 1283232
Deconv, BN, ReLU 64442 646464
Deconv, Tanh 3442 3128128
Discriminator
Layer Filter Size Activation Size
Input color image - 3128128
Conv, Leaky ReLU 64442 646464
Conv, BN, Leaky ReLU 128442 1283232
Conv, BN, Leaky ReLU 256442 2561616
Conv, BN, Leaky ReLU 512442 51288
Conv, BN, Leaky ReLU 1024442 102444
Fully Connected 1(102444) 1
Table 1. The detailed structure of S-GAN, including encoder , generator and discriminator

. BN: Batch Normalization.

4.1.3. Implementation Details

The detailed structures of the encoder , generator and discriminator are specified in Table 1. and apply the “Convolution, Batch Normalization, Leaky ReLU” module, while exploits the “Deconvolution, Batch Normalization, ReLU” module. Strides of 2 are utilized in both convolutional and deconvolutional layers to down-sample or up-sample the feature maps. Particularly, the output layer of

utilizes Tanh as activation function instead of ReLU. In addition, Batch Normalization

(Ioffe and Szegedy, 2015) is removed in the generator output layer and the discriminator input layer, since directly applying Batch Normalization to all layers may lead to sampling oscillation and model instability. Images is normalized to [0,1] before input to or , and the output of are re-scaled to [0,255]. For a given image , the output 102444 latent vector of is split along the channel dimension as and , i.e., the two half 51244 vectors are the content and style representations respectively.

During training, the Adam optimizer (Kingma and Ba, 2015) with the mini-batch of 16 samples is adopted. We employ VGG-19 network (Simonyan and Zisserman, 2014)

pre-trained on ImageNet dataset

(Deng et al., 2009) as the perceptual network . The weights of encoder , generator and discriminator

are initialized from a zero-centered Gaussian distribution with appropriate deviations

(Glorot and Bengio, 2010)

. The learning rate is fixed at 0.001 for 30 epochs. The loss weights are set as

. The absolute values of loss weights are obtained and chosen from the experiments. The direct loss values of perceptual losses are much larger than other losses, so we set the loss weights of perceptual losses smaller than other losses to make the weighted losses in the same order of magnitude. We compute content perceptual loss at layer relu4_2 and style perceptual loss at layers relu1_1, relu2_1 and relu3_1. We perform the alternative training approach of GANs, by alternating between one gradient descent step on and two steps on and

. Our experiments are implemented based on Tensorflow platform. All of our networks are trained and tested on one NVIDIA Tesla K40 GPU.

Figure 3. Visualization of the content and style representation on face images of CelebA Dataset. From top to bottom are: original images, content representation and style representation.
Figure 4. Visualization of the content and style representation on shoe images of UT Zappos50K Dataset. From top to bottom are: original images, content representation and style representation.

4.2. Results and Comparisons

4.2.1. Visualization of Content and Style Representations

The encoder of S-GAN could perform style separation through encoding an image to a latent vector, half of which represents style, and the other half represents content. To demonstrate the S-GAN has the capacity of style separation, we visualize the content and style representations produced by the encoder . For visualization, we preserve the half vector of content or style and simply fill the other half vector with zeros. Then we feed the new vector into the generator , and obtain the visualization for content or style. As shown in Figure 3 and 4, the visualizations of content representations preserve the structure information but abandon color information, while the style representations maintain the color information but ignore structure information. For example, the content representations preserve the identities and poses of original faces in Figure 3, and the shapes and structures of original shoes in Figure 4. In contrast, the style representations present the skin color and hair color of style target faces in Figure 3, and the colors and textures of style target shoes in Figure 4. These content and style representations are powerful to synthesizing new images. The above experiments show that the content and style representations are complementary and could be captured from the learned encoder.

Figure 5. Illustration of style synthesis on CelebA dataset. From top to bottom: style target images, content target images, synthesis results, logarithms of content distances and style distances. For the fourth and fifth rows, pink bars are distances between content target images and synthesis results, while blue bars are distances between style target images and synthesis results.
Figure 6. Illustration of style synthesis on UT Zappos50K dataset. From top to bottom: style target images, content target images, synthesis results, logarithms of content distances and style distances. For the fourth and fifth rows, pink bars are distances between content target images and synthesis results, while blue bars are distances between style target images and synthesis results.
Figure 7. Illustration of style synthesis on different styles. The first row shows different style target images, while the second row shows the content target image and style synthesis results.
Figure 8. Illustration of style synthesis on different contents. The first row shows different content target images, while the second row shows the style target image and style synthesis results.

4.2.2. Results of Style Synthesis

The generator of S-GAN could produce style synthesis from the vector by concatenating content and style half vectors. Both qualitative and quantitative evaluation of synthesis results on CelebA and UT Zappos50K datasets are presented in Figure 5 and 6. From the first three rows of those two figures, we could observe that the synthesis images represent the obvious style of the style target images and preserve the content of the content target images. For example, the synthesis faces in the third row in Figure 5 represent the skin colors and hair colors of the style target faces, while preserving the identities, poses and expressions of the content target faces. Similarly, the synthesis shoes in the third row in Figure 6 show the colors of the style target shoes and the structures and shapes of the content target shoes. Furthermore, for target and synthesis images, the lower the content/style perceptual distances (in Eq. (8) and Eq. (9)) are, the more similar the contents/styles are. As shown in the forth row in Figure 5 and 6, the synthesis images have a lower content perceptual distance with content target images than style target images. From the fifth row in Figure 5 and 6, we could observe that the synthesis images have a lower style perceptual distances with style target images than content target images. We conclude that the S-GAN could make the style synthesis have the following two advantages: 1) For the content target images, it only captures the content and abandons the style information. 2) For the style target images, it maintains the style and ignores the content information.

4.2.3. Diversity

Furthermore, we analyze the diversity of the S-GAN from the following two aspects: 1) we apply various style target images on a same content image to generate the synthesis images. As shown in Figure 7, the generated images maintain the similar structure with the original content target image and show different colors and textures according to different style target images. 2) we use a same style target image and different content target images to synthesize images. As shown in Figure 8, the color and texture of the generated images are same to the style target image with different shapes and structures. The above results present the diversity of the proposed S-GAN. In other words, S-GAN could capture various styles and contents in a single model.

Figure 9. Comparison results with four popular style synthesis approaches (Gatys et al., 2016), (Johnson et al., 2016), (Li et al., 2017b) and (Huang and Belongie, 2017).
Method Confidence Content Style
Gatys et al. (Gatys et al., 2016) 0.947 5.56 3.18
Johnson et al. (Johnson et al., 2016) 0.938 5.64 3.24
Li et al. (Li et al., 2017b) 0.967 5.66 3.67
Huang et al. (Huang and Belongie, 2017) 0.953 5.36 3.49
S-GAN (ours) 0.984 5.25 3.22
Table 2. Confidence scores of S-GAN compared with four popular style synthesis approaches (Gatys et al., 2016), (Johnson et al., 2016), (Li et al., 2017b) and (Huang and Belongie, 2017)

. Confidence: averaged confidence score of face detection. Content: averaged logarithms of content perceptual distances. Style: averaged logarithms of style perceptual distances.

4.2.4. Comparisons

We compare the proposed method with other four popular style synthesis approaches (Gatys et al., 2016), (Johnson et al., 2016), (Li et al., 2017b) and (Huang and Belongie, 2017). As shown in Figure 9, the images generated by S-GAN are more visually realistic because they contain distinguishable details. In contrast, the images generated by the four existing approaches are blurry and distorted, and lose many details of the content target images. Inspired by (Wang and Gupta, 2016) and (Li et al., 2017b), we also perform the quantitative experiment. We randomly select 10 images as style target images and another 100 images as content target images. Then we generate 1000 synthesis images by using the five approaches. In order to measure the realism of generated images, we employ the popular MTCNN (Zhang et al., 2016) to perform face detection on the generated images. The more realistic the generated image is, the higher confidence score is. Thus, we employ the confidence score, i.e.,

the softmax probability, of face detection to represent the quality of the generated image. As shown in Table

2, the confidence score of the proposed method is higher than the four methods, which means the generated images of the proposed method are more realistic. Furthermore, we employ the averaged logarithms of content/style perceptual distances in Eq. (8) and Eq. (9) to measure the similarity of synthesis results and content/style target images. As shown in Table 2, the averaged logarithms of content perceptual distances are lower than those of the four comparison methods. Meanwhile, the averaged logarithms of style perceptual distances are lower than those of the three comparison methods (Johnson et al., 2016), (Li et al., 2017b) and (Huang and Belongie, 2017). Note that the averaged logarithms of style perceptual distances of our method is slightly higher than (Gatys et al., 2016) because they (Gatys et al., 2016) only focus on transferring style information. The above comparisons show that the generated images of the proposed methods could well represent the content and style of the target images.

Figure 10. Effect of different objective functions. is the sum of content perceptual loss in Eq. (8) and style perceptual loss in Eq. (9). is the reconstruction loss in Eq. (11). is the adversary loss in Eq. (6).

4.3. Discussion and Analysis

4.3.1. Analysis of the objective function

In this work, the objective function (Eq. (14)) consists of several loss functions, such as content perceptual loss , style perceptual loss , reconstruction loss , and adversary loss . Therefore, we analyze the effect of each loss function for images, and the related results are summarized in Figure 10. From Figure 10, we could observe that the images generated by using perceptual loss () are blurry and distorted, and lack of many important details. The reason is that the perceptual loss is a global constraint, and has a limited ability to capture the subtle information. By adding the reconstruction loss to perceptual loss , the generated images are still blurry but are more reasonable. Otherwise, the images generated by combining the adversary loss and perceptual loss are sharper and realistic, but lose much important detail information compared with the content target images. For example, the slight differences in eyes, eyebrows, and mouths between the generated and content target images could change the identities of original faces. As the reconstruction loss could ensure the encoder and generator are a pair of inverse mappings, it could compensate for the incorrect details cased by adversary loss . Therefore, the reconstruction loss is complementary to adversary loss. As shown in Figure 10, fusing the reconstruction loss , adversary loss , and perceptual loss could tackle the above-mentioned problem, and the generated images could have sharper and corrected details.

4.3.2. Analysis of Content and Style Interpolation

We analyze the assumption of manifold

of object photographs through illustrating the results of content and style interpolation, as shown in Figure

11. Images in the bottom left and top right corners are the reconstructions of two target faces, while images in the bottom right and top left corners are the synthesis results of swapping contents and styles. The horizontal (or vertical) axis indicates the traversing of style (or content), i.e., images in each row (or column) are style (or content) interpolation results with fixed content (or style). These results show that the contents and styles are independent of images in the manifold . Moreover, the learned encoder and generator could build mappings between the manifold and the latent space , and successfully obtain the representation of content and style.

Figure 11. Illustration of the learned face manifold and analysis of content and style interpolation. The horizontal axis indicates the traversing of style, and the vertical axis indicates the traversing of content.

5. Conclusion

In this paper, we propose the S-GAN to implement style separation and style synthesis simultaneously. We assume the object photographs of a specific category lie on a manifold, as well as the content and style of an object are independent. We learn an encoder to build the mapping from the manifold to a latent space, in which the content and style of an object could be represented with two halves of its associated latent vector respectively. Thus, the style separation could be performed by the encoder. We also learn a generator for the inverse mapping, so that the result of style synthesis could be generated from concatenating the style half vector of the style target image and the content half vector of the content target image. Experiments on both CelebA and UT Zappos 50K datasets demonstrate the satisfactory results of the proposed S-GAN.

References

  • (1)
  • Arjovsky et al. (2017) Martín Arjovsky, Soumith Chintala, and Léon Bottou. 2017. Wasserstein GAN. arXiv preprint arXiv:1701.07875 (2017).
  • Chen et al. (2017) Dongdong Chen, Lu Yuan, Jing Liao, Nenghai Yu, and Gang Hua. 2017. StyleBank: An Explicit Representation for Neural Image Style Transfer. In

    IEEE Conference on Computer Vision and Pattern Recognition

    .
  • Chi and Peng (2018) Jingze Chi and Yuxin Peng. 2018. Dual Adversarial Networks for Zero-shot Cross-media Retrieval. In International Joint Conference on Artificial Intelligence.
  • Deng et al. (2009) Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Fei-Fei Li. 2009. ImageNet: A large-scale hierarchical image database. In IEEE Conference on Computer Vision and Pattern Recognition.
  • Dumoulin et al. (2016) Vincent Dumoulin, Jonathon Shlens, and Manjunath Kudlur. 2016. A Learned Representation For Artistic Style. In International Conference on Learning Representations.
  • Efros and Freeman (2001) Alexei A. Efros and William T. Freeman. 2001. Image quilting for texture synthesis and transfer. In Conference on Computer Graphics and Interactive Techniques, SIGGRAPH.
  • Efros and Leung (1999) Alexei A. Efros and Thomas K. Leung. 1999. Texture Synthesis by Non-parametric Sampling. In IEEE International Conference on Computer Vision.
  • Gatys et al. (2015) Leon A. Gatys, Alexander S. Ecker, and Matthias Bethge. 2015. Texture Synthesis Using Convolutional Neural Networks. In Advances in Neural Information Processing Systems.
  • Gatys et al. (2016) Leon A. Gatys, Alexander S. Ecker, and Matthias Bethge. 2016. Image Style Transfer Using Convolutional Neural Networks. In IEEE Conference on Computer Vision and Pattern Recognition.
  • Ghiasi et al. (2017) Golnaz Ghiasi, Honglak Lee, Manjunath Kudlur, Vincent Dumoulin, and Jonathon Shlens. 2017. Exploring the structure of a real-time, arbitrary neural artistic stylization network. In British Machine Vision Conference.
  • Glorot and Bengio (2010) Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In International Conference on Artificial Intelligence and Statistics (AISTATS).
  • Goodfellow et al. (2014) Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C. Courville, and Yoshua Bengio. 2014. Generative Adversarial Nets. In Advances in Neural Information Processing Systems.
  • Gulrajani et al. (2017) Ishaan Gulrajani, Faruk Ahmed, Martín Arjovsky, Vincent Dumoulin, and Aaron C. Courville. 2017. Improved Training of Wasserstein GANs. In Advances in Neural Information Processing Systems.
  • Hertzmann et al. (2001) Aaron Hertzmann, Charles E. Jacobs, Nuria Oliver, Brian Curless, and David Salesin. 2001. Image analogies. In Conference on Computer Graphics and Interactive Techniques, SIGGRAPH.
  • Huang and Belongie (2017) Xun Huang and Serge J. Belongie. 2017. Arbitrary Style Transfer in Real-Time with Adaptive Instance Normalization. In IEEE International Conference on Computer Vision.
  • Ioffe and Szegedy (2015) Sergey Ioffe and Christian Szegedy. 2015. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In Internal Conference on Meachine Learning.
  • Isola et al. (2017) Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. 2017.

    Image-to-Image Translation with Conditional Adversarial Networks. In

    IEEE Conference on Computer Vision and Pattern Recognition.
  • Johnson et al. (2016) Justin Johnson, Alexandre Alahi, and Li Fei-Fei. 2016. Perceptual Losses for Real-Time Style Transfer and Super-Resolution. In European Conference on Computer Vision.
  • Kim et al. (2017) Taeksoo Kim, Moonsu Cha, Hyunsoo Kim, Jung Kwon Lee, and Jiwon Kim. 2017. Learning to Discover Cross-Domain Relations with Generative Adversarial Networks. In Internal Conference on Meachine Learning.
  • Kingma and Ba (2015) Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. International Conference on Learning Representations.
  • Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In Advances in Neural Information Processing Systems.
  • Ledig et al. (2017) Christian Ledig, Lucas Theis, Ferenc Huszar, Jose Caballero, Andrew P. Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, and Wenzhe Shi. 2017. Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network. In IEEE Conference on Computer Vision and Pattern Recognition.
  • Li and Wand (2016a) Chuan Li and Michael Wand. 2016a. Combining Markov Random Fields and Convolutional Neural Networks for Image Synthesis. In IEEE Conference on Computer Vision and Pattern Recognition.
  • Li and Wand (2016b) Chuan Li and Michael Wand. 2016b. Precomputed Real-Time Texture Synthesis with Markovian Generative Adversarial Networks. In European Conference on Computer Vision.
  • Li et al. (2017a) Yijun Li, Chen Fang, Jimei Yang, Zhaowen Wang, Xin Lu, and Ming-Hsuan Yang. 2017a. Diversified Texture Synthesis with Feed-Forward Networks. In IEEE Conference on Computer Vision and Pattern Recognition.
  • Li et al. (2017b) Yijun Li, Chen Fang, Jimei Yang, Zhaowen Wang, Xin Lu, and Ming-Hsuan Yang. 2017b. Universal Style Transfer via Feature Transforms. In Advances in Neural Information Processing Systems.
  • Liu et al. (2015) Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. 2015. Deep Learning Face Attributes in the Wild. In IEEE International Conference on Computer Vision.
  • Ma et al. (2017) Liqian Ma, Xu Jia, Qianru Sun, Bernt Schiele, Tinne Tuytelaars, and Luc Van Gool. 2017. Pose Guided Person Image Generation. In Advances in Neural Information Processing Systems.
  • Mirza and Osindero (2014) Mehdi Mirza and Simon Osindero. 2014. Conditional Generative Adversarial Nets. arXiv preprint arXiv:1411.1784 (2014).
  • Pathak et al. (2016) Deepak Pathak, Philipp Krähenbühl, Jeff Donahue, Trevor Darrell, and Alexei A. Efros. 2016. Context Encoders: Feature Learning by Inpainting. In IEEE Conference on Computer Vision and Pattern Recognition.
  • Qi (2017) Guo-Jun Qi. 2017. Loss-Sensitive Generative Adversarial Networks on Lipschitz Densities. arXiv preprint arXiv:1701.06264 (2017).
  • Radford et al. (2016) Alec Radford, Luke Metz, and Soumith Chintala. 2016. Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. In International Conference on Learning Representations.
  • Reed et al. (2016) Scott E. Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. 2016. Generative Adversarial Text to Image Synthesis. In Internal Conference on Meachine Learning.
  • Shen and Liu (2017) Wei Shen and Rujie Liu. 2017. Learning Residual Images for Face Attribute Manipulation. In IEEE Conference on Computer Vision and Pattern Recognition.
  • Siarohin et al. (2018) Aliaksandr Siarohin, Enver Sangineto, Stéphane Lathuilière, and Nicu Sebe. 2018. Deformable GANs for Pose-based Human Image Generation. arXiv preprint arXiv:1801.00055 (2018).
  • Simonyan and Zisserman (2014) Karen Simonyan and Andrew Zisserman. 2014. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv preprint arXiv:1409.1556 (2014).
  • Tsai et al. (2018) Yi-Hsuan Tsai, Wei-Chih Hung, Samuel Schulter, Kihyuk Sohn, Ming-Hsuan Yang, and Manmohan Chandraker. 2018. Learning to Adapt Structured Output Space for Semantic Segmentation. In IEEE Conference on Computer Vision and Pattern Recognition.
  • Ulyanov et al. (2016) Dmitry Ulyanov, Vadim Lebedev, Andrea Vedaldi, and Victor S. Lempitsky. 2016. Texture Networks: Feed-forward Synthesis of Textures and Stylized Images. In Internal Conference on Meachine Learning.
  • Wang and Gupta (2016) Xiaolong Wang and Abhinav Gupta. 2016. Generative Image Modeling Using Style and Structure Adversarial Networks. In European Conference on Computer Vision.
  • Yang et al. (2017) Chao Yang, Xin Lu, Zhe Lin, Eli Shechtman, Oliver Wang, and Hao Li. 2017. High-Resolution Image Inpainting using Multi-Scale Neural Patch Synthesis. In IEEE Conference on Computer Vision and Pattern Recognition.
  • Yao et al. (2017) Hantao Yao, Shiliang Zhang, Yongdong Zhang, Jintao Li, and Qi Tian. 2017. One-Shot Fine-Grained Instance Retrieval. In ACM Multimedia Conference.
  • Yu and Grauman (2014) Aron Yu and Kristen Grauman. 2014. Fine-Grained Visual Comparisons with Local Learning. In IEEE Conference on Computer Vision and Pattern Recognition.
  • Zhang et al. (2018a) Jian Zhang, Yuxin Peng, and Mingkuan Yuan. 2018a. Unsupervised Generative Adversarial Cross-Modal Hashing. In The Thirty-Second AAAI Conference on Artificial Intelligence.
  • Zhang et al. (2016) Kaipeng Zhang, Zhanpeng Zhang, Zhifeng Li, and Yu Qiao. 2016. Joint Face Detection and Alignment Using Multitask Cascaded Convolutional Networks. IEEE Signal Process. Lett. (2016).
  • Zhang et al. (2018b) Yiheng Zhang, Zhaofan Qiu, Ting Yao, Dong Liu, and Tao Mei. 2018b. Fully Convolutional Adaptation Networks for Semantic Segmentation. (2018).
  • Zhang et al. (2017) Zhifei Zhang, Yang Song, and Hairong Qi. 2017.

    Age Progression/Regression by Conditional Adversarial Autoencoder. In

    IEEE Conference on Computer Vision and Pattern Recognition.
  • Zhao et al. (2016) Junbo Jake Zhao, Michaël Mathieu, and Yann LeCun. 2016. Energy-based Generative Adversarial Network. arXiv preprint arXiv:1609.03126 (2016).
  • Zhou et al. (2017) Shuchang Zhou, Taihong Xiao, Yi Yang, Dieqiao Feng, Qinyao He, and Weiran He. 2017. GeneGAN: Learning Object Transfiguration and Attribute Subspace from Unpaired Data. arXiv preprint arXiv:1705.04932 (2017).
  • Zhu et al. (2017) Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A. Efros. 2017. Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks. In IEEE International Conference on Computer Vision.