Exploring Unlabeled Faces for Novel Attribute Discovery

12/06/2019 ∙ by Hyojin Bahng, et al. ∙ 26

Despite remarkable success in unpaired image-to-image translation, existing systems still require a large amount of labeled images. This is a bottleneck for their real-world applications; in practice, a model trained on labeled CelebA dataset does not work well for test images from a different distribution – greatly limiting their application to unlabeled images of a much larger quantity. In this paper, we attempt to alleviate this necessity for labeled data in the facial image translation domain. We aim to explore the degree to which you can discover novel attributes from unlabeled faces and perform high-quality translation. To this end, we use prior knowledge about the visual world as guidance to discover novel attributes and transfer them via a novel normalization method. Experiments show that our method trained on unlabeled data produces high-quality translations, preserves identity, and be perceptually realistic as good as, or better than, state-of-the-art methods trained on labeled data.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 4

page 5

page 6

page 7

page 8

Code Repositories

XploreGAN

Unofficial PyTorch Implementation for "Exploring Unlabeled Faces for Novel Attribute Discovery"


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In recent years, unsupervised image-to-image translation has improved dramatically [CycleGAN2017, StarGAN2018, DRIT, huang2018munit]. Existing translation methods use the term unsupervised for translating with unpaired training data (i.e., provided with images in domain X and Y, with no information on which matches which ). However, existing systems, in essence, are still trained with supervision, as they require large amount of labeled images to perform translation. This becomes a bottleneck for their application in the real world; in practice, a model trained on labeled CelebA dataset [liu2015faceattributes] does not work well for images of different test distribution due to dataset bias [torralba2011unbiased, wang2019detecting]. For instance, a model trained on CelebA images are biased towards Western, celebrity faces, which necessitates collecting, labeling, and training with new data to match a different test distribution. Hence, need for labels greatly limits their application to unlabeled images of much larger quantity.

In this paper, we attempt to alleviate the necessity for labeled data by automatically discovering novel attributes from unlabeled images – moving towards unpaired and unlabeled multi-domain image-to-image translation. In particular, we focus on image translation of facial images, as they require annotation of multiple attributes (e.g., 40 attributes for 202,599 images in CelebA), which makes labeling labor- and time-intensive. While existing benchmark datasets attempt to label as much attributes as they can, we notice that much is still unnamed, e.g., CelebA only contains ‘pale skin’ attribute among all possible skin colors. This makes us wonder: can’t we make the attributes “emerge” from data?

This paper aims to explore the degree to which you can discover novel attributes from unlabeled faces , thus proposing our model XploreGAN. To this end, we utilize pre-trained CNN features – making the most out of what we have already learned

about the visual world. Note that classes used for CNN pre-training (ImageNet classes) differ from the unlabeled data (facial attributes). The goal is to transfer not its specific classes, but the general knowledge on what properties makes a good class in general 

[han2019learning]. We use it as guidance to group new set of unlabeled faces, where each group contains a common attribute, and transfer that attribute to an input image by our newly proposed attribute summary instance normalization (ASIN). Unlike previous style normalization methods that generate affine parameters from a single image [dumoulin2017learned, huang2017arbitrary], resulting in translation of entangled attributes (i.e., hair color, skin color, and gender) that exist in the style image, ASIN summarizes the common feature (i.e., blond hair) among a group of images (cluster) and only transfers its common attribute (style) to the input (content). Experiments show that XploreGAN trained on unlabeled data produces high-quality translation results as good as, or better than, state-of-the-art methods trained with labeled data. To the best of our knowledge, this is the first method that moves towards both unpaired and unlabeled image-to-image translation.

2 Proposed Method

While existing methods uses facial images that annotates a single image with multiple labels (i.e., one-to-many mapping) to achieve multi-domain translation, we slightly modify this assumption to achieve a high-quality performance without utilizing any attribute labels. We first utilize pre-trained feature space as guidance to cluster unlabeled images by their common attribute. Using the cluster assignment as pseudo-label, we utilize our newly proposed attribute summary instance normalization (ASIN) to summarize the common attribute (e.g., blond hair) among images in each cluster and perform high-quality translation.

2.1 Clustering for attribute discovery

CNN features pre-trained on ImageNet [deng2009imagenet] have been used to assess perceptual similarity among images [johnson2016perceptual, zhang2018unreasonable]

. In other words, images with similar pre-trained features are perceived as similar to humans. Exploiting this property, we propose to discover novel attributes that exist in unlabeled data by clustering their feature vectors obtained from pre-trained networks and using these cluster assignments as our

pseudo-label for attributes. In other words, we utilize the pre-trained feature space as guidance to group images by their dominant attributes.

We adopt a standard clustering algorithm, k-means, and partition the features from pre-trained networks into k groups by solving:

(1)

Solving this problem results in a set of cluster assignments , centroids

, and their standard deviations

. We use the

as pseudo-labels for training the auxiliary classifier of the discriminator and use

and for conditioning the normalization layer of the generator.

2.2 Attribute summary instance normalization

Normalization layers play a significant role in modeling style. As [huang2017arbitrary] puts it, a single network can “generate images in completely different styles by using the same convolutional parameters but different affine parameters in instance normalization (IN) layers”. In other words, to inject a style to a content image, it is sufficient to simply tune the scaling and shifting parameters specific to each style after normalizing the content image.

Previous style normalization methods generate affine parameters from a single image instance [dumoulin2017learned, huang2017arbitrary], resulting in translation of entangled attributes (e.g., hair color/shape, skin color, and gender) that exist in the given style image. In contrast, our approach summarizes and transfers the common attribute (e.g., blond hair) within a group of images by generating affine parameters from the feature statistics of each cluster. We call this attribute summary instance normalization

(ASIN). We use a multilayer perceptron (MLP)

to map cluster statistics to the affine parameters of the normalization layer, defined as

(2)

As the generator is trained to generalize the common feature among each subset of images (cluster), ASIN allows us to discover multiple attributes in unlabeled data. ASIN can also be used in supervised

settings to summarize the common attribute among images with the same label (e.g., black hair). You may generate affine parameters from both the centroid and variance of each cluster, only the centroid information, or the domain pseudo-label (i.e., cluster assignments). We will use the first option in subsequent equations in the paper, so as not to confuse the readers.

2.3 Objective function

Cluster classification loss.

To translate an input image to a target domain , we adopt a domain classification loss [StarGAN2018] to generate images that are properly classified as its target domain. However, we use cluster assignments as pseudo-labels for each attribute unlike previous multi-domain translation approaches that utilize pre-given labels for classification [StarGAN2018, pumarola2018ganimation]. We optimize the discriminator to classify real images to its original domain

by the loss function defined as

(3)

Similarly, we optimize the generator to classify fake images to its target domain via the loss function defined as

(4)

The cluster statistics act as conditional information for translating images to its corresponding pseudo domain.

Reconstruction and latent loss.

Our generator should be sensitive to change in content but robust to other variations. To make translated images preserve the content of its input images while changing only the domain-related details, we adopt a cycle consistency loss [kim2017learning, CycleGAN2017] to the generator, defined as

(5)

where the generator is given the fake image and the original cluster statistics and aims to reconstruct the original real image . We use the L1 norm for the reconstruction loss.

However, solely using the pixel-level reconstruction loss does not guarantee that translated images preserve the high-level content of its original images in settings where a single generator has to learn a large number of domains simultaneously (e.g., more than 40). Inspired by [yang2018learning], we adopt the latent loss, where we minimize the distance between real and fake images in the feature space:

(6)

We denote as the encoder of and use the norm for the latent loss. The latent loss ensures that the real and fake images have similar high-level feature representations (i.e., perceptually similar) even though they may be quite different at a pixel level.

Adversarial loss.

We adopt the adversarial loss used in GANs to make the generated images indistinguishable from real images. The generator tries to generate a realistic image given the input image and the target cluster statistics , while the discriminator tries to distinguish between generated images and real images. To stabilize GAN training, we adopt the Wasserstein GAN objective with gradient penalty [arjovsky2017wasserstein, gulrajani2017improved], defined as

(7)

where is sampled uniformly from straight lines between pairs of real and fake images.

Full objective function.

Finally, our full objective function for and can be written as

(8)
(9)

The hyperparameters control the relative importance of each loss function. In all experiments, we used

, , and . At test time, we use the pseudo-labels to generate translated results. We are surprised to find that the pseudo-labels correspond to meaningful facial attributes; results are demonstrated in Section 3.

2.4 Implementation details

Clustering stage.

We use the final convolutional activations (i.e., conv5 for BagNet-17 and ResNet-50) to cluster images according to high-level attributes. We use BagNet-17 [brendel2018approximating] pre-trained on ImageNet (IN) [deng2009imagenet] as the feature extractor for FFHQ [karras2018style] and CelebA [liu2015faceattributes] dataset, and ResNet-50 [he2016deep] pre-trained on Stylized ImageNet (SIN) [geirhos2018imagenet] as the feature extractor of EmotioNet [fabian2016emotionet] dataset. The former is effective in detecting local texture cues, while the latter ignores texture cues but detects global shapes effectively. For clustering, the extracted features are -normalized and PCA-reduced to 256 dimensions. We utilize the -means implementation by Johnson et al. [JDH17], with for images with resolution and for images with resolution.

Translation stage.

Adapted from StarGAN [StarGAN2018], our encoder has two convolutional layers for downsampling followed by six residual blocks [he2016deep] with spectral normalization [miyato2018spectral]. Our decoder has six residual blocks with attribute summary instance normalization (ASIN), with per-pixel noise [karras2018style] added after each convolutional layer. It is followed by two transposed convolutional layers for upsampling. We also adopt stochastic variation [karras2018style] to increase generation performance on fine, stochastic details of the image. For the discriminator, we use PatchGANs [li2016precomputed, isola2017image, zhu2017unpaired] to classify whether image patches are real or fake. As a module to predict the affine parameters for ASIN, our multi-layer perceptron consists of seven layers for FFHQ and EmotioNet datasets and three layers for the CelebA dataset. For training, we use the Adam optimizer, a mini-batch size of 32, a learning rate of 0.0001, and decay rates of , .

Figure 1: Comparison on style normalization methods. As AdaIN is conditioned on a single image instance to transfer style, it tends translate entangled attributes of the style image (last three rows). In contrast, ASIN summarizes a common attribute within a group (cluster) of images and transfers its specific feature, while keeping all other attributes (identity) of the content image intact.
Figure 2: Translation results from multiple datasets. XploreGAN can discover various attributes in data such as diverse hair colors, ethnicity, degree of age, and facial expressions from unlabeled images. Note that labels in the figure are assigned post-hoc to enhance the interpretability of the results.
Figure 3: Baseline comparisons. Facial attribute translation results on the CelebA dataset. (a) compares multi-domain translation quality (H: hair color, G: gender, A: aged), and (b) compares multimodal tranlation quality. Each result of our model is generated from single cluster statistics.

3 Experiments

3.1 Datasets

Flickr-Faces-HQ (FFHQ) [karras2018style] is a high-quality human face image dataset with 70,000 images, offering a large variety in age, ethnicity, and background. The dataset is not provided with any attribute labels.

CelebFaces Attributes (CelebA) [liu2015faceattributes] is a large-scale face dataset with 202,599 celebrity images, each annotated with 40 binary attribute labels. In our experiments, we do not utilize the attribute labels for training our model.

EmotioNet [fabian2016emotionet] contains 950,000 face images with diverse facial expressions. The facial expressions are annotated with action units, yet we do not utilize them for training our model.

3.2 Baseline models

We compare with baseline models that utilize unpaired yet labeled

datasets. All experiments of the baselines are conducted using the original codes and hyperparameters. As XploreGAN does not use any labels during training, at test time, we select pseudo-labels that best estimates the labels used by other baseline models (e.g., pseudo-label that best corresponds to ‘blond’). Each result of our model is generated from a single cluster statistics.

StarGAN is a state-of-the-art multi-domain image translation model that uses the attribute label during training.

DRIT and MUNIT are state-of-the-art models that perform multimodal image translation between two domains.

3.3 Comparison on style normalization

We show qualitative comparison of group-based ASIN and instance-based AdaIN. For fair comparison, we substitute the ASIN layers with AdaIN as implemented in [huang2018munit] while maintaining all the other network architecture and training settings. As shown in Fig. 1, AdaIN depends on a single image instance to transfer style. AdaIN results in translation of entangled attributes (e.g., hair color/shape, gender, background color; last three rows of Fig. 1) that exist within the reference image. In contrast, ASIN is able to summarize the common attribute within a group of images (e.g., hair color) and transfer its specific attribute. This makes it easy for users to transfer a particular attribute they desire while preserving all other attributes (identity) of the content image intact.

3.4 Qualitative evaluation

As shown in Fig. 3, we qualitatively compare face attribute translation results on the CelebA dataset. All baseline models are trained using the attribute labels, while XploreGAN is trained with unlabeled data. As we increase the number of , we can discover multiple subsets of a single attribute (e.g., diverse styles of ‘women’; further discussed in Section 3.6). This can be thought as discovering modes in data. Thus, we can compare our model to not only multi-domain translation but also multimodal translation between two domains. Fig. 3 shows that our method can enerate translation results as high quality as other models trained with labels. Also, Fig. 2 shows that XploreGAN can perform high-quality translation for various datasets (FFHQ [karras2018style], CelebA [liu2015faceattributes] and Emotionet [fabian2016emotionet]). We present additional qualitative results in the Appendix.

3.5 Quantitative evaluation

A high-quality image translation should i) transfer the target attribute well while ii) preserving identity of the input image and iii) look realistic to human eyes. We quantitatively measure the three quality metrics by attribute classification, face verification, and a user study.

Attribute classification. To measure how well a model transfers attributes, we compare classification accuracy of synthesized images on face attributes. We train a binary classifier for each of the selected attributes (blond, brown, old, male, and female) in the CelebA dataset (70%/30% split for training and test set), which results in an average accuracy of 95.8% on real test images. We train all models with the same training set and perform image translation on the same test set. Finally, we measured classification accuracy of translated images using the trained classifier above. Surprisingly, XploreGAN outperforms all baseline models in almost all attribute translation as shown in Table 1. This shows that our method trained on unlabeled data can perform high-quality translation as good as, or sometimes even better than, those models trained on labeled data.

Method Blond Brown Aged Male Female
Ours 90.2 77.4 90.0 99.7 99.6
StarGAN 90.0 86.1 88.4 97.5 98.0
MUNIT - - - 95.7 99.1
DRIT - - - 98.8 98.5
Real Image 97.2 92.4 93.3 98.5 97.6
Table 1: Classification performance for translated images, evaluated on five CelebA attributes.

Identity preservation. We measure the identity preservation performance of translated images using a state-of-the-art face verification model. We use ArcFace [deng2018arcface] pre-trained on Celeb-1M dataset [guo2016msceleb], which shows an average accuracy of 89.76% on the CelebA test set. Then, we perform image translation on the same unseen test set regarding five face attributes (blond hair, brown hair, aged, male, and female). To measure how well a translated image preserves identity of the input image, we measure face verification accuracy on pairs of real and fake images using the pre-trained verification model above. As shown in Table 2, our method produces translation results that preserve identity of the input image as good as, or sometimes even better than, most baseline models trained on attribute labels. Although multimodal image translation models (MUNIT and DRIT) show high classification performance (i.e., they transfer target attributes well), we observe that they tend modify the input to the extent that it greatly hinders identity preservation.

Method Blond Brown Aged Male Female
Ours 99.3 99.4 99.1 90.1 94.8
StarGAN 96.8 99.0 98.8 97.5 93.7
MUNIT - - - 9.7 16.3
DRIT - - - 72.2 62.0
Table 2: Identity preservation performance of translated images from different methods shown by facial verification accuracy.
Method Hair Aged Gender H+G H+A
Ours 54.7 43.8 64.5 89.6 53.1
StarGAN 45.3 56.2 14.6 10.4 46.9
MUNIT - - 4.2 - -
DRIT - - 16.7 - -
Table 3: Results from the user study. Last two columns correspond to simultaneous translations of multiple domains. (H+G: Hair+Gender, H+A: Hair+Aged)

User study. To evaluate whether the translated outputs look realistic to human eyes, we conduct a user study with 32 participants. Users are asked to choose which output is most successful in producing high-quality images, while preserving content and transferring the target attribute well. 20 questions were given for each of the six attributes, with a total of 120 questions. Note that MUNIT and DRIT produces multimodal outputs, thus a single image is randomly chosen for the user study. Table 3 shows that our model performs as good as supervised models across diverse attributes. Though StarGAN achieves promising results, its results on H+G frequently has green artifacts, which decreases user preference.

3.6 Analysis on the clustering stage

(a) BagNet-17 (IN)
(b) ResNet-50 (SIN)
Figure 4: Comparing different pre-trained feature spaces. Different pre-trained feature spaces provide highly different attribute clusters. (a) Texture-based representation: As ImageNet-pre-trained BagNets are constrained to look at small local features, it is effective in detecting minor texture cues (e.g., skin color, age, hair color, and lighting). (b) Shape-biased representation: ResNets trained on Stylized-ImageNet (SIN) are effective in ignoring texture cues and focusing on global shape information (e.g., facial expressions, gestures, and viewpoints). We used BagNet-17 as the feature extractor for the CelebA dataset (first four rows) and ResNet-50 pre-trained on SIN for the EmotioNet dataset (last four rows).

Comparison on pre-trained feature spaces.

The pre-trained feature space provides a guideline to group the unlabeled images. We found that difference in model architectures and datasets it was pre-trained on leads to significantly different feature spaces, i.e., representation bias. We attempt to exploit this “skewness” towards recognition of certain types of features (e.g., texture or shape) – to group novel images in different directions. We will mainly compare two feature spaces: texture-biased BagNets and shape-biased ResNets. It has been found that ImageNet pre-trained CNNs are strongly biased towards recognizing textures rather than shapes 

[geirhos2018imagenet]. In relation to this characteristic, BagNets [brendel2018approximating] are designed to be more sensitive to recognizing local textures compared to vanilla ResNets [he2016deep] by limiting the receptive field size. They are designed to focus on small local image features without taking into account their larger spatial relationships. On the other hand, ResNets trained on Stylized ImageNet [geirhos2018imagenet] (denoted ResNet (SIN)) ignores texture cues altogether and focuses on global shapes of images.

Fig. 4 shows the characteristics of these two feature spaces. BagNets trained on ImageNet are effective in detecting detailed texture cues (e.g., skin color and texture, degree of age, hair color/shape, lighting). However, BagNets or vanilla ResNets are ineffective in detecting facial emotions, as they tend to produce clustering results biased towards local texture cues. For this purpose, we found that ResNet (SIN) is highly effective in ignoring texture cues and focusing on global shape information (e.g., facial expressions, gestures, view-points). Considering these qualities, we have adopted BagNet-17 (IN) as the feature extractor for the CelebA [liu2015faceattributes] and FFHQ [karras2018style] dataset, and ResNet-50 (SIN) for the EmotioNet [fabian2016emotionet] dataset.

Figure 5: Effects of increasing the number of clusters. One can discover hidden attribute subsets previously entangled in a single cluster.

Choosing the number of clusters. As shown in Fig. 5, a single blond cluster is subdivided to specific types of blond hair as the number of increases. As such, small values of k produce compact clusters with highly distinctive features, while large values of k produce clusters with similar yet detailed features. In reality, there is no ground truth of the optimal number of . In other words, how a human labeler defines a single attribute in a given dataset is highly subjective (e.g., ‘pale makeup’ itself can be a single attribute, or it may be further subdivided into ‘pale skin’, ‘wearing eyeshadow’, and ‘wearing lipstick’ depending on the labeler’s preference). In our model, users can indirectly control such degree of division by adjusting the number of .

4 Related Work

Generative adversarial networks (GANs). GANs [goodfellow2014generative] have achieved remarkable success in image generation. The key to its success is the adversarial loss, where the discriminator tries to distinguish between real and fake images while the generator tries to fool the discriminator by producing realistic fake images. Several studies leverage conditional GANs in order to generate samples conditioned on the class [mirza2014conditional, odena2016semi, odena2016conditional], text description [reed2016generative, han2017stackgan, Tao18attngan], domain information [StarGAN2018, pumarola2018ganimation], input image [isola2017image], or color features [bahng2018coloring]. In this paper, we adopt the adversarial loss conditioned on cluster statistics to generate corresponding translated images indistinguishable from real images.

Unpaired image-to-image translation. Image-to-image translation [isola2017image, zhu2017toward] has recently shown remarkable success. CycleGAN [CycleGAN2017]

extends image-to-image translation to unpaired settings, which broadens application of deep learning models to more datasets. Multi-domain image-to-image translation models 

[StarGAN2018, pumarola2018ganimation] propose methods that generate diverse outputs when given domain labels. DRIT [DRIT] and MUNIT [huang2018munit] further develop image translation models to produce random multimodal outputs using unpaired data. Most existing image-to-image translation models rely on labeled data. Unlike previous approaches that define the term ‘unpaired’ as synonymous to unsupervised, we define unsupervised to encompass both unpaired and unlabeled. According to our definition, no previous work on image-to-image translation has tackled such settings.

Clustering for discovering the unknown.

Clustering is a powerful unsupervised learning method that groups data by their similarity. Clustering is used to discover novel object classes in images 

[liu2016unsupervised] and videos [osep2019large, triebel2010segmentation, herbst2011rgb, xie2018object]. Instead of discovering new object classes, our work aims to discover attributes within unlabeled data through clustering. Finding attributes is a complicated task, as a single image can have multiple different attributes. To the best of our knowledge, our work is the first to perform image-to-image translation using newly discovered attributes from unlabeled data.

Instance normalization for style transfer.

To ease training of neural networks, batch normalization (BN) was originally introduced. BN normalizes each feature channel by the mean and standard deviation from mini-batches of images. Instance normalization 

[ulyanov2016instance] samples the mean and standard deviation from each sample. Extending IN, conditional instance normalization [dumoulin2017learned] learns different sets of parameters for each style. Adaptive instance normalization (AdaIN) [huang2017arbitrary] performs normalization without additional trainable parameters, to which MUNIT adds trainable parameters for stronger translation ability. In contrast to existing normalization methods that perform style transfer on image instances, our attribute summary instance normalization (ASIN) uses cluster statistics to summarize the common attribute within each cluster and allows translation of fine, detailed attributes.

5 Conclusion

In this paper, we attempt to alleviate the necessity for labeled data in the facial image translation domain. Provided with raw, unlabeled data, we propose an unpaired and unlabeled multi-domain image-to-image translation method. We utilize prior knowledge from pre-trained feature spaces to group unseen, unlabeled images. Attribute summary instance normalization (ASIN) can effectively summarize the common attribute within clusters, enabling high-quality translation of specific attributes. We demonstrate that our model can produce results as good as, or sometimes better than, most state-of-the-art methods.

References