Toward Zero-Shot Unsupervised Image-to-Image Translation

07/28/2020 ∙ by Yuanqi Chen, et al. ∙ Peking University 0

Recent studies have shown remarkable success in unsupervised image-to-image translation. However, if there has no access to enough images in target classes, learning a mapping from source classes to the target classes always suffers from mode collapse, which limits the application of the existing methods. In this work, we propose a zero-shot unsupervised image-to-image translation framework to address this limitation, by associating categories with their side information like attributes. To generalize the translator to previous unseen classes, we introduce two strategies for exploiting the space spanned by the semantic attributes. Specifically, we propose to preserve semantic relations to the visual space and expand attribute space by utilizing attribute vectors of unseen classes, thus encourage the translator to explore the modes of unseen classes. Quantitative and qualitative results on different datasets demonstrate the effectiveness of our proposed approach. Moreover, we demonstrate that our framework can be applied to many tasks, such as zero-shot classification and fashion design.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 12

page 13

page 15

page 16

page 18

page 26

page 27

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep learning models have achieved great success in image-to-image(I2I) translation tasks. Recently, there is a line of works aiming to learn mappings among multiple classes Choi et al. (2018); Yu et al. (2018); Liu et al. (2018). However, the issue of class imbalance usually degrades the performance of these methods. As for the facial attribute transfer task, samples of people with glasses are much smaller than those of people without glasses. Then the image translation models tend to be biased towards the majority class and generate glasses with few styles Mariani et al. (2018); Ali-Gombe and Elyan (2019).

In this work, we are going to explore the extreme case of the class imbalance issue, the zero-shot setting. We begin by constructing a multi-domain I2I translation model as a baseline to align the semantic attributes with the corresponding visual characteristics for seen classes. However, when translating the input image to unseen classes, the translator suffers from the problem of mode collapse, where the output images are usually collapsed to few modes specified by some seen classes. As shown in Fig. 1, although we want to generate images of chestnut-sided warbler, mockingbird, black-billed cuckoo, and Bohemian waxwing, the four outputs all tend to have the same appearance as the warbling vireo, which is one of the seen classes and has a similar appearance to the above four unseen classes.

Learning to map images from seen classes to unseen classes is challenging. First, paired training images are difficult to collect or even impossible. For example, it is hard to obtain an image pair of birds from different species, but with the same pose and the same background. Moreover, the generalization capability is required for this task, while the most existing techniques do not meet this requirement and suffer from the problem of mode collapse above. Recently, OST Benaim and Wolf (2018) and FUNIT Liu et al. (2019b) are proposed to address these issues in a one-shot manner and a few-shot manner respectively. Different from these two methods, we take a step further and focus on zero-shot unsupervised I2I translation.

Figure 1: The limitation of existing image-to-image translation methods. They are successful in translating the input image to seen classes if these classes have enough samples during training time. However, when generating images of previous unseen classes, these methods suffer from the mode collapse problem, where the translator only produces outputs from a few modes of the data distribution.

To generalize the translator to unseen classes, we hypothesize that by exploiting the space spanned by the semantic attributes, the model can generate images of unseen classes conditioned on the corresponding attributes. Specifically, we propose to preserve semantic relations to the visual space and expand attribute space by utilizing attribute vectors of unseen classes. As pointed out in  Annadani and Biswas (2018), it is crucial to inherit the properties of semantic space for zero-shot learning. To this end, We first define the similarity metrics in the attribute space and the visual space separately, then introduce a regularization term for preserving semantic relations. As for expanding attribute space, since attribute vectors of unseen classes have no corresponding images during training, we propose an attribute regression loss for unseen semantic attribute vectors to bridge the semantic attributes and the visual characteristics effectively.

Later in the experiments, the effectiveness of our proposed method is evaluated on two datasets via quantitative metrics and qualitative comparisons. We also show that the proposed method can be applied to many tasks, such as zero-shot classification and fashion design.

The contributions of this work are summarized as follows:

  • We propose a framework for zero-shot unsupervised image-to-image translation, which alleviates the problem of mode collapse when synthesizing images of unseen classes.

  • By preserving semantic relations and expanding attribute space, our proposed model is efficient in utilizing the attribute space, and both quantitative and qualitative results on different datasets demonstrate its effectiveness.

  • We demonstrate the application of our framework on the zero-shot classification task and fashion design task. We achieve competitive performance compared with existing methods.

2 Related Work

2.1 Image-to-Image Translation

aims to translate images from one visual domain to another. Many computer vision tasks can be handled in image-to-image(I2I) translation framework, e.g., image colorization 

Isola et al. (2017), image deblurring Lu et al. (2019)

, image super-resolution 

Ledig et al. (2017), etc. To learn convincing mappings across image domains from unpaired images, CycleGAN Zhu et al. (2017a), DiscoGAN Kim et al. (2017) and DualGAN Yi et al. (2017) introduce a cycle-consistency constraint and train two cross-domain translation models. Recent works Choi et al. (2018); Yu et al. (2018); Liu et al. (2018) extend the I2I framework from two domains to multiple domains under a single unified framework. To learn an interpretable representation and further improve the model performance for I2I task, there is a vast literature working on disentangling the representations  Liu et al. (2018, 2019b); Lample et al. (2017); Huang et al. (2018); Lee et al. (2018); Singh et al. (2019); Yu et al. (2019). MUNIT Huang et al. (2018) and DRIT Lee et al. (2018) focus on disentangling the images into domain-invariant and domain-specific representations for producing diverse translation outputs. FineGAN Singh et al. (2019) disentangles the background, object shape, and object appearance to hierarchically generate images of fine-grained object categories.

However, these methods fail to generalize to unseen domains based on prior knowledge of seen domains. To improve the generalization capability of the I2I framework, OST Benaim and Wolf (2018) achieves one-shot cross-domain translation using a single source class image and many target class images. However, OST can only map images between two classes. By extracting appearance patterns from the target class images, FUNIT Liu et al. (2019b) is capable to translate images of seen classes to analogous images of previously unseen classes. Different from OST and FUNIT, to further improve the machine imagination capability, we assume that the images of target classes are unavailable even at test time and explore zero-shot unsupervised image-to-image translation.

2.2 Zero-Shot Classification

aims to learn a model with generalization ability that can recognize unseen objects by only giving some semantic descriptions. Since the seen objects and unseen ones are only connected in semantic space and disjoint in the visual space, early works Romera-Paredes and Torr (2015); Socher et al. (2013); Lampert et al. (2013); Frome et al. (2013); Akata et al. (2015b, a)

learn a visual-semantic mapping with the seen samples. During the test period, unseen objects are projected from visual space to semantic space and then classified by semantic attributes. To address the hubness problem arising in the above methods,  

Shigeto et al. (2015); Zhang et al. (2017) propose to use the visual space as the embedding space. Another approach is to augment the training set with synthesized samples for unseen classes. Several recent works Kumar Verma et al. (2018); Xian et al. (2018b); Huang et al. (2019); Li et al. (2019); Elhoseiny and Elfeki (2019) focus on generating new visual features conditioned on semantic attributes of novel classes. With the synthesized data of unseen classes, the zero-shot classification becomes a conventional classification problem. There are also some works Joseph et al. (2018); Kim et al. (2020) that try to generate images of unseen classes rather than feature level synthesis. Different from these methods, our goal is to translate images of seen classes to unseen classes while remaining class-independent information.

3 Proposed Method

The definition of zero-shot image-to-image translation is as follows. Let be a set of seen samples consisting of image , class label , and semantic attribute . is the class embedding of class that models the semantic relationship between classes. As for unseen classes, we have no access to the unseen image set at training time, and an auxiliary training set is . In zero-shot learning setting, the seen and the unseen classes are disjoint, i.e., . The goal of zero-shot image-to-image translation is to learn a mapping , which indicates that the translator should be capable to conduct the mapping for both seen and unseen classes.

We begin by constructing a multi-domain I2I translation model as a baseline in Section 3.1, and then introduce the proposed strategies that enable generalizing to unseen classes in Section 3.2.

3.1 Multi-Domain I2I Translation

We first consider how to learn a mapping and build a multi-domain I2I translation model as a baseline. Although lacking the generalization capability to unseen classes, this baseline model learns to align the semantic attributes with the corresponding visual characteristics, e.g., learns the relationship between the attribute ”white wings” and its visual representations.

The proposed baseline model consists of a conditional generator and a multi-task discriminator . learns to translate an input image to an output image conditioned on the target attribute vector , i.e., . It is noteworthy that and are both sampled from the set of seen samples . To make full use the information in the semantic attribute , each residual block in the decoder of is equipped with adaptive instance normalization(AdaIN) Huang and Belongie (2017); Huang et al. (2018) for information injection. As for the discriminator , we equip it with an auxiliary attribute regressor  Zhu et al. (2017b); Chen et al. (2016) to discriminate whether the output image has the visual characteristics of the conditional attribute .

3.1.1 Adversarial Loss.

To match the distribution of synthesized images to the real data of the corresponding class, we adopt an adversarial loss

(1)

where tries to distinguish between real and generated images of the given class label, while attempts to generate realistic images.

3.1.2 Attribute Regression Loss.

To encourage the generator to utilize the attribute vector, we introduce an attribute regression by adding an auxiliary regressor on top of . The attribute regression loss of generated images can be written as

(2)

where approximates the posterior . On the other hand, the attribute regression loss of real images is defined as

(3)

3.1.3 Self-Reconstruction Loss.

In addition to the above losses, we impose a self-reconstruction loss to facilitate the training process, which can be written as

(4)
Figure 2: Overview. (a) The multi-domain I2I translation model learns to align the semantic attributes with the corresponding visual characteristics for seen classes. (b) To preserve semantic relations, for the sampled classes and , we first calculate their similarity in the attribute space and visual space, then we regularize them to be close to each other. (c) To expand attribute space, we utilize the attribute sampled from unseen classes and introduce an attribute regression loss for effective incentives to synthesize images with the visual characteristics of .

3.2 Generalizing to Zero-Shot I2I Translation

Recall that our goal is to learn a mapping . With the above multi-domain I2I translator, we can conduct the mapping and synthesize images of seen classes with high fidelity. However, when it comes to unseen classes, the translator suffers from mode collapse  Goodfellow et al. (2014); Yang et al. (2019); Mao et al. (2019). As shown in Fig. 1, although the attribute vectors from different unseen classes are injected, the output images are collapsed to few modes specified by some seen classes, which indicates that the translator tends to ignore some information in the attribute vectors of unseen classes. we hypothesize that by exploiting the space spanned by the semantic attributes, the model can generate images of unseen classes conditioned on the corresponding attributes.

3.2.1 Preserving Semantic Relations.

For unseen classes, as there are no images available during the training process, we need to use the attribute vectors to provide effective guidance on where to map the input image. To this end, we propose to preserve the relations in the attribute space to the visual space. We begin by defining the similarity metrics in the attribute space and visual space separately, for any two sampled classes and . For the relations in the attribute space, We define the semantic relationship measure of two attribute vectors and

using cosine similarity  

Annadani and Biswas (2018):

(5)

However, due to the high dimensionality of images, it’s hard to measure the relationship in the visual space. To this end, we propose a simple yet effective approach to inherit the relations of the attribute space to the visual space. As our proposed generator is style-based, given the input image and the target attribute , AdaIN Huang and Belongie (2017); Huang et al. (2018) first learns a set of affine transformation parameters computed using via a multi-layer fully connected network. It then adjusts the activations of to fit the data distribution specified by

(6)

where each feature map of is normalized separately, and then scaled and biased using the corresponding scalar components from . In other words, for any two sampled classes and , their learned affine transformation parameters and describe the characteristics of their data distributions in visual space. In this way, we can utilize the similarity between and to approximate the similarity between the visual subspaces of and

(7)

We then introduce a regularization term to preserve the relations in the attribute space to the visual space

(8)

The proposed strategy shares some similarities to DistanceGAN Benaim and Wolf (2017), DSGAN Yang et al. (2019), and MSGAN Mao et al. (2019). DistanceGAN learns the mapping between the source domain and the target domain in a one-sided unsupervised way, by preserving the distance between two samples in the source visual domain to the target visual domain. Different from it, we preserve the relations in the attribute space to the visual space. To produce diverse outputs, DSGAN and MSGAN propose to maximize the distance between generated images with respect to that between the corresponding latent codes. Instead of maximizing the ratio of the distances in the visual space and the latent space, our proposed strategy encourages the generator to maintain this ratio to 1 for aligning the semantic attributes with the corresponding visual characteristics. Moreover, we calculate the distance in the visual space using high-level statistics rather than raw RGB image values, which alleviates the bias caused by some class-invariant information such as pose and lighting.

3.2.2 Expanding Attribute Space.

The mode collapse problem is similar to the phenomenon in imbalanced learning Kim et al. (2020)/zero-shot classification  Chao et al. (2016); Song et al. (2018). There exists a strong mapping bias during the phase of bridging the semantic attributes and the visual characteristics, which causes degradation in performance when generating images of unseen classes. In the training phase of the above multi-domain I2I translation model, as we only see samples from seen classes, the generator tends to map the input image to some visual subspaces specified by the seen classes. In this way, given a novel semantic attribute from an unseen class , the output image, although realistic, tends to own visual characteristics of some seen classes rather than the given class .

To address the above problem, we explore the use of the semantic attributes in to expand attribute space. Since the novel attribute has no corresponding images during training, the above adversarial training can not be applied in this case. For effective incentives to synthesize images with the visual characteristics of , we propose an attribute regression loss for unseen semantic attributes:

(9)

where approximates the posterior . Similar to InfoGAN Chen et al. (2016), the above loss term maximizes the mutual information between the attribute vector and the generated image . In this way, the generator has access to the semantic attributes in during the training phase, so these semantic attributes are not novel for during the testing phase, which helps to alleviate the mapping bias for unseen classes.

Finally, the objective functions to optimize and are written as

(10)
(11)

where the hyper-parameters , , and control the importance of the corresponding term.

4 Experiments

4.1 Implementation Details

For generator , it contains an encoder-decoder architecture and each residual block in the decoder of is equipped with adaptive instance normalization(AdaIN) Huang and Belongie (2017); Huang et al. (2018) for information injection. For discriminator , we adopt the architecture of multi-task adversarial discriminator in FUNIT Liu et al. (2019b), which leverages PatchGANs Isola et al. (2017) to classify whether local image patches are real or fake conditioned on the given class label. We build our model on the hinge variant of GANs Miyato et al. (2018); Zhang et al. (2019), which uses a hinge loss to train the model instead of a cross-entropy loss. We use the real gradient penalty regularization Mescheder et al. (2018) to stabilize the training process. The hyper-parameters are set to be: , , and . We train all our models with Adam optimizer Kingma and Ba (2015) with the learning rate of 0.0001 and exponential decay rates on a single NVIDIA V100 GPU. We refer the reader to our supplementary materials for more details about the network architecture.

4.2 Datasets

4.2.1 Cub.

Caltech-UCSD Birds-200-2011(CUB) Wah et al. (2011) dataset consists of 11,788 images which come from 200 bird species. 312 attribute labels are perceived by MTurkers for each image. The training set and test set of CUB have 150 and 50 species, respectively.

4.2.2 Flo.

Oxford Flowers(FLO) Nilsback and Zisserman (2008) is another dataset commonly used in the zero-shot learning tasks. It contains 102 flower categories with 8,189 images. For this dataset, we use the text embeddings as side information provided by Reed et al. (2016). FLO is split into 82 training classes and 20 test classes.

4.3 Metrics

4.3.1 Fréchet Inception Distance.

To quantify the performance, we adopt Fréchet Inception Distance (FID) Heusel et al. (2017) to evaluate the quality of generated images. For each unseen class, we first randomly sample images from the seen classes as input images with the number similar to the number of real images in this unseen class. Then we synthesize novel images with the semantic attribute of the class based on these sampled images. After synthesis, we compute the FID score between the distribution of generated images and real images of all unseen classes. For seen classes, we conduct the same operation and report the FID score to evaluate the performance of seen classes. The lower the FID score, the better the quality of generated images.

4.3.2 Classification Accuracy.

To measure whether a translation output belongs to the target class, following FUNIT Liu et al. (2019b), we adopt an Inception-V3 Szegedy et al. (2016) classifier which is trained using all the classes of the dataset. We report both Top-1 and Top-5 accuracies for unseen classes and seen classes. The higher the accuracy, the more relevant the generated images are to their translated class. We also report these two metrics for ground truth(GT) in the test set.

4.3.3 Human Perception.

To judge the visual realism of the generated images, we perform a user study on Amazon Mechanical Turk (AMT) platform. For each task, we randomly generate 2,500 questions by sampling images of all unseen classes. The workers are given four target class images and a series of translation outputs generated by different methods. They are given unlimited time to choose which synthesis is more similar to the target class images. Each question is answered by 5 different workers.

4.4 Main Results

Since there is no previous unsupervised I2I translation method that is designed for our setting, we adopt FUNIT Liu et al. (2019b) that is the most related state-of-the-art method for few-shot unsupervised I2I translation. At test time, given few images from the target category, FUNIT extracts appearance patterns from them and applies these patterns to input images for the translation task, while we assume that the images of the target category are unavailable even at test time. We evaluate its performance under the 1-shot and 5-shot settings and denote as FUNIT-1 and FUNIT-5, respectively. As for fair comparison, we compare our proposed framework against StarGAN Choi et al. (2018), which is the state of the art for multi-class unsupervised I2I translation.

4.4.1 Qualitative Evaluation.

As the qualitative comparison in Fig. 3 shows, the outputs of FUNIT-1 and FUNIT-5, although realistic, can not well preserve the background of the input images, which is caused by their example-guided image translation. During the phase of extracting appearance patterns from example images, the model would extract not only class-specific characteristics of the target category, but also some other information in the example images, e.g. background. The more example images they have, the more characteristics of the target category they capture. In this way, the outputs of FUNIT-5 tend to be more relevant to their target categories than the outputs of FUNIT-1. As for fair comparison, we observe that StarGAN produces results with significant artifacts and can not well obtain the visual characteristics of the target class from its semantic attribute. We conjecture that this is because StarGAN lacks enough incentives to utilize the attribute vector. In contrast, for both unseen and seen classes, our method can generate realistic images while relevant to their target categories.

As shown in Fig. 4, when translating an input image to multiple target categories, our results contain the required visual characteristics while remaining other class-independent information (e.g., pose, background). This indicates that our method achieves feature disentanglement. By encouraging the generator to exploit the semantic attribute vectors, the class-dependent information of the output comes from the semantic attribute vector and the class-independent information is extracted from the input image.

Figure 3: Qualitative comparison on CUB and FLO. For each dataset, the first column shows the input images sampled from seen classes, and the second column represents the characteristics of the target category. Each of the remaining columns shows the outputs from a method. The blue and orange borders indicate that the target category is sampled from unseen classes and seen classes, respectively.
Figure 4: Qualitative results on CUB. The first row shows the input images sampled from seen classes. The first column represents the characteristics of the target categories, and the blue and orange borders indicate that the target category is sampled from unseen classes and seen classes, respectively. The remaining images are the translation results of our proposed method.
CUB FLO
Unseen Seen Unseen Seen
Top-1 Top-5 FID Top-1 Top-5 FID Top-1 Top-5 FID Top-1 Top-5 FID
FUNIT-1 4.68 22.80 41.48 29.53 55.73 32.33 6.90 32.00 77.77 49.17 71.12 37.56
FUNIT-5 7.48 31.96 39.09 41.59 68.92 29.34 9.00 40.70 75.87 64.10 83.22 36.27
StarGAN 4.52 17.08 142.32 6.47 20.72 136.77 2.30 12.50 75.99 3.10 12.81 40.86
Ours 16.88 57.48 40.78 59.81 83.49 34.89 9.30 40.90 73.17 49.07 73.07 37.44
GT 82.84 95.65 - 84.69 96.50 - 97.59 100.00 - 97.86 99.68 -
Table 1: Quantitative comparison on CUB and FLO.

4.4.2 Quantitative Evaluation.

As shown in Table 1, our proposed method achieves the best performance on all the metrics over the fair baseline and shows competitive results against FUNIT-1 and FUNIT-5 especially on unseen classes. On the FLO dataset, the overall performance is inferior to the performance on the CUB dataset. This is because the text embeddings used in FLO have less information than the semantic attributes used in CUB and are not sufficient to represent the visual characteristics of the category. As the total number of real images from unseen classes in FLO is smaller than the number in CUB, the statistics of the distribution of unseen classes are more biased, and the FID score is higher. As the user study shown in Table 3, our model outperforms the zero-shot baseline and achieves significant results comparable to the few-shot methods. It indicates that the proposed learning strategy expands the sampling space and improves the model capacity for generating unseen images.

Method CUB FLO
FUNIT-1 27.8% 21.8%
FUNIT-5 34.2% 27.8%
StarGAN 7.8% 14.3%
Ours 30.2% 36.1%
Table 3: Ablation study on CUB.
Unseen Seen
Top-1 Top-5 FID Top-1 Top-5 FID
Baseline 7.32 37.48 46.35 57.24 81.11 39.22
Baseline+Preserving 12.36 46.52 42.50 59.69 83.92 38.02
Baseline+Expanding 15.44 54.24 41.81 54.92 79.31 37.13
Ours 16.88 57.48 40.78 59.81 83.49 34.89
Table 2: Human preference score.

4.5 Ablation Study

The intuitive idea of solving the zero-shot I2I is to model the mapping between the attribute space and the visual space. One approach is to find a shared space for visual and attribute modalities Lin et al. (2019), but we find that it is difficult to learn relevant representations of spatial structures when the visual samples are missing. In this section, we validate the importance of the two strategies we proposed, which are to preserve semantic relations and to expand attribute space. We perform an ablation study with two variants of our baseline on CUB, namely, Baseline+Preserving and Baseline+Expanding, respectively.

Figure 5: Qualitative comparison for ablation study. The image with an orange border is the input image. and the other images in the first column represent the characteristics of the target categories which are sampled from unseen classes. Each of the remaining columns shows the outputs from a method.

The quantitative results in Table 3 indicate the effectiveness of our proposed learning strategies. By adding either of these two strategies to the baseline model, the performance gap between unseen classes and seen classes is narrowed. By combining these two strategies, our proposed method obtains the best results on almost all the metrics. The results in Fig. 5 show the qualitative comparison for the ablation study. Without the proposed strategies for generalizing to unseen classes, the baseline model suffers from the problem of mode collapse. All of its outputs have a similar appearance and can not be well classified into the target categories. In contrast, the results of our proposed method contain the visual characteristics of the target categories.

Figure 6: Interpolation between two attributes of seen and unseen classes. The images with an orange border are the input images sampled from seen classes, while the images with a blue border represent the characteristics of the target classes.

4.6 Attribute Interpolation

In this section, by linearly interpolating between two attributes of seen and unseen classes, we visualize the intermediate results of the mapping from seen classes to unseen classes. As shown in Fig. 6, the interpolation results demonstrate the continuity in the attribute space and verify the effectiveness of our proposed two strategies for exploiting the attribute space.

4.7 Effect of the Number of Seen Classes

To analyze the effect of the number of seen classes, we train our model with different dataset splits, using the CUB dataset. As shown in Table 5

, all the performance metrics are positively correlated with the number of seen classes. The more classes available during training time, the more visual patterns will be learned and these patterns will be better linked to the semantic attributes, leading to better performance at test time. An outlier is the FID score on seen classes when the number of seen classes is 180, which is 64.59 and larger than others. We conjecture that this is because when the number of seen classes becomes 180, the number of unseen classes is reduced to 20, thus the size of the unseen dataset is not adequate to calculate an unbiased FID score.

4.8 Application

In the above, we have verified the effectiveness of our proposed strategies. In this section, we explore the application of our method in zero-shot classification and fashion design.

4.8.1 Zero-Shot Classification.

We demonstrate that the proposed zero-shot unsupervised I2I translation scheme can benefit zero-shot classification. After the training of our proposed image translation model, we first generate new samples for each unseen class to augment the training dataset, thus the zero-shot classification becomes a conventional classification problem. Then we train a classifier based on this new augmented dataset that contains samples for both seen and unseen classes. Similar to the deep CNN encoder used in recent zero-shot classification methods Xian et al. (2018a, b), we adopt a ResNet-101 He et al. (2016) as our classifier.

Unseen Seen
Top-1 Top-5 FID Top-1 Top-5 FID
90 10.53 43.09 42.96 56.22 79.73 41.84
120 15.85 52.38 42.10 56.40 81.22 39.06
150 16.88 57.48 40.78 59.81 83.49 34.89
180 22.20 66.20 64.59 61.26 85.24 34.53
Table 5: Results of zero-shot classification on CUB.
U S H
f-CLSWGAN Xian et al. (2018b) 43.7 57.7 49.7
LisGAN Li et al. (2019) 46.5 57.9 51.6
CADA-VAE Schonfeld et al. (2019) 51.6 53.5 52.4
ZSL-ABP Zhu et al. (2019) 47.0 54.8 50.6
SABR-I Paul et al. (2019) 55.0 58.7 56.8
AREN Xie et al. (2019) 38.9 78.7 52.1
ours 49.7 65.9 56.7
Table 4: Effect of the number of seen classes on CUB.

We validate the classification performance under a generalized zero-shot learning(GZSL) setting. During the test phase, we use images from both seen and unseen classes, and the label space is the combination of seen and unseen classes . The task of the classifier is to learn a mapping

. The performance of the classifier is evaluated based on the harmonic mean(H) of the top-1 classification accuracy on seen and unseen classes, i.e., S and U.

We compare our model with recent state-of-the-art methods, and the results in Table 5 show that our proposed method is comparable to these methods on CUB. Our model achieves the competitive result of harmonic mean accuracy, which indicates that it obtains a balance between seen and unseen classes. This experiment validates the effectiveness of our proposed strategies, and exploring the use of the proposed strategies for feature level synthesis is interesting for future work.

4.8.2 Fashion Design.

Automatic fashion design has recently attracted great attention due to its lucrative applications Liu et al. (2019a). Given the semantic description of a novel handbag, our task is to synthesize images with the required attributes and input images. The model is trained on the data from Zhu et al. (2016). The example results shown in Fig.7 demonstrate the effectiveness of our method. Compared with our baseline, when translating source image to target novel classes, the outputs produced by our method captures realistic and relevant characteristics. That is because the baseline model lacks enough incentives to utilize the attribute vector, while with the proposed strategies, the model is able to bridge the semantic attributes and the visual characteristics effectively. More details about the experiment can be found in our supplementary materials.

Figure 7: Example results of handbag translation. The input images are sampled from seen classes, while the target images are sampled from unseen classes.

5 Conclusion

In this paper, we explore the phenomenon that the multi-domain translation model suffers from mode collapse when synthesizing images of unseen classes. To address this issue, we extend the multi-domain baseline to a zero-shot paradigm by preserving semantic relations and expanding attribute space. Experiments demonstrate that the proposed method can generate realistic images while relevant to their target categories for both unseen and seen classes. Besides, we show that our model is beneficial for many applications such as zero-shot classification and fashion design.

Acknowledgement

This work was supported in part by Key-Area Research and Development Program of Guangdong Province (No. 2019B121204008), Shenzhen Municipal Science and Technology Program (No. JCYJ20170818141146428), National Natural Science Foundation of China and Guangdong Province Scientific Research on Big Data (No. U1611461).

References

  • Z. Akata, F. Perronnin, Z. Harchaoui, and C. Schmid (2015a) Label-embedding for image classification. PAMI. Cited by: §2.2.
  • Z. Akata, S. Reed, D. Walter, H. Lee, and B. Schiele (2015b) Evaluation of output embeddings for fine-grained image classification. In CVPR, Cited by: §2.2.
  • A. Ali-Gombe and E. Elyan (2019) MFC-gan: class-imbalanced dataset classification using multiple fake class generative adversarial network. Neurocomputing. Cited by: §1.
  • Y. Annadani and S. Biswas (2018) Preserving semantic relations for zero-shot learning. In CVPR, Cited by: §1, §3.2.1.
  • S. Benaim and L. Wolf (2017) One-sided unsupervised domain mapping. In NIPS, Cited by: §3.2.1.
  • S. Benaim and L. Wolf (2018) One-shot unsupervised cross domain translation. In NIPS, Cited by: §1, §2.1.
  • W. Chao, S. Changpinyo, B. Gong, and F. Sha (2016) An empirical study and analysis of generalized zero-shot learning for object recognition in the wild. In ECCV, Cited by: §3.2.2.
  • X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel (2016) Infogan: interpretable representation learning by information maximizing generative adversarial nets. In NIPS, Cited by: §3.1, §3.2.2.
  • Y. Choi, M. Choi, M. Kim, J. Ha, S. Kim, and J. Choo (2018) Stargan: unified generative adversarial networks for multi-domain image-to-image translation. In CVPR, Cited by: §1, §2.1, §4.4.
  • M. Elhoseiny and M. Elfeki (2019) Creativity inspired zero-shot learning. In ICCV, Cited by: §2.2.
  • A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, M. Ranzato, and T. Mikolov (2013) Devise: a deep visual-semantic embedding model. In NIPS, Cited by: §2.2.
  • I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In NIPS, Cited by: §3.2.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, Cited by: §4.8.1.
  • M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter (2017) Gans trained by a two time-scale update rule converge to a local nash equilibrium. In NIPS, Cited by: §4.3.1.
  • H. Huang, C. Wang, P. S. Yu, and C. Wang (2019) Generative dual adversarial network for generalized zero-shot learning. In CVPR, Cited by: §2.2.
  • X. Huang and S. Belongie (2017) Arbitrary style transfer in real-time with adaptive instance normalization. In ICCV, Cited by: §3.1, §3.2.1, §4.1.
  • X. Huang, M. Liu, S. Belongie, and J. Kautz (2018) Multimodal unsupervised image-to-image translation. In ECCV, Cited by: §2.1, §3.1, §3.2.1, §4.1.
  • P. Isola, J. Zhu, T. Zhou, and A. A. Efros (2017) Image-to-image translation with conditional adversarial networks. In CVPR, Cited by: §2.1, §4.1.
  • K. Joseph, A. Pal, S. Rajanala, and V. N. Balasubramanian (2018) Zero-shot image generation by distilling concepts from multiple captions. In ICML Workshop, Cited by: §2.2.
  • H. Kim, J. Lee, and H. Byun (2020) Unseen image generating domain-free networks for generalized zero-shot learning. Neurocomputing. Cited by: §2.2, §3.2.2.
  • T. Kim, M. Cha, H. Kim, J. K. Lee, and J. Kim (2017) Learning to discover cross-domain relations with generative adversarial networks. In ICML, Cited by: §2.1.
  • D. P. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. In ICLR, Cited by: §4.1.
  • V. Kumar Verma, G. Arora, A. Mishra, and P. Rai (2018) Generalized zero-shot learning via synthesized examples. In CVPR, Cited by: §2.2.
  • C. H. Lampert, H. Nickisch, and S. Harmeling (2013) Attribute-based classification for zero-shot visual object categorization. PAMI. Cited by: §2.2.
  • G. Lample, N. Zeghidour, N. Usunier, A. Bordes, L. Denoyer, and M. Ranzato (2017) Fader networks: manipulating images by sliding attributes. In NIPS, Cited by: §2.1.
  • C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, et al. (2017) Photo-realistic single image super-resolution using a generative adversarial network. In CVPR, Cited by: §2.1.
  • H. Lee, H. Tseng, J. Huang, M. Singh, and M. Yang (2018) Diverse image-to-image translation via disentangled representations. In ECCV, Cited by: §2.1.
  • J. Li, M. Jing, K. Lu, Z. Ding, L. Zhu, and Z. Huang (2019) Leveraging the invariant side of generative zero-shot learning. In CVPR, Cited by: §2.2, Table 5.
  • J. Lin, Y. Xia, S. Liu, T. Qin, and Z. Chen (2019) ZstGAN: an adversarial approach for unsupervised zero-shot image-to-image translation. In arXiv:1906.00184, Cited by: §4.5.
  • A. H. Liu, Y. Liu, Y. Yeh, and Y. F. Wang (2018) A unified feature disentangler for multi-domain image translation and manipulation. In NIPS, Cited by: §1, §2.1.
  • L. Liu, H. Zhang, Y. Ji, and Q. J. Wu (2019a) Toward ai fashion design: an attribute-gan model for clothing match. Neurocomputing. Cited by: §4.8.2.
  • M. Liu, X. Huang, A. Mallya, T. Karras, T. Aila, J. Lehtinen, and J. Kautz (2019b) Few-shot unsupervised image-to-image translation. In ICCV, Cited by: §1, §2.1, §2.1, §4.1, §4.3.2, §4.4.
  • B. Lu, J. Chen, and R. Chellappa (2019) Unsupervised domain-specific deblurring via disentangled representations. In CVPR, Cited by: §2.1.
  • Q. Mao, H. Lee, H. Tseng, S. Ma, and M. Yang (2019) Mode seeking generative adversarial networks for diverse image synthesis. In CVPR, Cited by: §3.2.1, §3.2.
  • G. Mariani, F. Scheidegger, R. Istrate, C. Bekas, and C. Malossi (2018) Bagan: data augmentation with balancing gan. arXiv:1803.09655. Cited by: §1.
  • L. Mescheder, A. Geiger, and S. Nowozin (2018) Which training methods for gans do actually converge?. In ICML, Cited by: §4.1.
  • T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida (2018) Spectral normalization for generative adversarial networks. In ICLR, Cited by: §4.1.
  • M. Nilsback and A. Zisserman (2008) Automated flower classification over a large number of classes. In ICVGIP, Cited by: §4.2.2.
  • A. Paul, N. C. Krishnan, and P. Munjal (2019) Semantically aligned bias reducing zero shot learning. In CVPR, Cited by: Table 5.
  • S. Reed, Z. Akata, H. Lee, and B. Schiele (2016) Learning deep representations of fine-grained visual descriptions. In CVPR, Cited by: §4.2.2.
  • B. Romera-Paredes and P. Torr (2015) An embarrassingly simple approach to zero-shot learning. In ICML, Cited by: §2.2.
  • O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. (2015) Imagenet large scale visual recognition challenge. IJCV. Cited by: Appendix B.
  • E. Schonfeld, S. Ebrahimi, S. Sinha, T. Darrell, and Z. Akata (2019)

    Generalized zero-and few-shot learning via aligned variational autoencoders

    .
    In CVPR, Cited by: Table 5.
  • Y. Shigeto, I. Suzuki, K. Hara, M. Shimbo, and Y. Matsumoto (2015) Ridge regression, hubness, and zero-shot learning. In ECML-KDD, Cited by: §2.2.
  • K. K. Singh, U. Ojha, and Y. J. Lee (2019) FineGAN: unsupervised hierarchical disentanglement for fine-grained object generation and discovery. In CVPR, Cited by: §2.1.
  • R. Socher, M. Ganjoo, C. D. Manning, and A. Ng (2013) Zero-shot learning through cross-modal transfer. In NIPS, Cited by: §2.2.
  • J. Song, C. Shen, Y. Yang, Y. Liu, and M. Song (2018) Transductive unbiased embedding for zero-shot learning. In CVPR, Cited by: §3.2.2.
  • C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna (2016) Rethinking the inception architecture for computer vision. In CVPR, Cited by: Appendix B, §4.3.2.
  • C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie (2011) The caltech-ucsd birds-200-2011 dataset. Cited by: §4.2.1.
  • Y. Xian, C. H. Lampert, B. Schiele, and Z. Akata (2018a) Zero-shot learning-a comprehensive evaluation of the good, the bad and the ugly. PAMI. Cited by: §4.8.1.
  • Y. Xian, T. Lorenz, B. Schiele, and Z. Akata (2018b) Feature generating networks for zero-shot learning. In CVPR, Cited by: §2.2, §4.8.1, Table 5.
  • G. Xie, L. Liu, X. Jin, F. Zhu, Z. Zhang, J. Qin, Y. Yao, and L. Shao (2019) Attentive region embedding network for zero-shot learning. In CVPR, Cited by: Table 5.
  • D. Yang, S. Hong, Y. Jang, T. Zhao, and H. Lee (2019) Diversity-sensitive conditional generative adversarial networks. In ICLR, Cited by: §3.2.1, §3.2.
  • Z. Yi, H. Zhang, P. Tan, and M. Gong (2017) Dualgan: unsupervised dual learning for image-to-image translation. In ICCV, Cited by: §2.1.
  • X. Yu, X. Cai, Z. Ying, T. Li, and G. Li (2018) SingleGAN: image-to-image translation by a single-generator network using multiple generative adversarial learning. In ACCV, Cited by: §1, §2.1.
  • X. Yu, Y. Chen, T. Li, S. Liu, and G. Li (2019) Multi-mapping image-to-image translation via learning disentanglement. In NIPS, Cited by: §2.1.
  • H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena (2019) Self-attention generative adversarial networks. In ICML, Cited by: §4.1.
  • L. Zhang, T. Xiang, and S. Gong (2017) Learning a deep embedding model for zero-shot learning. In CVPR, Cited by: §2.2.
  • J. Zhu, P. Krähenbühl, E. Shechtman, and A. A. Efros (2016) Generative visual manipulation on the natural image manifold. In ECCV, Cited by: Appendix B, §4.8.2.
  • J. Zhu, T. Park, P. Isola, and A. A. Efros (2017a) Unpaired image-to-image translation using cycle-consistent adversarial networks. In ICCV, Cited by: §2.1.
  • J. Zhu, R. Zhang, D. Pathak, T. Darrell, A. A. Efros, O. Wang, and E. Shechtman (2017b) Toward multimodal image-to-image translation. In NIPS, Cited by: §3.1.
  • Y. Zhu, J. Xie, B. Liu, and A. Elgammal (2019) Learning feature-to-feature translator by alternating back-propagation for generative zero-shot learning. In ICCV, Cited by: Table 5.

Appendix A Network Architecture

The architecture details are shown in Table 6, 7 and 8. In addition to the main architecture, the generator contains a 3-layer MLP that learns a set of affine transformation parameters using the attribute . There are some notations: and : height and width of the input image, : the dimension of attribute ,

: the number of seen classes, N: the number of output channels, K: kernel size, S: stride size, P: padding size, FC: fully connected layer, IN: instance normalization, AdaIN: adaptive instance normalization, LReLU: Leaky ReLU with a negative slope of 0.2.

Part Input Output Shape Layer Information
Input Layer (h,w,3) (h,w,64) CONV-(N64, K77, S1, P3)
(h,w,64) (h,w,64) ResBlock: LReLU, CONV-(N64, K33, S1, P1)
(h,w,64) (h,w,128) ResBlock: LReLU, CONV-(N128, K33, S1, P1)
(h,w,128) (,,128) AvgPool-(K33, S2, P1)
(,,128) (,,128) ResBlock: LReLU, CONV-(N128, K33, S1, P1)
(,,128) (,,256) ResBlock: LReLU, CONV-(N258, K33, S1, P1)
(,,256) (,,256) AvgPool-(K33, S2, P1)
Hidden Layers (,,256) (,,256) ResBlock: LReLU, CONV-(N256, K33, S1, P1)
(,,256) (,,512) ResBlock: LReLU, CONV-(N512, K33, S1, P1)
(,,512) (,,512) AvgPool-(K33, S2, P1)
(,,512) (,,512) ResBlock: LReLU, CONV-(N512, K33, S1, P1)
(,,512) (,,1024) ResBlock: LReLU, CONV-(N1024, K33, S1, P1)
(,,1024) (,,1024) AvgPool-(K33, S2, P1)
(,,1024) (,,1024) ResBlock: LReLU, CONV-(N1024, K33, S1, P1)
(,,1024) (,,1024) ResBlock: LReLU, CONV-(N1024, K33, S1, P1)
Output Layer(D) (,,1024) (,,) LReLU, CONV-(N(), K11, S1, P0)
(,,1024) (,,) LReLU, CONV-(N(), K11, S1, P0)
Output Layer(R) (,,) () LReLU, GlobalAvgPool
() () FC-(, ), Sigmoid
Table 6: Architecture of the discriminator.
Part Input Output Shape Layer Information
Input Layer (h,w,3) (h,w,64) CONV-(N64, K77, S1, P3), IN, ReLU
(h,w,64) (,,128) CONV-(N128, K44, S2, P1), IN, ReLU
Down-sampling (,,128) (,,256) CONV-(N256, K44, S2, P1), IN, ReLU
(,,256) (,,512) CONV-(N512, K44, S2, P1), IN, ReLU
(,,512) (,,512) ResBlock: CONV-(N128, K33, S1, P1), IN, ReLU
Bottleneck (,,512) (,,512) ResBlock: CONV-(N128, K33, S1, P1), IN, ReLU
(,,512) + (2048) (,,512) ResBlock: CONV-(N128, K33, S1, P1), AdaIN, ReLU
(,,512) + (2048) (,,512) ResBlock: CONV-(N128, K33, S1, P1), AdaIN, ReLU
(,,512) (,,256) Upsample(2), CONV-(N256, K55, S1, P2), IN, ReLU
Up-sampling (,,256) (,,128) Upsample(2), CONV-(N128, K55, S1, P2), IN, ReLU
(,,128) (h,w,64) Upsample(2), CONV-(N64, K55, S1, P2), IN, ReLU
Output Layer (h,w,64) (h,w,3) CONV-(N3, K77, S1, P3), Tanh
Table 7: Main architecture of the generator.
Part Input Output Shape Layer Information
Input Layer () (256) FC-(, 256), ReLU
Hidden Layer (256) (256) FC-(256, 256), ReLU
Output Layer (256) (4096) FC-(256, 4096)
Table 8: Architecture of the MLP in the generator.

Appendix B Experiment of Fashion Design

The dataset is from Zhu et al. (2016). We randomly select a subset of it, which contains 10,000 images of various handbags. We adopt an Inception-V3 Szegedy et al. (2016) network pre-trained on the ImageNet Russakovsky et al. (2015)

dataset to extract embeddings of these images. The 2048-dim embeddings serve as semantic descriptions of images. We then run K-means clustering on the embeddings of all images and get 50 classes. The dataset is randomly split into 40 training classes and 10 test classes.

Fig. 8 shows some examples of the clustering results. Most images have consistent characteristics within classes. However, a few classes, such as Class. 33 and Class. 48, are not clustered well and have various patterns, which degrades the performance of the zero-shot image-to-image translation. We conjecture that this is because our embeddings are not sufficient to represent the visual characteristics of the handbag images. Fine-tuning the Inception-V3 on the handbag dataset or finding more suitable semantic descriptions for this application may help address this issue and boost the performance.

Figure 8: Examples of the clustering results.

Appendix C Additional Qualitative Results

Figure 9: Qualitative results of fashion design. The first row shows the input images sampled from seen classes. The first column represents the characteristics of the target categories, and the blue and orange borders indicate that the target category is sampled from unseen classes and seen classes, respectively. The remaining images are the translation results of our proposed method.
Figure 10: Qualitative results on FLO. The first row shows the input images sampled from seen classes. The first column represents the characteristics of the target categories, and the blue and orange borders indicate that the target category is sampled from unseen classes and seen classes, respectively. The remaining images are the translation results of our proposed method.