Fashion-Gen: The Generative Fashion Dataset and Challenge

06/21/2018 ∙ by Negar Rostamzadeh, et al. ∙ 2

We introduce a new dataset of 293,008 high definition (1360 x 1360 pixels) fashion images paired with item descriptions provided by professional stylists. Each item is photographed from a variety of angles. We provide baseline results on 1) high-resolution image generation, and 2) image generation conditioned on the given text descriptions. We invite the community to improve upon these baselines. In this paper, we also outline the details of a challenge that we are launching based upon this dataset.



There are no comments yet.


page 2

page 3

page 4

page 5

page 6

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Machine learning has recently been employed in many applications pertaining to the fashion industry. The use cases range from style matching (Bossard et al., 2012; Kalantidis et al., 2013; Liu et al., 2016), recommendation systems in e-commerce sites (Chen et al., 2012; Xiao et al., 2015; Kiapour et al., 2015; Chen et al., 2015; Simo-Serra et al., 2015), trend prediction, the ability for customers to virtually try on clothes (Han et al., 2017), and clothing type classification (Liu et al., 2012; Liang et al., 2016; Veit et al., 2015; Zhu et al., 2017b).

The availability of large-scale datasets such as DeepFashion (Liu et al., 2016)

has fueled recent progress in applying deep learning to fashion tasks. However, there are still many aspects of the industry that computer vision methods have not been applied to. In this paper we explore the task of assisting fashion designers to share their ideas with others by translating verbal descriptions to images. Thus, given a description of a particular item, we generate images of clothes and accessories matching the description.

To explore these research directions we introduce here a new dataset of almost 300k high definition training images of clothes, and accessories accompanied by detailed design descriptions. Each description is provided by professional designers and contains fine-grained design details. Each product is photographed from multiple angles against a standardized background under consistent lighting conditions and annotated with matching items recommended by a stylist. See Figure 1 for examples.

In this paper we provide: 1) statistical details of the dataset, 2) detailed comparisons with existing datasets, 3) an introduction to the competition that we are launching on the task of text to image generation, with a brief explanation of the competition criteria and evaluation process, and 4) high-resolution image generation results using an approach based on the progressive growing of GANs (Karras et al., 2017), and text-to-image translation results using StackGAN-v1 (Zhang et al., 2017a), and StackGAN-v2 (Huang et al., 2017).

Figure 1: Pictures a, b and c present samples of the dataset. Each description is associated with all the images below it. And each item ie. a, b is photographed from different angles. We also provide each image’s attributes, and its relationship to other objects in the dataset

The paper is organized as follows: Section 2 discusses related work. Section 3 introduces the Fashion dataset, describes the collection procedure, and provides a statistical analysis of the dataset with details of our newly introduced challenge.111The competition is part of the first workshop of Computer Vision for Fashion, Art and Design at ECCV. The challenge website is In Section 5, we describe baseline approaches and the evaluation process, including human evaluation. Section 6, concludes the paper and discusses future work.

2 Related Work

We first provide a summary of generative models used in text-to-image synthesis and then discuss related datasets.

2.1 Applications of Generative models

Generative Adversarial Networks (Goodfellow et al., 2014)

have been used in a wide range of applications, including photo-realistic image super-resolution 

(Ledig et al., 2016; Sønderby et al., 2016), video generation (Denton & Fergus, 2018; Denton et al., 2017), inpainting (Belghazi et al., 2018)

, image-to-image translation 

(Isola et al., 2017; Zhu et al., 2017a; Taigman et al., 2016) and text-to-image synthesis (Zhang et al., 2017a; Huang et al., 2017; Reed et al., 2016a, b; Zhang et al., 2018).

Although state of the art generative models can already generate polished realistic images (Karras et al., 2017), conditional generation and translation tasks are still far from high quality. We hypothesize that this shortcoming is due to a shortage of large, clean datasets.

2.2 Related datasets

To the best of our knowledge none of the currently used datasets for text-to-image synthesis were collected specifically for the purposes of exploring the text-to-image synthesis task. Below we discuss existing datasets that have been used for text-to-image and attributes-to-image synthesis and focus on a specific set of attributes which are important for image synthesis.

Number of images Resolution Description Binary attributes Categories Poses Number of items
CelebA 202,599 43x55 to 6732x8984 no 40 no multiple 10,177
CelebA-HQ 30,000 1024x1024 no 40 no multiple unknown
DeepFashion - Fashion Image Synthesis 78,979 300x300 multiple 1000 50 multiple unknown
MS COCO 328,000 varying sizes 5 per image no 80 single unknown
Caltech-UCSD Birds-200-2011 11,788 varying sizes no 312 200 single unknown
Flowers Oxford-102 8189 varying sizes no no 102 single unknown
Fashion dataset (ours) 325,536 1360x1360 yes no 48 multiple 78850
Table 1: Comparison of datasets
Figure 2: Distribution of the data per category. Note that the axis is in log scale.

Caltech-UCSD Birds-200-2011 (Reed et al., 2016a)

was originally created for categorizing bird species, localizing their body-parts and classifying attributes. The dataset consists of 12k images , depicting 200 bird species with 28 attributes. More recent work employs this dataset for image synthesis tasks conditioned on text describing the attributes.

MS COCO (Lin et al., 2014) was originally created as a benchmark for image captioning. While some works use this dataset for text-to-image generation tasks, the generated images miss fine-grained details and only capture high-level information. This is due to the fact that the textual descriptions are very high-level.
Flowers Oxford-102 (Nilsback & Zisserman, 2008) consists of 102 categories of flowers and was proposed for the task of fine-grained image classification. (Reed et al., 2016a) collected 5 descriptions for each image in the dataset to augment it for the task of text to image generation.
CelebA (Liu et al., 2015) contains pictures of 10k celebrities, with 20 images per person (200k images in total). Each image in CelebA is annotated with 40 attributes.
DeepFashion (Liu et al., 2016) contains over 200k images downloaded from a variety of sources, with varying image sizes, qualities and poses. Each image is annotated with a range of attributes. This publicly available dataset was mainly employed for the task of cloth retrieval and classification. As an extension of the dataset on the task of text-to-image generation, 79k images from the dataset were later annotated with more descriptive text (Zhu et al., 2017b).

3 Our Fashion Dataset

The advantages of our new Fashion dataset over other contemporary datasets are as follows:

  • The dataset consists of images ( images for training, for validation, for test), which is larger than other available datasets for the task of text to image translation.

  • We provide full HD images photographed under consistent studio conditions. There are no other datasets with comparable resolution and consistent photographing condition.

  • All fashion items are photographed from to different angles depending on the category of the item. To our knowledge, this is the first dataset of this scale consisting of multiple angles of each item.

  • Each product belongs to a main category and a more fine-grained category (i.e: subcategory). There are main categories, and fine-grained categories in the dataset. The name and density of each category is plotted in 2. Table Fashion-Gen: The Generative Fashion Dataset and Challenge presents the number of images by category and subcategory.

  • Each fashion item is paired with paragraph-length descriptive captions sourced from experts (professional designers). The distribution of the length of descriptions is presented in Figure 4.

  • For each item, we also provide metadata such as stylist-recommended matched items, the fashion season, designer and the brand. We also provide the distribution of colors extracted from the text description presented in Figure 3

Figure 3: Distribution of the data based on colors. Note that the axis is in log scale.
Figure 4: Distribution of the description lengths.

4 Our Challenge

In addition to releasing a rich dataset, we are launching a challenge that uses our Fashion dataset for the task of text-to-image synthesis. To the best of our knowledge this is the first challenge on this task. Additionally, we encourage participants to take advantage of all information in the dataset, e.g. such as pose or category. We provide a framework that enables researchers to easily compare the performance of their models with an evaluation metric based on an

Inception Score (Salimans et al., 2016). The inception model we use for the experiments we present in Section 5 was trained on the training set for classifying the images into the categories presented in Figure 2. For the final challenge evaluation we will also provide inception scores from a model trained on the test set. However, there are a number of issues to consider when using inception scores for evaluating generative models (Barratt & Sharma, 2018). For example, different implementations of the same model trained on the same dataset can result in significant differences in Inception scores. For these and other reasons our challenge will also provide a human evaluation as we outline below.

Our automated evaluation platform for the challenge computes and displays the Inception score for each submission and compiles the best scores in a leader-board. We provide a comprehensive template and an easy to use service to submit a docker container that runs code, and evaluate the performance on an Amazon Web Services cloud instance. Our test set, which won’t be released, consists of descriptions of clothing items and is integrated at runtime in the challengers’ docker container.

Human Evaluation setup:
Inception scores do not consider the correlation between text and the given image. As such, the competition results will also be evaluated by humans. Since inception scores also have other issues (Barratt & Sharma, 2018) as discussed above, the competition winner will be determined based on this human evaluation. During the human evaluation phase, a fixed subset of the test-set will be randomly selected and the corresponding images will be given to a human evaluation system. Each human-evaluator will be given a text and images generated by each submission. The person’s task will be to rank these sets of images into the first, second, and third best set with respect to the given text. Each task of this nature will be given to different human-evaluators. The scores given to each image set will then be aggregated to compute final scores under the human evaluation.

5 Experiments with the Dataset

In this section, we present two sets of experiments: 1) Generating high-resolution images by using the progressive GAN (P-GAN) growing technique of  Karras et al. (2017), and 2) text-to-image synthesis using StackGAN-v1  (Zhang et al., 2017a) and StackGAN-v2  (Zhang et al., 2017b).

5.1 Generating high-resolution images using P-GANs

The primary idea of Progressive Growing of GANs (Karras et al., 2017) is to grow the generator and discriminator gradually and in a symmetric manner in order to produce high-resolution images. P-GAN starts with very low-resolution images and each new layer of the model improves quality and adds fine-grained details to the image generated in the prior stage. Experiments on the CelebA dataset (Liu et al., 2015) showed promising results and we similarly employ P-GANs to generate images using our fashion dataset as training data. To do this, we follow the same experimental setup and architectural details of the original P-GAN paper (Karras et al., 2017)222Using code provided by the authors of the P-GAN paper (Karras et al., 2017):

Figure 8 shows examples of images generated by P-GAN. The images exhibit global coherence and span a variety of poses and attributes ranging from color and category to accessory textures and characteristics of fashion designs.

In order to quantitatively evaluate the quality of our generated images, we compute the Inception score for the down-sampled version () of our generated images (See Table 2). The Inception score of the generated images using P-GANs is very close to the that of the original images, presented in Figure 5.

Figure 5: Images generated by the P-GAN approach (Karras et al., 2017)

5.2 Text-to-Image synthesis:

We employed two architectures: StackGAN-v1 (Zhang et al., 2017a) and StackGAN-v2 (Zhang et al., 2017b) to generate images conditioned on the their description.

StackGAN-v1 decomposes conditional image generation into two stages. First, the Stage-I GAN sketches a low resolution image (

) with the overall shape and colors of the image conditioned on the text and a random noise vector. Subsequently, the

Stage-II GAN refines this low-resolution image conditioned on the results of the first stage and the same text embeddings, and generates a image.

StackGAN-v2 follows a similar architecture consisting of multiple chained generators and discriminators. The input of each stage of the chain is the output of the previous stage. One of the major differences between StackGAN-v2 and StackGAN-v1 is that these stages are trained jointly, whereas in StackGAN-v1, they are trained independently.

In our experiments, we found that the method by which we encode the textual descriptions can indeed have a big impact on the quality of the generated images. Here, we discuss the text embedding that we applied.

Text embedding:
Both StackGAN-v1 and StackGAN-v2 condition the image generation process on , i.e. the text embedding of the corresponding image description generated from a pre-trained char-CNN-RNN encoder (Reed et al., 2016). It is important for the embedding of the description to correctly relate to the visual contents of the product image. We conducted our experiments using different encoders from a wide range of complexity, namely averaging word vectors, concatenating word vectors, a slightly modified encoder from the Transformer architecture (Vaswani et al., 2017) and a bidirectional LSTM (Schuster et al., 1997).

We experimented with both pre-training these models 333The pre-training step consisted of training the encoder to perform a classification task: given the item description predict its category. and jointly training them with the GAN network. In the case of the Transformer’s encoder and bi-LSTM, the text embedding is the output of the encoder of the Transformer, and the projected concatenation of the last hidden state of the forward and backward LSTM respectively. The final text embedding size for the Transformer is and for bi-LSTM.

We arrived at three conclusions based on our empirical experiments. First, we found that the bi-LSTM model achieves the highest category classification accuracy on the validation dataset in the pre-training process. As can be seen in Figure 9, the t-SNE  (van der Maaten & Hinton, 2008) visualization of text embeddings shows relatively good separation of the categories. Secondly, we found that irrespective of the encoder architecture, pre-training the encoder model results in better correspondence between the descriptions and generated images. Finally, we found that overall, using the pre-trained bi-LSTM with fixed weights as the encoder leads to better results both visually and quantitatively.

The Inception scores reported in table 2 were obtained with the pre-trained bi-LSTM encoder (with fixed weights during the training of GAN).

Figure 6: Images generated from the StackGAN-v1 model with pre-trained bi-LSTM text encoder.
Figure 7: Images generated from the StackGAN-v2 model with pre-trained bi-LSTM text encoder.

Implementation details:
Throughout all the experiments, the descriptions were lowercased, tokenized and cleared of stop words444The python NLTK module was used to tokenize the descriptions by word and remove stop words. We used the first 15 tokens of the descriptions as the input sequence to the encoder model.

StackGAN-v1: We used the same overall architecture as (Zhang et al., 2017a) 555We used the code provided by the authors of the StackGAN-v1 paper in github The first stage was trained for epochs, and the second stage was trained for epochs. The results can be seen in Figure 6.

StackGAN-v2: After careful experimentation, we ended up using the same architecture and hyper-parameters as (Zhang et al., 2017b) 666We used the code provided by the authors of the StackGAN-v2 paper: The results can be seen in Figure 7.

Inception Score
Fashion Real data
StackGAN-v1 (Zhang et al., 2017a)
StackGAN-v2 (Zhang et al., 2017a)
P-GAN (Karras et al., 2017)
Table 2: Inception Scores on the validation set, i.e: trained on the Fashion train set.
Figure 8: Generated images from the P-GAN approach (Karras et al., 2017)
Figure 9: t-SNE visualization of the validation text embedding obtained by the bi-LSTM encoder

We can observe in the Table 2 that first of all, the Inception Score of the StackGAN-V1 is better than StackGAN-V2, while the quality of the images in the StackGAN-V2 is better and the reason is due to a significant mode-collapse that we were faced to in StackGAN-V2. Another interesting point, is the fact that most of the faces in StackGAN-V1 and StackGan-V2 are blurry. It suggests that since the images are conditioned on the text, the model is focusing more on clothing material than face information.

6 Conclusion

Recent progress in generative modeling techniques has great potential to give designers tools for rapidly visualizing and modifying ideas. While recent advances in generative models can be used to generate images of unprecedented realism, the quality of images generated from textual descriptions has so-far remained far from realistic. We believe that the lack of good datasets for this task has made it difficult to develop models for this task. In this paper, we have introduced a new Fashion themed text-to-image generation dataset, with high-quality images and extensive annotations provided by fashion experts. We provided results for 2 experiments: generating high-resolution images without providing textual descriptions as input, and generating realistic images conditioned on product description using the Fashion dataset as training data. We provide experiments with StackGAN-v1 and StackGAN-v2 models using various text encoders.

To help stimulate further research on conditional generative models, we release our dataset as part of a challenge. Detailed submission instructions are provided and our API computes the inception score (trained on the Fashion dataset). Submissions with the highest quality images as judged by human evaluators will be selected as winners in the challenge organized around this new dataset.

7 Acknowledgement

We present our special thank to Alex Shee, for his help and support. We also thank Timnit Gebru and Archy de Berker, for assistance with comments that greatly improved the manuscript. We would also like to show our gratitude to Chelsea Moran, Valerie Becaert, Vincent Hoe-Tin-Noe, Misha Benjamin, Pedro Oliveira Pinheiro, David Vazquez, Francis Duplessis, Ishmael Belghazi, Caroline Bourbonniere and Xavier Snelgrove for their support and feedback during the course of this research. We would also like to thank SSENSE for open sourcing their data to the research community.