Fashioning with Networks: Neural Style Transfer to Design Clothes

07/31/2017 ∙ by Prutha Date, et al. ∙ University of Maryland, Baltimore County 0

Convolutional Neural Networks have been highly successful in performing a host of computer vision tasks such as object recognition, object detection, image segmentation and texture synthesis. In 2015, Gatys et. al [7] show how the style of a painter can be extracted from an image of the painting and applied to another normal photograph, thus recreating the photo in the style of the painter. The method has been successfully applied to a wide range of images and has since spawned multiple applications and mobile apps. In this paper, the neural style transfer algorithm is applied to fashion so as to synthesize new custom clothes. We construct an approach to personalize and generate new custom clothes based on a users preference and by learning the users fashion choices from a limited set of clothes from their closet. The approach is evaluated by analyzing the generated images of clothes and how well they align with the users fashion style.



There are no comments yet.


page 2

page 5

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

There have been recently impressive advances in computer vision tasks like object recognition and detection, segmentation [16][21][3]. The revolution started with Krizhevsky et. al [16]

substantially improving object recognition on the Imagenet challenge using convolutional neural networks (CNN). This led to research and subsequent improvements in many tasks related to fashion such as classification of clothes, predicting different kinds of attributes of a specific piece of clothing, and improving the retrieval of images

[19][14][2][29][15]. Giants of e-commerce are expanding their investment in fashion. Recently, Amazon patented a system to manufacture clothes on demand [22]. Also, they have started shipping their virtual assistant Echo with an integrated camera that clicks a picture of the user’s outfit and rates its style [9]. StitchFix111 aims to simplify the user’s experience to shop online. As the online fashion industry looks to improve the kind of clothes that are recommended to users, understanding the personal style preferences of users and recommending custom designs becomes an important task.

Personalization and recommendation models are a well researched area that include methods from collaborative filtering [18] to content-based recommendation systems (e.g., probabilistic graph models, neural networks) as well as hybrid systems that combine both. Collaborative filtering [18] tries to analyze user behaviour and preferences and align users to predefined patterns so as to recommend a product. Content-based methods recommend a product based on its attributes or features that the user is searching for. A hybrid system (knowledge-based system [27]) incorporates user preferences and product features to recommend an item to the user.

Figure 1: (a) and (b) provide the shape & style respectively (c) Final Design

While the above techniques retrieve the product (or its image) we seek to synthesize new personalized merchandise. Texture synthesis tries to learn the underlying texture of an image in order to generate new samples with the same texture. The research in this space [6] is largely focused on parametric and non-parametric methods. Non-parametric methods try to resample specific pixels from the image or adopt specific patches from the original to generate the new image [5][28][17][4]. Parametric methods define a statistical model that represents the texture [13][10][20]. In 2015, Gatys et. al. [6]

designed a new parametric model for texture synthesis based on convolutional neural networks. They model the style of an image by extracting the feature maps generated when the image is fed through a pre-trained CNN, in this instance using a 19 layer VGGNet. They successfully separate the style and content of an arbitrary image and demonstrate how the other image can be stylized using the textures of the prior.

Figure 2: Overall System Architecture. are all the attributes in the dataset [19], are set of attributes given by the user. is the total loss between gram matrix of modified (iteratively) UCO image & gram matrices from user’s personal style store (for ). In the first phase the user provides the system access to his / her closet images from where the user’s fashion preferences are learned. In phase two, the user gives his / her choices (attributes such as Striped Top or Chiffon) with the desired outline of piece of clothing to get the new custom design.

Although Convolutional Neural Networks provide state-of-the-art performance for multiple computer vision tasks, their complexity and opacity has been a substantial research question. Visualizing the features learned by the network has been addressed in multiple efforts. Zeilar et. al [30] use a deconvolution network to reconstruct the features learned in each layer of the CNN. Simoyan et. al. [25]backpropagate the gradients generated for a class with respect to the input image to create an artificial image (the initial image is just random noise) that represents the class in the network. The separation of style and content in an image by Gatys et. al. [6] shows the variant (content) and invariant (style) parts of the image.

Our contribution in this paper is a pipeline to learn the user’s unique fashion sense and generate new design patterns based on their preferences. Figure 1 shows a sample clothing item generated using neural style transfer. The first clothing item given by the user provides the shape for the new dress. The second is initially provided by the user from his/her closet to learn their preference. The third is the final generated design for the user (the generated sample contains styles from multiple pieces of the user’s clothing).

The following sections discuss the related work, how neural style transfer works, our system architecture, experiments conducted and their results.

2 Related Work

Prior research on fashion data in the computer vision community has dealt with a whole range of challenges including clothes classification, predicting attributes of clothes and the retrieval of items [14][2][29][15]. Liu et. al [19] create a robust fashion dataset of about 800,000 images that contains annotations for various types of clothes, their attributes and the location of landmarks as well as cross-domains pairs. They also design a CNN to predict attibutes and landmarks. The architecture is based on a 16 layer VGGNet and adds convolution and fully-connected layers to train a network to predict them. Phillip et. al [11]

perform image to image translation using a conditional adversarial network. They perform experiments to generate various fashion accessories when provided with a sketch of the item.

We use a 19-layer pre-trained VGGNet [25] that is trained on the imagenet dataset [24]. The network consists of 8 convolutional layers and 3 fully-connected layers. It is trained to predict 1000 classes (from the Imagenet challenge). The network is known to be robust and the features generated have been used to solve multiple downstream tasks. Gatys et al. use the pre-trained VGGNet to extract style and content features.

Johnson et. al. [12]

create an image transformation network trained to transform the image with the given style. A feed-forward transformation network is trained to run real-time using perceptual loss functions that depend on high-level features from a pre-trained loss network rather than the per-pixel loss function based on low level pixel information. The trained network does not start transforming the image from white-noise but generates the output directly, thus speeding up the process.

Gatys et al. [7, 8] describes the process of using image representations encoded by multiple layers of VGGNet to separate the content and style of images and recombine them to form new images. The idea of style extraction is based on the texture synthesis process that represents the texture as a Gram Matrix of the feature maps generated from each convolutional layer. The style is extracted as a weighted set of gram matrices across all convolutional layers of the pre-trained VGGNet when it processes an image. The content is obtained from feature maps extracted from the higher layers of the network when the image is processed. The style and content losses are computed as the mean squared error (MSE) between the features maps and Gram matrices of the original image and a randomly generated image (initiated from white noise). Minimizing the loss transforms the white noise to a new artistic image.

We use the method described above to generate new fashion designs.

3 Preliminaries

This section describes how the style and content is extracted from an image using neural style transfer [7]. We use the implementation given by [26], a pre-trained 19 layer VGGnet model (VGG-19) that takes a content image and a set of style images as input.

Consider an input image and convolutional neural network . Every convolution layer in the convolutional network has

distinct filters. Upon completion of the convolution operation (and the activation function being applied), let the feature map computed have height

and width

. The flattened map (into a single vector) has a size of

. Thus, the feature maps at every layer can be given as where represents the activation of the filter at position .

3.1 Style Extraction

The Gram matrix at layer is given by where is calculated by the dot product of the feature maps and for layer :


The dot product computes the similarities between feature maps. Thus the Gram matrix invariably contains image points that are consistent between the maps while inconsistent features become 0.

Consider two images (input image used to transfer the style) and (a randomly generated image from white noise). Let their corresponding Gram matrices be and . The style loss function is then computed for every layer as the mean squared error (MSE) between and as


is the style loss.

3.2 Content Extraction

The feature maps from the higher layers in the model give a representation of the image that is more biased towards the content [6]. We use the feature representations of the conv_4_2 layer to extract content. Given the feature representations in layer of the original image and the generated white noise image as and respectively, we define the content loss as the mean squared difference between the two:


The derivative of this loss with respect to the feature map at layer gives the gradient used to minimize the loss:

Figure 3: Evaluation model for predicting attribute labels on separate training and test generation images

4 System Architecture

Figure 2 shows the entire pipeline to personalize and design custom clothes for the user. There are four modules to the architecture, namely, preprocessing, personal style store creation, style transfer and post-processing to generate the final design. The following section discusses these modules in more detail.

To minimize the complexity of the problem, we consider images from the DeepFashion dataset [19] that have a white background. The images contain only clothing objects with no humans or other artifacts. They are only upper-body or full-body apparel pieces.

4.1 Preprocessing

All images are resized to 512 x 512. The image is resized not by expanding/contracting the image, but by creating a temporary white background image of the above mentioned size. The original image is then placed at the center of that temporary image. This resizes the image to the expected size without deforming it. Also, the mask of the image is extracted and stored using the grabcut utility [23]. This mask is used in the postprocessing step to get rid of patterns lying beyond the contours of the apparel. The attributes for the clothes are assumed to be provided and automatically labeling them is beyond the scope of this paper.

4.2 Creating a Personal Style Store

To learn the user’s fashion preferences, the user initially provides the set of clothes from his / her closet. The Gram matrices (eq.1

) of all the clothes with their annotated attributes are calculated. Tensorflow

[1] allows us to get the partially computed functions in 2 (where the gram matrices for are computed first and then later). The style losses are thus stored in a dictionary with the associated attributes. A personal style store is constructed for each user.

4.3 Style Transfer

To perform style transfer, two inputs are necessary. As shown in figure 2, the user inputs a list of attributes that he/she will like in their new garment. This list can be attributes like print and stripes or fabric such as chiffon. In the current system, style is learned only for attribute types texture and fabric. The dress shape is not considered as a representation of the style of that object. Apart from these attributes, the user also gives an image that contains the shape of the dress they desire. This is called the User Chosen Outline (UCO). Let the attributes of the dresses in the closet be . The selected user attributes are where . The set of style loss functions having the corresponding attributes are selected from the user’s personal style store. Although the style’s extracted from the user’s closet as a whole represent the his/her fashion sense, we pick the style functions of the chosen attributes because we assume the user’s mental model of dress is likely to be similar to the styles extracted for those attributes. All selected functions are then combined to get a singular representation of the user’s fashion choices.

For a style image and the initialized image , the style loss can be given as,


where is the style loss for a single image.

The combined loss is given by:


Here, is the style loss computed over select functions.

The number of images for every attribute picked depends on the distribution of the particular attribute across the entire list of images present. The higher the frequency of the attribute in the distribution, the higher is the bias towards a certain label and suppresses the effect of the others. This makes certain image characteristics more pronounced in the final dress than others. Hence, to offset the bias the weight is utilized.

Total Loss is the summation of the style and content losses obtained.


Here, and are the weights assigned to the content and style losses respectively. C is the user chosen outline (UCO). An LBFGS optimizer is used to minimize the loss. The output image is then post-processed to get the final image. The objective is to minimize the content and style losses.

Figure 4: Multiple styles reinforced in a content image

4.4 Postprocessing

The output image contains patches of patterns transferred across the entire image. We resize the image to its original dimensions and apply the mask (of the UCO image) extracted to white out the background and get the transformed clothing object as the final resultant dress.

5 Experiments & Results

We present two approaches to evaluate the results of personalization using style transfer.

5.1 Predicting Attribute labels

Quantitative evaluation for personalization models is a challenging task. A standard approach is to create a survey of mechanical turk and ask users if the styles have been transferred properly and if the new dress designs are personalized given a wardrobe. But fashion presents a unique challenge as it is highly dependent on the user’s taste for different kinds of clothing. Instead a different tact is applied.

Figure 3

shows how the evaluation is performed. We check if style is imparted on the given UCO image by verifying if the classifier is able to identify the style attributes present in it. An SVM is trained to learn attributes of the clothes present in the user’s closet using the features generated from a 16-layer VGGNet (our system uses the 19 layer for fashioning the clothes). The test dataset is created by generated a random combination of attributes (these combinations are likely not present in the training image closet). For these random combinations of attributes, the new dress images are generated. Once featurized by a pre-trained VGG-16, we check the SVM’s ability to predict the combinations of attributes.

The UCO images and the set of images used for styling are maintained separately. There are a total of 400 UCOs and 100 images from the user’s wardrobe. There are two kinds of tests considered in the experiment. In the first, the test images are generated from a set of images separate from the styles extracted from the training but with similar attributes. In the second, the test images are generated from the styles extracted from the training data itself. Figure 5 shows the F1-score for a varying number of test images generated. The consistent performance above the baseline suggests the style is likely transferred and the SVM is able the classify based on features generated.

Our experiments with increasing the number of images used for gaining more styles showed a drop in the F1 score, suggesting that an increasing number of style functions impact the quality of the result, thus making it difficult to identify patterns. Hence it is necessary to limit the number of style functions used to generate the new dress.

Figure 5: Bar-chart showing F1-scores for the baseline and our model on actual test data using separate training and test generation images, and using same images for training and test data generation
Figure 6: Styles extracted from multiple images for the same attribute "knit"

5.2 Qualitative evaluation

We analyze the quality of dress images by seeing how similar they are to the style images used in the personalization process. The quality of the generated image is impacted by a number of factors. The effect of various hyper-parameters is measured. The Figure 4 shows an image of a sheer draped blouse changed to adopt the styles extracted from a couple of images. The result is a nice blend of patterns borrowed from the style images given.

A single style superimposed on the same content image, but using multiple distinct style images, produces interesting results. Figure 6 presents the style of four different knit garments over a tank top. Four different textures of the same fabric produce distinct results.

6 Conclusions & Future Work

In this paper, we show an initial pipeline to generate new designs for clothes based on the preference of the user. The results indicate that style transfer happens successfully and is personalized for the closet of a user. In the future we will like to improve the performance of the pipeline as it is time consuming to generate a new design. Also, we plan to experiment with better methods to personalize and generate designs with higher resolutions.


  • [1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. Software available from
  • [2] L. Bossard, M. Dantone, C. Leistner, C. Wengert, T. Quack, and L. Van Gool. Apparel classification with style. In Asian conference on computer vision, pages 321–335. Springer, 2012.
  • [3] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. CoRR, abs/1606.00915, 2016.
  • [4] A. A. Efros and W. T. Freeman. Image quilting for texture synthesis and transfer. In Proceedings of the 28th annual conference on Computer graphics and interactive techniques, pages 341–346. ACM, 2001.
  • [5] A. A. Efros and T. K. Leung. Texture synthesis by non-parametric sampling. In Computer Vision, 1999. The Proceedings of the Seventh IEEE International Conference on, volume 2, pages 1033–1038. IEEE, 1999.
  • [6] L. Gatys, A. S. Ecker, and M. Bethge. Texture synthesis using convolutional neural networks. In Advances in Neural Information Processing Systems, pages 262–270, 2015.
  • [7] L. A. Gatys, A. S. Ecker, and M. Bethge. A neural algorithm of artistic style. arXiv preprint arXiv:1508.06576, 2015.
  • [8] L. A. Gatys, A. S. Ecker, and M. Bethge. Image style transfer using convolutional neural networks. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , pages 2414–2423, 2016.
  • [9] B. Heater. Amazon’s new echo look has a built-in camera for style selfies. Accessed: 2017-06-02.
  • [10] D. J. Heeger and J. R. Bergen. Pyramid-based texture analysis/synthesis. In Proceedings of the 22nd annual conference on Computer graphics and interactive techniques, pages 229–238. ACM, 1995.
  • [11] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-image translation with conditional adversarial networks. arxiv, 2016.
  • [12] J. Johnson, A. Alahi, and L. Fei-Fei.

    Perceptual losses for real-time style transfer and super-resolution.

    In European Conference on Computer Vision, 2016.
  • [13] B. Julesz. Visual pattern discrimination. IRE transactions on Information Theory, 8(2):84–92, 1962.
  • [14] Y. Kalantidis, L. Kennedy, and L.-J. Li. Getting the look: clothing recognition and segmentation for automatic product suggestions in everyday photos. In Proceedings of the 3rd ACM conference on International conference on multimedia retrieval, pages 105–112. ACM, 2013.
  • [15] M. H. Kiapour, K. Yamaguchi, A. C. Berg, and T. L. Berg. Hipster wars: Discovering elements of fashion styles. In European conference on computer vision, pages 472–488. Springer, 2014.
  • [16] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
  • [17] V. Kwatra, A. Schödl, I. Essa, G. Turk, and A. Bobick. Graphcut textures: image and video synthesis using graph cuts. In ACM Transactions on Graphics (ToG), volume 22, pages 277–286. ACM, 2003.
  • [18] G. Linden, B. Smith, and J. York. Amazon. com recommendations: Item-to-item collaborative filtering. IEEE Internet computing, 7(1):76–80, 2003.
  • [19] Z. Liu, P. Luo, S. Qiu, X. Wang, and X. Tang. Deepfashion: Powering robust clothes recognition and retrieval with rich annotations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1096–1104, 2016.
  • [20] J. Portilla and E. P. Simoncelli. A parametric texture model based on joint statistics of complex wavelet coefficients. International journal of computer vision, 40(1):49–70, 2000.
  • [21] S. Ren, K. He, R. B. Girshick, and J. Sun. Faster R-CNN: towards real-time object detection with region proposal networks. CoRR, abs/1506.01497, 2015.
  • [22] J. D. REY. Amazon won a patent for an on-demand clothing manufacturing warehouse. Accessed: 2017-06-02.
  • [23] C. Rother, V. Kolmogorov, and A. Blake. Grabcut: Interactive foreground extraction using iterated graph cuts. In ACM transactions on graphics (TOG), volume 23, pages 309–314. ACM, 2004.
  • [24] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
  • [25] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014.
  • [26] C. Smith. neural-style-tf., 2016.
  • [27] S. Trewin. Knowledge-based recommender systems. Encyclopedia of library and information science, 69(Supplement 32):180, 2000.
  • [28] L.-Y. Wei and M. Levoy. Fast texture synthesis using tree-structured vector quantization. In Proceedings of the 27th annual conference on Computer graphics and interactive techniques, pages 479–488. ACM Press/Addison-Wesley Publishing Co., 2000.
  • [29] T. Xiao, T. Xia, Y. Yang, C. Huang, and X. Wang. Learning from massive noisy labeled data for image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2691–2699, 2015.
  • [30] M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. CoRR, abs/1311.2901, 2013.