1.1 Image content challenges in the fashion business
In modern fashion e-commerce, scale and speed is both an opportunity and a challenge. The creation of image content for the huge amount of various articles means that traditional methods (shooting and photographing models) are a bottleneck, as they’re slow and expensive. Suppose that several thousands of articles arrive in a new batch. Shooting them on cloth hangers standalone is relatively easy and cheap, but making photoshoots with professional models is time consuming and expensive. Leveraging available data and reusing images of human models and products would therefore be very useful for a fashion business.
As a related application, consider the “virtual try-on” problem: giving a customer the ability to upload her/his picture and see how she/he would look with different fashion products in a sort of “magic mirror”111https://www.slideshare.net/metatechnology/magic-mirror-for-fashion-stores will give totally new exciting possibilities to customers in the near future. However, this is currently a technical challenge that has not been solved convincingly.
One approach would be to model humans and articles as 3D objects and render them photo-realistically with computer graphics methods. However, the engineering challenges and computational costs (e.g. for high quality 3D scans ) make that approach expensive to build and maintain. Instead, the rich 2D image collections characteristic for fashion companies contain rich structure that can be exploited by modern generative deep learning methods.
By learning a generative model for “dressing” an image of a human model given a fashion article photo specifying what to wear, we allow a novel image editing method that has great potential for fashion businesses. Our method can enable a lean system capable of creating thousands of images per second by using image data available in typical fashion e-commerce companies, as produced by the standard content production processes. It thus has the potential to allow further scaling in parts of current fashion businesses.
To avoid confusion, in this paper we speak of “human” instead of image of a human fashion model, and use the word model only for neural network models.
1.2 Image-to-image translation
Many image processing problems can be modeled as image-to-image translation problems (e.g. colorizing black-and-white images, translating from a sketch of a scene to a colorful photography, etc.), where an image is fed as input and the output is the modified image.
Convolutional neural networks (CNN) are powerful deep learning models that can solve such problems, see 
for an example. However, usual CNN architectures require prespecified loss functions and such losses may not be suitable for outputting sharp and realistic images, e.g. a Euclidean loss leads to blurry images.
The more recent model class of Generative Adversarial Networks (GANs)  train a model that learns a data distribution from example data, and a discriminator that attempts to distinguish generated from training data. These models learn loss functions that adapt to the data, and this makes them perfectly suitable for image-to-image translation tasks where the desired result is to create sharp images that look indistinguishable from the training examples.
The conditional version  of GAN (cGAN) learns to generate images as function of conditioning information from a dataset, instead of random noise from a prior, as in standard GANs.  is an excellent overview of image-to-image translation methods using cGANs.  further improves the type of losses available for image-to-image translation models and allows advanced sketch and texture control for interactive image editing. The requirement of datasets with paired input and output images as ground truth is sometimes a limitation. For many interesting image translation problems no such datasets with paired data exist.  presents a novel method of unpaired training for such cases: presenting two (unpaired) image domains, and regularizing the generator function, allows to learn a mapping between the image domains without requiring explicitly paired examples. For example, by getting a set of horse images, and a second set of zebra images, the model can accurately swap their textures.
2 The CAGAN model
2.1 Painting humans wearing fashion articles as an image analogy problem
For the task of painting a given image of a human with a given image of a certain article, one could potentially train supervised learning methods. However, ground truth data corresponding to the desired outcome does not exist, and it is not practical to generate it manually. Instead, in a typical fashion company, there’s rich image content material produced constantly in photo-shoots: millions of photos of humans modelswearing articles of clothing, and also close-up pictures of the articles without humans. CAGAN will use image data of such form, i.e. , where the index indicates a pair of human and article. These images have a relation – the image contains a human wearing a certain fashion product on his body, and is another image of the same fashion product shown alone.
The mapping – between the standalone article image and its appearance rendered on a human in – is typically distorted by occlusion, illumination, 3D rotation and deformation. We can use data to learn this relation, and then, given a new (not in the training set) clothing article , create an image of a human body wearing the appropriately rendered article. This is an image analogy problem: find in the same relation to as is to for all training data pairs . Note that our method, in contrast to the standard image analogy method , makes use of the whole dataset to infer the most plausible relation.
However, generating a completely new photorealistic image containing a human who wears a fashion article image is a complex task. In particular, plausible faces are very hard to generate . Fortunately, in our case it is beneficial to restrict the problem by staying close to the input image, as described in the next section.
2.2 Relaxing the task: swapping clothes on existing humans
We just need to make sure the fashion article looks well painted, and can reuse an existing human model image, instead of creating a complete output image from scratch. This is easier from a machine learning point of view. In addition it also corresponds to the fashion customization use-case from the introduction, i.e. how will a particular customer look in a piece of clothing?
We want to train an image-to-image translation network that will exchange one piece of clothing with a new one, , on a given human image . Note there are never examples of , the modified human image with the swapped fashion item we would like to see. We infer this indirectly from the data, which allows us to learn the relation , an article properly dressed on a model. The priors and model architecture will force the neural model to augment the human image and paint only some parts of it with the new article, which we consider implicit relation learning. Note that this is more complex than , because in our case the image domains are specified implicitly by the conditioning, rather than given explicitly as two image domains training data.
Our method incorporates end-to-end learning to side step localization and segmentation challenges and directly predict images having suitable properties: be smooth looking fashion models that wear the specified fashion articles. For this, we need to find where the old article is located, and replace it with the new articles. We use as conditioning directly an article stand-alone image, of the type usually available in an online shop. They show the article in some detail, but the transformation to correct for illumination, occlusion, 3D rotation and deformation such that it fits the output is not known. Our method automatically infers an appropriate segmentation map from both the article and human image, and uses it to generate an appropriate looking image consistent with the conditioning.
2.3 CAGAN model and training loss function
Training of the CAGAN model involves learning a generator to generate plausible images which fool a discriminator . The discriminator needs to answer two questions:
does an image look reasonable, i.e. indistinguishable from the training distribution of human images ?
does the article look well-painted on the human model image , i.e. is the relation of and consistent with the relations observed in the dataset .
The second criteria cannot be modeled directly with an Euclidean loss since we do not have ground truth examples for all the combinations of any article with any human – the data has only a few examples of specific humans, each wearing only a few clothes.
For training and we define a loss which contains several terms, weighted by constants :
The most important term is the adversarial loss that involves the generator and the discriminator:
where the notation means uniformly sampling elements with indices and from the dataset , under the fullfillment of the constraint . We marginalise over the indices of the spatial dimensions of the output of , because the discriminator does not output a single number per image, but a whole spatial field of classifications. The network is a typical discriminative network. Such local consistency discriminator works quite well, see also [4, 5, 1] where they discuss how discriminating patches from a bigger image and marginalizing over the spatial positions is fast and efficient for many image discrimination tasks.
The terms of loss (2) are very similar to the classical GAN loss of  that learn to distinguish true data from generated examples. However, the last term of defines another type of negative examples that come from the distribution of swapped data: a human and an article different than and “not” appearing on the human. The motivation for that term is that we need to find whether the article is really worn by the human, i.e. the discriminator needs to learn if the relation of and is corresponding to the dataset.  also finds such negative examples to be beneficial for conditional generative models.
The generator network is designed to output a 4 channel output: a 1 channel matting mask and a 3 channel color image . The final generated image of a human with swapped clothes is the convex combination of the predicted and original image . In order to ensure a range for the
mask values, we transform them with the sigmoid function. Such prediction of both blending mask and predicted image is similar to.
The term in Eq. (1) regularizes the outputs of to change as little as possible from the original human image, since we know we only change one piece of clothing at a time. It has the effect to avoid the model painting parts of the human body that are not relevant for the swapped clothes:
where is the L1-norm. Our preliminary results indicate that different norms (e.g. total variation) can improve the results further.
We also introduce a cycle loss in equation (1) that should force consistent results when swapping clothes:
Figure 2 illustrates how the cycle loss is defined, similar to . The motivation is that it should make the generator more stable: changing clothes on a human should only swap the relevant articles and leave the other parts of the image unchanged. If the generator creates an image that modifies original image in some regions unrelated to the article , then the reverse swapping operation will generate an image that will be penalized for deviating from .
Our setup used an Nvidia K80 GPU with 8gb of memory, and the code was implemented in Theano. For training we use ADAM with the settings of  – learning rate , and minibatch size of 16. We use instance normalisation at all layers except the first and last, as in , and Relu
as activation function in all layers. Usually several hours of training (10000 gradient steps) were enough to get reasonable results. The strengths of the regularizations for Eq. (1) are and . In our experience both regularizations are beneficial for the article swapping task, but did not investigate in detail what the optimal values are.
uses convolutions with stride 2,has also convolutions with stride 2 and deconvolutions with stride . We double the number of channels when the spatial resolution decreases. Figure 3 shows the spatial sizes and channels when training on 128x96 pixel images. For it was enough to use 4 layers with receptive field of only 63x63 pixels, enforcing local consistency when discriminating images.
For we use the encoder-decoder architecture with skip connections, as in . In addition, we use always the last 6 channels of any intermediate layer (in both and ) to store downsampled copies of the inputs . This improves the convergence of the models and the image quality. We suppose in that way information from the conditioning is better preserved in the deep network.
The training data used contained 15000 images of humans (frontal view) and paired upper-body garments (pullovers and hoodies). The data was given by Zalando SE222www.zalando.de, one of the biggest e-commerce fashion companies. We used the data in RGB color space, and tested both 128x96 and 256x192 pixel resolutions with this data.
3.2 Generation results: images of human models with swapped clothes
In general, our model has the correct behavior: generated images keep the looks of the human image and only swap the relevant article. Figure 4 shows our results on 3 randomly chosen models. In-place color changes are easier than texture and geometric deformations .
Figure 5 shows results with a higher resolution. We see that the masks can be of really good quality, which is impressive given that the segmentation is learned implicitly by the CAGAN objective. We note that the images we obtained from Zalando have a consistent background – street view images would likely be more difficult, since background clutter makes segmentation more complex.
Figure 6(c) shows how we can swap different clothes on the same human, or apply the same fashion article on different humans (d). This also shows clearly that our model generalizes well and can combine any (coming from the training distribution) human and article in a visually appealing way. As a sidenote, we could obtain a similar performance using images not contained in the original training set , as long as the photoshoot style is consistent with the training data – the CAGAN does not memorize the training data. Note, however, that the model is good at transferring the colors and rough structures of clothes but not the fine textures, see Figure 6(c), top row 2nd image (from left to right) and bottom row 3rd image.
4.1 Related methods
Another recent method for generating people with clothes  uses segmented data from the Chictopia  dataset, and applies auto-encoders and image translation cGAN (as in ) to generate images of humans using segmented body regions as conditioning images. However, they lack the ability to specify what piece of clothing to generate exactly – e.g. they generate any upper body garment on the specified segmentation mask region, but cannot control how it looks exactly. In addition, the requirement for semantic segmentation data can be a limitation in practice: fashion companies do not usually gather such data. Inferring and labeling such data can lead to extra costs: per-pixel segmentation is expensive.
In contrast, using available fashion images – as CAGAN does – is a more natural fit to the fashion domain: such data is already available in huge quantities, and also allows to precisely specify what article to paint.
4.2 Future work
We showed the concept and first results of a novel type of GAN that can swap clothes on humans and offer new possibilities to image manipulation for fashion purposes. We plan to continue this work and improve upon it both in model architecture and application.
We want to use more data and try more exciting fashion article swapping scenarios. Being able to change all fashion item categories (upper/lower body garments, shoes, accessories) on a human picture is essential for the full “virtual try-on” experience. We will also examine in detail how much good segmentations of human model images can improve the overall results. Having at least a foreground-background segmentation can indeed be beneficial, see . We can easily augment CAGAN with segmentation masks if they are provided for the human model pictures, either by directly overwriting the mask or by using them as a prior.
We plan to improve the neural network architecture in several ways. (i) Examine other color spaces, e.g. Lab instead of RGB. (ii) Test whether using texture descriptors (as in ) can improve the results when swapping clothing with specific textile patterns – currently the CAGAN is inaccurate with complex textures. (iii) Analyze whether embedding of the conditioning information (the articles ) can lead to better flow of information in the neural network and faster convergence, as in . Right now the generator needs both to analyze the visual descriptors and localize the article image on the human image . Having an embedding of the visual description of the article may help the earlier layers to focus on the correct region where the old fashion article is located.
-  U. Bergmann, N. Jetchev, and R. Vollgraf. Learning texture manifolds with the Periodic Spatial GAN. In Proceedings of The 34th International Conference on Machine Learning, 2017.
-  I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. C. Courville, and Y. Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems 27, 2014.
-  A. Hertzmann, C. E. Jacobs, N. Oliver, B. Curless, and D. H. Salesin. Image analogies. In Proceedings of the 28th annual conference on Computer graphics and interactive techniques, pages 327–340. ACM, 2001.
-  P. Isola, J. Zhu, T. Zhou, and A. A. Efros. Image-to-image translation with conditional adversarial networks. CoRR, abs/1611.07004, 2016.
-  N. Jetchev, U. Bergmann, and R. Vollgraf. Texture synthesis with spatial generative adversarial networks. CoRR, abs/1611.08207, 2016.
-  D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014.
-  C. Lassner, G. Pons-Moll, and P. V. Gehler. A generative model of people in clothing. CoRR, abs/1705.04098, 2017.
-  X. Liang, C. Xu, X. Shen, J. Yang, S. Liu, J. Tang, L. Lin, and S. Yan. Human parsing with contextualized convolutional neural network. In ICCV, pages 1386–1394, 2015.
-  M. Mirza and S. Osindero. Conditional generative adversarial nets. CoRR, abs/1411.1784, 2014.
-  A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. CoRR, abs/1511.06434, 2015.
-  S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee. Generative adversarial text-to-image synthesis. In Proceedings of The 33rd International Conference on Machine Learning, 2016.
-  W. Xian, P. Sangkloy, J. Lu, C. Fang, F. Yu, and J. Hays. TextureGAN: Controlling Deep Image Synthesis with Texture Patches. ArXiv e-prints, June 2017.
-  J. Yang, A. Kannan, D. Batra, and D. Parikh. LR-GAN: layered recursive generative adversarial networks for image generation. CoRR, abs/1703.01560, 2017.
R. Zhang, P. Isola, and A. A. Efros.
Colorful image colorization.In ECCV, 2016.
-  J. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. CoRR, abs/1703.10593, 2017.