Disentangling Multiple Conditional Inputs in GANs
In this paper, we propose a method that disentangles the effects of multiple input conditions in Generative Adversarial Networks (GANs). In particular, we demonstrate our method in controlling color, texture, and shape of a generated garment image for computer-aided fashion design. To disentangle the effect of input attributes, we customize conditional GANs with consistency loss functions. In our experiments, we tune one input at a time and show that we can guide our network to generate novel and realistic images of clothing articles. In addition, we present a fashion design process that estimates the input attributes of an existing garment and modifies them using our generator.READ FULL TEXT VIEW PDF
Disentangling Multiple Conditional Inputs in GANs
The process of fashion design requires extensive amount of knowledge in creation and production of garments. Designers often need to closely follow the current trends and predict what will be popular in the future. Therefore, to be ahead of the curve for commercial success, a well-structured and agile design process is crucial. A machine-assisted design approach that combines human experience with deep learning can help designers to rapidly visualize an original garment and can save time on design iteration cycles.
In recent years, deep learning techniques have been used in various fashion related topics, such as article representation and retrieval (Liu et al., 2016; Simo-Serra and Ishikawa, 2016; Bracher et al., 2016). In addition, variants of Generative Adversarial Networks (GAN) (Goodfellow et al., 2014) opened up image generation and manipulation possibilities that are, among other things, especially effective in fashion visualization and design (Zhu et al., 2016; Isola et al., 2017; Shuhui Jiang, 2017; Zhu et al., 2017; Jetchev and Bergmann, 2017).
GANs are a deep learning architecture that learns to map an easy-to-sample latent probability distribution into a complex and high-dimensional distribution, such as images. In its bare form, GANs do not provide out-of-the-box control for the generation process. Researchers have achieved control of image generation by using GANs that are conditioned on a categorical input(Mirza and Osindero, 2014; Odena et al., 2017). In this paper, we employ conditional GANs to control the visual attributes, such as color, texture, and shape, of a generated apparel.
One of the main challenges of the conditional image generation GANs is to isolate the effects of input attributes on the final image. For example, we want the color of an article to stay constant, when we tune its texture and/or shape. One possibility would be to employ Adversarial Autoencoders(Makhzani et al., 2016) or DNA-GAN (Xiao et al., 2018) to disentangle the inputs. However, this requires an exhaustive dataset, in other words, we need to have the images of garments with comprehensive color, texture, and shape combinations. Unfortunately, most of the clothing articles are produced in a limited range of design options.
In this paper, we introduce a conditional GAN architecture and a training procedure that independently controls the color, texture, and shape characteristics of a generated garment image. In order to disentangle the influence of generator inputs, for each attribute, we add a consistency loss function at the output of our generator. We show that when we tune one attribute, the other two characteristics of the generated image stay visually stable. We then demonstrate a simple fashion-design process by first estimating the characteristics of a real article, and then modifying these properties using our generator inputs. Finally, we discuss possible future directions to add more control over the generation process.
An overview of our method can be found in Figure 1. Our generator has three inputs. The first one represents the average color of an article,
. The color is defined as a 3-dimensional vector of RGB values. We chooseto be uniform between the interval [-1, 1], so that we can generate clothing articles with any desired color. The second input is a 512-dimensional latent vector that represents the texture and the local structure of an article , where of size pixels. is the distribution of segmentation masks of real articles. In order to feed the shape input, we use a mask embedding network that transforms the binary image into a 512-dimensional vector. Note that the mask embedding network is a part of the image synthesis and is jointly trained with our generator. All three attributes are concatenated into a 1027-dimensional vector and are passed into the generator, resulting in , where is a generated image.
Our discriminator (or critic) outputs a 4-dimensional vector and is trained to perform two tasks: it distinguishes (in 1-dim) between real and fake
images and it estimates the average color (in 3-dim) of an input article. This is very similar to Auxiliary Classifier (AC) GANs(Odena et al., 2017), except that we replace the categorical classification with a color regression.
In order to train our GAN architecture, we use the WGAN-GP loss (Gulrajani et al., 2017). The first component of this loss represents a distance between real and generated image distributions, and is defined as follows:
where is the Wasserstein loss in (Gulrajani et al., 2017), is the real-vs-fake output of our discriminator, is a real image that comes from the distribution and is a generated image that comes from and is induced by our generator . As mentioned before, the auxiliary output of the discriminator attempts to correctly estimate the average color of real and generated article images, which is computed using the following function:
Here, is a function that calculates the average color of a real or fake article using its corresponding segmentation mask. The sum is computed over image pixel locations and then normalized by , which is the number of pixels inside the binary segmentation mask. From this point forward, we will drop the pixel locations from our notation. The auxiliary loss is defined as:
Here, is the auxiliary loss for color estimation, is the auxiliary output of the discriminator, which is trained by optimizing the following function:
where, is the gradient penalty term from (Gulrajani et al., 2017) and its weight is .
We have multiple generator inputs that influence the appearance of a synthesized image. In order to disentangle the effects of these inputs, we use three loss functions that works in collaboration and minimizes the crosstalk when we individually tune the attributes.
We want to control the average color of the generated article through the 3-dimensional color input. Consider that we generate two images, namely and
, that are synthesized with the same color and different texture and shape inputs. We can consider these images as independent random variables that are identically distributed with. We can achieve color disentanglement at the output of our generator by enforcing the average color of the generated article images to be the same:
where is defined as the color consistency loss. The inner expectation ensures that for a given color input , all the generated images have approximately the same average color. The outer expectation provides color consistency for all colors.
In order to provide texture consistency, we need to preserve the local structure (garment pattern, wrinkles, and shading), even when we change the color and shape inputs. In photo-realistic style transfer (Luan et al., 2017), to retain the details of an input image, Laplacian matting matrices (Levin et al., 2006) are employed. These very sparse matrices represent the local structure around each pixel and can be used to measure structural similarity between a source and a target image. In our method, to ensure texture consistency, we adopt a similar approach. Let and be two images that are generated with the same texture, but different colors and shapes. Similar to what we do in color consistency, we consider these images as independent random variables that are identically distributed with . Prior to loss computation, we define the following operations:
where is an operator that flattens an input image into a matrix and computes the Laplacian matting matrix () described in (Levin et al., 2006). In order to minimize texture inconsistencies, we propose the following loss function:
where is the texture consistency loss, and is the trace of a matrix. In equation 7, the inner expectation minimizes local structural differences between two images that are generated with the same texture input, regardless of their colors and shapes. The outer expectation ensures that consistency exist for all texture inputs.
The shape consistency is accomplished by generating background pixels for the locations that are outside the segmentation mask. This can be achieved by using the following loss function on the generated image :
Here, is the binary complement of the input segmentation mask, is the L1 norm, and is the background color, which, in our case, is white.
In our experiments, we observed that the color consistency loss in equation 5 sometimes causes the average color of the generated articles to collapse into a single color. Therefore, in order to avoid trivial solutions and to stabilize the training, we put an additional average color check right after the generator as follows:
We aggregate the aforementioned losses and train our generator using the following function:
Here, are the weights for the loss functions for color, texture, shape, and generator check, respectively.
In our experiments, we used a server with Intel Xeon CPU (E5-2630), 256GB system memory, and an NVIDIA P100 GPU with 12GB of graphics memory. We modified the Tensorflow code111Our source code is available at https://github.com/zalandoresearch/disentangling_conditional_gans from (Karras et al., 2018), where they used a temporal smoothing over the generator to create photorealistic images. Unlike that work, we did not progressively grow our model. Instead, we trained a network to directly generate pixel images. Both our generator and discriminator layers are the same as in (Karras et al., 2018). Our mask embedding network has the same architecture as our discriminator, except the last layer, which outputs a 512-dimensional vector and normalizes it with the L2-norm.
We train our model using ADAM as optimizer (Kingma and Ba, 2014) with the following parameters: learning rate, , , . The weights of our generator loss functions are .
Our training dataset is composed of over 120,000 images of dresses that are downloaded from Zalando’s website222www.zalando.de
. These images are padded into squares of sizepixels. At each training iteration, we sample two random colors, textures, and masks, and feed them into our generator using all combinations, which gives us a batch size of . This combinatorial batch, with the help of our loss functions, disentangles the effects of input attributes. This is because we check if the effect of a certain attribute stay roughly the same for each combination.
In our experiments, for input and output images, we used the standard RGB color space representation. One can easily use CIE-LAB color space with small modifications to our architecture.
We demonstrate the effect of color tuning in Figure 2, where we randomly select three colors (represented as a colored square at the top-left corner) and generate three articles using the same texture and shape. We observe that the average colors of the generated articles are approximately the same with their corresponding input colors. In addition, the local structure and the shape are very stable and mostly unaffected by the input color changes.
The color distribution of real articles is not uniform. This can limit the generator output to a certain set of colors, as the discriminator might reject a plausible looking dress with an unlikely color. In Figure 3
, for around 3000 generated images, we plot the log-likelihood of their input colors against their discriminator scores. We measure the color likelihood by fitting a Gaussian Mixture Model (16 components) on the colors of real articles. We can observe that, although there is a small linear correlation, our discriminator is not strongly biased towards colors of the real articles. This is especially important when designers want to create novel garments.
In order to show the texture control, we randomly sample three latent vectors and generate images using the same color and shape. We can see in Figure 4 that different texture inputs create distinct local structures and patterns, yet the average color and the shape of the articles are equivalent.
In our generator, the shape of a generated article is guided by providing a binary mask. Similar to color and texture, we want to adjust the shape of the generated article independent of the color and texture inputs. In Figure 5, we demonstrate three synthesized images that have the same color and texture, but different shapes. We observe that, although not pixel precise, the garment outlines faithfully follow the input shape mask, which is sufficient for design visualizations. One can see that the local structure of an article cannot stay perfectly unchanged, when we change its shape. However, our generator creates realistic article images with different shapes yet similar-looking textures. These images are superior to cut-out versions of 2D pattern maps.
Next, we investigate the stability of the generated images, when we input masks that do not come from real mask distribution. In Figure 6, we can see that, although the mask is hand drawn and unlikely to belong to a real article, its corresponding article looks plausible. This property can be used in a fashion design user interface to define article outlines.
In order to modify images of real fashion articles, we need to find their corresponding color, texture, and shape inputs. We calculate the shape (or segmentation mask) of a garment by using a simple neural network that is trained on article images and their corresponding binary masks. One can obtain the mask of an image by using an interactive method such as GrabCut(Rother et al., 2004). Once we have the shape estimate , we can estimate the average color using equation 2. The texture (or the local structure) of an article can be obtained by solving the following optimization problem:
where, , is the texture consistency loss, and computes the KL-divergence between the texture vector and a zero-mean, unit-variance normal distribution as follows:
Here, and are the mean and variance values that are calculated over the individual elements of texture vector . KL-divergence regularizes the estimated to be close to the distribution used during training of the GAN. The weights we use during the optimization are . In Figure 7, we demonstrate a real article and its reconstructed version using the estimated input tuple . We can see that the color and shape inputs are accurately reflected in the reconstructed version. In addition, the estimated texture is able to capture the horizontal line in the middle, and the shading/wrinkling in the lower part of the garment.
In Figure 8, we modify the input attributes to the generator and observe the process is robust and each attribute separately affects the reconstructed article.
Our generator only ensures the average color within the mask will be the same as the input color. This assumption limits the generator and prevents it from capturing some parts of the real image distribution . For example, the multi-colored garment in Figure 9 is not accurately reconstructed, due to its complex structure.
It is possible to extend the color input from a single color to a collection of colors or to a color histogram. In this case, however, a single shape mask might not be enough to check if the colors are correctly generated at correct image locations. Instead, a full color segmentation of the image might be required.
The Laplacian matting matrix in our texture consistency loss is computed using a neighborhood around each pixel. If we increase the neighborhood size, we can capture larger structures on the articles without changing the size of the Laplacian matting matrix. However, the sparsity of this matrix decreases quadratically with the neighborhood size, which would dramatically increase the computation time and memory requirements.
In this paper, we presented a generative adversarial network architecture and a corresponding training procedure that takes color, texture, and binary shape mask as input, and outputs an image of a fashion article. We showed that by using our consistency loss functions, we were able to disentangle the effects of generator inputs, which enabled us to independently tune the attributes of a generated image. Our generator presents an opportunity to easily design and modify fashion images.
The attributes we presented here are only a subset of characteristics that fashion designers require. We plan to add more sophisticated control over the generation process by extending our method to multiple color inputs and allow texture input directly from an image or another article.
Image-to-Image Translation with Conditional Adversarial Networks. In.
Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence.
Fashion Style in 128 Floats: Joint Ranking and Classification Using Weak Data for Feature Extraction. InIEEE Conference on Computer Vision and Pattern Recognition.