Log In Sign Up

SieveNet: A Unified Framework for Robust Image-Based Virtual Try-On

by   Surgan Jandial, et al.

Image-based virtual try-on for fashion has gained considerable attention recently. The task requires trying on a clothing item on a target model image. An efficient framework for this is composed of two stages: (1) warping (transforming) the try-on cloth to align with the pose and shape of the target model, and (2) a texture transfer module to seamlessly integrate the warped try-on cloth onto the target model image. Existing methods suffer from artifacts and distortions in their try-on output. In this work, we present SieveNet, a framework for robust image-based virtual try-on. Firstly, we introduce a multi-stage coarse-to-fine warping network to better model fine-grained intricacies (while transforming the try-on cloth) and train it with a novel perceptual geometric matching loss. Next, we introduce a try-on cloth conditioned segmentation mask prior to improve the texture transfer network. Finally, we also introduce a dueling triplet loss strategy for training the texture translation network which further improves the quality of the generated try-on results. We present extensive qualitative and quantitative evaluations of each component of the proposed pipeline and show significant performance improvements against the current state-of-the-art method.


page 3

page 4

page 6

page 7

page 8


Toward Characteristic-Preserving Image-based Virtual Try-On Network

Image-based virtual try-on systems for fitting a new in-shop clothes int...

LGVTON: A Landmark Guided Approach to Virtual Try-On

We address the problem of image based virtual try-on (VTON), where the g...

PT-VTON: an Image-Based Virtual Try-On Network with Progressive Pose Attention Transfer

The virtual try-on system has gained great attention due to its potentia...

ZFlow: Gated Appearance Flow-based Virtual Try-on with 3D Priors

Image-based virtual try-on involves synthesizing perceptually convincing...

VITON: An Image-based Virtual Try-on Network

We present an image-based VIirtual Try-On Network (VITON) without using ...

An Efficient Style Virtual Try on Network for Clothing Business Industry

With the increasing development of garment manufacturing industry, the m...

Toward Accurate and Realistic Virtual Try-on Through Shape Matching and Multiple Warps

A virtual try-on method takes a product image and an image of a model an...

Code Repositories


This is the unofficial implementation of SieveNet: A Unified Framework for Robust Image-Based Virtual Try-On

view repo

1 Introduction

Providing interactive shopping experiences is an important problem for online fashion commerce. Consequently, several recent efforts have been directed towards delivering smart, intuitive online experiences including clothing retrieval [13, 4], fine-grained tagging [1, 15], compatibility prediction [5, 24, 23, 9, 2] and virtual try-on [25, 7]. Virtual try-on is the visualization of fashion products in a personalized setting. The problem consists of trying on a specific garment on the image of a person. It is especially important for online fashion commerce because it compensates for the lack of a direct physical experience of in-store shopping.

Recent methods based on deep neural networks

[25, 7], formulate the problem as that of conditional image generation. As depicted in Figure 1, the objective is to synthesize a new image (henceforth referred to as the try-on output) from two images - a try-on cloth and a target model image, such that in the try-on output the target model is wearing the try-on cloth while the original body shape, pose and other model details (eg. bottom, face) are preserved.

Figure 1: The task of image-based virtual try-on involves synthesizing a try-on output where the target model is wearing the try-on cloth while other characteristics of the model and cloth are preserved.

Successful virtual try-on experience depends upon synthesizing images free from artifacts arising from improper positioning or shaping of the try-on garment, and inefficient composition resulting in blurry or bleeding garment textures in the final try-on output. Current solutions [7, 25] suffer from these problems especially when the try-on cloth is subject to extreme deformations or when characteristics of the try-on cloth and original clothing item in target model differ. For example, transferring a half-sleeves shirt image to a target model originally in a full-sleeves shirt often results in texture bleeding and incorrect warping. For alleviating these problems, we propose:

  1. A multi-stage coarse-to-fine warping module trained with a novel perceptual geometric matching loss to better model fine intricacies while transforming the try-on cloth image to align with shape of the target model.

  2. A conditional segmentation mask generation module to assist in handling complexities arising from complex pose, occlusion and bleeding during the texture transfer process, and

  3. A duelling triplet loss strategy for training the texture translation network to further improve quality of the final try-on result.

We show significant qualitative and quantitative improvement over the current state-of-the-art method for image-based virtual try-on. An overview of our SieveNet framework is presented in Figure 2 and the training pipeline is detailed in Figure 3.

Figure 2: Inference Pipeline of the SieveNet framework
Figure 3: An overview of the training pipeline of SieveNet, containing (A) Coarse-to-Fine Warping Module, (B) Conditional Segmentation Mask Generation Module, and (C) Segmentation Assisted Texture Translation Module.

2 Related Work

Our work is related to existing methods for conditional person image synthesis that use pose and shape information to produce images of humans, and to existing virtual try-on methods - most notably [25].

Conditional Image Synthesis

Ma et al. [14] proposed a framework for generating human images with pose guidance along with a refinement network trained using an adversarial loss. Deformable GANs [22] attempted to alleviate the misalignment problem between different poses by using an affine-transformation on the coarse rectangle region, and warped the parts on pixel-level. In [6], Esser et al. introduced a variational U-Net [19] to synthesize the person image by restructuring the shape with stickman pose guidance. [17] applied CycleGAN directly to manipulate pose. However, all of these methods fail to preserve the texture details of the clothes in the output. Therefore, they cannot directly be applied to the virtual try-on problem.

Virtual Try-On

Initial works on virtual try-on were based on 3D modeling techniques and computer graphics. Sekine et al. [21] introduced a virtual fitting system that captures 3D measurements of body shape via depth images for adjusting 2D clothing images. Pons-Moll et al. [16]

used a 3D scanner to automatically capture real clothing and estimate body shape and pose. Compared to graphics models, image-based generative models provide a more economical and computationally efficient solution. Jetchev et al.

[11] proposed a conditional analogy GAN to swap fashion articles between models without using person representations. They do not take pose variant into consideration, and during inference, they required the paired images of in-shop clothes and a wearer, which limits their applicability in practical scenarios. In [7], Han et al. introduced a virtual try-on network to transfer a desired garment on a person image. It uses an encoder-decoder with skip connections to produce a coarse clothing mask and a coarse rendered person image. It then uses a Thin-plate spline (TPS) based spatial transformation (from [10]) to align the garment image with the pose of the person, and finally a refinement stage to overlay the warped garment image on to the coarse person image to produce the final try-on image. Most recently Wang et al. [25] present an improvement over [7] by directly predicting the TPS parameters from the pose and shape information and the try-on cloth image. Both of these methods suffer from geometric misalignment, blurry and bleeding textures in cases where the target model is characterized by occlusion and where pose variation or garment shape variation is high. Our method, aligned to the approach in [25], improves upon all of these methods. SieveNet learns the TPS parameters in multiple stages to handle fine-grained shape intricacies and uses a conditional segmentation mask generation step to aid in handling of pose variation and occlusions, and improve textures. In Section 5.2, we compare our results with [25].

3 Proposed Methodology

The overall process (Figure 2) comprises of two main stages - warping the try-on cloth to align with pose and shape of the target model, and transferring the texture from the warped output onto the target model to generate the final try-on image. We introduce three major refinements into this process. To capture fine details in the geometric warping stage, we use a two-stage spatial-transformer based warp module (Section 3.2). To prevent the garment textures from bleeding onto skin and other areas, we introduce a conditional segmentation mask generation module (Section 3.4.1) that computes an expected semantic segmentation mask to reflect the bounds of the target garment on the model, which in turn assists the texture translation network to produce realistic try-on results. We also propose two new loss computations - a perceptual geometric matching loss (Section 3.3) to improve the warping output, and a duelling triplet loss strategy (Section 3.4.3) to improve the output from the texture translation network.

3.1 Inputs

The framework uses the try-on cloth image (), a 19-channel pose and body-shape map () generated as described in [25] as input to the various networks in our framework. is a cloth-agnostic person representation created using the model image () to overcome the unavailability of ideal training triplets as discussed in [25]. A human parsing semantic segmentation mask () is also used as ground-truth during training of the conditional segmentation mask generation module (described in Sections 3.3 and 3.4.1). For training, the task is set such that data consists of paired examples where the model in is wearing the clothing product .

3.2 Coarse-to-Fine Warping

The first stage of the framework warps the try-on product image () to align with the pose and shape of the target model (). It uses the priors as guidance for achieving this alignment. Warping is achieved using thin-plate spline (TPS) based spatial transformers [10], as introduced in [25] with a key difference that we learn the transformation parameters in a two-stage cascaded structure and use a novel perceptual geometric matching loss for training.

3.2.1 Tackling Occlusion and Pose-variation

We posit that accurate warping requires accounting for intricate modifications resulting from two major factors:

  1. Large variations in shape or pose between the try-on cloth image and the corresponding regions in the model image.

  2. Occlusions in the model image. For example, the long hair of a person may occlude part of the garment near the top.

The warping module is formulated as a two-stage network to overcome these problems of occlusion and pose-variation. The first stage predicts a coarse-level transformation, and the second stage predicts the fine-level corrections on top of the coarse transformation. The transformation parameters from the coarse-level regression network () is used to warp the product image to produce an approximate warp output (). This output is then used to compute the fine-level transformation parameters () and the corresponding warp output () is computed using () to warp the initial try-on cloth and not

. This is done to avoid the artifacts from applying the interpolation in the spatial transformer twice. To facilitate the expected hierarchical behaviour, residual connections are introduced to offset the parameters of the fine-transformation with the coarse-transformation. The network structure is schematized in Figure

3 (A). Ablation study to support the design of the network and losses is in Section 5.3.

3.3 Perceptual Geometric Matching Loss

The interim () and final () output from the warping stage are subject to a matching loss against the (segmented out from the model image) during training. is defined below which includes a novel perceptual geometric matching loss component. The intuition behind this loss component is to have the second stage warping incrementally improve upon that from the first stage.


Here, , and is the perceptual geometric matching loss which comprises of two components. is the cloth worn on the target model in and is the binary mask representing the cloth worn on the target model.


Minimizing pushes the second stage output closer to the ground-truth compared to the first stage output.


The scalar is a multiplicative margin used to ensure stricter bound for the difference ( is used for our experiments).

Figure 4: Visualization of the Perceptual Geometric Matching Loss in VGG-19 Feature Space.

For , , and

are first mapped to the VGG-19 activation space, and then the loss attempts to align the difference vectors between

and , and and in the feature space.


Minimizing facilitates the goal of minimizing .

3.4 Texture Transfer

Once the product image is warped to align with the pose and shape of the target model, the next stage transfers the warped product to the model image. This stage computes a rendered model image, and a fractional composition mask to compose the warped product image onto the rendered model image. We break down this stage into two steps - conditional segmentation mask prediction and segmentation assisted texture translation.

Figure 5: Illustrating work of Conditional Segmentation Mask Prediction Network

3.4.1 Conditional Segmentation Mask Prediction

A key problem with existing methods is their inability to accurately honor the bounds of the clothing product and human skin. The product pixels often bleeds into the skin pixels (or vice-versa), and in the case of self-occlusion (such as with the case of folded arms), the skins pixels may get replaced entirely. This problem is exacerbated for cases where the try-on clothing item has a significantly different shape than the clothing in the model image. Yet another scenario that aggravates this problem is when the target model is in a complex pose. To help mitigate these problems of bleeding and self-occlusion as well as to handle variable and complex poses, we introduce a conditional segmentation mask prediction network.

Figure 3 (B) illustrates the schematics of the network. It takes the pose and shape priors () and the product image () as input, to generate an “expected” segmentation mask (). This try-on clothing conditioned seg. mask represents the expected segmentation of the generated try-on output where the target model is now wearing the try-on cloth. Since we are constrained to train with coupled data ( and ), this expected (generated) segmentation mask () is matched against the ground-truth segmentation mask () itself. We intend to highlight that the network is able to generalize to unseen models at inference since it learns from a sparse clothing agnostic input () that does not include any effects of worn cloth in target model image or segmentation mask (to avoid learning identity). At inference time, the generated is directly used downstream. Figure 5 demonstrates some examples of the corrected segmentation masks generated with our network, and ablation studies to support the use of the conditional segmentation mask is in Section 5.3.

The network (a 12-layer U-Net [19] like architecture) is trained with a weighted cross-entropy loss, which is the standard cross-entropy loss for semantic segmentation with increased weights for skin and background classes. The weight of the skin is increased to better handle occlusion cases, and the background weight is increased to stem bleeding of the skin pixels into the background.

3.4.2 Segmentation Assisted Texture Translation

The last stage of the framework uses the expected segmentation mask (), the warped product image (), and unaffected regions from the model image () to produce the final try-on image. The network is a 12-layer U-Net [19] that takes the following inputs:

  • The warped product image

  • The expected seg. mask , and

  • Pixels of for the unaffected regions, (Texture Translation Priors in Figure 3). E.g. face and bottom cloth, if a top garment is being tried-on.

The network produces two output images - an RGB rendered person image () and a composition mask , which are combined with the warped product image using the following equation to produce the final try-on image:


Because the unaffected parts of the model image are provided as prior, the proposed framework is also able to better translate texture of auxiliary products such as bottoms onto the final generated try-on image (unlike in [25] and [7]).

The output of the network is subject to the following matching losses based on distance and a perceptual distance based on VGG-19 activations:


The training happens in multiple phases. The first steps of training is a conditioning phase that minimizes the to produce reasonable results. The subsequent phases (each lasting steps) employ the loss augmented with a triplet loss (Section 3.4.3) to fine-tune the results further. This strategy further improves the output significantly (see ablation study in Section 5.3).

3.4.3 Duelling Triplet Loss Strategy

A triplet loss is characterized by an anchor, a positive and a negative (w.r.t the anchor), with the objective being to simultaneously push the anchor result towards the positive and away from the negative. In the duelling triplet loss strategy, we pit the output obtained from the network with the current weights (anchor) against that from the network with weights from the previous phase (negative), and push it towards the ground-truth (positive). As training progresses, this online hard negative mining strategy helps push the results closer to the ground-truth by updating the negative at discrete step intervals ( steps). In the fine-tuning phase, at step () the triplet loss is computed as:


Here is the try-on image output obtained from the network with weights at the iteration. The overall loss with the duelling triplet strategy in use is then computed for a training step as:

Figure 6: SieveNet can generate more realistic try-on results compared to the state-of-the-art CP-VTON.

4 Experiments

4.1 Datasets

We use the dataset collected by Han et al. [7] for training and testing. It contains around 19,000 images of front-facing female models and the corresponding upper-clothing isolated product images. There are 16253 cleaned pairs, which are split into a training set and a testing set with 14221 and 2032 pairs, respectively. The images in the testing set are rearranged into unpaired sets for qualitative evaluation and kept paired for quantitative evaluation otherwise.

4.2 Implementation Details

All experiments are conducted on 4 NVIDIA 1080Ti on a machine with 16 GB RAM. The hyper-parameter configurations were as follows: batch size=

, epochs=

, optimizer=Adam[12], lr=, ===, ==.

4.3 Quantitative Metrics

To effectively compare the proposed approach against the current state-of-the-art, we report our performance using various metrics including Structural Similarity (SSIM) [26], Multiscale-SSIM (MS-SSIM) [27], Fréchet Inception Distance (FID) [8]

, Peak Signal to Noise Ratio (PSNR), and Inception Score (IS)

[20]. We adapt the Inception Score metric in our case as a measure of generated image quality by estimating similarity of generated image distribution to the ground truth distribution. For computing pairwise MS-SSIM and SSIM metrics, we use the paired test data.

4.4 Baselines

CP-VTON[25] and VITON [7] are the latest image based virtual try-on methods, with CP-VTON being the current state-of-the-art. In particular, [7] directly applied shape context [3] matching to compute the transformation mapping. By contrast, [25] estimates the transformation mapping using a convolutional network and has superior performance than [7]. We therefore use results from CP-VTON [25] as our baseline.

5 Results

The task of virtual try-on can be broadly broken down into two stages, warping of the product image and texture transfer of the warped product image onto the target model image. We conduct extensive quantitative and qualitative evaluations for both stages to validate the effectiveness of our contributions (coarse-to-fine warping trained with perceptual geometric matching loss, try-on cloth conditioned segmentation mask prior, and the duelling triplet loss strategy for training the texture translation network) over the existing baseline CP-VTON [25].

5.1 Quantitative Results

Table 1 summarizes the performance of our proposed framework against CP-VTON on benchmark metrics for image quality (IS, FID and PSNR) and pair-wise structural similarity (SSIM and MS-SSIM). To highlight the benefit of our contributions in warp and texture transfer, we experiment with different warping and texture transfer configurations (combining modules from CP-VTON with our modules). All scores progressively improve as we swap-in our modules. Using our final configuration of coarse-to-fine warp (C2F) and segmentation assisted texture translation with duelling triplet strategy (SATT-D) improved FID from 20.331 (for CP-VTON) to 14.65. Also, PSNR increased by around 17% from 14.554 to 16.98. While a higher Inception score (IS) is not necessarily representative of output quality for virtual try-on, we argue that the proposed approach is able to better model the ground truth distribution as it produces an IS (2.82 0.09) which is closer to the IS for ground-truth images in the test set (2.83 0.07) than CP-VTON (2.66 0.14). These quantitative claims are further substantiated in subsequent sections where we qualitatively highlight the benefit from each of the components.

GMM + TOM (CP-VTON) 0.698 0.746 20.331 14.544 2.66 0.14
GMM + SATT 0.751 0.787 15.89 16.05 2.84 0.13
C2F + SATT 0.755 0.794 14.79 16.39 2.80 0.08
C2F + SATT-D (SieveNet) 0.766 0.809 14.65 16.98 2.82 0.09
Table 1: Quantitative comparison of Proposed vs CP-VTON. GMM, TOM are the warping and texture transfer modules from CP-VTON. C2F is the coarse-to-fine warp network and SATT is the segmentation assisted texture translation network we introduce in this framework. SATT-D is SATT trained with the duelling triplet loss strategy.

5.2 Qualitative Results

Figure 6 presents a comparison of results of the proposed framework with those of CP-VTON. The results are presented to compare the impact on different aspects of quality - skin generation (row 1), handling occlusion (row 2), variation in poses (row 3), avoiding bleeding (row 5), preserving unaffected regions (row 4), better geometric warping (row 4) and overall image quality (row 6). For all aspects, our method produces better results than CP-VTON for most of the test images. These observations are corroborated by the quantitative results reported in Table 1.

5.3 Ablation Studies

In this section, we present a series of ablation studies to qualitatively highlight the particular impact of each our contributions: the coarse-to-fine warp, try-on product conditioned segmentation prediction and the duelling triplet loss strategy for training the texture translation module.

Figure 7: Comparison of our C2F warp results with GMM warp results. Warped clothes are directly overlaid onto target persons for visual checking. C2F produces robust warp results which can be seen from preservation of text (row 1) and horizontal stripes (row 2, row 3) along with better fitting. GMM produces highly unnatural results.
Impact of Coarse-to-Fine Warp
Figure 8: Using the conditional segmentation mask as prior to texture transfer aids in better handling of complex pose, occlusion and helps avoid bleeding.

Figure 7 presents sample results comparing outputs of the proposed coarse-to-fine warp approach against the geometric matching module (GMM) used in CP-VTON [25]. Learning warp parameters in a multi-stage framework helps in better handling of large variations in model pose and body-shape in comparison to the single stage warp in [25]. The coarse-to-fine (C2F) warp module trained with our proposed perceptual geometric matching loss does a better job at preserving textures and patterns on warping. This is further corroborated through the quantitative results in Table 1 (row 2 vs row 3).

Figure 9: Finetuning texture translation with the duelling triplet strategy refines quality of generated images by handling occlusion and avoiding bleeding.
Impact of Duelling Triplet Loss

In Figure 9, we present sample results depicting the particular benefit of training the texture translation network with the duelling triplet strategy. As highlighted by the results, using the proposed triplet loss for online hard negative mining in the fine-tuning stage refines the quality of the generated results. This arises from better handling of occlusion, bleeding and skin generation. These observations are corroborated by results in Table 1 (row 3 vs 4).

Impact of Conditional Segmentation Mask Prediction

Figure 8 presents results obtained by training the texture transfer module of CP-VTON (TOM) [25] with an additional prior of the try-on cloth conditioned segmentation mask. It can be observed that this improves handling of skin generation, bleeding and complexity of poses. Providing the expected segmentation mask of the try-on output image as prior equips the generation process to better handle these issues. These observations are corroborated through results in Table 1 (row 1 vs 2).

Impact of Adversarial Loss on Texture Transfer

Many recent works on conditional image generation [25, 18, 14] employ a discriminator network to help improve quality of generated results. However, we observe that fine-tuning with the duelling triplet strategy instead results in better handling of texture and blurring in generated images without the need for any additional trainable parameters. Sample results in Figure 10 corroborate the claim.

Figure 10: Proposed Duelling Triplet Loss helps in better handling of texture and avoiding blurry effects in generated results than the GAN Loss.
(a) Failure in correctly occluding the back portion of the t-shirt.
(b) Failure in predicting the correct segmentation mask owing to errors in key-point prediction.
Figure 11: Failure Cases
Failure Cases

While SieveNet performs significantly better than existing methods, it has certain limitations too. Figure 11 highlights some specific failure cases. In some cases, generated result is unnatural due to presence of certain artifacts (as the gray neckline of the t-shirt in the example in row 1) that appear in the output despite the correct fit and texture being achieved. This problem can be alleviated if localized fine-grained key-points are available. Further, texture quality in try-on output may be affected by errors in the predicted conditional segmentation mask. This happens due to errors in predicting pose key-points. For instance, this may happen in model images with low-contrast regions (example in row 2). Using dense pose information or a pose prediction network can help alleviate this problem.

6 Conclusion

In this work, we propose SieveNet, a fully learnable image-based virtual try-on framework. We introduce a coarse-to-fine cloth warping network trained with a novel perceptual geometric matching loss to better model fine-grained intricacies while transforming the try-on cloth image to align with shape of the target model. Next, we achieve accurate texture transfer using a try-on cloth conditioned segmentation mask prior and training the texture translation network with a novel duelling triplet loss strategy. We report qualitatively and quantitatively superior results over the state-of-the-art methods.


  • [1] K. E. Ak, A. A. Kassim, J. Hwee Lim, and J. Yew Tham (2018-06) Learning attribute representations with localization for flexible fashion search. In

    The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    Cited by: §1.
  • [2] K. Ayush Context aware recommendations embedded in augmented viewpoint to retarget consumers in v-commerce. Cited by: §1.
  • [3] S. Belongie, J. Malik, and J. Puzicha (2001) Shape context: a new descriptor for shape matching and object recognition. In Advances in neural information processing systems, pp. 831–837. Cited by: §4.4.
  • [4] A. Chopra, A. Sinha, H. Gupta, M. Sarkar, K. Ayush, and B. Krishnamurthy (2019) Powering robust fashion retrieval with information rich feature embeddings. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 0–0. Cited by: §1.
  • [5] G. Cucurull, P. Taslakian, and D. Vazquez (2019-06) Context-aware visual compatibility prediction. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
  • [6] P. Esser, E. Sutter, and B. Ommer (2018) A variational u-net for conditional appearance and shape generation. Cited by: §2.
  • [7] X. Han, Z. Wu, Z. Wu, R. Yu, and L. S. Davis (2017) VITON: an image-based virtual try-on network. CoRR abs/1711.08447. External Links: Link, 1711.08447 Cited by: §1, §1, §1, §2, §3.4.2, §4.1, §4.4.
  • [8] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter (2017) GANs trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 6626–6637. External Links: Link Cited by: §4.3.
  • [9] G. Hiranandani, K. Ayush, C. Varsha, A. Sinha, P. Maneriker, and S. V. R. Maram (2017) [POSTER] enhanced personalized targeting using augmented reality. In 2017 IEEE International Symposium on Mixed and Augmented Reality (ISMAR-Adjunct), pp. 69–74. Cited by: §1.
  • [10] M. Jaderberg, K. Simonyan, A. Zisserman, and k. kavukcuoglu (2015) Spatial transformer networks. In Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (Eds.), pp. 2017–2025. External Links: Link Cited by: §2, §3.2.
  • [11] N. Jetchev and U. Bergmann (2017-10) The conditional analogy gan: swapping fashion articles on people images. In The IEEE International Conference on Computer Vision (ICCV) Workshops, Cited by: §2.
  • [12] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.2.
  • [13] Z. Liu, P. Luo, S. Qiu, X. Wang, and X. Tang (2016-06) DeepFashion: powering robust clothes recognition and retrieval with rich annotations. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
  • [14] L. Ma, X. Jia, Q. Sun, B. Schiele, T. Tuytelaars, and L. Van Gool (2017) Pose guided person image generation. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 406–416. External Links: Link Cited by: §2, §5.3.
  • [15] K. Mahajan, T. Khurana, A. Chopra, I. Gupta, C. Arora, and A. Rai (2018-10) Pose aware fine-grained visual classification using pose experts. pp. 2381–2385. External Links: Document Cited by: §1.
  • [16] G. Pons-Moll, S. Pujades, S. Hu, and M. Black (2017) ClothCap: seamless 4d clothing capture and retargeting. ACM Transactions on Graphics, (Proc. SIGGRAPH) 36 (4). Note: Two first authors contributed equally External Links: Link Cited by: §2.
  • [17] A. Pumarola, A. Agudo, A. Sanfeliu, and F. Moreno-Noguer (2018) Unsupervised person image synthesis in arbitrary poses. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8620–8628. Cited by: §2.
  • [18] A. Raj, P. Sangkloy, H. Chang, J. Hays, D. Ceylan, and J. Lu (2018) SwapNet: image based garment transfer. In ECCV, Cited by: §5.3.
  • [19] O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: §2, §3.4.1, §3.4.2.
  • [20] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, X. Chen, and X. Chen (2016) Improved techniques for training gans. In Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett (Eds.), pp. 2234–2242. External Links: Link Cited by: §4.3.
  • [21] M. Sekine, K. Sugita, F. Perbet, B. Stenger, and M. Nishiyama (2014) Virtual fitting by single-shot body shape estimation. In Int. Conf. on 3D Body Scanning Technologies, pp. 406–413. Cited by: §2.
  • [22] A. Siarohin, E. Sangineto, S. Lathuilière, and N. Sebe (2018) Deformable gans for pose-based human image generation. CoRR abs/1801.00055. External Links: Link, 1801.00055 Cited by: §2.
  • [23] K. Tanmay and K. Ayush (2019) Augmented reality based recommendations based on perceptual shape style compatibility with objects in the viewpoint and color compatibility with the background. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 0–0. Cited by: §1.
  • [24] M. I. Vasileva, B. A. Plummer, K. Dusad, S. Rajpal, R. Kumar, and D. Forsyth (2018) Learning type-aware embeddings for fashion compatibility. In ECCV, Cited by: §1.
  • [25] B. Wang, H. Zhang, X. Liang, Y. Chen, L. Lin, and M. Yang (2018) Toward characteristic-preserving image-based virtual try-on network. CoRR abs/1807.07688. External Links: Link, 1807.07688 Cited by: §1, §1, §1, §2, §2, §3.1, §3.2, §3.4.2, §4.4, §5.3, §5.3, §5.3, §5.
  • [26] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli (2004) Image quality assessment: from error visibility to structural similarity. IEEE TRANSACTIONS ON IMAGE PROCESSING 13 (4), pp. 600–612. Cited by: §4.3.
  • [27] Z. Wang, E. P. Simoncelli, and A. C. Bovik (2003) Multi-scale structural similarity for image quality assessment. In in Proc. IEEE Asilomar Conf. on Signals, Systems, and Computers, (Asilomar, pp. 1398–1402. Cited by: §4.3.