Down to the Last Detail: Virtual Try-on with Detail Carving

12/13/2019 ∙ by Jiahang Wang, et al. ∙ 10

Virtual try-on under arbitrary poses has attracted lots of research attention due to its huge potential applications. However, existing methods can hardly preserve the details in clothing texture and facial identity (face, hair) while fitting novel clothes and poses onto a person. In this paper, we propose a novel multi-stage framework to synthesize person images, where rich details in salient regions can be well preserved. Specifically, a multi-stage framework is proposed to decompose the generation into spatial alignment followed by a coarse-to-fine generation. To better preserve the details in salient areas such as clothing and facial areas, we propose a Tree-Block (tree dilated fusion block) to harness multi-scale features in the generator networks. With end-to-end training of multiple stages, the whole framework can be jointly optimized for results with significantly better visual fidelity and richer details. Extensive experiments on standard datasets demonstrate that our proposed framework achieves the state-of-the-art performance, especially in preserving the visual details in clothing texture and facial identity. Our implementation will be publicly available soon.



There are no comments yet.


page 1

page 3

page 5

page 6

Code Repositories


This is copy of

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

People tend to buy fashion items online because it is less time-consuming. However, due to the lack of physical try-on, there always exists inconsistency of looks between real consumers and models in garments. Virtual try-on system can bridge the gap between online and offline shopping and provide more realistic shopping experience. In this work, we propose a multi-stage solution to synthesize person image given poses and clothes as in Fig. 1.

Existing methods target for generating person image based on new cloth or novel pose. Many approaches [15, 9, 16] attempt to synthesizing reasonable images given the target pose, while VITON [3] and CPVTON [17] mainly focus on generating images given a new cloth image. However, these work can not simply transfer to the task of synthesizing person images under arbitrary poses and clothes. In reality, people would like to know their look wearing new clothes in different poses without taking too many photos. However, existing methods can hardly meet customers’ requirements. In this work, we address the problem of generating person images conditioning on new clothes and poses.

Figure 1: Virtual try-on results of fitting target cloth and pose (top row) onto a reference person image (left column) by our method. Zoom for more details in electric version.

Due to large variations of deformation and articulation while synthesizing a new photo-realistic person image given a target pose and in-shop clothing image, the key challenges are in two folds. First, the detailed texture information, such as logos, characters of the cloth can hardly be preserved simply using standard convolutional network, since human body is non-rigid and thus standard convolutional operation can not handle large deformations due to the strong restriction of kernel size and receptive field. Second, salient regions (such as facial details and hairs) that mostly respond to user attention, contain too much artifacts in one-stage generation without refinement. Some traditional methods try to build human 3D model and render output images [8, 11], which can solve the problems to a certain extent. However, the huge labor costs and expensive devices limit the applications in practice.

For aforementioned problems, we propose a novel multi-stage framework that generate person images based on the target pose and in-shop clothing image in a coarse-to-fine manner. Notably, rich details in salient regions can also be well preserved. Different from existing methods which directly use the landmarks of human pose, we first design a parsing transformation network to predict the target semantic map, which spatially aligns the corresponding body parts and provides more structural information regarding the shapes of torso and limbs. To generate more reasonable and decent results with detailed information, we fuse the spatially aligned cloth with the coarse rendered image through our appearance generation network by proposing a novel tree dilated fusion block (Tree-Block).

Our virtual try-on network not only overlays new clothes onto the corresponding region of the person under arbitrary poses without resorting to 3D information, but also preserves and enhances rich details in salient regions, i.e., cloth textures, facial identity. Moreover, we propose to address the aforementioned challenges in a coarse-to-fine manner using spatial alignment, multi-scale contextual feature aggregation and salient region enhancement. Our main contributions of this paper are summarized as follows:

  • We propose a multi-stage network and generate new person images conditioning on arbitrary poses and in-shop clothing images through spatial alignment and coarse-to-fine generation.

  • We propose a novel Tree-Block to aggregate multi-scale features to capture more spatial information for dealing with large misalignment and visual variations. High quality details of clothing texture and facial details can be well preserved to improve user experiences.

  • We design a delicate training strategy to jointly optimize each modules in an end-to-end manner, which further improves generation results.

2 Related Work

2.1 Image Synthesis

The advanced methods such as Generative Adversarial Networks (GANs) [2]

and Variational Autoencoders (VAEs)

[7] have demonstrated convincing results in a variety of image generation tasks. In order to incorporating the given information as prior to guide image generation process, conditional GANs [4]

synthesize the images based on the given information. The GAN-like models consist both the generator and discriminator ‘contesting’ with each other while optimizing. The generator tries to make the generated images indistinguishable with fake images, while the discriminator aims to distinguish the real images and fake images. Besides, some image2image translation methods, such as pix2pix

[5] and CycleGAN [18], could successfully generate photo-realistic images from one domain to another domain. Inspired by this, we apply adversarial loss in our implement.

2.2 Pose-guided Person Image Generation

Based on the training strategy, the methods on this task could mainly be separated into two parts. The first one is supervised person image generation conditioning on the target pose information using paired data. PG2 [9] propose to firstly synthesize coarse result and then refine the generated images with adversarial training. Deformable GAN [15] introduce deformable skip connections in the generator to deal with pixel-to-pixel misalignment caused by the pose differences. Progressive training with attention mechanism has been proposed to focus on each transfers certain regions while generating the person image progressively [19]. Different from the aforementioned methods, another popular part is unsupervised person image synthesis. Bidirectional generator [12] was utilized in the generator and the whole pipeline could be trained in an unsupervised manner. Semantic parsing transformation network was employed in the unified unsupervised person image generation framework [16] to guide the generator to generate image in the precise region level.

2.3 Virtual Try-on

The most popular methods of image-based virtual try-on models are VITON [3] and CPVTON [17]. The two methods use the geometric warping method to deal with the misalignment between the in-shop cloth and reference person. However, they could only present on a certain viewpoint and can not flexibly apply to arbitrary poses.

3 proposed method

Given a target pose , an in-shop cloth image and a reference image under pose , our goal is to generate an output image , which changes to the clothing appearance of and under the target pose . The generation can be formulated as: . The whole framework is shown in Fig.2

Figure 2: The framework for our whole pipeline. Our framework is mainly composed of four parts: parsing transformation, clothing spatial alignment, detailed appearance generation, and face refinement. Details of Tree-Block in blue are illustrated in Fig. 3

3.1 Tree Dilated Fusion Block (Tree-Block)

We propose a novel tree dilated fusion block, abbreviated as Tree-Block, which aggregates multi-scale features and captures more spatial information with dilated convolutions. Since larger receptive fields are considered while generation, Tree-Block is more capable dealing with large spatial deformation and variation. As illustrated in Fig. 3, we define the input of Tree-Block as , and the input passes through several dilated convolutions in parallel. The multi-scale features of each dilated convolutional block can be calculated as:


where , denotes the convolutional operation, is dilation rate, is the output feature of dilation convolution with , is the total number of Tree-Block, and and are the parameters of convolutional operation. The multi-scale features are fused through tree-structured aggregation for . Given two features maps and , the fusion operation can be defined as:


where is a kernel of size . Similar to standard ResNet blocks, we also add a skip connection to generate the final output by:

Figure 3: Comparison of Standard ResNet Block and our Tree-Block. The green block denotes the input and output features, and the yellow block denotes convolutional blocks.

3.2 Coarse-to-fine Person Image Synthesis

3.2.1 Coarse Alignment

We define parsing transformation network and clothing spatial alignment as coarse alignment:

To guide the generator synthesize in a more precise region level, a parsing transformation network is introduced to predict the semantic map under target pose according to the reference semantic map , target pose and target in-shop cloth mask . The architecture of our parsing transformation network is based on U-net [14]. Taking the concatenation of , , as input, the proposed network learn the mapping to the . Similar to previous method [16], cross entropy loss and adversarial loss are also adopted as our optimization objectives.

In addition, due to non-rigid nature of person image generation and huge pixel-to-pixel misalignment, it is hard for convolutional network to preserve details of person image, especially the texture of cloth. Therefore, following the cloth warping method [17], we manipulate the target in-shop cloth beforehand and spatially align the cloth to target pose by the clothing spatial alignment module in Fig.2.

3.2.2 Detailed Appearance Generation

We replace standard ResNet blocks in pix2pix [5] with our proposed Tree-Block as generator to learn the mapping about how to color the transformed semantic map. As show in part III of Figure 2, the reference image masking cloth information , transformed semantic map , and spatially aligned cloth are concatenated together as input, and then output the coarse rendered image and composition mask . The composition mask is regarded as the fusion weight of and , and the final result can be written as:


where denote element-wise multiplication.

Attention Loss. In order to preserve more details of cloth, we tend to select the spatially aligned cloth as much as possible. The attention loss is defined as the difference between and target cloth mask , and we also use total variation regularization to penalize the gradients of the composition mask :


Pixel-level Loss. Smooth loss [13] is exploited in our experiments instead of loss or MSE loss. Smooth

loss is more robust and less sensitive to the outliers than the

loss, and converges faster than loss.


Perceptual Loss. Similar to previous style transfer method, we also apply perceptual loss [6]

including feature reconstruction loss and style construction loss, which are denoted as

and respectively, to maintain image high-level content and overall spatial structure. In our training process, the loss using layer of VGG19 leads to best result.

Adversarial Loss. Discriminator is introduced to distinguish between real image pair and fake image pair , which leads to adversarial loss . We calculate the adversarial loss as [10] using . The adversarial loss is formulated as:


Overall Objectives:


3.2.3 Face Refinement

We add a specialized GAN-based refinement network to add more detail and realism to the face region as shown in part IV of Fig.1, and we also adopt Tree-Block to replace standard ResNet block. We denote the facial region of both reference image and target image as , , and generates refined face directly.

The objectives of refined face, denoted as , are similar to that of appearance generation networks, including pixel-level loss, feature reconstruction loss, adversarial loss. To get more smoothing and reasonable refined person image, we add another pixel-level loss based on the refined image and the ground truth of target person image :


Thus, the overall objectives of face refinement network is:


3.3 End-to-End Jointly Training

In our multi-stage framework, the result of parsing transformation network directly and heavily influence the final output since the shape and the contour of the final generated images is guided by the semantic map, especially for the regions of face and hair. If we train the corresponding networks separately, it may lead to unpleasing synthesize images. Therefore, after training each part separately, we jointly train all the modules with the overall objective as Eq.8 to reduce the influence of inaccurate semantic map of person image and fine tune all the networks together. An overview of the general training procedure can be seen in Algorithm 1.

1:  Train parsing transformation network and clothing spatial alignment module in Sec. 3.2.1.
2:  With {, , }, fixing parsing transformation network, clothing spatial alignment module, train appearance generation network in Sec 3.2.2 to optimize Eq.8
3:  With {, }, fixing parsing transformation network, clothing spatial alignment module and appearance generation network, train face refinement network in Sec 3.2.3 to optimize Eq. 10
4:  Jointly training the framework as in Eq. 8
5:  Refined target person image
Algorithm 1 Training procedure of the virtual try-on network

4 experiments

4.1 Datasets and Settings

The dataset used for experiments is one part of the MPV dataset [1], consisting of 14,754 pairs of top clothing images and positive perspective images of female models. The resolution of images in the dataset is . We split the whole dataset with 12,410 pairs for training and 2,344 pairs for testing. By default, the learning rates for the discriminator and the generator are 0.0002. We adopt ADAM optimizer to train our network with , .

Figure 4: Results of our method and ablation study. Our full pipeline synthesizes more realistic images with convincing details, such as the texture / character of cloth, facial and hair regions, as highlighted in red boxes.

4.2 Qualitative Results

The following baseline methods are included for comparison: CPVTON. As CPVTON[17] is originally designed to tackle the simple try-on task of overlaying a new cloth on the reference person, we adapt it by adding target pose as input and then retrain the CP-VTON.

SRB-VTON. We define the try-on network only using Standard ResNet Blocks as SRB-VTON. In this baseline, we replace the Tree-Block of our proposed whole pipeline with standard ResNet block as Pix2Pix[5], and the other modules are maintained the same as our proposed method.

We perform the visual comparison between our established baseline and proposed method shown in Fig. 5, which illustrates that our model generates more reasonable results with convincing details. Both the baselines and our model can accomplish the task of virtual try-on with arbitrary poses, but the texture and the color can’t well preserve in CP-VTON method. Despite SRB-VTON surpasses CP-VTON visually, it cannot align the in-shop cloth to the target pose naturally and also generates much artifacts, which is highlighted with red boxes. Compared with SRB-VTON, our full network can well preserve the texture, enhance the facial details, and better align the in-shop cloth to the new poses due to the introduction of Tree-Block and clothing spatial alignment.

4.3 Quantitative Results

We apply the Structural SIMilarity (SSIM) and Inception Score (IS) for quantitative evaluation, summarized in Table 1. To reduce the influence of background, we also calculate the mask-SSIM and mask-IS

by masking out the background. The results show that our proposed method achieves the best scores by generating better visual details. Moreover, our results are more close to the ground truth real data, under all evaluation metrics.

Method SSIM mask-SSIM IS mask-IS
CP-VTON 0.563 0.548 3.012 0.127 2.937 0.130
SRB-VTON 0.701 0.745 3.1450.115 3.156 0.147
Ours 0.723 0.784 3.265 0.116 3.322 0.132
Real data 1 1 3.333 0.166 3.465 0.147
Table 1: Evaluation on the test split of MPV dataset.
Figure 5: Results comparison with other state-of-the-arts. Our framework with Tree-Block is good at generating rich details as shown in red boxes.

4.4 Discussion and Ablation study

We perform an ablation study on held-out test data to verify the role and effectiveness of each part of our model. The ablation study includes four parts: removing the Tree-Block which becomes the same as our SRB-VTON baseline, removing attention loss Eq.5, removing face refinement network, and without end2end joint training strategy. From the ablation experiments’ results illustrated in Fig.4, we can draw the following observations: 1) It is difficult to naturally align pixels to target image (i.e., the lower-body and the contour between spatially aligned cloth and human body) and preserve the identity (i.e., face) of the person if we replace the Tree-Block with standard ResNet block. 2) Attentive loss tends to select more from spatially aligned cloth and could help reduce artifacts of cloth region. 3) Face refinement module could enhance the facial details, especially the hair and eyes. 4) End-to-end training strategy could improve the synthesised results since optimizing the whole framework together leads to more reasonable predicted semantic map and more convincing details as shown in Fig.6.

Overall, our full proposed model performs better both qualitatively and quantitatively, and the attentive loss, face refinement module, end-to-end training strategy all contribute to the performance of the whole framework.

Method SSIM mask-SSIM IS mask-IS
w/o Tree-Block 0.701 0.745 3.1450.115 3.156 0.147
w/o attention loss 0.712 0.764 3.2120.117 3.239 0.131
w/o face refinement 0.717 0.778 3.3010.123 3.296 0.105
w/o end2end training 0.721 0.782 3.245 0.092 3.279 0.114
Ours 0.723 0.784 3.265 0.116 3.322 0.132
Table 2: Ablation study of our framework.
Figure 6: Results with different training schemes. End-to-end training leads to get more precise semantic map compared with that without joint optimization as shown in yellow boxes and improves the synthesized details.

5 conclusion

In this paper, we propose a multi-stage framework to transfer person image generation to semantic map based image synthesis. Due to the proposed Tree-Block and the attention loss, we could preserve more details and better fuse the spatially aligned cloth with the coarse rendered person image. In addition, facial refinement network with Tree-Block could enhance the facial salient regions and greatly improve visual results. Our end-to-end training scheme also improves the quality of generated images.


  • [1] H. Dong, X. Liang, X. Shen, B. Wang, H. Lai, J. Zhu, Z. Hu, and J. Yin (2019) Towards multi-pose guided virtual try-on network. In ICCV, pp. 9026–9035. Cited by: §4.1.
  • [2] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In NIPS, pp. 2672–2680. Cited by: §2.1.
  • [3] X. Han, Z. Wu, Z. Wu, R. Yu, and L. S. Davis (2018) VITON: an image-based virtual try-on network. In CVPR, Cited by: §1, §2.3.
  • [4] P. Isola, J. Zhu, T. Zhou, and A. A. Efros (2017)

    Image-to-image translation with conditional adversarial networks

    In CVPR, pp. 1125–1134. Cited by: §2.1.
  • [5] P. Isola, J. Zhu, T. Zhou, and A. A. Efros (2017) Image-to-image translation with conditional adversarial networks. In CVPR, Cited by: §2.1, §3.2.2, §4.2.
  • [6] J. Johnson, A. Alahi, and L. Fei-Fei (2016)

    Perceptual losses for real-time style transfer and super-resolution

    In ECCV, pp. 694–711. Cited by: §3.2.2.
  • [7] D. P. Kingma and M. Welling (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §2.1.
  • [8] Z. Lahner, D. Cremers, and T. Tung (2018) Deepwrinkles: accurate and realistic clothing modeling. In ECCV, pp. 667–684. Cited by: §1.
  • [9] L. Ma, X. Jia, Q. Sun, B. Schiele, T. Tuytelaars, and L. Van Gool (2017) Pose guided person image generation. In NIPS, pp. 406–416. Cited by: §1, §2.2.
  • [10] X. Mao, Q. Li, H. Xie, R. Y. Lau, Z. Wang, and S. Paul Smolley (2017) Least squares generative adversarial networks. In ICCV, pp. 2794–2802. Cited by: §3.2.2.
  • [11] G. Pons-Moll, S. Pujades, S. Hu, and M. J. Black (2017) ClothCap: seamless 4d clothing capture and retargeting. ACM Transactions on Graphics (TOG) 36 (4), pp. 73. Cited by: §1.
  • [12] A. Pumarola, A. Agudo, A. Sanfeliu, and F. Moreno-Noguer (2018) Unsupervised person image synthesis in arbitrary poses. In ICCV, pp. 8620–8628. Cited by: §2.2.
  • [13] S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In NIPS, pp. 91–99. Cited by: §3.2.2.
  • [14] O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: §3.2.1.
  • [15] A. Siarohin, E. Sangineto, S. Lathuilière, and N. Sebe (2018) Deformable gans for pose-based human image generation. In CVPR, pp. 3408–3416. Cited by: §1, §2.2.
  • [16] S. Song, W. Zhang, J. Liu, and T. Mei (2019) Unsupervised person image generation with semantic parsing transformation. In CVPR, pp. 2357–2366. Cited by: §1, §2.2, §3.2.1.
  • [17] B. Wang, H. Zheng, X. Liang, Y. Chen, and L. Lin (2018) Toward characteristic-preserving image-based virtual try-on network. In ECCV, pp. 589–604. Cited by: §1, §2.3, §3.2.1, §4.2.
  • [18] J. Zhu, T. Park, P. Isola, and A. A. Efros (2017) Unpaired image-to-image translation using cycle-consistent adversarial networkss. In ICCV, Cited by: §2.1.
  • [19] Z. Zhu, T. Huang, B. Shi, M. Yu, B. Wang, and X. Bai (2019) Progressive pose attention transfer for person image generation. In CVPR, pp. 2347–2356. Cited by: §2.2.