With recent socio-cultural events accelerating the shift towards online commerce, there is an increasing interest in providing smart and intuitive experiences [sievenet, tagging_icip, ayushcontext, cvpr_attr, chopra2019powering, Lang_2020_CVPR]
that can compensate for the lack of in-store interaction. Virtual try-on is concerned with the visualization of clothes in a personalized setting and is of great importance to a plethora of real world applications. While attractive even before the renaissance of deep learning[tanaka2009texture, hauswiesner2013virtual, ehara2006texture], recent advances in generative networks have inspired researchers to pursue image-based virtual try-on [acgpn, sievenet, cpvton, han2019clothflow, vtnfp], based solely on RGB images, by formulating the problem as that of conditional image synthesis.
Given as input the images of an isolated in-shop garment and a target model, the objective of image-based virtual try-on is to synthesise a perceptually convincing new image (referred to as the try-on output) where the target model is wearing the in-shop garment (Figure 1). Recent methods employ a two step process consisting of: a) warping of in-shop garment to align with pose and body shape of the target model and, b) texture fusion of the warped garment and target model images to generate the try-on output. A successful try-on experience depends upon synthesizing a sharp, realistic image that preserves the textural and geometric integrity of both the garment and model. Issues arise from improper warping or incorrect texture fusion due to the non-rigid nature of garments and the lack of understanding of the 3D geometry of the garment and the model. This results in unconvincing rendering of granular clothing details. Alleviating these concerns is the focus of this work.
Recent research [viton, cpvton, sievenet, acgpn] has been directed towards these challenges. [viton, cpvton] proposed thin-plate spline (TPS) based warping of the garment image. [sievenet, acgpn] improve the stability of TPS warping via multi-stage cascaded parameter estimation, and second order difference constraints respectively. However, TPS based warping leads to inaccurate transformation estimation when large geometric deformation is required, since each parameter defines the spatial deformation for a coarse block of pixels. To address this issue, [han2019clothflow] proposes to use dense, per-pixel appearance flow [appearanceFlow]
prediction to spatially deform the garment image. But owing to the high degree of freedom and the absence of proper regularization, this method often causes drastic deformation during warping resulting in significant textural artefacts. To address both issues - the inability of TPS to handle large deformations, and over-warping with appearance flows - we introduceGated Appearance Flow (GAF) which regularizes per-pixel appearance flow by aggregating candidate flow estimates predicted across multiple scales.
Next, for improving texture fusion, especially the issue of bleeding colors, [sievenet, acgpn] propose to use an apriori estimate of target clothing segmentation for the try-on output as conditioning. However, this method results in ambiguities in depth perception and body-part ordering because of the absence of 3D geometric priors. This is prominently visible in the generation of necklines, and handling cases with occlusion. For example, part of the garment that should go behind the neck appears in the front. To encode the 3D geometry information, we combine UV projection maps with dense body-part segmentation (obtained via DensePose [densepose]) as priors during warping and texture fusion.
Our contributions can be summarized as follows:
We propose ZFlow, an end-to-end try-on framework, that utilizes gated appearance flow estimates and dense geometric priors to render high quality try-on results.
We present extensive quantitative and qualitative comparisons as well as a detailed user study to show significant improvement over state-of-the-art methods.
We present ablation studies to analyse impact of different design choices in ZFlow. We further reinforce the efficacy of GAF by adapting it to improve state-of-the-art for human pose transfer.
2 Related Work
Virtual Try-On Progress in deep learning has motivated 2D image-based try-on as a scalable alternative to older methods ([sekine2014virtual, pons2017clothcap, tanaka2009texture, zhou2012image]) that used 3D scanners for virtual fitting of clothing items. Most of these new 2D image-based methods [viton, cpvton, acgpn, sievenet, han2019clothflow] pose the problem as that of synthesizing a realistic image of a model from a reference image and an isolated garment image. VITON [viton] uses a Thin-Plate Spline (TPS) based warping method to deform the garment images and maps the warped garment onto the model image using an encoder-decoder refinement module. CP-VTON [cpvton] improves over [viton]
using a neural network to regress the transformation parameters of TPS. SieveNet[sievenet] improves over [cpvton, viton] by estimating TPS parameters over multiple interconnected stages and also proposes a conditional layout constraint to better handle pose variation, bleeding and occlusion during texture fusion. ACGPN [acgpn] utilizes a similar layout constraint and also imposes a second-order constraint on TPS warping to preserve local patterns. However, these methods can only model limited geometric changes and often unnaturally deform clothing due to limited degrees of freedom in TPS transformation. ClothFlow [han2019clothflow] uses a per-pixel appearance flow [appearanceFlow] (instead of TPS) predicted over multiple cascaded stages, and also utilizes the conditional layout constraint as in [sievenet, acgpn]. Appearance flow [appearanceFlow] is used to spatially deform a source scene to a target scene by computing a pixel-wise 2D deformation field. This is conceptually distinct from optical flow and we refer the reader to [appflowdiff] for a discussion regarding the difference. The high degree of freedom in per-pixel flow estimation as well as the limited (3D) structural information often results in geometric misalignment and unnatural and bleeding textures. We propose ZFlow, an end-to-end framework which seeks to preserve the geometric and textural integrity by a combination of gated aggregation of hierarchical flow estimates across pixel-block levels (Gated Appearance Flow) and dense structural priors (Dense Geometric Priors) at various stages of the network.
3D Pose Representation The optimal choice of 3D representations for neural networks is an open problem. Recent works in single image 3D reconstruction have explored voxel, point cloud, octree, surface and volumetric representations [varol2018bodynet, loper2015smpl, yao2019densebody, alp2018densepose, zheng2019deephuman, kanazawa2018end, jackson20183d]. Surface based representation methods [alp2018densepose, yao2019densebody] use UV maps [feng2018joint] to establish dense correspondence between pixels and human body surface. To preserve the geometric integrity (depth-ordering, pose, skin and neckline reconstruction) of the try-on output in our image-based setup, we use dense geometric priors in form of UV maps and body-part segmentation masks obtained from a pre-trained DensePose [densepose]. These priors helps in handling complex poses even under heavy occlusion.
Human Pose Transfer Given a reference image of a person and a target pose, the task is to synthesize an image of the model in the desired pose. [ma2017pose]
uses a two stage, guided image-to-image translation network to generate the target. Recent work[siarohin2018deformable, dong2018soft, balakrishnan2018synthesizing, han2019clothflow, li2019dense, grigorev2018coordinate] incorporates spatial deformation from the source to the target for better perceptual quality. ClothFlow [han2019clothflow] predicts a dense appearance flow over multiple interconnected stages using a stacked network to warp source clothing pixels. Dense Intrinsic Flow (DIF) [li2019dense], introduced a flow regression module to map input and target skeleton poses with 3D appearance flow which it then uses to performs feature warping on the input image and generate a photo-realistic target image. We validate the efficacy of Gated Appearance Flow by adapting it for regression of 3D flows in [li2019dense]. We note subsequent work in pose transfer [hpt-sota] but highlight that [li2019dense] is ideal for our objective of validating the efficacy of GAF.
ZFlow takes as input images of a target model () and an isolated garment product () to generates the try-on output , where the target model is wearing the garment. This transformation is composed of two key phases: (A) Garment Warping which deforms to align with pose of the model in and generates , (B) Texture Fusion which composes the warped garment with to generate over two steps: (B-1) conditional segmentation and (B-2) segmentation-assisted fusion (as in Figure 2).
3.1 Garment Warping
is warped based on pose and shape of the target model to produce a warped garment image . For this, we propose Gated Appearance Flow which estimates per-pixel warp parameters by aggregating candidate estimates predicted across multiple scales (pixel-block sizes).
3.1.1 Enriched Input
Because training triplets where the same model wears two different garments are unavailable, contemporary methods use as input a clothing-agnostic prior of the target model () along with the garment . We extend the conventional binary (1-channel) body shape, (18-channel) pose map and (3-channel) head region used previously [acgpn, sievenet, han2019clothflow] with an additional dense (11-channel) body-part segmentation () of to provide richer structural priors (). This subtle enhancement, as we delineate in section 6, cascades through the network and results in significantly fewer artefacts in the output.
3.1.2 Gated Appearance Flow
This module predicts per-pixel appearance flow (pixel displacements) for warping the garment image by aggregating candidate flow estimates across multiple scales. The process comprises of first predicting the flow estimates and then aggregating them using a gating mechanism, along with losses that ensure smoothness (and regularity) of the flow predictions.
Multi-scale Appearance Flow Prediction
The backbone network is a 12-layer Skip-Unet [UNet]. Given an input RGB image of size (), the last decoding layers are used to predict the candidate flow maps ( for ) such that a predicted map is double the size of map
. All maps are then interpolated to have identical height and width () generating a pyramid of candidate flow maps that correspond to a structural hierarchy.
Appearance Flow Aggregation
The candidate flows are combined to obtain an aggregate per-pixel appearance flow (), using a convolution gated recurrent-network (ConvGRU) [conv-GRU] (summarized in figure 2(A)). Intuitively, this is a per-pixel selection process that determine the aggregate flow by gating (allowing or dismissing) pixel flow estimates corresponding to different radial neighborhoods (for the multiple scales). This prevents over-warping of the garment image by regularizing the high degrees of freedom in dense per-pixel appearance flow. We corroborate this position with extensive ablation studies in section 6.1 where we propose and contrast several alternative flow aggregation mechanisms.
Garment Image Warping
Next, the aggregate appearance flow map is used to warp the garment image and mask to obtain the warped image and the warped binary garment mask respectively. Additionally, the intermediate flow maps for are also used to produce intermediate warped images and masks ().
Each of the warped images (final and intermediate) are subject to L1-loss and perceptual similarity loss [VGGCNN] with respect to garment regions of the model image. Each predicted warped mask is subject to a reconstruction loss with respect to . The predicted flow-maps are subjected to a total variation loss () to ensure spatial smoothness of flow-predictions. The combined warping loss is defined as :
Validation with Human Pose Transfer
For an extended validation of GAF’s efficacy for estimating appearance flows, we use it to regress 3D flows for human pose transfer. The task involves producing an image of a person in a target pose from a reference image. We note that in contrast to virtual try-on where GAF is used for warping the garment based on the model pose, here it warps the target model pose itself. DIF [li2019dense] is a recent method for pose transfer that first regresses on a 3D appearance flow to map input to target pose and then performs feature warping on the input using the flow estimates. We swap-in our proposed GAF for 3D flow regression while retaining the feature warping module of DIF. We observe significant qualitative improvement in the generated image and discuss the results in section 6.
3.2 Texture Fusion
Once the warped garment () is obtained, the final try-on output is then generated over two steps (figure 2 B-1 and B-2): First, a conditional mask is predicted that corresponds to the clothing segmentation of the target model after garment change in try-on. Then, is combined with the warped garment () and the texture and geometry priors to produce the try-on output () .
3.2.1 Conditional Segmentation
The inputs to this module are the garment image () and the Dense Garment-Agnostic Representation (). The encodes the geometry of the target person and is agnostic to the specific garment the model is wearing. This is important to prevent over-fitting as the pipeline is trained on paired data where the input and output are the same images (and hence have the same segmentation mask). The network architecture is a Skip-UNet [UNet] with six encoder and decoder layers and the output, , is the 7-channel clothing segmentation mask.
The module is trained with a weighted cross-entropy loss with respect to the ground-truth garment segmentation mask () obtained with a pre-trained human parser (as used in [sievenet, acgpn, han2019clothflow]). The weight for skin and background classes are increased (3.0 in our experiments) for better handling of bleeding, and self-occlusion where the pose of the person results in certain parts of the garment or body to remain hidden from view. The loss is expressed as:
We observe that using the Dense Garment-Agnostic Representation improves depth perception and handling of occlusion in which results in try-on outputs with fewer artefacts. We discuss this further in section 6.2.
3.2.2 Segmentation-Assisted Dense Fusion
This stage generates the final try-on output. The network architecture for this stage is also a Skip-UNet [UNet] with six encoder and decoder layers. The network inputs include outputs of the previous stages ( and ) and texture translation prior () representing the non-garment pixels of . To include the 3D geometry of the model, we also input a dense prior (called IUV Priors) composed of UV map () and body-part segmentation () of the target model. We note that (body-part segmentation) is a function of the body geometry (agnostic of the specific garments) and differs from (or ) (clothing segmentation) which is altered with changing garments (both are useful for try-on). The try-on output () is defined as: I_tryon = M_out * I_wrp + (1 - M_out) * I_rp where and are generated by the network. is a composite mask for the garment pixels in try-on output and is a rendered person comprising all target model pixels except the garment in the try-on output. To preserve structural and geometric integrity of the try-on output, we also constrain the network to reconstruct the input clothing segmentation (as ) and IUV (as ) priors which are unchanged during this step.
is subject to , perceptual similarity [VGGCNN] () and edge () losses with respect to the model image . is based on sobel filters ( and ) and improves quality of the reproduced textures. Finally, , and are subjected to reconstruction losses against their corresponding network inputs (, and respectively). This reconstruction loss () combines cross entropy () for the categorical masks (, ) and smooth for the map.
We observe that conditioning texture fusion with these geometric priors via improves quality of try-on output via improved depth perception and structural coherence and explain this effect with evidence in section 6.
In this section, we formalise the setup for our experiments with virtual try-on and human pose transfer.
For image-based virtual try-on, we use the VITON dataset [viton] to ensure consistence with baseline methods. It contains 19000 images of front-facing female models and corresponding upper-clothing isolated garment images of size 256x192. There are 16253 cleaned pairs, which are split into train and test sets of 14221 and 2032 pairs. We also separate out 500 pairs from the train set into a validation set used exclusively for quantitative analysis. The images in the test set are rearranged into unpaired sets for qualitative evaluation. For human pose transfer, we use in-shop clothes benchmark from the Deep Fashion dataset [deepfashion] which contains 52712 in-shop clothes images and 200000 cross-pose pairs of size 256x256. Following the setup in DIF [li2019dense], we select 89262 pairs and 12000 pairs for train and test respectively.
For virtual try-on, we use SSIM [seshadrinathan2008unifying], FID [heusel2017gans] and PSNR [hore2010image] of the warp garment and try-on output. We avoid inception score (IS) following the considerations presented in [barratt2018note]. For human pose transfer, we evaluate performance using SSIM [seshadrinathan2008unifying] and PSNR [hore2010image] to ensure consistency with baselines. We note these metrics are chosen to ensure consistent comparison with prior work.
For virtual try-on, we compare performance with several recent state-of-the-art methods including CP-VTON [cpvton], SieveNet [sievenet], ClothFlow [han2019clothflow], VTNFP [vtnfp] and ACGPN [acgpn]. For [cpvton, sievenet, acgpn], we use author provided implementations and perform extensive qualitative and quantitative comparisons.
Table 1 compares performance of ZFlow against state-of-the-art baselines for virtual try-on. We report performance for TPS-based baselines [cpvton, sievenet] using author provided implementations. In comparison to [cpvton, sievenet], ZFlow achieves significantly better SSIM of 0.885, PSNR of 25.46 and FID of 15.17, compared to the next best values (SSIM=0.845, PSNR=23.60 and FID=23.68). We note that ZFlow with GAF significantly outperforms ClothFlow [han2019clothflow] which uses vanilla per-pixel appearance flow based warping for the garment image. Note that the official code for ClothFlow [han2019clothflow] was not available, we implement it as described and reproduce stated SSIM values.
Figure 3 illustrates qualitative comparison with SieveNet [sievenet], CP-VTON [cpvton] and ACGPN [acgpn], the baselines with available code implementations. We contrast the try-on outputs along varying dimensions of quality. These include factors that determine the realism of the generated image as a whole as well as the local geometry, colors and patterns.
Rows (1-5) demonstrate improvement in geometric integrity - the accurate representation of the geometry of the target model, the garment, and their interaction in the try-on output. Specifically, we observe that ZFlow improves the handling of extreme pose (row 1), depth-ordering of body parts, especially hands and neck region (row 2), skin generation for correct visibility of target garment and human skin (row 3) and neckline reproduction & shoulder correction in coherence with garments structure (row 4, 5). We highlight the improved neckline reproduction and depth-ordering in row 5 where none of the baselines are able to disambiguate front and back of the garment neckline.
Rows (6-10) demonstrate improvement in texture integrity which is concerned with accurate reproduction of patterns and colors of inshop garments in try-on output, and the handling of related artefacts. Specifically, we observe that ZFlow improves the reproduction of pattern and texture (stripes in row 6, 7), print design of the garment (graphic in row 8), text written on garment (row 9) and prevents color bleeding across part boundaries (row 10).
Shadows and highlights in the generated image, especially along the boundaries of body parts, are also important to correctly represent the dynamics of the actual scene. Row 11 demonstrates improvement along this dimension.
We conduct a survey with 70 volunteers from 3 continents, 5 countries, 10 institutions across diverse age, gender and occupations. As in [acgpn], we use pairwise comparison where each user is shown 100 distinct result pairs randomly sampled from 2000 test set results. Each pair consists of one ZFlow result and the other sampled from the results of (one of three) baselines ( [acgpn, sievenet, cpvton]). The in-shop garment and target model images are also shown for each result pair. Every volunteer is asked to select the best output of the two in each result pair in unlimited time. Results in Table 2 show overwhelmingly clear preference for ZFlow in all pairwise comparisons.
|Baseline||Prefer Baseline||Prefer ZFlow|
|Configuration||Warp Garment ()||Try-On Output ()|
|Garment Warping||Texture Fusion||SSIM||PSNR||SSIM||PSNR||FID|
|Various Gating Approaches for Flow Aggregation|
|GAF (w/ )||BaseFuse +||0.871||23.28||0.875||25.02||19.39|
|GAF (w/ )||BaseFuse + +||0.871||23.28||0.876||25.12||18.74|
|ZFlow (end-to-end training)||0.871||23.28||0.885||25.46||15.17|
6 Ablation Studies
In this section, we analyse the impact of different contributions of ZFlow and summarize results in Table 3.
6.1 Gated Appearance Flow (GAF)
First, we demonstrate the impact of GAF for garment image warping by comparing it to an existing per-pixel appearance flow based warping technique proposed in ClothFlow [han2019clothflow]. Next, to justify our choice of using a ConvGRU layer for aggregating hierarchical candidate appearance flow estimates, we propose alternate flow-aggregation schemes and report comparison with ConvGRU.
ClothFlow and GAF
Rows 1 and 2 of Table 3 compare the use of per-pixel appearance flow for garment image warping as described in [han2019clothflow] with the proposed gated aggregation of hierarchical flow estimates (GAF). GAF clearly outperforms the vanilla warping method corroborating our position that gated aggregation yields superior results both for the warping stage as well as for the try-on output.
Design choices for GAF
In rows 3, 4 and 5 of Table 3, we compare following schemes for gated aggregation - i) Using Residual Gating to perform residual sum (operation from [tirg]) on flow estimates of the last two decoding layers. ii) Using ConvLSTM for the flow estimate aggregation over three layers (3 scales), and iii) Using ConvGRU for aggregating flow estimates. The results indicate clearly that using ConvGRU for gated aggregation produces the best results of the three and hence is used in GAF.
Further, we note that all three aggregation schemes significantly outperform ClothFlow on metrics for both the warped garment and try-on output. For instance, ConvGRU improves the warp garment SSIM (from 0.835 to 0.871) and PSNR (from 20.54 to 23.14) against ClothFlow [han2019clothflow]. We note that this benefit translates to the try-on output where we observe consistent gains in SSIM (from 0.843 to 0.865), PSNR (from 23.60 to 24.47) and FID (from 23.48 to 18.89).
GAF in Human Pose Transfer
As an additional test of the efficacy of the proposed appearance flow-aggregation, we adapt it for flow-regression for task of Human Pose Transfer, building upon baseline DIF [li2019dense]. This results in both qualitative (Figure 4) and quantitative (Table 4) improvements in the pose-transfer output. Figure 4 present evidence to show significantly improved skin generation (row 1), texture (row 2) and reduced bleeding (row 1, 2) in the generated image. We corroborate this with results in Table 4 which indicates considerable improvement in SSIM (from 0.778 and 0.791) and PNSR (from 18.59 to 19.26). We also note the significant gain over ClothFlow [han2019clothflow], which also uses flow regression, as a validation of the efficacy of GAF.
6.2 Input Priors, Losses and Training
Dense Garment-Agnostic Representation
() is proposed as structural priors for garment warping and conditional segmentation. Figure 5 shows that this improves depth perception, skin generation (row 1) and neckline reconstruction (row 2) in the try-on output. We note similar improvements during garment warping (qualitative in appendix) which is corroborated through increase in PSNR of the warp garment (row 5 vs 6 in Table 3).
composed of UV projection map () and body-part segmentation () are used to encode the 3D geometry of the target model during texture fusion. The network is trained to reconstruct these priors along with the try-on output (). Figure 6 shows that conditioning on these IUV Priors via the reconstruction loss () improves generation of neckline, skin (row 1) and depth perception (row 2) in the output. This is corroborated through improved PSNR (25.02 to 25.12) and FID (19.39 to 18.74) of the try-on output (row 6 vs 7 in Table 3).
() based on Sobel filters is used to better preserve high frequency details during texture fusion. Table 3 show that this improves SSIM (from 0.865 to 0.875) and PSNR (from 24.47 to 25.02) of try-on output.
End-to-end Fine Tuning
The end-to-end fine-tuning of the entire ZFlow network (including the warping and texture fusion modules) improves SSIM (0.876 to 0.885), PSNR (25.12 to 25.46) and FID (from 18.74 to 15.17) of the try-on output as indicated in Table 3 (row 7 vs 8).
We introduce ZFlow, an end-to-end try-on framework, which utilizes a combination of gated aggregation of hierarchical flow estimates (Gated Appearance Flow) and dense geometric priors (DGAR and IUV Priors) to reduce undesirable output artefacts. We highlight effectiveness of ZFlow through comparisons with state-of-the-art and detailed ablation studies. We also validate the efficacy of GAF as a general technique by applying it to human pose transfer.