1 Introduction
Inspired by the huge success of Deep Neural Networks, during this decade, the realm of
Clothes Style Transfer has attracted intensive attention from the research community, which aims to generate realistic person scenes satisfying the demands of diverse attributes [2020MUST, 2019VTNFP, 2018VITON]. Due to alluring application prospects in controllable person manipulation, virtual try-on clothes texture editing, etc., lots of efforts have been made to improve its performance [2019VTNFP, 2018VITON, 2020Deep, 2020Controllable]. More importantly, video format is a finer carrier for delivering visual experience than static images in many tasks (e.g. person pose animation [2020Deep]). However, the video-based person generation has not been investigated in depth yet by far [2019DwNet].Existing approaches for clothes style transfer can be roughly divided into two classes: warping-based [2019VTNFP, 2018VITON, 2020Towards, 2018Human, 2017ClothCap, 2016Detailed] and image generation-based models [2020MUST, 2020Deep, 2020Controllable]. The former type of methods benefit from preserving informative and detailed features of the clothes. Zhang et al. [2018VITON]
attempt to estimate a thin plate spline (TPS) transformation for warping the clothing items and refine the coarse results by using a composition mask. Meanwhile, and Yu et al.
[2019VTNFP] devise a body segmentation map prediction module to delineate body parts and warped clothing regions. Despite the constant progress achieved by them, it is still difficult to directly transform the spatially misaligned body parts into the desired shapes due to the non-rigid nature of human body. Hence, these approaches are unable to attain satisfactory performance. To mitigate this issue, several approaches [2020Towards, 2018Human, 2017ClothCap, 2016Detailed] probe into the paradigm of transferring clothes from one person to another by estimating a complicated 3D human mesh and warping the textures to cater for the body topology. Nevertheless, these approaches fail to capture the sophisticated interplay of the intrinsic shape and appearance, and thus results in unrealistic synthesis with deformed textures.
Profiting from the invention of GANs [2016Image, 2017High, 2019Few], image generation-based approaches have prevailed in image synthesis task, such as [2020MUST, 2020Deep, 2020Controllable, 2018Everybody, 2017Deformable, 2018A, 2019Progressive]. A core driving force behind these methods is to impose the pose guidance on the image generation process. Hence, the photo-realistic people images with arbitrary poses are rendered, whereas the clothes styles are controlled simultaneously by utilizing various valuable human attributes. Early models [2018Everybody, 2017Deformable, 2018A, 2019Progressive] only focus on keeping the posture and identity without providing user manipulation of human attributes, e.g. head, pants and upper clothes, whereas several recent works [2020MUST, 2020Deep, 2020Controllable] treat the clothes as a type of texture style and try to encode it. Especially, thanks to the high flexibility and strong generalization ability to other clothes styles [2020Controllable, 2019A], the de facto standard of Adaptive Instance Normalization (AdaIN) and its variants [2017Arbitrary] have witnessed remarkable improvements in extracting appearance features [2020MUST, 2020Controllable]
. While obtaining acceptable performance, they have obvious drawbacks: they can often only perform coarse-grained feature extraction, and cannot restore fine-grained details well, which easily causes blurry clothes and the shortage of identity information. In the field of video-based generation, existing methods formulate this problem as a generation process of image sequence and produce results in a frame-by-frame manner. For example, resorting to a set of sparse trajectories, studies
[2019Animating] and [2020First] leverage zeroth-order and first-order Taylor expansions such that the complex transformations are approximated. However, they rarely consider the video spatio-temporal coherence and thus affects the consistency of movements. Albeit the relatively smooth videos, a sequential generator proposed by Wang et al. [2019Few] is solely applied in tasks of single-person pose transfer.Such limitations and drawbacks drive our exploration of clothes style transfer for person video generation to improve the quality and consistency of the generated video. In this paper, we propose a novel framework for person video generation towards using clothes style transfer. It consists of four components: Pose Landmark Encoder, Characteristic Identity Encoder, Clothes Style Encoder and a Shared Decoder, as illustrated in Figure 1. In specific, we formulate the task of feature learning as three disentangled sub-tasks to avoid the learning conflicts existing in the single-column structures. The disentangled multi-branch scheme is inspired by the natural insight that different salient areas correspond to varying degrees of responses to multiple tasks [2020Revisiting]. Furthermore, to guarantee the spatio-temporal consistency in generated videos and implicitly capture the local frames dependencies, we delicately devise a variant of discriminator, named as inner-frame discriminator, which takes the cross-frame difference as input. Besides, it is worth noting that background/scenarios replacement has also been a new trend for video person generation since it has higher values of privacy and entertainment [2020Real]. Unfortunately, existing approaches [2020MUST, 2020Controllable] are scenario-agnostic. Motivated by this observation, we design a strategy to expand our framework’s capability of being scenario-aware and improve the flexibility of video generation. To our knowledge, this method provides the first investigation into spatio-temporal consistency and scenario awareness simultaneously for person video generation with clothes style transfer. Extensive experiments are conducted on TEDXPeople benchmark [2021StylePeople] and demonstrate that our proposed method outperforms existing popular approaches in terms of image quality, similarity to realistic images and video smoothness, while embracing the merit of scenario awareness for matting.
2 Methodology
2.1 Disentangled Multi-branch Encoders
To prevent the model from suffering from learning conflicts often occurring in single-column structure and capture fine-grained cues for the downstream decoder, we disentangle different spatial information from the main input and construct a multi-branch structure, which takes the pose landmark images , characteristic identity images and clothes style images as input. Note the pose landmark images
are obtained using a commonly-used human pose estimation method proposed by
[openpose]. Following previous works [2020MUST, 2020Deep, 2020Controllable], we build our pose landmark encoder upon a three-layer down-sampling network to encode the pose frame sequence. The characteristic identity encoder aims to boost the identity information and get rid of possible distortion, especially in the face region. Specifically, the identity images are disentangled by leveraging a human segmentation method in [2018Encoder]. Unlike the pose landmark encoder with demands of extracting more abundant and sophisticated features, the identity encoder plays a role of providing weak cues for identity information, and thus we narrow the network width to half of the pose encoder while reducing the number of parameters. For the clothes style encoder, the same segmentor [2018Encoder] is also employed to generate the clothes input images . Besides, during the training phrase, a stochastic regularizer of geometric transformation is adopted to strengthen the model’s generalization ability to arbitrary clothes input. In addition, to further increase model’s capacity of learning representations, the combination of spatial and channel attention mechanisms [2018CBAM] is involved in each layer of encoders to adaptively emphasize the informative filters or regions by softly recalibrating corresponding units.2.2 Shared Decoder with Scenario Awareness
After flowing through multi-branch encoders in parallel, the disentangled sources are converted into the fine-grained feature representations. Then, these features are integrated and further processed in the high-level shared decoder. It is worth noting that skip connections are used here to fuse multi-level features while avoiding the issue of gradient vanishing [2016Deep]. To let the framework possess higher flexibility of arbitrarily changing scenarios, we design an auxiliary output branch to predict the scenario mask . The final composition result can be formulated as , where and denote the generated frame and the alpha matte for new scenarios, respectively, whereas represents the arbitrary scenario image.
2.3 Inner-frame Discriminator
In general, two kinds of discriminators ( and ) are prevailing in current algorithms for pose-guided person image generation, such as [2020Controllable, 2019Progressive, 2018Unsupervised]. In detail, is leveraged to guarantee the consistency between the pose of generated image , and the target pose , while concentrates on the quality and reality of the generated images. To make full use of merits of both discriminators, both of them are adopted in our framework.


As described in Section 1, spatio-temporal consistency of generated videos is short of being considered in existing approaches. To alleviate this issues, we devise a inner-frame discriminator in parallel with and , which takes the cross-frame difference as the input and contributes to better video smoothness. In other words, it operates as an strong constraint imposed on adjacent frames to suppress suddenly drastic changes. Formally, given two adjacent frames and in the same video, the inner-frame discriminator can be denoted as: . Besides, the sequence discriminator in [2020Deep, 2019Few] is also incorporated into our method to enhance the visual quality of the generated videos since it enables the framework to implicitly consider the inner-relationships among frames. Figure 2 delineates the designed scheme of these two discriminators.
2.4 Objective Functions
The entire training objective is to supervise the optimization of our proposed framework for the sake of high-quality person videos generation conforming to the input clothes style as well as the identity. Additionally, the posture of the generated person should be compatible with that of input pose images. The elaborate loss function of the whole network is formulated as:
(1) |
where are the weights which are used to control the influence of different loss items. indicates the adversarial loss implemented by the LSGAN [2017Least] loss. Following [2016Perceptual], the reconstruction loss provides the supervision to impel the generated frames to be close to the ground truths following [2016Perceptual] and [2014Very] while style loss aims to maintain style coherence. More importantly, the video loss is achieved by discriminators and illustrated in Section 2 to facilitate the video quality, especially for spatio-temporal coherence. Finally, it is worth noting that the truthfulness degree and identity preservation of generated images heavily rely on the synthesis quality of a family of key positions, such as head and hands parts. Therefore, to aid in improving comprehensive generation quality, we introduce a type of local characteristic discriminator which takes the head/hands segmentation results as input and computes the local characteristic loss term .
3 Experiments
3.1 Dataset and Implementation Details
We evaluate our method on the TEDXPeople dataset provided in [2021StylePeople]. This dataset contains 48,188 videos of TED and TED-X talks. We randomly select 5,000 videos from the entire dataset with a resolution of 256×256. 85
of the dataset is used as training data. The rest is used to validate the performance of clothes style transfer during the inference phase. Besides, images of ImageNet dataset
[ILSVRC15] are used as random scenario inputs during training process. The weights for the loss terms are set to . We adopt Adam optimizer [2014Adam] with the momentum set to 0.5 to train our model for around 100k iterations.3.2 Comparison with SOTA Models
In our experiments, Structured similarity (SSIM) [2004Image]
and Peak Signal-to-Noise Ratio (PSNR) are employed to measure the similarity between the generated images and the ground truths, whereas Fréchet Inception Distance (FID)
[2017GANs] is used to assess the realism and consistency of generated images [2020MUST, 2020Deep]. We evaluate the spatio-temporal consistency of the generated videos via the Fréchet Video Distance (FVD) [2018FVD]. We compare our method with two existing popular approaches MUST [2020MUST] and ADGAN[2020Controllable]. The quantitative results are demonstrated in Table 1 while the visualization examples are also provided in Figure 3(a). Both quantitative results and visualization examples demonstrate the superiority of our proposed framework. Meanwhile, we further validate the scenario awareness of our framework and several example results are given in Figure 3(b). As it can be observed, the identity of the generated videos remain the same as the source persons, and our approach can manipulate clothes styles and background scenarios flexibly. Video demos are available at https://github.com/XSimba123/demos-of-csf-sa/Method | SSIM | PSNR | FID | FVD |
---|---|---|---|---|
ADGAN [2020Controllable] | 0.787 | 19.0 | 19.1 | 0.732 |
MUST [2020MUST] | 0.794 | 19.3 | 11.7 | 0.804 |
Our approach | 0.841 | 23.9 | 11.3 | 0.261 |
3.3 Ablation Study
To better understand each component in our model, we conduct ablation studies on TEDXPeople benchmark (see Table 2), which demonstrates that the multiple encoders structure (I) and attention module (II) have a great contribution to visual quality. The FVD metric shows that two spatio-temporal constraints (III) brings a significant improvement to video consistency. And local characteristic discriminator (IV) also gives a boost to the similarity metrics.
I | II | III | IV | SSIM | PSNR | FID | FVD |
---|---|---|---|---|---|---|---|
0.801 | 20.5 | 16.7 | 0.643 | ||||
0.804 | 21.5 | 14.6 | 0.498 | ||||
0.818 | 21.8 | 12.8 | 0.427 | ||||
0.826 | 22.3 | 12.9 | 0.282 | ||||
0.841 | 23.9 | 11.3 | 0.261 |
4 Conclusion
In this work, we propose an approach to tackle scenario-aware clothes style transfer for person videos, which is shown to outperform baseline and recent approaches. This is because our model uses a multi-encoder framework to handle different types of features. And an inner-frame discriminator is proposed to greatly enhance the video temporal coherence. Moreover, we design a framework to support scenario-aware video generation. Experimental results on TEDXPeople dataset confirmed the effectiveness and robustness of our proposed framework.