The source code for paper "Deep Image Spatial Transformation for Person Image Generation"
Pose-guided person image generation is to transform a source person image to a target pose. This task requires spatial manipulations of source data. However, Convolutional Neural Networks are limited by lacking the ability to spatially transform the inputs. In this paper, we propose a differentiable global-flow local-attention framework to reassemble the inputs at the feature level. Specifically, our model first calculates the global correlations between sources and targets to predict flow fields. Then, the flowed local patch pairs are extracted from the feature maps to calculate the local attention coefficients. Finally, we warp the source features using a content-aware sampling method with the obtained local attention coefficients. The results of both subjective and objective experiments demonstrate the superiority of our model. Besides, additional results in video animation and view synthesis show that our model is applicable to other tasks requiring spatial transformation. Our source code is available at https://github.com/RenYurui/Global-Flow-Local-Attention.READ FULL TEXT VIEW PDF
Pose-guided person image generation and animation aim to transform a sou...
We present a novel Bipartite Graph Reasoning GAN (BiGraphGAN) for the
In this paper, we address the task of semantic-guided scene generation. ...
Pose guided person image generation means to generate a photo-realistic
In this paper, we propose a novel fitting method that uses local image
We present a generalization of the person-image generation task, in whic...
In this paper, we address unsupervised pose-guided person image generati...
The source code for paper "Deep Image Spatial Transformation for Person Image Generation"
Image spatial transformation is used to describe the spatial deformation between sources and targets. Such deformation can be caused by object motions or viewpoint changes. Many conditional image generation tasks can be seen as a type of spatial transformation tasks. For example, pose-guided person image generation [18, 24, 26, 38] transforms a person image from a source pose to a target pose while retaining the appearance details. This task can be tackled by reasonably reassembling the input data in the spatial domain.
However, Convolutional Neural Networks are spatial invariant to the input data since they calculate outputs in a position-independent manner. This property can benefit tasks requiring to reason about images such as classification [13, 27], segmentation [4, 7] and detection  etc
. However, it limits the networks by lacking abilities to spatially rearrange the input data. Spatial Transformer Networks (STN) solves this problem by introducing a Spatial Transformer module to standard neural networks. This module regresses global transformation parameters and warps input features using affine transformation. However, since it assumes a global affine transformation between sources and targets, this method cannot deal with the transformations of non-rigid objects.
allows networks to take use of non-local information, which gives networks abilities to build long-term correlations. It has been proved to be efficient in many tasks such as natural language processing, image recognition [31, 9], and image generation . However, for spatial transformation tasks in which target images are the deformation results of source images, each output position has a clear one-to-one relationship with the source positions. Therefore, the attention coefficient matrix between the source and target should be a sparse matrix instead of a dense matrix.
Flow-based operation forces the attention coefficient matrix to be a sparse matrix by sampling a very local source patch for each output position. These methods predict 2-D coordinate vectors specifying which positions in the sources could be sampled to generate the targets. However, in order to stabilize the training, most of the traditional flow-based methods[37, 3] warp input data at the pixel level. They limit the networks to be unable to generate new contents. Meanwhile, large motions are difficult to extract due to the requirement of generating full-resolution flow fields . Warping the inputs at the feature level can solve these problems. However, the networks are easy to be stuck within bad local minima [21, 32] due to two reasons. As the input features and flow fields are changing during the training process, in the early stages of training, the features can not obtain reasonable gradients without correct flow fields. The network also cannot extract similarities to generate correct flow fields without reasonable features. The poor gradients provided by the commonly used Bilinear sampling method further lead to instability in training. See Section Warping at the Feature Level for more discussion.
In order to deal with these problems, in this paper, we combine flow-based operation with attention mechanisms. We propose a novel global-flow local-attention framework to force each output location to be only related to a local feature patch of sources. The architecture of our model can be found in Figure 2
. Specifically, our network can be divided into two parts: Flow Field Estimator and Target Image Generator. The Flow Filed Estimator is responsible for extracting the global correlations and generating flow fields. The Target Image Generator is used to synthesize the final results using local attention. To avoid the poor gradient propagation of the Bilinear sampling, we propose a content-aware sampling method to calculate the local attention coefficients.
We compare our model with several state-of-the-art methods. The results of both subjective and objective experiments show the superior performance of our model. We also conduct comprehensive ablation studies to verify our hypothesis. Besides, we apply our model to other tasks requiring spatial transformation manipulation including view synthesis and video animation. The results show the versatility of our module. The main contributions of our paper can be summarized as:
A global-flow local-attention framework is proposed for pose-guided person image generation tasks. Experiments demonstrate the effectiveness of the proposed method.
The carefully-designed framework and content-aware sampling operation ensure that our model is able to warp and reasonably reassemble the input data at the feature level. This not only enables the model to generate new contents, but also reduces the difficulty of the flow field estimation task.
Additional experiments over view synthesis and video animation show that our model can be flexibly applied to different spatial transformation tasks and use different structure guidance.
Pose-guided Person Image Generation. An early attempt  on the pose-guided person image generation task proposes a two-stage network to first generate a coarse image with target pose and then refine the results in an adversarial way. Essner et al.  try to disentangle the appearance and pose of person images. Their model enables both conditional image generation and transformation. However, they use U-Net based skip connections, which may lead to feature misalignments. Siarohin et al.  solve this problem by introducing deformable skip connections to spatially transform the textures. It decomposes the overall deformation by a set of local affine transformations (e.g. arms and legs etc.). Although it works well in person image generation, the requirement of the pre-defined transformation components limits its application. Zhu et al.  propose a more flexible method by using a progressive attention module to transform the source data. However, useful information may be lost during multiple transfers, which may result in blurry details. Han et al.  use a flow-based method to transform the source information. However, they warp the sources at the pixel level, which means that further refinement networks are required to fill the holes of occlusion contents. Liu et al.  and Li et al.  warp the inputs at the feature level. But both of them need additional 3D human models to calculate the flow fields between sources and targets, which limits the application of these models. Our model does not require any supplementary information and obtains the flow fields in an unsupervised manner.
Image Spatial Transformation. Many methods have been proposed to enable the spatial transformation capability of Convolutional Neural Networks. Jaderberg et al.  introduce a differentiable Spatial Transformer module that estimates global transformation parameters and warps the features with affine transformation. Several variants have been proposed to improve the performance. Zhang et al. add controlling points for free-form deformation . The model proposed in paper  sends the transformation parameters instead of the transformed features to the network to avoid sampling errors. Jiang et al.  demonstrate the poor gradient propagation of the commonly used Bilinear sampling. They propose a linearized multi-sampling method for spatial transformation.
Flow-based methods are more flexible than affine transformation methods. They can deal with complex deformations. Appearance flow  predicts flow fields and generates the targets by warping the sources. However, it warps image pixels instead of features. This operation limits the model to be unable to generate new contents. Besides, it requires the model to predict flow fields with the same resolution as the result images, which makes it difficult for the model to capture large motions [39, 20]. Vid2vid  deals with these problems by predicting the ground-truth flow fields using FlowNet  first and then trains their flow estimator in a supervised manner. They also use a generator for occluded content generation. Warping the sources at the feature level can avoid these problems. In order to stabilize the training, some papers propose to obtain the flow-fields by using some assumptions or supplementary information. Paper  assumes that keypoints are located on object parts that are locally rigid. They generate dense flow fields from sparse keypoints. Papers [16, 14] use the 3D human models and the visibility maps to calculate the flow fields between sources and targets. Paper  proposes a sampling correctness loss to constraint flow fields and achieve good results.
For the pose-guided person image generation task, target images are the deformation results of source images, which means that each position of targets is only related to a local region of sources. Therefore, we design a global-flow local-attention framework to force the attention coefficient matrix to be a sparse matrix. Our network architecture is shown in Figure 2. It consists of two modules: Flow Field Estimator and Target Image Generator . The Flow Field Estimator is responsible for estimating the motions between sources and targets. It generates global flow fields and occlusion masks for the local attention blocks. With and , the Target Image Generator extracts the corresponding local feature block pair centered at each location to calculate the output feature using local attention. The following describes each in detail. Please note that to simplify the notations, we describe the network with a single local attention block. As shown in Figure 2, our model can be extended to use multiple attention blocks at different scales.
Let and denote the structure guidance of the source image and the target image respectively. Flow Field Estimator predicts the motions between and in an unsupervised manner. It takes , and as inputs and generates flow fields and occlusion masks .
where contains the coordinate offsets between sources and targets. The occlusion mask with continuous values between and indicates whether the information of a target position exists in the sources. and share all weights of other than their output layers.
Different from some previous flow-based methods [37, 3, 30], our model warps features instead of pixels, which enables the model to generate new content and reduces the difficulty of the flow fields estimation. However, networks are easy to be stuck within bad local minima when warping at the feature level [32, 21]. Therefore, we use the sampling correctness loss proposed by  to constraint . It calculates the similarity between the warped source and ground-truth target at the VGG feature level. Let and denote the features generated by a specific layer of VGG19. is the warped results of using the flow field
. The sampling correctness loss calculates the relative cosine similarity betweenand .
where denotes the cosine similarity. Coordinate set contains all positions in the feature maps, and denotes the feature of located at the coordinate . The normalization term is calculated as
It is used to avoid the bias brought by occlusion.
The sampling correctness loss can constrain the flow fields to sample semantically similar regions. However, as the deformations of image neighborhoods are highly correlated, it would benefit if we could extract this relationship. Therefore, we further add a regularization term to our flow fields. This regularization term is used to punish local regions where the transformation is not an affine transformation. Let be the 2D coordinate matrix of the target feature map. The corresponding source coordinate matrix can be written as . We use to denote local patch of centered at the location . Our regularization assumes that the transformation between and is an affine transformation.
where with each coordinate and with each coordinate . The estimated affine transformation parameters can be solved using the least-squares estimation as
Our regularization is calculated as the distance of the error.
are objective metrics. JND is obtained by human subjective studies. It represents the probability that the generated images are mistaken for real images.
With the flow fields and occlusion masks , our Target Image Generator is responsible for synthesizing the results using the local attention. As shown in Figure 2, takes , , and as inputs and generate the result image .
Two branches are designed in for different purposes. One is used to generate the feature of target images using the structure information , while another is used to extract the feature from the source images . As shown in Figure 3, our local attention block transforms vivid textures from source features to target features. We first extract local patches and from and respectively. The patch is extracted using bilinear sampling as the coordinates may not be integers. Then, a kernel prediction network is used to predict local kernel as
The softmax function is used as the non-linear activation function of the output layer of. This operation forces the sum of to 1, which enables the stability of gradient backward. Finally, the flowed feature located at coordinate is calculated using a content-aware attention over the extracted source feature patch .
where denotes the element-wise multiplication over the spatial domain and represents the global average pooling operation. The warped feature map is obtained by repeating the previous steps for each location .
However, not all contents of target images can be found in source images because of occlusion or movements. In order to enable our Target Image Generator to generate new contents, the occlusion mask with continuous value between 0 and 1 is used to select features between and .
We train the network using a joint loss consisting of a reconstruction loss, adversarial loss, perceptual loss, and style loss. The reconstruction loss is written as
The generative adversarial framework  is employed to mimic the distributions of the ground-truth . The adversarial loss is written as
where is the discriminator of Target Image Generator . We also use the perceptual loss and style loss introduced by . The perceptual loss calculates distance between activation maps of a pre-trained network. It can be written as
where is the activation map of the -th layer of a pre-trained network. The style loss calculates the statistic error between the activation maps as
where is the Gram matrix constructed from activation maps . We train our model using the overall loss as
Datasets. Two datasets are used in our experiments: person re-identification dataset Market-1501  and DeepFashion In-shop Clothes Retrieval Benchmark . Market-1501 contains 32668 low-resolution images (128 64). The images vary in terms of the viewpoints, background, illumination etc. The DeepFashion dataset contains 52712 high-quality model images with clean backgrounds. We split the datasets with the same method as that of . The personal identities of the training and testing sets do not overlap.
Metrics. We use Learned Perceptual Image Patch Similarity (LPIPS) proposed by  to calculate the reconstruction error. LPIPS computes the distance between the generated images and reference images at the perceptual domain. It indicates the perceptual difference between the inputs. Meanwhile, Fréchet Inception Distance  (FID) is employed to measure the realism of the generated images. It calculates the Wasserstein-2 distance between distributions of the generated images and ground-truth images. Better scores are assigned to small distribution errors. Besides, we perform a Just Noticeable Difference (JND) test to evaluate the subjective quality. Volunteers are asked to choose the more realistic image from the data pair of ground-truth and generated images. We provide the fooling rate as the evaluation results.
Network Implementation and Training Details. Basically, auto-encoder structures are employed to design our and . The residual block is used as the basic component of these models. The kernel prediction network is designed as a fully connected net. We use convolutional layers to implement this network, which can obtain for different location in parallel. We train our model using images for the Fashion dataset. Two local attention blocks are used for feature maps with resolutions as and . The extracted local patch sizes are and respectively. For Market-1501, we use images with a single local attention block at the feature maps with resolution as . The extracted patch size is . We train our model in stages. The Flow Field Estimator is first trained to generate flow fields. Then we train the whole model in an end-to-end manner. We adopt the ADAM optimizer with the learning rate as . The batch size is set to for all experiments. More details can be found in Section Implementation Details
We compare our method with several stare-of-the-art methods including Def-GAN , VU-Net , Pose-Attn and Intr-Flow . The quantitative evaluation results are shown in Table 1. For the Market-1501 dataset, we follow the previous work  to calculate the mask-LPIPS to alleviate the influence of the background. It can be seen that our model achieves competitive results in both datasets, which means that our model can generate realistic results with fewer perceptual reconstruction errors.
As the subjective metrics may not be sensitive to some artifacts, its results may mismatch with the actual subjective perceptions. Therefore, we implement a just noticeable difference test on Amazon Mechanical Turk (MTurk). This experiment requires volunteers to choose the more realistic image from image pairs of real and generated images. The test is performed over images for each model and dataset. Each image is compared times by different volunteers. The evaluation results are shown in Table 1. It can be seen that our model achieves the best result in the challenging Fashion dataset and competitive results in the Market-1501 dataset.
The typical results of different methods are provided in Figure 4. For the Fashion dataset, VU-Net and Pose-Attn struggle to generate complex textures since these models lack efficient spatial transformation blocks. Def-GAN defines local affine transformation components (e.g. arms and legs etc
.). This model can generate correct textures. However, the pre-defined affine transformations are not sufficient to represent complex spatial variance, which limits the performance of the model. Flow-based model Intr-Flow is able to generate vivid textures for front pose images. However, it may fail to generate realistic results for side pose images due to the requirement of generating full-resolution flow fields. Meanwhile, this model needs 3D human models to generate the ground-truth flow fields for training. Our model regresses flow fields in an unsupervised manner. It can generate realistic images. Not only the global pattern but also the details such as the lace of clothes and the shoelace are reconstructed. For the Market-1501 Dataset, our model can generate correct pose with vivid backgrounds. Artifacts can be found in the results of competitors, such as the sharp edges in Pose-Attn and the halo effects in Def-GAN.
The numbers of model parameters are also provided to evaluate the computation complexity in Table 1. Thanks to our efficient attention blocks, our model does not require a large number of convolution layers. Thus, we can achieve high performance with less than half of the parameters of the competitors.
In this subsection, we train several ablation models to verify our assumptions and evaluate the contribution of each component.
Baseline. Our baseline model is an auto-encoder convolutional network. We do not use any attention blocks in this model. Images , , are directly concatenated as the model inputs.
Global Attention Model (Global-Attn).
Global Attention Model (Global-Attn).The Global-Attn model is designed to compare the global-attention block with our local-attention block. We use a similar network architecture as our Target Image Generator for this model. The local attention blocks are replaced by global attention blocks where the attention coefficients are calculated by the similarities between the source features and target features .
Bilinear Sampling Model (Bi-Sample). The Bi-Sample model is designed to evaluate the contribution of our content-aware sampling method described in Section 3 .2. Both the Flow Field Estimator and Target Image Generator are employed in this model. However, we use Bilinear sampling as the sampling method in our local-attention blocks.
Full Model (Ours). We use both the flow-based local-attention blocks and the content-aware sampling method in this model.
The evaluation results of the ablation study are shown in Table 2. Compared with the Baseline model, the performance of the Global-Attn model is degraded, which means that unreasonable attention block cannot efficiently transform the information. Improvements can be obtained by using flow-based methods such as the Bi-Sample model and our Full model which force the attention coefficient matrix to be a sparse matrix. However, the Bi-Sample model uses a pre-defined sampling method with a limited sampling receptive field, which may lead to unstable training. Our full model uses a content-aware sampling operation with an adjustable receptive field, which brings further performance gain.
Subjective comparison of these ablation models can be found in Figure 5. It can be seen that the Baseline and Global-Attn model can generate correct structures. However, the textures of the source images cannot be well-maintained, which is because these models generate images by first extracting global features and then propagating the information to specific locations. This process leads to the loss of details. The flow-based methods spatially transform the features. They are able to reconstruct vivid details. However, the Bi-Sample model uses the pre-defined Bilinear sampling method. It cannot find the exact sampling locations, which leads to artifacts in the final results.
We further provide the visualization of the attention maps in Figure 6. It can be seen that the Global-Attn model struggles to exclude irrelevant information. Therefore, the extracted features are hard to be used to generate specific textures. The Bi-Sample model assigns a local patch for each generated location. However, incorrect features are often flowed due to the limited sampling receptive field. Our Full model using the content-aware sampling method can flexibly change the sampling weights and avoid the artifacts.
In this section, we demonstrate the versatility of our global-flow local-attention module. Since our model does not require any additional information other than images and structure guidance, it can be flexibly applied to tasks requiring spatial transformation. Two example tasks are shown as follows.
View Synthesis. View synthesis requires generating novel views of objects or scenes based on arbitrary input views. Since the appearance of different views is highly correlated, the existing information can be reassembled to generate the targets. We directly train our model on the ShapeNet dataset . We generate novel target views using single view input. The results can be found in Figure 7. We provide the results of appearance flow as a comparison. It can be seen that appearance flow struggles to generate occluded contents as they warp image pixels instead of features. Our model generates reasonable results.
Image Animation. Given an input image and a guidance video sequence depicting the structure movements, the image animation task requires generating a video containing the specific movements. This task can be solved by spatially moving the appearance of the sources. We train our model with the real videos in the FaceForensics dataset . This dataset contains 1000 videos of news briefings from different reporters. The face regions are cropped for this task. We use the edge maps as the structure guidance. For each frame, the input source frame and the previous generated frames are used as the references. The flow fields are calculated for each reference. We also employ an additional discriminator to capture the distributions of the residuals between video frames. The results can be found in Figure 8. It can be seen that our model generates realistic result videos with vivid movements. More applications can be found in Section Additional Results.
In this paper, we propose a global-flow local-attention module for person image generation. Our model can spatially transform the useful information from sources to targets. In order to train the model to reasonably transform and reassemble the inputs at the feature level in an unsupervised manner, we analyze the specific reasons causing instability in training. Meanwhile, we propose a framework to solve these problems. Our model first extracts global correlations between sources and targets to obtain flow fields. Then, the local attention is used to sample source features in a content-aware manner. Experiments show that our model can generate target images with correct poses while maintaining details. In addition, the ablation study shows that our improvements help the network find more reasonable sampling positions. Finally, we show that our model can be easily extended to address other spatial deformation tasks such as view synthesis and video animation.
Perceptual losses for real-time style transfer and super-resolution.In European conference on computer vision, pages 694–711. Springer, 2016.
Structureflow: Image inpainting via structure-aware appearance flow.In IEEE International Conference on Computer Vision (ICCV), 2019.
The unreasonable effectiveness of deep features as a perceptual metric.In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 586–595, 2018.
The networks are easy to be stuck within bad local minima when warping the inputs at the feature level by using flow-based methods with traditional warping operation. Two main problems are pointed out in our paper. In this section, we further explain these reasons.
Figure .9 shows the warping process of the traditional Bilinear sampling. For each location in the output features , a sampling position is assigned by the offsets in flow fields . Then, the output feature is obtained by sampling the local regions of the input features .
where and represent round up and round down respectively. Location . Then, we can obtain the backward gradients. The gradients of flow fields is
the can be obtained in a similar way. The gradients of the input features can be written as
The other items can be obtained in a similar way.
Let us first explain the first problem. Different from image pixels, the image features are changing during the training process. The flow fields need reasonable input features to obtain correct gradients. As shown in Equation 17, their gradients are calculated by the difference between adjacent features. If the input features are meaningless, the network cannot obtain correct flow fields. Meanwhile, according to Equation 18, the gradients of the input features are calculated by the offsets. They cannot obtain reasonable gradients without correct flow fields. Imagine the worst case that the model only uses the warped results as output and does not use any skip connections, the warp operation stops the gradient from back-propagating.
Another problem is caused by the limited receptive field of Bilinear sampling. Suppose we have got meaningful input features by pre-training, we still cannot obtain stable gradient propagation. According to Equation 17, the gradient of flow fields is calculated by the difference between adjacent features. However, since the adjacent features are extracted from adjacent image patches, they often have strong correlations (i.e. ). Therefore, the gradients may be small at most positions. Meanwhile, large motions are hard to be captured.
In our paper, we mentioned that the deformations of image neighborhoods are highly correlated and proposed a regularization loss to extract this relationship. In this section, we further discuss this regularization loss. Our loss is based on an assumption: although the deformations of the whole images are complex, the deformations of local regions such as arms, clothes are always simple. These deformations can be modeled using affine transformation. Based on this assumption, our regularization loss is proposed to punish local regions where the transformation is not an affine transformation. Figure .10 gives an example. Our regularization loss first extracts local 2D coordinate matrix and . Then we estimate the affine transformation parameters . Finally, the error is calculated as the regularization loss. Our loss assigns large errors to local regions that do not conform to the affine transformation assumption thereby forcing the network to change the sampling regions.
We provide an ablation study to show the effect of the regularization loss. A model is trained without using the regularization term. The results are shown in Figure .11. It can be seen that by using our regularization loss the flow fields are more smooth. Unnecessary jumps are avoided. We use the obtained flow fields to warp the source images directly to show the sampling correctness. It can be seen that without using the regularization loss incorrect sampling regions will be assigned due to the sharp jumps of the flow fields. Relatively good results can be obtained by using the affine assumption prior.
We provide additional comparison results in this section. The qualitative results is shown in Figure .12.
We provide additional results of the image animation task in Figure .15.
Basically, the auto-encoder structure is employed to design our network. We use the residual blocks as shown in Figure .16 to build our model. Each convolutional layer is followed by instance normalization 
. We use Leaky-ReLU as the activation function in our model. Spectral normalization is employed in the discriminator to solve the notorious problem of instability training of generative adversarial networks. The architecture of our model is shown in Figure .17. We note that since the images of the Market-1501 dataset are low-resolution images (), we only use one local attention block at the feature maps with resolution as . We design the kernel prediction net in the local attention block as a fully connected network. The extracted local patch and are concatenated as the input. The output of the network is . Since it needs to predict attention kernels for all location in the feature maps, we use a convolutional layer to implement this network, which can take advantage of the parallel computing power of GPUs.
We train our model in stages. The Flow Field Estimator is first trained to generate flow fields. Then we train the whole model in an end-to-end manner. We adopt the ADAM optimizer. The learning rate of the generator is set to . The discriminator is trained with a learning rate of one-tenth of that of the generator. The batch size is set to 8 for all experiments. The loss weights are set to , , , , , and .