Attention-based Fusion for Multi-source Human Image Generation

05/07/2019 ∙ by Stéphane Lathuilière, et al. ∙ Università di Trento 0

We present a generalization of the person-image generation task, in which a human image is generated conditioned on a target pose and a set X of source appearance images. In this way, we can exploit multiple, possibly complementary images of the same person which are usually available at training and at testing time. The solution we propose is mainly based on a local attention mechanism which selects relevant information from different source image regions, avoiding the necessity to build specific generators for each specific cardinality of X. The empirical evaluation of our method shows the practical interest of addressing the person-image generation problem in a multi-source setting.



There are no comments yet.


page 4

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The person image generation task, as proposed by Ma et al. [19], consists in generating “person images in arbitrary poses, based on an image of that person and a novel pose”. This task has recently attracted a lot of interest in the community because of different potential applications, such as computer-graphics based manipulations [34] or data augmentation for training person re-identification [41, 16]

or human pose estimation

[5] systems. Previous work on this field [19, 15, 39, 26, 3, 25] assume that the generation task is conditioned on two variables: the appearance image of a person (we call this variable the source image) and a target pose, automatically extracted from a different image of the same person using a Human Pose Estimator (HPE).

Using person-specific abundant data the quality of the generated images can be potentially improved. For instance, a training dataset specific to each target person can be recorded [6]. Another solution is to build a full-3D model of the target person [17]. However, these approaches lack of flexibility and need an expensive data-collection.

Figure 1: Multi-source Human Image Generation: an image of a person in a novel pose is generated from a set of images of the same person.

In this work we propose a different direction which relies on a few, variable number of source images (e.g., from 2 to 10). We call the corresponding task multi-source human image generation. As far as we know, no previous work has investigated this direction yet. The reason for which we believe this generalization of the person-image generation task is interesting is that multiple source images, when available, can provide richer appearance information. This data redundancy can possibly be exploited by the generator in order to compensate for partial occlusions, self-occlusions or noise in the source images. More formally, we define our multi-source human image generation task as follows. We assume that a set of () source images is given and that these images depict the same person with the same overall appearance (e.g., the same clothes, haircut, etc.). Besides, a unique target body pose is provided, typically extracted from a target image not contained in . The multi-source human image generation task consists in generating a new image with an appearance similar to the general appearance pattern represented in but in the pose (see Fig. 1). Note that is not a-priori fixed, and we believe this task characteristics are important for practical applications, in which the same dataset can contain multiple-source images of the same person but with unknown and variable cardinalities.

Most of previous methods on single-source human image generation [26, 15, 19, 34, 39, 9, 25, 16] are based on variants of the U-Net architecture generator proposed by Isola et al. [13]. A common, general idea in these methods is that the conditioning information (e.g., the source image and/or the target pose) is transformed into the desired synthetic image using the U-Net skip connections, which shuttle information between those layers in the encoder and in the decoder having a corresponding resolution (see Sec. 3). However, when the cardinality

of the source images is not fixed a priori, as in our proposed task, a “plain” U-Net architecture cannot be used, being the number of input neurons a-priori fixed. For this reason, we propose to modify the U-Net generator introducing an

attention mechanism. Attention is widely used to represent a variable-length input into a deep network [2, 36, 33, 32, 10, 31] and, without loss of generality, it can be thought of as a mechanism in which multiple-input representations are averaged (i.e., summed) using some saliency criterion emphasizing the importance of specific representations with respect to the others. In this paper we propose to use attention in order to let the generator decide which specific image locations of each source image are the most trustable and informative at different convolutional layer resolutions. Specifically, we keep the standard encoder-decoder general partition typical of the U-Net (see Sec. 3) but we propose three novelties. First, we introduce an attention-based decoder () which fuses the feature representations of each source. Second, we encode the target pose and each source image with an encoder () which processes each source image independently of the others and locally deforms each performing a target-pose driven geometric “normalization” of . Once normalized, the source images can be compared to each other in , assigning location and source-specific saliency weights which are used for fusion. Finally, we use a multi-source adversarial loss that employs a single conditional discriminator to handle any arbitrary number of source images.

2 Related work

Most of the image generation approaches are based either on Variational Autoencoders (VAEs)

[14] or on Generative Adversarial Networks (GANs) [11]. GANs have been extended to conditional GANs [28], where the image generation depends on some input variable. For instance, in [13], an input image is “translated” into a different representation using a U-Net generator.

The person generation task (Sec. 1) is a specific case of a conditioned generation process, where the conditioning variables are the source and the target images. Most of the previous works use conditional GANs and a U-Net architecture. For instance, Ma et al. [19] propose a two-step training procedure: pose generation and texture refinement, both obtained using a U-Net architecture. Recently, this work has been extended in [20] by learning disentangled representations of the pose, the foreground and the background. Following [19], several methods for pose-guided image generation have been recently proposed [15, 39, 26, 3, 25]. All these approaches are based on the U-Net. However, the original U-Net, having a fixed-number of input images, cannot be directly used for the multi-source image generation as defined in Sec. 1. Siarohin et al. [26] modify the U-Net using deformable skip connections which align the input image features with the target pose. In this work we use an encoder similar to their proposal in order to align the source images with the target pose, but we introduce a pose stream which compares the similarity between the source and the target pose. Moreover, similarly to the aforementioned works, also [26] is single-source and uses a “standard” U-Net decoder [13].

Other works on image-generation rely on a strong supervision during training or testing. For instance, Neverova et al. [21] use a dense-pose estimator [12] trained using image-to-surface correspondences [12]. Dong et al. [8] use an externally trained model for image segmentation in order to improve the generation process. Zanfir et al. [38] estimate the human 3D-pose using meshes and identify the mesh regions that can be transferred directly from the input image mesh to the target mesh. However, these methods cannot be directly compared with most of the other works, including ours, which rely only on a sparse keypoint detection. Hard data-collection constraints are used also in [6], where a person and a background specific model are learned for video generation. This approach requires that the target person moves for several minutes covering all the possible poses and that a new model is trained specifically for each target person. Similarly, Liu et al. [17] compute the 3D human model by combining several minutes of video. In contrast with these works, our approach is based on fusing only a few source images in random poses and in variable number, which we believe is important because it makes it possible to exploit existing datasets where multiple images are available for the same person. Moreover, our network does not need to be trained for each specific person.

Sun et al. [29] propose a multi-source image generation approach whose goal is to generate a new image according to a target-camera position. Note that this task is different from what we address in this paper (Sec. 1), since a human pose describes an articulated object by means of a set of joint locations, while a camera position describes a viewpoint change but does not deal with source-to-target object deformations. Specifically, Sun et al. [29] represent the camera pose with either a discrete label (e.g., left, right

,etc.) or a 6DoF vector and then they generate a pixel-flow which estimates the “movement” of each source-image pixel. Multiple images are integrated using a Convolutional LSTM

[24] and confidence maps. Most of the reported results concern 3D synthetic (rigid) objects, while a few real scenes are also used but only with a limited viewpoint change.

3 Attention-based U-Net

(a) A schematic representation of the proposed attention decoder architecture
(b) Zoom on the attention module
Figure 2: Illustration of the proposed Attention U-Net. For the sake of clarity, in this figure, we consider the case in which we use only two conditioning images (). The colored rectangles represent the feature maps. The attention module (dashed purple rectangles) in the figure (a) are detailed in figure (b). The dashed double arrows denote normalization across attention maps, denotes the element-wise product and denotes the concatenation along the channel axis.

3.1 Overview

We first introduce some notation and provide a general overview of the proposed method. Referring to the multi-source human image generation task defined in Sec. 1, we assume a training set is given, being each sample , where is a set of source images of the same person sharing a common appearance and is the target image. Every sample image has the same size . Note that the source-set size is variable and depends on the person identity . Given an image depicting a person, we represent the body-pose as a set of 2D keypoints , where each is the pixel location of a body joint in . The body pose can be estimated from an image using an external HPE. The target pose is denoted by .

Our method is based on a conditional GAN approach, where the generator follows a general U-Net architecture [13] composed of an encoder and a decoder. A U-Net encoder is a sequence of convolutional and pooling layers, which progressively decrease the spatial resolution of the input representation. As a consequence, a specific activation in a given encoder layer has a receptive field progressively increasing with the layer depth, so gradually encoding “contextual” information. Vice versa, the decoder is composed of up-convolution layers, and, importantly, each decoder layer is connected to the corresponding layer in the encoder by means of skip connections, that concatenate the encoder-layer feature maps with the decoder-layer feature maps [13]. Finally, Isola et al. [13] use a conditional discriminator in order to discriminate between real and fake “image transformations”.

We modify the aforementioned framework in three main aspects. First, we use replicas of the same encoder in order to encode the geometrically normalized source images together with the target pose. Second, we propose an attention-based decoder that fuses the feature maps provided by the encoders. Finally, we propose a multi-source adversarial loss .

Fig. 2 shows the architecture of . Given a set of source images, encodes each source image together with the target pose. Similarly to the standard U-Net, for a given source image , each encoder outputs feature maps for different-resolution blocks. Each is aligned with the target pose (Sec 3.3). This alignment acts as a geometric “normalization” of each with respect to and makes it possible to compare with (

). Finally, each tensor

jointly represents pose and appearance information at resolution .

3.2 The Attention-based Decoder

is composed of blocks. Similarly to the standard U-Net, the spatial resolution increases symmetrically with respect to the blocks in . Therefore, to highlight this symmetry, the decoder blocks are indexed from R to 1. In the current -th block, the image which is going to be generated is represented by a tensor . This representation is progressively refined in the subsequent blocks using an attention-based fusion of . We call the latent representation of at resolution , and is recursively defined starting from till as follows:

The initial latent representation is obtained by averaging the output tensors of the last layer of (Fig. 2):


Note that each spatial position in corresponds to a large receptive field in the original image resolution which, if is sufficiently large, may include the whole initial image. As a consequence, we can think of as encoding general contextual information on .

For each subsequent block , is computed as follows. Given , we first perform a up-sampling on followed by a convolution layer in order to obtain a tensor . is then fed to an attention mechanism in order to estimate how the different tensors should be fused into a single final tensor :


where denotes the element-wise product and is the proposed attention module.

In order to reduce the number of weights involved in computing Eq. (2), we factorize using a spatial-attention (which is channel independent) and a channel-attention vector (which is spatial independent). Specifically, at each spatial coordinate , compares the current latent representation with and assigns a saliency weight to which represents how significant/trustable is with respect to . The function is implemented by taking the concatenation of and as input and then using a convolution layer. Similarly, is implemented by means of global-average-pooling on the concatenation of and followed by two fully-connected layers. We employ sigmoid activations on both and . Combining together and , we obtain:


Importantly, is not spatially or channel normalized. This because a normalization would enforce that, overall, each source image is used in the same proportion. Conversely, without normalization, given, for instance, a non-informative source (e.g., completely black), the attention module can correspondingly produce a null saliency tensor . Nevertheless, the final attention tensor in Eq. (2) is normalized in order to assign a relative importance to each source:


Finally, the new latent representation at resolution is obtained by concatenating with :


where is the tensor concatenation along the channel axis.

3.3 The Pose-based Encoder

Rather than using a generic convolutional encoder as in [13]

, we use a task-specific encoder specifically designed to work synergistically with our proposed attention model. Our pose-based encoder

is similar to the encoder proposed in [26] but it also contains a dedicated stream which is used to compare each other the source and the target pose.

Figure 3: The Pose-based encoder. For simplicity, we show only 4 blocks (

). Each parallelepiped represents the feature maps obtained after convolution and max-pooling. The

circles denote deformations.

In more detail, is composed of two streams (see Fig. 3). The first stream, referred to as pose stream, is used to represent pose information and to compare each other the target pose with the pose of the person in the source image. Specifically, the target pose is represented using a tensor composed of heatmaps . For each joint , a heatmap is computed using a Gaussian kernel centered in [26]. Similarly, given , we extract the pose using [5] and we describe it using a tensor . The tensors and are concatenated and input to the pose stream, which is composed of a sequence of convolutional and pooling layers. The purpose of the pose stream is twofold. First, it provides the target pose to the decoder. Second, it encodes the similarity between the -th source pose and the target pose. This similarity is of a crucial importance for our attention mechanism to work (Sec. 3.2) since a source image with a pose similar to the target pose is likely more trustable in order to transfer appearance information to the final generated image. For instance, a leg in with a pose closer to than the corresponding leg in , should be most likely preferred for encoding the leg appearance.

The second stream, called source stream, takes as input the concatenation of the RGB image and its pose representation . is provided as input to the source stream in order to guide the source-stream convolutional layers in extracting relevant information which may depend on the joint locations. The output of each convolutional layer of the source stream is a tensor (green blocks in Fig. 3). This tensor is then deformed according to the difference between and (the circles in Fig. 3). Specifically, we use body part-based affine deformations as in [26] to locally deform the source-stream feature maps at each given layer and then concatenate the obtained tensor with the corresponding-layer pose-stream tensor. In this way we get a final tensor for each of the different layers in (). Each is a representation of aligned with and it is obtained independently of .

Given a set of source images, we apply replicas of the encoder to each producing the set of output tensors that are input to the decoder described in Sec.3.2.

3.4 Training

We train the whole network in an end-to-end fashion combining a reconstruction loss with an adversarial loss. For the reconstruction loss, we use the nearest-neighbour loss introduced in [26] which exploits the convolutional maps of an external network (VGG-19 [27]

, trained on ImageNet

[7]) at the original image resolution in order to compare each location of the generated image with a local neighbourhood of the ground-truth image . This reconstruction loss is more robust to small spatial misalignments between and than other common losses as the loss.

On the other hand, in our multi-source problem, the employed adversarial loss has to handle a varying number of sources. We use a single-source discriminator conditioned on only one source image [13] More precisely, we use discriminators that share their parameters and independently process each . Each takes as input the concatenation of four tensors: , where is either the ground truth real image or the generated image . Differently from other multi-source losses [37, 1, 22], we employ a conditional discriminator in order to exploit the information contained in the source image and the pose heatmaps. The GAN loss for the source image is defined as:


where and, with a slight abuse of notation, means the expectation computed over pairs of single-source and target image extracted at random from the training set . Using Eq. (6), the multi-source adversarial loss () is defined as:


Putting all together, the final training loss is given by:


where the weight is set to in all our experiments.

4 Experiments

Market-1501 DeepFashion
Model M SSIM IS mask-SSIM mask-IS SSIM IS
Ma et al. [19] 1
Ma et al. [20] 1
Esser et al. [9] 1
Siarohin et al. [26] 1
Ours 1
Ours 2
Ours 3
Ours 5
Ours 7 - -
Ours 10 - -
Table 1: Comparison with the state of the art on the Market-1501 and the DeepFashion datasets.

In this section we evaluate our method both qualitatively and quantitatively adopting the evaluation protocol proposed by Ma et al. [19]. We train and for 60k iterations, using the Adam optimizer (learning rate: , , ). We use instance normalization [30] as recommended in [13]. The networks used for and have the same convolutional-layer dimensions and normalization parameters used in [26]. Also the up-convolutional layers of have the same dimensions of the corresponding decoder used in [26]. Finally, the number of the hidden-layer neurons used to implement (Sec. 3.2) is . For a fair comparison with single-source person generation methods [19, 20, 9, 26], we adopt the HPE proposed in [5].

Even if there is no constraint on the cardinality of the source images , in order to simplify the implementation, we train and test our networks using different steps, each step having fixed for all in . Specifically, we initially train , and with . Then, we fine-tune the model with the desired value, except for single-source experiments where (see Sec. 4.4).

4.1 Datasets

The person re-identification Market-1501 dataset [40] is composed of 32,668 images of 1,501 different persons captured from 6 surveillance cameras. This dataset is challenging because of the high diversity in pose, background, viewpoint and illumination, and because of the low-resolution images (12864). To train our model, we need tuples of images of the same person in different poses. As this dataset is relatively noisy, we follow the preprocessing described in [26]. The images where no human body is detected using the HPE are removed. Other methods [19, 20, 9, 26] generate all the possible pairs for each identity. However, in our approach, since we consider tuples of size ( sources and 1 target image), considering all the possible tuples is computationally infeasible. In addition, Market-1501 suffers from a high person-identity imbalance and computing all the possible tuples, would exponentially increase this imbalance. Hence, we generate tuples randomly in such a way that we obtain the same identity repartition than it is obtained when sampling all the possible pairs. In addition, this solution also allows for a fair comparison with single-source methods which sample based on pairs. Eventually, we get 263K tuples for training. For testing, following [19], we randomly select 12K tuples without person is in common between the training and the test split.

The DeepFashion dataset (In-shop Clothes Retrieval Benchmark) [18] consists of 52,712 clothes images with a resolution of 256256 pixels. For each outfit, we dispose of about 5 images with different viewpoints and poses. Thus, we only perform experiments using up to sources. Following the training/test split adopted in [19], we create tuples of images following the same protocol as for the market-1501 dataset. After removing the images where the HPE does not detect any human body, we finally collect about 89K tuples for training and 12K tuples for testing.

4.2 Metrics

Evaluation metrics in the context of generation tasks is a problem in itself. In our experiments we adopt the evaluation metrics proposed in [19] which is used by most of the single-source methods. Specifically, we use: Structural Similarity (SSIM) [35], Inception Score (IS) [23] and their corresponding masked versions mask-SSIM and mask-IS [19]. The masked versions of the metrics are obtained by masking-out the image background. The motivation behind the use of masked metrics is that no background information is given to the network, and therefore, the network cannot guess the correct background of the target image. For a fair comparison, we adopt the masks as defined in [19].

It is worth noting that the SSIM-based metrics compare the generated image with the ground-truth. Thus, they measure how well the model transfers the appearance of the person from the source image. Conversely, IS-based metrics evaluate the distribution of generated images, jointly assessing the degree of realism and diversity of the generated outcomes, but do not take into account any similarity with the conditioning variables. These two metrics are each other complementary [4] and should be interpreted jointly.

4.3 Comparison with previous work

[26] [9] [19] Ours Attention Saliency
Figure 4: A qualitative comparison on the Market-1501 dataset. The first column shows the source images. Note that [26, 19, 9] use only the leftmost source image. The target poses are given by the ground truth images in column 2. In column 4, we show the results obtain by our model while increasing the number of source images. The source from the first column are added while increasing from left to right. In the last column we show the saliency maps predicted by our model when using all the five source images. These maps are shown in the same order than the source images .

Quantitative comparison. In Tab. 1 we show a quantitative comparison with state-of-the-art single-source methods. Note that, except from [20], none of the compared methods, including ours, is conditioned on background information. On the other hand, the mask-based metrics focus on only the region of interest (i.e., the foreground person) and they are not biased by the randomly generated background. For these reasons, we believe the mask-based metrics are the most informative ones. However, on the DeepFashion dataset, following [20], we do not report the masked values being the background uniform in most of the images. On both datasets, we observe that the SSIM and masked-SSIM increase when we input more images to our model. This confirms the idea that multi-source image generation is an effective direction to improve the generation quality. Furthermore, it illustrates that the proposed model is able to combine the information provided by the different source images. Interestingly, our method reaches high SSIM scores while keeping high IS values, thus showing that it is able to transfer better the appearance without loosing image quality and diversity.

Concerning the comparison with the state of the art, our method reports the highest performance according to both the mask-SSIM and the mask-IS metrics on the Market-1501 dataset when we use 10 source images. When we employ fewer images, only Siarohin et al [26] obtain better masked-SSIM but at the cost of a significantly lower IS. Similarly, we observe that [9] achieves a really high SSIM score, but again at the cost of a drastically lower IS, meaning that we can generate more diverse and higher quality images. Moreover, we notice that [9] obtains a lower masked-SSIM. This seems to indicate that their high SSIM score is mostly due to a better background generation. Similar conclusions can be drawn for the DeepFashion dataset. We obtain the best IS and rank second in SSIM. Only [9]

outperforms our model in terms of SSIM at the cost of a much lower IS value. The gain in performance seems smaller than on the market-1501 dataset. This is probably due to the lower pose diversity of the DeepFashion dataset.

Qualitative comparison. Fig. 4 shows some images obtained using the Market-1501 dataset. We compare our results with the images generated by three methods for which the code is publicly available [9, 19, 26]. The source images are shown in the first column. Note that the single-source methods use only the leftmost image. The target pose is extracted from the ground-truth target image. We display the generated images varying . We also show the corresponding saliency tensors (see Sec. 3.2) at the highest resolution . Specifically, we use and, at each location in , we average the values over the channel axis () using a color scale from dark blue (0 values) to orange (1 values).

The qualitative results confirm the quantitative evaluation since we clearly obtain better images when we increase the number of source images. The images become sharper and with more details and contain less artifacts. By looking at the saliency maps, we observe that our model uses mostly the source images in wich the human pose is similar to the target pose. For instance in row 1 and 4, the model has high attention values for the two frontal images but very low values for the back view images. Interestingly, in row 1, among the two source images with a pose similar to the target pose, the saliency values are lower for the more blurry image. This illustrates that, between two images with similar poses, our attention model favours the image with the highest quality. Concerning the comparison with the state of the art, we observe that our model better preserves the details of the source images. In general, we obtain higher-quality details and less artefacts. For instance, in row 3, the three other methods do not generate the white hat nor the small logo of the shirt. In particular, the V-UNet architecture proposed in [9] generates realistic images but with less accurate details. This can be easily observed in the last two rows where the colors of the clothes are wrongly generated.

4.4 Ablation study and qualitative analysis

In this section we present an ablation study to clarify the impact of each part of our proposal on the final performance. We first describe the compared methods, obtained by “amputating” important parts of the full-pipeline presented in Sec. 3. The discriminator architecture is the same for all the methods.

  • [noitemsep,topsep=0pt]

  • Avg No-d: In this baseline version of our method we use the encoder described in Sec. 3.3 without the deformation-based alignment of the features with the target pose. For the decoder, we use a standard U-Net decoder without attention module. More precisely, the tensors provided by the skip connections of each encoder are simply averaged and concatenated with the decoder tensors as in the original U-Net. In other words, Eq. (2) is replaced by the average over each convolution layer of the decoder, similarly to (1).

  • Avg: We use the encoder described in Sec. 3.3 and the same decoder of Avg No-d.

  • Att. 2D: We use an attention model similar to the full model described in Sec. 3.2. However, in Eq. (3), is not used.

  • Full: This is the full-pipeline as described in Sec. 3.

Market-1501 DeepFashion
Model SSIM IS mask-SSIM mask-IS SSIM IS
Single source 1
Avg No-d 2
Avg 2
Att. 2D 2
Full 2
Avg 5
Att. 2D 5
Full 5
Table 2: Quantitative ablation study on the Market-1501 and the DeepFashion dataset.
Avg Full Attention Saliency
Figure 5: A qualitative ablation study on the Deep-Fashion dataset. We compare Avg with Full using . The attention saliency are displayed in the same order than the source images .

Tab. 2 shows a quantitative evaluation. First, we notice that our method without spatial deformation performs poorly on both datasets. This is particularly evident with the SSIM-based scores. This confirms the importance of source-target alignment before computing a position-dependent attention. Interestingly, when using only two source images, Avg, Att. 2D and Full perform similarly to each other on the Market-1501 dataset. However, when we dispose of more source images we clearly observe the benefit of using our proposed attention approach. Avg performs constantly worst than our Full pipeline. The 2D attention model outputs images with higher SSIM-based scores but with lower IS values. Concerning the DeepFashion dataset, our attention model performs that the simpler approach with 2 and 5 source images.

In Fig. 5 we compare Avg with Full using . The advantage of using Full is is clearly illustrated by the fact that Avg mostly performs an average of the front and back images. In the second row, Full reduces the amount of artefacts. Interestingly, in the last row, Full fails to generate correctly the new viewpoint but we see that it chooses to focus on the back view in order to generate the collar.

5 Conclusion

In this work we introduced a generalization of the person-image generation problem. Specifically, a human image is generated conditioned on a target pose and a set of source images. This makes it possible to exploit multiple and possibly complementary images. We introduced an attention-based decoder which extends the U-Net architecture to a multiple-input setting. Our attention mechanism selects relevant information from different sources and image regions. We experimentally validate our approach on two different datasets. We expect that the practical advantages of the multi-source approach, as demonstrated in this work, will attract the interest of the community.


We thank the NVIDIA Corporation for the donation of the GPUs used in this work. This project has received funding from the European Research Council (ERC) (Grant agreement No.788793-BACKUP).