Grid Partitioned Attention: Efficient TransformerApproximation with Inductive Bias for High Resolution Detail Generation

by   Nikolay Jetchev, et al.

Attention is a general reasoning mechanism than can flexibly deal with image information, but its memory requirements had made it so far impractical for high resolution image generation. We present Grid Partitioned Attention (GPA), a new approximate attention algorithm that leverages a sparse inductive bias for higher computational and memory efficiency in image domains: queries attend only to few keys, spatially close queries attend to close keys due to correlations. Our paper introduces the new attention layer, analyzes its complexity and how the trade-off between memory usage and model power can be tuned by the hyper-parameters.We will show how such attention enables novel deep learning architectures with copying modules that are especially useful for conditional image generation tasks like pose morphing. Our contributions are (i) algorithm and code1of the novel GPA layer, (ii) a novel deep attention-copying architecture, and (iii) new state-of-the art experimental results in human pose morphing generation benchmarks.


page 2

page 4

page 6

page 7

page 8

page 9

page 10

page 13


GIRAFFE HD: A High-Resolution 3D-aware Generative Model

3D-aware generative models have shown that the introduction of 3D inform...

PCGAN: Partition-Controlled Human Image Generation

Human image generation is a very challenging task since it is affected b...

Energy-relaxed Wassertein GANs(EnergyWGAN): Towards More Stable and High Resolution Image Generation

Recently, generative adversarial networks (GANs) have achieved great imp...

Patch-Based Stochastic Attention for Image Editing

Attention mechanisms have become of crucial importance in deep learning ...

Taming Transformers for High-Resolution Image Synthesis

Designed to learn long-range interactions on sequential data, transforme...

The 8-Point Algorithm as an Inductive Bias for Relative Pose Prediction by ViTs

We present a simple baseline for directly estimating the relative pose (...

1 Introduction

The latest generative models like Generative Adversarial Networks (GANs) have impressive capabilities

Karras et al. (2020); Brock et al. (2019)

: they generate realistic high-resolution human faces; they handle complex class conditioning (e.g. ImageNet

Deng et al. (2009)) or semantic-map conditioning Wang et al. (2018). However, we think that these methods need further improvement in the following aspects, especially for the case of generation conditional on other images:

  • accurately represent the high-frequency details (complicated textures) of diverse distributions

  • deal with rare and unique visual artifacts (e.g. text logo), which may come only at test time

The scientific challenge lies in the development of new deep architectures with the properties to (i) learn parametric distributions over data, as traditional generative models, (ii) copy flexibly from (strongly warped) conditioning images with minimal loss of the conditioning information, and (iii) spatially blend the two approaches in a computationally feasible end-to-end differentiable learning framework. Fig. 1 is an example of the visual quality of our novel copying architecture that has these properties. Its approach is very efficient for image-conditional generation (e.g. pose morphing Ma et al. (2017)), since we leverage the conditioning information directly through attention, avoid encoding it in a latent embedding space (which requires abundant training examples) while still benefiting from the generalisation capabilities of a GAN model.

One existing approach to copy-generation relies on appearance flow Zhou et al. (2016): predict flow coordinates and deform the input images before feeding to a convolutional architecture. An advantage of such warping is that it can copy smooth continuous regions of an image. However, the extreme locality of flow models makes them difficult to train by standard optimization: high-frequency gradients destabilize learning Kanazawa et al. (2018), as each copied pixel blends only few neighbouring grid pixels.

Classical encoder/decoder models (e.g. compressing autoencoders) require huge model capacity to reconstruct information. Copying image content (in pixel or feature space) helps (e.g. skip connections in UNets

Ronneberger et al. (2015)), but spatial alignment is an issue as convolutions allow only local displacements. Deformable skip connections can help Siarohin et al. (2018), but training is tricky due to unstable warping flow. Siarohin et al. (2019) generates by deforming and copying from a source image – this is a way to recreate image details, without needing model capacity to learn all such data distribution details from training data. A similar approach was followed in Jetchev et al. (2019) – given a set of input images, the model learns to reuse parts of the input, deform them and blend them in a convincing final image. However, this works best only with spatially aligned semantic image regions.

Figure 1: Example of attention copying for pose transfer: the task is to morph the source appearance to target pose . The novel GPA attention layer can run at high resolutions (here px) and allows generative architectures that efficiently leverage conditioning information and copy important details.

Attention Vaswani et al. (2017) can be considered as a more general alternative to appearance flow for content copying. It produces an affinity map over the whole spatial extent of the key/value input that weights the contribution of each input pixel to the total output. Attention mechanisms exhibit highly non-local behaviour and copy information from multiple locations at once. This can lead to better gradient training than flow warping. In contrast, we can see appearance flow as attention in the limit where the whole affinity mass for a given query is concentrated at a single key/value. In addition, attention can flexibly integrate geometric inductive bias for the affinity map, e.g. enforce smoothness or leverage coordinates in keys and queries as inductive bias.

The downside of using attention is the memory cost, e.g. for

pixel images, the query/key tensors have

elements, and attention needs an affinity matrix with

elements, too big for current GPUs. An approximation is required that balances the expressive power of the model against its memory (and computation) costs. In this paper we propose Grid Partitioned Attention (GPA), a novel attention algorithm with such properties. Sec. 2 will describe in detail the GPA algorithm. Sec. 3 will demonstrate experimentally how GPA enables novel deep architectures, which generate high resolution images with improved detail reconstruction. We focus on pose morphing, re-rendering the image of a person in a different pose, as a good example of stringently conditioned image generation.

2 The GPA attention layer

2.1 Notation

Figure 2: Illustration of spatial partitioning (Definition 3) and composition (Definition 4) operations.

Images are represented usually as tensors . The 3 dimensions are channel, and two spatial dimensions (height and width). We address individual pixels of an image with their spatial coordinates defined with index sets , where is the full spatial index structure of tensor . However, we can also have index arrays with less elements, indexing an image cropped out of tensor . Spatial index structures may exhibit a spatial structure on their own, in which case we would define rather as a tensor than a set. Which interpretation is used shall be explicitly mentioned or be clear from the context. Image tensors allow for simple indexing where we implicitly slice over the channel dimension. Thus, and . If has a spatial structure, then is a tensor with the same spatial structure.

We start with a formal definition of several operations required to implement GPA.

Definition 1 (Downsampling image tensors).

Let be an image tensor and (for simplicity and w.l.o.g.) be multiples of . We define downsampling with factor as where .

Definition 2 (Upsampling index sets).

Let be a spatial index structure. We define upsampling of with factor as .

Per definition of and , pixels of downsampled images are averages over the upsampled indices .

Definition 3 (Spatial partitioning).

Let be a tensor with spatial index structure . An operation where and and for is called spatial partitioning of into partitions (cells).

Usually, spatial partitioning is implemented such that the have a spatial structure and, hence, are tensors.

Definition 4 (Spatial composition).

Let be a spatial partitioning. If there is exactly one such that , then is the unique spatial combination of .

Definition 5 (Consistent spatial partitioning).

Let be an image tensor, and . Let be a spatial partitioning and and . is said to be consistent, iff  for all .

For example, clipping an (image) tensor into equally sized pieces (cf. Fig. 2), which are arranged in a specific order (e.g. scan line order) is a consistent spatial partitioning.

2.2 GPA algorithm

Suppose we have flattened query, key and value tensors , , where , and want to calculate attention


This operation is expensive in memory due to the matrix . We will make three useful assumptions for the image domain as inductive bias for the approximation of GPA. A1

local spatial correlation: natural images have a structure where close positions are correlated (a counterexample is a white noise image, where there is no correlation)

Cecchi (2010)Maheswaranathan et al. (2018). A2 scale correlation: assume that the attention structure at lower scale roughly corresponds to the attention structure at high scale Ruderman (1994). A3 attention sparsity: for any query, only a fraction of the keys have non-zero affinity values, as in dense correspondence methods Rocco et al. (2018) and recent sequence transformer models Zaheer et al. (2020).

Figure 3: Illustration of GPA: how to calculate very sparse attention from query tensor to keys . Phase 1: downsample into and find keys that have the largest affinity for a query cell – the relevant key set. We highlight in blue a query position, and in green its top keys. Phase 2: given the keys and queries from Phase 1, we can up-sample their indices and get an approximation for the relevant key set in the original size tensors .

Let be the indices of the queries (column indices of ) and the indices of the keys (column indices of ). For a single query we have that which takes memory. If we can identify a subset of key indices that are relevant for the query , we can calculate just the attention with respect to those keys and realize great computational savings if (see assumption A3). How can we find in a computationally efficient way? A simple greedy relevance algorithm is to take the key rows of with highest average activation over all query columns. Formally, where and for all and for all .

Trivially, we could calculate where and keep just elements of for subsequent attention calculation. However, this would be quite expensive and saves us no memory at all if we need to calculate the full attention in order to sort all elements by magnitude. Motivated by assumption A2, we downsample the keys and queries into and by factor , and then, at the lower scale, calculate the full affinity matrix . The relevant key sets at the lower level can help us approximate the relevant sets at the higher level, see Fig. 3 for illustration. However, simply computing would have no other effect than setting the key-value dictionary size to elements globally for all queries. This is not exactly what we want. Instead, following assumption A1, we perform this step individually for different local neighborhoods of the image, the spatial partition cells (Definition 3).

We describe the GPA algorithm as Algorithm 1. is usually the square cell partitioning function, which is fast and available in many deep learning frameworks. In Phase 1 we find the relevant key sets at a low resolution with , for which full attention is still feasible in memory. This is not done for single query elements, but more generally for consistent spatial partitions of query locations. By assumption A1 keys and queries are spatially correlated: if attends to keys at index set , and attends to the keys at index set , then for and close to each other, the relevant key index sets will be similar. Here, the partition parameter is a hyper-parameter that reflects the level of smoothness according to A1. In Phase 2 we leverage the relevant key set at the high resolution. On line 15, we use the index upsampling operator to sample around the relevant key sets from the low spatial level – we densely fill the finer grid around the locations we choose. Other sampling strategies are possible, but this one is fast to implement and easy to parallelize.

1:Input: Query, Keys and Values
2:Input: consistent spatial partitioning
3:Parameters: count of partition cells, downsampling factor, size of relevant key set
4:## Phase 1: finding relevant keys at the low resolution ##
5: down-sample by factor , so that
6: reshape to size ; matrix ,
7:reshape into size the queries dimension of is reshaped spatially
8: partition the query spatial dimensions of into cells
9:for  do
10:      find the most relevant keys for each cell of grouped queries
11:end for
12:## Phase 2: using the relevant keys for approximate attention at the original resolution ##
13: partitions are related across scales
14:for  do
15:      indices corresponding to the low-res key set
16:     gather subtensors tensors of size
17:      is a subset of the GPA output
18:end for
19: combine full size attention output from the query cells
Algorithm 1 Grid Partitioned Attention (GPA)

Complexity is for Phase 1, due to a call to full attention after downsampling. For Phase 2 the memory cost is , where for each partition we call times attention between queries and keys in the relevant key set.222All attention calls will be executed efficiently in parallel on the GPU. If we set , then the approach becomes equivalent to full attention, since at the original scale all keys are considered for all queries. If we set , then all queries are in the same partition, and tunes how many keys to attend to globally.

2.3 Related works: approximate attention

Full attention is used successfully in transformer architectures for vision, language, and reinforcement learning, but its memory footprint often limits the tensor sizes it can process. A lot of work deals with approximations, which can be much lighter but introduce an error, see

Tay et al. (2021, 2020) for an overview. GPA will be placed at the intersection of fixed and learnable patterns for attention. It uses a pattern for the queries that is regular and fixed w.r.t. the image grid, while extracting the most relevant keys (via sorting) is an adaptive yet non-differentiable approach. GPA differs from random sparse attention patterns such as Zaheer et al. (2020), because we use spatial grid partitioning as inductive bias and leverage sparsity in many query neighborhood cells.

Ref. Kitaev et al. (2020) uses Locally Sensitive Hashing (LSH) to sort the keys/queries, assuming that the key partitions are of similar size and non-overlapping. In contrast, GPA deals with relevant keys without any such restrictive assumptions: it finds the relevant keys for a query partition independently of the keys for all other partitions. In addition, for the applications we have in mind (e.g. conditional GAN models) it holds that queries, on one side, and keys and values, on the other side, come from different information pathways – GPA handles that case naturally. In contrast, ref. Kitaev et al. (2020) is specifically designed for self-attention problems where keys and queries share the same information.

Ref. Vyas et al. (2020) clusters the queries and computes attention for the centroids, while keeping all keys. This is a costly strategy for high-resolution images. Clustering also adds an approximation error since for all cluster members (sharing the same centroid) attention is exactly the same. GPA partitions the queries too, but we handle sparsity differently, and each query in a partition can get a different attention.

3 Experiments

We ran experiments on the popular DeepFashion (DF) dataset Liu et al. (2016), using the standard training and test splits Ma et al. (2017), and 18 pose keypoints as in Zhu et al. (2019). Image size is

, which we pad to a square when training. DF allows a clean quantitative comparison with many other papers using this dataset setup. For qualitative comparison, in Sec. 

3.3 we run additional experiments on two larger, higher resolution datasets. Their rich image distributions give more insight into the benefits of detail copying.

3.1 The copying attention: architecture and loss

Figure 4: (a) Scheme of pose-morphing generator architecture with two UNets: one for the concatenated source appearance (M1) and pose (B1) and one for the target pose (B2). The later outputs the generated target appearance (). (b) Attention blocks copy information from the source decoder layers (keys, values) to the appropriate spatial positions in the target decoders (query).

A generator for the pose-morphing task can be considered as a function which takes as input a source image , its pose , and a target pose , and produces an image . The training loss (defined below) is designed to make the likeness of but change pose to – hence "pose-morphing". Fig. 4a) shows our architecture with attention copying. The key/value tensors (for the attention) come from the decoder blocks of the source appearance and pose UNet Ronneberger et al. (2015) (featuring encoder and decoder with skip connections), and the query tensors come from the decoder of the target branch UNet which generates the output image. All convolutional blocks are designed as in Karras et al. (2020). Typically we used 6 blocks in each encoder and decoder (see Appendix for details). The poses are represented as 18 keypoint channels, each being a heatmap around a pose keypoint. For ease of interpretation, our figures show skeletons connecting these keypoints.

Fig. 4b) shows the details of the main novel component of the architecture: an attention layer that efficiently leverages and “copies” image content. It has 3 trainable convolutions: for key and value (source what to copy), and the query (target where to copy). Given these, the attention module (GPA or other attention type) will propagate information from the value tensor, according to key-query affinity. We can either add as residual to the generated feature map, or concatenate the attention output as additional channels, which we used in our experiments. This design allows the network to learn complex copying behaviour. For DF data, we define the pivot configuration: use full attention layers for tensors with size or less (which fit in memory for that size), and for larger sizes ( and ) use approximate attention layers. All GPA layers downsample to size for Phase 1. At that size, queries are partitioned into cells, each of spatial size . We used keys for each cell. Phase 2 works on the original large sizes, respectively. We also tried multihead versions of GPA (and other attention blocks), but found no performance gain despite more memory usage; thus we used single head for the final experiments.

We train the pose-morphing network with a loss combining 3 terms:


For datasets (e.g. DF) with pairs of source and target , we can directly use a perceptual loss Zhang et al. (2018) using paired data . The error metric uses features from the pretrained VGG network Simonyan and Zisserman (2014). For any dataset we can use also a self-supervised loss Jing and Tian (2019) by creating pairs of frames , where is a spatial transformation applied on both image and pose. We define , which forces the generator to reproduce the target image. Copying details is an essential mechanism for that. In general, can be designed in many ways (see Jing and Tian (2019)), but we opted for a simple affine deformation. To randomly sample affine transformation matrices we added noise to the identity transform. While most pose changes in data are more complex than affine transforms, the self-supervised loss was helpful and stabilized GAN training behaviour. We use also a standard GAN loss, with a discriminator that sees two types of pairs: true data and generator output , where is the pose the image should have. The discriminator is sensitive both to the quality of the image, and its match with the chosen pose. The discriminator and loss had the StyleGAN design Karras et al. (2020). We optimized the loss with ADAM Kingma and Ba (2017) over minibatches of 3 images for

pixel images, on Nvidia V100 GPUs in our internal cluster, in PyTorch

Paszke et al. (2019). All our GAN models were trained for 7 days on one GPU core; the supervised distortion task ran for 5 days.

3.2 DF benchmark: affine transformation copying and in-person pose morphing

We want to investigate which attention block works best in the architecture from Fig. 4) for a DF image distortion task. For that we used only loss term and fixed training time to 5 days, a fair method comparison. The experimental loss is measured on a test set of 500 images and random affine transforms. All configurations used full attention for the 4 decoder blocks until size 64x64. We tested the pivot configuration of GPA1024_12_64 (as in Sec. 3.1), and several ablations around it. Each is named , indicating and the largest resolution where we used full attention (approximate attentions at size above ). We tested two popular recent low rank attention approximation methods, the Nystromformer NYF Xiong et al. (2021), and the Performer PER Choromanski et al. (2020), set up with 512, 708 or 768 as low rank parameter. NO indicates no attention layer above size 64x64.

The results are summarized in Fig. 5. The statistics (with error bars over 3 randomly initialized runs) confirm that GPA (around the pivot configuration) had clearly the best copying performance. We measure the memory333In PyTorch: do forward/backward pass, call torch.cuda.max_memory_allocated(). just for the 6 attention blocks, calculated in gigabytes (GB) for input tensors with batch size 1. For the pivot configuration, GPA uses less memory than PER and NYF. Interestingly, NYF and PER do not benefit from extra capacity. NO is better than them, but still worse than GPA. Note that the ablations changing the scale for GPA Phase 1 to and are both more memory costly (and slower) than the pivot. This is consistent with the GPA complexity formula from Sec. 2.2. For GPA1024_12_128 we used batchsize 2 (instead of 3) due to memory limits.

Figure 5: Image distortion task: conditional on , , and the network needs to generate . We test various attention blocks. (a) Attention memory cost vs loss (error bars for 3 runs of 5 days). (b) Example generated images – complex textures (e.g. stripes) are transferred best by GPA. The top row shows , absolute error of the images as RGB pixel intensity.
Attention Copying + GPA 8.86 0.767 0.12 0.27
Attention Copying + NO 13.05 0.752 0.08 0.23
BFT 14.5 0.766 0.16 0.21
DSC 19.9 0.761 0.15 0.15
XING 42.8 0.756 0.12 0.03
Table 1: Performance of GPA+attention copying against other baseline models.

We performed the classical DF benchmark, with standard train/test splits Ma et al. (2017). DF has thousands of appearances, several images per appearance. For the in-person pose morphing, images and poses from the same appearance are used. We compared results using statistical scores (FID Heusel et al. (2017) and SSIM Wang et al. (2004)). We also did a human perceptual evaluation using 55 true and 55 generated images. The task was to see a random image for 4 seconds (as in Zhou et al. (2019)), and answer whether it looks real or generated. We used 3 clicks per image, 330 clicks total, for calculating R2G (percentage of real images considered generated) and G2R (generated considered real).

We compare GPA attention copying with NO copying configuration, and BFT AlBahar and Huang (2019), DSC Siarohin et al. (2018), XING Tang et al. (2020). Table 1 summarizes our results – GPA with attention copying is a new state-of-the-art for the DF posemorphing task (FID and crowdsourcing scores are more relevant quality measures than SSIM, as noted by Zhu et al. (2019)). In Fig. 6 we show a visual comparison for several example generated images. The other methods in that figure are PATN Zhu et al. (2019)

and RTE

Yang et al. (2020). XING was tested using a pretrained model, other reference images were taken from Yang et al. (2020). We do not report results for Siarohin et al. (2019) since we could not train it well on DF data - we suppose that DF is very different than the training data setup of Siarohin et al. (2019).

Figure 6: Visual comparison of several architectures for the task of generating the appearance of into pose , on the DF dataset. GPA (right) generates fine details, e.g. the pocket in the top row.

3.3 High resolution results

Figure 7: Pose morphing ChaL pixels. (a) Increasing the parameter of GPA leads to more expensive and powerful attention copying models (error bars over 3 random runs). (b) The generated images sorted from left to right with increasing , which correlates with higher visual quality.

We also worked with another high resolution publicly available dataset ChaLBertiche et al. (2020a, b). It has 40.000 videos, 120 frames each – data for pose morphing between frames of the same appearance. We extracted keypoints by Sun et al. (2019), cropped each video to a region of size pixels around the human figure, and trained our model for 7 days on a V100 GPU with minibatch size 2.Until size we used full attention, and for tensors of size , , we used GPA. For Phase 1 the query and keys were downsampled to pixels, and the query was partitioned in pixel-sized cells. Fig. 7 shows results when training with different number of relevant keys for each downsampled query block. We ablated , and we see how this affects visual fidelity. As discussed already in Sec. 2.2, the larger gets, the more capacity the GPA gets – this correlates also well with FID score, and memory usage. The inductive bias of GPA allows it to have good image generation performance, while still being much cheaper (memory and computationally) than full attention, which at this resolution cannot fit in GPU memory.

Figure 8: Visual comparison of pose morphing via attention copying with GPA or the NYF modules in the 3 spatially largest decoder layers – GPA enables fine detail generation, such as the ripped jeans.

We ran further experiments on the dataset used in Yildirim et al. (2019), which has size pixels and more visual details than DF and ChaL, so it is a good test case for attention copying architectures. This dataset has only single unpaired images, so we sampled random poses and used as training loss . Architecture was the same as for ChaL. Fig. 8 shows how well GPA recreates details, and avoids the blur of other attention methods (NYF Xiong et al. (2021)).

We examine how GPA picks sparse key sets for attention at a query location, for the largest resolution . Fig. 9 visualizes the key contributions. For a few query locations we draw lines to keys , with alpha value proportional to the affinity. We see that for some locations (e.g. neck ) just a single key contributes the most, i.e. local copying. The shoe copies information from locations around the shoes in the source appearance image. The blouse is spread and copies from a larger area with the right color. Interestingly, the hand is occluded in the source appearance, so GPA learned to fill it with information from the legs, with a similar skin color. Note: attention is performed in feature space, not RGB channels space, but the visualisation uses RGB images as background, in order to show what information the key/query tensors roughly represent.444Convolutional architectures have local displacement due to the receptive field

Figure 9: A visualisation of 4 query positions . Each is connected with a line to its highest affinity keys. Figure 10: Example of extreme out-of-sample generation: good transfer capabilities of the architecture using attention copying with GPA.

We also tested the limits of the model trained on Yildirim et al. (2019), by taking at inference time a random target pose and a source taken from a different dataset NA (2021). Fig. 10 shows the results – oil paintings are very different domains than fashion photography, but the copying module can adapt.

4 Conclusion

We presented the GPA attention layer: an efficient attention approximation, with hyper-parameters to trade-off memory and accuracy. It is especially useful as copying mechanism in a novel hybrid generative architecture. Our experiments showed that attention copying with GPA is a new state-of-the-art method for the pose morphing Deep Fashion benchmark. We defined the GPA algorithm for images with 2-d spatial structure, but it is straightforward to adapt similar attention methods to 1-d sequences (e.g. audio data), 3-d (video stream), or even more complicated spatial manifolds (e.g. graphs as a generalization of grids) because its basic operators (downsampling, upsampling and partitioning) can be defined on such structures as well. We expect that GPA is suitable as general attention mechanism for many diverse architectures and tasks, including such with causal masks, but we leave this to future work. Currently, Phase 1 of the GPA algorithm is designed to select the top key indices for each query in a non-differentiable manner. Recent research on differentiable optimization and sorting Blondel et al. (2020); Cordonnier et al. (2021) makes fully differentiable GPA layers feasible, but this is the topic of future work.


  • B. AlBahar and J. Huang (2019)

    Guided image-to-image translation with bi-directional feature transformation


    Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

    Cited by: §3.2.
  • H. Bertiche, M. Madadi, and S. Escalera (2020a) 3D+texture garment reconstruction, NeurIPS competition. Note: Cited by: §3.3.
  • H. Bertiche, M. Madadi, and S. Escalera (2020b) CLOTH3D: clothed 3d humans. External Links: 1912.02792 Cited by: §3.3.
  • M. Blondel, O. Teboul, Q. Berthet, and J. Djolonga (2020) Fast differentiable sorting and ranking. External Links: 2002.08871 Cited by: §4.
  • A. Brock, J. Donahue, and K. Simonyan (2019) Large scale GAN training for high fidelity natural image synthesis. In International Conference on Learning Representations, External Links: Link Cited by: §1.
  • G. A. e. al. Cecchi (2010) Statistics of natural scenes and cortical color processing. Journal of vision 10, pp. 517–548. Cited by: §2.2.
  • K. Choromanski, V. Likhosherstov, D. Dohan, X. Song, A. Gane, T. Sarlos, P. Hawkins, J. Davis, A. Mohiuddin, L. Kaiser, D. Belanger, L. Colwell, and A. Weller (2020) Rethinking attention with performers. Cited by: §3.2.
  • J. Cordonnier, A. Mahendran, A. Dosovitskiy, D. Weissenborn, J. Uszkoreit, and T. Unterthiner (2021) Differentiable patch selection for image recognition. External Links: 2104.03059 Cited by: §4.
  • J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In

    2009 IEEE conference on computer vision and pattern recognition

    Cited by: §1.
  • M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter (2017) GANs trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, Cited by: §3.2.
  • N. Jetchev, U. Bergmann, and G. Yildirim (2019) Transform the set: memory attentive generation of guided and unguided image collages. CoRR abs/1910.07236. Cited by: §1.
  • L. Jing and Y. Tian (2019)

    Self-supervised visual feature learning with deep neural networks: a survey

    External Links: 1902.06162 Cited by: §3.1.
  • A. Kanazawa, S. Tulsiani, A. A. Efros, and J. Malik (2018) Learning category-specific mesh reconstruction from image collections. In ECCV, V. Ferrari, M. Hebert, C. Sminchisescu, and Y. Weiss (Eds.), Cited by: §1.
  • T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila (2020) Analyzing and improving the image quality of StyleGAN. External Links: 1912.04958 Cited by: Appendix B, §1, §3.1, §3.1.
  • D. P. Kingma and J. Ba (2017) Adam: a method for stochastic optimization. External Links: 1412.6980 Cited by: §3.1.
  • N. Kitaev, Ł. Kaiser, and A. Levskaya (2020) Reformer: the efficient transformer. External Links: 2001.04451 Cited by: §2.3.
  • Z. Liu, P. Luo, S. Qiu, X. Wang, and X. Tang (2016) DeepFashion: powering robust clothes recognition and retrieval with rich annotations. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §3.
  • L. Ma, X. Jia, Q. Sun, B. Schiele, T. Tuytelaars, and L. Van Gool (2017) Pose guided person image generation. In Advances in Neural Information Processing Systems, Vol. 30. Cited by: §1, §3.2, §3.
  • N. Maheswaranathan, L. T. McIntosh, D. B. Kastner, J. B. Melander, L. Brezovec, A. Nayebi, J. Wang, S. Ganguli, and S. A. Baccus (2018) Deep learning models reveal internal structure and diverse computations in the retina under natural scenes. bioRxiv. External Links: Document, Link, Cited by: §2.2.
  • NA (2021) The metropolitan art museum online resources. Note: Cited by: §3.3.
  • A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Köpf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala (2019) PyTorch: an imperative style, high-performance deep learning library. External Links: 1912.01703 Cited by: §3.1.
  • I. Rocco, M. Cimpoi, R. Arandjelović, A. Torii, T. Pajdla, and J. Sivic (2018) Neighbourhood consensus networks. External Links: 1810.10510 Cited by: §2.2.
  • O. Ronneberger, P. Fischer, and T. Brox (2015) U-Net: convolutional networks for biomedical image segmentation. External Links: 1505.04597 Cited by: §1, §3.1.
  • D. Ruderman (1994) The statistics of natural images. Network: Computation In Neural Systems 5, pp. 517–548. Cited by: §2.2.
  • A. Siarohin, S. Lathuilière, S. Tulyakov, E. Ricci, and N. Sebe (2019) First order motion model for image animation. In Conference on Neural Information Processing Systems (NeurIPS), Cited by: §1, §3.2.
  • A. Siarohin, E. Sangineto, S. Lathuilière, and N. Sebe (2018) Deformable GANs for pose-based human image generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §3.2.
  • K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §3.1.
  • K. Sun, B. Xiao, D. Liu, and J. Wang (2019)

    Deep high-resolution representation learning for human pose estimation

    In IEEE Conference on Computer Vision and Pattern Recognition, CVPR, Cited by: §3.3.
  • H. Tang, S. Bai, L. Zhang, P. H. Torr, and N. Sebe (2020) XingGAN for person image generation. In ECCV, Cited by: §3.2.
  • Y. Tay, M. Dehghani, S. Abnar, Y. Shen, D. Bahri, P. Pham, J. Rao, L. Yang, S. Ruder, and D. Metzler (2021) Long range arena : a benchmark for efficient transformers. In ICLR 2021, Cited by: §2.3.
  • Y. Tay, M. Dehghani, D. Bahri, and D. Metzler (2020) Efficient transformers: a survey. External Links: 2009.06732 Cited by: §2.3.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. Cited by: §1.
  • A. Vyas, A. Katharopoulos, and F. Fleuret (2020) Fast transformers with clustered attention. External Links: 2007.04825 Cited by: §2.3.
  • T. Wang, M. Liu, J. Zhu, A. Tao, J. Kautz, and B. Catanzaro (2018) High-resolution image synthesis and semantic manipulation with conditional GANs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1.
  • Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli (2004) Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Processing. Cited by: §3.2.
  • Y. Xiong, Z. Zeng, R. Chakraborty, M. Tan, G. Fung, Y. Li, and V. Singh (2021) Nyströmformer: a Nyström-based algorithm for approximating self-attention. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    Cited by: §3.2, §3.3.
  • L. Yang, P. Wang, X. Zhang, S. Wang, Z. Gao, P. Ren, X. Xie, S. Ma, and W. Gao (2020) Region-adaptive texture enhancement for detailed person image synthesis. IEEE International Conference of Multimedia Expo (ICME). Cited by: §3.2.
  • G. Yildirim, N. Jetchev, R. Vollgraf, and U. Bergmann (2019) Generating high-resolution fashion model images wearing custom outfits. External Links: 1908.08847 Cited by: Appendix A, §3.3, §3.3.
  • M. Zaheer, G. Guruganesh, K. A. Dubey, J. Ainslie, C. Alberti, S. Ontanon, P. Pham, A. Ravula, Q. Wang, L. Yang, and A. Ahmed (2020) Big Bird: transformers for longer sequences. In Advances in Neural Information Processing Systems, Cited by: §2.2, §2.3.
  • R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang (2018)

    The unreasonable effectiveness of deep features as a perceptual metric

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §3.1.
  • S. Zhou, M. Gordon, R. Krishna, A. Narcomey, L. F. Fei-Fei, and M. Bernstein (2019) HYPE: a benchmark for human eye perceptual evaluation of generative models. In Advances in Neural Information Processing Systems, Cited by: §3.2.
  • T. Zhou, S. Tulsiani, W. Sun, J. Malik, and A. A. Efros (2016) View synthesis by appearance flow. In Computer Vision - ECCV 2016, B. Leibe, J. Matas, N. Sebe, and M. Welling (Eds.), Cited by: §1.
  • Z. Zhu, T. Huang, B. Shi, M. Yu, B. Wang, and X. Bai (2019) Progressive pose attention transfer for person image generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2347–2356. Cited by: §3.2, §3.

Appendix A Additional high resolution results

We include another figure 11 with results for the ablation (similar to Fig. 7). We think that the large resolution, larger size (close to 1 millions image) and the nice level of unique details of that fashion dataset Yildirim et al. [2019] showcases the usefulness of conditioning image copying especially well. It is very difficult to learn from training data only such details as the cherries in the image, but efficient copying from the conditioning image can handle well these unique details seen only at inference time.

Figure 11: Generating the image of the source appearance in a different pose. Increasing the parameters of the GPA attention trades-off between cheap+approximate attention versus accurate+expensive full attention behaviour. The generated images sorted from left to right with increasing , which correlates with higher visual detail fidelity, e.g. see how the cherry in the logo appears gradually.

Appendix B Architecture details

layer number spatial size channels attention type GPA downsampling
1 640 full attention -
2 640 full attention -
3 640 full attention -
4 640 full attention -
5 320 GPA 2
6 160 GPA 4
Table 2: DF pose morphing architecture – sizes of Unet decoder blocks.
layer number spatial size channels attention type
1 160 -
2 320 -
3 640 -
4 640 -
5 640 -
6 640 -
Table 3: DF pose morphing architecture – sizes of Unet encoder blocks, same for Discriminator blocks.

We give in Table 2 details for the pixel architecture for deep fashion generation. These are the decoder blocks of the unets of the pose morphing with attention copying architecture. They have also upsampling and LeakyReLU nonlinearities, as in the Karras et al. [2020]. We also indicate at which levels an approximate attention was used, and on which levels full attention could be used. In case a GPA approximate attention is used, we give also the downsampling factor, chosen to downsample to pixels for Phase 1. As in Karras et al. [2020], the final RGB space image output is coming from additional convolutions with channels, projecting feature maps to RGB space, and upsampling and adding them residually for every decoder layer (as in StyleGAN2). This is shown in Fig. 12. Decoder blocks use the mean of the last encoder layer as the style input for the StyleGAN blocks.

Encoder blocks are similar, but without attention, and with downsampling instead of upsampling, and without any style input (style vector can be considered constant). Discriminator blocks are identical to encoder blocks, as described in Table


The architectures for the larger experiments in section 3.3 are slightly modified, see Table 4 for the decoder channels and attentions, the encoders and discriminator are similar.

layer number spatial size channels attention type GPA downsampling
1 768 full attention -
2 768 full attention -
3 768 full attention -
4 512 GPA 2
5 256 GPA 4
6 128 GPA 8
Table 4: Pose morphing architecture for 512x256 pixel resolution – sizes of Unet decoder blocks.
Figure 12: Scheme of pose-morphing generator architecture with two UNets: one for the concatenated source appearance (M1) and pose (B1) and one for the target pose (B2). The later outputs the generated target appearance (M2), by using a residual-skip scheme to add (upsampled) RGB space outputs, as StyleGAN2 does.

Appendix C Crowdsourcing

We ran tests using the platform Figure8appen, see Fig. 13 for example of the setup. An image is shown and the participant has 4 seconds limited decision time to decide "is it true or generated". We evaluated 4 methods, for each we had 55 fake and 55 true images to evaluate, and gathered 3 clicks per image. Price was 8 cents per click. We also had 45 "golden" questions for performance monitoring, requiring participants to score at least 60 percent correctly on these questions.

Given that we expect participants to need (with time switching images) no more than 10 seconds per image, we expect that top workers can get up to 30 dollars per hour, which is a fair wage. The overall cost for 1320 clicks was around 120 dollars , including overhead for testing and task setting.

Figure 13: Instructions for crowdsourcing perceptual test for DeepFashion pose-morphing.

Appendix D Broader impact

The GPA quality approximation can allow researchers to train larger attention models than what was possible before, and have smaller computational GPU budgets. This can be a net positive

  • for the limited budgets of small research organisations

  • for the environment by reducing overall electricity consumption.

In general, our motivation when designing the experiments was to get the maximal model accuracy on a limited computation budget. E.g. our experiments were always ran on a single V100 GPU, for limited amount of time (5 or 7 days). This is much cheaper computationally than the alternative: running all methods for an unlimited amount of time until convergence, on multiple GPU cores. We think that such an approach keeps deep learning research feasible for the resources of small research labs.

Application-wise, we have not dealt with any sensitive personal information records, and also the images we have created were solely for the purpose of deep learning architecture benchmarking. However, as a general word of caution, pose morphing is closely related to DeepFake technologies. While we consider static image editing not so problematic, the closely related application of morphing of high resolution video signals can have much larger impact on society. Research in this area should be open and ethical, to prevent abuses and disinformation.