Log In Sign Up

Patch-Based Stochastic Attention for Image Editing

by   Nicolas Cherel, et al.

Attention mechanisms have become of crucial importance in deep learning in recent years. These non-local operations, which are similar to traditional patch-based methods in image processing, complement local convolutions. However, computing the full attention matrix is an expensive step with a heavy memory and computational load. These limitations curb network architectures and performances, in particular for the case of high resolution images. We propose an efficient attention layer based on the stochastic algorithm PatchMatch, which is used for determining approximate nearest neighbors. We refer to our proposed layer as a "Patch-based Stochastic Attention Layer" (PSAL). Furthermore, we propose different approaches, based on patch aggregation, to ensure the differentiability of PSAL, thus allowing end-to-end training of any network containing our layer. PSAL has a small memory footprint and can therefore scale to high resolution images. It maintains this footprint without sacrificing spatial precision and globality of the nearest neighbours, which means that it can be easily inserted in any level of a deep architecture, even in shallower levels. We demonstrate the usefulness of PSAL on several image editing tasks, such as image inpainting and image colorization.


page 5

page 7

page 8

page 12

page 14

page 15

page 16


High-Resolution Deep Image Matting

Image matting is a key technique for image and video editing and composi...

High-Resolution Image Inpainting using Multi-Scale Neural Patch Synthesis

Recent advances in deep learning have shown exciting promise in filling ...

Augmenting Convolutional networks with attention-based aggregation

We show how to augment any convolutional network with an attention-based...

On Nearest Neighbors in Non Local Means Denoising

To denoise a reference patch, the Non-Local-Means denoising filter proce...

A Full-Image Full-Resolution End-to-End-Trainable CNN Framework for Image Forgery Detection

Due to limited computational and memory resources, current deep learning...

Iterative Patch Selection for High-Resolution Image Recognition

High-resolution images are prevalent in various applications, such as au...

1 Introduction

Attention mechanisms (Vaswani et al., 2017)

have become of crucial importance in many deep learning architectures. Originating in Natural Language Processing (NLP), where self-attention models have seen great successes 

(Devlin et al., 2019), these attention mechanisms have spread to other domains, and in particular, images. The use of attention has helped deep learning introduce long range dependencies. This addresses a drawback of the commonly used convolutions, which are local operations. Even if deeper networks and dilated convolutions can extend the receptive field of the network, they nevertheless fall short when non-local behaviors are important. This happens in text processing where words referring to a same subject can be far apart, in video classification (Wang et al., 2018) when the action is changing position in time and space, or for image editing to maintain a global coherence (Zhang et al., 2019; Yu et al., 2018).

In spite of this recent increase in popularity, the standard method to compute attention suffers from poor algorithmic complexity, scaling quadratically with the number of elements in a tensor. This reduces the number of feasible applications to low resolution images or features.

Attention models share their underlying principles with non-local methods (Buades et al., 2005), and are mostly used in conjunction with the highly popular transformers, as well as with image restoration and image editing networks. In this context, the desirable properties of attention are efficiency with respect to the input size, pixel level accuracy with overlapping queries / patches, and global processing of the data, i.e. no “tricks” which carry out attention locally.

It turns out that the attention layer is very closely linked to the problem of Nearest Neighbor (NN) search. The softmax layer effectively biases the distribution of weights towards a handful of similar points. In this paper, we show that when dealing with images, attention mechanisms can be efficiently estimated via an Approximate Nearest Neighbor (ANN) search. For this search, we turn to the prominent PatchMatch algorithm 

(Barnes et al., 2009)

, a fast algorithm for ANN search that is especially efficient for comparing similar images. In order to overcome the computational limits of the traditional attention layer, we propose an attention layer which employs the PatchMatch method, specifically designed for the case of images, which we name Patch-Based Stochastic Attention Layer (PSAL). PSAL has a small memory impact, scaling linearly with the input image size. As a result, it can be applied to large-size two-dimensional inputs and in particular allows us to apply the attention mechanism to high resolution images or to both shallow and deep 2D feature maps. Such situations are out of reach for classical attention mechanisms, because of memory limitations, and require the use of a sub-sampling strategy or a restriction to very deep features. These approaches are problematic for image editing in several situations: handling high resolution images, using low level features closer to the original image or applying attention mechanisms at the pixel level.

We illustrate the usefulness of PSAL on several image editing tasks, such as image colorization and inpainting. In the inpainting case in particular, we show that our approach can handle high resolution images without impairing the quality of results.

The paper is organised as follows. In Section 2, we detail previous work related to attention layers and patch-based image editing. In Section 3, we describe the classical attention layer and then the approach proposed in this paper. In Section 4, we evaluate our approach in four tasks. Firstly, we compare theoretical and true memory usage, then we validate the approximation performance of our approach for an image reconstruction task. Thirdly, we compare PSAL with other attention layers on a task of image colorization. Finally, we show how our PSAL can be used for image inpainting, allowing us to process high-resolution images. To summarize, we propose a new attention layer for images that has a greatly reduced memory complexity compared to traditional attention layers, while maintaining a similar basic functionality. In particular, this means that attention-based architectures can be easily modified to process images with a much higher resolution than is currently attainable. As a concrete example, Full Attention memory requirements scale quadratically with the number of pixels (requiring 16 GB of memory for a image and an infeasible 256 GB for images). PSAL, on the other hand, scales linearly, and needs only 786 kB and 3.15 MB, respectively, illustrating the great reduction of memory requirements which the proposed method entails.

2 Related work

2.1 Attention models

The origins of attention models can be found in the work of Vaswani et al(Vaswani et al., 2017) on NLP, where they became a crucial element of subsequent networks (Devlin et al., 2019)

. In computer vision, self-attention has been applied successfully to generative adversarial networks  

(Zhang et al., 2019; Parmar et al., 2018), object detection (Carion et al., 2020) and video classification (Wang et al., 2018). Attention models are also closely related to the Non-Local operations presented by Wang et al(2018)

. In their framework, self-attention with a softmax activation function becomes a specific non-local layer.

Self-Attention has a quadratic complexity in memory , which has inspired works on more efficient alternatives. Some works have focused on reducing the amount of distances to compute. Child et al(2019)

propose the Sparse Transformer which factorizes of the attention matrix with strided and fixed patterns. Zaheer

et al(2020) combine random attention, local attention, and global attention. Kitaev et al(2020) introduce a layer based on Locality Sensitive Hashing (LSH), an ANN method. Other methods have looked at approximations of the softmax using linear operations. Katharopoulos et al(2020) approximate the softmax with a linear operation through a fixed projection, making the attention much faster and more efficient. Performers (Choromanski et al., 2020) estimate the attention matrix with orthogonal random features. Wang et al(2020) proposed the Linformer, which factorizes the self-attention into low-rank matrices. Tay et al(2020) have written a comprehensive survey of efficient transformers.

Efficient attention models for vision-related tasks are needed given the high dimensionality of images. Local Attention (Parmar et al., 2018, 2019) reduces the attention to a smaller neighborhood. To further extend the attention range, Swin (Liu et al., 2021) proposed local windows with cycling shifts. Calian et al(2019), close in spirit to the proposal of this paper, propose an attention layer derived from PatchMatch to compute the attention, however in their preprint the authors do not verify the validity of their layer in a practical deep learning setting, meaning that there is no way of knowing if the layer functions correctly.

Since we will be evaluating our PSAL on image editing, in particular the inpainting problem, and given the close link of this domain with patch-based methods, we now present previous work concerning image editing.

2.2 Image editing

Image editing has long used patch-based approaches for inpainting (Criminisi et al., 2004; Wexler et al., 2007), retargeting (Barnes et al., 2009), style transfer (Frigo et al., 2016), and other image editing tasks (Darabi et al., 2012), making heavy use of NN patches. The search for NN patches, the bottleneck of such approaches, was greatly accelerated by PatchMatch (Barnes et al., 2009), an ANN algorithm that works by propagating good matches to neighbors.

Recently, deep learning alternatives have used encoder-decoder architectures and the GAN framework for image inpainting (Pathak et al., 2016; Iizuka et al., 2017). Yu et al(2018) have observed that the textures often lack details, and offer to use an attention layer to reuse existing patches, bridging the gap between deep learning and patch-based methods. ShiftNet (Yan et al., 2018) shares a similar idea but replaces the softmax stage by an argmax, effectively reducing the attention layer to a NN layer.

3 Patch-Based Stochastic Attention

3.1 Full Attention

We first give the mathematical definition of what we refer to as the Full Attention layer (FA layer), the classical dot-product attention mechanism introduced in (Vaswani et al., 2017). Let denote a set of queries

packed into a matrix, each query being a vector of

. Intuitively, these queries correspond to the different elements for which we want an attention vector. In the context of images, this may be a set of patches. The queries are compared to a set of keys, packed into the matrix . These keys correspond to elements that we want to use as a reference to give more or less importance to the queries. In the image case, the keys may be a set of patches (not necessarily the same as the queries). Given a vector , the attention output is the following weighted sum:


where the softmax of matrix of similarities

, is a row-stochastic matrix defined by

. Thus, the final result of the attention is a vector for each query, containing a weighted average of values, weighted by the dot product of the query with the elements in (the keys). In this paper, we concentrate on the case where these contain image patches. For simplicity, in all that follows, we will consider that . However, we note that our approach is equally applicable in the general case where .

The dot-product attention in Equation (1) requires the computation of the full matrix with entries. This results in a computational complexity of , and a memory complexity of , being the input size e.g. sequence length or number of pixels. For 1-dimensional vectors, this can be implemented with simple matrix multiplications. For 2-dimensional vectors i.e. patches, dot products can be computed as 2D convolutions as remarked by Li and Wand (Li & Wand, 2016).

This memory requirement is the most problematic limitation of the FA layers. To address this problem, Yu et al. (2018) subsample the set of keys. Another approach, proposed by Liu et al(Liu et al., 2018) for image restoration, is to compute the attention restricted to a local neighborhood. While such approximations are useful, they nevertheless rely on many pairwise distances to be computed. In practice, we remark that after the softmax operation in Equation (1), only a few elements actually matter. One way of viewing the attention layer is as a “soft” NN search layer. Consequently, in order to limit algorithmic complexity, we propose to switch to a sparse layer, keeping only a single non zero value in each row in the matrix , corresponding to the NN. The crux of the problem is now to solve the NN search quickly and with little memory overhead. For this purpose, we propose to employ an efficient ANN algorithm, designed specifically for images: the PatchMatch (Barnes et al., 2009) algorithm.

3.2 Patch-based Stochastic Attention Layer (PSAL)

Figure 1: Illustration of patch NN search. (a) Full Attention computes a complete attention matrix but many elements have negligible weight. (b) Patch-Based Stochastic Attention only probes randomly a few elements. (c) Good matches are propagated to neighbors

Recall that our goal is to replace the traditional attention layer, which is cumbersome in terms of memory, with a more efficient approach, designed for images. As we have noted above, attention layers are closely related to the search of NNs. We start by defining the NN mapping , between the query vectors and the key vectors:


where is the th line of the matrix , corresponding to the vector , and likewise for . In the attention literature, the dot product is commonly used to compare patches, but for more generality, we introduce a patch similarity function , which is high when patches are similar.

Finally, we introduce the associated sparse matrix defined as:


Our definition of attention, which can be seen as a rewriting of Equation (1), is simply


The next step is to provide a fast and light way to approximate . In the general case, this can be implemented using ANN algorithms, like kd-trees or Locality Sensitive Hashing as done by Kitaev et al(Kitaev et al., 2020). For images and image-like tensors such as feature maps, where vectors are patches, PatchMatch is an efficient alternative. It accelerates the search for NNs by drawing on a specific regularity property of images: the shift map (ie. the values ) between NNs in different images is almost piece-wise constant. Note that this implicitly requires a spatial organisation (1D, 2D etc) of the data, which is the present case of images.

PatchMatch is an efficient, stochastic, algorithm for searching for ANNs of patches in images and videos, between a query image and a key image. PatchMatch starts out by randomly associating ANNs to the query patches. In general, these ANNs will be of poor quality, however from time to time a good association will be found. The algorithm then attempts to propagate the shift given by this ANN to other query patches in the spatial patch neighborhood of , with the hypothesis that these shifts are piece-wise constant. This happens, for example, when a coherent object is found in both the query image and the key image.

After the random initialization, the PatchMatch algorithm relies on two alternating steps:

  1. Propagation, in which good shifts are propagated to spatial neighbors

  2. Random search for better ANNs for each query patch

The random search is carried out by randomly looking in a window of decreasing size, around the current ANN. An illustration of the idea of PatchMatch can be seen in Figure 1.

A drawback of PatchMatch is that it is an iterative algorithm that is not naturally parallelizable, with the propagation step being inherently sequential, thus making it problematic for use in deep learning. However, we employ a semi-parallel approximation to this propagation step known as jump-flooding (Barnes et al., 2009; Rong & Tan, 2006) described in Algorithm 1. A significant advantage of PatchMatch is that it keeps only the current ANN, which vastly reduces the memory requirements.

ParForEndParFor parfor do end parfor ParFor[1] #1  EndParFor

queries , keys , ANN field Updated ANN field in Look up, down, left, right Candidate position
Algorithm 1 Propagation step using Jump-Flooding

The computational complexity of our proposed PSAL is with the number of iterations of propagation / random search, and the memory complexity is . This is to be compared with the Full Attention layer, whose complexities are and , respectively. In particular, the memory complexity is linear with respect to the number of queries, while that of Full Attention is quadratic. This has important consequences, in particular on the maximum resolution of images that can be processed by deep learning architectures which employ attention layers.

Figure 2: Full Attention (left) and PSAL (right) reconstruction using another frame of the same video. The memory constraints and the subsequent subsampling step make it impossible for the Full Attention to capture all details

3.3 Differentiability

Unfortunately, PatchMatch using a single NN is not differentiable with respect to and because of the argmax operator in equation (2).

Following (Plötz & Roth, 2018), however, we can approximate matrix in equation (4), by a continuous relaxation . This relaxation is both related to a truncated version of the full attention matrix in equation (1) and admits as a limit case (when ). In addition the relaxation is differentiable with respect to and as long as , and we propose two ways to construct in a computationally- and memory-efficient way: one based on using several NNs, and one based on patch aggregation. Our theoretical findings are confirmed by the experiments (Section 4.3.1).

3.3.1 -Nearest Neighbors differentiability

In order to overcome the aforementioned difficulty with only a single NN, we propose to enrich the matrix using several NNs for each patch, which can be done with a modified version of PatchMatch (Barnes & Shechtman, 2010). In this case, each is now a set of NN correspondences. We start by redefining the matrix as:


Then by applying a softmax operation along the rows, we obtain the differentiable with the same behavior than the original implementation. It can be seen that the Full Attention implementation corresponds to in the case where .

Attention Method Mem. complexity Mem. for 64x64 Mem. for 128x128 Mem. for 256x256 Mem. for 512x512
Full Attention 0.30 4.98 15.26 250.04
Local Attention 0.19 0.64 3.23 13.12
Performer 0.71 2.84 11.55 17.68
PSAL 3 0.01 0.01 0.04 0.18
PSAL Aggreg. 0.05 0.19 0.74 2.95
Table 1: Memory (mem.) (GB) required by the attention layer when the input size is increasing. is the number of pixels. For Local Attention, the window size is set to . For Performer, we use the recommended parameter . Tested in conditions with patch size and 16 channels. PSAL and PSAL Aggreg. have a low enough memory footprint to make batches larger than 1 possible. Performer uses 1D vectors, gathering patches and unrolling them consumes a significant amount of memory.

3.3.2 Patch aggregation differentiability

The second approach we propose is to perform spatial aggregation. Intuitively, we enrich the list of NNs for a given patch by using the NNs of the spatial neighbors of this patch. To put it more colloquially, the neighbor of my spatial neighbor is my neighbor. In this case, we redefine in terms of a spatial neighbor and its patch-space neighbor as:


where is the spatial patch neighborhood of . The condition in Equation (6) basically says that, for a patch , we are analysing its spatial neighbor and the NN of , . We then check that the spatial shift between the patch and is the same as the NN patches and . The last condition is necessary to link to . In practice, the spatial neighborhood coincides with the patch neighborhood. This aggregation can be useful to other sparse attention layers.

These previous two differentiability methods can be combined as desired. We present experiments in the next Section that show that without these approaches, networks have great difficulty learning, and produce poor results.

4 Results

We now present quantitative and qualitative results showing the advantages of our proposed attention layer. We use four tasks for this :

  1. We analyze the memory consumption of different attention layers. We observe in particular that PSAL requires orders of magnitude less memory than alternatives

  2. Image reconstruction. We show that by replacing the FA layer with the proposed PSAL, the NN patches reconstruct an image well. This evaluation is motivated by the fact that the examples in the matrix should reconstruct or approximate the initial queries

  3. Image colorization. PSAL performs better than other attention layers in the context of guided image colorization, an image editing task for which attention is crucial

  4. Image inpainting. We show that it is possible to replace a classical FA layer directly with a PSAL, without affecting the inpainting quality, allowing for inpainting of high-resolution images

We compare our work with Full Attention and two other state-of-the-art attention layers: Local Attention (Parmar et al., 2018) and Performer (Choromanski et al., 2020). We compare these with two differentiable PSAL approaches: PSAL 3 which uses -NNs with and PSAL with aggregation, (PSAL Aggreg.).

4.1 Memory benchmark

We recall that one of our initial motivations is to develop an efficient attention layer that can be used in any situation. In particular, we designed our layer to not be memory-bounded when applying it to mid to large images. In this section, we measure the memory footprint of various attention layers for different feature maps size. We also compare the theoretical complexity with the true memory occupation. These results can be seen in Table 1. We see that our PSAL requires vastly less memory than FA and competing methods. For example, in the case of images, FA requires 250GB, whereas PSAL 0.18GB or 2.95GB (PSAL 3 or PSAL Aggreg.).

Figure 3: Architecture for our colorization network. We have used as simple an architecture as possible to isolate the contribution of the attention layer.

4.2 Image reconstruction task

We now compare the performances of PSAL and that of a FA layer on the task of image reconstruction using patches. The goal of this experiment is to check that using ANNs does not induce any loss of quality with respect to the “standard” attention. This is not often discussed in the literature on attention, however it is important to closely analyze the reconstruction quality to ensure that PSAL can be used in any architecture. In the case of images, we can directly view this quality, contrary to the case of deep features.

We first note that the dot product is not an appropriate measure of similarity for patches in this context. We adapt the FA layer to compute the distance and change the similarity measure in PSAL to . This is straightforward to do because PSAL can be used with any similarity function.

We reconstruct the image by applying the attention layer using as the set of patches from image A, the set of patches from image B and the pixels of image B. Image A and B are taken from the same video sequence.

Attention Method loss
PSAL 1 0.0083
PSAL 3 0.0023
PSAL Aggreg. 0.0019
Table 2: Colorization error of PSAL layers with different parameters. PSAL 1 is not fully differentiable which limits learning and performance. These experiments show that the solutions we proposed in Section 3.3 do indeed allow for efficient learning

We consider a realistic case where images are of size and patches of size . In order to comply with the memory capacity of a classical GPU, the keys are obtained using a stride of 8 for Full Attention.

Figure 2

shows the results of the image reconstruction task. We observe that not only does PSAL maintain good reconstruction, it in fact performs better than FA, which is not able to reconstruct fine details (see the zooms in the red squares on the bottom right of the images). This is a result of the stride induced by strong memory requirements. On the other hand, PSAL reconstruction has crisp details. This ability of PSAL to recover details is interesting when attention layers are used for tasks such as single image super-resolution 

(Parmar et al., 2018).

Figure 4: Results on the colorization task. First row: Ground truth, Full Attention, Local Attention. Second row: Performer, PSAL 3, PSAL Aggreg. Performer and Full Attention produce bland results. PSAL 3 and PSAL Aggreg. have good results despite some wrong matches (gray skin).
Attention Method FA Local Perf. PSAL 3 PSAL Aggreg.
loss 0.0024 0.0031 0.0054 0.0023 0.0019
Table 3: Evaluation of the error of different attention layers: Full Attention (Vaswani et al., 2017), Local Attention (Parmar et al., 2018), Performer (Choromanski et al., 2020) on the colorization task. We observe that our proposed PSAL, with aggregation differentiability, produces the best results.

4.3 Colorization

We evaluate PSAL on the task of guided image colorization. Given a grayscale image and a colorful reference image, we train a network to recover the color information. We use as simple a network as possible, to isolate the contribution of the attention layer. This architecture can be seen in Figure 3, and is further explained in the supplementary material. Because we are using similar images, but not identical, attention is a crucial component to identify the best regions from which to copy the colors.

We compare our results with the other attention models by keeping the same architecture, but changing only the attention layer. We use images of size 256x256 which are large by feature map standards but low resolution in the modern context. The quantitative results of Table 4 show that PSAL perform favorably against all other considered methods. Full Attention has close results but is limited by the memory constraints: to fit into memory a subsampling step (of stride ) is necessary which limits the set of keys considered. For Local Attention (Parmar et al., 2018), the model is performing well for frames with little or no displacement but degrades abruptly when distant neighbors are required. PSAL 3 reaches good results but may sometimes wrongly match patches. PSAL Aggreg. helps with this aspect, the aggregation step helps smoothing out irregularities. As in the case of image reconstruction (Section 4.2), the choice of either distance or dot product when comparing patches has a significant impact on performance. Our method is compatible with any metric. For FA and Local Attention (Parmar et al., 2018), we replaced the dot product with a operation. For the Performer, which is based on softmax approximations of the dot product, we left them as described by their authors, since there is no obvious way to adapt their algorithm to the case. They indeed produce poorer results generally.

Figure 4 shows the visual results. We observe that FA and Performer produce faded results, while Local Attention has visible square artifacts because of local windows. PSAL aggreg. produces the best results in this case. Further results are available in the supplementary material.

4.3.1 Differentiability

We now show experimental evidence that our proposed strategies for PSAL’s differentiability are effective and are crucial for end-to-end training. Looking at Table 2, we see a large performance gap between PSAL 1 and PSAL 3/PSAL aggreg. PSAL 1 is indeed not fully differentiable with respect to all its parameters which makes a common representation space impossible to optimize for. The attention is not helping in this case and performances are weak.

Attention Method loss loss PSNR TV loss SSIM
ContextualAttention (Yu et al., 2018) 11.8% 3.6% 16.4% 6.6% 53.7%
PSAL (ours) 11.6% 3.6% 16.6% 6.9% 54.1%
Table 4: Average inpainting metrics on Places2 validation set. Quantitative measures show similar performance using our layer with reduced memory requirements. retrained.

PSAL 3, on the other hand, employs a 3-NN PSAL layer which makes optimization possible. Similarly, PSAL Aggreg. performs extremely well. This experiment confirms that our two approaches for making PSAL differentiable are effective. This also shows that the adaptation of PatchMatch to feature maps is not as straightforward as it looks.

4.4 Image inpainting

A prominent field of application of attention models is image inpainting. This is the process of automatically filling in unknown or damaged regions in an image. However, one of the remaining blindspots of deep learning approaches to image inpainting is the correct reconstruction of textures and fine details. It turns out that this problem is, in turn, well addressed by using patch-based approaches (Criminisi et al., 2004; Wexler et al., 2007). Thus, the ideal inpainting method would be able to unite the strengths of both approaches in a single algorithm. This has motivated the introduction of attention layers in deep inpainting networks.

Recently, Yu et al(2018) have introduced an attention layer with great success in their inpainting network, which we refer to as ContextualAttention (CA). After a first coarse inpainting, the image is refined flowing into 2 different branches: a fully convolutional network, and an attention-based network. The outputs are then merged. At its core, CA is a Full Attention layer based on 3x3 patches in feature maps. Unfortunately, even with mid level features, the remaining spatial size is too large to avoid a memory overflow especially during training. Yu et al. limit the number of patches to be computed using a downsampling scheme. Once again, this illustrates the practical need for an attention layer which scales to large images. Given the setting, the proposed PSAL is a good fit in the network.

We directly replace the FA layer with our PSAL 3. We use a patch size of 7 for PSAL, which is equivalent to a patch size of 3 plus a downsampling with a factor of 2 as used by CA. The quantitative results shows no significant differences between PSAL and ContextualAttention (Table 4). This confirms that the PSAL can indeed replace the CA layer, with no quality loss, but with a great reduction in memory requirements. More inpainting results can be seen in the supplementary material. In particular we show visual evidence that there is no visual loss of quality in comparison to the original ContextualAttention.

Note that we do not need to compare with other attention approaches, since our goal here is not to improve the quality of inpainting, but rather to show that our PSAL can be easily inserted into any existing architecture, without any loss in quality, while greatly reducing memory requirements (which we showed in Table 1).

Figure 5: A 2700x3300 image inpainted using PSAL. The initial occlusions are outlined in green. We show the full page version in the supplementary material. Original picture by Didier Descouens - Licensed under CC BY-SA 4.0

4.5 High resolution inpainting

Finally, in Figure 5 we show that with PSAL we can inpaint high resolution images, which is the initial goal that motivates this work (extending attention layers to larger resolution images). The low memory requirements make it possible to handle images of resolution up to 3300x3300 on a NVIDIA GTX 1080 Ti with 11 GB of memory. Processing such an image using the FA layer without subsampling would require more than 1000 GB of memory.

5 Conclusion

In this work, we have presented PSAL, an efficient patch-based stochastic attention layer that is not limited by GPU memory. We have showed that our layer gives a much lighter memory load, scaling to very high resolution images. This makes the processing of high resolution images possible with deep networks using attention, without any of the customary tricks currently used (subsampling, etc). Furthermore, new network architectures using attention mechanisms at low level features are now conceivable. We have demonstrated the use of PSAL in several tasks, showing in particular that high resolution image inpainting is achievable with PSAL.

We plan to continue this work by applying the proposed attention layer to other image editing tasks, and in particular to be able to process videos, which is not achievable at this point in time by classical attention architectures, due to the memory constraints addressed in the present work.


  • Barnes & Shechtman (2010) Barnes, C. and Shechtman, E. The generalized patchmatch correspondence algorithm. In (ECCV) European Conference on Computer Vision, pp. 29–43, 2010.
  • Barnes et al. (2009) Barnes, C., Shechtman, E., Finkelstein, A., and Goldman, D. B. PatchMatch: a randomized correspondence algorithm for structural image editing. In SIGGRAPH 2009, 2009.
  • Buades et al. (2005) Buades, A., Coll, B., and Morel, J.-M. A non-local algorithm for image denoising. In

    2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05)

    , volume 2, pp. 60–65. IEEE, 2005.
  • Calian et al. (2019) Calian, D. A., Roelants, P., Cali, J., Carr, B., Dubba, K., Reid, J. E., and Zhang, D. SCRAM: Spatially Coherent Randomized Attention Maps. arXiv:1905.10308 [cs, stat], May 2019. arXiv: 1905.10308.
  • Carion et al. (2020) Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., and Zagoruyko, S. End-to-End Object Detection with Transformers. In Vedaldi, A., Bischof, H., Brox, T., and Frahm, J.-M. (eds.), Computer Vision – ECCV 2020, Lecture Notes in Computer Science, pp. 213–229, Cham, 2020. Springer International Publishing. ISBN 978-3-030-58452-8.
  • Child et al. (2019) Child, R., Gray, S., Radford, A., and Sutskever, I. Generating Long Sequences with Sparse Transformers. ArXiv, 2019.
  • Choromanski et al. (2020) Choromanski, K., Likhosherstov, V., Dohan, D., Song, X., Gane, A., Sarlós, T., Hawkins, P., Davis, J., Mohiuddin, A., Kaiser, L., Belanger, D., Colwell, L. J., and Weller, A. Rethinking Attention with Performers. ArXiv, 2020.
  • Criminisi et al. (2004) Criminisi, A., Perez, P., and Toyama, K. Region Filling and Object Removal by Exemplar-Based Image Inpainting. IEEE Transactions on Image Processing, 13(9):1200–1212, September 2004. ISSN 1057-7149.
  • Darabi et al. (2012) Darabi, S., Shechtman, E., Barnes, C., Goldman, D. B., and Sen, P. Image melding. ACM Trans. Graph., 2012.
  • Devlin et al. (2019) Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL-HLT, 2019.
  • Frigo et al. (2016) Frigo, O., Sabater, N., Delon, J., and Hellier, P. Split and Match: Example-Based Adaptive Patch Sampling for Unsupervised Style Transfer. In (CVPR) Conference on Computer Vision and Pattern Recognition, pp. 553–561. IEEE, jun 2016. ISBN 978-1-4673-8851-1.
  • Iizuka et al. (2017) Iizuka, S., Simo-Serra, E., and Ishikawa, H. Globally and locally consistent image completion. ACM Transactions on Graphics, 36(4):107:1–107:14, July 2017. ISSN 0730-0301.
  • Katharopoulos et al. (2020) Katharopoulos, A., Vyas, A., Pappas, N., and Fleuret, F. Transformers are rnns: Fast autoregressive transformers with linear attention. In

    Proceedings of the International Conference on Machine Learning (ICML)

    , 2020.
  • Kitaev et al. (2020) Kitaev, N., Kaiser, L., and Levskaya, A. Reformer: The Efficient Transformer. ICLR, 2020.
  • Li & Wand (2016) Li, C. and Wand, M.

    Combining Markov Random Fields and Convolutional Neural Networks for Image Synthesis.

    In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2479–2486, June 2016. ISSN: 1063-6919.
  • Liu et al. (2018) Liu, D., Wen, B., Fan, Y., Loy, C. C., and Huang, T. S. Non-Local Recurrent Network for Image Restoration. Advances in Neural Information Processing Systems, 31, 2018.
  • Liu et al. (2021) Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., and Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In (ICCV) International Conference on Computer Vision, pp.  11, mar 2021.
  • Parmar et al. (2018) Parmar, N., Vaswani, A., Uszkoreit, J., Kaiser, L., Shazeer, N. M., Ku, A., and Tran, D. Image Transformer. ICML, 2018.
  • Parmar et al. (2019) Parmar, N., Ramachandran, P., Vaswani, A., Bello, I., Levskaya, A., and Shlens, J. Stand-alone self-attention in vision models. In Wallach, H. M., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E. B., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pp. 68–80, 2019.
  • Pathak et al. (2016) Pathak, D., Krahenbuhl, P., Donahue, J., Darrell, T., and Efros, A. A. Context Encoders: Feature Learning by Inpainting. In (CVPR) Conference on Computer Vision and Pattern Recognition, volume 2016-Decem, pp. 2536–2544. IEEE Computer Society, jun 2016. ISBN 978-1-4673-8851-1. doi: 10.1109/CVPR.2016.278.
  • Perazzi et al. (2016) Perazzi, F., Pont-Tuset, J., McWilliams, B., Van Gool, L., Gross, M., and Sorkine-Hornung, A. A Benchmark Dataset and Evaluation Methodology for Video Object Segmentation. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 724–732, June 2016. ISSN: 1063-6919.
  • Plötz & Roth (2018) Plötz, T. and Roth, S. Neural nearest neighbors networks. (NeurIPS) Advances in Neural Information Processing Systems, pp. 1087–1098, 2018. ISSN 10495258.
  • Rong & Tan (2006) Rong, G. and Tan, T.-S. Jump flooding in GPU with applications to Voronoi diagram and distance transform. In Proceedings of the 2006 symposium on Interactive 3D graphics and games - SI3D ’06, pp. 109, Redwood City, California, 2006. ACM Press. ISBN 978-1-59593-295-2.
  • Tay et al. (2020) Tay, Y., Dehghani, M., Bahri, D., and Metzler, D. Efficient Transformers: A Survey. ArXiv, 2020.
  • Vaswani et al. (2017) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attention is All you Need. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 30, pp. 5998–6008. Curran Associates, Inc., 2017.
  • Wang et al. (2020) Wang, S., Li, B. Z., Khabsa, M., Fang, H., and Ma, H. Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768, 2020.
  • Wang et al. (2018) Wang, X., Girshick, R. B., Gupta, A., and He, K.

    Non-local Neural Networks.

    2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018.
  • Wexler et al. (2007) Wexler, Y., Shechtman, E., and Irani, M. Space-Time Completion of Video. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(3):463–476, March 2007. ISSN 0162-8828, 2160-9292.
  • Yan et al. (2018) Yan, Z., Li, X., Li, M., Zuo, W., and Shan, S. Shift-Net: Image Inpainting via Deep Feature Rearrangement. In Ferrari, V., Hebert, M., Sminchisescu, C., and Weiss, Y. (eds.), Computer Vision – ECCV 2018, volume 11218, pp. 3–19. Springer International Publishing, Cham, 2018. ISBN 978-3-030-01263-2 978-3-030-01264-9. Series Title: Lecture Notes in Computer Science.
  • Yu et al. (2018) Yu, J., Lin, Z. L., Yang, J., Shen, X., Lu, X., and Huang, T. S. Generative Image Inpainting with Contextual Attention. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018.
  • Zaheer et al. (2020) Zaheer, M., Guruganesh, G., Dubey, K. A., Ainslie, J., Alberti, C., Ontanon, S., Pham, P., Ravula, A., Wang, Q., Yang, L., et al. Big bird: Transformers for longer sequences. Advances in Neural Information Processing Systems, 33, 2020.
  • Zhang et al. (2019) Zhang, H., Goodfellow, I., Metaxas, D. N., and Odena, A. Self-Attention Generative Adversarial Networks. In ICML, 2019.

Appendix A Colorization

a.1 Training details

The colorization network is minimal, only containing a single layer of convolutions, the attention layer, and another convolution. The goal is to put the emphasis on the attention layer and not on the problem of colorization which is a complex one with an extensive literature. We add a residual connection so that the attention layer only has to provide color information.

We use 3x3 convolutions with 16 channels. Training is done on DAVIS dataset (Perazzi et al., 2016): train + test-dev. Testing is done on test-challenge. Images are resized to 256x256. The patch size is set to 7. We use Adam with a learning rate of 0.001 for 200k iterations and a batch size of 1.

Figure 6: More results on the colorization task. First row: Ground truth, Full Attention, Local Attention. Second row: Performer, PSAL 3, PSAL Aggreg. Performer and Full Attention produce bland results. Because of large displacements between frames local attention is not enough to recover the true colors resulting in artifacts. PSAL 3 and PSAL Aggreg. have good results despite some wrong matches.

a.2 vs dot product

For image colorization, we observed a large discrepancy between methods employing a distance and a dot product similarity measure in the attention e.g. vs . Table 5 shows this gap for the Full Attention layer for the same hyper parameters.

We hypothesize that in this bare bones experiment without additional layers and especially normalization layers, the dot product is not well adapted.

Attention Method loss
Full Attention (dot product) 0.0064
Full Attention () 0.0024
Table 5: vs dot product colorization results for Full Attention. The choice of similarity metric is very important for the task, performs much better.

a.3 Pretrained networks

In the main article, we mentioned that PSAL 1 is not end-to-end trainable. However, we can still use it if we do not need to train the network. PSAL 1 is still a fast, light, and good approximation of Full Attention or other PSAL k layers. In the colorization task, we trained and froze several different networks, and then we switched attention layers to PSAL 1. We observed very good results with this approach. Table 6 shows for instance that Full Attention can be approximated with no drop in performance but with 40x less computations and 225x less memory.

More generally, because PSAL is an approximation of Full Attention, we can replace attention layers, keeping similar results at a fraction of the original computational cost. This is particularly useful when performing inference on large inputs.

Attention Method (Weights) loss GFLOPs Memory (GB)
PSAL 1 (PSAL 1) 0.0083 30 0.05
PSAL Aggreg (PSAL Aggreg.) 0.0019 37 0.74
PSAL 1 (PSAL Aggreg.) 0.0023 30 0.05
PSAL 3 (PSAL 3) 0.0023 36 0.08
PSAL 1 (PSAL 3) 0.0023 30 0.05
Full Attention (Full Attention) 0.0024 1173 9.32
PSAL 1 (Full Attention) 0.0024 30 0.045
Local Attention (Local Attention) 0.0032 385 3.23
PSAL 1 (Local Attention) 0.0035 30 0.045
Table 6: Colorization performance when reusing weights from pretrained models with PSAL-1. PSAL 1 can replace the attention layers in trained networks with similar performance but a significant reduction in FLOPS and memory usage.

a.4 Memory and computational efficiency

From the results of Section 4.3, we have shown that PSAL has very good performance on the colorization task. On top of that, we want to emphasize that PSAL does not require a lot of memory, rather the opposite. If we plot the error as a function of the memory usage or number of Floating Point OPerations (FLOPs), (Figure 7), it is clear that PSAL does not trade performance for memory or computations.

Figure 7: Performance vs computation constraints (memory and GFLOPs) in the colorization task. Full Attention performs well at the cost of high memory and many GFLOPs. Local Attention is an efficient approximation of Full Attention with a limited drop in performance. Performer does not perform well. PSAL 3 and PSAL Aggreg. have better performance, largely reduced memory usage, and require less FLOPs than alternatives.

a.5 Why does Performer performs so badly on our colorization benchmark?

We split this question into two subsequent questions: 1) Why Performer does not successfully complete the task? 2) Why Performer’s memory and FLOPS are worse than Full Attention despite linear complexity?

For the first question, we gave a hint in Section A.2. For a simple network without normalization layers, the patch distance is a better metric than the dot product. Indeed, in the case of unnormalized patches, the dot product is maximized by patches of large norm rather than similar patches. This has been observed for Full Attention and we hypothesize that the same conclusion holds for Performer. We have not found an obvious adaptation of Performer to the metric.

For the second question, we used the recommended parameter, setting the dimension of the projection to . In our case, the patch dimension is and therefore . This explains the large memory usage: contrary to the other methods, each patch must be gathered and projected to this large space. For the number of FLOPs, we are in the case where .

Appendix B Inpainting results

For completeness of Section 4.4, we produce visual comparisons of inpainting results. Figure 8 shows very similar results. This was reflected by close quantitative inpainting metrics.

Figure 8: Here, we show the result of the ContextualAttention (Yu et al., 2018) and our version of the algorithm where the FA layer is replaced with a PSAL. We observe that the results are of very similar quality, validating the direct and straightforward replacement of FA with PSAL. On the left side of each set of results, the original image with the occlusion in white. On the right side, above: the result of FA, below: the result of PSAL.

Appendix C Inpainting training

We retrain Contextual Attention (Yu et al., 2018) using the same parameters than the authors and an implementation by Du Ang Specifically, we train our networks for 800k iterations with a batch size of 16 on Places2.

For PSAL, we train for the same number of iterations. We use a patch size of 7, remove the downsampling step in the attention, reconstruct only from the central pixel, and use 5 iterations of PatchMatch.

Appendix D High resolution inpainting

We present further high resolution inpainting results. The images are processed at their native resolutions without tricks. The network is still limited by its receptive field which make it only possible to fill small holes (128x128). Figure 9 and Figure 10 really highlight the completion of details and structures for which attention is important. Note that the maximum size of the occlusions/holes (128128) is imposed by the network architecture of ContextualAttention. We could certainly create another architecture using PSAL for larger occlusion sizes, but to be fair to the original work we kept the same sizes.

Figure 9: A 2700x3300 image inpainted using PSAL. The initial occlusions are outlined in green. Note that the maximum size of the occlusions/holes (128128) is imposed by the network architecture of ContextualAttention. We could certainly create another architecture using PSAL for larger occlusion sizes, but to be fair to the original work we kept the same sizes. Original picture by Didier Descouens - Licensed under CC BY-SA 4.0
Figure 10: A 3600x2700 image inpainted using PSAL. The initial occlusions are outlined in green. Original picture by MOSSOT - Licensed under CC BY-SA 3.0