1 Introduction
Image deblurring aims to restore highquality images from blurred ones. This problem has achieved significant progress due to the development of various effective deep models with largescale training datasets.
Most stateoftheart methods for image deblurring are mainly based on deep convolutional neural networks (CNNs). The main success of these methods is due to developing kinds of network architectural designs, for example, the multiscale
[GoPro, SRN, MIMO] or multistage [DMPHN, MPRNet] network architectures, generative adversarial learning [DeblurGAN, DeblurGANv2], physics model inspired network structures [svrnn, physicgan], and so on. As the basic operation in these networks, the convolution operation is a spatiallyinvariant local operation, which does not model the spatially variant properties of the image contents. Most of them use larger and deeper models to remedy the limitation of the convolution. However, simply increasing the capacity of deep models does not always lead to better performance as shown in [svrnn, physicgan].Different from the convolution operation that models the local connectivity, Transformers are able to model the global contexts by computing the correlations of one token to all other tokens. They have been shown to be an effective approach in lots of highlevel vision tasks and also have great potential to be the alternatives of deep CNN models. In image deblurring, the methods based on Transformers [Restormer, Uformer] also achieve better performance than the CNNbased methods. However, the computation of the scaled dotproduct attention in Transformers leads to quadratic space and time complexity in terms of the number of tokens. Although using smaller and fewer tokens can reduce the space and time complexity, such strategy cannot model the longrange information of features well and usually leads to significant artifacts when handling highresolution images, which thus limits the performance improvement.
To alleviate this problem, most approaches use the downsampling strategy to reduce the spatial resolution of features [PyramidVIT]. However, reducing the spatial resolution of features will cause information loss and thus affect the image deblurring. Several methods reduce the computational cost by computing the scaled dotproduct attention in terms of the number of features [Restormer, cotransformer]. Although the computational cost is reduced, the spatial information is well not explored, which may affect the deblurring performance.
In this paper, we develop an effective and efficient method that explores the properties of Transformers for highquality image deblurring. We note that the scaled dotproduct attention computation is actually to estimate the correlation of one token from the query and all the tokens from the key. This process can be achieved by a convolution operation when rearranging the permutations of tokens. Based on this observation and the convolution theorem that the convolution in the spatial domain equals a pointwise multiplication in the frequency domain, we develop an efficient frequency domainbased selfattention solver (FSAS) to estimate the scaled dotproduct attention by an elementwise product operation instead of the matrix multiplication. Therefore, the space and time complexity can be reduced to for each feature channel, where is the number of the pixels.
In addition, we note that simply using the feedforward network (FFN) by [Restormer] does not generate good deblurred results. To generate better features for latent clear image restoration, we develop a simple yet effective discriminative frequency domainbased FFN (DFFN). Our DFFN is motivated by the Joint Photographic Experts Group (JPEG) compression algorithm. It introduces a gated mechanism in the FFN to discriminatively determine which low and highfrequency information should be preserved for latent clear image restoration.
We formulate the proposed FSAS and DFFN into an endtoend trainable network based on an encoder and decoder architecture to solve image deblurring. However, we find that as features of shallow layers usually contain blur effects, applying the scaled dotproduct attention to shallow features does not effectively explore global clear contents. As the features from deep layers are usually clearer than those from shallow layers, we develop an asymmetric network architecture, where the FSAS is only used in the decoder module for better image deblurring. We analyze that the exploring properties of Transformers in the frequency domain is able to facilitate blur removal. Experimental results demonstrate that the proposed method generates favorable results against stateoftheart methods in terms of accuracy and efficiency (Figure 1).
The main contributions of this work are summarized as follows:

We develop an efficient frequency domainbased selfattention solver to estimate the scaled dotproduct attention. Our analysis demonstrates that using the frequency domainbased solver reduces the space and time complexity and is much more effective and efficient.

We propose a simple yet effective discriminative frequency domainbased FFN based on the JPEG compression algorithm to discriminatively determine which low and highfrequency information should be preserved for latent clear image restoration.

We develop an asymmetric network architecture based on an encoder and decoder network, where the frequency domainbased selfattention solver is only used in the decoder module for better image deblurring.

We analyze that the exploring properties of Transformers in the frequency domain is able to facilitate blur removal and show that our approach performs favorably against stateoftheart methods.
2 Related Work
Deep CNNbased Image deblurring methods.
In recent years, we have witnessed significant advances in image deblurring due to the development of different deep CNN models [GoPro, SRN, SSN, DMPHN, MPRNet, MIMO, NAFNet]. In [GoPro], Nah et al. propose a deep CNN based on a multiscale framework to directly estimate clear images from blurred ones. To better utilize the information of each scale in multiscale framework, Tao et al. [SRN] develop an effective scale recurrent network. Gao et al. [SSN] propose a selective network parameter sharing method to improve [GoPro, SRN].
As using more scales does not improve the performance significantly, Zhang et al. [DMPHN] develop an effective network based on multipatch strategy. The deblurring process is achieved stage by stage. To better explore the features from different stages, Zamir et al. [MPRNet] propose a crossstage feature fusion for better performance. In order to reduce the computational cost of the methods based on multiscale framework, Cho et al. [MIMO] present a multiinput and multioutput network. Chen et al. [NAFNet] analyze the baseline modules and simplify them for better image restoration. As demonstrated in [Restormer], the convolution operation is spatial invariant and does not effectively model the global contexts for image deblurring.
Transformers and their applications to image deblurring.
As the Transformer [Transformer] can model the global contexts and achieves significant progress in lots of highlevel vision tasks (e.g., image classification [Swin], object detection [Object_Detection, Object_Detection_2] and semantic segmentation [Segmentation, Segmentation_2]
), it has been developed to solve image superresolution
[SwinIR], image deblurring [Restormer, Stripformer] and image denoise [IPT, Uformer]. To reduce the computational cost of Transformer, Zamir et al. [Restormer] propose an efficient Transformer model by computing the scaled dotproduct attention in the feature depth domain. This method can effectively explore information from different features along the channel dimension. However, the spatial information that is vital for image restoration is not fully explored. Tsai et al. [Stripformer] simplify the calculation of selfattention by constructing intra and inter strip tokens to replace the global attention. Wang et al. [Uformer] propose a Transformer based on a UNet which uses nonoverlapping windowbased selfattention for single image deblurring. Although using the splitting strategy reduces the computational cost, the coarse splitting does not fully explore the information of each patch. Moreover, the scaled dotproduct attention in these methods usually needs the complex matrix multiplication whose the space and time complexity is quadratic.Different from these methods, we develop an efficient Transformerbased method that explores the property of the frequency domain to avoid the complex matrix multiplication for the scaled dotproduct attention.
3 Proposed Method
Our goal is to present an effective and efficient method to explore the properties of Transformers for highquality image deblurring. To this end, we first develop an efficient frequency domainbased selfattention solver to estimate the scaled dotproduct attention. To refine the features estimated by the frequency domainbased solver, we further develop a discriminative frequency domainbased feedforward network. We formulate these above approaches into an endtoend trainable network based on an encoder and decoder architecture to solve image deblurring, where the frequency domainbased selfattention solver for the estimation of the scaled dotproduct attention is used in the decoder module for better feature representation. Figure 2(a) shows the overview of the proposed method. In the following, we present the details of each component.
3.1 Frequency domainbased selfattention solver
Given the input feature with a spatial resolution of pixels and channels, existing vision Transformers usually first compute the features , , and
by applying linear transformations
, , and to . Then, they apply the unfolding function to the features , , and to extract image patches , , and , where denotes the number of extracted patches. By applying a reshape operation to the extracted patches, the query , key , and value can be obtained by:(1) 
where denotes the reshape function which ensures that , and denote the height and width of extracted patches. Based on the obtained query , key , and value , the scaled dotproduct attention is achieved by:
(2) 
The attention map computation involves the matrix multiplication of whose space complexity and time complexity are and . It is not affordable if the image resolution and the number of the extracted patches are large. Although using downsampling operation to reduce the image resolution or nonoverlapping method to extract fewer patches will alleviate the problem, these strategies would lead to information loss and limit the ability to model details within and across each patch [cotransformer].
We note that each element of is obtained by the inner product:
(3) 
where and
are the vectorized forms of
th and th patches from and . Based on (3), if we apply reshape functions to and all the patches , respectively, all the ith column elements of can be obtained by a convolution operation, i.e., , where and denote the reshaped results of and ; denotes the convolution operation.According to the convolution theorem, the correlation or convolution of two signals in the spatial domain is equivalent to an elementwise product of them in the frequency domain. Therefore, a natural question is that can we efficiently estimate the attention map by an elementwise product operation in a frequency domain instead of computing the matrix multiplication of in the spatial domain?
To this end, we develop an effective frequency domainbased selfattention solver. Specifically, we first obtain , , and by a pointwise convolution and
depthwise convolution. Then, we apply the fast Fourier transform (FFT) to the estimated features
and and estimate the correlation of and in the frequency domain by:(4) 
where denotes the FFT, denotes the inverse FFT, and denotes the conjugate transpose operation. Finally, we estimate the aggregated feature by:
(5) 
where a layer norm is used to normalize . Finally, we generate the output feature of FSAS by:
(6) 
where denotes a convolution with filter size of pixel. The detailed network architecture of the proposed FSAS is shown in Figure 2(b).
3.2 Discriminative frequency domainbased FFN
The FFN is used to improve the features by the scaled dotproduct attention. Thus, it is important to develop an effective FFN to generate the features that facilitate the latent clear image reconstruction. As not all the lowfrequency information and highfrequency information help latent clear image restoration, we develop a DFFN that can adaptively determine which frequency information should be preserved. However, how to effectively determine which frequency information is important. Motivated by the JPEG compression algorithm, we introduce a learnable quantization matrix and learn it by an inverse method of JPEG compression to determine which frequency information should be preserved. The proposed DFFN can be formulated by:
(7) 
where and denote the patch unfolding and folding operations in the JPEG compression method; denotes GEGLU function by [GLU]. The detailed network architecture of the proposed DFFN is shown in Figure 2(c).
3.3 Asymmetric encoderdecoder network
We embed the proposed FSAS and DFFN into a network based on an encoder and decoder architecture. We note that most existing methods usually use symmetric architectures in the encoder and decoder modules. For example, if the FSAS and DFFN are used in the encoder module, they are also used in the decoder module. We note that the features extracted by encoder module are shallow ones, which usually contain blur effects compared to the deep features from the decoder module. However, the blur usually changes similarity of two similar patches from clear features. Thus, using the FSAS in the encoder module may not estimate the similarity correctly, which accordingly affects image restoration. To overcome this problem, we embed the FSAS into the decoder module, which leads to an asymmetric architecture for better image deblurring. Figure
2(a) shows the network architecture of the proposed asymmetric encoderdecoder network.Finally, given a blurred image , the restored image is estimated by the asymmetric encoderdecoder network:
(8) 
where denotes the asymmetric encoderdecoder network.
(a) Blurred image  (b) GT  (c) MIMOUnet+ [MIMO]  (d) Restormer [Restormer] 
(e) Stripformer [Stripformer]  (f) Restormerlocal [TLC]  (g) NAFNet [NAFNet]  (h) Ours 
4 Experimental Results
In this section, we evaluate the proposed method and compare it with stateoftheart methods using public benchmark datasets.
4.1 Datasets and parameter settings
Datasets.
We evaluate our method on commonly used image deblurring datasets including the GoPro dataset [GoPro], the HIDE dataset [HIDE], and the RealBlur dataset [Realblur]. We follow the protocols of existing methods for fair comparisons.
Parameter settings.
We use the same loss function as
[MIMO] to constrain the network and train it using the Adam [Adam] optimizer with default parameters. The initial value of the learning rate is and is updated with the cosine annealing strategy after 600,000 iterations. The minimum value of the learning rate is . The patch size is empirically set to be pixels and the batch size is set to be 16. We adopt the same data augmentation method as [Restormer] during the training. The patch size for the weight matrix estimation is empirically set to be based on the JPEG compression method. Similarity, we also use the patch of the size pixels when computing the selfattention (4).Due to the page limit, we include more experimental results in the supplemental material. The training code and models will be available to the public.
Methods  PSNRs  SSIMs  Parameters (M) 
DeblurGANv2 [DeblurGANv2]  29.55  0.9340  60.9 
SRN [SRN]  30.26  0.9342  6.8 
DMPHN [DMPHN]  31.20  0.9453  21.7 
SAPHN [SAPHN]  31.85  0.9480  23.0 
MIMOUnet+ [MIMO]  32.45  0.9567  16.1 
MPRNet [MPRNet]  32.66  0.9589  20.1 
DeepRFT+ [Deeprft]  33.23  0.9632  23.0 
Restormer [Restormer]  32.92  0.9611  26.1 
UformerB [Uformer]  33.06  0.9670  50.9 
Stripformer [Stripformer]  33.08  0.9624  19.7 
MPRNetlocal [TLC]  33.31  0.9637  20.1 
Restormerlocal [TLC]  33.57  0.9656  26.1 
NAFNet [NAFNet]  33.71  0.9668  67.9 
Ours  34.21  0.9692  16.6 
4.2 Comparisons with the state of the arts
We compare our method with stateoftheart ones and use the PSNR and SSIM to evaluate the quality of restored images.
Evaluations on the GoPro dataset.
We first evaluate our method on the commonly used GoPro dataset by [GoPro]
. For fair comparisons, we follow the protocols of this dataset and retrain or finetune the deep learning methods that are not trained on this dataset. Table
1 shows the quantitative evaluation results. Our method generates the results with the highest PSNR and SSIM values. Compared to the stateoftheart CNNbased methods, NAFNet [NAFNet], the PSNR gain of our method is at least 0.5dB higher than NAFNet, while the number of the proposed model parameters is a quarter of the NAFNet. In addition, compared to the Transformerbased methods [Restormer, Uformer, Stripformer], our method has the fewest model parameters while the performance is better.Figure 3 shows visual comparisons of the proposed method and the evaluated ones on the GoPro dataset. As demonstrated by [Restormer], the CNNbased methods [MIMO, NAFNet] do not effectively explore nonlocal information for latent clear image restoration. Therefore, the deblurred results by the methods [MIMO, NAFNet] still contain significant blur effect as shown in Figure 3(c) and (g). The Transformerbased methods [Restormer, Stripformer, TLC] are able to model the global contexts for image deblurring. However, some main structures, e.g., characters and chairs, are not recovered well (see Figure 3(d)(f)).
In contrast to existing Transformerbased methods that are based on the spatial domain, we develop an efficient frequency domainbased Transformer, where the proposed DFFN is able to discriminately estimate useful frequency information for latent clear image restoration. Thus, the deblurred results contain clear structures, and the characters are much clearer as shown in Figure 3(h).
RealblurR  RealblurJ  
Methods  PSNRs  SSIMs  PSNRs  SSIMs 
DeblurGANv2 [DeblurGANv2]  36.44  0.9347  29.69  0.8703 
SRN [SRN]  38.65  0.9652  31.38  0.9091 
MIMOUnet+ [MIMO]      31.92  0.9190 
BANet [BANet]  39.55  0.9710  32.00  0.9230 
DeepRFT+ [Deeprft]  39.84  0.9721  32.19  0.9305 
Stripformer [Stripformer]  39.84  0.9737  32.48  0.9290 
Ours  40.11  0.9753  32.62  0.9326 
(a) Blurred image  (b) GT  (c) DeblurGAN [DeblurGAN]  (d) SRN [SRN] 
(e) MIMOUnet+ [MIMO]  (f) DeepRFT+ [Deeprft]  (g) Stripformer [Stripformer]  (h) Ours 
Evaluations on the RealBlur dataset.
We further evaluate our method on the RealBlur dataset by [Realblur] and follow the protocols of this dataset for fair comparisons. The test dataset of [Realblur] includes a RealBlurR test set from the raw images and RealBlurJ test set from the JPEG images. Table 2 summarizes the quantitative evaluation results on the above mentioned test sets. The proposed method generates the results with higher PSNR and SSIM values.
Figure 4 shows the visual comparisons on the RealBlur dataset, where our method generates the results with clearer characters and finer structural details (Figure 4(h)).
Methods  PSNRs  SSIMs  Parameters (M) 
DeblurGANv2[DeblurGANv2]  26.61  0.8750  60.9 
SRN [SRN]  28.36  0.9040  6.8 
DMPHN [DMPHN]  29.09  0.9240  21.7 
SAPHN [SAPHN]  29.98  0.9300  23.0 
MIMOUnet+ [MIMO]  29.99  0.9304  16.1 
MPRNet [MPRNet]  30.96  0.9397  20.1 
Stripformer [Stripformer]  31.03  0.9395  19.7 
MPRNetlocal [TLC]  31.19  0.9418  20.1 
Restormer [Restormer]  31.22  0.9423  26.1 
NAFNet[NAFNet]  31.31  0.9427  67.9 
Restormerlocal [TLC]  31.49  0.9447  26.1 
Ours  31.62  0.9455  16.6 
(a) Blurred image  (b) GT  (c) MPRNet [MPRNet]  (d) Restormer [Restormer] 
(e) Stripformer [Stripformer]  (f) NAFNet [NAFNet]  (g) Restormerlocal [TLC]  (h) Ours 
Evaluations on the HIDE dataset.
We then evaluate our method on the HIDE dataset [HIDE], which mainly contains humans. Similar to stateoftheart methods [MPRNet, MIMO], we directly use the models of the evaluated methods, which are trained on the GoPro dataset for test. Table 3 shows that the quality of the deblurred images generated by the proposed method is better than the evaluated methods, suggesting that our method has a better generalization ability as models are not trained on this dataset.
We show some visual comparisons in Figure 5. We note that the evaluated methods do not recover the humans well. In contrast, our method generates better images. For example, the faces and zipper of clothes are much clearer.
Windowbased method [Uformer]  Ours  
Window size  #Avg. Time  #GPU Mem.  #Avg. Time  #GPU Mem. 
53ms  6.3G  44ms  6.5G  
56ms  7.1G  44ms  6.2G  
89ms  12.0G  43ms  6.0G  
  Out of memory  42ms  5.9G  
  Out of memory  42ms  5.9G 
pixels. The test environment is based on a machine with an NVIDIA GeForce RTX 3090 GPU. “#GPU Mem.” denotes the maximum GPU memory consumption that is computed by the “torch.cuda.max memory allocated()” function. “#Avg. Time” denotes the average running time.
FSAS  Swin attention  FFN  DFFN  PSNRs/SSIMs  
w/ only FFN  33.19/0.9626  
w/ only DFFN  33.55/0.9651  
SA w/ SD  33.46/0.9645  
FSAS+FFN  33.61/0.9654  
FSAS+DFFN  33.73/0.9663 
5 Analysis and Discussion
We have shown that exploring the properties of Transformers in the frequency domain generates favorable results against stateoftheart methods. In this section, we provide deeper analysis on the proposed method and demonstrate the effect of the main components. For the ablation studies in this section, we train our method and all the baselines on the GoPro dataset using the batch size of to illustrate the effect of each component in our method.
Effect of FSAS.
The proposed FSAS is used to reduce the computational cost. According to the properties of FFT, the space and time complexity of the FSAS are and , which are much lower than and in the original computation of the scaled dotproduct attention, where is the number of features. We further examine the space and time complexity of the FSAS and the windowbased strategy [Swin, Uformer] for Transformers. Table 4 shows that using the proposed FSAS needs a small GPU memory and is much more efficient compared to the windowbased strategy [Uformer].
Moreover, as the proposed FSAS is performed in the frequency domain, one may wonder whether the scaled dotproduct attention estimated in the spatial domain performs better or not. To answer this question, we compare the FSAS with the baseline method that performs in the spatial domain (SA w/ SD for short). As the space complexity of the original scaled dotproduct attention is , it is not affordable to train “SA w/ SD” when using the same settings as the proposed FSAS. We use the Swin Transformer [Swin] for comparison as it is much more efficient. Table 5 shows the quantitative evaluation results on the GoPro dataset. The method that computes the scaled dotproduct attention in the spatial domain does not generate good deblurred results, where its PSNR value is 0.27 lower (see comparisons of “SA w/ SD” and “FSAS+DFFN” in Table 5). The main reason is that although using the shifted window partitioning method reduces the computational cost, it does not fully explore the useful information across different windows. In contrast, the space complexity of the proposed FSAS is and does not need the shifted window partitioning as an approximation, thus leading to better deblurred results.
(a) Blurred image  (b) Spatial domain  (c) Frequency domain 
Figure 6(b) further shows that using the shifted window partitioning method as an approximation of scaled dotproduct attention in the spatial domain does not remove blur effectively. In contrast, the proposed FSAS generates clearer images.
(a) Blurred image  (b) Ours w/o FSAS  (c) Ours 
Moreover, compared to the baseline method only using FFN (“w/ only FFN”), using the proposed FSAS in this baseline generates much better results, where the PSNR value is 0.42dB higher (see comparisons of “w/ only FFN” and “FSAS+FFN” in Table 5). The visual comparisons in Figure 7(b) and (c) further demonstrate that using the proposed FSAS facilitates the blur removal well, where the boundaries are recovered well as shown in Figure 7(c).
(a) Blurred image  (b) w/ only FFN  (c) w/ only DFFN 
Effect of DFFN.
The proposed DFFN is used to discriminatively estimate useful frequency information for latent clear image restoration. To demonstrate its effectiveness on image deblurring, we compare the proposed method with two baselines. For the first baseline, we compare the proposed method only using the DFFN (w/ only DFFN for short) and the proposed method only using the original FFN (w/ only FFN for short). For the second baseline, we compare the proposed method with the one that replaces the DFFN with the original FFN in the proposed method (FSAS+FFN). The comparisons of “w/ only DFFN” and “w/ only FFN” in Table 5 show that using the proposed DFFN generates better results, where the PSNR value is 0.36dB higher.
In addition, the comparisons of “FSAS+FFN” and “FSAS+DFFN” in Table 5 show that using the proposed DFFN further improves the performance.
Figure 8 shows the visualization results by these above mentioned baseline methods. Using the proposed DFFN generates better deblurred images, where the windows are recovered well shown in Figure 8(c).
Effect of the asymmetric encoderdecoder network.
As demonstrated in Section 3.3, the shallow features extracted by encoder module usually contain blur effects that affect the estimations of FSAS. We thus embed it into the decoder module, which leads to an asymmetric encoderdecoder network for better image deblurring. To examine the effect of this network design, we compare the network that puts the FSAS into both the encoder and decoder modules (“FSAS in enc&dec” in Table 6). Table 6 shows that using the FSAS in the decoder module generates better results, where the PSNR value is at least 0.17dB higher. The visual comparisons in Figure 9(b) and (c) further demonstrate that using the FSAS in the decoder module generates better clear images.
Methods  FSAS in enc&dec  FSAS in dec (Ours) 
PSNRs  33.56  33.73 
SSIMs  0.9653  0.9663 
(a) Blurred image  (b) FSAS in enc&dec  (c) FSAS in dec (Ours) 
6 Conclusion
Motivated by the convolution theorem, we have presented an effective and efficient method that explores the properties of Transformers for highquality image deblurring. We have developed an efficient frequency domainbased selfattention solver (FSAS) to estimate the scaled dotproduct attention by an elementwise product operation instead of the matrix multiplication in the spatial domain, where we show that the spatial complexity and the computational complexity are significantly reduced. We further propose a DFFN to discriminatively determine which low and high frequency information of the features should be preserved for latent clear image restoration. Moreover, we develop an asymmetrical network based on an encoder and decoder architecture, where the FSAS is only used in the decoder module for better image deblurring. By training the proposed method in an endtoend manner, we show that it performs favorably against the stateoftheart approaches in terms of accuracy and efficiency.