Image deblurring aims to restore high-quality images from blurred ones. This problem has achieved significant progress due to the development of various effective deep models with large-scale training datasets.
Most state-of-the-art methods for image deblurring are mainly based on deep convolutional neural networks (CNNs). The main success of these methods is due to developing kinds of network architectural designs, for example, the multi-scale[GoPro, SRN, MIMO] or multi-stage [DMPHN, MPRNet] network architectures, generative adversarial learning [DeblurGAN, DeblurGANv2], physics model inspired network structures [svrnn, physicgan], and so on. As the basic operation in these networks, the convolution operation is a spatially-invariant local operation, which does not model the spatially variant properties of the image contents. Most of them use larger and deeper models to remedy the limitation of the convolution. However, simply increasing the capacity of deep models does not always lead to better performance as shown in [svrnn, physicgan].
Different from the convolution operation that models the local connectivity, Transformers are able to model the global contexts by computing the correlations of one token to all other tokens. They have been shown to be an effective approach in lots of high-level vision tasks and also have great potential to be the alternatives of deep CNN models. In image deblurring, the methods based on Transformers [Restormer, Uformer] also achieve better performance than the CNN-based methods. However, the computation of the scaled dot-product attention in Transformers leads to quadratic space and time complexity in terms of the number of tokens. Although using smaller and fewer tokens can reduce the space and time complexity, such strategy cannot model the long-range information of features well and usually leads to significant artifacts when handling high-resolution images, which thus limits the performance improvement.
To alleviate this problem, most approaches use the downsampling strategy to reduce the spatial resolution of features [PyramidVIT]. However, reducing the spatial resolution of features will cause information loss and thus affect the image deblurring. Several methods reduce the computational cost by computing the scaled dot-product attention in terms of the number of features [Restormer, cotransformer]. Although the computational cost is reduced, the spatial information is well not explored, which may affect the deblurring performance.
In this paper, we develop an effective and efficient method that explores the properties of Transformers for high-quality image deblurring. We note that the scaled dot-product attention computation is actually to estimate the correlation of one token from the query and all the tokens from the key. This process can be achieved by a convolution operation when rearranging the permutations of tokens. Based on this observation and the convolution theorem that the convolution in the spatial domain equals a point-wise multiplication in the frequency domain, we develop an efficient frequency domain-based self-attention solver (FSAS) to estimate the scaled dot-product attention by an element-wise product operation instead of the matrix multiplication. Therefore, the space and time complexity can be reduced to for each feature channel, where is the number of the pixels.
In addition, we note that simply using the feed-forward network (FFN) by [Restormer] does not generate good deblurred results. To generate better features for latent clear image restoration, we develop a simple yet effective discriminative frequency domain-based FFN (DFFN). Our DFFN is motivated by the Joint Photographic Experts Group (JPEG) compression algorithm. It introduces a gated mechanism in the FFN to discriminatively determine which low- and high-frequency information should be preserved for latent clear image restoration.
We formulate the proposed FSAS and DFFN into an end-to-end trainable network based on an encoder and decoder architecture to solve image deblurring. However, we find that as features of shallow layers usually contain blur effects, applying the scaled dot-product attention to shallow features does not effectively explore global clear contents. As the features from deep layers are usually clearer than those from shallow layers, we develop an asymmetric network architecture, where the FSAS is only used in the decoder module for better image deblurring. We analyze that the exploring properties of Transformers in the frequency domain is able to facilitate blur removal. Experimental results demonstrate that the proposed method generates favorable results against state-of-the-art methods in terms of accuracy and efficiency (Figure 1).
The main contributions of this work are summarized as follows:
We develop an efficient frequency domain-based self-attention solver to estimate the scaled dot-product attention. Our analysis demonstrates that using the frequency domain-based solver reduces the space and time complexity and is much more effective and efficient.
We propose a simple yet effective discriminative frequency domain-based FFN based on the JPEG compression algorithm to discriminatively determine which low and high-frequency information should be preserved for latent clear image restoration.
We develop an asymmetric network architecture based on an encoder and decoder network, where the frequency domain-based self-attention solver is only used in the decoder module for better image deblurring.
We analyze that the exploring properties of Transformers in the frequency domain is able to facilitate blur removal and show that our approach performs favorably against state-of-the-art methods.
2 Related Work
Deep CNN-based Image deblurring methods.
In recent years, we have witnessed significant advances in image deblurring due to the development of different deep CNN models [GoPro, SRN, SSN, DMPHN, MPRNet, MIMO, NAFNet]. In [GoPro], Nah et al. propose a deep CNN based on a multi-scale framework to directly estimate clear images from blurred ones. To better utilize the information of each scale in multi-scale framework, Tao et al. [SRN] develop an effective scale recurrent network. Gao et al. [SSN] propose a selective network parameter sharing method to improve [GoPro, SRN].
As using more scales does not improve the performance significantly, Zhang et al. [DMPHN] develop an effective network based on multi-patch strategy. The deblurring process is achieved stage by stage. To better explore the features from different stages, Zamir et al. [MPRNet] propose a cross-stage feature fusion for better performance. In order to reduce the computational cost of the methods based on multi-scale framework, Cho et al. [MIMO] present a multi-input and multi-output network. Chen et al. [NAFNet] analyze the baseline modules and simplify them for better image restoration. As demonstrated in [Restormer], the convolution operation is spatial invariant and does not effectively model the global contexts for image deblurring.
Transformers and their applications to image deblurring.
As the Transformer [Transformer] can model the global contexts and achieves significant progress in lots of high-level vision tasks (e.g., image classification [Swin], object detection [Object_Detection, Object_Detection_2] and semantic segmentation [Segmentation, Segmentation_2]
), it has been developed to solve image super-resolution[SwinIR], image deblurring [Restormer, Stripformer] and image denoise [IPT, Uformer]. To reduce the computational cost of Transformer, Zamir et al. [Restormer] propose an efficient Transformer model by computing the scaled dot-product attention in the feature depth domain. This method can effectively explore information from different features along the channel dimension. However, the spatial information that is vital for image restoration is not fully explored. Tsai et al. [Stripformer] simplify the calculation of self-attention by constructing intra and inter strip tokens to replace the global attention. Wang et al. [Uformer] propose a Transformer based on a UNet which uses non-overlapping window-based self-attention for single image deblurring. Although using the splitting strategy reduces the computational cost, the coarse splitting does not fully explore the information of each patch. Moreover, the scaled dot-product attention in these methods usually needs the complex matrix multiplication whose the space and time complexity is quadratic.
Different from these methods, we develop an efficient Transformer-based method that explores the property of the frequency domain to avoid the complex matrix multiplication for the scaled dot-product attention.
3 Proposed Method
Our goal is to present an effective and efficient method to explore the properties of Transformers for high-quality image deblurring. To this end, we first develop an efficient frequency domain-based self-attention solver to estimate the scaled dot-product attention. To refine the features estimated by the frequency domain-based solver, we further develop a discriminative frequency domain-based feed-forward network. We formulate these above approaches into an end-to-end trainable network based on an encoder and decoder architecture to solve image deblurring, where the frequency domain-based self-attention solver for the estimation of the scaled dot-product attention is used in the decoder module for better feature representation. Figure 2(a) shows the overview of the proposed method. In the following, we present the details of each component.
3.1 Frequency domain-based self-attention solver
Given the input feature with a spatial resolution of pixels and channels, existing vision Transformers usually first compute the features , , and
by applying linear transformations, , and to . Then, they apply the unfolding function to the features , , and to extract image patches , , and , where denotes the number of extracted patches. By applying a reshape operation to the extracted patches, the query , key , and value can be obtained by:
where denotes the reshape function which ensures that , and denote the height and width of extracted patches. Based on the obtained query , key , and value , the scaled dot-product attention is achieved by:
The attention map computation involves the matrix multiplication of whose space complexity and time complexity are and . It is not affordable if the image resolution and the number of the extracted patches are large. Although using downsampling operation to reduce the image resolution or non-overlapping method to extract fewer patches will alleviate the problem, these strategies would lead to information loss and limit the ability to model details within and across each patch [cotransformer].
We note that each element of is obtained by the inner product:
are the vectorized forms of-th and -th patches from and . Based on (3), if we apply reshape functions to and all the patches , respectively, all the i-th column elements of can be obtained by a convolution operation, i.e., , where and denote the reshaped results of and ; denotes the convolution operation.
According to the convolution theorem, the correlation or convolution of two signals in the spatial domain is equivalent to an element-wise product of them in the frequency domain. Therefore, a natural question is that can we efficiently estimate the attention map by an element-wise product operation in a frequency domain instead of computing the matrix multiplication of in the spatial domain?
To this end, we develop an effective frequency domain-based self-attention solver. Specifically, we first obtain , , and by a point-wise convolution and
depth-wise convolution. Then, we apply the fast Fourier transform (FFT) to the estimated featuresand and estimate the correlation of and in the frequency domain by:
where denotes the FFT, denotes the inverse FFT, and denotes the conjugate transpose operation. Finally, we estimate the aggregated feature by:
where a layer norm is used to normalize . Finally, we generate the output feature of FSAS by:
where denotes a convolution with filter size of pixel. The detailed network architecture of the proposed FSAS is shown in Figure 2(b).
3.2 Discriminative frequency domain-based FFN
The FFN is used to improve the features by the scaled dot-product attention. Thus, it is important to develop an effective FFN to generate the features that facilitate the latent clear image reconstruction. As not all the low-frequency information and high-frequency information help latent clear image restoration, we develop a DFFN that can adaptively determine which frequency information should be preserved. However, how to effectively determine which frequency information is important. Motivated by the JPEG compression algorithm, we introduce a learnable quantization matrix and learn it by an inverse method of JPEG compression to determine which frequency information should be preserved. The proposed DFFN can be formulated by:
where and denote the patch unfolding and folding operations in the JPEG compression method; denotes GEGLU function by [GLU]. The detailed network architecture of the proposed DFFN is shown in Figure 2(c).
3.3 Asymmetric encoder-decoder network
We embed the proposed FSAS and DFFN into a network based on an encoder and decoder architecture. We note that most existing methods usually use symmetric architectures in the encoder and decoder modules. For example, if the FSAS and DFFN are used in the encoder module, they are also used in the decoder module. We note that the features extracted by encoder module are shallow ones, which usually contain blur effects compared to the deep features from the decoder module. However, the blur usually changes similarity of two similar patches from clear features. Thus, using the FSAS in the encoder module may not estimate the similarity correctly, which accordingly affects image restoration. To overcome this problem, we embed the FSAS into the decoder module, which leads to an asymmetric architecture for better image deblurring. Figure2(a) shows the network architecture of the proposed asymmetric encoder-decoder network.
Finally, given a blurred image , the restored image is estimated by the asymmetric encoder-decoder network:
where denotes the asymmetric encoder-decoder network.
|(a) Blurred image||(b) GT||(c) MIMO-Unet+ [MIMO]||(d) Restormer [Restormer]|
|(e) Stripformer [Stripformer]||(f) Restormer-local [TLC]||(g) NAFNet [NAFNet]||(h) Ours|
4 Experimental Results
In this section, we evaluate the proposed method and compare it with state-of-the-art methods using public benchmark datasets.
4.1 Datasets and parameter settings
We evaluate our method on commonly used image deblurring datasets including the GoPro dataset [GoPro], the HIDE dataset [HIDE], and the RealBlur dataset [Realblur]. We follow the protocols of existing methods for fair comparisons.
We use the same loss function as[MIMO] to constrain the network and train it using the Adam [Adam] optimizer with default parameters. The initial value of the learning rate is and is updated with the cosine annealing strategy after 600,000 iterations. The minimum value of the learning rate is . The patch size is empirically set to be pixels and the batch size is set to be 16. We adopt the same data augmentation method as [Restormer] during the training. The patch size for the weight matrix estimation is empirically set to be based on the JPEG compression method. Similarity, we also use the patch of the size pixels when computing the self-attention (4).
Due to the page limit, we include more experimental results in the supplemental material. The training code and models will be available to the public.
4.2 Comparisons with the state of the arts
We compare our method with state-of-the-art ones and use the PSNR and SSIM to evaluate the quality of restored images.
Evaluations on the GoPro dataset.
We first evaluate our method on the commonly used GoPro dataset by [GoPro]
. For fair comparisons, we follow the protocols of this dataset and retrain or fine-tune the deep learning methods that are not trained on this dataset. Table1 shows the quantitative evaluation results. Our method generates the results with the highest PSNR and SSIM values. Compared to the state-of-the-art CNN-based methods, NAFNet [NAFNet], the PSNR gain of our method is at least 0.5dB higher than NAFNet, while the number of the proposed model parameters is a quarter of the NAFNet. In addition, compared to the Transformer-based methods [Restormer, Uformer, Stripformer], our method has the fewest model parameters while the performance is better.
Figure 3 shows visual comparisons of the proposed method and the evaluated ones on the GoPro dataset. As demonstrated by [Restormer], the CNN-based methods [MIMO, NAFNet] do not effectively explore non-local information for latent clear image restoration. Therefore, the deblurred results by the methods [MIMO, NAFNet] still contain significant blur effect as shown in Figure 3(c) and (g). The Transformer-based methods [Restormer, Stripformer, TLC] are able to model the global contexts for image deblurring. However, some main structures, e.g., characters and chairs, are not recovered well (see Figure 3(d)-(f)).
In contrast to existing Transformer-based methods that are based on the spatial domain, we develop an efficient frequency domain-based Transformer, where the proposed DFFN is able to discriminately estimate useful frequency information for latent clear image restoration. Thus, the deblurred results contain clear structures, and the characters are much clearer as shown in Figure 3(h).
|(a) Blurred image||(b) GT||(c) DeblurGAN [DeblurGAN]||(d) SRN [SRN]|
|(e) MIMO-Unet+ [MIMO]||(f) DeepRFT+ [Deeprft]||(g) Stripformer [Stripformer]||(h) Ours|
Evaluations on the RealBlur dataset.
We further evaluate our method on the RealBlur dataset by [Realblur] and follow the protocols of this dataset for fair comparisons. The test dataset of [Realblur] includes a RealBlur-R test set from the raw images and RealBlur-J test set from the JPEG images. Table 2 summarizes the quantitative evaluation results on the above mentioned test sets. The proposed method generates the results with higher PSNR and SSIM values.
|(a) Blurred image||(b) GT||(c) MPRNet [MPRNet]||(d) Restormer [Restormer]|
|(e) Stripformer [Stripformer]||(f) NAFNet [NAFNet]||(g) Restormer-local [TLC]||(h) Ours|
Evaluations on the HIDE dataset.
We then evaluate our method on the HIDE dataset [HIDE], which mainly contains humans. Similar to state-of-the-art methods [MPRNet, MIMO], we directly use the models of the evaluated methods, which are trained on the GoPro dataset for test. Table 3 shows that the quality of the deblurred images generated by the proposed method is better than the evaluated methods, suggesting that our method has a better generalization ability as models are not trained on this dataset.
We show some visual comparisons in Figure 5. We note that the evaluated methods do not recover the humans well. In contrast, our method generates better images. For example, the faces and zipper of clothes are much clearer.
|Window-based method [Uformer]||Ours|
|Window size||#Avg. Time||#GPU Mem.||#Avg. Time||#GPU Mem.|
|-||Out of memory||42ms||5.9G|
|-||Out of memory||42ms||5.9G|
pixels. The test environment is based on a machine with an NVIDIA GeForce RTX 3090 GPU. “#GPU Mem.” denotes the maximum GPU memory consumption that is computed by the “torch.cuda.max memory allocated()” function. “#Avg. Time” denotes the average running time.
|w/ only FFN||33.19/0.9626|
|w/ only DFFN||33.55/0.9651|
|SA w/ SD||33.46/0.9645|
5 Analysis and Discussion
We have shown that exploring the properties of Transformers in the frequency domain generates favorable results against state-of-the-art methods. In this section, we provide deeper analysis on the proposed method and demonstrate the effect of the main components. For the ablation studies in this section, we train our method and all the baselines on the GoPro dataset using the batch size of to illustrate the effect of each component in our method.
Effect of FSAS.
The proposed FSAS is used to reduce the computational cost. According to the properties of FFT, the space and time complexity of the FSAS are and , which are much lower than and in the original computation of the scaled dot-product attention, where is the number of features. We further examine the space and time complexity of the FSAS and the window-based strategy [Swin, Uformer] for Transformers. Table 4 shows that using the proposed FSAS needs a small GPU memory and is much more efficient compared to the window-based strategy [Uformer].
Moreover, as the proposed FSAS is performed in the frequency domain, one may wonder whether the scaled dot-product attention estimated in the spatial domain performs better or not. To answer this question, we compare the FSAS with the baseline method that performs in the spatial domain (SA w/ SD for short). As the space complexity of the original scaled dot-product attention is , it is not affordable to train “SA w/ SD” when using the same settings as the proposed FSAS. We use the Swin Transformer [Swin] for comparison as it is much more efficient. Table 5 shows the quantitative evaluation results on the GoPro dataset. The method that computes the scaled dot-product attention in the spatial domain does not generate good deblurred results, where its PSNR value is 0.27 lower (see comparisons of “SA w/ SD” and “FSAS+DFFN” in Table 5). The main reason is that although using the shifted window partitioning method reduces the computational cost, it does not fully explore the useful information across different windows. In contrast, the space complexity of the proposed FSAS is and does not need the shifted window partitioning as an approximation, thus leading to better deblurred results.
|(a) Blurred image||(b) Spatial domain||(c) Frequency domain|
Figure 6(b) further shows that using the shifted window partitioning method as an approximation of scaled dot-product attention in the spatial domain does not remove blur effectively. In contrast, the proposed FSAS generates clearer images.
|(a) Blurred image||(b) Ours w/o FSAS||(c) Ours|
Moreover, compared to the baseline method only using FFN (“w/ only FFN”), using the proposed FSAS in this baseline generates much better results, where the PSNR value is 0.42dB higher (see comparisons of “w/ only FFN” and “FSAS+FFN” in Table 5). The visual comparisons in Figure 7(b) and (c) further demonstrate that using the proposed FSAS facilitates the blur removal well, where the boundaries are recovered well as shown in Figure 7(c).
|(a) Blurred image||(b) w/ only FFN||(c) w/ only DFFN|
Effect of DFFN.
The proposed DFFN is used to discriminatively estimate useful frequency information for latent clear image restoration. To demonstrate its effectiveness on image deblurring, we compare the proposed method with two baselines. For the first baseline, we compare the proposed method only using the DFFN (w/ only DFFN for short) and the proposed method only using the original FFN (w/ only FFN for short). For the second baseline, we compare the proposed method with the one that replaces the DFFN with the original FFN in the proposed method (FSAS+FFN). The comparisons of “w/ only DFFN” and “w/ only FFN” in Table 5 show that using the proposed DFFN generates better results, where the PSNR value is 0.36dB higher.
In addition, the comparisons of “FSAS+FFN” and “FSAS+DFFN” in Table 5 show that using the proposed DFFN further improves the performance.
Figure 8 shows the visualization results by these above mentioned baseline methods. Using the proposed DFFN generates better deblurred images, where the windows are recovered well shown in Figure 8(c).
Effect of the asymmetric encoder-decoder network.
As demonstrated in Section 3.3, the shallow features extracted by encoder module usually contain blur effects that affect the estimations of FSAS. We thus embed it into the decoder module, which leads to an asymmetric encoder-decoder network for better image deblurring. To examine the effect of this network design, we compare the network that puts the FSAS into both the encoder and decoder modules (“FSAS in enc&dec” in Table 6). Table 6 shows that using the FSAS in the decoder module generates better results, where the PSNR value is at least 0.17dB higher. The visual comparisons in Figure 9(b) and (c) further demonstrate that using the FSAS in the decoder module generates better clear images.
|Methods||FSAS in enc&dec||FSAS in dec (Ours)|
|(a) Blurred image||(b) FSAS in enc&dec||(c) FSAS in dec (Ours)|
Motivated by the convolution theorem, we have presented an effective and efficient method that explores the properties of Transformers for high-quality image deblurring. We have developed an efficient frequency domain-based self-attention solver (FSAS) to estimate the scaled dot-product attention by an element-wise product operation instead of the matrix multiplication in the spatial domain, where we show that the spatial complexity and the computational complexity are significantly reduced. We further propose a DFFN to discriminatively determine which low and high frequency information of the features should be preserved for latent clear image restoration. Moreover, we develop an asymmetrical network based on an encoder and decoder architecture, where the FSAS is only used in the decoder module for better image deblurring. By training the proposed method in an end-to-end manner, we show that it performs favorably against the state-of-the-art approaches in terms of accuracy and efficiency.