On Efficient Transformer and Image Pre-training for Low-level Vision

Pre-training has marked numerous state of the arts in high-level computer vision, but few attempts have ever been made to investigate how pre-training acts in image processing systems. In this paper, we present an in-depth study of image pre-training. To conduct this study on solid ground with practical value in mind, we first propose a generic, cost-effective Transformer-based framework for image processing. It yields highly competitive performance across a range of low-level tasks, though under constrained parameters and computational complexity. Then, based on this framework, we design a whole set of principled evaluation tools to seriously and comprehensively diagnose image pre-training in different tasks, and uncover its effects on internal network representations. We find pre-training plays strikingly different roles in low-level tasks. For example, pre-training introduces more local information to higher layers in super-resolution (SR), yielding significant performance gains, while pre-training hardly affects internal feature representations in denoising, resulting in a little gain. Further, we explore different methods of pre-training, revealing that multi-task pre-training is more effective and data-efficient. All codes and models will be released at https://github.com/fenglinglwb/EDT.



There are no comments yet.


page 4

page 5

page 6

page 13

page 14

page 15

page 16

page 17


SemVLP: Vision-Language Pre-training by Aligning Semantics at Multiple Levels

Vision-language pre-training (VLP) on large-scale image-text pairs has r...

Pre-Trained Image Processing Transformer

As the computing power of modern hardware is increasing strongly, pre-tr...

Towards a Multi-modal, Multi-task Learning based Pre-training Framework for Document Representation Learning

In this paper, we propose a multi-task learning-based framework that uti...

Activating More Pixels in Image Super-Resolution Transformer

Transformer-based methods have shown impressive performance in low-level...

Dual-view Molecule Pre-training

Inspired by its success in natural language processing and computer visi...

Enhance Visual Recognition under Adverse Conditions via Deep Networks

Visual recognition under adverse conditions is a very important and chal...

Convergence of gradient based pre-training in Denoising autoencoders

The success of deep architectures is at least in part attributed to the ...

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Image pre-training has received great attention in computer vision, especially prevalent in object detection and segmentation [20, 19, 13, 33, 7]. Previous studies [48, 26, 37, 50, 25]

mainly focus on convolutional neural networks (CNNs) 


and have shown that ConvNets pre-trained on ImageNet 

[11] classification yield significant improvements on a wide spectrum of downstream tasks. More recently, thanks to the large-scale pre-training, Transformer [55] architectures have achieved competitive performance in both NLP [43, 44, 4, 12, 45] and computer vision [5, 77, 15, 53, 32, 64, 56, 59, 61, 49, 73] fields.

Despite the success of pre-training in high-level vision, few efforts are made to explore its role in low-level vision. To the best of our knowledge, the sole pioneer exploring this point is IPT [6]. With the help of large-scale synthesized data (over 10 million images), IPT obtains strong improvements on several tasks. Whereas, the conventional Transformer design faces great challenges in handling high-resolution inputs, due to the massive amount of parameters (e.g., 116M for IPT) and huge computational cost. This makes it rather difficult to be applied in practice and also prohibitively hard to explore various system design choices. In fact, more detailed analysis is in real need to understand how pre-training affects the representations of models.

Figure 1: Comparison on PSNR (dB) performance of the proposed EDT and state-of-the-art methods in different low-level tasks.
Figure 2: The proposed encoder-decoder-based Transformer (EDT). EDT processes high-resolution (e.g., in deraining) and low-resolution (e.g., in SR, refers to the scale) inputs using different paths, modeling long-range interactions at a low resolution, for efficient computation.

To deal with these burning issues, we first propose a novel encoder-decoder-based Transformer (EDT), which is data and computation efficient yet powerful. Bringing the strengths of CNNs and Transformers into full play, we exploit the multi-scale information and long-range interaction. The encoder-decoder architecture along with the carefully designed Transformer block enable our EDT highly efficient, while achieving state-of-the-art results on multiple low-level tasks, especially for those with heavy degradation. For example, as shown in Fig. 1, EDT yields 0.49dB improvement in SR on the Urban100 [24] benchmark compared to IPT, while our SR model size (11.6M) is only 10.0% of IPT (115.6M) and only requires 200K images (15.6%

of ImageNet) for pre-training. Also, our denoising model obtains superior performance in level-50 Gaussian denoising, with 38 GFLOPs for

inputs, far less than SwinIR [31] (451 GFLOPs), accounting for only 8.4%. Moreover, we develop four variants of EDT with different model sizes, rendering our framework easily applied in various scenarios.

Based on the afore-introduced efficient EDT framework, the next focus of this paper moves onto systematically exploring and evaluating how image pre-training performs in low-level vision tasks, for the first time in Sec. 3. Using centered kernel alignment  [26, 8, 46] as a network “diagnosing” measure, we have designed a set of pre-training strategies, and thoroughly tested them with different image processing tasks. As a result, we uncover their respective effects on internal network representations, and draw useful guidelines for applying pre-training to low-level vision. For instance, we find simply increasing the data scale may not be the optimal option, and in contrast, multi-related-task pre-training is more effective and data-efficient. Our study sheds light on advancing efficient transformer-based image pre-training.

The contributions of our work can be summarized as:

  • We present a highly efficient and generic Transformer framework for low-level vision. Our model achieves state-of-the-art performance under constrained parameters and computational complexity.

  • We are the first to deliver an in-depth study on image pre-training for low-level vision, uncovering insights about how pre-training affects internal representations of models and how to conduct an effective pre-training.

2 Efficient EDT Framework

2.1 Overall Architecture

As illustrated in Fig. 2, the proposed EDT consists of a lightweight convolution-based encoder and decoder as well as a Transformer-based body, capable of efficiently modeling long-range interactions.

Despite the success of Transformers, the high computational cost makes them difficult to handle high-resolution inputs. To improve the encoding efficiency, images are first downsampled to

size with strided convolutions for tasks with high-resolution inputs (

e.g., denoising or deraining), while being processed under the original size for those with low-resolution inputs (e.g., SR). The stack of early convolutions is also proven useful for stabling the optimization [65]. Then, there follow multiple stages of Transformer blocks, achieving a large receptive field at a low computational cost. During the decoding phase, we upsample the feature back to the input size using transposed convolutions for denoising or deraining while maintaining the size for SR. Besides, skip connections are introduced to enable fast convergence during training. In particular, there is an additional convolutional upsampler before output for super-resolution.

2.2 Transformer Block

In general, Vision Transformer block[15] consists of multi-head self-attention and feed-forward network (FFN), and there are many works[62, 6, 32, 14] tailoring these two parts to different tasks. In this section, we introduce shifted crossed local attention and anti-block FFN.

Shifted Crossed Local Attention. To reduce computational complexity, several works [32, 54] attempt to leverage shifted or halo windows to perform local self-attention. Whereas, the slow growth of effective receptive fields hampers the representational capability. Later on, [14] proposes to use globally horizontal and vertical stripe attention to achieve global receptive field. However, the computation is a heavy burden when processing high-resolution images. Drawing insights from these works [32, 14], we design a local self-attention mechanism with shifted crossed windows that strikes a balance of receptive field growth and computational cost, showing competitive performance.

Figure 3: Shifted Crossed Local Attention. The horizontal and vertical window sizes are set to and .

As shown in Fig. 3, by evenly splitting a given feature map into two parts in the channel dimension, each half performs the multi-head attention (MSA) in an either horizontal or vertical local window, where the window size is or . Generally speaking, there exists . The (shifted) crossed local attention (S)CL-MSA is formally defined as:


where the projection layer fuses the attention results. Then, in the next transformer block, we shift the horizontal and vertical windows by and pixels, respectively. The crossed windows with shifts dramatically increase the effective receptive field. We show the fast growth of the receptive field is beneficial in achieving higher restoration quality in Sec. 4.1. The computational complexity of our attention module for an image is:


Anti-Blocking FFN. To eliminate the possible blocking effect caused by window partition, we design an anti-blocking feed-forward network (Anti-FFN), which is formulated as:


where the anti-blocking operation is implemented with a depth-wise convolution with a large kernel size (). Note that [62] adopts a similar strategy to leverage local context. We further explore the role of this operation in low-level tasks, which is detailed in Sec. H of supplementary file.

2.3 Architecture Variants

For a fair comparison with other architectures, we build four versions of EDT with different model sizes and computational complexities. As shown in Table 1, apart from the base model (EDT-B), we also provide EDT-T (Tiny), EDT-S (Small) and EDT-L (Large). The main differences lie in the channel number, stage number and head number in the Transformer body. We uniformly set the block number in each Transformer stage to 6, the expansion ratio of FFN to 2 and the window size to .

#Channels 60 120 180 240
#Stages 4 5 6 12
#Heads 6 6 6 8
#Param. (, M) 0.9 4.2 11.5 40.2
FLOPs (, G) 2.8 12.4 37.6 136.4
Table 1: Configurations of four variants of EDT. The parameter numbers and FLOPs are counted in denoising at size.

3 In-depth Study of Image Pre-training

3.1 Pre-training on ImageNet

Following [6], we adopt the ImageNet [11] dataset in the pre-training stage. Unless specified otherwise, we only use 200K images for pre-training. We choose three representative low-level tasks including super-resolution (SR), denoising and deraining. Referring to [6, 1, 23]

, we simulate the degradation procedure to synthesize low quality images. In terms of SR, we utilize the bicubic interpolation to obtain low-resolution images. As for denoising and deraining, Gaussian noises (on RGB space) and rain streaks are directly added to the clean images. In this work, we explore

// settings in SR, 15/25/50 noise levels in denoising and light/heavy rain streaks in deraining.

Figure 4: Sub-figures (a)-(c) show CKA similarities between all pairs of layers in EDT-S SR model, EDT-B SR model, level-15 EDT-B denoising model with single-task pre-training, and the corresponding similarities between with and without pre-training are shown in (e)-(g). Sub-figure (d) shows the cross-model comparison between EDT-B SR and EDT-B denoising models and (h) shows the ratios of layer similarity larger than 0.6, where “” means the similarity between the current layer in SR and any layer in denoising.

We investigate three pre-training methods: single task, related tasks and unrelated tasks. (1) The single-task pre-training refers to training a single model for a specific task (e.g., SR). (2) The second is to train a single model for highly related tasks (e.g., , , SR), while (3) the last contains unrelated tasks (e.g., , SR and level-15 denoising). As suggested in [6], we adopt a multi-encoder, multi-decoder, shared-body architecture for the latter two setups. The fine-tuning is performed on a single task, where the model is initialized with the pre-trained task-specific encoder and decoder as well as the shared Transformer body.

3.2 Centered Kernel Alignment

We introduce centered kernel alignment (CKA)[26, 8, 46] to study representation similarity of network hidden layers, supporting quantitative comparisons within and across networks. In detail, given data points, we calculate the activations of two layers and , having and neurons respectively. We use the Gram matrices and to compute CKA:


where HSIC is the Hilbert-Schmidt independence criterion [21]. Given the centering matrix , and are centered Gram matrices, then we have

. Thanks to the properties of CKA, invariant to orthogonal transformation and isotropic scaling, we are able to conduct a meaningful analysis of neural network representations. However, naive computation of CKA requires maintaining the activations across the entire dataset in memory, causing much memory consumption. To avoid this, we use minibatch estimators of CKA

[41], with a minibatch of 300 by iterating over the test dataset 10 times.

3.3 Representation Structure of EDT

Figure 5: Sub-figures (a)-(d) show CKA similarities of SR models, without pre-training as well as with pre-training on a single task (), highly related tasks (, , SR) and unrelated tasks (, SR, level-15 denoising). Sub-figures (e)-(h) show the corresponding attention head mean distances of Transformer blocks. Note that we do not plot shifted local windows in (e)-(h) so that the last blue dotted line (“”) has no matching point. The red boxes indicate the same attention modules.

We begin our investigation by studying the internal representation structure of our models. How are representations propagated within models of different sizes in low-level tasks? To answer this intriguing question, we compute CKA similarities between every pair of layers within a model. Apart from the convolutional head and tail, we include outputs of attention and FFN after residual connections in the Transformer body.

As for the SR task in Fig. 4 (a)-(b), we find there are roughly three block structures, among which a range of Transformer layers are of high similarity (also found in ViTs [41]). The first block structure (from left to right) corresponds to the model head, which transforms degraded images to tokens for the Transformer body. The second and third block structures account for the Transformer body, and the proportion of the third part is positively correlated with the model size, which partly reflects the redundancy of the model (we show a larger EDT-L in Fig. F.6 of the supplementary). As for the denoising task (Fig. 4 (c)), there are only two obvious block structures, where the second one (Transformer body) is dominated. Finally, from the cross-model comparison in Fig. 4 (d) and (h), we find higher similarity scores between denoising body layers and the second-block SR layers, while showing significant differences compared to those of third-block SR.

Further, We explore the impact of single-task pre-training on the internal representations. As for SR in Fig. 4 (e)-(f), the representation of the model head remains basically unchanged across different model sizes (EDT-S and EDT-B). Meanwhile, we observe more changes in the third-block representations than those of the second block with pre-training. In terms of denoising, as shown in Fig. 4 (g), the internal representations do not change too much, consistent with the finding in Table 5 that denoising tasks obtain limited improvements.

Key Findings: (1) SR models show clear stages in the internal representations and the proportion of each stage varies with the model size, while the denoising model presents a relatively uniform structure; (2) the denoising model layers show more similarity to the lower layers of SR models, containing more local information, as verified in Sec. 3.4; (3) single-task pre-training mainly affects the higher layers of SR models but has limited impact on the denoising model.

3.4 Single- and Multi-Task Pre-training

In the previous section, we observe that the Transformer body of SR models is clearly composed of two block structures and pre-training mainly changes the representations of higher layers. What is the difference between these two partitions? How does the pre-training, especially multi-task pre-training, affect the behaviors of models?

We conjecture that one possible reason causing the partition lies with the difference of ability to incorporate local or global information between different layers. We start by analyzing self-attention layers for their mechanism of dynamically aggregating information from other spatial locations, which is quite different from the fixed receptive field of FFN layer. To represent horizontal or vertical attention mean distance, we average pixel distances between the queries and keys using attention weights for each head over 170,000 data points. Lastly, we average the horizontal and vertical attention mean distances to get the final result, where a larger distance usually refers to using more global information. Note that we do not record attention distances of shifted local windows, because the shift operation narrows down the boundary windows and hence can not reflect real attention distances.

As shown in Fig. 5 (e)-(h), for the second block structure (counted from the head, same as Sec. 3.3

), the standard deviation of attention distances (shown as the blue area) is large and the mean value is small, indicating the attention modules in this block structure area have a mix of local heads (relatively small distances) and global heads (relatively small distances). On the contrary, the third block structure only contains global heads, showing more global information are aggregated in this stage.

Compared to single-task pre-training ( SR, Fig. 5 (b) and (f)), multi-related-task setup (, , SR, Fig. 5 (c) and (g)) converts more global representations (in red box) of the third block structure to local ones, increasing the scope of the second block structure. In consequence, as shown in Fig. 6, we observe significant PSNR improvements on all benchmarks. When replacing SR with unrelated level-15 denoising (Fig. 5 (d) and (h)), we observe fewer changes in global representations, resulting in a performance drop in Fig. 6. It is mainly due to the representation mismatch of unrelated tasks, as mentioned in Sec. 3.3. More quantitative comparisons are provided in Sec. J.

Figure 6: PSNR(dB) improvements of single-task, multi-related-task and multi-unrelated-task pre-training in SR.

Key Findings: (1) the representations of SR models contain more local information in early layers while more global information in higher layers; (2) all three pre-training methods can greatly improve the performance by introducing different degrees of local information, treated as a kind of inductive bias, to the intermediate layers of model, among which multi-related-task pre-training performs best.

Furthermore, in low-level tasks, we find simply increasing the data scale may not be the optimal option, as verified in our experiments using different data scales (see Sec.E in supplementary materials). In contrast, multi-task pre-training is more effective and data-efficient.

4 Experiments

In this section, we conduct an ablation study of window size and provide quantitative results of super-resolution (SR), denoising and detraining. The experimental settings and visual comparisons are given in the supplementary file.

4.1 Ablation Study of Window Size

Size Set5 Set14 BSDS100 Urban100 Manga109
(8, 8) 38.37 34.51 32.48 33.57 39.82
(4, 16) 38.39 34.48 32.49 33.66 39.83
(8, 16) 38.42 34.53 32.51 33.78 39.92
(12, 12) 38.45 34.56 32.50 33.73 39.89
(6, 24) 38.45 34.57 32.52 33.80 39.93
Table 2: Ablation study of window size on PSNR(dB) in SR. Best results are in bold.

To evaluate the effectiveness of the proposed shifted crossed local attention, we conduct experiments to analyze the effects of different window sizes on final performance. As shown in Table 2, when simply increasing the square window size from 8 to 12, the results of five SR benchmarks are largely improved, verifying the importance of a large receptive field under the current architecture. The evidence also comes from the comparison between (4, 16) and (8, 16), where we find a larger shot side achieves superior performance. Besides, for windows with the same area, (6, 24) is better than (12, 12), indicating our shifted crossed window attention is an effective way of quickly enlarging the receptive field to improve performance. This claim is also supported in Table 4, where our lightweight SR models (EDT-T) yield significant improvements on high-resolution benchmark datasets Urban100 [24] and Manga109 [39]. Further, we use LAM [22], which represents the range of information utilization, to visualize receptive fields. Fig. 7 shows our model can take advantage of a wider range of information than SwinIR [31], restoring more details.

Figure 7: LAM [22] comparison between SwinIR [31] and EDT in SR. The first column shows input along with the super-resolved result, and the second column shows the receptive field.

4.2 Super-Resolution Results

For the super-resolution (SR) task, we test our models on two settings, classical SR and lightweight SR, where the latter generally refers to models with 1M parameters.

Classical SR. We compare our EDT with state-of-the-art CNNs-based methods as well as Transformer-based methods. As shown in Table 3, our method outperforms all competing algorithms by a large margin on , and scales, under comparable model capacity. Especially, the PSNR improvements are up to 0.46dB and 0.45dB on high-resolution benchmark Urban100 [24] and Manga109 [39]. Even without pre-training, our EDT-B still stands out as the best, achieving nearly 0.1dB gains on multiple datasets over the second-best SwinIR [31] under a fair training setting (trained on patches without pre-training).

Scale Method #Param. Set5 Set14 BSDS100 Urban100 Manga109
RCAN [74] 15.4 38.27 0.9614 34.12 0.9216 32.41 0.9027 33.34 0.9384 39.44 0.9786
SAN [10] 15.7 38.31 0.9620 34.07 0.9213 32.42 0.9028 33.10 0.9370 39.32 0.9792
HAN [42] 15.9 38.27 0.9614 34.16 0.9217 32.41 0.9027 33.35 0.9385 39.46 0.9785
NLSA [40] 31.9 38.34 0.9618 34.08 0.9231 32.43 0.9027 33.42 0.9394 39.59 0.9789
IPT [6] 115.5 38.37 - 34.43 - 32.48 - 33.76 - - -
SwinIR [31] 11.8 38.42 0.9622 34.48 0.9252 32.50 0.9038 33.70 0.9418 39.81 0.9796
SwinIR [31] 11.8 38.42 0.9623 34.46 0.9250 32.53 0.9041 33.81 0.9427 39.92 0.9797
EDT-B(Ours) 11.5 38.45 0.9624 34.57 0.9258 32.52 0.9041 33.80 0.9425 39.93 0.9800
EDT-B(Ours) 11.5 38.63 0.9632 34.80 0.9273 32.62 0.9052 34.27 0.9456 40.37 0.9811

RCAN [74] 15.6 34.74 0.9299 30.65 0.8482 29.32 0.8111 29.09 0.8702 34.44 0.9499
SAN [10] 15.9 34.75 0.9300 30.59 0.8476 29.33 0.8112 28.93 0.8671 34.30 0.9494
HAN [42] 16.1 34.75 0.9299 30.67 0.8483 29.32 0.8110 29.10 0.8705 34.48 0.9500
NLSA [40] 44.7 34.85 0.9306 30.70 0.8485 29.34 0.8117 29.25 0.8726 34.57 0.9508
IPT [6] 115.6 34.81 - 30.85 - 29.38 - 29.49 - - -
SwinIR [31] 11.9 34.91 0.9317 30.90 0.8531 29.43 0.8140 29.65 0.8809 35.05 0.9531
SwinIR [31] 11.9 34.97 0.9318 30.93 0.8534 29.46 0.8145 29.75 0.8826 35.12 0.9537
EDT-B(Ours) 11.7 34.97 0.9316 30.89 0.8527 29.44 0.8142 29.72 0.8814 35.13 0.9534
EDT-B(Ours) 11.7 35.13 0.9328 31.09 0.8553 29.53 0.8165 30.07 0.8863 35.47 0.9550

RCAN [74] 15.6 32.63 0.9002 28.87 0.7889 27.77 0.7436 26.82 0.8087 31.22 0.9173
SAN [10] 15.9 32.64 0.9003 28.92 0.7888 27.78 0.7436 26.79 0.8068 31.18 0.9169
HAN [42] 16.1 32.64 0.9002 28.90 0.7890 27.80 0.7442 26.85 0.8094 31.42 0.9177
NLSA [40] 44.2 32.59 0.9000 28.87 0.7891 27.78 0.7444 26.96 0.8109 31.27 0.9184
IPT [6] 115.6 32.64 - 29.01 - 27.82 - 27.26 - - -
SwinIR [31] 11.9 32.74 0.9020 29.06 0.7939 27.89 0.7479 27.37 0.8233 31.93 0.9246
SwinIR [31] 11.9 32.92 0.9044 29.09 0.7950 27.92 0.7489 27.45 0.8254 32.03 0.9260
EDT-B(Ours) 11.6 32.82 0.9031 29.09 0.7939 27.91 0.7483 27.46 0.8246 32.05 0.9254
EDT-B(Ours) 11.6 33.06 0.9055 29.23 0.7971 27.99 0.7510 27.75 0.8317 32.39 0.9283
Table 3: Quantitative comparison for classical SR on PSNR(dB)/SSIM on the Y channel from the YCbCr space. “” means the and models of SwinIR are pre-trained on the setup and training patch size is (ours is ). “” indicates methods with a pre-training. Best and second best results are in red and blue colors.

Lightweight SR. Among lightweight SR algorithms, our model achieves the best results on all benchmark datasets. Though SwinIR uses larger training patches () than ours () and adopts the well-trained model as initialization for and scales, our EDT-T (without pre-training) still obtains nearly 0.2dB and 0.4dB improvements on Urban100 and Manga109 across all scales, demonstrating the superiority of our architecture.

Scale Method #Param. Set5 Set14 BSDS100 Urban100 Manga109
LAPAR [28] 548 38.01 0.9605 33.62 0.9183 32.19 0.8999 32.10 0.9283 38.67 0.9772
LatticeNet [34] 756 38.15 0.9610 33.78 0.9193 32.25 0.9005 32.43 0.9302 - -
SwinIR [31] 878 38.14 0.9611 33.86 0.9206 32.31 0.9012 32.76 0.9340 39.12 0.9783
EDT-T(Ours) 917 38.23 0.9615 33.99 0.9209 32.37 0.9021 32.98 0.9362 39.45 0.9789

LAPAR [28] 594 34.36 0.9267 30.34 0.8421 29.11 0.8054 28.15 0.8523 33.51 0.9441
LatticeNet [34] 765 34.53 0.9281 30.39 0.8424 29.15 0.8059 28.33 0.8538 - -
SwinIR [31] 886 34.62 0.9289 30.54 0.8463 29.20 0.8082 28.66 0.8624 33.98 0.9478
EDT-T(Ours) 919 34.73 0.9299 30.66 0.8481 29.29 0.8103 28.89 0.8674 34.44 0.9498

LAPAR [28] 659 32.15 0.8944 28.61 0.7818 27.61 0.7366 26.14 0.7871 30.42 0.9074
LatticeNet [34] 777 32.30 0.8962 28.68 0.7830 27.62 0.7367 26.25 0.7873 - -
SwinIR [31] 897 32.44 0.8976 28.77 0.7858 27.69 0.7406 26.47 0.7980 30.92 0.9151
EDT-T(Ours) 922 32.53 0.8991 28.88 0.7882 27.76 0.7433 26.71 0.8051 31.35 0.9180
Table 4: Quantitative comparison for lightweight SR on PSNR(dB)/SSIM on the Y channel. “” means the and models of SwinIR are pre-trained on the setup and the training patch size is (ours is and without pre-training).
[9] [69] [70] [71] [51] [75] [76] [6] [68] [31] (Ours) (Ours) (Ours)
CBSD68 15 33.52 33.90 33.86 33.87 34.10 - - - 34.30 34.42 34.33 34.38 34.39
25 30.71 31.24 31.16 31.21 31.43 - - - 31.69 31.78 31.73 31.76 31.76
50 27.38 27.95 27.86 27.96 28.16 28.27 28.31 28.39 28.51 28.56 28.55 28.57 28.56
Kodak24 15 34.28 34.60 34.69 34.63 34.88 - - - 35.31 35.34 35.25 35.31 35.37
25 32.15 32.14 32.18 32.13 32.41 - - - 32.89 32.89 32.84 32.89 32.94
50 28.46 28.95 28.93 28.98 29.22 29.58 29.66 29.64 29.86 29.79 29.81 29.83 29.87
McMaster 15 34.06 33.45 34.58 34.66 35.08 - - - 35.40 35.61 35.43 35.51 35.61
25 31.66 31.52 32.18 32.35 32.75 - - - 33.14 33.20 33.20 33.26 33.34
50 28.51 28.62 28.91 29.18 29.52 29.72 - 29.98 30.08 30.22 30.21 30.25 30.25
Urban100 15 33.93 32.98 33.78 33.83 34.42 - - - 34.81 35.13 34.93 35.04 35.22
25 31.36 30.81 31.20 31.40 31.99 - - - 32.60 32.90 32.78 32.86 33.07
50 27.93 27.59 27.70 28.05 28.56 29.08 29.38 29.71 29.61 29.82 29.93 29.98 30.16
Table 5: Quantitative comparison for color image denoising on PSNR(dB) on RGB channels. “” means the models of SwinIR [31] are pre-trained on the level. “” indicates methods with a pre-training. “” means our model without downsampling and without pre-training.

4.3 Denoising Results

In Table 5, apart from a range of denoising methods, we present our three models: (1) EDT-B without pre-training; (2) EDT-B with pre-training; (3) EDT-B without downsampling and pre-training.

It is worthwhile to note that, unlike SR models that benefit a lot from pre-training, denoising models only achieve 0.02-0.11dB gains. One possible reason is that we use a large training dataset in denoising tasks, which already provides sufficient data to make the capacity of our models into full play. On the other hand, pre-training hardly affects the internal feature representation of models, discussed in Sec. 3.3. Therefore, we suggest that the Gaussian denoising task may not need a large amount of training data.

Besides, we find our encoder-decoder-based framework is well performed on high noise levels (e.g., ), while yielding slightly inferior performance on low noise levels (e.g., ). This could be caused by the downsampling operation in EDT. To verify this assumption, we train another EDT-B model without downsampling. As shown in Table 5, it does obtain better performance on the low level noises. Nonetheless, we suggest that the proposed EDT model is still a good choice for denoising tasks since it strikes a sweet point between performance and computational complexity. For example, the FLOPs of EDT-B (38G) is only 8.4% of SwinIR (451G).

4.4 Deraining Results

We also evaluate the performance of the proposed EDT on Rain100L [66] and Rain100H [66] two benchmark datasets, accounting for light and heavy rain steaks. As illustrated in Table 6, though the model size of our EDT-B (11.5M) for deraining is far smaller than IPT (116M), it still outperforms IPT by nearly 0.9dB on the light rain setting. Meanwhile, our model reaches significantly superior results by 2.97dB gain on the heavy rain setting, compared to the second-best RCDNet [57], supporting that EDT performs well for restoration tasks with heavy degradation.

Method RAIN100L RAIN100H
DSC [35] 27.34 0.8494 13.77 0.3199
GMM [30] 29.05 0.8717 15.23 0.4498
JCAS [23] 28.54 0.8524 14.62 0.4510
Clear [17] 30.24 0.9344 15.33 0.7421
DDN [18] 32.38 0.9258 22.85 0.7250
RESCAN [29] 38.52 0.9812 29.62 0.8720
PReNet [47] 37.45 0.9790 30.11 0.9053
SPANet [58] 35.33 0.9694 25.11 0.8332
JORDER_E [66] 38.59 0.9834 30.50 0.8967
SSIR [63] 32.37 0.9258 22.47 0.7164
RCDNet [57] 40.00 0.9860 31.28 0.9093
IPT [6] 41.62 0.9880 - -
EDT-B(Ours) 42.50 0.9905 34.25 0.9485
Table 6: Quantitative comparison for image deraining on PSNR(dB)/SSIM on the Y channel. “” indicates methods with a pre-training.

5 Limitations

In this paper, we mainly investigate the internal representations of the proposed EDT and the effect of pre-training on synthesized data. Future research should be undertaken to explore the real-world setting and more tasks, further extended to video processing. Also, despite high efficiency of our encoder-decoder design, a limitation remains that we adopt a fixed downsampling strategy. As shown in Sec. 4.3, it may be a better choice to conduct adaptive downsampling based on degradation degrees of low-quality images.

6 Conclusion

Based on the proposed encoder-decoder-based Transformer that shows high efficiency and strong performance, we perform an in-depth analysis of image pre-training in low-level vision. We find pre-training plays the central role of developing stronger intermediate representations by incorporating more local information. Also, we find the effect of pre-training is task-specific, leading to significant improvements on SR while minor gains on denoising. Lastly, we suggest multi-task pre-training exhibits great potential in digging image priors, far more efficient than using larger pre-training datasets. The study of pre-training on model size, data scale and ConvNets is further provided in the supplementary materials.


  • [1] Eirikur Agustsson and Radu Timofte. Ntire 2017 challenge on single image super-resolution: Dataset and study. In CVPRW, pages 126–135, 2017.
  • [2] Pablo Arbelaez, Michael Maire, Charless Fowlkes, and Jitendra Malik. Contour detection and hierarchical image segmentation. PAMI, 33(5):898–916, 2010.
  • [3] Marco Bevilacqua, Aline Roumy, Christine Guillemot, and Marie Line Alberi-Morel. Low-complexity single-image super-resolution based on nonnegative neighbor embedding. 2012.
  • [4] Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020.
  • [5] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. In ECCV, pages 213–229. Springer, 2020.
  • [6] Hanting Chen, Yunhe Wang, Tianyu Guo, Chang Xu, Yiping Deng, Zhenhua Liu, Siwei Ma, Chunjing Xu, Chao Xu, and Wen Gao. Pre-trained image processing transformer. In CVPR, pages 12299–12310, 2021.
  • [7] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. PAMI, 40(4):834–848, 2017.
  • [8] Corinna Cortes, Mehryar Mohri, and Afshin Rostamizadeh. Algorithms for learning kernels based on centered alignment.

    The Journal of Machine Learning Research

    , 13(1):795–828, 2012.
  • [9] Kostadin Dabov, Alessandro Foi, Vladimir Katkovnik, and Karen Egiazarian. Image denoising by sparse 3-d transform-domain collaborative filtering. TIP, 16(8):2080–2095, 2007.
  • [10] Tao Dai, Jianrui Cai, Yongbing Zhang, Shu-Tao Xia, and Lei Zhang. Second-order attention network for single image super-resolution. In CVPR, pages 11065–11074, 2019.
  • [11] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, pages 248–255. Ieee, 2009.
  • [12] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  • [13] Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, and Trevor Darrell. Decaf: A deep convolutional activation feature for generic visual recognition. In ICML, pages 647–655. PMLR, 2014.
  • [14] Xiaoyi Dong, Jianmin Bao, Dongdong Chen, Weiming Zhang, Nenghai Yu, Lu Yuan, Dong Chen, and Baining Guo. Cswin transformer: A general vision transformer backbone with cross-shaped windows. arXiv preprint arXiv:2107.00652, 2021.
  • [15] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2020.
  • [16] Rich Franzen. Kodak lossless true color image suite. http://r0k.us/graphics/kodak/.
  • [17] Xueyang Fu, Jiabin Huang, Xinghao Ding, Yinghao Liao, and John Paisley. Clearing the skies: A deep network architecture for single-image rain removal. TIP, 26(6):2944–2956, 2017.
  • [18] Xueyang Fu, Jiabin Huang, Delu Zeng, Yue Huang, Xinghao Ding, and John Paisley. Removing rain from single images via a deep detail network. In CVPR, pages 3855–3863, 2017.
  • [19] Ross Girshick. Fast r-cnn. In ICCV, pages 1440–1448, 2015.
  • [20] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, pages 580–587, 2014.
  • [21] Arthur Gretton, Kenji Fukumizu, Choon Hui Teo, Le Song, Bernhard Schölkopf, Alexander J Smola, et al. A kernel statistical test of independence. In Nips

    , volume 20, pages 585–592. Citeseer, 2007.

  • [22] Jinjin Gu and Chao Dong. Interpreting super-resolution networks with local attribution maps. In

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    , pages 9199–9208, 2021.
  • [23] Shuhang Gu, Deyu Meng, Wangmeng Zuo, and Lei Zhang. Joint convolutional analysis and synthesis sparse representation for single image layer separation. In ICCV, pages 1708–1716, 2017.
  • [24] Jia-Bin Huang, Abhishek Singh, and Narendra Ahuja. Single image super-resolution from transformed self-exemplars. In CVPR, pages 5197–5206, 2015.
  • [25] Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Joan Puigcerver, Jessica Yung, Sylvain Gelly, and Neil Houlsby. Big transfer (bit): General visual representation learning. In ECCV, pages 491–507. Springer, 2020.
  • [26] Simon Kornblith, Jonathon Shlens, and Quoc V Le. Do better imagenet models transfer better? In CVPR, pages 2661–2671, 2019.
  • [27] Yann LeCun, Bernhard Boser, John S Denker, Donnie Henderson, Richard E Howard, Wayne Hubbard, and Lawrence D Jackel. Backpropagation applied to handwritten zip code recognition. Neural computation, 1(4):541–551, 1989.
  • [28] Wenbo Li, Kun Zhou, Lu Qi, Nianjuan Jiang, Jiangbo Lu, and Jiaya Jia. Lapar: Linearly-assembled pixel-adaptive regression network for single image super-resolution and beyond. NeurIPS, 33, 2020.
  • [29] Xia Li, Jianlong Wu, Zhouchen Lin, Hong Liu, and Hongbin Zha. Recurrent squeeze-and-excitation context aggregation net for single image deraining. In ECCV, pages 254–269, 2018.
  • [30] Yu Li, Robby T Tan, Xiaojie Guo, Jiangbo Lu, and Michael S Brown. Rain streak removal using layer priors. In CVPR, pages 2736–2744, 2016.
  • [31] Jingyun Liang, Jiezhang Cao, Guolei Sun, Kai Zhang, Luc Van Gool, and Radu Timofte. Swinir: Image restoration using swin transformer. In ICCVW, pages 1833–1844, 2021.
  • [32] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. ICCV, 2021.
  • [33] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In CVPR, pages 3431–3440, 2015.
  • [34] Xiaotong Luo, Yuan Xie, Yulun Zhang, Yanyun Qu, Cuihua Li, and Yun Fu. Latticenet: Towards lightweight image super-resolution with lattice block. In ECCV, pages 272–289. Springer, 2020.
  • [35] Yu Luo, Yong Xu, and Hui Ji. Removing rain from a single image via discriminative sparse coding. In ICCV, pages 3397–3405, 2015.
  • [36] Kede Ma, Zhengfang Duanmu, Qingbo Wu, Zhou Wang, Hongwei Yong, Hongliang Li, and Lei Zhang. Waterloo exploration database: New challenges for image quality assessment models. TIP, 26(2):1004–1016, 2016.
  • [37] Dhruv Mahajan, Ross Girshick, Vignesh Ramanathan, Kaiming He, Manohar Paluri, Yixuan Li, Ashwin Bharambe, and Laurens Van Der Maaten. Exploring the limits of weakly supervised pretraining. In ECCV, pages 181–196, 2018.
  • [38] David Martin, Charless Fowlkes, Doron Tal, and Jitendra Malik. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In ICCV, volume 2, pages 416–423. IEEE, 2001.
  • [39] Yusuke Matsui, Kota Ito, Yuji Aramaki, Azuma Fujimoto, Toru Ogawa, Toshihiko Yamasaki, and Kiyoharu Aizawa. Sketch-based manga retrieval using manga109 dataset. Multimedia Tools and Applications, 76(20):21811–21838, 2017.
  • [40] Yiqun Mei, Yuchen Fan, and Yuqian Zhou. Image super-resolution with non-local sparse attention. In CVPR, pages 3517–3526, 2021.
  • [41] Thao Nguyen, Maithra Raghu, and Simon Kornblith. Do wide and deep networks learn the same things? uncovering how neural network representations vary with width and depth. arXiv preprint arXiv:2010.15327, 2020.
  • [42] Ben Niu, Weilei Wen, Wenqi Ren, Xiangde Zhang, Lianping Yang, Shuzhen Wang, Kaihao Zhang, Xiaochun Cao, and Haifeng Shen. Single image super-resolution via a holistic attention network. In ECCV, pages 191–207. Springer, 2020.
  • [43] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training. 2018.
  • [44] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  • [45] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu.

    Exploring the limits of transfer learning with a unified text-to-text transformer.

    Journal of Machine Learning Research, 21:1–67, 2020.
  • [46] Maithra Raghu, Thomas Unterthiner, Simon Kornblith, Chiyuan Zhang, and Alexey Dosovitskiy. Do vision transformers see like convolutional neural networks? arXiv preprint arXiv:2108.08810, 2021.
  • [47] Dongwei Ren, Wangmeng Zuo, Qinghua Hu, Pengfei Zhu, and Deyu Meng. Progressive image deraining networks: A better and simpler baseline. In CVPR, pages 3937–3946, 2019.
  • [48] Ali Sharif Razavian, Hossein Azizpour, Josephine Sullivan, and Stefan Carlsson. Cnn features off-the-shelf: an astounding baseline for recognition. In CVPRW, pages 806–813, 2014.
  • [49] Robin Strudel, Ricardo Garcia, Ivan Laptev, and Cordelia Schmid. Segmenter: Transformer for semantic segmentation. arXiv preprint arXiv:2105.05633, 2021.
  • [50] Chen Sun, Abhinav Shrivastava, Saurabh Singh, and Abhinav Gupta.

    Revisiting unreasonable effectiveness of data in deep learning era.

    In ICCV, pages 843–852, 2017.
  • [51] Chunwei Tian, Yong Xu, and Wangmeng Zuo. Image denoising using deep cnn with batch renormalization. Neural Networks, 121:461–473, 2020.
  • [52] Radu Timofte, Eirikur Agustsson, Luc Van Gool, Ming-Hsuan Yang, and Lei Zhang. Ntire 2017 challenge on single image super-resolution: Methods and results. In CVPRW, pages 114–125, 2017.
  • [53] Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. Training data-efficient image transformers & distillation through attention. In ICML, pages 10347–10357. PMLR, 2021.
  • [54] Ashish Vaswani, Prajit Ramachandran, Aravind Srinivas, Niki Parmar, Blake Hechtman, and Jonathon Shlens. Scaling local self-attention for parameter efficient visual backbones. In CVPR, pages 12894–12904, 2021.
  • [55] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NIPS, pages 5998–6008, 2017.
  • [56] Ziyu Wan, Jingbo Zhang, Dongdong Chen, and Jing Liao. High-fidelity pluralistic image completion with transformers. arXiv preprint arXiv:2103.14031, 2021.
  • [57] Hong Wang, Qi Xie, Qian Zhao, and Deyu Meng. A model-driven deep neural network for single image rain removal. In CVPR, pages 3103–3112, 2020.
  • [58] Tianyu Wang, Xin Yang, Ke Xu, Shaozhe Chen, Qiang Zhang, and Rynson WH Lau. Spatial attentive single-image deraining with a high quality real rain dataset. In CVPR, pages 12270–12279, 2019.
  • [59] Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. arXiv preprint arXiv:2102.12122, 2021.
  • [60] Xintao Wang, Ke Yu, Shixiang Wu, Jinjin Gu, Yihao Liu, Chao Dong, Yu Qiao, and Chen Change Loy.

    Esrgan: Enhanced super-resolution generative adversarial networks.

    In ECCVW, pages 0–0, 2018.
  • [61] Yuqing Wang, Zhaoliang Xu, Xinlong Wang, Chunhua Shen, Baoshan Cheng, Hao Shen, and Huaxia Xia. End-to-end video instance segmentation with transformers. In CVPR, pages 8741–8750, 2021.
  • [62] Zhendong Wang, Xiaodong Cun, Jianmin Bao, and Jianzhuang Liu. Uformer: A general u-shaped transformer for image restoration. arXiv preprint arXiv:2106.03106, 2021.
  • [63] Wei Wei, Deyu Meng, Qian Zhao, Zongben Xu, and Ying Wu. Semi-supervised transfer learning for image rain removal. In CVPR, pages 3877–3886, 2019.
  • [64] Haiping Wu, Bin Xiao, Noel Codella, Mengchen Liu, Xiyang Dai, Lu Yuan, and Lei Zhang. Cvt: Introducing convolutions to vision transformers. arXiv preprint arXiv:2103.15808, 2021.
  • [65] Tete Xiao, Mannat Singh, Eric Mintun, Trevor Darrell, Piotr Dollár, and Ross Girshick. Early convolutions help transformers see better. arXiv preprint arXiv:2106.14881, 2021.
  • [66] Wenhan Yang, Robby T Tan, Jiashi Feng, Zongming Guo, Shuicheng Yan, and Jiaying Liu. Joint rain detection and removal from a single image with contextualized deep networks. PAMI, 42(6):1377–1393, 2019.
  • [67] Roman Zeyde, Michael Elad, and Matan Protter. On single image scale-up using sparse-representations. In International conference on curves and surfaces, pages 711–730. Springer, 2010.
  • [68] Kai Zhang, Yawei Li, Wangmeng Zuo, Lei Zhang, Luc Van Gool, and Radu Timofte. Plug-and-play image restoration with deep denoiser prior. PAMI, 2021.
  • [69] Kai Zhang, Wangmeng Zuo, Yunjin Chen, Deyu Meng, and Lei Zhang. Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising. TIP, 26(7):3142–3155, 2017.
  • [70] Kai Zhang, Wangmeng Zuo, Shuhang Gu, and Lei Zhang. Learning deep cnn denoiser prior for image restoration. In CVPR, pages 3929–3938, 2017.
  • [71] Kai Zhang, Wangmeng Zuo, and Lei Zhang. Ffdnet: Toward a fast and flexible solution for cnn-based image denoising. TIP, 27(9):4608–4622, 2018.
  • [72] Lei Zhang, Xiaolin Wu, Antoni Buades, and Xin Li. Color demosaicking by local directional interpolation and nonlocal adaptive thresholding. Journal of Electronic imaging, 20(2):023016, 2011.
  • [73] Pengchuan Zhang, Xiyang Dai, Jianwei Yang, Bin Xiao, Lu Yuan, Lei Zhang, and Jianfeng Gao. Multi-scale vision longformer: A new vision transformer for high-resolution image encoding. arXiv preprint arXiv:2103.15358, 2021.
  • [74] Yulun Zhang, Kunpeng Li, Kai Li, Lichen Wang, Bineng Zhong, and Yun Fu. Image super-resolution using very deep residual channel attention networks. In ECCV, pages 286–301, 2018.
  • [75] Yulun Zhang, Kunpeng Li, Kai Li, Bineng Zhong, and Yun Fu. Residual non-local attention networks for image restoration. In ICLR, 2018.
  • [76] Yulun Zhang, Yapeng Tian, Yu Kong, Bineng Zhong, and Yun Fu. Residual dense network for image restoration. PAMI, 43(7):2480–2495, 2020.
  • [77] Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. Deformable detr: Deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159, 2020.

A Network Architecture

The proposed encoder-decoder-based transformer (EDT) in Fig. 2 is composed of a convolution encoder and decoder as well as a transformer body. It processes low-resolution (e.g., super-resolution) and high-resolution (e.g., denoising) inputs using different encoders and decoders, where the high-resolution path involves additional downsampling and upsampling operations, as shown in Fig. B.1. This design enables the transformer body to model long-range relations at a low resolution, thus being computationally efficient. The body consists of multiple stages of transformer blocks, using a global connection in each stage. We show the structure of a transformer block in Fig. B.2, where the (shifted) crossed local attention and anti-blocking FFN are detailed in Sec. 2.2. We design four variants of EDT and the corresponding configurations are detailed in Table 1.

B Datasets

As illustrated in Sec. 3.1, unless specified otherwise, we only use 200K images in ImageNet [11] for pre-training. In sec. E, we further discuss the effect of data scale on pre-training. As for fine-tuning, datasets of super-resolution (SR), denoising and deraining are described as follows.

Super-Resolution. The models are trained on 800 images of DIV2K [1] and 2560 images of Flickr2K [52]. The evaluation benchmark datasets consist of Set5 [3], Set14 [67], BSDS100 [38], Urban100 [24] and Manga109 [39].

Denoising. Following [76, 68, 31], we use the combination of 800 DIV2K images, 2650 Flickr2K images, 400 BSD500 [2] images and 4744 WED [36] images as training data, and use CBSD68 [38], Kodak24 [16], McMaster [72] and Urban100 for evaluation.

Deraining. Rain100L [66] and Rain100H [66] are used for training and testing. Following [47], we exclude 546 training images in Rain100H that have the same background contents with testing images.

Figure B.1: Structures of the convolution blocks (CB) in the encoder and decoder. The dotted boxes and lines represent additional downsampling and upsampling operations when processing high-resolution inputs.
Figure B.2: Structure of the transformer block, including layer normalizations (LN), a (shifted) crossed local multi-head attention module ((S)CL-MSA) and an anti-blocking feed-forward network (Anti-FFN).
Figure B.3: The first row shows CKA similarities of light streak deraining models without pre-training and with single-task pre-training. The second row shows the corresponding attention head mean distances of Transformer blocks. We do not plot shited local windows so that the last blue dotted line has no matching point.

C Training Details

All experiments are conducted on eight NVIDIA GeForce RTX 2080Ti GPUs, except that multi-task pre-training models and denoising models without downsampling are trained on eight NVIDIA V100 GPUs.

Pre-training. We train models with a batch size (8 GPUs) of 32 per task for 500K iterations. The initial learning rate is set to and halved at iterations. We adopt the Adam optimizer with and . The input patch size is set to for SR and for denoising/deraining, thus different tasks share the same feature size in the body part.

Fine-tuning. We fine-tune the SR models for 500K iterations, denoising models for 800K iterations and deraining models for 200K iterations. The learning rate is . Other settings remain the same as pre-training.

Training from Scratch. The initial learning rate is uniformly set to . We train SR models for 500K iterations, during which the learning rate is halved at iterations. As for denoising, we train models for 800K iterations and the learning rate is halved at iterations. In terms of deraining, we train models for 200K iterations, halving the learning rate at iterations. Other settings are the same as pre-training.

As aforementioned in Sec. 4.3, we also train denoising models without downsampling from scratch. Following [31], the batch size (8 GPUs) is set to 8 and iteration number is 1600K. The initial learning rate is and halved at iterations.

Figure C.4: PSNR(dB) improvements of different data scales during single-task pre-training for EDT-L in SR.

D Deraining with Pre-training

As shown in the first row of Fig. B.3, it is observed that CKA maps of deraining models mainly consist of three block structures, just like SR models illustrated in Sec. 3.3. Pre-training makes some attention heads at higher layers attend more locally (see attention modules in the red box), thus some global representations in the third block are converted to the second block. It supports our claim that pre-training introduces more local information as an inductive bias to the model, yielding better performance.

E Effect of Data Scale on Pre-training

Model Data Set5 Set14 Urban100 Manga109
EDT-B 0 38.45 34.57 33.80 39.93
EDT-B 50K 38.53 34.66 33.86 40.14
EDT-B 100K 38.55 34.68 33.90 40.18
EDT-B 200K 38.56 34.71 33.95 40.25
EDT-B 400K 38.61 34.75 34.05 40.37
EDT-B 200K 38.63 34.80 34.27 40.37
Table E.1: PSNR(dB) results of different pre-training (single-task) data scales in SR. “EDT-B” refers to the base model with single-task pre-training and “EDT-B” represents the base model with multi-related-task pre-training. The best results are in bold.

In this section, we investigate how pre-training data scale affects the super-resolution performance. As illustrated in Fig. C.4 and Table E.1, with regard to the EDT-B model, we obviously observe incremental PSNR improvements on multiple SR benchmarks by increasing the data scale from 50K to 400K during single-task pre-training. It is noted that we double the pre-training iterations (from 500K to 1M) for the data scale of 400K so that the data can be fully functional. However, longer pre-training period largely increases the training burden.

On the contrary, as shown in Table E.1, multi-task pre-training (with 500K training iterations) successfully breaks through the limit. Our EDT-B model with multi-task pretraining on 200K images achieves new state of the arts on all benchmarks, though a smaller data scale is adopted. Thus, we suggest multi-related-task pre-training is usually more effective and data-efficient in low-level vision tasks.

Figure E.5: PSNR(dB) improvements of four variants of EDT models using 200K pre-training (single-task) images in SR. “T”, “S”, “B” and “L” refer to tiny, small, base and large EDT models. The improvement of EDT-T on the Urban100 benchmark is 0.00dB, thus we do not plot the bar.
Model Set5 Set14 BSDS100 Urban100 Manga109
EDT-T 38.23 33.99 32.37 32.98 39.45
EDT-T 38.30 34.09 32.40 32.98 39.57
EDT-S 38.38 34.36 32.48 33.61 39.78
EDT-S 38.49 34.57 32.52 33.67 40.05
EDT-B 38.45 34.57 32.52 33.80 39.93
EDT-B 38.56 34.71 32.57 33.95 40.25
EDT-L 38.47 34.51 32.53 33.91 40.02
EDT-L 38.59 34.71 32.60 34.07 40.33
Table E.2: PSNR(dB) results of four variants of EDT models using 200K pre-training (single-task) images in SR. “” indicates models with single-task pre-training.

F Effect of Model Size on Pre-training

We conduct experiments to compare the performance of single-task pre-training for four model variants in the SR task. As shown in Fig. E.5, we visualize the PSNR improvements of models with pre-training over counterparts trained from scratch. It is observed that models with larger capacities generally obtain more improvements. Especially, we find pre-training can still improve a lot upon already strong EDT-L models, showing the potential of pre-training in low-level tasks. The detailed quantitative results are provided in Table E.2.

As illustrated in Sec. 3.3, we already know there are roughly three block structures in the CKA maps of SR models. Here we also visualize the CKA map of the EDT-L model in Fig F.6. Compared with EDT-B, the third-block representations of EDT-L account for the vast majority and show high similarities, which somewhat reflects the redundancy of the EDT-L model.

Figure F.6: CKA similarities between all pairs of layers in EDT-B and EDT-L SR models.
Figure F.7: The first row shows CKA similarities between ConvNets (RRDB [60] and RCAN [74]) and our EDT-B with single-task pre-training in SR. The second row shows the ratios of layer similarity larger than 0.6, where “s” is the similarity between the current layer of EDT-B model and any layer of ConvNets.

G EDT v.s. ConvNets with Pre-training

We further explore the relationship of internal representations between EDT and CNNs-based models (RRDB [60] and RCAN [74]) and the superiority of our transformer architecture over ConvNets. As for RRDB and RCAN, apart from the head and tail, we use outputs of blocks, e.g., residual dense blocks in RRDB and residual channel attention blocks in RCAN, to compute CKA similarities.

As shown in Fig. F.7, the early layers in EDT-B have more representations similar to those learned in RRDB and RCAN, which tend to be local as mentioned in Sec. 3.4, while higher layers in EDT-B incorporate more global information, showing clear differences compared to ConvNets. Fig. G.8 demonstrates that our EDT-B obtains comparable or greater improvements from pre-training, despite higher baselines and fewer parameters.

Figure G.8: Quantitative comparison on PSRN(dB) between ConvNets (RRDB [60] and RCAN [74]) and our EDT-B without (“W/o”) and with (“W/”) single-task pre-training in SR.

H Analysis of Kernels in Anti-blocking FFN

We visualize the kernels of depth-wise convolutions in the anti-blocking feed-forward network (Anti-FFN). As shown in Fig. H.9

, we observe a more uniform distribution of kernels at lower layers while diverse representations at higher layers. We find most kernels at lower layers are just like low-pass filters, acting like anti-aliasing filtering to avoid the possible blocking effect caused by window splitting in self-attention, thus meeting our expectations. As for higher layer kernels, they show a high diversity, learning various local contexts. Without the proposed anti-blocking design, there is nearly a 0.1dB drop on both Urban100 and Manga109 in

SR, verifying the necessity of this design.

Figure H.9: Visualization of kernels of the anti-blocking FFN in the first (“Stage 0”) and last (“Stage 5”) Transformer Stage in EDT-B in SR.
Figure H.10: Sub-figures (a)-(c) show CKA similarities between all pairs of layers in EDT-B SR model, level-15 EDT-B denoising model and light streak EDT-B deraining model with single-task pre-training, and the similarities between before and after fine-tuning are shown in (d)-(f).
Model Stage Set5 Set14 Urban100 Manga109
EDT-T Before 38.27 33.96 32.83 39.35
After 38.30 34.09 32.98 39.57
EDT-S Before 38.46 34.30 33.45 39.90
After 38.49 34.57 33.67 40.05
EDT-B Before 38.49 34.44 33.73 40.10
After 38.56 34.71 33.95 40.25
EDT-L Before 38.54 34.52 33.90 40.23
After 38.59 34.71 34.07 40.33
Table H.3: PSNR(dB) results of four variants of EDT models before (single-task pre-training) and after fine-tuning in SR.

I Before and After Fine-tuning

Considering our models are trained in two stages, pre-training on the ImageNet and fine-tuning on the target dataset, we also study the differences of model representations between before and after fine-tuning. As shown in Fig. H.10 (d)-(e), we find fine-tuning mainly changes the higher layer representations of the SR model but has little effect on the denoising model, which shows similar phenomenons to the comparison between without and with pre-training in Sec. 3.3. As for the deraining task, model representations do not change too much after fine-tuning (the similarities are almost larger than 0.98), where the degrees of changes between the second and the third blocks maintain the same level.

J Single- and Multi-Task Pre-training

As shown in Table J.4, multi-related-task pre-training leads to obviously more improvements on PSNR(dB) and SSIM in all tasks, especially in SR and deraining. Considering the multi-task setting enables the transformer body to see more samples in an iteration, we also conduct a single-task pre-training with a large batch size in Table J.5. It is observed that multi-related-task pre-training is more effective and efficient, providing initialization for multiple tasks.

Task Dataset Setting Single Multi
SR Set14 34.71/0.9266 34.80/0.9273
30.99/0.8537 31.09/0.8553
29.20/0.7960 29.23/0.7971
Urban100 33.95/0.9435 34.27/0.9456
29.82/0.8825 30.07/0.8863
27.61/0.8275 27.75/0.8317
Manga109 40.25/0.9806 40.37/0.9811
35.30/0.9542 35.47/0.9550
32.22/0.9265 32.39/0.9283
Denoising CBSD68 g15 34.34/0.9348 34.38/0.9352
g25 31.74/0.8932 31.76/0.8937
g50 28.56/0.8118 28.57/0.8120
Urban100 g15 34.94/0.9514 35.04/0.9522
g25 32.79/0.9298 32.86/0.9307
g50 29.93/0.8886 29.98/0.8892
Deraining Rain100L light 41.80/0.9900 42.50/0.9905
Rain100H heavy 33.97/0.9400 34.25/0.9485
Table J.4: Quantitative comparison between single-task (“Single”) and multi-related-task (“Multi”) pre-training for EDT-B model on PSNR(dB)/SSIM in SR, denoising and deraining. The multi-task setting of SR includes , and , denoising includes g15, g25 and g50, and deraining includes light and heavy rain streaks. For clarity, we only provide partial results on several datasets.
Type Batch Set5 Set14 BSDS100 Urban100 Manga109
Single 32 38.56 34.71 32.57 33.95 40.25
Single 96 38.61 34.83 32.61 34.14 40.39
Multi 32 38.63 34.80 32.62 34.27 40.37
Table J.5: PSNR(dB) comparison between single-task large-batch and multi-related-task pre-training for EDT-B model in SR. “Batch” represents the batch size of a single task.

K Visual Comparison with SOTA Methods

We present more visual examples of denoising, super-resolution and deraining in Fig. K.11, Fig. K.12 and Fig. K.13. Compared with other state-of-the-art methods, the proposed EDT successfully recovers more regular structures and richer textures, producing high-quality images.

Figure K.11: Qualitative comparison in denoising with noise level 50 on CBSD68 [38] and Urban100 [24].
Figure K.12: Qualitative comparison in lightweight SR on Urban100 [24] and Set14 [67].
Figure K.13: Qualitative comparison in deraining with heavy rain streaks on Rain100H [66].