Pre-training has marked numerous state of the arts in high-level computer vision, but few attempts have ever been made to investigate how pre-training acts in image processing systems. In this paper, we present an in-depth study of image pre-training. To conduct this study on solid ground with practical value in mind, we first propose a generic, cost-effective Transformer-based framework for image processing. It yields highly competitive performance across a range of low-level tasks, though under constrained parameters and computational complexity. Then, based on this framework, we design a whole set of principled evaluation tools to seriously and comprehensively diagnose image pre-training in different tasks, and uncover its effects on internal network representations. We find pre-training plays strikingly different roles in low-level tasks. For example, pre-training introduces more local information to higher layers in super-resolution (SR), yielding significant performance gains, while pre-training hardly affects internal feature representations in denoising, resulting in a little gain. Further, we explore different methods of pre-training, revealing that multi-task pre-training is more effective and data-efficient. All codes and models will be released at https://github.com/fenglinglwb/EDT.READ FULL TEXT VIEW PDF
mainly focus on convolutional neural networks (CNNs)
and have shown that ConvNets pre-trained on ImageNet classification yield significant improvements on a wide spectrum of downstream tasks. More recently, thanks to the large-scale pre-training, Transformer  architectures have achieved competitive performance in both NLP [43, 44, 4, 12, 45] and computer vision [5, 77, 15, 53, 32, 64, 56, 59, 61, 49, 73] fields.
Despite the success of pre-training in high-level vision, few efforts are made to explore its role in low-level vision. To the best of our knowledge, the sole pioneer exploring this point is IPT . With the help of large-scale synthesized data (over 10 million images), IPT obtains strong improvements on several tasks. Whereas, the conventional Transformer design faces great challenges in handling high-resolution inputs, due to the massive amount of parameters (e.g., 116M for IPT) and huge computational cost. This makes it rather difficult to be applied in practice and also prohibitively hard to explore various system design choices. In fact, more detailed analysis is in real need to understand how pre-training affects the representations of models.
To deal with these burning issues, we first propose a novel encoder-decoder-based Transformer (EDT), which is data and computation efficient yet powerful. Bringing the strengths of CNNs and Transformers into full play, we exploit the multi-scale information and long-range interaction. The encoder-decoder architecture along with the carefully designed Transformer block enable our EDT highly efficient, while achieving state-of-the-art results on multiple low-level tasks, especially for those with heavy degradation. For example, as shown in Fig. 1, EDT yields 0.49dB improvement in SR on the Urban100  benchmark compared to IPT, while our SR model size (11.6M) is only 10.0% of IPT (115.6M) and only requires 200K images (15.6%
of ImageNet) for pre-training. Also, our denoising model obtains superior performance in level-50 Gaussian denoising, with 38 GFLOPs forinputs, far less than SwinIR  (451 GFLOPs), accounting for only 8.4%. Moreover, we develop four variants of EDT with different model sizes, rendering our framework easily applied in various scenarios.
Based on the afore-introduced efficient EDT framework, the next focus of this paper moves onto systematically exploring and evaluating how image pre-training performs in low-level vision tasks, for the first time in Sec. 3. Using centered kernel alignment [26, 8, 46] as a network “diagnosing” measure, we have designed a set of pre-training strategies, and thoroughly tested them with different image processing tasks. As a result, we uncover their respective effects on internal network representations, and draw useful guidelines for applying pre-training to low-level vision. For instance, we find simply increasing the data scale may not be the optimal option, and in contrast, multi-related-task pre-training is more effective and data-efficient. Our study sheds light on advancing efficient transformer-based image pre-training.
The contributions of our work can be summarized as:
We present a highly efficient and generic Transformer framework for low-level vision. Our model achieves state-of-the-art performance under constrained parameters and computational complexity.
We are the first to deliver an in-depth study on image pre-training for low-level vision, uncovering insights about how pre-training affects internal representations of models and how to conduct an effective pre-training.
As illustrated in Fig. 2, the proposed EDT consists of a lightweight convolution-based encoder and decoder as well as a Transformer-based body, capable of efficiently modeling long-range interactions.
Despite the success of Transformers, the high computational cost makes them difficult to handle high-resolution inputs. To improve the encoding efficiency, images are first downsampled to
size with strided convolutions for tasks with high-resolution inputs (e.g., denoising or deraining), while being processed under the original size for those with low-resolution inputs (e.g., SR). The stack of early convolutions is also proven useful for stabling the optimization . Then, there follow multiple stages of Transformer blocks, achieving a large receptive field at a low computational cost. During the decoding phase, we upsample the feature back to the input size using transposed convolutions for denoising or deraining while maintaining the size for SR. Besides, skip connections are introduced to enable fast convergence during training. In particular, there is an additional convolutional upsampler before output for super-resolution.
In general, Vision Transformer block consists of multi-head self-attention and feed-forward network (FFN), and there are many works[62, 6, 32, 14] tailoring these two parts to different tasks. In this section, we introduce shifted crossed local attention and anti-block FFN.
Shifted Crossed Local Attention. To reduce computational complexity, several works [32, 54] attempt to leverage shifted or halo windows to perform local self-attention. Whereas, the slow growth of effective receptive fields hampers the representational capability. Later on,  proposes to use globally horizontal and vertical stripe attention to achieve global receptive field. However, the computation is a heavy burden when processing high-resolution images. Drawing insights from these works [32, 14], we design a local self-attention mechanism with shifted crossed windows that strikes a balance of receptive field growth and computational cost, showing competitive performance.
As shown in Fig. 3, by evenly splitting a given feature map into two parts in the channel dimension, each half performs the multi-head attention (MSA) in an either horizontal or vertical local window, where the window size is or . Generally speaking, there exists . The (shifted) crossed local attention (S)CL-MSA is formally defined as:
where the projection layer fuses the attention results. Then, in the next transformer block, we shift the horizontal and vertical windows by and pixels, respectively. The crossed windows with shifts dramatically increase the effective receptive field. We show the fast growth of the receptive field is beneficial in achieving higher restoration quality in Sec. 4.1. The computational complexity of our attention module for an image is:
Anti-Blocking FFN. To eliminate the possible blocking effect caused by window partition, we design an anti-blocking feed-forward network (Anti-FFN), which is formulated as:
where the anti-blocking operation is implemented with a depth-wise convolution with a large kernel size (). Note that  adopts a similar strategy to leverage local context. We further explore the role of this operation in low-level tasks, which is detailed in Sec. H of supplementary file.
For a fair comparison with other architectures, we build four versions of EDT with different model sizes and computational complexities. As shown in Table 1, apart from the base model (EDT-B), we also provide EDT-T (Tiny), EDT-S (Small) and EDT-L (Large). The main differences lie in the channel number, stage number and head number in the Transformer body. We uniformly set the block number in each Transformer stage to 6, the expansion ratio of FFN to 2 and the window size to .
|#Param. (, M)||0.9||4.2||11.5||40.2|
|FLOPs (, G)||2.8||12.4||37.6||136.4|
Following , we adopt the ImageNet  dataset in the pre-training stage. Unless specified otherwise, we only use 200K images for pre-training. We choose three representative low-level tasks including super-resolution (SR), denoising and deraining. Referring to [6, 1, 23]
, we simulate the degradation procedure to synthesize low quality images. In terms of SR, we utilize the bicubic interpolation to obtain low-resolution images. As for denoising and deraining, Gaussian noises (on RGB space) and rain streaks are directly added to the clean images. In this work, we explore// settings in SR, 15/25/50 noise levels in denoising and light/heavy rain streaks in deraining.
We investigate three pre-training methods: single task, related tasks and unrelated tasks. (1) The single-task pre-training refers to training a single model for a specific task (e.g., SR). (2) The second is to train a single model for highly related tasks (e.g., , , SR), while (3) the last contains unrelated tasks (e.g., , SR and level-15 denoising). As suggested in , we adopt a multi-encoder, multi-decoder, shared-body architecture for the latter two setups. The fine-tuning is performed on a single task, where the model is initialized with the pre-trained task-specific encoder and decoder as well as the shared Transformer body.
We introduce centered kernel alignment (CKA)[26, 8, 46] to study representation similarity of network hidden layers, supporting quantitative comparisons within and across networks. In detail, given data points, we calculate the activations of two layers and , having and neurons respectively. We use the Gram matrices and to compute CKA:
where HSIC is the Hilbert-Schmidt independence criterion . Given the centering matrix , and are centered Gram matrices, then we have
. Thanks to the properties of CKA, invariant to orthogonal transformation and isotropic scaling, we are able to conduct a meaningful analysis of neural network representations. However, naive computation of CKA requires maintaining the activations across the entire dataset in memory, causing much memory consumption. To avoid this, we use minibatch estimators of CKA, with a minibatch of 300 by iterating over the test dataset 10 times.
We begin our investigation by studying the internal representation structure of our models. How are representations propagated within models of different sizes in low-level tasks? To answer this intriguing question, we compute CKA similarities between every pair of layers within a model. Apart from the convolutional head and tail, we include outputs of attention and FFN after residual connections in the Transformer body.
As for the SR task in Fig. 4 (a)-(b), we find there are roughly three block structures, among which a range of Transformer layers are of high similarity (also found in ViTs ). The first block structure (from left to right) corresponds to the model head, which transforms degraded images to tokens for the Transformer body. The second and third block structures account for the Transformer body, and the proportion of the third part is positively correlated with the model size, which partly reflects the redundancy of the model (we show a larger EDT-L in Fig. F.6 of the supplementary). As for the denoising task (Fig. 4 (c)), there are only two obvious block structures, where the second one (Transformer body) is dominated. Finally, from the cross-model comparison in Fig. 4 (d) and (h), we find higher similarity scores between denoising body layers and the second-block SR layers, while showing significant differences compared to those of third-block SR.
Further, We explore the impact of single-task pre-training on the internal representations. As for SR in Fig. 4 (e)-(f), the representation of the model head remains basically unchanged across different model sizes (EDT-S and EDT-B). Meanwhile, we observe more changes in the third-block representations than those of the second block with pre-training. In terms of denoising, as shown in Fig. 4 (g), the internal representations do not change too much, consistent with the finding in Table 5 that denoising tasks obtain limited improvements.
Key Findings: (1) SR models show clear stages in the internal representations and the proportion of each stage varies with the model size, while the denoising model presents a relatively uniform structure; (2) the denoising model layers show more similarity to the lower layers of SR models, containing more local information, as verified in Sec. 3.4; (3) single-task pre-training mainly affects the higher layers of SR models but has limited impact on the denoising model.
In the previous section, we observe that the Transformer body of SR models is clearly composed of two block structures and pre-training mainly changes the representations of higher layers. What is the difference between these two partitions? How does the pre-training, especially multi-task pre-training, affect the behaviors of models?
We conjecture that one possible reason causing the partition lies with the difference of ability to incorporate local or global information between different layers. We start by analyzing self-attention layers for their mechanism of dynamically aggregating information from other spatial locations, which is quite different from the fixed receptive field of FFN layer. To represent horizontal or vertical attention mean distance, we average pixel distances between the queries and keys using attention weights for each head over 170,000 data points. Lastly, we average the horizontal and vertical attention mean distances to get the final result, where a larger distance usually refers to using more global information. Note that we do not record attention distances of shifted local windows, because the shift operation narrows down the boundary windows and hence can not reflect real attention distances.
), the standard deviation of attention distances (shown as the blue area) is large and the mean value is small, indicating the attention modules in this block structure area have a mix of local heads (relatively small distances) and global heads (relatively small distances). On the contrary, the third block structure only contains global heads, showing more global information are aggregated in this stage.
Compared to single-task pre-training ( SR, Fig. 5 (b) and (f)), multi-related-task setup (, , SR, Fig. 5 (c) and (g)) converts more global representations (in red box) of the third block structure to local ones, increasing the scope of the second block structure. In consequence, as shown in Fig. 6, we observe significant PSNR improvements on all benchmarks. When replacing SR with unrelated level-15 denoising (Fig. 5 (d) and (h)), we observe fewer changes in global representations, resulting in a performance drop in Fig. 6. It is mainly due to the representation mismatch of unrelated tasks, as mentioned in Sec. 3.3. More quantitative comparisons are provided in Sec. J.
Key Findings: (1) the representations of SR models contain more local information in early layers while more global information in higher layers; (2) all three pre-training methods can greatly improve the performance by introducing different degrees of local information, treated as a kind of inductive bias, to the intermediate layers of model, among which multi-related-task pre-training performs best.
Furthermore, in low-level tasks, we find simply increasing the data scale may not be the optimal option, as verified in our experiments using different data scales (see Sec.E in supplementary materials). In contrast, multi-task pre-training is more effective and data-efficient.
In this section, we conduct an ablation study of window size and provide quantitative results of super-resolution (SR), denoising and detraining. The experimental settings and visual comparisons are given in the supplementary file.
To evaluate the effectiveness of the proposed shifted crossed local attention, we conduct experiments to analyze the effects of different window sizes on final performance. As shown in Table 2, when simply increasing the square window size from 8 to 12, the results of five SR benchmarks are largely improved, verifying the importance of a large receptive field under the current architecture. The evidence also comes from the comparison between (4, 16) and (8, 16), where we find a larger shot side achieves superior performance. Besides, for windows with the same area, (6, 24) is better than (12, 12), indicating our shifted crossed window attention is an effective way of quickly enlarging the receptive field to improve performance. This claim is also supported in Table 4, where our lightweight SR models (EDT-T) yield significant improvements on high-resolution benchmark datasets Urban100  and Manga109 . Further, we use LAM , which represents the range of information utilization, to visualize receptive fields. Fig. 7 shows our model can take advantage of a wider range of information than SwinIR , restoring more details.
For the super-resolution (SR) task, we test our models on two settings, classical SR and lightweight SR, where the latter generally refers to models with 1M parameters.
Classical SR. We compare our EDT with state-of-the-art CNNs-based methods as well as Transformer-based methods. As shown in Table 3, our method outperforms all competing algorithms by a large margin on , and scales, under comparable model capacity. Especially, the PSNR improvements are up to 0.46dB and 0.45dB on high-resolution benchmark Urban100  and Manga109 . Even without pre-training, our EDT-B still stands out as the best, achieving nearly 0.1dB gains on multiple datasets over the second-best SwinIR  under a fair training setting (trained on patches without pre-training).
Lightweight SR. Among lightweight SR algorithms, our model achieves the best results on all benchmark datasets. Though SwinIR uses larger training patches () than ours () and adopts the well-trained model as initialization for and scales, our EDT-T (without pre-training) still obtains nearly 0.2dB and 0.4dB improvements on Urban100 and Manga109 across all scales, demonstrating the superiority of our architecture.
In Table 5, apart from a range of denoising methods, we present our three models: (1) EDT-B without pre-training; (2) EDT-B with pre-training; (3) EDT-B without downsampling and pre-training.
It is worthwhile to note that, unlike SR models that benefit a lot from pre-training, denoising models only achieve 0.02-0.11dB gains. One possible reason is that we use a large training dataset in denoising tasks, which already provides sufficient data to make the capacity of our models into full play. On the other hand, pre-training hardly affects the internal feature representation of models, discussed in Sec. 3.3. Therefore, we suggest that the Gaussian denoising task may not need a large amount of training data.
Besides, we find our encoder-decoder-based framework is well performed on high noise levels (e.g., ), while yielding slightly inferior performance on low noise levels (e.g., ). This could be caused by the downsampling operation in EDT. To verify this assumption, we train another EDT-B model without downsampling. As shown in Table 5, it does obtain better performance on the low level noises. Nonetheless, we suggest that the proposed EDT model is still a good choice for denoising tasks since it strikes a sweet point between performance and computational complexity. For example, the FLOPs of EDT-B (38G) is only 8.4% of SwinIR (451G).
We also evaluate the performance of the proposed EDT on Rain100L  and Rain100H  two benchmark datasets, accounting for light and heavy rain steaks. As illustrated in Table 6, though the model size of our EDT-B (11.5M) for deraining is far smaller than IPT (116M), it still outperforms IPT by nearly 0.9dB on the light rain setting. Meanwhile, our model reaches significantly superior results by 2.97dB gain on the heavy rain setting, compared to the second-best RCDNet , supporting that EDT performs well for restoration tasks with heavy degradation.
In this paper, we mainly investigate the internal representations of the proposed EDT and the effect of pre-training on synthesized data. Future research should be undertaken to explore the real-world setting and more tasks, further extended to video processing. Also, despite high efficiency of our encoder-decoder design, a limitation remains that we adopt a fixed downsampling strategy. As shown in Sec. 4.3, it may be a better choice to conduct adaptive downsampling based on degradation degrees of low-quality images.
Based on the proposed encoder-decoder-based Transformer that shows high efficiency and strong performance, we perform an in-depth analysis of image pre-training in low-level vision. We find pre-training plays the central role of developing stronger intermediate representations by incorporating more local information. Also, we find the effect of pre-training is task-specific, leading to significant improvements on SR while minor gains on denoising. Lastly, we suggest multi-task pre-training exhibits great potential in digging image priors, far more efficient than using larger pre-training datasets. The study of pre-training on model size, data scale and ConvNets is further provided in the supplementary materials.
The Journal of Machine Learning Research, 13(1):795–828, 2012.
, volume 20, pages 585–592. Citeseer, 2007.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9199–9208, 2021.
Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of Machine Learning Research, 21:1–67, 2020.
Revisiting unreasonable effectiveness of data in deep learning era.In ICCV, pages 843–852, 2017.
Esrgan: Enhanced super-resolution generative adversarial networks.In ECCVW, pages 0–0, 2018.
The proposed encoder-decoder-based transformer (EDT) in Fig. 2 is composed of a convolution encoder and decoder as well as a transformer body. It processes low-resolution (e.g., super-resolution) and high-resolution (e.g., denoising) inputs using different encoders and decoders, where the high-resolution path involves additional downsampling and upsampling operations, as shown in Fig. B.1. This design enables the transformer body to model long-range relations at a low resolution, thus being computationally efficient. The body consists of multiple stages of transformer blocks, using a global connection in each stage. We show the structure of a transformer block in Fig. B.2, where the (shifted) crossed local attention and anti-blocking FFN are detailed in Sec. 2.2. We design four variants of EDT and the corresponding configurations are detailed in Table 1.
As illustrated in Sec. 3.1, unless specified otherwise, we only use 200K images in ImageNet  for pre-training. In sec. E, we further discuss the effect of data scale on pre-training. As for fine-tuning, datasets of super-resolution (SR), denoising and deraining are described as follows.
Super-Resolution. The models are trained on 800 images of DIV2K  and 2560 images of Flickr2K . The evaluation benchmark datasets consist of Set5 , Set14 , BSDS100 , Urban100  and Manga109 .
Denoising. Following [76, 68, 31], we use the combination of 800 DIV2K images, 2650 Flickr2K images, 400 BSD500  images and 4744 WED  images as training data, and use CBSD68 , Kodak24 , McMaster  and Urban100 for evaluation.
All experiments are conducted on eight NVIDIA GeForce RTX 2080Ti GPUs, except that multi-task pre-training models and denoising models without downsampling are trained on eight NVIDIA V100 GPUs.
Pre-training. We train models with a batch size (8 GPUs) of 32 per task for 500K iterations. The initial learning rate is set to and halved at iterations. We adopt the Adam optimizer with and . The input patch size is set to for SR and for denoising/deraining, thus different tasks share the same feature size in the body part.
Fine-tuning. We fine-tune the SR models for 500K iterations, denoising models for 800K iterations and deraining models for 200K iterations. The learning rate is . Other settings remain the same as pre-training.
Training from Scratch. The initial learning rate is uniformly set to . We train SR models for 500K iterations, during which the learning rate is halved at iterations. As for denoising, we train models for 800K iterations and the learning rate is halved at iterations. In terms of deraining, we train models for 200K iterations, halving the learning rate at iterations. Other settings are the same as pre-training.
As shown in the first row of Fig. B.3, it is observed that CKA maps of deraining models mainly consist of three block structures, just like SR models illustrated in Sec. 3.3. Pre-training makes some attention heads at higher layers attend more locally (see attention modules in the red box), thus some global representations in the third block are converted to the second block. It supports our claim that pre-training introduces more local information as an inductive bias to the model, yielding better performance.
In this section, we investigate how pre-training data scale affects the super-resolution performance. As illustrated in Fig. C.4 and Table E.1, with regard to the EDT-B model, we obviously observe incremental PSNR improvements on multiple SR benchmarks by increasing the data scale from 50K to 400K during single-task pre-training. It is noted that we double the pre-training iterations (from 500K to 1M) for the data scale of 400K so that the data can be fully functional. However, longer pre-training period largely increases the training burden.
On the contrary, as shown in Table E.1, multi-task pre-training (with 500K training iterations) successfully breaks through the limit. Our EDT-B model with multi-task pretraining on 200K images achieves new state of the arts on all benchmarks, though a smaller data scale is adopted. Thus, we suggest multi-related-task pre-training is usually more effective and data-efficient in low-level vision tasks.
We conduct experiments to compare the performance of single-task pre-training for four model variants in the SR task. As shown in Fig. E.5, we visualize the PSNR improvements of models with pre-training over counterparts trained from scratch. It is observed that models with larger capacities generally obtain more improvements. Especially, we find pre-training can still improve a lot upon already strong EDT-L models, showing the potential of pre-training in low-level tasks. The detailed quantitative results are provided in Table E.2.
As illustrated in Sec. 3.3, we already know there are roughly three block structures in the CKA maps of SR models. Here we also visualize the CKA map of the EDT-L model in Fig F.6. Compared with EDT-B, the third-block representations of EDT-L account for the vast majority and show high similarities, which somewhat reflects the redundancy of the EDT-L model.
We further explore the relationship of internal representations between EDT and CNNs-based models (RRDB  and RCAN ) and the superiority of our transformer architecture over ConvNets. As for RRDB and RCAN, apart from the head and tail, we use outputs of blocks, e.g., residual dense blocks in RRDB and residual channel attention blocks in RCAN, to compute CKA similarities.
As shown in Fig. F.7, the early layers in EDT-B have more representations similar to those learned in RRDB and RCAN, which tend to be local as mentioned in Sec. 3.4, while higher layers in EDT-B incorporate more global information, showing clear differences compared to ConvNets. Fig. G.8 demonstrates that our EDT-B obtains comparable or greater improvements from pre-training, despite higher baselines and fewer parameters.
We visualize the kernels of depth-wise convolutions in the anti-blocking feed-forward network (Anti-FFN). As shown in Fig. H.9
, we observe a more uniform distribution of kernels at lower layers while diverse representations at higher layers. We find most kernels at lower layers are just like low-pass filters, acting like anti-aliasing filtering to avoid the possible blocking effect caused by window splitting in self-attention, thus meeting our expectations. As for higher layer kernels, they show a high diversity, learning various local contexts. Without the proposed anti-blocking design, there is nearly a 0.1dB drop on both Urban100 and Manga109 inSR, verifying the necessity of this design.
Considering our models are trained in two stages, pre-training on the ImageNet and fine-tuning on the target dataset, we also study the differences of model representations between before and after fine-tuning. As shown in Fig. H.10 (d)-(e), we find fine-tuning mainly changes the higher layer representations of the SR model but has little effect on the denoising model, which shows similar phenomenons to the comparison between without and with pre-training in Sec. 3.3. As for the deraining task, model representations do not change too much after fine-tuning (the similarities are almost larger than 0.98), where the degrees of changes between the second and the third blocks maintain the same level.
As shown in Table J.4, multi-related-task pre-training leads to obviously more improvements on PSNR(dB) and SSIM in all tasks, especially in SR and deraining. Considering the multi-task setting enables the transformer body to see more samples in an iteration, we also conduct a single-task pre-training with a large batch size in Table J.5. It is observed that multi-related-task pre-training is more effective and efficient, providing initialization for multiple tasks.