In recent years, Self-attention models have obtained wide attention with promising performance in many visual tasks. In NIPS 2017, Vaswani et al. proposed the Transformer structure, that is originally designed for the NLP tasks, such as Bert using Transformer as encoder, GPT  using Transformer as decoder, Transformer-XL solving the problem of long sequences, etc. It is also widely used in the field of Recommend Systems to improve its performance, such as BST  for behavioral sequence modeling, Autoint  for feature combination of CTR(Click-Through-Rate) prediction model, re-ranking model PRM , etc. Recently in the field of computer vision, many excellent methods demonstrate that Transformer can obtain promising performance, such as image classification ViT , object detection DETR , semantic segmentation SETR , 3D point cloud processing Point Transformer , image generation TransGAN , etc.
The seminal work ViT proves that using a pure Transformer network could achieve the SOTA in image classification. Specifically, ViT splits the input into 1414 or 16
16 patches. Each patch is flattened to a vector, acted as a token for the Transformer system. ViT is the first fully-transformer model to extract image features, but it still suffers the following limitations:
1) ViT requires a huge dataset such as the JFT-300M dataset for pre-training to better explore the relationship between pixels. It cannot obtain satisfactory results using a midsize dataset such as ImageNet.
2) The Transformer structure in ViT extracts the non-local interactions from the entire image, sacrificing the merit in learning local patterns, includes texture and edges. We believe that the local patterns are essential for visual recognition tasks, which is proved by existing CNN techniques. In principle, CNN extract various shallow features, and then obtain semantic information through multi-layer nonlinear stacking. Therefore, features obtained by ViT are difficult to generate high-resolution images, which require more low-level representations.
Based on this, we design a new fully-transformer model to address the above issues:
1) To extract low-level visual representations, we divide the image into several appropriate size patches. For each patch, the transformer is used to calculate the global correlation features of all pixels, with which the source image can be reconstructed. Therefore, we can obtain a set of low-level features of the image, sharing similar characteristics. We coin this module as ”Patch Transformer”.
2) To explicitly extract the global patterns of the image, we design a down-sampling pyramid to achieve local-to-global perception. Specifically, we downsample the input image once, thus the size becomes a quarter of the original. We perform Patch Transformer on the down-sampled image. After repeating downsampling and performing the Patch Transformer until the image is the same size as the patch. The obtained features corresponding to different scales are up-sampled to the original size. We name this operation as ¡°Pyramid Transformer¡±.
Based on the Patch Transformer and Pyramid Transformer, we develop a Pyramid Patch Transformer (PPT) image feature extraction model. The PPT model can fully extract the features of the input image, including local context and global saliency. We further design a reconstruction auto-encoder network to ensure that the extracted features can reconstruct the input image.
To verify the effectiveness of the proposed PPT model, we apply this feature extraction approach to the image fusion task. The input images are obtained by different kinds of sensors, such as infrared images and visible light images, medical CT images and X-ray images, or images with different focus, images with different exposures and so on. These multi-source images reflect different physical attributes of the same scene. An example of the fusion of a visible light image and an infrared image is shown in Fig.1.
Image fusion approach based on deep learning can be divided into two categories: 1) The method based on autoencoders uses an encoder to extract features into the latent space for feature fusion, and then the fused features are input to the decoder to obtain the fused image[35, 19, 17, 7]
. 2)The end-to-end fusion network designs a suitable structure and loss function to realize the end-to-end image generation[28, 27, 6].
The Pyramid Patch Transformer model can extract a variety of features from the input image. We use PPT model as the feature extraction network, and then design a feature decoder for feature compression and image reconstruction. In summary, our contributions are three-fold:
An improved Patch Transformer to extract low-level image representations without loss of resolution, integrating the interactions among raw pixels.
A novel Pyramid Transformer to reflect the global relationships of multi-scale patches, achieving local-to-global perception.
A new Pyramid Patch Transformer as a general feature extraction module, which is successfully applied to image fusion tasks with superior performance against the state-of-the-art methods.
Ii Related Work
A Transformer encoder is composed of a multi-head self-attention layer (MSA) and a Multi-layer Perception (MLP) block. And before each MSA and MLP layer, Layer Norm (LN) is used, with the residual structure. The essential design of Transformer is that the input vector is combined with the position embedding to preserve the localization clues for each vector. To introduce the basic Transformer to the visual task, ViT splits the input image into a sequence of several patches , where are the corresponding resolution and number of channels, is the width / height of the patch, and . ViT maps these patches to the dimensional features after forward passing the network. The obtained output at is the classification result. A ViT network structure for image classification is as follows,
Ii-B Image Fusion
In ICCV 2017, Prabhakar et al. proposed DeepFuse  approach, introducing the auto-encoder structure for multi-exposure image fusion task. Specifically, DeepFuse trains the auto-encoder network, with the encoder extracting features of the image. After performing an addition fusion strategy in the middle layer, the fused features are input to the decoder to obtain the fused image. Similar structures are further developed by DenseFuse  and IFCNN. The general steps of these auto-encoder-based image fusion methods are as follows:
Iii Pyramid Patch Transformer
For high-resolution images, most of the existing transformer approaches split the image into several patches. Suppose the resolution of an image is , we split it into vectors with each patch being , where , . Therefore, the features with size are multi-channel features of the original image after special down-sampling and non-linear changes. However, the original image information is mapped into a low-dimensional feature space, with semantic meaning and discrimination, this it is difficult to reconstruct the original image with the obtained Transformer features.
To reflect the pixel details, a straightforward solution is to set the size of the patch to 1. This means that the Transformer is performed on the original resolution image, resulting in resource problems. For example, if a transformer is applied to a image, at least one attention matrix of parameter will be generated, which requires huge memory. In order to overcome this problem, we propose the Pyramid Patch Transformer, a network framework that uses fully-transformer for image feature extraction.
Iii-a Patch Transformer
The Patch Transformer module is designed to alleviate the memory consumption caused by general transformers with exceeding tokens when processing large-resolution images. Each Patch Transformer module contains three steps: 1, Trans to Patches, 2, Transformer and 3, Reconstruct. The Patch Transformer process is shown in the Fig. 2.
As shown in the Fig. 2, given an input image , will first be split into patches by a sliding window with size, where . We name this operation T2P. We can obtain a sequence of patch features , , . We reshape each into a 1-D vector. is reshaped into . In order to enhance the information, we use MLP to increase the dimension of to obtain , equivalent to a sequence of patch features with size with and channels, , .
We set a learnable position embedding vector . We extend the vector to the same dimension as , , enabling learning the location clues among the embedding vectors. Therefore, we can obtain the feature .
Then we conduct transformers for each patch in , . The Transformer encoder module can be applied several times in the network. Each Transformer module is divided into two steps,multi-head self-attention layer (MSA) and Multi-layer Perception (MLP). A standard Transformer module with LayerNorm and Residual structure is shown in the Fig. 4.
We restore according to the order of split . The corresponding output is a set of features mapped from the original image to the latent space.
Iii-B Pyramid Transformer
Using the above Patch Transformer, each input image will be split into several patches. The representations of each patch are only related to the pixels within the patch, withot considering the long-range dependency between pixels in the entire image. To address this issue, we refer to the multi-scale approach with the following design to construct a pyramid structure. First, the image is down-sampled once to obtain an image with size of , . Apply the corresponding Patch Transformer to to get the representations , . Then is upsampled to obtain with the same size as the input image .
1) Continue to downsample the image ,
2) Use Patch Transformer to extract representations,
3) Upsample to to get .
Repeat the above operations recursively until the the downsample image can be split into one patch. We perform these operations times. Suppose the image is of size, and the spitted patch is of size, we can obtain . After concatenating all the features at different scales, we get a set of multi-scale features , as shown in Fig.3.
Iii-C Transformer Receptive Field
In general, in Deep Learning, CNN performs well in the Computer Vision field. One of the important roles is the receptive field of CNN, it can effectively capture the local features in the image. With the size of the convolution kernel increasing or the depth of the convolution layer deepening, each cell of the feature maps reflects a relevant spatial region in the original image.
As shown in the Fig. 5 (a), a Patch Transformer is performed on the image, with each pixel of its feature being associated with all the pixels of the entire patch. This patch size can be considered as the receptive field of this Patch Transformer. If you downsample once, the receptive field will be four times larger, as the length and width of the image are both half.
With the gradual deepening of the pyramid, the range of associated pixels expands from the local area to the global. The receptive field of the Patch Transformer has also become larger. In particular, the pixels that are close to each other contribute more correlation, and the pixels with a long distance preserve weak long-range dependence. With the bottom Patch Transformer layer in the Pyramid Transformer, the receptive field is expanded to the whole images, as shown in the Fig.5 (b). The continuous down-sampling is designed to obtain a large receptive field here. A large receptive field on the original image captures more large-scale or global semantic features with less detail information. While the upper several layers in the Pyramid Transformer captures low-level details. Therefore, we believe that the Pyramid Transformer can extract both shallow and semantic information simultaneously.
Iii-D Network Architecture
We design the auto-encoder network for image reconstruction, as shown in the Fig. 6. The encoder is composed of the Pyramid Transformer and the Patch Transformer. After obtaining a set of multi-scale features after encoding, the reconstructed image can be generated with the decoder.
We use the mean square error (MSE) loss function as the reconstruction loss for the network.
Iii-E Features Visualization
As shown in the Fig. 7, we obtain the extracted features by the PPT module. We select the features from three different receptive fields in the Pyramid Transformer. In the first row with the smallest receptive field, it can be seen that the features represent more low-level features such as edge contour and color distribution of the image. While in the third row of the features with the largest receptive field, it can be seen that the features represent the concerned area of the related object, reflecting the semantic related feature of the pixels.
Iv Pyramid Patch Transformer For image Fusion
The primary purpose of the image fusion task is to generate a fusion image that contains as much useful information as possible from the two source images. We use the designed PPT module to extract the image features for image fusion tasks.
Iv-a Fusion Network Architecture
We take the infrared image and visible light image fusion task as an example. We input the visible light image and the infrared image into the pre-trained PPT encoder module to obtain and . The PPT encoder can map any image to a high-dimensional feature space to obtain features . These features can represent the input image from different angles such as edge, texture, color distribution, semantic information, etc.. As we use the Siamese structure with a same PPT encoder module to extract features, and are mapped to the same feature space. We can easily perform fusion operations on and across the channel dimension, thus get a new fused feature representation .
Iv-B Fusion Strategy
For different image fusion tasks, we choose different fusion strategies. All fusion strategies operate at the pixel level of features, as shown in Fig. 8.
For the fusion task of infrared image and visible light image, we believe that the two images are not obvious biased in feature selection. We decide to use the average strategy to obtain their fusion features,.
For the multi-focus image fusion tasks, as the focus of the image are different, the features of the focused area are more prominent than the unfocused area. We believe that the fused pixel should be the more obvious one. Therefore,we adopt the maximum value strategy, .
In addition to these two common fusion strategies, we propose a Softmax strategy, which can be used for multiple image fusion tasks at the same time. To adaptively trade off the significance between the two input images, Softmax is employed to fuse the two features, .
V-a Datasets and Implementation
We perform experiments on two image fusion tasks, i.e., multi-focus image fusion, infrared and visible light image fusion.
In the infrared image and visible light image fusion task, we use the TNO dataset  and the RoadScene dataset . For the RoadScene dataset, we convert the images to gray scale to keep the the visible light image channels consistent with infrared image. For the multi-focus image fusion task, we use the Lytro dataset . The Lytro image are split according to the RGB channels to obtain three pairs of images. The fusion result is merged according to the RGB to obtain a fused image.
As the network input is a fixed size , we split the input image into several patches with a sliding window of size, filling the insufficient area with the value 128 (the pixel range is 0255). After fusing each patch pair, the final fused image is obtained by splicing according to the order of patch split.
V-B Experiments Setting
We input the image to the network, and the size of the patch . The optimizer is selected as Adam  with a learning rate 1e-4. The batch size is 1. We set the total training times of the network to 50 times. The experiments are performed on an NVIDIA Geforce GTX1080 GPU and 3.60GHz Intel Core i7-6850K with 64GB of memory.
V-C Quantitative Analysis
V-C1 Visible and Infrared Image Fusion
We compare PPT Fusion with the eighteen state-of-the-art methods, including Cross Bilateral Filter fusion method(CBF) , Curvelet Transform(CVT) , Dual-Tree Complex Wavelet Transform(DTCWT) , Gradient Transfer(GTF) 
, Multi-resolution Singular Value Decomposition(MSVD), Ratio of Low-pass Pyramid(RP) , Deepfuse , DenseFuse , FusionGan , IFCNN , MDLatLRR , DDcGan , ResNetFusion , NestFuse , FusionDN , HybridMSD , PMGI , and U2Fusion , respectively.
As shown in Fig. 9, we report the results of all approaches and highlight some specific local areas. It can be seen that the fusion result of our PPT Fusion retains the necessary person radiation information. The global semantic feature of our results is more obvious, that is, the contrast between the sky and the house. After highlighting the details of the branches, our results reflect more details from both the visible light image and the infrared image.
We using six related indicators to quantitatively evaluate the fusion quality, namely Sum of Correlation Coefficients (SCD) , Structural SIMilarity (SSIM) , pixel feature mutual information() , ,  and correlation coefficient (CC) . SCD and CC calculates the correlation coefficients between images. SSIM and
calculate the similarity between images.calculate the mutual information between features. represents the ratio of noise added to the final image. Among them, the lower the value of the , and the higher other values, the better the fusion quality of the approach.
As shown in Table. I, the best value in the quality table is made the bold red font in italic, and the second-best value is in the bold black font in italic. It can be seen that PPT Fusion rank in top 2 in multiple indicators. Other indicators are also better than most methods. It can be demonstrated that PPT Fusion maintains a effective structural similarity with the source images, preserving a large information correlation with the source images, without introducing noise, artifacts, etc.
V-C2 Multi-focus Image Fusion
We compare PPT Fusion with the eighteen state-of-the-art methods, and they are Guided Filtering Fusion(GFF) , Laplace Pyramid Sparse Representation(LPSR) , MFCNN , DenseFuse  and IFCNN , as shown in Fig. 10.
On the basis of the previous five indicators: SCD, SSIM, , , and CC, we added four additional indicators, namely Entropy(EN) , Visual Information Fidelity(VIFF) , Cross Entropy , Mutual Information (MI). EN measure the amount of information. VIFF is used to measure the loss of image information to the distortion process. CrossEntropy and MI measure the degree of information correlation between images. Among them, the lower the value of the CrossEntropy and the higher other values, the better the fusion quality of the approach.
From Table 2, we can see that PPT Fusion can rank in top 2 in all indicators. This shows that the fusion image of PPT Fusion effectively extract the source details while making the generated image clear enough.
In this study, we propose a feature extraction module that uses Fully-Transformer, termed as the Pyramid Patch Transformer (PPT) module. First, the Patch Transformer we proposed can map high-resolution images to feature space without resolution loss. Second, we propose the Pyramid Transformer with transformer receptive field to extract local information and global information from images. The PPT module can map images into a set of multi-scale, multi-dimensional, and multi-angle features. We successfully apply the PPT module to different image fusion tasks and achieve the state-of-the-art. This proves that using a Fully-Transformer and designing a reasonably structure can represent the image features without loss information, demonstrating the effectiveness and universality of the PPT module. We believe that the propsoed PPT module has reference significance for low-level vision tasks and image generation tasks.
-  (2015) A new image quality metric for image fusion: the sum of the correlations of differences. Aeu-international Journal of electronics and communications 69 (12), pp. 1890–1896. Cited by: §V-C1.
-  (2019) Behavior sequence transformer for e-commerce recommendation in alibaba. In Proceedings of the 1st International Workshop on Deep Learning Practice for High-Dimensional Sparse Data, pp. 1–4. Cited by: §I.
Transformer-xl: attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860. Cited by: §I.
-  (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §I.
-  (2020) An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929. Cited by: §I.
Image fusion based on generative adversarial network consistent with perception. Information Fusion. Cited by: §I.
-  (2021) A dual-branch network for infrared and visible image fusion. arXiv preprint arXiv:2101.09643. Cited by: §I.
-  (2014) Fast-fmi: non-reference image fusion metric. In 2014 IEEE 8th International Conference on Application of Information and Communication Technologies (AICT), pp. 1–3. Cited by: §V-C1.
-  (2008) The study on image fusion for high spatial resolution remote sensing images. Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci. XXXVII. Part B 7, pp. 1159–1164. Cited by: §V-C1.
-  (2013) A new image fusion performance metric based on visual information fidelity. Information Fusion 14 (2), pp. 127–135. Cited by: §V-C2.
-  (2016) Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415. Cited by: §III-D.
-  (2021) Transgan: two transformers can make one strong gan. arXiv preprint arXiv:2102.07074. Cited by: §I.
-  (2014) Adam: a method for stochastic optimization. arXiv: Learning. Cited by: §V-B.
-  (2013) Multifocus and multispectral image fusion based on pixel significance using discrete cosine harmonic wavelet transform. Signal, Image and Video Processing 7 (6), pp. 1125–1143. Cited by: §V-C1, §V-C2.
-  (2015) Image fusion based on pixel significance using cross bilateral filter. Signal, image and video processing 9 (5), pp. 1193–1204. Cited by: §V-C1.
-  (2007) Pixel-and region-based image fusion with complex wavelets. Information fusion 8 (2), pp. 119–130. Cited by: §V-C1.
-  (2020) NestFuse: an infrared and visible image fusion architecture based on nest connection and spatial/channel attention models. IEEE Transactions on Instrumentation and Measurement. Cited by: §I, §V-C1.
-  (2020) MDLatLRR: a novel decomposition method for infrared and visible image fusion. IEEE Transactions on Image Processing. Cited by: §V-C1.
-  (2018) Densefuse: a fusion approach to infrared and visible images. IEEE Transactions on Image Processing 28 (5), pp. 2614–2623. Cited by: §I, §II-B, §V-C1, §V-C2.
-  (2013) Image fusion with guided filtering. IEEE Transactions on Image processing 22 (7), pp. 2864–2875. Cited by: §V-C2.
Multi-focus image fusion with a deep convolutional neural network. Information Fusion 36, pp. 191–207. Cited by: §V-C2.
-  (2015) A general framework for image fusion based on multi-scale transform and sparse representation. Information fusion 24, pp. 147–164. Cited by: §V-C2.
-  (2011) Objective assessment of multiresolution image fusion algorithms for context enhancement in night vision: a comparative study. IEEE transactions on pattern analysis and machine intelligence 34 (1), pp. 94–109. Cited by: §V-C1.
-  (2016) Infrared and visible image fusion via gradient transfer and total variation minimization. Information Fusion 31, pp. 100–109. Cited by: §V-C1.
-  (2020) Infrared and visible image fusion via detail preserving adversarial learning. Information Fusion 54, pp. 85–98. Cited by: §V-C1.
-  (2020) DDcGAN: a dual-discriminator conditional generative adversarial network for multi-resolution image fusion. IEEE Transactions on Image Processing 29, pp. 4980–4995. Cited by: §V-C1.
Pan-gan: an unsupervised learning method for pan-sharpening in remote sensing image fusion using a generative adversarial network. Information Fusion. Cited by: §I.
-  (2019) FusionGAN: a generative adversarial network for infrared and visible image fusion. Information Fusion 48, pp. 11–26. Cited by: §I, §V-C1.
Image fusion technique using multi-resolution singular value decomposition. Defence Science Journal 61 (5), pp. 479. Cited by: §V-C1.
-  (2015) Multi-focus image fusion using dictionary-based sparse representation. Information Fusion 25, pp. 72–84. Cited by: §V-A.
-  (2007) Remote sensing image fusion using the curvelet transform. Information fusion 8 (2), pp. 143–156. Cited by: §V-C1.
-  (2020) 3D object detection with pointformer. arXiv preprint arXiv:2012.11409. Cited by: §I.
-  (2019) Personalized re-ranking for recommendation. In Proceedings of the 13th ACM Conference on Recommender Systems, pp. 3–11. Cited by: §I.
-  (2005) Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on Pattern Analysis and Machine Intelligence 27 (8), pp. 1226–1238. Cited by: §V-C2.
-  (2017) DeepFuse: a deep unsupervised approach for exposure fusion with extreme exposure image pairs.. In ICCV, pp. 4724–4732. Cited by: §I, §II-B, §V-C1.
-  (2019) Language models are unsupervised multitask learners. OpenAI blog 1 (8), pp. 9. Cited by: §I.
-  (2008) Assessment of image fusion procedures using entropy, image quality, and multispectral classification. Journal of Applied Remote Sensing 2 (1), pp. 023522. Cited by: §V-C2.
Autoint: automatic feature interaction learning via self-attentive neural networks. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, pp. 1161–1170. Cited by: §I.
-  (2014) TNO image fusion dataset. Figshare. data. Cited by: §V-A.
-  (1989) Image fusion by a ration of low-pass pyramid.. Pattern Recognition Letters 9 (4), pp. 245–253. Cited by: §V-C1.
-  (2004) Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13 (4), pp. 600–612. Cited by: §V-C1.
-  (2020) U2fusion: a unified unsupervised image fusion network. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §V-C1.
-  (2020) FusionDN: a unified densely connected network for image fusion.. In AAAI, pp. 12484–12491. Cited by: §V-A, §V-C1.
-  (2020) Rethinking the image fusion: a fast unified image fusion network based on proportional maintenance of gradient and intensity.. In AAAI, pp. 12797–12804. Cited by: §V-C1.
-  (2020) IFCNN: a general image fusion framework based on convolutional neural network. Information Fusion 54, pp. 99–118. Cited by: §II-B, §V-C1, §V-C2.
-  (2020) Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. arXiv preprint arXiv:2012.15840. Cited by: §I.
-  (2016) Perceptual fusion of infrared and visible images through a hybrid multi-scale decomposition with gaussian and bilateral filters. Information Fusion 30, pp. 15–26. Cited by: §V-C1.
-  (2020) Deformable detr: deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159. Cited by: §I.