[CVPR 2019] Learning Parallax Attention for Stereo Image Super-Resolution
Stereo image pairs can be used to improve the performance of super-resolution (SR) since additional information is provided from a second viewpoint. However, it is challenging to incorporate this information for SR since disparities between stereo images vary significantly. In this paper, we propose a parallax-attention stereo superresolution network (PASSRnet) to integrate the information from a stereo image pair for SR. Specifically, we introduce a parallax-attention mechanism with a global receptive field along the epipolar line to handle different stereo images with large disparity variations. We also propose a new and the largest dataset for stereo image SR (namely, Flickr1024). Extensive experiments demonstrate that the parallax-attention mechanism can capture correspondence between stereo images to improve SR performance with a small computational and memory cost. Comparative results show that our PASSRnet achieves the state-of-the-art performance on the Middlebury, KITTI 2012 and KITTI 2015 datasets.READ FULL TEXT VIEW PDF
[CVPR 2019] Learning Parallax Attention for Stereo Image Super-Resolution
Repository for "A Stereo Attention Module for Stereo Image Super-Resolution ", SPL, 2020
The website of this repository is at https://yingqianwang.github.io/Flickr1024/
Super-resolution (SR) aims to reconstruct high-resolution (HR) images from their low-resolution (LR) counterparts. Recovering an HR image from a single shot is a long-standing problem [1, 2, 3]. Recently, dual cameras are becoming increasingly popular in mobile phones and autonomous vehicles. It is already demonstrated that subpixel shifts contained in LR stereo images can be used to improve SR performance . However, since disparities between stereo images can vary significantly for different baselines, focal lengths, depths and resolutions, it is highly challenging to incorporate stereo correspondence for SR.
Traditional multi-image SR methods [7, 8] use patch recurrence across images to obtain correspondence. However, these methods cannot exploit sub-pixel correspondence and their computational cost is high. Recent CNN-based frameworks [9, 10, 11]
incorporate optical flow estimation and SR in unified networks to solve the video SR problem. However, these methods cannot be directly applied to stereo image SR since the disparity can bemuch larger than their receptive field.
Stereo matching has been investigated to obtain correspondence between a stereo image pair [12, 13, 14]. Recent CNN-based methods [15, 16, 17, 18] use 3D or 4D cost volumes in their networks to model long-range dependency between stereo image pairs. Intuitively, these CNN-based stereo matching methods can be integrated with SR to provide accurate correspondence. However, 4D cost volume based methods [15, 16] suffer from a high computational and memory burden, which is unbearable for stereo image SR. Although the efficiency of 3D cost volume based methods [17, 18] is improved, these methods cannot handle stereo images with large disparity variations since a fixed maximum disparity is used to construct a cost volume.
Recently, Jeon et al. proposed a stereo SR network (StereoSR)  to provide correspondence cues for SR using an image stack. Specifically, the image stack is obtained by concatenating the left image and the images generated by shifting the right image with different intervals. A direct mapping between parallax shifts and an HR image is then obtained. However, the flexibility of this method for different sensors and scenes is limited since the largest allowed disparity is fixed (i.e., 64 in ) in this algorithm.
In this paper, we propose a parallax-attention stereo SR network (PASSRnet) to incorporate stereo correspondence for the SR task. Given a stereo image pair, a residual atrous spatial pyramid pooling (ASPP) module is first used to generate multi-scale features. Then, these features are fed to a parallax-attention module (PAM) to capture stereo correspondence. For each pixel in the left image, its feature similarities with all possible disparities in the right image are computed to generate an attention map. Consequently, our PAM can capture global correspondence while maintaining high flexibility. Afterwards, attention-driven feature aggregation is performed to update the features of the left image. Finally, these features are used to generate the SR result. Ablation study is performed on the KITTI 2015 dataset to test our PASSRnet. Comparative experiments are further conducted on the Middlebury, KITTI 2012 and KITTI 2015 datasets to demonstrate the superior performance of our network (as shown in Fig. 1).
Our main contributions can be summarized as follows: 1) We propose a PASSRnet for SR by incorporating stereo correspondence; 2) We introduce a generic parallax-attention mechanism with a global receptive field along the epipolar line to handle different stereo images with large disparity variations. It is demonstrated that reliable correspondence can be efficiently generated by the parallax-attention mechanism for the improvement of SR performance; 3) We propose a new dataset, namely Flickr1024, for the training of stereo image SR networks. The Flickr1024 dataset consists of 1024 high-quality stereo image pairs and covers diverse scenes; 4) Our PASSRnet achieves the state-of-the-art performance as compared to recent single image SR and stereo image SR methods.
In this section, we briefly review several major works for SR and long-range dependency learning.
Single Image SR
Since the seminal work of super-resolution convolutional neural network (SRCNN), learning-based methods have dominated the research of single image SR. Kim et al.  proposed a very deep super-resolution network (VDSR) with 20 convolutional layers. Tai et al.  developed a deep recursive residual network (DRRN) to control model parameters. Recently, Zhang et al.  proposed a residual dense network (RDN) to facilitate effective feature learning through a contiguous memory mechanism.
Video SR Liao et al.  introduced the first CNN for video SR. They performed motion compensation to generate an ensemble of SR-drafts, and then employed a CNN to reconstruct HR frames from the ensemble. Caballero et al.  proposed an end-to-end video SR framework by incorporating a motion compensation module with an SR module. Tao et al.  integrated an encoder-decoder network with LSTM to fully use temporal correspondence. This architecture further facilitates the extraction of temporal context. Since correspondence between adjacent frames mainly exists within a local region, video SR methods focus on the exploitation of local dependency. Therefore, they cannot be directly applied to stereo image SR due to the non-local and long-range dependency in stereo images.
Light-field Image SR Light-filed imaging can capture additional angular information of light at the cost of spatial resolution. To enhance spatial resolution, Yoon et al.  introduced the first light-field convolutional neural network (LFCNN). Yuan et al.  proposed a CNN framework with a single image SR module and an epipolar plane image enhancement module. To model the correspondence between images of adjacent sub-apertures, Wang et al.  developed a bidirectional recurrent CNN. Their network uses an implicit multi-scale feature fusion scheme to accumulate contextual information for SR. Note that, these methods are specifically proposed for light-field imaging with short baselines. Since stereo imaging usually has a much larger baseline than light-field imaging, these methods are unsuitable for stereo image SR.
Stereo Image SR Bhavsar et al.  argued that image SR and HR depth estimation are intertwined under stereo setting. Therefore, they proposed an integrated approach to jointly estimate the SR image and HR disparity from LR stereo images. Recently, Jeon et al. 
proposed a StereoSR to employ parallax prior. Given a stereo image pair, the right image is shifted with different intervals and concatenated with the left image to generate a stereo tensor. The tensor is then fed to a plain CNN to generate the SR result by detecting similar patches within the disparity channel. However, StereoSR cannot handle different stereo images with large disparity variations since the number of shifted right images is fixed.
To handle different stereo images with varying disparities for SR, long-range dependency in stereo images should be captured. In this section, we review two types of methods for long-range dependency learning.
Cost Volume Cost volume is widely applied in stereo matching [15, 16, 17] and optical flow estimation [27, 28]. For stereo matching, several methods [15, 16] use naive concatenation to construct 4D cost volumes. These methods concatenate left feature maps with their corresponding right feature maps across all disparities to obtain a 4D cost volume (i.e., heightwidthdisparitychannel). Then, 3D CNNs are usually used for matching cost learning. However, learning matching costs from 4D cost volumes suffers from a high computational and memory burden. To achieve higher efficiency, dot product is used to reduce feature dimension [17, 18], resulting in 3D cost volumes (i.e., heightwidthdisparity). However, due to the fixed maximum disparity in 3D cost volumes, these methods are unable to handle different stereo image pairs with large disparity variations.
Self-attention Mechanisms Attention mechanisms have been widely used to capture long-range dependency [29, 30]. For self-attention mechanisms [31, 32, 33], a weighted sum of all positions in spatial and/or temporal domain is calculated as the response at a position. Through matrix multiplication, self-attention mechanisms can capture the interaction between any two positions. Consequently, long-range dependency can be modeled with a small increase in computational and memory cost. Self-attention mechanisms have been successfully applied in image modeling  and semantic segmentation . Recent non-local networks [34, 35] share a similar idea and can be considered as a generalization of self-attention mechanisms. Note that, since self-attention mechanisms model dependency across the whole image, directly applying these mechanisms to stereo image SR involves unnecessary calculations.
Inspired by self-attention mechanisms, we develop a parallax-attention mechanism to model global dependency in stereo images. Compared to cost volumes, our parallax-attention mechanism is more flexible and efficient. Compared to self-attention mechanisms, our parallax-attention mechanism takes full use of epipolar constraints to reduce search space and improve efficiency. Moreover, the parallax-attention mechanism enforces our network to focus on the most similar feature rather than collecting all similar features for correspondence generation. It is demonstrated that the parallax-attention mechanism can generate reliable correspondence to improve SR performance (Section 4.3.1).
|Residual ASPP Module|
|sub-pixel||, pixel shuffle|
The detailed architecture of our PASSRnet. LReLU represents leaky ReLU with a leakage factor of 0.1, dila stands for dilation rate,denotes batch-wise matrix multiplication, and is the upscaling factor.
Feature representation with rich context information is important for correspondence estimation . Therefore, both large receptive filed and multi-scale feature learning are required to obtain a discriminative representation. To this end, we propose a residual ASPP module to enlarge the receptive field and extract hierarchical features with dense pixel sampling rate and scales.
As shown in Fig. 2 (a), our residual ASPP module is constructed by alternately cascading a residual ASPP block with a residual block. Input features are first fed to a residual ASPP block to generate multi-scale features. These resulting features are then sent to a residual block for feature fusion. This structure is repeated twice to produce final features. Within each residual ASPP block (as shown in Fig. 2 (b)), we first combine three dilated convolutions (with dilation rates of 1, 4, 8) to form an ASPP group, and then cascade three ASPP groups in a residual manner. Our residual ASPP block not only enlarges the receptive field, but also enriches the diversity of convolutions, resulting in an ensemble of convolutions with different receptive regions and dilation rates. The highly discriminative feature learned by our residual ASPP module is beneficial to the overall SR performance, as demonstrated in Sec. 4.3.1.
Parallax-attention Mechanism The architecture of our PAM is illustrated in Fig. 2 (c). Given two feature maps , they are fed to a transition residual block to generate and . Then, is fed to a convolution layer to produce a query feature map . Meanwhile, is fed to another convolution layer to produce , which is then reshaped to . Batch-wise matrix multiplication is then performed between Q and S
and a softmax layer is applied, resulting in a parallax attention map. For more details, please refer to the supplemental material. Next, B is fed to a convolution to generate , which is further multiplied by to produce features . As a weighted sum of features at all possible disparities, O is then integrated with its corresponding local features A. Since PAM can gradually focus on the features at accurate disparities using feature similarities, correspondence can then be captured. Note that, once is ready, A and B are exchanged to produce for valid mask generation (as described below). Finally, stacked features and a valid mask are fed to a convolution layer for feature fusion.
Different from self-attention mechanisms [32, 33], our parallax-attention mechanism enforces our network to focus on the most similar feature along the epipolar line rather than collecting all similar features, resulting in sparse attention maps. A comparison between parallax-attention maps generated by our PAM and the groundtruth is shown in Fig. 3. Note that, represents the contribution of position in the right image to position in the left image. Consequently, the patterns in an attention map can reflect the correspondence between stereo pairs and also encode disparity information. For more details, please refer to the supplemental material. It can be observed that our PAM produces patterns similar to the groundtruth, which indicates that reliable stereo correspondence can be captured by our PAM. It should be noted that our PASSRnet can be considered as a multi-task network to learn both stereo correspondence and SR. However, using shared features for different tasks usually suffers from training conflict . Therefore, a transition block is introduced in our PAM to alleviate this problem. The effectiveness of the transition block is demonstrated in Sec. 4.3.1.
Left-right Consistency & Cycle Consistency
Given deep features extracted from an LR stereo image pair (and ), two parallax-attention maps ( and ) can be generated by PAM. Ideally, the following left-right consistency can be obtained if our PAM captures accurate correspondence:
where denotes batch-wise matrix multiplication. Based on Eq. (1), we can further derive a cycle consistency:
where the cycle-attention maps and can be calculated as:
Here, we introduce left-right consistency and cycle consistency to regularize the training of our PAM for the generation of reliable and consistent correspondence.
Valid Masks Since left-right consistency and cycle consistency do not hold for occluded regions, we use an occlusion detection method to generate valid masks. We only enforce consistency on valid regions. In the parallax-attention map generated by our PAM (e.g., ), it is observed that pixels in occluded regions are usually assigned with small weights. Therefore, a valid mask can be obtained by:
where is a threshold (empirically set to 0.1 in this paper) and is the width of stereo images. Two examples of valid masks are shown in Fig. 4. According to the parallax-attention mechanism, represents the contribution of position in the left image to position in the right image. Since occluded pixels in the left image cannot find their correspondences in the right image, their values are usually low. Thus, we consider these pixels as occluded ones. In practice, we use several morphological operations to handle isolated pixels and holes in valid masks. Note that, occluded regions in the left image cannot obtain additional information from the right image. Therefore, valid mask is further used to guide feature fusion, as shown in Fig. 2 (c).
We design four losses for the training of our PASSRnet. Other than an SR loss, we introduce three additional losses, including photometric loss, smoothness loss and cycle loss, to help the network to fully exploit
the correspondence between stereo images. The overall loss function is formulated as:
where is empirically set to 0.005. The performance of our network with different losses will be analyzed in Sec. 4.3.2.
SR Loss The mean square error (MSE) loss ifs used as the SR loss:
where and represent the SR result and HR groundtruth of the left image, respectively.
Photometric Loss Since collecting a large stereo dataset with densely labeled groundtruth disparities is highly challenging, we train our PAM in an unsupervised manner. Note that, if the groundtruth disparities are available, we can generate the groundtruth attention maps accordingly (see the supplemental material for more details) and train our PAM in a supervised manner. Following , we introduce a photometric loss using the mean absolute error (MAE) loss. Note that, since the left-right consistency defined in Eq. (1) only holds in non-occluded regions, we introduce a photometric loss as:
where represents a pixel with a valid mask value.
Smoothness Loss To generate accurate and consistent attention in textureless regions, a smoothness loss is defined on the attention maps and :
where . The first and second terms in Eq. (8) are used to achieve vertical and horizontal attention consistency, respectively.
Cycle Loss In addition to photometric loss and smoothness loss, we further introduce a cycle loss to achieve cycle consistency. Since and in Eq. (2) can be considered as identity matrices, we design a cycle loss as:
where is a stack of identity matrices.
|PASSRnet with single input||Left||25.27||0.770||1.32M||114ms|
|PASSRnet with replicated inputs||Left-Left||25.29||0.771||1.42M||176ms|
|PASSRnet without residual manner||Left-Right||25.40||0.774||1.42M||176ms|
|PASSRnet without atrous convolution||Left-Right||25.38||0.773||1.42M||176ms|
|PASSRnet without PAM||Left-Right||25.28||0.771||1.32M||135ms|
|PASSRnet without transition residual block||Left-Right||25.36||0.773||1.34M||160ms|
In this section, we first introduce the datasets and implementation details, and then conduct ablation experiments to test our network. We further compare our network to recent single image SR and stereo image SR methods.
For training, we followed  and downsampled 60 Middlebury  images by a factor of 2 to generate HR images. We further collected 1024 stereo images from Flickr to construct a new Flickr1024 dataset. This dataset was used as the augmented training data for our PASSRnet. Please see the supplemental material for more details about the Flickr1024 dataset. For test, we used 5 images from the Middlebury dataset, 20 images from the KITTI 2012 dataset  and 20 images from the KITTI 2015 dataset  as benchmark datasets. We further collected 10 close-shot stereo images (with disparities larger than 200) from Flickr to test the flexibility of our network to large disparity variations. For validation, we selected another 20 images from the KITTI 2012 dataset.
During the training phase, we first downsampled HR images using bicubic interpolation to generate LR images, and then cropped
patches with a stride of 20 from these LR images. Meanwhile, their corresponding patches in HR images were also cropped. The horizontal patch size was increased to 90 to cover most disparities (96%) in our training dataset. These patches were randomly flipped horizontally and vertically for data augmentation. Note that, rotation was not performed to maintain epipolar constraints. We used peak signal-to-noise ratio (PSNR) and structural similarity index (SSIM) to test SR performance. Similar to , we cropped borders to achieve fair comparison.
Our PASSRnet was implemented in Pytorch on a PC with an Nvidia GTX 1080Ti GPU. All models were optimized using the Adam method with , and a batch size of 32. The initial learning rate was set to
and reduced to half after every 30 epochs. The training was stopped after 80 epochs since more epochs do not provide further consistent improvement.
In this section, we present ablation experiments to justify our design choices, including the network architecture and the losses.
Single Input vs. Stereo Input Compared to single images, stereo image pairs provide additional information observed from a different viewpoint. To demonstrate the effectiveness of stereo information for SR performance improvement, we removed PAM from our PASSRnet and retrained the network with single images (i.e., the left images). For comparison, we also used pairs of replicated left images as the input to the original PASSRnet. Results achieved on the KITTI 2015 dataset are listed in Table 2.
Compared to the original PASSRnet, the network trained with single images suffers a decrease of 0.16 dB (25.43 to 25.27) in PSNR. Further, if pairs of replicated left images are fed to the original PASSRnet, the PSNR value is decreased to 25.29 dB. Without extra information introduced by stereo images, our PASSRnet with replicated images achieves comparable performance to the network trained with single images. This clearly demonstrates that stereo images can be used to improve the performance of PASSRnet.
Residual ASPP Module
Residual ASPP module is used in our network to extract multi-scale features. To demonstrate the effectiveness of residual ASPP, two variants were introduced. First, to test the effectiveness of residual connections, we removed them to obtain a cascading ASPP module. Then, to test the effectiveness of atrous convolutions, we replaced them with ordinary convolutions.
From the comparative results shown in Table 2, we can see that SR performance benefits from both residual connections and atrous convolutions. If residual connections are removed, the PSNR value is decreased from 25.43 dB to 25.40 dB. That is because, residual connections enable our residual ASPP module to extract features at more scales, resulting in more robust feature representations. Furthermore, if atrous convolutions are replaced by ordinary ones, the PSNR value is decreased from 25.43 dB to 25.38 dB. That is because, large receptive field of atrous convolutions facilitates our PASSRnet to employ context information in a large area. Therefore, more accurate correspondence can be obtained to improve SR performance.
Parallax-attention Module PAM is introduced to integrate the information from stereo images. To demonstrate its effectiveness, we introduced a variant by removing PAM and directly stacking the output features of the residual ASPP module. It can be observed from Table 2 that the PSNR value is decreased from 25.43 dB to 25.28 dB if PAM is removed. That is because, long spatial distance between local features in the left image and their dependency in the right image hinders plain CNNs to integrate these features effectively.
Transition Block in PAM Transition block in PAM is introduced to alleviate the training conflict in shared layers. To demonstrate the effectiveness of transition block, we removed it from our PAM and retrained the network. It can be observed from Table 2 that the PSNR value is decreased from 25.43 dB to 25.36 dB if the transition block is removed. That is because, the transition block enhances task-specific feature learning in our PAM and alleviates training conflict in shared layers. Therefore, more representative features can be learned in shared layers.
PAM vs. Cost Volume Cost volume and 3D convolutions are commonly used to obtain stereo correspondence [15, 16]. To demonstrate the efficiency of our PAM in stereo correspondence generation, we replaced PAM with a 4D cost volume and two 3D convolutional layers (). It can be observed from Table 3 that our PAM has less than half of the parameters in the cost volume formation. Moreover, our PAM achieves superior computational efficiency, with FLOPs being reduced by over 150 times. With PAM, our PASSRnet achieves better SR performance (i.e., PSNR value is increased from 25.23 dB to 25.43 dB) and efficiency (i.e., running time is decreased by 1.5 times). That is because, two 3D convolutional layers are insufficient to capture long-range correspondence within the cost volume. However, adding more layers will lead to a significant increase of computational cost.
To test the effectiveness of our losses, we retrained PASSRnet using different losses.
It can be observed from Table 4 that the PSNR value of our PASSRnet is decreased from 25.43 to 25.35 if PASSRnet is trained with only SR loss. That is because, with only this loss, our PAM learns to collect all similar features along the epipolar line and cannot focus on the most similar feature to provide accurate correspondence. Further, the performance is gradually improved if photometric loss, smoothness loss and cycle loss are added. That is because, these losses encourage our PAM to generate reliable and consistent correspondence. Overall, our PASSRnet achieves the best performance (i.e., PSNR=25.43 dB and SSIM=0.776) when it is trained with all these losses.
|Dataset||Scale||Single Image SR||Stereo Image SR|
||SRCNN ||VDSR ||DRCN ||LapSRN ||DRRN ||StereoSR ||Ours|
|Middlebury (5 images)||2||32.05/0.935||32.66/0.941||32.82/0.941||32.75/0.940||32.91/0.945||33.05/0.955*||34.05/0.960|
KITTI 2012 (20 images)
|KITTI 2015 (20 images)||2||28.77/0.901||28.99/0.904||29.04/0.904||28.97/0.903||29.00/0.906||29.09/0.909||29.78/0.919|
We compared our PASSRnet to a number of CNN-based SR methods on three benchmark datasets. Recent single image SR methods under comparison include SRCNN , VDSR , DRCN , LapSRN  and DRRN . We also compared our PASSRnet to the latest stereo image SR method StereoSR . The codes provided by the authors of these methods were used to conduct experiments. Note that, similar to [6, 43], EDSR , RDN  and D-DBPN  are not included in our comparison since their model sizes are larger than our PASSRnet by at least 8 times.
Quantitative Results The quantitative results are shown in Table 5. It can be observed that our PASSRnet achieves the best performance on the Middlebury, KITTI 2012 and KITTI 2015 datasets. Specifically, compared to single image SR methods, our PASSRnet outperforms the second best approach (i.e., DRRN) by 1.04 dB in terms of PSNR on the Middlebury dataset for SR. Moreover, the PSNR value achieved by our network is higher than that of StereoSR by 1.00 dB. That is because, more reliable correspondence can be captured by our parallax-attention mechanism.
Qualitative Results Figure 5 illustrates the qualitative results achieved on two scenarios. It can be observed from zoom-in regions that single image SR methods cannot recover reliable details. In contrast, our PASSRnet uses stereo correspondence to produce finer details with fewer artifacts, such as the railings and stripe in Fig. 5. Compared to StereoSR, our PASSRnet explicitly captures stereo correspondence for SR. Consequently, superior visual performance is achieved.
We further tested the flexibility of our PASSRnet and StereoSR  with respect to large disparity variations. Results achieved on images with different resolutions are shown in Table 6. More results under different baselines and depths are available in the supplemental material. It can be observed that our PASSRnet is significantly better than StereoSR in terms of efficiency (i.e. , FLOPs) on low resolution images. Meanwhile, our PASSRnet outperforms StereoSR by a large margin in terms of PSNR on high resolution images. That is because, StereoSR needs to perform padding for images with horizontal resolution lower than 64 pixels, which involves unnecessary calculations. For high resolution images, the fixed maximum disparity hinders StereoSR to capture longer-range correspondence. Therefore, the SR performance of StereoSR is limited.
, FLOPs) on low resolution images. Meanwhile, our PASSRnet outperforms StereoSR by a large margin in terms of PSNR on high resolution images. That is because, StereoSR needs to perform padding for images with horizontal resolution lower than 64 pixels, which involves unnecessary calculations. For high resolution images, the fixed maximum disparity hinders StereoSR to capture longer-range correspondence. Therefore, the SR performance of StereoSR is limited.
In this paper, we propose a parallax-attention stereo super-resolution network (PASSRnet) to incorporate stereo correspondence for the SR task. Our PASSRnet introduces a parallax-attention mechanism with global receptive field to handle different stereo images with large disparity variations. We also introduce a new and the largest dataset for stereo image SR. It is demonstrated that our PASSRnet can effectively capture stereo correspondence for the improvement of SR performance. Comparison to recent single image SR and stereo image SR methods has shown that our network achieves the state-of-the-art performance.
International Journal of Computer Vision, 47(1-3), 2002.
Efficient deep learning for stereo matching.In CVPR, pages 5695–5703, 2016.
DRAW: A recurrent neural network for image generation.In ICML, volume 37, pages 1462–1471, 2015.