Parallax Attention for Unsupervised Stereo Correspondence Learning

09/16/2020
by   Longguang Wang, et al.
0

Stereo image pairs encode 3D scene cues into stereo correspondences between the left and right images. To exploit 3D cues within stereo images, recent CNN based methods commonly use cost volume techniques to capture stereo correspondence over large disparities. However, since disparities can vary significantly for stereo cameras with different baselines, focal lengths and resolutions, the fixed maximum disparity used in cost volume techniques hinders them to handle different stereo image pairs with large disparity variations. In this paper, we propose a generic parallax-attention mechanism (PAM) to capture stereo correspondence regardless of disparity variations. Our PAM integrates epipolar constraints with attention mechanism to calculate feature similarities along the epipolar line to capture stereo correspondence. Based on our PAM, we propose a parallax-attention stereo matching network (PASMnet) and a parallax-attention stereo image super-resolution network (PASSRnet) for stereo matching and stereo image super-resolution tasks. Moreover, we introduce a new and large-scale dataset named Flickr1024 for stereo image super-resolution. Experimental results show that our PAM is generic and can effectively learn stereo correspondence under large disparity variations in an unsupervised manner. Comparative results show that our PASMnet and PASSRnet achieve the state-of-the-art performance.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

page 5

page 6

page 9

page 11

page 12

page 15

page 18

03/14/2019

Learning Parallax Attention for Stereo Image Super-Resolution

Stereo image pairs can be used to improve the performance of super-resol...
11/30/2020

Cross-MPI: Cross-scale Stereo for Image Super-Resolution using Multiplane Images

The combination of various cameras is enriching the way of computational...
04/04/2022

Degradation-agnostic Correspondence from Resolution-asymmetric Stereo

In this paper, we study the problem of stereo matching from a pair of im...
06/02/2021

Feedback Network for Mutually Boosted Stereo Image Super-Resolution and Disparity Estimation

Under stereo settings, the problem of image super-resolution (SR) and di...
02/02/2022

Multi-Resolution Factor Graph Based Stereo Correspondence Algorithm

A dense depth-map of a scene at an arbitrary view orientation can be est...
02/28/2022

Rectifying homographies for stereo vision: analytical solution for minimal distortion

Stereo rectification is the determination of two image transformations (...
01/03/2013

Investigating the performance of Correspondence Algorithms in Vision based Driver-assistance in Indoor Environment

This paper presents the experimental comparison of fourteen stereo match...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

With the popularity of dual cameras in mobile phones, autonomous vehicles and robots, stereo vision has attracted increasingly attention in both academia and industry [1, 2]. Traditional studies [3, 4, 5] mainly focus on finding correspondences between stereo images to provide depth information, namely stereo matching. Recent works further use 3D cues within stereo images for various tasks including stereo image restoration [6, 7], stereo magnification [8], stereo video retargeting [9] and stereo neural style transfer [10]. In real-world applications such as mobile phones, the baselines and resolutions of stereo cameras in various devices are different. Moreover, the focal length of stereo cameras may change as the cameras adjust to a new scene. Since disparities between stereo images can vary significantly for stereo cameras with different baselines, focal lengths and resolutions, it is highly challenging to incorporate stereo correspondence for various real-world applications.

Motivated by the powerful feature representation capability of convolutional neural network (CNN), deep feature representations are widely used to calculate similarities for stereo correspondence. However, the local receptive field hinders plain CNNs to capture correspondence over large disparities. To overcome this limitation, cost volumes are widely applied in CNN-based methods

[11, 4, 5, 12]. Several methods [4, 5, 13] concatenate unary features from left and right images to generate a 4D cost volume (i.e., heightwidthdisparitychannel). Then, 3D CNNs are used to learn matching costs within the cost volume. However, learning matching costs from 4D cost volumes suffers from a high computational and memory burden. To achieve higher efficiency, 1D correlation is used to reduce feature dimension [11, 12, 14, 15], resulting in 3D cost volumes (i.e., heightwidthdisparity). Although 4D/3D cost volumes enable CNNs to capture correspondence over large disparities, the fixed maximum disparity hinders them to handle different stereo image pairs with large disparity variations.

In this paper, we propose an unsupervised parallax-attention mechanism (PAM) to learn stereo correspondence under large disparity variations. Our PAM integrates epipolar constraints with attention mechanism to calculate feature similarities along the epipolar line. Specifically, for each pixel in the left image, its feature similarities with all possible disparities in the right image are computed to generate an attention map. By regularizing the attention map, our PAM can focus on the most similar feature to provide accurate stereo correspondence.

Compared to cost volume techniques, our PAM has three remarkable properties: First, the PAM gets access to different disparities through matrix multiplication instead of shift operation. Therefore, our PAM does not need to manually set a fixed maximum disparity and can handle large disparity variations. Second, cost volume based methods commonly regress disparities upon matching costs and then calculate losses upon these disparities. However, this may lead to unreasonable cost distribution due to the ambiguity of disparity regression (as shown in Fig. 1). In contrast, performing direct regularization on parallax-attention maps (e.g., through smoothness loss and cycle loss, as described in Sec. 4.2) enables our PAM to achieve improved performance. Third, the PAM is compact and can be applied to various tasks such as stereo matching and stereo image super-resolution (SR). By using PAM, 3D cues from left and right images can be aggregated without explicit disparity calculation.

This paper is an extension of our previous conference version [6]. The contributions of this work can be summarized as follows:

  • A generic parallax-attention mechanism is proposed to learn stereo correspondence in image pairs with large disparity variations in an unsupervised manner.

  • The PAM is successfully applied to two specific tasks: stereo matching and stereo image SR. Our PAM based networks achieve the state-of-the-art performance in both stereo matching and stereo image SR.

  • A new dataset, namely Flickr1024, is proposed for the training of stereo image SR networks. This dataset consists of 1024 high-quality stereo image pairs and covers various scenes.

The rest of this paper is organized as follows. In Section 2, we briefly review the related work. In Section 3, we present our PAM in details. In Sections 4 and 5, we apply the PAM to stereo matching and stereo image SR tasks, respectively. Finally, we conclude this paper in Section 6.

Fig. 1: An illustration of the ambiguity for disparity regression in existing cost volume based methods. These three distributions have equal regression results but are quite different. A unimodal distribution peaked at the true disparity with high sharpness (i.e., low uncertainty) is more reasonable, as shown in (a).

2 Related Work

Stereo vision has been investigated for various tasks in recent decades [3, 5, 6, 8, 7, 10]. Here, we focus on two specific tasks, including stereo matching and stereo image SR. Moreover, we also review attention mechanisms that are highly related to our work.

2.1 Stereo Matching

2.1.1 Supervised Stereo Matching

Traditional stereo matching methods commonly follow a popular four-step pipeline, including matching cost calculation, cost aggregation, disparity calculation and disparity refinement [14].

Motivated by the success of CNN in many vision tasks such as object recognition [16, 17] and detection [18, 19], some early learning-based methods [20, 21, 22] use CNNs to replace one or multiple steps in the stereo matching pipeline. Zbontar and Lecun [20] used a CNN to compute matching cost between two image patches. The resulting matching costs are then processed by several traditional techniques (including cross-based cost aggregation and semi-global matching) to predict the disparity map. Luo et al. [3] proposed a dot product operation to compute correlation between unary feature representations of left and right images. Their network has lower computational complexity and achieves higher efficiency than [20]. To improve feature representation capability, pyramid pooling [23] and highway networks [22] are used to enlarge the receptive field and deepen the network, respectively. These two methods produce more accurate disparities than [20].

To achieve better performance, several methods [11, 4, 14] are proposed to integrate all steps into an end-to-end architecture for joint optimization. Mayer [11] proposed an end-to-end CNN called DispNet to regress disparities from stereo images. Specifically, 1D correlation operation is used to generate 3D cost volumes for matching cost learning. Following DispNet, several works [24, 14]

have been proposed to employ correlation operation and 3D cost volumes for disparity estimation. Liang

et al. [14] used feature constancy to bridge the gap between disparity calculation and disparity refinement. Therefore, their network can seamlessly integrate the disparity calculation and disparity refinement steps into a compact network to improve efficiency.

Recent methods [4, 5, 13] commonly use concatenation to generate 4D cost volumes and then use 3D CNNs for matching cost aggregation. GC-Net [4] is the first work to use naive concatenation instead of correlation to build 4D cost volumes. Specifically, 3D convolutions are used to aggregate context information within the 4D cost volume for matching cost aggregation. Following GC-Net, Chang et al. [5] proposed a pyramid stereo matching network (PSMNet) with a spatial pyramid pooling module to enlarge receptive fields for more representative features. In PSMNet, stacked 3D hourglass networks are used to aggregate matching costs at multiple scales. Inspired by the semi-global matching (SGM) method [25], Zhang et al. [26] introduced semi-global and local guided aggregation layers to aggregate matching costs in different directions and local regions.

2.1.2 Unsupervised Stereo Matching

Due to the difficulty of collecting a large dataset with densely labeled groundtruth depth, several methods [27, 24, 28] have been developed to learn stereo matching in an unsupervised manner.

Zhou et al. [27] adopted left-right consistency check to produce a confidence map to guide the training of a network. Yang et al. [24] proposed a SegStereo network to use semantic cues for stereo matching. Specifically, they introduced a segmentation sub-network to provide semantic features and proposed a semantic loss regularization to improve the robustness of disparity estimation. Li et al. [28] proposed an occlusion aware stereo matching network (OASM) to exploit occlusion cues for stereo matching. Specifically, they introduced an occlusion inference module to provide occlusion cues and proposed a hybrid loss to use the interaction between disparity and occlusion as the supervision for network training.

Existing supervised and unsupervised stereo matching methods commonly used 3D/4D cost volumes to capture stereo correspondence over large disparities. However, the fixed maximum disparity hinders them to handle different stereo images with large disparity variations. Moreover, 4D cost volumes require high computational and memory costs.

2.2 Stereo Image Super-Resolution

Stereo image SR aims at recovering a high-resolution (HR) left image from a pair of low-resolution (LR) stereo images. Given additional information from a second viewpoint, stereo correspondence can be used to improve the SR performance. A recent work [29] further extend the stereo image SR task to a stereoscopic image SR task that reconstructs a pair of HR left/right images while preserving stereo consistency. Note that, we only focus on the stereo image SR task in this paper. Due to the unique characteristics of stereo images (such as epipolar constraints and non-local correspondence), video SR and light-field image SR methods are unsuitable for stereo image SR. Video SR methods [30, 31, 32] commonly focus on the exploitation of local correspondence since motion between adjacent frames are mainly limited within a local region. Therefore, these video SR methods are unsuitable for stereo image SR due to the non-local correspondence over large disparities in stereo images. Light-field image SR methods [33, 34, 35] are specifically proposed for light-field images with short baselines. Therefore, they cannot be directly used for stereo image SR since stereo images usually have much longer baselines than light-field images.

To enhance the resolution of stereo images, Bhavsar et al. [36] argued that image SR and HR depth estimation are intertwined under stereo setting. They proposed a joint framework to iteratively estimate SR image and HR disparity. Recently, Jeon et al. [37]

proposed to employ a parallax prior in CNN for stereo image SR. Given a stereo image pair, the right image is shifted with different intervals and concatenated with the left image to generate a stereo tensor. The stereo tensor is then fed to a CNN called StereoSR to generate SR results. However, StereoSR cannot handle different stereo images with large disparity variations since the number of shifted right images is fixed.

Fig. 2: Comparison between self-attention mechanism and parallax-attention mechanism.
Fig. 3: An illustration of our PAM.

2.3 Attention Mechanism

Attention mechanism was first introduced by Bahdanau et al. [38]

and has been widely used in natural language processing tasks to capture long-range (long-term) dependencies

[39, 40]

. Recently, attention mechanism has also been applied in many computer vision tasks including semantic segmentation

[41, 42]

, image captioning

[43, 44] and image generation [45]. Due to the limited receptive fields of CNNs, long-range dependencies cannot be captured. To overcome this limitation, self-attention mechanism is introduced to calculate the correlation of any two positions in an image. Wang et al. [46] proposed a non-local network to aggregate information from non-local regions, which can be viewed as a general format of the self-attention mechanism. Zhang et al. [45] introduced self-attention mechanism for image generation. The self-attention mechanism enables their network to use features in both local regions and distant regions to generate consistent scenarios. Fu et al. [42] proposed to use self-attention mechanism to aggregate long-range information for semantic segmentation.

Our work is motivated by the self-attention mechanism. We incorporate epipolar constraints with attention mechanism and develop a parallax-attention mechanism to capture correspondences in stereo images (as shown in Fig. 2). For each pixel in the left image, our parallax-attention mechanism attends to all disparities along the epipolar line and learns to focus on the most similar one.

3 Parallax-Attention Mechanism (PAM)

In this section, we present our PAM in details. We first introduce the formulation of our PAM, and then describe left-right consistency, cycle consistency and valid mask based on our PAM.

3.1 Formulation

3.1.1 Overview

In self-attention mechanism, a feature map of size is first reshaped to the size of . Then, matrix multiplication () is used to calculate the correlation of any two positions in an image. For stereo images, the corresponding pixel for a pixel in the left image only lies along its epipolar line in the right image. Taking this epipolar constraint into consideration, our PAM uses specifically designed reshape operation and geometry-aware matrix multiplication to calculate the correlation between a pixel in the left image and all positions along the epipolar line in the right image.

As illustrated in Fig. 3, given two feature maps from a stereo image pair, they are first fed to convolutions for feature adaption. Specifically, A is fed to a convolution to produce a query feature map . Meanwhile, B is fed to another convolution to produce a key feature map , which is then reshaped to . Then, matrix multiplication is performed between Q and K

and a softmax layer is applied, resulting in a parallax-attention map

. Through matrix multiplication, our PAM can efficiently encode feature correlation between any two positions along the epipolar line into the parallax-attention map. The parallax-attention map has several remarkable properties, as described in Sec. 3.2 and 3.3. Next, B is fed to a convolution to generate response feature map , which is further multiplied by to produce output features . Meanwhile, a valid mask is also generated (Section 3.3).

It should be noted that all disparities are considered in the PAM. That is, our PAM does not need to manually set a fixed maximum disparity and can handle large disparity variations. Since PAM can learn to focus on the features at accurate disparities using feature similarities, correspondence can then be captured.

Fig. 4: An illustration of our PAM using a toy example. The parallax-attention maps () correspond to the regions () marked by a yellow stroke. The first row represents left/right stereo images and the second row stands for parallax-attention maps and .

3.1.2 Toy Example

We further illustrate our PAM using a toy example, as shown in Fig. 4. Given a stereo image pair and of size , parallax-attention maps and of size can be generated by our PAM. Note that, each slice of the parallax-attention maps (e.g., ) delivers the dependency between corresponding rows (i.e., and ). If the disparity between a pair of stereo images is zero, all parallax-attention maps are identity matrices, as shown in Fig. 4 (a). That is because, the pixel in corresponds to the pixel in . Therefore, position in is focused on. For regions with non-zero disparities, e.g., the red region in Fig. 4 (b) with a disparity of 5, the pixel in corresponds to the pixel in . Therefore, position in is focused on. In summary, stereo correspondence can be depicted by the positions of focused pixels in the parallax-attention maps.

Furthermore, occlusion can be encoded by the parallax-attention maps. Specifically, it can be observed from in Fig. 4(b) that several vertical regions are “discarded” without any position being focused on. That is because, these regions in are occluded in , thus no correspondence should be focused on. Similar “discarded” horizontal regions are also caused by occlusion.

It should be noted that, only integer disparities are considered in our toy example, which is not the real case. In practice, our PAM can focus on adjacent pixels to handle sub-pixel disparities. Due to the softmax layer used in PAM, several pixels in “discarded” horizontal regions may be incorrectly focused on. However, these occluded regions can be excluded using valid masks, as demonstrated in Sec. 3.3.

3.2 Left-Right Consistency and Cycle Consistency

To capture reliable and consistent correspondence, we introduce left-right consistency and cycle consistency to regularize our PAM.

Given feature representations extracted from a pair of stereo images and , two parallax-attention maps ( and ) are generated by our PAM. Ideally, the following left-right consistency can be obtained if our PAM captures accurate correspondence:

(1)

where denotes geometry-aware matrix multiplication. Based on Eq. (1), we can further derive a cycle consistency:

(2)

where the cycle-attention maps and can be calculated as:

(3)
Fig. 5: An illustration of geometry-aware matrix multiplication .

We further illustrate geometry-aware matrix multiplication  111

can be implemented using tf.matmul() or torch.matmul().

in Fig. 5. Take Eq. (1) for an example, the product of corresponding slices in and (e.g., ( and () determines the slice of , i.e., (. All these slices are concatenated to obtain .

3.3 Valid Mask

Since left-right consistency and cycle consistency do not hold for occluded regions, we perform occlusion detection based on the parallax-attention map to generate valid masks and only enforce consistency regularization on those valid regions.

As illustrated in the Sec. 3.1.2, occlusion is encoded as the “discarded” regions in the parallax-attention maps. In practice, vertical “discarded” regions that correspond to occluded regions are usually assigned with small weights in the parallax-attention maps (e.g., ). That is because, occluded pixels in the left image cannot find their correspondences in the right image, their feature similarities with all disparities are low. Therefore, a valid mask can be obtained by:

(4)

where is a threshold (empirically set to 0.1 in this paper). An examples of valid mask is shown in Fig. 6. In practice, we use several morphological operations to handle isolated pixels and holes in valid masks.

In summary, our PAM provides a flexible and effective approach for unsupervised stereo correspondence learning and can be applied to various stereo vision tasks. In this paper, we demonstrate the effectiveness of PAM on two typical stereo tasks: stereo matching and stereo image SR.

Fig. 6: Visualization of valid masks. A stereo image pair and the occluded regions (i.e., yellow regions) are illustrated.
Fig. 7: An overview of our PASMnet.
Fig. 8: An illustration of matching cost aggregation in our cascaded parallax-attention module.

4 PAM for Unsupervised Stereo Matching

Stereo matching aims at finding correspondence pixels between a stereo image pair. Since disparities can vary significantly in real-world applications, the fixed maximum disparity in cost volume techniques hinders them to handle large disparity variations. In contrast, our PAM can get rid of setting a fixed maximum disparity by efficiently calculating feature correlation between any two positions along the epipolar line. Therefore, the PAM can be used for unsupervised stereo matching to handle large disparity variations.

4.1 Network Architecture

4.1.1 Overview

Based on our PAM, we proposed a parallax-attention stereo matching network (PASMnet). The architecture of our PASMnet is shown in Fig. 7. Given a stereo image pair of size

, they are first fed to an hourglass network for feature extraction. Then, the features extracted from left and right images are fed to a cascaded parallax-attention module to regress matching costs in a coarse-to-fine manner. Next, an initial disparity can be obtained from the matching cost using an output module. Finally, the initial disparity is further refined using an hourglass network to produce the output disparity.

4.1.2 Cascaded Parallax-Attention Module

After feature extraction, features from left and right images are fed to the cascaded parallax-attention module to regress matching costs in a coarse-to-fine manner. Specifically, our cascaded parallax-attention module consists of 3 stages (with 4 parallax-attention blocks in each stage), as shown in Fig. 7(b). First, features from (i.e., ) and initial matching costs with all elements being set to 0 are passed to the first parallax-attention block. and are fed to two convolutions to obtain and , respectively. Note that, the convolutions for left and right images share the same parameters. Then, query features Q and key features K are obtained from and through convolutions. Next, K is reshaped and multiplied with Q, resulting in a matching cost . Once is ready, and are exchanged to obtain . After that, and are added to and , respectively. Meanwhile, and are added to and , respectively.

As shown in Fig. 7(b), the features and matching cost produced by the preceding parallax-attention block are fed to the succeeding block. The features produced by stage 1 are bilinearly upsampled and concatenated with F and F (from F in Fig. 7(a)). Then, the concatenated features and the upsampled matching cost (,) are passed to stages 2 and 3 for further refinement, resulting in the final matching cost .

Different from the cost volume based methods [4, 5] that explicitly aggregate matching costs using 3D convolutions, our network performs implicit matching cost aggregation by cascading several parallax-attention blocks, as shown in Fig. 8. Here, only one

convolutional layer (without batch normalization and ReLU layers) in the parallax-attention blocks is considered for simplicity. In the

parallax-attention block, the matching cost between the locations of and is computed as:

(5)

where are the kernels of two convolutional layers, are the corresponding features, and represents the matching cost calculation between the pair of input features. In the block, features from the block within a local neighborhood are aggregated to produce and :

(6)

where are kernels at different sub-locations for two convolutional layers with shared parameters, and represent the output features. Consequently, the matching cost is computed as:

(7)

where are the kernels of two convolutional layers, and represents the matching cost calculation between the pair of input features. Note that, and are from the

parallax-attention block. That is, matching costs in a local neighborhood are implicitly aggregated as parallax-attention blocks are cascaded. Moreover, residual connection further enables the aggregation of matching costs from different network depths.

4.1.3 Output Module

As shown in Fig. 7(a), the matching cost of the last block at each stage in the cascaded parallax-attention module is fed to an output module. Within the output module of stage 3, C and C are first fed to a softmax layer to produce parallax-attention maps and , as shown in Fig. 7(c). Next, and are used to generate valid masks and . Once is obtained, the disparity can be regressed as:

(8)

Note that, the estimated disparity is a sum of all disparity candidates weighted by the parallax-attention map. Consequently, our PASMnet does not need to manually set a fixed maximum disparity and can handle large disparity variations. Since pixels in occluded regions cannot find their correspondences, disparity in occluded regions cannot be well handled. Therefore, we exclude these invalid disparities and fill occluded regions using partial convolution [47].

4.1.4 Disparity Refinement

After the cascaded parallax-attention module, a disparity refinement module is introduced using features from the left image as guidance to provide structural information like edges. As shown in Fig. 7(a), the initial disparity is concatenated with and fed to an hourglass network to produce a residual disparity map and a confidence map . Finally, the refined disparity is calculated as:

(9)

where is a bilinear upsampling operator.

4.2 Losses

4.2.1 Photometric Loss

Following [48, 49, 28], we introduce a photometric loss consisting of a mean absolute error (MAE) loss term and a structural similarity index (SSIM) loss term. Note that, since photometric consistency only holds in non-occluded regions, the photometric loss is formulated as:

(10)

where and is a warping operator using the refined disparity. is an SSIM function, represents a valid pixel covered by the valid mask, is the number of valid pixels, and is empirically set to 0.85 in this paper.

4.2.2 Smoothness Loss

Following [48, 28], we use an edge-aware smoothness loss to encourage local smoothness of the disparity, which is defined as:

(11)

where and are gradients in the and axes, respectively.

4.2.3 PAM Loss

We introduce three additional losses to regularize our PAM at multiple scales to capture accurate and consistent stereo correspondence. The PAM loss term for scale is defined as:

(12)

Different from the aforementioned photometric loss, here, we introduce a photometric loss based on the parallax-attention maps as:

(13)

where and are the numbers of valid pixels in and , respectively. and are bilinearly downsampled images at corresponding scale level.

Different from the aforementioned smoothness loss, here, another smoothness loss is introduced to directly regularize parallax-attention maps:

(14)

where and is the number of pixels in . The first and the second terms are used to achieve vertical and horizontal attention consistency, respectively. Take as an example, measures the contribution of in to in using their feature similarity . Our smoothness loss enforces and to be close to . Consequently, smoothness in correspondence (disparity) space can be encouraged.

In addition to and , we further introduce a cycle loss to achieve cycle consistency. Since and in Eq. (2) can be considered as identity matrices, we design a cycle loss as:

(15)

where is a stack of identity matrices.

In summary, the overall loss is defined as:

(16)

Note that, groundtruth disparities are not used in the overall loss. That is, our network is trained in an unsupervised manner.

4.3 Experimental Results

4.3.1 Datasets and Metrics

We used the SceneFlow [11] and KITTI 2015 [50] datasets for training and test.

SceneFlow: The SceneFlow dataset is a synthetic dataset consisting of 35454 training image pairs and 4370 test image pairs of size .

KITTI 2015: The KITTI 2015 dataset is a real-world dataset with street views collected from a driving car. This dataset contains 200 training image pairs and 200 test image pairs of size .

For evaluation, we used end-point-error (EPE) and t-pixel error rate () as metrics. Metrics for both non-occluded regions (Noc) and all pixels (All) are evaluated.

4.3.2 Implementation Details

We first trained our network on the SceneFlow dataset and performed ablation study on this dataset. During training phase, patches of size were randomly cropped as inputs. , , and are set to 1, 1, 0.1, and 1, respectively. All models were optimized using the Adam method [51] with , and a batch size of 14. The initial learning rate was set to

for 5 epochs and decreased to

for another 5 epochs. Note that, pixels with disparities over 192 are excluded for training to demonstrate the generalization of our network to large disparity variations, since our test set contains image pairs with disparities over 192. Following [28], pixels with disparities over 192 are also excluded for evaluation unless specified otherwise. Next, we fine-tuned our network on the KITTI 2015 dataset to obtain the final model for submission. , , , and are set to 5, 5, 0.5, and 1 for this dataset, respectively. The initial learning rate was set to for 60 epochs and decreased to for another 20 epochs. All experiments were conducted on a PC with two Nvidia RTX 2080Ti GPUs.

Model Hourglass Coarse-to-Fine # Blocks EPE
PASMnet 1 4.86 19.58 16.31
PASMnet 1 4.78 19.50 16.21
PASMnet 1 4.68 19.32 16.09
PASMnet 2 4.63 19.19 16.05
PASMnet 3 4.58 19.05 15.97
PASMnet 4 4.54 18.99 15.91
TABLE I: Comparative results achieved on SceneFlow by our PASMnet trained with different settings.

4.3.3 Ablation Study

Hourglass Feature Extraction. Hourglass network is used to aggregate features from multiple scales in our PASMnet. To demonstrate its effectiveness, we replaced the hourglass network with a pyramid network for feature extraction. It can be observed from Table I that the 1-pixel/3-pixel error rates are increased from 19.32/16.09 to 19.58/16.31 if hourglass network is replaced with a pyramid network. As compared to pyramid network, hourglass network can aggregate features from multiple scales, resulting in features rich of multi-scale geometric information. Therefore, better performance can be achieved.

Coarse-to-fine Manner. Our cascaded parallax-attention module regresses matching costs in a coarse-to-fine manner. To demonstrate its effectiveness, we introduced a variant that regresses matching costs on a single scale (i.e., ). It can be observed from Table I that if matching costs are regressed on a single scale, the 1-pixel/3-pixel error rates are increased from 19.32/16.09 to 19.50/16.21. By using a coarse-to-fine manner, disparities can be progressively refined and better performance can be achieved.

Model EPE
PASMnet_wo_Refinement 4.56 19.15 16.11
PASMnet 4.54 18.99 15.91
TABLE II: Ablation results for the disparity refinement module in our PASMnet on SceneFlow.

Number of Cascaded Blocks. We tested the performance of our PASMnet with different number of parallax-attention blocks. From Table I we can see that, the performance of our network improves as the number of parallax-attention blocks increases. Specifically, our PASMnet with 4 blocks outperforms the variant with 1 block, with 1-pixel/3-pixel error rates being improved from 19.32/16.09 to 18.99/15.91. That is because, by cascading more blocks, our network can progressively refine the matching cost for better performance. We further fed the matching costs produced by the last two parallax-attention blocks in stage 3 (i.e., blocks 3 and 4) to the output module for disparity regression. The disparities, matching cost distributions and parallax-attention distributions are visualized in Fig. 9. As more parallax-attention blocks are cascaded, matching cost is aggregated to produce higher peak values at the groundtruth disparity in the parallax-attention distributions. Therefore, more accurate and smoother disparities are produced.

Disparity Refinement. After the cascaded parallax-attention module, disparity refinement is performed by using features from the left image to provide structural information. To demonstrate the effectiveness of disparity refinement, we introduced a variant by removing this disparity refinement module and upsampling to obtain the final disparity. It can be observed from Table II that the 1-pixel/3-pixel error rates are increased from 18.99/15.91 to 19.15/16.11 without disparity refinement. That is because, structural information helps to effectively refine the disparity. We further visualize the confidence map learned within the disparity refinement module in Fig. 10. We can see that occluded regions and edge regions benefit a lot from disparity refinement while other regions in the initial disparity are “good enough”.

Fig. 9: Matching cost distributions, parallax-attention distributions and disparities produced by blocks 3 and 4 in stage 3 of our cascaded parallax-attention module. The red lines in the first and second rows represent the groundtruth disparity. As compared to block 3, block 4 produces more accurate and smoother disparities by cascading more parallax-attention blocks, especially in regions denoted with white boxes.
Fig. 10: An example of initial disparity , confidence map and refined disparity produced by our PASMnet.

PAM vs. Cost Volume. To demonstrate the efficiency of the proposed PAM, we compare our parallax-attention block with 3D convolution in terms of model size, FLOPs and memory consumption in Table III. With the same number of channels, our parallax-attention block has fewer parameters than 3D convolution ( vs. ). Assume that (i.e., the size of images in SceneFlow), when performed at the resolution level (i.e., ), our block achieves a reduction in FLOPs and a comparable memory cost as compared to 3D convolution when . As increases, better efficiency is achieved by our parallax-attention block, with / reductions in FLOPs/memory cost being achieved for .

We further introduced a network variant by replacing PAM with a 4D cost volume technique and then compared the stereo matching performance of this variant to our PASMnet. Specifically, we replaced the four cascaded parallax-attention blocks at each stage of our cascaded parallax-attention module with four 3D convolutions. Note that, the number of channels are adjusted to ensure that this variant has comparable parameters to our PASMnet. The maximum disparity is set to 192 for the cost volume technique. Comparative results achieved on SceneFlow are shown in Table IV. Compared to the 4D cost volume technique, our PAM achieves much better performance with comparable model size and inference time. With the proposed PAM loss, we can directly regularize our PAM to capture accurate and consistent stereo correspondence. Therefore, better performance can be achieved. Moreover, our PAM also achieves higher efficiency, with its memory consumption being half of the cost volume technique.

Module Params. FLOPs Memory
Parallax- Attention Block Conv
Conv
-
 Total
3D Conv
TABLE III: Comparison between our parallax-attention block and 3D convolution. Here, refers to geometry-aware matrix multiplication.
Model Params. Time Memory EPE
Cost Volume 7.35M 6.02 19.79 17.42
PAM 7.61M 4.54 18.99 15.91
TABLE IV: Comparison between our PAM and cost volume formation on SceneFlow.
Model EPE
PASMnet 6.07 20.78 17.11
PASMnet 4.87 19.69 16.54
PASMnet 4.63 19.46 16.12
PASMnet 4.58 19.13 15.99
PASMnet 4.54 18.99 15.91
TABLE V: Comparative results achieved on SceneFlow by our PASMnet trained with different losses.
Fig. 11: An example of matching cost and parallax-attention distributions. (a) Results produced by our PASMnet trained without . (b) Results produced by our PASMnet. The red lines represent the groundtruth disparity. Using for regularization, our PASMnet produces more reasonable cost and parallax-attention distributions, which have peak values at the groundtruth disparity with higher sharpness.
Resolution Cost Volume PAM
EPE EPE
44.04 48.53 47.04 22.02 33.17 31.43
14.60 35.75 31.77 10.57 35.56 30.85
7.87 39.00 30.31 6.04 39.93 29.64
Average 22.17 41.09 36.37 12.88 36.22 30.64
TABLE VI: Comparison between PAM and cost volume formation on SceneFlow with different resolutions.

Losses.

We retrained our PASMnet using different losses to test the effectiveness of our loss function. It can be observed from Table 

V that the EPE/1-pixel error rate/3-pixel error rate achieved by our network are increased from 4.54/18.99/15.91 to 6.07/20.78/17.11 if only photometric loss is used for training. That is because, photometric loss cannot handle texture-less regions well. If smoothness loss is included for training, the EPE value is decreased to 4.87. Further, the performance is gradually improved if PAM loss is added. That is because, PAM loss regularizes our PAM to capture accurate and consistent stereo correspondence.

An example of matching cost and parallax-attention distributions produced by our network trained with different losses is shown in Fig. 11. Using for regularization, our network produces more reasonable matching cost and parallax-attention distributions. Both of them have peak values at the groundtruth disparity with high sharpness. This clearly demonstrates that performing direct regularization on parallax-attention maps enables our PAM to achieve better performance.

4.3.4 Flexibility to Disparity Variations

We tested the flexibility of our PAM with respect to large disparity variations. We chose 20 image pairs from the test set of SceneFlow with more than 20% disparity values larger than 200 for evaluation. Note that, all pixels were used for evaluation.

Resolutions. We resized chosen image pairs to different resolutions to test the flexibility of our PAM. Results achieved on images with different resolutions are shown in Table VI. It can be observed that our PAM outperforms the cost volume technique on all metrics except 1-pixel error rate at the resolution of . Moreover, the performance improvement achieved by our PAM is enhanced as the resolution increases. That is because, the fixed maximum disparity hinders the cost volume technique to capture longer-range correspondence. In contrast, our PAM is more flexible and robust to large disparity variations and achieves better performance.

Fig. 12: Results achieved on the KITTI 2015 dataset.
Disparity Cost Volume PAM
EPE EPE
200 150.21 99.99 99.99 57.90 41.62 41.60
100-200 36.24 45.80 45.79 25.16 33.15 33.15
100 13.71 33.92 31.92 16.31 34.57 31.81
Average 44.04 48.53 47.04 22.02 33.17 31.43
TABLE VII: Comparison between PAM and cost volume formation on SceneFlow with different depths.

Depths. We compared the performance on regions with different depths (disparities) to test the flexibility of our PAM. Results achieved on images with different depths are shown in Table VII. It can be observed that the performance of our PAM is significantly better than the cost volume technique on regions with disparities larger than 100. That is because, our PAM can get rid of setting a fixed maximum disparity to handle large disparities. In contrast, the cost volume technique cannot capture the correspondences with disparities larger than the fixed maximum disparity.

Noc All
D1-bg D1-fg D1-all D1-bg D1-fg D1-all
Supervised DispNet [11] 4.11 3.72 4.05 4.32 4.41 4.34
GC-Net [4] 2.02 5.58 2.61 2.21 6.16 2.87
CRL [52] 2.32 3.12 2.45 2.48 3.59 2.67
iResNet [14] 2.07 2.76 2.19 2.25 3.40 2.44
PSMNet [5] 1.71 4.31 2.14 1.86 4.62 2.32
PASMnet_192 (ours) 1.88 3.91 2.22 2.04 4.33 2.41
Unsupervised USCNN [53] - - 11.71 - - 16.55
Yu et al. [54] - - 8.35 - - 19.14
Zhou et al. [27] - - 8.61 - - 9.91
SegStereo [24] - - 7.70 - - 8.79
OASM [28] 5.44 17.30 7.39 6.89 19.42 8.98
PASMnet (ours) 5.35 15.24 6.99 5.89 16.74 7.70
PASMnet_192 (ours) 5.02 15.16 6.69 5.41 16.36 7.23
TABLE VIII: Comparison to existing supervised and unsupervised stereo matching methods on KITTI 2015.

4.3.5 Comparison to State-of-the-arts

We compare our PASMnet with existing unsupervised stereo matching methods [53, 54, 27, 24, 28] on the KITTI 2015 dataset. Note that, [24] and [28] use a maximum disparity of 192 as a prior knowledge for both training and test. For fair comparison, we also trained a variant with a maximum disparity of 192. Specifically, we excluded positions with disparities larger than the maximum disparity in the attention maps. Moreover, we include several supervised stereo matching methods [11, 4, 52, 14, 5] and trained our PASMnet_192 in a supervised manner for comparison. Specifically, we first finetuned our PASMnet_192 (which was trained on SceneFlow in an unsupervised manner) on SceneFlow using groundtruth disparities as its supervision. Then, this model was further finetuned on the KITTI 2015 dataset. As we can see from Table VIII, our PASMnet outperforms other unsupervised methods by notable margins and significantly reduces the performance gap between supervised and unsupervised methods. With a prior maximum disparity, the performance of our PASMnet is further improved. Specifically, our PASMnet_192 achieves a much lower 3-pixel error rate (6.69/7.23) than OASM (7.39/8.98). If supervision is provided, our PASMnet_192 produces competitive results to iResNet and PSMNet. Figure 12 shows several visual examples. From the white boxes in Fig. 12, it can be observed that our PASMnet produces much more accurate and smoother results.

Fig. 13: An overview of our PASSRnet.

5 PAM for Stereo Image Super-Resolution

Stereo image SR aims at recovering an HR left image from a pair of LR stereo images. The key challenge for stereo image SR lies in exploiting stereo correspondence to aggregate information from the pair of stereo images. Our PAM can effectively aggregate information from a pair of stereo images using the parallax-attention map as its guidance. Therefore, the PAM can be applied for the stereo image SR task.

5.1 Network Architecture

5.1.1 Overview

Based on our PAM, we propose a parallax-attention stereo super-resolution network (PASSRnet) for stereo image SR. Specifically, the stereo images are first fed to a convolutional layer and a residual block for feature extraction. Then, the resulting features are passed to a residual atrous spatial pyramid pooling (ASPP) module to extract deep feature representations. Next, a parallax-attention module is used to aggregate features from a stereo pair. Finally, 4 residual blocks and a sup-pixel convolutional layer are used to generate the SR results. The architecture of our PASSRnet is shown in Fig. 13.

5.1.2 Residual Atrous Spatial Pyramid Pooling (ASPP) Module

Since feature representation with rich context information is beneficial to correspondence estimation [5], we propose a residual ASPP module to enlarge the receptive field and extract hierarchical features with dense pixel sampling rate and scales. As shown in Fig. 13 (a), input features are first fed to a residual ASPP block to generate multi-scale features. The resulting features are then sent to a residual block for feature fusion. This structure is repeated twice to produce final features. Within each residual ASPP block (as shown in Fig. 13 (b)), we first combine three dilated convolutions (with dilation rates of 1, 4, 8) to form an ASPP group, and then cascade three ASPP groups in a residual manner. Our residual ASPP block not only enlarges the receptive field, but also enriches the diversity of convolutions, resulting in an ensemble of convolutions with different receptive regions and dilation rates. The highly discriminative feature learned by our residual ASPP module is beneficial to the overall SR performance, as demonstrated in Sec. 5.4.

5.1.3 Parallax-Attention Module

Based on our PAM, we introduced a parallax-attention module to exploit stereo correspondence for the aggregation of features from a pair of stereo images. As shown in Fig. 13(c), features from left/right images are first fed to a transition residual block for feature adaption. Then, the resulting features are used to generate the output feature O and the parallax-attention map based on PAM. Next, features from left/right images are exchanged to produce for valid mask generation. Finally, the concatenation of the output feature, the feature from the left image and the valid mask is fed to a convolution for feature fusion.

It should be noted that our PASSRnet can be considered as a multi-task network to learn both stereo correspondence and SR. However, using shared features for different tasks usually suffers from training conflict [55]. Therefore, a transition block is used in our parallax-attention module to alleviate this problem. The effectiveness of the transition block is demonstrated in Sec. 5.4.

5.2 Losses

We used the mean square error (MSE) loss as the SR loss:

(17)

where and represent the SR result and HR groundtruth of the left image, respectively. Other than the SR loss, we also used the PAM loss (as defined in Sec. 4.2.3) to help our network exploit the correspondence between stereo images. The overall loss is formulated as:

(18)

where is empirically set to 0.005. The performance of our network with different losses will be analyzed in Sec. 5.4. Note that, stereo correspondence is learned in an unsupervised manner using while SR is learned in a supervised manner using .

5.3 The Flickr1024 Dataset

Although several stereo datasets such as Middlebury [56] and KITTI [57, 50] are already available, these datasets are mainly proposed for stereo matching. Further, the Middlebury dataset only consists of close shots of man-made objects, while the KITTI 2012 and KITTI 2015 datasets only consist of road scenes. For stereo image SR task, a large dataset which covers diverse scenes and consists of images with high quality and rich details is required. Therefore, we introduce a new Flickr1024 dataset [58] for stereo image SR, which covers a large diversity of scenes, including landscapes, urban scenes, people and man-made objects, as shown in Fig. 14. Our Flickr1024 dataset is available at: https://yingqianwang.github.io/Flickr1024.

5.3.1 Data Collection and Processing

We manually collected 1024 pairs of RGB stereo images from Flickr using tags such as stereophotography, stereoscopic and cross-eye 3D. All images are stereograms (as shown in Fig. 15) taken by amateur photographers using dual lens or dual cameras.

Fig. 14: Samples of different scenes covered in the Flickr1024 dataset.
Fig. 15: An example stereogram collected from Flickr.

First, we cut each stereogram into a pair of images and crop their black margins. The resulting left and right images from a stereogram are exchanged since the stereograms are provided in a cross-eye mode. Note that, these stereo image pairs are originally shifted to a common focus plane by the amateurs to produce a perception of 3D for viewers. In other words, both positive and negative disparities exist in these image pairs. Therefore, we roughly shift these images back to ensure that zero disparity corresponds to infinite depth. For close shots, since regions with infinite depth are unavailable, we just shift these images to make the minimum disparity larger than a threshold (empirically set to 40 pixels in our dataset). Finally, we crop each resulting image to multiple of 12 pixels on both axes following [59].

5.3.2 Comparison to Existing Datasets

The Flickr1024 dataset is compared to three widely used stereo datasets including Middlebury, KITTI 2012 and KITTI 2015. CNNIQA [60] and SR-metric [61] are used to evaluate the image quality. It can be observed from Table IX that our Flickr1024 dataset is larger than other datasets by at least 5 times. Besides, the pixel per image (ppi) value of our Flickr1024 dataset is nearly 2 times those of the KITTI 2012 and KITTI 2015 datasets. Although the Middlebury dataset has the highest ppi value, the number of image pairs in this dataset is very limited. Further, our Flickr1024 dataset achieves comparable or even better CNNIQA and SR-metric values than other datasets, which demonstrates the good image quality of our Flickr1024 dataset.

5.4 Experimental Results

5.4.1 Datasets

For training, we followed [37] and downsampled 60 Middlebury [56] images by a factor of 2 to generate HR images. We further used Flickr1024 as the additional training data for our PASSRnet. For test, we used 5 images from the Middlebury dataset 222Middlebury: cloth2, motorcycle, piano, pipes and sword2., 20 images from the KITTI 2012 dataset [57] 333KITTI2012: 000000_10 to 000019_10. and 20 images from the KITTI 2015 dataset [50] 444KITTI2015: 000000_10 to 000019_10.. For validation, we selected another 20 images from the KITTI 2012 dataset.

Dataset image pairs ppi CNNIQA () SR-metric ()
Middlebury 65 3511605 20.18 6.01
KITTI 2012 194 462564 20.32 7.15
KITTI 2015 200 465573 22.86 7.06
Flickr1024 1024 800486 19.75 7.12
TABLE IX: Comparison between the Middlebury, KITTI 2012, KITTI 2015 and Flickr1024 datasets. Only the training sets of the KITTI 2012 and KITTI 2015 datasets are considered.

5.4.2 Implementation Details

Training Details.

We first downsampled HR images using bicubic interpolation to generate LR images, and then cropped

patches with a stride of 20 from these LR images. Meanwhile, their corresponding patches in HR images were cropped. The horizontal patch size was set to 90 to cover most disparities (

96%) in our training dataset. These patches were randomly flipped horizontally and vertically for data augmentation. Note that, rotation was not performed for augmentation, since it would destroy the epipolar constraints.

Our PASSRnet was implemented in Pytorch on a PC with an Nvidia GTX 1080Ti GPU. All models were optimized using the Adam method

[51] with , and a batch size of 32. The initial learning rate was set to and reduced to half after every 30 epochs. The training was stopped after 80 epochs since more epochs do not provide further improvement.

Metrics.

We used peak signal-to-noise ratio (PSNR) and SSIM to test SR performance. Following

[37], we cropped borders to achieve fair comparison.

Model Input PSNR