ResFPN: Residual Skip Connections in Multi-Resolution Feature Pyramid Networks for Accurate Dense Pixel Matching

by   Rishav, et al.

Dense pixel matching is required for many computer vision algorithms such as disparity, optical flow or scene flow estimation. Feature Pyramid Networks (FPN) have proven to be a suitable feature extractor for CNN-based dense matching tasks. FPN generates well localized and semantically strong features at multiple scales. However, the generic FPN is not utilizing its full potential, due to its reasonable but limited localization accuracy. Thus, we present ResFPN – a multi-resolution feature pyramid network with multiple residual skip connections, where at any scale, we leverage the information from higher resolution maps for stronger and better localized features. In our ablation study, we demonstrate the effectiveness of our novel architecture with clearly higher accuracy than FPN. In addition, we verify the superior accuracy of ResFPN in many different pixel matching applications on established datasets like KITTI, Sintel, and FlyingThings3D.



There are no comments yet.


page 1

page 7


FastFlowNet: A Lightweight Network for Fast Optical Flow Estimation

Dense optical flow estimation plays a key role in many robotic vision ta...

SDC - Stacked Dilated Convolution: A Unified Descriptor Network for Dense Matching Tasks

Dense pixel matching is important for many computer vision tasks such as...

DDCNet: Deep Dilated Convolutional Neural Network for Dense Prediction

Dense pixel matching problems such as optical flow and disparity estimat...

CNN-based Patch Matching for Optical Flow with Thresholded Hinge Embedding Loss

Learning based approaches have not yet achieved their full potential in ...

DGC-Net: Dense Geometric Correspondence Network

This paper addresses the challenge of dense pixel correspondence estimat...

Semantic Flow for Fast and Accurate Scene Parsing

In this paper, we focus on effective methods for fast and accurate scene...

SceneFlowFields++: Multi-frame Matching, Visibility Prediction, and Robust Interpolation for Scene Flow Estimation

State-of-the-art scene flow algorithms pursue the conflicting targets of...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Dense pixel matching is the task to find pixel-wise correspondences across different images. It is one of the core challenges in computer vision and used for many algorithms such as optical flow, scene flow and disparity estimation. Traditionally, heuristic feature descriptors (e.g. SIFT

[lowe1999sift] or CENSUS [zabih1994census]) were used to represent every pixel via its surrounding. In recent years, especially CNN-based approaches, which were trained end-to-end, have achieved remarkable results for dense pixel matching [chang2018pyramid, hui2018liteflownet, saxena2019pwoc, sun2018pwc]. Within this category of algorithms, the feature representation turned out to be an essential factor for accurate matching [bailer2017cnn]. The representation must be as characteristic as possible in order to be distinguishable. In addition, it must be as localizable as possible to allow for accurate matching and avoid small displacement mismatches. In the state-of-the-art, Feature Pyramid Networks (FPN) [lin2017feature] seem to fulfill these properties best. FPN was originally proposed in the field of object detection, for which its localization is completely sufficient. However, the accuracy of the localization of FPN for dense pixel matching can be further improved.

Thus, we present ResFPN which combines – compared to FPN – multiple feature representation of higher resolutions via residual skip connections. This is supposed to re-introduce details for better localization in the final feature representation. Further, the residual skip connections can reduce the length of gradient paths during back-propagation to improve convergence [zhu2018sparsely]. We review our ResFPN in a comprehensive ablation study by validating each individual design decision in detail. In addition, we bring ResFPN into application for the dense pixel matching tasks of optical flow, scene flow and disparity estimation. For these experiments, we utilize state-of-the-art algorithms and change nothing but the feature description. We confirm the superior accuracy across different algorithms as well as datasets, such as KITTI [menze2015object], Sintel [butler2012naturalistic] and FlyingThings3D [mayer2016large].

Ii Related Work

Representations and Image Pyramids

Feature maps (i.e. dense descriptors) are the basic cues for many computer vision tasks. A large number of methods show that a proper design of feature maps improves results especially for dense pixel-wise matching in terms of geometric reconstruction and motion estimation. Many approaches employ handcrafted designs like SIFT [lowe1999sift], HOG [dalal2005histograms] or DAISY [tola2009daisy] features using image pyramid structure for seeking dense motion matches [bailer2015flow, hu2016efficient, xu2011motion] or for scene flow estimation [schuster2018sceneflowfields]. Pyramid feature representations use information from multiple scales for more improvement in terms of estimating correspondences. However, the advances of CNNs improve the robustness of feature maps against ill-conditioned environments, light or geometric changes compared to conventional solutions. In this context, many approaches aim to learn features [choy2016universal, schuster2019sdc] for dense matching. These methods replace the conventional descriptors but they are not proven in end-to-end learning fashion for dense matching predictions. Our ResFPN is a flexible, modular network that can be plugged in as feature backbone for end-to-end matching networks.

End-to-End Solutions using Feature Pyramids

Early end-to-end learning solutions yielded impressive results based on encoder-decoder architectures, e.g. FlowNet [dosovitskiy2015flownet, ilg2017flownet] for optical flow estimation. DispNet [mayer2016large] extends the idea of FlowNet to disparity and scene flow estimation. The main idea of the encoder-decoder network is to aggregate the information from coarse-to-fine predictions, which is useful for large displacement predictions. However, it is a memory consuming approach and its computation is inefficient. SPyNet [ranjan2017optical] is a lightweight model that aggregates information with a spatial pyramid network. Large motions can be handled with this approach. Compared to FlowNet, it is faster and yields better accuracy. PWC-Net [sun2018pwc] and LiteFlowNet [hui2018liteflownet] add warping and cost volume layers to the pyramid feature extractor which improves dense optical flow accuracy. PSMNet [chang2018pyramid] uses a spatial pyramid pooling module to enlarge the receptive field of feature maps for stereo matching. Instead of using a generic CNN as feature extractor in PWC-Net [sun2018pwc], PWOC-3D [saxena2019pwoc] employs the FPN architecture [lin2017feature] and utilizes those features for scene flow estimation with stereo images. Our ResFPN contributes to feature computation for many kinds of deep networks especially in the context of dense matching in a novel way.

Connecting Layers in Deep Neural Networks

Traditional CNN architectures establish strictly sequential connections between layers [lecun1998lenet, krizhevsky2012imagenet, simonyan2015vgg]. Recently, more involved connections have been proposed. DenseNet [huang2017densely] uses connections in a feed-forward fashion so that for each layer the feature maps of all preceding layers are used as input to strengthen feature propagation. ResNet [he2016deep] and InceptionNet [szegedy2017inception, szegedy2015inception] aim to improve deep networks through parallel shortcut connections.

Among modern architectures, Feature Pyramid Network (FPN) [lin2017feature] leverages the concept of lateral connections for multi-level predictions based on features of multiple scales. Similar to the U-Net architecture [ronneberger2015u], it fuses feature maps between the same levels of top-down and bottom-up paths using element-wise addition. In contrary, TDM [shrivastava2016beyond] changes the lateral connections to convolutional layers and channel-wise concatenation with the output, which makes it computationally inefficient. Reverse Densely Connected Feature Pyramid Network [xin2018reverse] proposes to add reverse dense connections for the top-down module (decoder). Similarly, (A)RDFPN [zhao2019residual, zhao2019aggregated]

add dilated residual connections to the top-down stream of FPNs. The previous feature modules have been presented in the context of object detection. Recently, HRNet

[sun2019deep] has used multi-resolution feature maps to improve localization in the estimation of human poses.

Different from the aforementioned applications, our ResFPN uses the advantages of pyramidal networks to extract dense feature maps for dense matching tasks in terms of stereo matching, optical flow, and scene flow estimation. We utilize not only connections between similar levels of feature maps across bottom-up and top-down parts like FPN [lin2017feature], but further enhance the spatial accuracy by adding new connections across high resolution feature maps of the bottom-up part and feature maps in the top-down part as shown in Fig. 1(d).

(a) Image Pyramid.
(b) Feature Pyramid.
(c) Feature Pyramid Network (FPN) [lin2017feature].
(d) Exemplary structure of our ResFPN.
Fig. 2: Feature computation with different types of pyramids. (fig:pyramids:ip) A simple image pyramid is used together with heuristic descriptors for multi-scale predictions. (fig:pyramids:fp) Feature pyramids successively compress and encode the input image for multi-scale predictions. (fig:pyramids:fpn) Feature Pyramid Networks traverse the entire encoder and decode the representation until the required scale is reached. (fig:pyramids:dfpn) Additional feature encodings of higher resolutions are combined during up-sampling in our ResFPN. Here, only a single additional connection per layer is visualized. Details about up-sampling and merging of ResFPN can be found in Fig. 3.

Iii Method

ResFPN is a generic concept that can be applied in many different applications for different tasks. The general idea is to increase the number of lateral skip connections between encoder and decoder in feature pyramid networks in order to improve the spatial accuracy while maintaining high-level feature representations.

Iii-a Multiple Residual Skip Connections

Our work continues with the logical extension of regular lateral skip connections to further improve localization and feature abstraction in feature pyramid networks [lin2017feature]. The reasoning is that additional connections from higher resolved levels of the encoder can benefit the final feature description (cf. Fig. 2). Further, more densely connected networks are assumed to have a better flow of gradients during training [he2016deep, huang2017densely] which improves convergence properties. Most recently, pyramidal feature extractors have also been shown to be more robust to adversarial attacks [ranjan2019attacking]. Moreover, the idea of ResFPN is independent of hyper-parameters of the pyramid like the number of levels, or the scale factor. It is applicable together with any building blocks for down-/up-sampling, like Residual [he2016deep], Dense [huang2017densely], or Inception [szegedy2017inception, szegedy2015inception] units. The idea of additional residual skip connections between encoder and decoder can be applied in all cases.

The theoretical idea of ResFPN can include any additional connection of layers in pyramid networks that goes beyond regular lateral skip connections, e.g. dense connections. However, for dense matching we argue that the set of possible connections can be restricted. More precisely, additional connections from lower resolved feature maps towards higher resolutions [xin2018reverse] are assumed to improve feature semantics only and do not contribute to the goal of better localization (they might even accomplish the opposite). As a result, we focus on (multiple) connections from higher resolution feature maps of the encoder to feature maps of the decoder (see Fig. 1(d)).

Fig. 3: A single up-sampling block in the decoder of our ResFPN combines four different resolutions. The previous lower resolution feature representation of the decoder is up-sampled with a transposed convolution, the equally resolved feature map from the encoder is connected through a classical skip connection, and two higher resolution feature encodings are additionally connected after down-sampling. For down-sampling,

convolution and max-pooling are applied. Merging is performed by element-wise addition followed by convolution.

Along with these additional connections, novel questions arise. Higher resolution feature maps need to be adjusted to fit the spatial dimensions of the connected decoder layer. This can be done with any size-changing layer, e.g. strided convolution or pooling. Joining multiple feature maps into a single one requires a suitable strategy for merging. Commonly, either element-wise addition or concatenation is used. While the latter allows to maintain the separation of features, it also can lead to heavy computational loads for large and deep feature maps. Finally, one may ask which layers should be additionally connected. In theory, the more higher levels are used, the more the focus is shifted towards localization. On the other hand, a dense connection of every higher resolution to every lower one might be impractical. These questions are investigated in our ablation study in

Section IV-A.

The final remark of the theoretical discussion of ResFPN is related to the spectrum of applications. We argue that ResFPN is especially powerful when used for deeper pyramids that realize a (coarse-to-fine, incremental) multi-level prediction at multiple scales. However, the application of ResFPN is not limited to this use case. It is also possible to use only a certain level of the decoder for a single final prediction. Our experiments (Section IV-B) cover a broad range of end-to-end differentiable dense matching networks to demonstrate the flexibility of ResFPN.

Name Input Layer Output Shape
enc-1-1 input Conv(16,3,2,1)
enc-1-2 enc-1-1 Conv(16,3,1,1)
enc-2-1 enc-1-2 Conv(32,3,2,1)
enc-2-2 enc-2-1 Conv(32,3,1,1)
enc-3-1 enc-2-2 Conv(64,3,2,1)
enc-3-2 enc-3-1 Conv(64,3,1,1)
enc-4-1 enc-3-2 Conv(96,3,2,1)
enc-4-2 enc-4-1 Conv(96,3,1,1)
enc-5-1 enc-4-2 Conv(128,3,2,1)
enc-5-2 enc-5-1 Conv(128,3,1,1)
enc-6-1 enc-5-2 Conv(196,3,2,1)
enc-6-2 enc-6-1 Conv(196,3,1,1)
bottleneck enc-6-2 Conv(196,1,1,1)
skip-5-6 enc-5-2
skip-4-6 enc-4-2
dec-6-2 bottleneck +skip-5-6 +skip-4-6 Conv(196,3,1,1)
dec-5-1 dec-6-2 UpConv(128,4,2,1)
skip-4-5 enc-4-2
skip-3-5 enc-3-2
dec-5-2 dec-5-1 +enc-5-2 +skip-4-5 +skip-3-5 Conv(128,3,1,1)
dec-4-1 dec-5-2 UpConv(96,4,2,1)
skip-3-4 enc-3-2
skip-2-4 enc-2-2
dec-4-2 dec-4-1 +enc-4-2 +skip-3-4 +skip-2-4 Conv(96,3,1,1)
dec-3-1 dec-4-2 UpConv(64,4,2,1)
skip-2-3 enc-2-2
skip-1-3 enc-1-2
dec-3-2 dec-3-1 +enc-3-2 +skip-2-3 +skip-1-3 Conv(64,3,1,1)
dec-2-1 dec-3-2 UpConv(32,4,2,1)
skip-1-2 enc-1-2
skip-0-2 input
dec-2-2 dec-2-1 +enc-2-2 +skip-1-2 +skip-0-2 Conv(32,3,1,1)
TABLE I: The detailed architecture of our ResFPN. (Up)Conv() and MaxPool() describe (transposed) convolution and max-pooling with kernels, square kernel size , stride , and dilation rate .

Iii-B Feature Extraction Network

A ResFPN consists of arbitrary down-sampling blocks of sub-sampling factor (usually ), a bottleneck, and up-sampling blocks using the same factor . In regular FPNs [lin2017feature], the up-sampling block merges the corresponding feature encoding of the target resolution with the up-sampled result to produce a refined feature map. In our extension of ResFPN, we additionally use feature encodings of the next higher resolutions during merging. The feature encodings of higher resolutions need to be re-shaped to fit the spatial dimensions (and possibly the feature depth) of the target feature map. In theory, any re-sizing operation could be used for this task, e.g. strided convolution. We compare different strategies for re-shaping and merging in Section IV-A. Each (or one) of the feature maps of the decoder can then be used as features for the prediction.

Re-shaping Merging FT3D [mayer2016large] KITTI [menze2015object] Parameters FLOPs
>3px EPE >3px EPE
FPN [saxena2019pwoc] 0 addition 21.49 9.15 12.55 3.22 8.05 6.07
1 , max-pool addition 20.95 8.28 11.37 3.09 8.09 6.50
2 max-pool concatenation 19.90 7.91 11.21 3.04 8.67 8.94
2 , max-pool concatenation 21.16 8.34 11.83 3.02 9.03 12.09
2 , stride addition 21.65 8.42 13.67 3.50 8.74 7.43
2 , bi-linear addition 20.89 8.09 11.55 3.21 8.12 7.26
2 max-pool, addition 20.28 7.67 12.24 3.06 8.12 6.24
ResFPN 2 , max-pool addition 18.91 7.19 10.63 2.98 8.12 7.30
TABLE II: Ablation study on our validation split of KITTI data for different numbers and kinds of residual connections with different strategies for merging. A simple FPN establishes only a single skip connection between layers of the same resolution. Our ResFPN adds two residual connections of higher resolutions (cf. Fig. 3). Results for scene flow estimation with PWOC-3D [saxena2019pwoc] validate that the setup of ResFPN yields the best results while at the same time increases the computational effort and network size only marginally.

One possible way to implement a ResFPN is described here. We base our architecture on the FPN in [saxena2019pwoc] which is an extension of the feature pyramid of [sun2018pwc]. That is, we use down-sampling blocks with a sub-sampling factor of to compute 6 feature maps, where the first one has of the input resolution and the deepest encoding has of the input resolution. This is followed by up-sampling blocks to reconstruct a feature map of of the original image resolution. Higher resolutions are not required for most of the prediction heads in our experiments [chang2018pyramid, saxena2019pwoc, sun2018pwc], but are possible. The down-sampling is performed by two convolutions, where the first one applies a stride of 2. For up-sampling, we apply a transposed convolution with stride 2, merge the up-sampled features with a regular skip connection and additional lateral connections through element-wise addition, and then refine the fused features with a convolution. To align spatial size and feature depth for the merging of higher resolution feature encodings, we propose a convolution followed by max-pooling with a kernel size and stride of . Reshaping, merging, and refinement during up-sampling is illustrated in Fig. 3. In our experience, the combination of convolution with max-pooling is in the sweet spot of preserving spatial accuracy and computational efficiency, especially when feature depth is increased during the convolution (which is usually the case from higher to lower resolutions) (cf. Section IV-A). LeakyReLU activation [maas2013rectifier] is used for all convolutions to introduce non-linearity into the model. The entire architecture of ResFPN with all details is given in Table I.

Iv Experiments and Results

Our ResFPN is designed to extract features for dense matching such as stereo disparity, optical flow, or scene flow estimation. Our experiments cover end-to-end networks for all these matching tasks (cf. Section II). In particular, PWOC-3D [saxena2019pwoc] is used for scene flow estimation, PWCNet [sun2018pwc] and LiteFlowNet [hui2018liteflownet] represent optical flow estimators, and PSMNet [chang2018pyramid] is the network used for disparity estimation.

The experiments consider three well established data sets. FlyingThings3D (FT3D) [mayer2016large] is used in all cases for pre-training and evaluation. It provides dense scene flow ground truth and is thus also applicable for the training of optical flow or disparity networks. Further, we fine-tune networks on KITTI [geiger2012kitti, menze2015object] and Sintel [butler2012naturalistic]. The KITTI 2015 Scene Flow data set also provides (sparse) labels for scene flow and can therefore be used for all evaluations. Sintel is a data set for optical flow and is thus used for experiments related to optical flow only. For validation and evaluation, the random split of [saxena2019pwoc] is used for KITTI, and we randomly sample 5 out of the 23 sequences for Sintel. These sequences are alley_2, ambush_4, bamboo_2, cave_4, and market_5. For augmentation, we apply photometric transformations as in [dosovitskiy2015flownet, saxena2019pwoc] and temporal flipping for pre-training on FT3D. Unless mentioned otherwise, pre-training and fine-tuning are done with a batch size of 2 and 1, respectively.

The metrics being considered in the comparison are the end-point error (EPE) in pixels of the 1-, 2-, or 4-dimensional prediction and the KITTI outlier rate (>3px) of the respective task in percent [menze2015object]. For both, lower is better.

Using these setups, two sets of experiments are performed. First, we evaluate our design choices in Section IV-A and compare different ways to implement ResFPN. Second, we apply the features of ResFPN together with different end-to-end matching networks in Section IV-B.

FT3D [mayer2016large] KITTI [menze2015object] Sintel [butler2012naturalistic]
Original FPN ResFPN Original FPN ResFPN Original FPN ResFPN
Prediction Head >3px EPE >3px EPE >3px EPE >3px EPE >3px EPE >3px EPE >3px EPE >3px EPE >3px EPE
PWOC-3D [saxena2019pwoc] 21.5 9.2 18.9 7.2 12.6 3.2 10.6 3.0
PWCNet [sun2018pwc] 19.9 8.5 19.4 8.4 18.7 8.2 15.6 3.7 14.6 3.3 13.9 3.2 20.2 6.0 19.6 5.7 18.5 5.7
LiteFlowNet [hui2018liteflownet] 23.1 9.8 22.8 9.9 20.9 9.0 18.0 3.7 18.0 3.6 16.4 3.5 20.7 5.7 19.6 5.7 18.3 5.6
PSMNet [chang2018pyramid] 16.0 5.3 10.9 5.2 10.9 4.9 3.0 1.0 2.6 1.0 2.2 1.0
TABLE III: Comparison of feature extractors. For different prediction networks on different data sets, we evaluate the original network and a version where nothing but the feature module is changed to our improved ResFPN. To validate if the additional lateral connections in ResFPN are the reason for the improvement, we also compare to a simple FPN [lin2017feature].

Iv-a Design Decisions

There are multiple ways to implement the idea of ResFPN. In this section, we compare different entities of ResFPN and vary the number of additional skip connections , the merging operation, and the method to adjust size and depth of the skip features. For those experiments, we use the up- and down-sampling blocks presented in Tables I and III-B with the prediction head of PWOC-3D [saxena2019pwoc] for scene flow. The different variants are compared in Table II.

We vary the number of skip connections from 1 (, the original FPN) to 3 (). More than three lateral connections were not realizable due to hardware constraints, yet we can clearly see that an increase of connections improves the final results. Furthermore, concatenation versus addition is tested. Since the concatenation is independent of the feature depths of the merging input, it is not necessary to reshape the depth of the additional skip features. However, when this step is omitted, performance decreases. Yet, if this step is included, it is not obvious what the output depth of the convolution should be. For the numbers reported in Table II, we use the output depth of the up-sampled target feature map, i.e. the same number of output channels that is required for merging by element-wise addition. As a consequence, the computational effort is increased a lot.

Lastly, we change the re-shaping strategy to align spatial shapes, and in case of addition the depth of the skip feature maps. Our approach of convolution followed by max-pooling is opposed to strided convolution, convolution followed by bi-linear down-sampling, and max-pooling followed by convolution to show the importance of the order. Out of all strategies, our re-shaping approach with element-wise addition and skip connections (visualized in Fig. 3) performs the best while, at the same time, is computationally affordable. The overhead of the additional residual connections in terms of numbers of parameters and floating point operations is negligibly small, but outlier rate and end-point error drop by 7 to 22 %. Note that the feature computation with either FPN or ResFPN requires less than 10 % of the entire floating point operations for the prediction of the scene flow with PWOC-3D [saxena2019pwoc]. In detail, inference with PWOC-3D for a single pair of stereo images on a GeForce GTX 1080 Ti requires about s, i.e. feature computation with ResFPN for a single image ( MP) takes about ms.

Reference Images
and Ground Truth

Results of PWCNet [sun2018pwc]
and Error Maps

Improved Results with our ResFPN
and Error Maps

[width=1]graphics/pwc_kitti_base_error Outliers: 13.6 %  EPE: 2.4 px

[width=1]graphics/pwc_kitti_dfpn_error Outliers: 6.0 %  EPE: 1.9 px

[width=1]graphics/pwc_sintel_base_error Outliers: 2.8 %  EPE: 0.9 px

[width=1]graphics/pwc_sintel_dfpn_error Outliers: 6.4 %  EPE: 1.1 px

Reference Images
and Ground Truth

Results of LiteFlowNet [hui2018liteflownet]
and Error Maps

Improved Results with our ResFPN
and Error Maps

[width=1]graphics/lfn_kitti_base_error Outliers: 16.5 %  EPE: 2.7 px

[width=1]graphics/lfn_kitti_dfpn_error Outliers: 11.2 %  EPE: 2.0 px

[width=1]graphics/lfn_sintel_base_error Outliers: 49.8 %  EPE: 33.1 px

[width=1]graphics/lfn_sintel_dfpn_error Outliers: 33.1 %  EPE: 9.2 px


Fig. 4: Some examples of how ResFPN improves optical flow prediction on KITTI and Sintel. Please note the subtle differences around objects, e.g. vehicles. More visual comparisons for all networks and data sets are presented in our supplementary video.

Iv-B Dense Matching with ResFPN

Four different end-to-end networks for scene flow, optical flow, and disparity estimation are used for dense matching. For our experiments, we replace nothing but the feature computation module with our ResFPN (and a simple FPN [lin2017feature] for comparison). The predictions for the three different feature extractors are then compared. Evaluation is conducted on all mentioned data sets if ground truth for the respective task is available. Training schedules are as close as possible to the original, including multi-stage training if relevant, learning rate schedules, etc. Deviations are explicitly mentioned. Results for all networks on all data sets are presented in Table III.

Stereo Disparity

PSMNet [chang2018pyramid] is used to compute stereo disparity. This network predicts single scale dense stereo displacements at resolution, i.e. only the output of dec-2-2 from our ResFPN is used for the prediction. For the comparison between baseline and ResFPN, we replace the CNN

module for feature extraction (see Table 1 in

[chang2018pyramid]) with our ResFPN. To smooth the interface between our code and the SPP module of [chang2018pyramid], we pass the used feature representation through a convolution with 128 output channels to match the input shapes between baseline and ResFPN. For any training of PSMNet, we use a batch-size of 3 for pre-training. For the training of PSMNet together with ResFPN, we reduce the entire learning rate schedule by factor 10, because the additional skip connections affect the flow of gradients and thus can influence stability.

Our results show a significant reduction of outliers (>3px) for both stereo data sets when using ResFPN. End-point errors on FT3D are also reduced. ResFPN also outperforms the simple FPN with a single lateral skip connection only.

Optical Flow

PWCNet [sun2018pwc] and LiteFlowNet [hui2018liteflownet] are used for estimation of optical flow. For our experiments with PWCNet, we use the exact ResFPN as described in Tables I and III-B. For LiteFlowNet, we demonstrate the flexibility of ResFPN and test a version that is closer to the original feature computation module of LiteFlowNet. We still apply the concept of multiple residual skip connections in a pyramidal encoder-decoder network using the up-sampling concept shown in Fig. 3, but we change the hyper-parameters to fit the settings of the encoder of LiteFlowNet [hui2018liteflownet]. In detail, the feature encoder is formed by the input image, a first feature representation at full resolution, and then 5 additional down-sampled feature maps. This setup reaches a minimal resolution of with feature depths of for the 7 parts of the encoder (including the image itself). For the prediction, multiple scales are used iteratively until of the input resolution is reached. This is different from all other networks, where the final resolution for prediction is .

For both optical flow networks, the results improve on all data sets when features from our ResFPN are used. This holds for both metrics, outlier rate and average end-point error. ResFPN also outperforms a simple FPN [lin2017feature] in all our optical flow experiments. A visual comparison of the results of the baselines and ResFPN is given in Fig. 4 for exemplary images from KITTI and Sintel. It is evident that not only localization of features was improved to capture more details during matching. Moreover, ResFPN shows an increased robustness compared with its competitors in general. In the first sample from Sintel for example, the relatively small, badly illuminated character is outlined much better when ResFPN features are used for the matching, even if the overall results for this frame are slightly worse. On a global scale, especially for large displacements or occluded areas, ResFPN outperforms the baseline (e.g. in the last example of Fig. 4).

Scene Flow

For estimation of scene flow with PWOC-3D [saxena2019pwoc], our original design of ResFPN is applied again. The major differences here are that four instead of two images are processed for matching with ResFPN and that the baseline is already using a FPN with lateral skip connections [lin2017feature, saxena2019pwoc]. Therefore, this experiment has the strongest baseline. Still, ResFPN achieves a considerable reduction of outliers of about 15 % and cuts the end-point errors by 6 % and 22 % for KITTI and FT3D, respectively.

In summary, using ResFPN for feature computation in end-to-end matching networks reduces outlier rates and end-point errors (or maintains them) in all our experiments. The better localized features preserve details during matching and produce more consistent and smooth results in comparison to simple Feature Pyramids (FP) and basic Feature Pyramid Networks (FPN). ResFPN could achieve this for prediction networks with very different characteristics, e.g. single and multi-scale estimation, different encoding (down-sampling) blocks, and different final resolutions. More results and visualizations are provided within our supplementary video.

Iv-C Improved Localization

Original with ResFPN
PWOC [saxena2019pwoc] 10.75 10.23 8.24 6.54 10.18 9.87 8.30 6.80
PWC [sun2018pwc] 10.34 9.79 8.25 6.94 9.46 9.06 7.91 6.84
LFN [hui2018liteflownet] 13.72 12.57 10.13 8.74 12.36 11.43 9.15 7.72
PSM [chang2018pyramid] 2.07 2.31 2.12 1.70 2.77 2.46 1.79 1.35
TABLE IV: Evaluation in boundary regions of objects on KITTI [menze2015object]. defines the average end-point error for areas around object boundaries of pixels width.

The previous section confirms that matching with ResFPN yields an overall better result on various domains with all kinds of networks. However, one of our major claims is improved localization by utilization of multiple higher resolution feature maps. Therefore, a final experiment to validate this claim is conducted. Towards this end, we make use of the object masks provided by the KITTI data set [menze2015object] to repeat the previous experiment on boundary regions only. The average end-point error for different maximum distances to object boundaries is evaluated and reported in Table IV.

The numbers indicate that results obtained from our feature module are better at discontinuities around objects in most cases. Except for very narrow evaluation regions for predictions with PSMNet [chang2018pyramid] and wide boundaries for scene flow prediction with PWOC-3D [saxena2019pwoc], ResFPN reduces the error in these difficult image regions.

V Conclusion

In this paper, we presented ResFPN – a multi-resolution feature pyramid network with residual skip connections. With this novel design we were able to significantly improve the representativity and localization of feature description for end-to-end learned dense pixel matching tasks. We validated our design in a comprehensive ablation study. In various experiments, we showed that ResFPN achieves significant improvements in application for optical flow, scene flow and disparity estimation. These improvements have been confirmed for a wide range of state-of-the-art methods over a large number of renowned data sets.

As future work, we plan to explicitly consider further input modalities like LiDAR [battrawy2019lidar] or radar [meyer2019deep] in the design of ResFPN. The additional 3D information plays an essential role for various applications. Furthermore, we want to improve ResFPN with respect to its robustness against disturbances in the input data [ranjan2019attacking].