RAFT-3D: Scene Flow using Rigid-Motion Embeddings

12/01/2020
by   Zachary Teed, et al.
Princeton University
0

We address the problem of scene flow: given a pair of stereo or RGB-D video frames, estimate pixelwise 3D motion. We introduce RAFT-3D, a new deep architecture for scene flow. RAFT-3D is based on the RAFT model developed for optical flow but iteratively updates a dense field of pixelwise SE3 motion instead of 2D motion. A key innovation of RAFT-3D is rigid-motion embeddings, which represent a soft grouping of pixels into rigid objects. Integral to rigid-motion embeddings is Dense-SE3, a differentiable layer that enforces geometric consistency of the embeddings. Experiments show that RAFT-3D achieves state-of-the-art performance. On FlyingThings3D, under the two-view evaluation, we improved the best published accuracy (d < 0.05) from 30.33 KITTI, we achieve an error of 5.77, outperforming the best published method (6.31), despite using no object instance supervision.

READ FULL TEXT VIEW PDF

page 6

page 7

page 9

01/11/2021

Learning to Segment Rigid Motions from Two Frames

Appearance-based detectors achieve remarkable performance on common scen...
10/05/2017

Multiframe Scene Flow with Piecewise Rigid Motion

We introduce a novel multiframe scene flow approach that jointly optimiz...
02/17/2021

Weakly Supervised Learning of Rigid 3D Scene Flow

We propose a data-driven scene flow estimation algorithm exploiting the ...
04/18/2019

Deep Rigid Instance Scene Flow

In this paper we tackle the problem of scene flow estimation in the cont...
09/07/2016

Dense Motion Estimation for Smoke

Motion estimation for highly dynamic phenomena such as smoke is an open ...
07/27/2016

A Continuous Optimization Approach for Efficient and Accurate Scene Flow

We propose a continuous optimization method for solving dense 3D scene f...
07/26/2019

Differential Scene Flow from Light Field Gradients

This paper presents novel techniques for recovering 3D dense scene flow,...

1 Introduction

Figure 1:

Overview of our approach. Features extracted from the input images are used to construct a 4D correlation volume. We initialize the SE3 motion field,

, to be the identity at every pixel. During each iteration, the update operator uses the current SE3 motion estimate to index from the correlation volume, using the correlation features and hidden state to produce estimates of pixel correspondence and rigid-motion embeddings. These estimates are plugged into Dense-SE3, a least-squares optimization layer which uses geometric constraints to produce an update to the SE3 field. After successive iterations we recover a dense SE3 field, which can be decomposed into a rotational and translation component. The SE3 field can be projected onto the image to recover optical flow.

Scene flow is the task of estimating pixelwise 3D motion between a pair of video frames[vedula1999three]. Detailed 3D motion is requisite for many downstream applications including path planning, collision avoidance, virtual reality, and motion modeling. In this paper, we focus on stereo scene flow and RGB-D scene flow, which address stereo video and RGB-D video respectively.

Many scenes can be well approximated as a collection of rigidly moving objects. The motion of driving scenes, for example, can be modeled as a variable number of cars, buses, and trucks. The most successful scene flow approaches have exploited this structure by decomposing a scene into its rigidly moving components[menze2015object, vogel20113d, vogel20113d, vogel20153d, ma2019deep, behl2017bounding, jaimez2015motion, jaimez2015primal, kumar2017monocular]. This introduces a powerful prior which can be used to guide inference. While optical flow approaches typically assume piecewise smooth motion, a scene containing rigid objects will exhibit piecewise constant 3D motion fields (Fig. 1).

Recently, many works have proposed integrating deep learning into scene flow estimation pipelines. A common approach has been to use object detection

[behl2017bounding, cao2019learning] or segmentation [behl2017bounding, ma2019deep, lv2018learning, ren2017cascaded] networks to decompose the scene into a collection of potentially rigidly moving objects. Once the scene has been segmented into its rigidly moving components, more traditional optimization can be used to fit a motion model to each of the objects. One limitation of this approach is that the networks require instance segmentations to be trained and cannot recover the motion of new unknown objects. Object detection and instance segmentation produce non-differentiable components in the network, making end-to-end training difficult without bounding box or instance level supervision.

We introduce RAFT-3D, an end-to-end differentiable architecture which estimates pixelwise 3D motion from stereo or RGB-D video. RAFT-3D is built on top of RAFT [teed2020raft], a state-of-the-art optical flow architecture that builds all-pairs correlation volumes and uses a recurrent unit to iteratively refine a 2D flow field. We retain the basic iterative structure of RAFT but introduce a number of novel designs.

The main innovation we introduce is rigid-motion embeddings, which are per-pixel vectors that represent a soft grouping of pixels into rigid objects. During inference, RAFT-3D iteratively updates the rigid-motion embeddings such that pixels with similar embeddings belong to the same rigid object and follow the same SE3 motion.

Integral to rigid-motion embeddings is Dense-SE3, a differentiable layer that seeks to ensure that the embeddings are geometrically meaningful. Dense-SE3 iteratively updates a dense field of per-pixel SE3 motion by performing unrolled Gauss-Newton iterations such that the per-pixel SE3 motion is geometrically consistent with the current estimates of rigid-motion embeddings and pixel correspondence. Because of Dense-SE3, the rigid-motion embeddings can be indirectly supervised from only ground truth 3D scene flow, and our approach does not need any supervision of object boxes or masks.

Fig. 1

provides an overview of our approach. RAFT-3D take a pair of RGB-D images as input. It extracts features from the input images and builds a 4D correlation volume by computing the visual similarity between all pairs of pixels. RAFT-3D maintains and updates a dense field of pixelwise SE3 motion. During each iteration, it uses the current estimate of SE3 motion to index from the correlation volume. A recurrent GRU-based update operator takes the correlation features and produces an estimate of pixel correspondence, which is then used by Dense-SE3 to generate updates to the SE3 motion field.

RAFT-3D achieves state-of-the-art accuracy. On FlyingThings3D, under the two-view evaluation [liu2019flownet3d], RAFT-3D improves the best published accuracy () from 30.33% to 83.71%. On KITTI, RAFT-3D achieves an error of 5.77, outperforming the best published method (6.31), despite using no object instance supervision.

2 Related Work

The task of reconstructing a 3D motion field from video is often referred to as estimating “scene flow”.

Optical Flow: Optical flow is the problem of estimating dense 2D pixel-level motion between a pair of frames. Early work formulated optical flow as a energy minimization problem, where the objective was a combination of a data term—encouraging the matching of visually similar image regions—and a regularization term—favoring piecewise smooth motion fields. Many early scene flow approaches evolved from this formulation, replacing piecwise smooth flow priors with a piecewise constant rotation/translation field prior[vogel2013piecewise, menze2015object]. This greater degree of structure allowed scene flow methods to outperform approaches which treated optical flow or stereo separately[vogel20113d].

Recently, the problem of optical flow has been reformulated in the context of deep learning. Many works have demonstrated that a neural network can be directly trained to estimate optical flow between a pair of frames, and a large variety of network architectures have been proposed for the task

[flownet1, flownet2, pwcnet, ranjan2017optical, lu2020devon, yang2019volumetric, teed2020raft]. RAFT[teed2020raft] is a recurrent network architecture for estimating optical flow. RAFT builds a 4D correlation volume by computing the visual similarity between all pairs of pixels; then, during inference, a recurrent update operator indexes from the correlation volume to produce a flow update. A unique feature of RAFT is that a single, high resolution, flow field is updated and maintained.

Our approach is based on the RAFT architecture, but instead of a flow field, we estimate a SE3 motion field, where a rigid body transformation is estimated for each pixel. When projected onto the image, our SE3 motion vectors give more accurate optical flow than RAFT.

Rectified Stereo: Rectified stereo can be viewed as a 1-dimensional analog to optical flow, where the correspondence of each pixel in the left image is constrained to lie on a horizontal line spanning the right image. Like optical flow, traditional methods treated stereo as an energy minimization problem[hirschmuller2005accurate, ranftl2012pushing] often exploting planar information[bleyer2011patchmatch].

Recent deep learning approaches have borrowed many core concepts from conventional approaches such as the use of a 3D cost volume [gcnet], replacing hand-crafted features and similarity metrics with learned features, and cost volume filtering with a learned 3D CNN. Like optical flow, a variety of network architectures have been proposed [gcnet, zhang2019ga, guo2019group, chang2018pyramid]. Here we use GA-Net[zhang2019ga] to estimate depth between the each left/right image pair.

Scene Flow: Like optical flow and stereo, scene flow can be approached as a energy minimization problem. The objective is to recover a flow field such that (1) visually similar image regions are aligned and (2) the flow field maximizes some prior such as piecewise rigid motion and piecewise planar depth. Both variational optimization[quiroga2014dense, jaimez2015motion, jaimez2015primal] and discrete optimization[menze2015object, jaimez2015primal] approaches have been explored for inference. Our network is designed to mimic the behavior an optimization algorithm. We maintain an estimate of the current motion field which is updated and refined with each iteration.

Jaimez et al.[jaimez2015motion] proposed an alternating optimization approach for scene flow estimation from a pair of RGB-D images, iterating between grouping pixels into rigidly moving clusters and estimating the motion model for each of the cluster. Our method shares key ideas with this approach, namely the grouping of pixels into rigidly moving objects, however, we avoid a hard clustering by using rigid-motion embeddings, which softly and differentiably group pixels into rigid objects.

Recent works have leveraged the object detection and semantic segmentation ability of deep networks to improve scene flow accuracy[ma2019deep, cao2019learning, ren2017cascaded, behl2017bounding, gordon2019depth]. In these works, an object detection or instance segmentation network is trained to identify potentially moving objects, such as cars or buses. While these approaches have been very effective for driving datasets such as KITTI where moving objects can be easily identified using semantics, they do not generalize well to novel objects. An additional limitation is that the detection and instance segmentation introduces non-differentiable components into the pipeline, requiring these components to be trained separately on ground truth annotation. Ma et al. [ma2019deep]

was able to train an instance segmentation network jointly with optical flow estimation by differentiating through Gauss-Newton updates; however, this required additional instance supervision and pre-training on Cityscapes

[cordts2016cityscapes]. On the other hand, our network outperforms these approaches without using object instance supervision.

Yang and Ramanan[yang2020upgrading] take a unique approach and use a network to predict optical expansion, or the change in perceived object size. Combining optical expansion with optical flow gives normalized 3D scene flow. The scale ambiguity can be recovered using Lidar, stereo, or monocular depth estimation. This approach does not require instance segmentation, but also cannot directly enforce rigid motion priors.

Another line of works have focused on estimating 3D motion between a pair [liu2019flownet3d, wang2020flownet3d++, gu2019hplflownet] or sequence[liu2019meteornet, fan2019pointrnn] of point clouds. These approaches are well suited for Lidar data where the sensor produces sparse measurements. However, these works do not directly exploit scene rigidity. As we demonstrate in our experiments, reasoning about object level rigidity is critical for good accuracy.

3 Approach

We propose an iterative architecture for scene flow estimation from a pair of RGB-D images. Our network takes in two image/depth pairs, , , and outputs a dense transformation field which assigns a rigid body transformation to each pixel. For stereo images, the depth estimates and are obtained using an off-the-shelf stereo network.

3.1 Preliminaries

We use the pinhole projection model and assume known camera intrinsics. We use an augmented projection function which maps a 3D point to its projected pixel coordinates, , in addition to inverse depth, . Given a homogeneous 3D point

(1)

where are the camera intrinsics.

Given a dense depth map , we can use the inverse projection function.

(2)

which maps from pixel to the point , again with inverse depth .

Mapping Between Images: We use a dense transformation field, to represent the 3D motion between a pair of frames. Using , we can construct a function which maps points in frame to . Letting be the pixel coordinate at index then the mapping

(3)

can be used to find the correspondence of in .

A flow vector can be obtained by taking the difference . The first two components of the flow vector give us the standard optical flow. The last component provides the change in inverse depth between the pair of frames. The focus of this paper is to recover given a pair of frames.

Jacobians: For optimization purposes, we will need the Jacobian of the Eqn. 3

. Using the chain rule, we can compute the Jacobian of Eqn.

3 as the product of the projection Jacobian

(4)

and the transformation Jacobian

(5)

using local coordinates defined by the retraction . Giving the Jacobian of Eqn. 3 as .

Optimization on Lie Manifolds: The space of rigid-body transformations forms a Lie group, which is a smooth manifold and a group. In this paper, we use the Gauss-Newton algorithm to perform optimization steps over the space of dense SE3 fields.

Given a weighted least squares objective

(6)

the Gauss-Newton algorithm linearizes the residual terms, and solves for the update

(7)
(8)

The update is found by solving Eqn. 8 and applying the retraction . Eqn. 8 can be rewritten as the linear system

(9)

and and can be constructed without explicitly forming the Jacobian matrices

(10)

This fact is especially useful when solving optimization functions with millions of residual terms. In this setting, storing the full Jacobian matrix becomes impractical.

3.2 Network Architecture

Our network architecture is based on RAFT[teed2020raft]. We construct a full 4D correlation volume by computing the visual similarity between all pairs of pixels between the two input images. During each iteration, the network uses the current estimate of the SE3 field to index from the correlation volume. Correlation features are then fed into an recurrent update operator which estimates a dense flow field. We provide an overview of the RAFT architecture here, but more details can be found in [teed2020raft].

Feature Extraction: We first extract features from the two input images. We use two separate feature extract networks. The feature encoder, , is applied to both images with shared weights. extracts a dense 128-dimension feature vector at 1/8 resolution. It consists of 6 residuals blocks, 2 at 1/2 resolution, 2 at 1/4 resolution, and 2 at 1/8 resolution. We provide more details of the network architectures in the appendix.

The context encoder extracts semantic and contextual information from the first image. Different from the original RAFT[teed2020raft], we use a pretrained ResNet50[resnet] with a skip connection to extract context features at 1/8 resolution. The reason behind this change is that grouping objects into rigidly moving regions requires a greater degree of semantic information and larger receptive field. During training, we freeze the batch norm layers in the context encoder.

Computing Visual Similarity: We construct a 4D correlation volume by computing the dot product between all-pairs of feature vectors between the input images

(11)

We then pool the last two dimensions of the correlation volume 3 times using average pooling with a kernel, resulting in a correlation pyramid with

(12)

Indexing the Correlation Pyramid: Given a current estimate of correspondence , we can index from the correlation volume to produce a set of correlation features. First we construct a neighborhood grid around

(13)

and then use the neighboorhood to sample from the correlation volume using bilinear sampling. We note that the constructing and indexing from the correlation volume is performed in an identical manner to RAFT[teed2020deepv2d].

Update Operator: The update operator is a recurrent GRU-unit which retrieves features from the correlation volume using the indexing operator and outputs a set of revisions. RAFT uses a series of 1x5 and 5x1 GRU units; we use a single 3x3 unit but use a kernel composed of 1 and 3 dilation rates. We provide more details on the architecture of the update operator in the appendix.

Using Eqn. 3, we can use the current estimate of to estimate 2D correspondences . The following features are used as input to the GRU

  • Flow field:

  • Twist field:

  • Depth residual:

  • Correlation features:

In the depth residual term, the inverse depth is obtained from the depth component of , i.e. the backprojected pixel expressed in the coordinate system of frame 2. The inverse depth is obtained by indexing the inverse depth map of frame 2 using the correspondence of pixel . If pixel is non-occluded, an accurate SE3 field should result in a depth residual of 0. Each of the derived features are processed through 2 convolutional layers and then provided as input to the convolutional GRU.

The hidden state is then used to predict the inputs to the Dense-SE3 layer. We apply two convolutional layers to hidden state to output a rigid-motion embedding map . We additionally predict a “revision map” and corresponding confidence maps . The revisions and correspond to corrections that should be made to the optical flow induced by the current SE3 field. In other words, the network is trying to get a new estimate of pixel correspondence, but is expressing it on top of the flow induced by the SE3 field. The revisions is the corrections that should be made to the inverse depth in frame 2 when the inverse depth is used by Dense-SE3 to enforce geometric consistency. This is to account for noise in the input depth as well as occlusions. The embedding map and revision maps are taken as input to the Dense-SE3 layer to produce an update to the SE3 motion field.

SE3 Upsampling: The SE3 motion field estimated by the network is at 1/8 of the resolution. We use convex up-sampling [teed2020raft] to upsample the transformation field to the full input resolution. In RAFT[teed2020raft], the high resolution flow field was taken to be the convex combination of grids at the lower resolution with combination weights predicted by the network. However, the SE3 field lies on a manifold and is not closed under linear combinations. Instead we perform upsampling by first mapping to the Lie algebra using the logarithm map, performing convex upsampling in the lie algebra, and then mapping back to the manifold using the exponential map.

3.3 Dense-SE3 Layer

The key ingredient to our approach is the Dense-SE3 layer. Each application of the update module produces a revision map . The Dense-SE3 layer is a differentiable optimization layer which maps the revision map to a SE3 field update.

The rigid-motion embedding vectors are used to softly group pixels into rigid objects. Given two embedding vectors and , we compute an affinity by taking the exponential of the negative L2 distance

(14)

Objective Function: Using the affinity terms, we define an objective function based on the reprojection error

(15)
(16)

with . The objective states that for every pixel , we want a transformation which describes the motion of pixels in a neighborhood . However, not every pixel belongs to the same rigidly moving object. That is the purpose for the embedding vector. Only pairs with similar embeddings significantly contribute to the objective function.

Efficient Optimization: We apply a single Gauss-Newton update to Eqn. 16 to generate the next SE3 estimate. Since the Dense-SE3 layer is applied after each application of the update operator, 12 iterations of the update operator yields 12 Gauss-Newton updates.

The objective defined in Eqn. 16 can result in a very large optimization problem. We generally use a large neighborhood in practice; in some experiments we take to be the entire image. For the FlyingThings3D dataset, with resolution, this results in 200 million equations and 50,000 variables. Trying the store the full system would exceed available memory.

However, each term in Eqn. 16 only includes a single . This means that instead of solving a single optimization problem with variables, we can instead solve a set of problems each with only variables. Furthermore, we can leverage Eqn. 10 and build the linear system in place without explicitly constructing the Jacobian. When implemented directly in Cuda, a Gauss-Newton update of Eqn. 16 can be performed very quickly and is not a bottleneck in our approach.

3.4 Boundary-Aware Global Pooling

The boundary aware global pooling layer allows rigid-motion embeddings to be aggregated within motion boundaries. Since our architecture operates primarily at high resolution, it can be difficult for the network to group pixels which span large objects.

Asteroids 2D epe px px px 3D epe All Zeros 54.82 0.00 0.00 0.010 0.345 SRSF[quiroga2014dense] 41.90 0.028 0.079 0.173 8.367   2D (RAFT) 4.30 0.567 0.818 0.874 -   3D 5.44 0.489 0.774 0.844 0.173   6D 4.14 0.606 0.824 0.877 0.165 Ours 2.53 0.832 0.913 0.931 0.032
Table 1: Results on the Asteroids dataset, with comparisons with 2D, 3D, 6D motion fields as output.

We propose a new boundary-aware global pooling layer, which aggregates embedding vectors within motion boundaries. Given an embedding map , we have the GRU predict additional edge weights and define the objective

(17)

where and are linear finite difference operators, and is the flattened feature map.

In other words, we want to solve for a new embedding map which is smooth within motion boundaries and close to the original embedding map . At boundaries, the network can set the weights to 0 so that edges do not get smoothed over. Eqn. 21 can be solved in closed form using sparse Cholesky decomposition and we use the Cholmod library[chen2008algorithm]. Using nested dissection[george1973nested] factorization can be performed in time and backsubstition can be performed in time. In the appendix, we derived the gradients of Eqn. 21 so that and can be trained without direct supervision.

3.5 Supervision

We supervise our network on a combination of ground truth optical flow and inverse depth change. Our network outputs a sequences of . For each transformation, , we computed the induced optical flow and inverse depth change

(18)

where is a dense coordinate grid in . We compute the loss as the sequence over all estimations

(19)

with . Note that no supervision is applied to the embedding vectors, and that rigid-motion embeddings are implicitly learned by differentiating through the dense update layer. We also apply an additional loss directly to the revisions predicted by the GRU with 0.2 weight.

Figure 2: Visualization of the predicted motion fields on FlyingThings3D (top) and KITTI (bottom). Our network outputs a dense SE3 motion field, which can be used to compute optical flow. Here, we visualize the translational and rotational components of the SE3 field. Note that the rotation and translation fields are piecewise constant—pixels from the same rigid object are assigned the same SE3 motion.
Evaluation Method Input 2D Metrics 3D Metrics
1px EPE EPE
Non-Occluded (35m) FlowNetC [liu2019flownet3d] RGB-D - - 0.25% 1.74% 0.7836
ICP [besl1992method] XYZ - - 7.62% 21.98% 0.5019
FlowNet3D [liu2019flownet3d] XYZ - - 25.37% 57.85% 0.1694
FlowNet3D++[wang2020flownet3d++] RGB-D - - 30.33% 63.43% 0.1369
RAFT [teed2020raft] RGB 77.47% 3.63 - - -
RAFT (2D flow backprojected) RGB-D 77.02% 3.19 60.22% 66.73% 1.8076
RAFT (2D flow + depth change) RGB-D 73.51% 3.64 47.01% 61.87% 0.4288
RAFT (3D flow) RGB-D 74.84% 3.75 56.29% 75.50% 0.1172
Ours RGB-D 81.72% 2.63 83.71% 89.23% 0.0573
All RAFT [teed2020raft] RGB 79.37% 3.53 - - -
RAFT (2D flow backprojected) RGB-D 78.80% 3.42 50.58% 55.74% 5.4421
RAFT (2D flow + depth change) RGB-D 75.21% 3.66 33.87% 47.22% 1.2184
RAFT (3D flow) RGB-D 73.56% 4.42 36.19% 55.40% 0.2656
Ours RGB-D 86.35% 2.46 87.81% 91.52% 0.0619
Table 2: Results on the FlyingThings3D dataset. We used the split proposed by Liu et al [liu2019flownet3d]. We evaluate both 2D flow metrics and 3D scene flow metrics. We see that our approach outperforms FlowNet3D[liu2019flownet3d] and FlowNet3D++[wang2020flownet3d++] by a large margin. Our method also achieves better optical flow performance than RAFT, since we are able to reason about rigidly moving objects.

4 Experiments

We evaluate our approach on a variety of real and synthetic datasets. For all experiments we use the AdamW optimizer[loshchilov2017decoupled] with weight decay set to

and unroll 12 iterations of the update operator. All components of the network are trained from scratch, with the exception of the context encoder which uses ImageNet 

[deng2009imagenet] pretrained weights.

4.1 Asteroids

We begin by evaluating 3D motion when depth is given as input. We create a synthetic dataset we call Asteroids, with 10,000 videos of moving asteroids (90-5-5 train-test-val split). Each scene is constructed by placing 3-10 randomly textured asteroids into a Blender111https://www.blender.org/. The 3D models are of real asteroids downloaded from the 3D Asteroid Catalogue 222https://3d-asteroids.space/ and textures are used from the Describable Texture database[dtd].

We compare several baselines. All baselines are given the depth of the first frame as input (Semi-Rigid Scene Flow (SRSF) [quiroga2014dense] is given depth for both frames), and the task is to predict the motion between the first and second frames. The 2D, 2D, and 6D baselines all have the same architecture as RAFT-3D, except that they directly predict motion fields instead of using our Dense-SE layer. The 2D baseline simply predicts optical flow exactly like RAFT[teed2020raft]. The 3D and 6D baselines directly estimate either 3D scene flow or 6D rotation/translation fields. All methods are trained on 2D optical flow and 3D scene flow, except the 2D network, which only uses optical flow.

The results (Table 1) show that our Dense-SE layer improves optical flow and 3D scene flow. Visualization of the embedding and rotation/translation fields shows that the network can accurately segment pixels into rigid bodies.

Disparity 1 Disparity 2 Optical Flow Scene Flow
Methods Runtime bg fg all bg fg all bg fg all bg fg all
OSF [menze2015object] 50 mins 4.54 12.03 5.79 5.45 19.41 7.77 5.62 18.92 7.83 7.01 26.34 10.23
SSF [ren2017cascaded] 5 mins 3.55 8.75 4.42 4.94 17.48 7.02 5.63 14.71 7.14 7.18 24.58 10.07
Sense [jiang2019sense] 0.31s 2.07 3.01 2.22 4.90 10.83 5.89 7.30 9.33 7.64 8.36 15.49 9.55
DTF Sense [schuster2020deep] 0.76 sec 2.08 3.13 2.25 4.82 9.02 5.52 7.31 9.48 7.67 8.21 14.08 9.18
PRSM* [vogel20153d] 5 mins 3.02 10.52 4.27 5.13 15.11 6.79 5.33 13.40 6.68 6.61 20.79 8.97
OpticalExp [yang2020upgrading] 2.0 sec 1.48 3.46 1.81 3.39 8.54 4.25 5.83 8.66 6.30 7.06 13.44 8.12
ISF [behl2017bounding] 10 mins 4.12 6.17 4.46 4.88 11.34 5.95 5.40 10.29 6.22 6.58 15.63 8.08
ACOSF [Cong2020ICPR] 5mins 2.79 7.56 3.58 3.82 12.74 5.31 4.56 12.00 5.79 5.61 19.38 7.90
DRISF[ma2019deep] 0.75 sec (2 GPUs) 2.16 4.49 2.55 2.90 9.73 4.04 3.59 10.40 4.73 4.39 15.94 6.31
Ours 2.0 sec 1.48 3.46 1.81 2.51 9.46 3.67 3.39 8.79 4.29 4.27 13.27 5.77
Table 3: Results of the top performing methods on the KITTI leaderboard. Ours ranks first on the leaderboard among all published methods.

4.2 FlyingThings3D

The FlyingThings3D dataset was introduced as part of the synthetic Scene Flow datasets by Mayer et al. [mayer2016large]. The dataset consists of ShapeNet [chang2015shapenet]

shapes with randomized translation and rotations placed in a scene populated with background objects. While the dataset is not naturalistic, it offers a challenging combination of camera and object motion, each of which span all 6 degrees of freedom.

We train our network for 200k iterations with a batch size of 3 and a crop size of [320, 720]. We perform spatial augmentation by random cropping and resizing and adjust intrinsics accordingly. We use an initial learning rate of .0001 and decay the learning rate linearly during training.

We evaluate our network using 2D and 3D end-point-error (EPE). 2D EPE is defined as the euclidean distance between the ground truth optical flow and the predicted optical flow which can be obtained from the 3D transformation field using Eqn. 3. 3D EPE is the euclidean distance between the ground truth 3D scene flow and the predicted scene flow. We also report threshold metrics, which measure the portion of pixels which lie within a given threshold.

We report results in Table 2 and compare to scene flow[liu2019flownet3d, wang2020flownet3d++] and optical flow[teed2020raft] networks. In the top portion of the table, we use the evaluation setup of FlowNet3D[liu2019flownet3d] and FlowNet3D++[wang2020flownet3d++] where only non-occluded pixels with depth 35 meters are evaluated. Our method improves the 3D accuracy from 30.33% to 83.71%.

We compare to RAFT[teed2020raft] which estimates optical flow between a pair of frames and different baselines which modify RAFT to predict 3D motion. All RAFT baselines use the same network architecture as our approach, including the pretrained ResNet-50. All baselines are provided with inverse depth as input which is concatenate with the input images. We also experiment with directly provided depth as input, but found that inverse depth gives the best results.

RAFT (2D flow backprojected) uses the depth maps to backproject 2D motion into a 3D flow vector, but this only works for non-occluded pixels, which is the reason for the very large 3D EPE error. RAFT (2D flow + depth change) predicts 2D flow in addition to inverse depth change, which can be used to recover 3D flow fields. Finally, we also test a version of RAFT which predicts 3D motion fields directly; RAFT(3D flow). We find that our method outperforms all these baselines by a large margin, particularly on the 3D metrics. This is because our network operates directly on the SE3 motion field, which offers a more structured representation than flow fields and the Dense-SE3 layer produces analytically constrained updates which the other baselines lack.

In the second portion of the table, we evaluate over all pixels (excluding extremely fast moving objects with flow 250 pixels). Since we decompose the scene into rigidly moving components, our method can estimate the motion of occluded regions as well. We provide qualitative results in Fig. 2. These examples show that our network can segment the scene into rigidly moving regions, producing piecewise constant SE3 motion fields, even though no supervision is used on the embeddings.

4.3 Kitti

Using our model trained on FlyingThings3D, we finetune on KITTI for an additional 50k iterations with an initial learning rate of . We use a crop size of [288, 960] and perform spatial and photometric augmentation. To estimate disparity, we use GA-Net[zhang2019ga], which provides the input depth maps for our method.

Experiment Method 2D Metrics 3D Metrics
1px EPE EPE
Iterations 1 62.1 6.05 56.0 65.9 0.212
3 82.8 2.95 80.5 85.7 0.098
8 85.5 2.47 86.4 90.5 0.062
16 85.8 2.43 87.1 91.0 0.059
32 85.7 2.50 87.0 90.9 0.061
Neighborhood Radius (px) 8 73.2 4.01 38.7 59.0 0.192
64 83.8 2.52 78.1 86.6 0.078
256 85.8 2.43 87.1 91.0 0.059
Full Image 83.3 2.91 83.2 88.1 0.078
Revision Factors Flow 86.1 2.29 84.6 88.7 0.081
Flow + Inv. Depth 85.8 2.43 87.1 91.0 0.059
Boundary-Aware Pooling No 85.8 2.43 87.1 91.0 0.059
Yes 86.3 2.45 87.8 91.5 0.062
Table 4: Ablation experiments, details of the individual experiments are provided in 4.4

We submit our method to the KITTI leaderboard and report results from our method and other top performing methods in Tab. 3. Our approach outperforms all published methods. DRISP [ma2019deep] is the next best performing approach, and combines PSMNet[chang2018pyramid], PWC-Net[pwcnet], and Mask-RCNN[he2017mask]. Mask-RCNN is pretrained on Cityscapes and fine-tuned on KITTI using bounding box and instance mask supervision. Our network outperforms DRISP despite only training on FlyingThings3D and KITTI, and uses no instance supervision.

4.4 Ablations

We ablate various components of our model and report results in Tab. 4. For all ablations, we use our network without boundary-aware pooling as the baseline architecture.

Iterations: We evaluate the performance of our model as function of the number of application of the update operator. We find that more iterations gives better performance up to about 16, after which we observe a slight degradation.

Neighborhood Radius: The Dense-SE3 layer defines an objective which such at all pairs of pixels within a specific radius contribute to the objective. Here, we train networks where is set to . In the last case, all pairs of pixels in the image contribute to the objective. We find that gives the better performance than smaller radii; however, using the full image gives worse performance. This is likely due to the fact that most rigid objects will be less than 512 pixels in diameter, and imposing a restriction on the radius is a useful prior.

Revision Factors: The update operator produces a set of revisions which are used as input to the Dense-SE3 layer. Here we experiment with different revisions. In Flow we only use the optical flow revisions and . In flow + inv. depth we include inverse depth revisions. We find that including inverse depth revisions leads to better performance on 3D metrics because it allows for consistency between the depth maps.

Figure 3: Impact of boundary-aware pooling on motion output.

Boundary-Aware Global Pooling: Here we test the impact of our proposed boundary-aware global pooling layer. Our pooling layer improves the accuracy of the threshold metrics improving 1px accuracy from 85.8 to 86.3, and 3D accuracy from 87.1 to 87.8 and gives comparable average EPE. In Fig. 3 we see that the pooling layer produces qualitatively better results, particularly over large objects.

Timing: Applying 16 updates on a 1080Ti GPU takes 620ms for resolution images. When boundary-aware global pooling is used, inference takes 1.2s for image pairs.

Parameters: RAFT-3D has 45M trainable parameters. The ResNet50 backbone has 40M parameters, while the feature extractor and update operator make up the remaining 5M parameters.

5 Conclusion

We have introduced RAFT-3D, an end-to-end network for scene flow. RAFT-3D uses rigid-motion embeddings, which represent a soft grouping of pixels into rigidly moving objects. We demonstrate that these embeddings can be used to solve for dense and accurate 3D motion fields.

Acknowledgements: This research is partially supported by the National Science Foundation under Grant IIS-1942981.

References

Appendix A Network Architecture

Details of the network architecture, including feature encoders and the GRU-based update operator are shown on the next page in Fig 4.

Appendix B Boundary Aware Pooling Gradients

The boundary aware pooling layer minimizes an objective function in the form

(20)

where and are linear finite difference operators, and is the flattened feature map.

First consider the case of single channel, . Let . We can solve for

(21)

We perform sparse Cholesky factorization and backsubstition to solve for using the Cholmod library[chen2008algorithm].

Gradients: In the backward pass, given the gradient , we need to find the gradients with respect to the boundary weights and .

Given the linear system , the gradients with respect to and can be found by solving the system in the backward direction [amos2017optnet]

(22)
(23)
(24)

Here the column vector is defined for notational convenience. Since is positive definite, so we can reuse the factorization from the forward pass.

To compute the gradients with respect to and

(25)
(26)

giving

(27)

where is elementwise multiplication. Similarly

(28)

Multiple Channels: We can easily extend Eqn. 21 to work with multiple channels. Since the matrix does not depend on , it only needs to be factored once. We can solve Eqn. 21 for all channels by reusing the factorization, treating as a matrix. The gradient formulas can also be updated by summing the gradients over the channel dimensions.