1 Introduction
Scene flow is the task of estimating pixelwise 3D motion between a pair of video frames[vedula1999three]. Detailed 3D motion is requisite for many downstream applications including path planning, collision avoidance, virtual reality, and motion modeling. In this paper, we focus on stereo scene flow and RGBD scene flow, which address stereo video and RGBD video respectively.
Many scenes can be well approximated as a collection of rigidly moving objects. The motion of driving scenes, for example, can be modeled as a variable number of cars, buses, and trucks. The most successful scene flow approaches have exploited this structure by decomposing a scene into its rigidly moving components[menze2015object, vogel20113d, vogel20113d, vogel20153d, ma2019deep, behl2017bounding, jaimez2015motion, jaimez2015primal, kumar2017monocular]. This introduces a powerful prior which can be used to guide inference. While optical flow approaches typically assume piecewise smooth motion, a scene containing rigid objects will exhibit piecewise constant 3D motion fields (Fig. 1).
Recently, many works have proposed integrating deep learning into scene flow estimation pipelines. A common approach has been to use object detection
[behl2017bounding, cao2019learning] or segmentation [behl2017bounding, ma2019deep, lv2018learning, ren2017cascaded] networks to decompose the scene into a collection of potentially rigidly moving objects. Once the scene has been segmented into its rigidly moving components, more traditional optimization can be used to fit a motion model to each of the objects. One limitation of this approach is that the networks require instance segmentations to be trained and cannot recover the motion of new unknown objects. Object detection and instance segmentation produce nondifferentiable components in the network, making endtoend training difficult without bounding box or instance level supervision.We introduce RAFT3D, an endtoend differentiable architecture which estimates pixelwise 3D motion from stereo or RGBD video. RAFT3D is built on top of RAFT [teed2020raft], a stateoftheart optical flow architecture that builds allpairs correlation volumes and uses a recurrent unit to iteratively refine a 2D flow field. We retain the basic iterative structure of RAFT but introduce a number of novel designs.
The main innovation we introduce is rigidmotion embeddings, which are perpixel vectors that represent a soft grouping of pixels into rigid objects. During inference, RAFT3D iteratively updates the rigidmotion embeddings such that pixels with similar embeddings belong to the same rigid object and follow the same SE3 motion.
Integral to rigidmotion embeddings is DenseSE3, a differentiable layer that seeks to ensure that the embeddings are geometrically meaningful. DenseSE3 iteratively updates a dense field of perpixel SE3 motion by performing unrolled GaussNewton iterations such that the perpixel SE3 motion is geometrically consistent with the current estimates of rigidmotion embeddings and pixel correspondence. Because of DenseSE3, the rigidmotion embeddings can be indirectly supervised from only ground truth 3D scene flow, and our approach does not need any supervision of object boxes or masks.
Fig. 1
provides an overview of our approach. RAFT3D take a pair of RGBD images as input. It extracts features from the input images and builds a 4D correlation volume by computing the visual similarity between all pairs of pixels. RAFT3D maintains and updates a dense field of pixelwise SE3 motion. During each iteration, it uses the current estimate of SE3 motion to index from the correlation volume. A recurrent GRUbased update operator takes the correlation features and produces an estimate of pixel correspondence, which is then used by DenseSE3 to generate updates to the SE3 motion field.
RAFT3D achieves stateoftheart accuracy. On FlyingThings3D, under the twoview evaluation [liu2019flownet3d], RAFT3D improves the best published accuracy () from 30.33% to 83.71%. On KITTI, RAFT3D achieves an error of 5.77, outperforming the best published method (6.31), despite using no object instance supervision.
2 Related Work
The task of reconstructing a 3D motion field from video is often referred to as estimating “scene flow”.
Optical Flow: Optical flow is the problem of estimating dense 2D pixellevel motion between a pair of frames. Early work formulated optical flow as a energy minimization problem, where the objective was a combination of a data term—encouraging the matching of visually similar image regions—and a regularization term—favoring piecewise smooth motion fields. Many early scene flow approaches evolved from this formulation, replacing piecwise smooth flow priors with a piecewise constant rotation/translation field prior[vogel2013piecewise, menze2015object]. This greater degree of structure allowed scene flow methods to outperform approaches which treated optical flow or stereo separately[vogel20113d].
Recently, the problem of optical flow has been reformulated in the context of deep learning. Many works have demonstrated that a neural network can be directly trained to estimate optical flow between a pair of frames, and a large variety of network architectures have been proposed for the task
[flownet1, flownet2, pwcnet, ranjan2017optical, lu2020devon, yang2019volumetric, teed2020raft]. RAFT[teed2020raft] is a recurrent network architecture for estimating optical flow. RAFT builds a 4D correlation volume by computing the visual similarity between all pairs of pixels; then, during inference, a recurrent update operator indexes from the correlation volume to produce a flow update. A unique feature of RAFT is that a single, high resolution, flow field is updated and maintained.Our approach is based on the RAFT architecture, but instead of a flow field, we estimate a SE3 motion field, where a rigid body transformation is estimated for each pixel. When projected onto the image, our SE3 motion vectors give more accurate optical flow than RAFT.
Rectified Stereo: Rectified stereo can be viewed as a 1dimensional analog to optical flow, where the correspondence of each pixel in the left image is constrained to lie on a horizontal line spanning the right image. Like optical flow, traditional methods treated stereo as an energy minimization problem[hirschmuller2005accurate, ranftl2012pushing] often exploting planar information[bleyer2011patchmatch].
Recent deep learning approaches have borrowed many core concepts from conventional approaches such as the use of a 3D cost volume [gcnet], replacing handcrafted features and similarity metrics with learned features, and cost volume filtering with a learned 3D CNN. Like optical flow, a variety of network architectures have been proposed [gcnet, zhang2019ga, guo2019group, chang2018pyramid]. Here we use GANet[zhang2019ga] to estimate depth between the each left/right image pair.
Scene Flow: Like optical flow and stereo, scene flow can be approached as a energy minimization problem. The objective is to recover a flow field such that (1) visually similar image regions are aligned and (2) the flow field maximizes some prior such as piecewise rigid motion and piecewise planar depth. Both variational optimization[quiroga2014dense, jaimez2015motion, jaimez2015primal] and discrete optimization[menze2015object, jaimez2015primal] approaches have been explored for inference. Our network is designed to mimic the behavior an optimization algorithm. We maintain an estimate of the current motion field which is updated and refined with each iteration.
Jaimez et al.[jaimez2015motion] proposed an alternating optimization approach for scene flow estimation from a pair of RGBD images, iterating between grouping pixels into rigidly moving clusters and estimating the motion model for each of the cluster. Our method shares key ideas with this approach, namely the grouping of pixels into rigidly moving objects, however, we avoid a hard clustering by using rigidmotion embeddings, which softly and differentiably group pixels into rigid objects.
Recent works have leveraged the object detection and semantic segmentation ability of deep networks to improve scene flow accuracy[ma2019deep, cao2019learning, ren2017cascaded, behl2017bounding, gordon2019depth]. In these works, an object detection or instance segmentation network is trained to identify potentially moving objects, such as cars or buses. While these approaches have been very effective for driving datasets such as KITTI where moving objects can be easily identified using semantics, they do not generalize well to novel objects. An additional limitation is that the detection and instance segmentation introduces nondifferentiable components into the pipeline, requiring these components to be trained separately on ground truth annotation. Ma et al. [ma2019deep]
was able to train an instance segmentation network jointly with optical flow estimation by differentiating through GaussNewton updates; however, this required additional instance supervision and pretraining on Cityscapes
[cordts2016cityscapes]. On the other hand, our network outperforms these approaches without using object instance supervision.Yang and Ramanan[yang2020upgrading] take a unique approach and use a network to predict optical expansion, or the change in perceived object size. Combining optical expansion with optical flow gives normalized 3D scene flow. The scale ambiguity can be recovered using Lidar, stereo, or monocular depth estimation. This approach does not require instance segmentation, but also cannot directly enforce rigid motion priors.
Another line of works have focused on estimating 3D motion between a pair [liu2019flownet3d, wang2020flownet3d++, gu2019hplflownet] or sequence[liu2019meteornet, fan2019pointrnn] of point clouds. These approaches are well suited for Lidar data where the sensor produces sparse measurements. However, these works do not directly exploit scene rigidity. As we demonstrate in our experiments, reasoning about object level rigidity is critical for good accuracy.
3 Approach
We propose an iterative architecture for scene flow estimation from a pair of RGBD images. Our network takes in two image/depth pairs, , , and outputs a dense transformation field which assigns a rigid body transformation to each pixel. For stereo images, the depth estimates and are obtained using an offtheshelf stereo network.
3.1 Preliminaries
We use the pinhole projection model and assume known camera intrinsics. We use an augmented projection function which maps a 3D point to its projected pixel coordinates, , in addition to inverse depth, . Given a homogeneous 3D point
(1) 
where are the camera intrinsics.
Given a dense depth map , we can use the inverse projection function.
(2) 
which maps from pixel to the point , again with inverse depth .
Mapping Between Images: We use a dense transformation field, to represent the 3D motion between a pair of frames. Using , we can construct a function which maps points in frame to . Letting be the pixel coordinate at index then the mapping
(3) 
can be used to find the correspondence of in .
A flow vector can be obtained by taking the difference . The first two components of the flow vector give us the standard optical flow. The last component provides the change in inverse depth between the pair of frames. The focus of this paper is to recover given a pair of frames.
Jacobians: For optimization purposes, we will need the Jacobian of the Eqn. 3
. Using the chain rule, we can compute the Jacobian of Eqn.
3 as the product of the projection Jacobian(4) 
and the transformation Jacobian
(5) 
using local coordinates defined by the retraction . Giving the Jacobian of Eqn. 3 as .
Optimization on Lie Manifolds: The space of rigidbody transformations forms a Lie group, which is a smooth manifold and a group. In this paper, we use the GaussNewton algorithm to perform optimization steps over the space of dense SE3 fields.
Given a weighted least squares objective
(6) 
the GaussNewton algorithm linearizes the residual terms, and solves for the update
(7)  
(8) 
The update is found by solving Eqn. 8 and applying the retraction . Eqn. 8 can be rewritten as the linear system
(9) 
and and can be constructed without explicitly forming the Jacobian matrices
(10) 
This fact is especially useful when solving optimization functions with millions of residual terms. In this setting, storing the full Jacobian matrix becomes impractical.
3.2 Network Architecture
Our network architecture is based on RAFT[teed2020raft]. We construct a full 4D correlation volume by computing the visual similarity between all pairs of pixels between the two input images. During each iteration, the network uses the current estimate of the SE3 field to index from the correlation volume. Correlation features are then fed into an recurrent update operator which estimates a dense flow field. We provide an overview of the RAFT architecture here, but more details can be found in [teed2020raft].
Feature Extraction: We first extract features from the two input images. We use two separate feature extract networks. The feature encoder, , is applied to both images with shared weights. extracts a dense 128dimension feature vector at 1/8 resolution. It consists of 6 residuals blocks, 2 at 1/2 resolution, 2 at 1/4 resolution, and 2 at 1/8 resolution. We provide more details of the network architectures in the appendix.
The context encoder extracts semantic and contextual information from the first image. Different from the original RAFT[teed2020raft], we use a pretrained ResNet50[resnet] with a skip connection to extract context features at 1/8 resolution. The reason behind this change is that grouping objects into rigidly moving regions requires a greater degree of semantic information and larger receptive field. During training, we freeze the batch norm layers in the context encoder.
Computing Visual Similarity: We construct a 4D correlation volume by computing the dot product between allpairs of feature vectors between the input images
(11) 
We then pool the last two dimensions of the correlation volume 3 times using average pooling with a kernel, resulting in a correlation pyramid with
(12) 
Indexing the Correlation Pyramid: Given a current estimate of correspondence , we can index from the correlation volume to produce a set of correlation features. First we construct a neighborhood grid around
(13) 
and then use the neighboorhood to sample from the correlation volume using bilinear sampling. We note that the constructing and indexing from the correlation volume is performed in an identical manner to RAFT[teed2020deepv2d].
Update Operator: The update operator is a recurrent GRUunit which retrieves features from the correlation volume using the indexing operator and outputs a set of revisions. RAFT uses a series of 1x5 and 5x1 GRU units; we use a single 3x3 unit but use a kernel composed of 1 and 3 dilation rates. We provide more details on the architecture of the update operator in the appendix.
Using Eqn. 3, we can use the current estimate of to estimate 2D correspondences . The following features are used as input to the GRU

Flow field:

Twist field:

Depth residual:

Correlation features:
In the depth residual term, the inverse depth is obtained from the depth component of , i.e. the backprojected pixel expressed in the coordinate system of frame 2. The inverse depth is obtained by indexing the inverse depth map of frame 2 using the correspondence of pixel . If pixel is nonoccluded, an accurate SE3 field should result in a depth residual of 0. Each of the derived features are processed through 2 convolutional layers and then provided as input to the convolutional GRU.
The hidden state is then used to predict the inputs to the DenseSE3 layer. We apply two convolutional layers to hidden state to output a rigidmotion embedding map . We additionally predict a “revision map” and corresponding confidence maps . The revisions and correspond to corrections that should be made to the optical flow induced by the current SE3 field. In other words, the network is trying to get a new estimate of pixel correspondence, but is expressing it on top of the flow induced by the SE3 field. The revisions is the corrections that should be made to the inverse depth in frame 2 when the inverse depth is used by DenseSE3 to enforce geometric consistency. This is to account for noise in the input depth as well as occlusions. The embedding map and revision maps are taken as input to the DenseSE3 layer to produce an update to the SE3 motion field.
SE3 Upsampling: The SE3 motion field estimated by the network is at 1/8 of the resolution. We use convex upsampling [teed2020raft] to upsample the transformation field to the full input resolution. In RAFT[teed2020raft], the high resolution flow field was taken to be the convex combination of grids at the lower resolution with combination weights predicted by the network. However, the SE3 field lies on a manifold and is not closed under linear combinations. Instead we perform upsampling by first mapping to the Lie algebra using the logarithm map, performing convex upsampling in the lie algebra, and then mapping back to the manifold using the exponential map.
3.3 DenseSE3 Layer
The key ingredient to our approach is the DenseSE3 layer. Each application of the update module produces a revision map . The DenseSE3 layer is a differentiable optimization layer which maps the revision map to a SE3 field update.
The rigidmotion embedding vectors are used to softly group pixels into rigid objects. Given two embedding vectors and , we compute an affinity by taking the exponential of the negative L2 distance
(14) 
Objective Function: Using the affinity terms, we define an objective function based on the reprojection error
(15)  
(16) 
with . The objective states that for every pixel , we want a transformation which describes the motion of pixels in a neighborhood . However, not every pixel belongs to the same rigidly moving object. That is the purpose for the embedding vector. Only pairs with similar embeddings significantly contribute to the objective function.
Efficient Optimization: We apply a single GaussNewton update to Eqn. 16 to generate the next SE3 estimate. Since the DenseSE3 layer is applied after each application of the update operator, 12 iterations of the update operator yields 12 GaussNewton updates.
The objective defined in Eqn. 16 can result in a very large optimization problem. We generally use a large neighborhood in practice; in some experiments we take to be the entire image. For the FlyingThings3D dataset, with resolution, this results in 200 million equations and 50,000 variables. Trying the store the full system would exceed available memory.
However, each term in Eqn. 16 only includes a single . This means that instead of solving a single optimization problem with variables, we can instead solve a set of problems each with only variables. Furthermore, we can leverage Eqn. 10 and build the linear system in place without explicitly constructing the Jacobian. When implemented directly in Cuda, a GaussNewton update of Eqn. 16 can be performed very quickly and is not a bottleneck in our approach.
3.4 BoundaryAware Global Pooling
The boundary aware global pooling layer allows rigidmotion embeddings to be aggregated within motion boundaries. Since our architecture operates primarily at high resolution, it can be difficult for the network to group pixels which span large objects.
We propose a new boundaryaware global pooling layer, which aggregates embedding vectors within motion boundaries. Given an embedding map , we have the GRU predict additional edge weights and define the objective
(17) 
where and are linear finite difference operators, and is the flattened feature map.
In other words, we want to solve for a new embedding map which is smooth within motion boundaries and close to the original embedding map . At boundaries, the network can set the weights to 0 so that edges do not get smoothed over. Eqn. 21 can be solved in closed form using sparse Cholesky decomposition and we use the Cholmod library[chen2008algorithm]. Using nested dissection[george1973nested] factorization can be performed in time and backsubstition can be performed in time. In the appendix, we derived the gradients of Eqn. 21 so that and can be trained without direct supervision.
3.5 Supervision
We supervise our network on a combination of ground truth optical flow and inverse depth change. Our network outputs a sequences of . For each transformation, , we computed the induced optical flow and inverse depth change
(18) 
where is a dense coordinate grid in . We compute the loss as the sequence over all estimations
(19) 
with . Note that no supervision is applied to the embedding vectors, and that rigidmotion embeddings are implicitly learned by differentiating through the dense update layer. We also apply an additional loss directly to the revisions predicted by the GRU with 0.2 weight.
Evaluation  Method  Input  2D Metrics  3D Metrics  
1px  EPE  EPE  
NonOccluded (35m)  FlowNetC [liu2019flownet3d]  RGBD      0.25%  1.74%  0.7836 
ICP [besl1992method]  XYZ      7.62%  21.98%  0.5019  
FlowNet3D [liu2019flownet3d]  XYZ      25.37%  57.85%  0.1694  
FlowNet3D++[wang2020flownet3d++]  RGBD      30.33%  63.43%  0.1369  
RAFT [teed2020raft]  RGB  77.47%  3.63        
RAFT (2D flow backprojected)  RGBD  77.02%  3.19  60.22%  66.73%  1.8076  
RAFT (2D flow + depth change)  RGBD  73.51%  3.64  47.01%  61.87%  0.4288  
RAFT (3D flow)  RGBD  74.84%  3.75  56.29%  75.50%  0.1172  
Ours  RGBD  81.72%  2.63  83.71%  89.23%  0.0573  
All  RAFT [teed2020raft]  RGB  79.37%  3.53       
RAFT (2D flow backprojected)  RGBD  78.80%  3.42  50.58%  55.74%  5.4421  
RAFT (2D flow + depth change)  RGBD  75.21%  3.66  33.87%  47.22%  1.2184  
RAFT (3D flow)  RGBD  73.56%  4.42  36.19%  55.40%  0.2656  
Ours  RGBD  86.35%  2.46  87.81%  91.52%  0.0619 
4 Experiments
We evaluate our approach on a variety of real and synthetic datasets. For all experiments we use the AdamW optimizer[loshchilov2017decoupled] with weight decay set to
and unroll 12 iterations of the update operator. All components of the network are trained from scratch, with the exception of the context encoder which uses ImageNet
[deng2009imagenet] pretrained weights.4.1 Asteroids
We begin by evaluating 3D motion when depth is given as input. We create a synthetic dataset we call Asteroids, with 10,000 videos of moving asteroids (9055 traintestval split). Each scene is constructed by placing 310 randomly textured asteroids into a Blender^{1}^{1}1https://www.blender.org/. The 3D models are of real asteroids downloaded from the 3D Asteroid Catalogue ^{2}^{2}2https://3dasteroids.space/ and textures are used from the Describable Texture database[dtd].
We compare several baselines. All baselines are given the depth of the first frame as input (SemiRigid Scene Flow (SRSF) [quiroga2014dense] is given depth for both frames), and the task is to predict the motion between the first and second frames. The 2D, 2D, and 6D baselines all have the same architecture as RAFT3D, except that they directly predict motion fields instead of using our DenseSE layer. The 2D baseline simply predicts optical flow exactly like RAFT[teed2020raft]. The 3D and 6D baselines directly estimate either 3D scene flow or 6D rotation/translation fields. All methods are trained on 2D optical flow and 3D scene flow, except the 2D network, which only uses optical flow.
The results (Table 1) show that our DenseSE layer improves optical flow and 3D scene flow. Visualization of the embedding and rotation/translation fields shows that the network can accurately segment pixels into rigid bodies.
Disparity 1  Disparity 2  Optical Flow  Scene Flow  

Methods  Runtime  bg  fg  all  bg  fg  all  bg  fg  all  bg  fg  all 
OSF [menze2015object]  50 mins  4.54  12.03  5.79  5.45  19.41  7.77  5.62  18.92  7.83  7.01  26.34  10.23 
SSF [ren2017cascaded]  5 mins  3.55  8.75  4.42  4.94  17.48  7.02  5.63  14.71  7.14  7.18  24.58  10.07 
Sense [jiang2019sense]  0.31s  2.07  3.01  2.22  4.90  10.83  5.89  7.30  9.33  7.64  8.36  15.49  9.55 
DTF Sense [schuster2020deep]  0.76 sec  2.08  3.13  2.25  4.82  9.02  5.52  7.31  9.48  7.67  8.21  14.08  9.18 
PRSM* [vogel20153d]  5 mins  3.02  10.52  4.27  5.13  15.11  6.79  5.33  13.40  6.68  6.61  20.79  8.97 
OpticalExp [yang2020upgrading]  2.0 sec  1.48  3.46  1.81  3.39  8.54  4.25  5.83  8.66  6.30  7.06  13.44  8.12 
ISF [behl2017bounding]  10 mins  4.12  6.17  4.46  4.88  11.34  5.95  5.40  10.29  6.22  6.58  15.63  8.08 
ACOSF [Cong2020ICPR]  5mins  2.79  7.56  3.58  3.82  12.74  5.31  4.56  12.00  5.79  5.61  19.38  7.90 
DRISF[ma2019deep]  0.75 sec (2 GPUs)  2.16  4.49  2.55  2.90  9.73  4.04  3.59  10.40  4.73  4.39  15.94  6.31 
Ours  2.0 sec  1.48  3.46  1.81  2.51  9.46  3.67  3.39  8.79  4.29  4.27  13.27  5.77 
4.2 FlyingThings3D
The FlyingThings3D dataset was introduced as part of the synthetic Scene Flow datasets by Mayer et al. [mayer2016large]. The dataset consists of ShapeNet [chang2015shapenet]
shapes with randomized translation and rotations placed in a scene populated with background objects. While the dataset is not naturalistic, it offers a challenging combination of camera and object motion, each of which span all 6 degrees of freedom.
We train our network for 200k iterations with a batch size of 3 and a crop size of [320, 720]. We perform spatial augmentation by random cropping and resizing and adjust intrinsics accordingly. We use an initial learning rate of .0001 and decay the learning rate linearly during training.
We evaluate our network using 2D and 3D endpointerror (EPE). 2D EPE is defined as the euclidean distance between the ground truth optical flow and the predicted optical flow which can be obtained from the 3D transformation field using Eqn. 3. 3D EPE is the euclidean distance between the ground truth 3D scene flow and the predicted scene flow. We also report threshold metrics, which measure the portion of pixels which lie within a given threshold.
We report results in Table 2 and compare to scene flow[liu2019flownet3d, wang2020flownet3d++] and optical flow[teed2020raft] networks. In the top portion of the table, we use the evaluation setup of FlowNet3D[liu2019flownet3d] and FlowNet3D++[wang2020flownet3d++] where only nonoccluded pixels with depth 35 meters are evaluated. Our method improves the 3D accuracy from 30.33% to 83.71%.
We compare to RAFT[teed2020raft] which estimates optical flow between a pair of frames and different baselines which modify RAFT to predict 3D motion. All RAFT baselines use the same network architecture as our approach, including the pretrained ResNet50. All baselines are provided with inverse depth as input which is concatenate with the input images. We also experiment with directly provided depth as input, but found that inverse depth gives the best results.
RAFT (2D flow backprojected) uses the depth maps to backproject 2D motion into a 3D flow vector, but this only works for nonoccluded pixels, which is the reason for the very large 3D EPE error. RAFT (2D flow + depth change) predicts 2D flow in addition to inverse depth change, which can be used to recover 3D flow fields. Finally, we also test a version of RAFT which predicts 3D motion fields directly; RAFT(3D flow). We find that our method outperforms all these baselines by a large margin, particularly on the 3D metrics. This is because our network operates directly on the SE3 motion field, which offers a more structured representation than flow fields and the DenseSE3 layer produces analytically constrained updates which the other baselines lack.
In the second portion of the table, we evaluate over all pixels (excluding extremely fast moving objects with flow 250 pixels). Since we decompose the scene into rigidly moving components, our method can estimate the motion of occluded regions as well. We provide qualitative results in Fig. 2. These examples show that our network can segment the scene into rigidly moving regions, producing piecewise constant SE3 motion fields, even though no supervision is used on the embeddings.
4.3 Kitti
Using our model trained on FlyingThings3D, we finetune on KITTI for an additional 50k iterations with an initial learning rate of . We use a crop size of [288, 960] and perform spatial and photometric augmentation. To estimate disparity, we use GANet[zhang2019ga], which provides the input depth maps for our method.
Experiment  Method  2D Metrics  3D Metrics  

1px  EPE  EPE  
Iterations  1  62.1  6.05  56.0  65.9  0.212 
3  82.8  2.95  80.5  85.7  0.098  
8  85.5  2.47  86.4  90.5  0.062  
16  85.8  2.43  87.1  91.0  0.059  
32  85.7  2.50  87.0  90.9  0.061  
Neighborhood Radius (px)  8  73.2  4.01  38.7  59.0  0.192 
64  83.8  2.52  78.1  86.6  0.078  
256  85.8  2.43  87.1  91.0  0.059  
Full Image  83.3  2.91  83.2  88.1  0.078  
Revision Factors  Flow  86.1  2.29  84.6  88.7  0.081 
Flow + Inv. Depth  85.8  2.43  87.1  91.0  0.059  
BoundaryAware Pooling  No  85.8  2.43  87.1  91.0  0.059 
Yes  86.3  2.45  87.8  91.5  0.062 
We submit our method to the KITTI leaderboard and report results from our method and other top performing methods in Tab. 3. Our approach outperforms all published methods. DRISP [ma2019deep] is the next best performing approach, and combines PSMNet[chang2018pyramid], PWCNet[pwcnet], and MaskRCNN[he2017mask]. MaskRCNN is pretrained on Cityscapes and finetuned on KITTI using bounding box and instance mask supervision. Our network outperforms DRISP despite only training on FlyingThings3D and KITTI, and uses no instance supervision.
4.4 Ablations
We ablate various components of our model and report results in Tab. 4. For all ablations, we use our network without boundaryaware pooling as the baseline architecture.
Iterations: We evaluate the performance of our model as function of the number of application of the update operator. We find that more iterations gives better performance up to about 16, after which we observe a slight degradation.
Neighborhood Radius: The DenseSE3 layer defines an objective which such at all pairs of pixels within a specific radius contribute to the objective. Here, we train networks where is set to . In the last case, all pairs of pixels in the image contribute to the objective. We find that gives the better performance than smaller radii; however, using the full image gives worse performance. This is likely due to the fact that most rigid objects will be less than 512 pixels in diameter, and imposing a restriction on the radius is a useful prior.
Revision Factors: The update operator produces a set of revisions which are used as input to the DenseSE3 layer. Here we experiment with different revisions. In Flow we only use the optical flow revisions and . In flow + inv. depth we include inverse depth revisions. We find that including inverse depth revisions leads to better performance on 3D metrics because it allows for consistency between the depth maps.
BoundaryAware Global Pooling: Here we test the impact of our proposed boundaryaware global pooling layer. Our pooling layer improves the accuracy of the threshold metrics improving 1px accuracy from 85.8 to 86.3, and 3D accuracy from 87.1 to 87.8 and gives comparable average EPE. In Fig. 3 we see that the pooling layer produces qualitatively better results, particularly over large objects.
Timing: Applying 16 updates on a 1080Ti GPU takes 620ms for resolution images. When boundaryaware global pooling is used, inference takes 1.2s for image pairs.
Parameters: RAFT3D has 45M trainable parameters. The ResNet50 backbone has 40M parameters, while the feature extractor and update operator make up the remaining 5M parameters.
5 Conclusion
We have introduced RAFT3D, an endtoend network for scene flow. RAFT3D uses rigidmotion embeddings, which represent a soft grouping of pixels into rigidly moving objects. We demonstrate that these embeddings can be used to solve for dense and accurate 3D motion fields.
Acknowledgements: This research is partially supported by the National Science Foundation under Grant IIS1942981.
References
Appendix A Network Architecture
Details of the network architecture, including feature encoders and the GRUbased update operator are shown on the next page in Fig 4.
Appendix B Boundary Aware Pooling Gradients
The boundary aware pooling layer minimizes an objective function in the form
(20) 
where and are linear finite difference operators, and is the flattened feature map.
First consider the case of single channel, . Let . We can solve for
(21) 
We perform sparse Cholesky factorization and backsubstition to solve for using the Cholmod library[chen2008algorithm].
Gradients: In the backward pass, given the gradient , we need to find the gradients with respect to the boundary weights and .
Given the linear system , the gradients with respect to and can be found by solving the system in the backward direction [amos2017optnet]
(22)  
(23)  
(24) 
Here the column vector is defined for notational convenience. Since is positive definite, so we can reuse the factorization from the forward pass.
To compute the gradients with respect to and
(25)  
(26) 
giving
(27) 
where is elementwise multiplication. Similarly
(28) 
Multiple Channels: We can easily extend Eqn. 21 to work with multiple channels. Since the matrix does not depend on , it only needs to be factored once. We can solve Eqn. 21 for all channels by reusing the factorization, treating as a matrix. The gradient formulas can also be updated by summing the gradients over the channel dimensions.