RAFT: Recurrent All-Pairs Field Transforms for Optical Flow

03/26/2020
by   Zachary Teed, et al.
Princeton University
3

We introduce Recurrent All-Pairs Field Transforms (RAFT), a new deep network architecture for optical flow. RAFT extracts per-pixel features, builds multi-scale 4D correlation volumes for all pairs of pixels, and iteratively updates a flow field through a recurrent unit that performs lookups on the correlation volumes. RAFT achieves state-of-the-art performance, with strong cross-dataset generalization and high efficiency in inference time, training speed, and parameter count. Code is available <https://github.com/princeton-vl/RAFT>.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 9

page 10

page 14

09/14/2020

PRAFlow_RVC: Pyramid Recurrent All-Pairs Field Transforms for Optical Flow Estimation in Robust Vision Challenge 2020

Optical flow estimation is an important computer vision task, which aims...
02/02/2022

CSFlow: Learning Optical Flow via Cross Strip Correlation for Autonomous Driving

Optical flow estimation is an essential task in self-driving systems, wh...
09/15/2021

RAFT-Stereo: Multilevel Recurrent Field Transforms for Stereo Matching

We introduce RAFT-Stereo, a new deep architecture for rectified stereo b...
05/09/2022

Multiview Stereo with Cascaded Epipolar RAFT

We address multiview stereo (MVS), an important 3D vision task that reco...
04/05/2021

Learning Optical Flow from a Few Matches

State-of-the-art neural network models for optical flow estimation requi...
03/31/2022

CRAFT: Cross-Attentional Flow Transformer for Robust Optical Flow

Optical flow estimation aims to find the 2D motion field by identifying ...
06/01/2020

High-quality Panorama Stitching based on Asymmetric Bidirectional Optical Flow

In this paper, we propose a panorama stitching algorithm based on asymme...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Optical flow is the task of estimating per-pixel motion between video frames. It is a long-standing vision problem that remains unsolved. The best systems are limited by difficulties including fast-moving objects, occlusions, motion blur, and textureless surfaces.

Optical flow has traditionally been approached as a hand-crafted optimization problem over the space of dense displacement fields between a pair of images [19, 46, 11]. Generally, the optimization objective defines a trade-off between a data term which encourages the alignment of visually similar image regions and a regularization term which imposes priors on the plausibility of motion. Such an approach has achieved considerable success, but further progress has appeared challenging, due to the difficulties in hand-designing an optimization objective that is robust to a variety of corner cases.

Recently, deep learning has been shown as a promising alternative to traditional methods. Deep learning can side-step formulating an optimization problem and train a network to directly predict flow. Current deep learning methods 

[23, 37, 20, 44, 18] have achieved performance comparable to the best traditional methods while being significantly faster at inference time. A key question for further research is designing effective architectures that perform better, train more easily and generalize well to novel scenes.

Figure 1: RAFT consists of 3 main components: (1) A feature encoder that extracts per-pixel features from both input images, along with a context encoder that extracts features from only . (2) A correlation layer which constructs a 4D

correlation volume by taking the inner product of all pairs of feature vectors. The last 2-dimensions of the 4D volume are pooled at multiple scales to construct a set of multi-scale volumes. (3) An

update operator which recurrently updates optical flow by using the current estimate to look up values from the set of correlation volumes.

We introduce Recurrent All-Pairs Field Transforms (RAFT), a new deep network architecture for optical flow. RAFT enjoys the following strengths:

  • State-of-the-art accuracy: RAFT achieves large improvements over existing methods. On Sintel [9] (final pass), a challenging benchmark covering diverse motions, RAFT obtains an end-point-error of 3.39 pixels, a 20% error reduction from the best published result (4.26 pixels).

  • Strong generalization: When trained only on synthetic data, RAFT achieves an end-point-error of 5.54 pixels on KITTI [16], a 34% error reduction from the best prior deep network trained on the same data (8.36 pixels).

  • High efficiency: RAFT processes videos at 9 frames per second on a 1080Ti GPU. It trains with 10X fewer iterations than other architectures. A smaller version of RAFT with 1/5 of the parameters runs at 20 frames per second while still outperforming all prior methods on Sintel.

RAFT consists of three main components: (1) a feature encoder that extracts a feature vector for each pixel; (2) a correlation layer that produces a 4D correlation volume for all pairs of pixels, with subsequent pooling to produce lower resolution volumes; (3) a recurrent GRU-based update operator that retrieves values from the correlation volumes and iteratively updates a flow field initialized at zero. Fig. 1 illustrates the design of RAFT.

The RAFT architecture is motivated by traditional optimization-based approaches. The feature encoder extracts per-pixel features. The correlation layer computes visual similarity between pixels. The update operator mimics the steps of an iterative optimization algorithm. But unlike traditional approaches, features and motion priors are not handcrafted but learned—learned by the feature encoder and the update operator respectively.

The design of RAFT draws inspiration from many existing works but is substantially novel. First, RAFT operates entirely at high resolution. It maintains and updates a single high-resolution flow field, with zero upsampling operations during inference. This is different from the prevailing coarse-to-fine design in prior work [37, 44, 20, 21, 45], where flow is first estimated at low resolution and upsampled and refined at high resolution. By operating entirely at high resolution, RAFT overcomes several limitations of a coarse-to-fine cascade: the difficulty of recovering from errors at coarse resolutions, the tendency to miss small fast-moving objects, and the many training iterations (often over 1M) typically required for training a multi-stage cascade.

Second, the update operator of RAFT is recurrent and lightweight. Many recent works [22, 37, 44, 20, 23] have included some form of iterative refinement, but do not tie the weights across iterations [37, 44, 20] and are therefore limited to a fixed number of iterations. To our knowledge, IRR [22] is the only deep learning approach [22] that is recurrent. It uses FlowNetS [13] or PWC-Net [37] as its recurrent unit. When using FlowNetS, it is limited by the size of the network (38M parameters) and is only applied up to 5 iterations. When using PWC-Net, iterations are limited by the number of pyramid levels. In contrast, our update operator has only 2.7M parameters and can be applied 100+ times during inference without divergence.

Third, the update operator has a novel design, which consists of a convolutional GRU that performs lookups on 4D multi-scale correlation volumes; in contrast, refinement modules in prior work typically use only plain convolution or correlation layers.

We conduct experiments on Sintel[9] and KITTI[16]. Results show that RAFT achieves state-of-the-art performance on both datasets. In addition, we validate various design choices of RAFT through extensive ablation studies.

2 Related Work

Optical Flow as Energy Minimization Optical flow has traditionally been treated as an energy minimization problem which imposes a tradeoff between a data term and a regularization term. Horn and Schnuck [19] formulated optical flow as a continuous optimization problem using a variational framework, and were able to estimate a dense flow field by performing gradient steps. TV-L1 [46]

replaced the quadratic penalties with an L1 data term and total variation regularization, which allowed for motion discontinuities and was better equipped to handle outliers. Improvements have been made by defining better matching costs 

[40, 8] and regularization terms [33].

Such continuous formulations maintain a single estimate of optical flow which is refined at each iteration. To ensure a smooth objective function, a first order Taylor approximation is used to model the data term. As a result, they only work well for small displacements. To handle large displacements, the coarse-to-fine strategy is used, where an image pyramid is used to estimate large displacements at low resolution, then small displacements refined at high resolution. But this coarse-to-fine strategy may miss small fast-moving objects and have difficulty recovering from early mistakes. Like continuous methods, we maintain a single estimate of optical flow which is refined with each iteration. However, since we build correlation volumes for all pairs at both high resolution and low resolution, each local update uses information about both small and large displacements. Instead of using a subpixel Taylor approximation of the data term, our update operator learns to propose the descent direction.

More recently, optical flow has also been approached as a discrete optimization problem [30, 11, 42] using a global objective. One challenge of this approach is the massive size of the search space, as each pixel can be reasonably paired with thousands of points in the other frame. Menez et al[30] pruned the search space using feature descriptors and approximated the global MAP estimate using message passing. Raftl et al. [11] showed that by using the distance transform, solving the global optimization problem over the full space of flow fields is tractable. DCFlow [42]

showed further improvements in results by using a neural network as a feature descriptor, and constructed a 4D cost volume over all pairs of features. The 4D cost volume was then processed using the Semi-Global Matching (SGM) algorithm 

[17]. Like DCFlow, we also constructed 4D cost volumes over learned features. However, instead of processing the cost volumes using SGM, we use a neural network to estimate flow. Our approach is end-to-end differentiable, meaning the feature encoder can be trained with the rest of the network to directly minimize the error of the final flow estimate. In contrast, DCFlow requires their network to be trained using an embedding loss between pixels; it cannot be trained directly on optical flow because their cost volume processing is not differentiable.

Direct Flow Prediction Neural networks have been trained to directly predict optical flow between a pair of frames, side-stepping the optimization problem completely. Coarse-to-fine processing has emerged as a popular ingredient in many recent works [37, 45, 20, 21, 22, 44, 18]. A defining feature of coarse-to-fine processing is the use of upsampling operations, which allow coarse estimates to be refined at higher resolutions. In contrast, our method maintains and updates a single high-resolution flow field and does not use any upsampling operations during inference.

Iterative Refinement for Optical Flow Many recent works have used iterative refinement to improve results on optical flow [23, 34, 37, 20, 44] and related tasks [25, 47, 39]. Ilg et al. [23] applied iterative refinement to optical flow by stacking multiple FlowNetS and FlowNetC modules in series. SpyNet[34], PWC-Net[37], LiteFlowNet[20], and VCN [44] apply iterative refinement using coarse-to-fine pyramids. The main difference of these approaches from ours is that they do not share weights between iterations.

More closely related to our approach is IRR[22], which builds off of the FlownetS and PWC-Net architecture but shares weights between refinement networks. When using FlowNetS, it is limited by the size of the network (38M parameters) and is only applied up to 5 iterations. When using PWC-Net, iterations are limited by the number of pyramid levels. In contrast, we use a much simpler refinement module (2.7M parameters) which can be applied for 100+ iterations during inference without divergence.

Our method also has ties to TrellisNet [5] and Deep Equilibrium Models (DEQ) [6], which both use depth tied weights over a large number of layers. TrellisNet and DEQ were designed for sequence modeling tasks, but we adopt the core idea of use a large number of weight-tied units. Our update operator uses a modified GRU block[12], which is similar to the LSTM block used in TrellisNet. We found that this structure allows our update operator to more easily converge to a fixed point.

Learning to Optimize Many problems in vision can be formulated as an optimization problem. This has motivated several works to embed optimization problems into network architectures [4, 3, 38, 28, 39]. These works typically use a network to predict the inputs or parameters of the optimization problem, and then train the network weights by backpropogating the gradient through the solver, either implicitly[4, 3] or unrolling each step [28, 38]. However, this technique is limited to problems with an objective that can be easily defined.

Another approach is to learn iterative updates directly from data [1, 2]. These approaches are motivated by the fact that first order optimizers such as Primal Dual Hybrid Gradient (PDHG)[10] can be expressed as a sequence of iterative update steps. Instead of using an optimizer directly, Adler et al. [1] proposed building a network which mimics the updates of a first order algorithm. This approach has been applied to inverse problems such as image denoising [24], tomographic reconstruction [2], and novel view synthesis[15]. TVNet [14] implemented the TV-L1 algorithm as a computation graph, which enabled the training the TV-L1 parameters. However, TVNet operates directly based on intensity gradients instead of learned features, which limits the achievable accuracy on challenging datasets such as Sintel.

Our approach can be viewed as learning to optimize: our network uses a large number of update blocks to emulate the steps of a first-order optimization algorithm. However, unlike prior work, we never explicitly define a gradient with respect to some optimization objective. Instead, our network retrieves features from correlation volumes to propose the descent direction.

3 Approach

Given a pair of consecutive RGB images, , , we estimate a dense displacement field which maps each pixel in to its corresponding coordinates in . An overview of our approach is given in Figure 1

. Our method can be distilled down to three stages: (1) feature extraction, (2) computing visual similarity, and (3) iterative updates, where all stages are differentiable and composed into an end-to-end trainable architecture.

3.1 Feature Extraction

Features are extracted from the input images using a convolutional network. The feature encoder network is applied to both and and maps the input images to dense feature maps at a lower resolution. Our encoder, outputs features at 1/8 resolution where we set . The feature encoder consists of 6 residual blocks, 2 at 1/2 resolution, 2 at 1/4 resolution, and 2 at 1/8 resolution (more details in the supplemental material).

We additionally use a context network. The context network extracts features only from the first input image . The architecture of the context network, is identical to the feature extraction network. Together, the feature network and the context network form the first stage of our approach, which only need to be performed once.

3.2 Computing Visual Similarity

We compute visual similarity by constructing a full correlation volume between all pairs. Given image features and , the correlation volume is formed by taking the dot product between all pairs of feature vectors. The correlation volume, , can be efficiently computed as a single matrix multiplication.

(1)
Figure 2: Building correlation volumes. Here we depict 2D slices of a full 4D volume. For a feature vector in , we take take the inner product with all pairs in , generating a 4D volume (each pixel in produces a 2D response map). The volume is pooled using average pooling with kernel sizes .

Correlation Pyramid: We construct a 4-layer pyramid

by pooling the last two dimensions of the correlation volume with kernel sizes 1, 2, 4, and 8 and equivalent stride (Figure

2). Thus, volume has dimensions . The set of volumes gives information about both large and small displacements; however, by maintaining the first 2 dimensions (the dimensions) we maintain high resolution information, allowing our method to recover the motions of small fast-moving objects.

Correlation Lookup: We define a lookup operator which generates a feature map by indexing from the correlation pyramid. Given a current estimate of optical flow , we map each pixel in to its estimated correspondence in : . We then define a local grid around

(2)

as the set of integer offsets which are within a radius of units of using the L1 distance. We use the local neighborhood to index from the correlation volume. Since is a grid of real numbers, we use bilinear sampling.

We perform lookups on all levels of the pyramid, such that the correlation volume at level , , is indexed using the grid . A constant radius across levels means larger context at lower levels: for the lowest level, using a radius of 4 corresponds to a range of 256 pixels at the original resolution. The values from each level are then concatenated into a single feature map.

An important point here is that we are build the grid directly in the coordinate system defined by . Previous work has used warping operations followed by local correlation [37, 22, 44], so the local search is actually being performed on a warped coordinate system; in contrast, our approach avoids warping. While this difference is subtle, it is important for subpixel accuracy, particularly near motion boundaries where warping will change the local geometry.

Efficient Computation for High Resolution Images: The all pairs correlation scales where is the number of pixels, but only needs to be computed once and is constant in the number of iterations . However, there exists an equivalent implementation of our approach which scales exploiting the linearity of the inner product and average pooling. Consider the cost volume at level , , and feature maps , :

which is the average over the correlation response in the grid. This means that the value at can be computed as the inner product between the feature vector and pooled with kernel size .

In this alternative implementation, we do not precompute the correlations, but instead precompute the pooled image feature maps. In each iteration, we compute each correlation value on demand—only when it is looked up. This gives a complexity of .

We found empirically that precomputing all pairs is easy to implement and not a bottleneck, due to highly optimized matrix routines on GPUs—even for 1088x1920 videos it takes only 17% of total inference time. Note that we can always switch to the alternative implementation should it become a bottleneck.

3.3 Iterative Updates

Our update operator estimates a sequence of flow estimates from an initial starting point . With each iteration, it produces an update direction which is applied to the current estimate: .

The update operator takes flow, correlation, and a latent hidden state as input, and outputs the update and an updated hidden state. The architecture of our update operator is designed to mimic the steps of an optimization algorithm. As such, we used tied weights across depth and use bounded activations to encourage convergence to a fixed point. The update operator is trained to perform updates such that the sequence converges to a fixed point .

Initialization: By default, we initialize the flow field to 0 everywhere, but our iterative approach gives us the flexibility to experiment with alternatives. When applied to video, we test warm-start

initialization, where optical flow from the previous pair of frames is forward projected to the next pair of frames with occlusion gaps filled in using nearest neighbor interpolation.

Inputs: Given the current flow estimate , we use it to retrieve correlation features from the correlation pyramid as described in Sec. 3.2. The correlation features are then processed by 2 convolutional layers. Additionally, we apply 2 convolutional layers to the flow estimate itself to generate flow features. Finally, we directly inject the input from the context network. The input feature map is then taken as the concatenation of the correlation, flow, and context features.

Update: A core component of the update operator is a gated activation unit based on the GRU cell, with fully connected layers replaced with convolutions:

(3)
(4)
(5)
(6)

where is the concatenation of flow, correlation, and context features previously defined. We also experiment with a separable ConvGRU unit, where we replace the convolution with two GRUs: one with a convolution and one with a convolution to increase the receptive field without significantly increasing the size of the model.

Flow Prediction: The hidden state outputted by the GRU is passed through two convolutional layers to predict the flow update . The output flow is at 1/8 resolution of the input image. During training and evaluation, we upsample the predicted flow fields to match the resolution of the ground truth.

3.4 Supervision

We supervised our network on the distance between the predicted and ground truth optical flow over the full sequence of flow estimates, , with exponentially increasing weights. Given ground truth flow

, the loss function is defined as

(7)

where we set in our experiments.

4 Experiments

We evaluate RAFT on Sintel[9] and KITTI[16]. Following previous works, we pretrain our network on FlyingChairs[13] and FlyingThings[29], followed by dataset specific finetuning. Our method achieves state-of-the-art performance on both Sintel (both clean and final passes) and KITTI. Additionally, we test our method on 1080p video from the DAVIS dataset[32] to demonstrate that our method scales to videos of very high resolutions.

Implementation Details:

RAFT is implemented in PyTorch

[31]. All modules are initialized from scratch with random weights. During training, we use the AdamW[27] optimizer with weight decay 0.00005 and clip gradients Unless otherwise noted, we evaluate after 50 iterations on Sintel and 25 on KITTI.

Gradient Stopping: For every update, , we only backpropgate the gradient through the branch, and zero the gradient through the branch.

Training Schedule: We pretrain on FlyingChairs for 100k iterations with a batch size of 6 and 2 GPUs. We then finetune on the FlyingThings3D dataset for an additional 60k iterations with a batch size of 3 and 2 GPUs. We linearly increase the learning rate for the first 20% of training, then linearly decay to 0. In total, this gives 160k training steps. This is significantly fewer than the 7M steps used to train FlowNet2 and the 1.7M steps used to train PWCNet. VCN[44] is trained for 220k steps, but with 4 GPUs.

We perform dataset-specific finetuning on Sintel[9] and KITTI[16]. We use 60k iterations on Sintel and 40k on KITTI following the same schedule.

Augmentation: We apply color augmentation by adjusting contrast, saturation, brightness, and hue. We apply spatial augmentation by random resizing and flipping. Following HSM-Net [43], we also randomly erase rectangular regions in

with probability 0.5 to simulate occlusions.

Figure 3: Visualization of predicted flow on Sintel.

4.1 Sintel

We train our model using the FlyingChairsFlyingThings schedule and then evaluate on the Sintel dataset using the train split for validation. Results are shown in Table 1 and Figure 3, and we split results based on the data used for training. C + T means that the models are trained on FlyingChairs(C) and FlyingThings(T), while C+T+S indicates the model is finetuned on Sintel (S).

px

Training Data Method Sintel (train) Sintel (test)
Clean Final Clean Final
- FlowFields[7] - - 3.75 5.81
FlowFields++[35] - - 2.94 5.49
DCFlow [42] - - 3.54 5.12
TVNet[14] 7.45 8.59 - -
MRFlow[41] 1.83 3.59 2.53 5.38
C + T HD3[45] 3.84 8.77 - -
LiteFlowNet[20] 2.48 4.04 - -
PWC-Net[37] 2.55 3.93 - -
LiteFlowNet2[21] 2.24 3.78 - -
VCN[44] 2.21 3.68 - -
FlowNet2[23] 2.02  3.54 3.96 6.02
Ours (small) 2.21 3.35 - -
Ours 1.63 2.83 - -
C+T+S FlowNet2[23] - - 4.16 5.74
LiteFlowNet2 [21] - - 3.45 4.90
HD3[45] - - 4.79 4.67
PWC-Net+[36] - - 3.45 4.60
IRR-PWC[22] - - 3.84 4.58
VCN[44] - - 2.81 4.40
SelfFlow[26] - - 3.74 4.26
Ours - - 2.77 3.61
Ours (warm-start) - - 2.42 3.39
Table 1: Results on Sintel. We test the generalization performance on Sintel(train) after training on FlyingChairs(C) and FlyingThing(T), and outperform all existing methods on both the clean and final pass. After finetuning on Sintel(train) RAFT ranks 1st on the Sintel final dataset, and 1st on the SIntel clean dataset when warm-start is used. (FlowNet2 originally reported results on the disparity split of Sintel, 3.54 is the EPE when their model is evaluated on the standard data [20], uses Sintel data for training.)

When using C+T for training, our method outperforms all existing approaches, despite using a significantly shorter training schedule. Our method achieves an average EPE (end-point-error) of 1.63 on the Sintel(train) clean pass, which is a 20% lower error than FlowNet2 and 44% lower than PWC-Net. These results demonstrates good cross dataset generalization. One of the reasons for better generalization is the structure of our network. By constraining optical flow to be the product of a series of identical update steps, we force the network to learn an update operator which mimics the updates of a first-order descent algorithm. This constrains the search space, reduces the risk of over-fitting, and leads to faster training and better generalization.

When evaluating on the Sintel(test) set, we finetune on the combined clean and final passes of the training set. Our method ranks 1st on both the Sintel clean and final passes, and outperforms the SelFlow[26], the best performing prior work, by 0.87 pixels (3.39 versus 4.26). We evaluate two versions of our model, Ours uses zero initialization, while Ours (warp-start) initializes flow by forward projecting the flow estimate from the frame. Since our method operates at a single resolution, we can initialize the flow estimate to utilize motion smoothness from past frames, which cannot be easily done using the coarse-to-fine model.

Image 1                          VCN                            Ours

Figure 4: Flow predictions on KITTI dataset compared with VCN[44], both VCN and Ours are trained on FlyingChairs and FlyingThings.

4.2 Kitti

We also evaluate RAFT on KITTI and provide results in Table 2 and Figure 4. We first evaluate cross-dataset generalization by evaluating on the KITTI-15 (train) split after training on Chairs(C) and FlyingThings(T). Our method outperforms prior works by a large margin, improving EPE (end-point-error) from 8.36 to 5.54, which shows that the underlying structure of our network facilitates generalization. This property is important for applying our method in circumstances where it is difficult to collect training data.

Training Data Method KITTI-15 (train) KITTI-15 (test)
F1-epe F1-all F1-all
- FlowFields [7] 8.33 24.4 15.31
DCFlow [42] - - 14.86
MRFlow [41] - - 12.19
C + T HD [45] 13.17 24.0 -
LiteFlowNet [20] 10.39 28.5 -
PWC-Net [37] 10.35 33.7 -
FlowNet2 [23] 10.08 30.0 -
LiteFlowNet2 [21] 8.97 25.9 -
VCN [44] 8.36 25.1 -
Ours (small) 7.51 26.9 -
Ours 5.54 19.8 -
C+T+K FlowNet2 [23] - - 11.48
LiteFlowNet2 [21] - - 7.74
PWC-Net [36] - - 7.72
IRR-PWC [22] - - 7.65
HD [45] - - 6.55
VCN [44] - - 6.30
Ours - - 6.30
Table 2: Results on KITTI. When trained only on synthetic data (C+T), RAFT generalizes well on KITTI especially when compared to other deep networks using EPE. After finetuning on KITTI, we match the performance of VCN with fewer parameters and faster inference.
Figure 5: (Left) EPE on the Sintel set as a function of the number of iterations at inference time. (Right) Magnitude of each update indicating convergence to a fixed point .

4.3 Ablations

We perform a set of ablation experiments to show the relative importance of each component. All ablated versions are trained on FlyingChairs(C) + FlyingThings(T). Results of the ablations are shown in Table 3. In each section of the table, we test a specific component of our approach in isolation, the settings which are used in our final model is underlined. Below we describe each of the experiments in more detail.

Number of Iterations: Although we unroll 12 iterations during training, we can apply an arbitrary number of iterations during inference. In Figure 5 (left), we plot EPE as a function of the number of iterations. Our method quickly converges, surpassing PWC-Net after 3 iterations and FlowNet2 after 6 iterations, but continues to improve with more iterations. Figure 5 (right) shows the magnitude of each subsequent update . In Table 3 we provide numerical results for selected number of iterations, and test an extreme case of 1000 iterations to show that our method doesn’t diverge.

px

Experiment Method Sintel (train) KITTI-15 (train) Parameters
Clean Final F1-epe F1-all
Inference Iter. 1 4.57 5.97 18.10 46.3 4.8M
3 2.47 3.86 10.00 29.8 4.8M
8 1.87 3.00 6.32 21.8 4.8M
32 1.71 2.83 5.54 19.8 4.8M
100 1.64 2.86 5.73 20.1 4.8M
1000 1.66 2.87 5.80 20.2 4.8M
Update Op. ConvGRU 1.63 2.83 5.54 19.8 4.8M
Conv 2.04 3.21 7.66 26.1 4.1M
Tying Tied Weights 1.63 2.83 5.54 19.8 4.8M
Untied Weights 1.96 3.20 7.64 24.1 32.5M
Context Context 1.63 2.83 5.54 19.8 4.8M
No Context 1.93 3.06 6.25 23.1 3.3M
Feature Scale Single-Scale 1.63 2.83 5.54 19.8 4.8M
Multi-Scale 2.08 3.12 6.91 23.2 6.6M
Lookup Radius 0 3.41 4.53 23.6 44.8 4.7M
1 1.80 2.99 6.27 21.5 4.7M
2 1.78 2.82 5.84 21.1 4.8M
4 1.63 2.83 5.54 19.8 4.8M
Correlation Pooling No 1.95 3.02 6.07 23.2 4.7M
Yes 1.63 2.83 5.54 19.8 4.8M
Correlation Range 32px 2.91 4.48 10.4 28.8 4.8M
64px 2.06 3.16 6.24 20.9 4.8M
128px 1.64 2.81 6.00 19.9 4.8M
All-Pairs 1.63 2.83 5.54 19.8 4.8M
Features for Refinement Correlation 1.63 2.83 5.54 19.8 4.8M
Warping 2.27 3.73 11.83 32.1 2.8M
Table 3: Ablation experiments. Settings used in our final model are underlined. See Sec. 4.3 for details.

Architecture of Update Operator:

We use a gated activation unit based on the GRU cell. We experiment with replacing the convolutional GRU with a set of 3 convolutional layers with ReLU activation. We achieve better performance by using the GRU block, likely because the gated activation makes it easier for the sequence of flow estimates to converge.

Weight Tying: By default, we tied the weights across all instances of the update operator. Here, we test a version of our approach where each update operator learns a separate set of weights. Accuracy is better when weights are tied and the parameter count is significantly lower. This suggests that the tied weights induce proper constraints over network architecture.

Context: We test the importance of context by training a model with the context network removed. Without context, we still achieve good results, outperforming all existing works on both Sintel and KITTI. But context is helpful. Directly injecting image features into the update operator likely allows spatial information to be better aggregated within motion boundaries.

Feature Scale: By default, we extract features at a single resolution. We also try extracting features at multiple resolutions by building a correlation volume at each scale separately. Single resolution features simplifies the network architecture and allows fine-grained matching even at large displacements.

Lookup Radius: The lookup radius specifies the dimensions of the grid used in the lookup operation. When a radius of 0 is used, the correlation volume is retrieved at a single point. Surprisingly, we can still get a rough estimate of flow when the radius is 0, which means the network is learning to use 0’th order information. However, we see better results as the radius is increased.

Correlation Pooling: We output features at a single resolution and then perform pooling to generate multiscale volumes. Here we test the impact when this pooling is removed. Results are better with pooling, because large and small displacements are both captured.

Correlation Range: Instead of all-pairs correlation, we also try constructing the correlation volume only for a local neighborhood around each pixel. We try a range of 32 pixels, 64 pixels, and 128 pixels. Overall we get the best results when the all-pairs are used, although a 128px range is sufficient to perform well on Sintel because most displacements fall within this range. That said, all-pairs is still preferable because it eliminates the need to specify a range. It is also more convenient to implement: it can be computed using matrix multiplication allowing our approach to be implemented entirely in PyTorch.

Features for Refinement: We compute visual similarity by building a correlation volume between all pairs of pixels. In this experiment, we try replacing the correlation volume with a warping layer, which uses the current estimate of optical flow to warp features from onto and then estimates the residual displacement. While warping is still competitive with prior work on Sintel, correlation performs significantly better, especially on KITTI.

4.4 Timing and Parameter Counts

Inference time and parameter counts are shown in Figure 6. Accuracy is determined by performance on the Sintel(train) final pass after training on FlyingChairs and FlyingThings (C+T). In these plots, we report accuracy and timing after 12 iterations, and we time our method using a GTX 1080Ti GPU. Parameters counts for other methods are taken as reported in their papers, and we report times when run on our hardware. RAFT is more efficient in terms of parameter count, inference time, and training iterations. The context and feature encoder both use 1.05M parameters each. The update operator uses 1.5M parameters. The remaining 1.2M parameters are used to process correlation features and predict flow. Ours-S uses only 1M parameters, but outperforms PWC-Net and VCN which are more than 6x larger. We provide an additional table with numerical values for parameters, timing, and training iterations in the supplemental material.

Figure 6: Plots comparing parameter counts, inference time, and training iterations vs. accuracy. Accuracy is measured by the EPE on the Sintel(train) final pass after training on C+T. Left: Parameter count vs. accuracy compared to other methods. RAFT is more parameter efficient while achieving lower EPE. Middle: Inference time vs. accuracy timed using our hardware Right: Training iterations vs. accuracy (taken as product of iterations and GPUs used).

4.5 Video of Very High Resolution

To demonstrate that our method scales well to videos of very high resolution we apply our network to HD video from the DAVIS[32] dataset. We use 1080p (1088x1920) resolution video and apply 12 iterations of our approach. Inference takes 550ms for 12 iterations on 1080p video, with all-pairs correlation taking 95ms. Fig. 7 visualizes example results on DAVIS.

Figure 7: Generalization to HD video from the DAVIS dataset. Inference time is 550ms per frame for 1080p (1088x1920) video.

5 Conclusions

We have proposed RAFT—Recurrent All-Pairs Field Transforms—a new end-to-end trainable model for optical flow. RAFT is unique in that it operates at a single resolution using a large number of lightweight, recurrent update operators. Our method achieves state-of-the-art accuracy across a diverse range of datasets, strong cross dataset generalization, and is efficient in terms of inference time, parameter count, and training iterations.

Acknowledgments: This work was partially funded by the National Science Foundation under Grant No. 1617767.

References

  • [1] J. Adler and O. Öktem (2017) Solving ill-posed inverse problems using iterative deep neural networks. Inverse Problems 33 (12), pp. 124007. Cited by: §2.
  • [2] J. Adler and O. Öktem (2018) Learned primal-dual reconstruction. IEEE transactions on medical imaging 37 (6), pp. 1322–1332. Cited by: §2.
  • [3] A. Agrawal, B. Amos, S. Barratt, S. Boyd, S. Diamond, and J. Z. Kolter (2019) Differentiable convex optimization layers. In Advances in Neural Information Processing Systems, pp. 9558–9570. Cited by: §2.
  • [4] B. Amos and J. Z. Kolter (2017) Optnet: differentiable optimization as a layer in neural networks. In

    Proceedings of the 34th International Conference on Machine Learning-Volume 70

    ,
    pp. 136–145. Cited by: §2.
  • [5] S. Bai, J. Z. Kolter, and V. Koltun (2018) Trellis networks for sequence modeling. arXiv preprint arXiv:1810.06682. Cited by: §2.
  • [6] S. Bai, J. Z. Kolter, and V. Koltun (2019) Deep equilibrium models. In Advances in Neural Information Processing Systems, pp. 688–699. Cited by: §2.
  • [7] C. Bailer, B. Taetz, and D. Stricker (2015) Flow fields: dense correspondence fields for highly accurate large displacement optical flow estimation. In

    Proceedings of the IEEE international conference on computer vision

    ,
    pp. 4015–4023. Cited by: Table 1, Table 2.
  • [8] T. Brox, C. Bregler, and J. Malik (2009) Large displacement optical flow. In

    2009 IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 41–48. Cited by: §2.
  • [9] D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black (2012)

    A naturalistic open source movie for optical flow evaluation

    .
    In European conference on computer vision, pp. 611–625. Cited by: 1st item, §1, §4, §4.
  • [10] A. Chambolle and T. Pock (2011) A first-order primal-dual algorithm for convex problems with applications to imaging. Journal of mathematical imaging and vision 40 (1), pp. 120–145. Cited by: §2.
  • [11] Q. Chen and V. Koltun (2016) Full flow: optical flow estimation by global optimization over regular grids. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4706–4714. Cited by: §1, §2.
  • [12] K. Cho, B. Van Merriënboer, D. Bahdanau, and Y. Bengio (2014)

    On the properties of neural machine translation: encoder-decoder approaches

    .
    arXiv preprint arXiv:1409.1259. Cited by: §2.
  • [13] A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas, V. Golkov, P. Van Der Smagt, D. Cremers, and T. Brox (2015) Flownet: learning optical flow with convolutional networks. In Proceedings of the IEEE international conference on computer vision, pp. 2758–2766. Cited by: §1, §4.
  • [14] L. Fan, W. Huang, C. Gan, S. Ermon, B. Gong, and J. Huang (2018) End-to-end learning of motion representation for video understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6016–6025. Cited by: §2, Table 1.
  • [15] J. Flynn, M. Broxton, P. Debevec, M. DuVall, G. Fyffe, R. S. Overbeck, N. Snavely, and R. Tucker (2019) DeepView: high-quality view synthesis by learned gradient descent. Cited by: §2.
  • [16] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun (2013) Vision meets robotics: the kitti dataset. The International Journal of Robotics Research 32 (11), pp. 1231–1237. Cited by: 2nd item, §1, §4, §4.
  • [17] H. Hirschmuller (2007) Stereo processing by semiglobal matching and mutual information. IEEE Transactions on pattern analysis and machine intelligence 30 (2), pp. 328–341. Cited by: §2.
  • [18] M. Hofinger, S. R. Bulò, L. Porzi, A. Knapitsch, T. Pock, and P. Kontschieder (2019) The five elements of flow. arXiv preprint arXiv:1912.10739. Cited by: §1, §2.
  • [19] B. K. Horn and B. G. Schunck (1981) Determining optical flow. In Techniques and Applications of Image Understanding, Vol. 281, pp. 319–331. Cited by: §1, §2.
  • [20] T. Hui, X. Tang, and C. Change Loy (2018)

    Liteflownet: a lightweight convolutional neural network for optical flow estimation

    .
    In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8981–8989. Cited by: Table 4, §1, §1, §1, §2, §2, Table 1, Table 2.
  • [21] T. Hui, X. Tang, and C. C. Loy (2019) A lightweight optical flow cnn–revisiting data fidelity and regularization. arXiv preprint arXiv:1903.07414. Cited by: §1, §2, Table 1, Table 2.
  • [22] J. Hur and S. Roth (2019) Iterative residual refinement for joint optical flow and occlusion estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5754–5763. Cited by: Table 4, §1, §2, §2, §3.2, Table 1, Table 2.
  • [23] E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox (2017) Flownet 2.0: evolution of optical flow estimation with deep networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2462–2470. Cited by: Table 4, §1, §1, §2, Table 1, Table 2.
  • [24] E. Kobler, T. Klatzer, K. Hammernik, and T. Pock (2017) Variational networks: connecting variational methods and deep learning. In German conference on pattern recognition, pp. 281–293. Cited by: §2.
  • [25] Z. Liang, Y. Feng, Y. Guo, H. Liu, W. Chen, L. Qiao, L. Zhou, and J. Zhang (2018) Learning for disparity estimation through feature constancy. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2811–2820. Cited by: §2.
  • [26] P. Liu, M. Lyu, I. King, and J. Xu (2019)

    Selflow: self-supervised learning of optical flow

    .
    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4571–4580. Cited by: §4.1, Table 1.
  • [27] I. Loshchilov and F. Hutter (2017) Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101. Cited by: §4.
  • [28] Z. Lv, F. Dellaert, J. M. Rehg, and A. Geiger (2019) Taking a deeper look at the inverse compositional algorithm. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4581–4590. Cited by: §2.
  • [29] N. Mayer, E. Ilg, P. Hausser, P. Fischer, D. Cremers, A. Dosovitskiy, and T. Brox (2016) A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4040–4048. Cited by: §4.
  • [30] M. Menze, C. Heipke, and A. Geiger (2015) Discrete optimization for optical flow. In German Conference on Pattern Recognition, pp. 16–28. Cited by: §2.
  • [31] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017) Automatic differentiation in pytorch. Cited by: §4.
  • [32] J. Pont-Tuset, F. Perazzi, S. Caelles, P. Arbeláez, A. Sorkine-Hornung, and L. Van Gool (2017) The 2017 davis challenge on video object segmentation. arXiv preprint arXiv:1704.00675. Cited by: §4.5, §4.
  • [33] R. Ranftl, K. Bredies, and T. Pock (2014) Non-local total generalized variation for optical flow estimation. In European Conference on Computer Vision, pp. 439–454. Cited by: §2.
  • [34] A. Ranjan and M. J. Black (2017) Optical flow estimation using a spatial pyramid network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4161–4170. Cited by: §2.
  • [35] R. Schuster, C. Bailer, O. Wasenmüller, and D. Stricker (2018) FlowFields++: accurate optical flow correspondences meet robust interpolation. In 2018 25th IEEE International Conference on Image Processing (ICIP), pp. 1463–1467. Cited by: Table 1.
  • [36] D. Sun, X. Yang, M. Liu, and J. Kautz (2018) Models matter, so does training: an empirical study of cnns for optical flow estimation. arXiv preprint arXiv:1809.05571. Cited by: Table 4, Table 1, Table 2.
  • [37] D. Sun, X. Yang, M. Liu, and J. Kautz (2018) Pwc-net: cnns for optical flow using pyramid, warping, and cost volume. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8934–8943. Cited by: §1, §1, §1, §2, §2, §3.2, Table 1, Table 2.
  • [38] C. Tang and P. Tan (2018) Ba-net: dense bundle adjustment network. arXiv preprint arXiv:1806.04807. Cited by: §2.
  • [39] Z. Teed and J. Deng (2018) Deepv2d: video to depth with differentiable structure from motion. arXiv preprint arXiv:1812.04605. Cited by: §2, §2.
  • [40] P. Weinzaepfel, J. Revaud, Z. Harchaoui, and C. Schmid (2013) DeepFlow: large displacement optical flow with deep matching. In Proceedings of the IEEE international conference on computer vision, pp. 1385–1392. Cited by: §2.
  • [41] J. Wulff, L. Sevilla-Lara, and M. J. Black (2017) Optical flow in mostly rigid scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4671–4680. Cited by: Table 1, Table 2.
  • [42] J. Xu, R. Ranftl, and V. Koltun (2017) Accurate optical flow via direct cost volume processing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1289–1297. Cited by: §2, Table 1, Table 2.
  • [43] G. Yang, J. Manela, M. Happold, and D. Ramanan (2019) Hierarchical deep stereo matching on high-resolution images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5515–5524. Cited by: §4.
  • [44] G. Yang and D. Ramanan (2019) Volumetric correspondence networks for optical flow. In Advances in Neural Information Processing Systems, pp. 793–803. Cited by: Table 4, §1, §1, §1, §2, §2, §3.2, Figure 4, Table 1, Table 2, §4.
  • [45] Z. Yin, T. Darrell, and F. Yu (2019) Hierarchical discrete distribution decomposition for match density estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6044–6053. Cited by: §1, §2, Table 1, Table 2.
  • [46] C. Zach, T. Pock, and H. Bischof (2007) A duality based approach for realtime tv-l 1 optical flow. In Joint pattern recognition symposium, pp. 214–223. Cited by: §1, §2.
  • [47] H. Zhou, B. Ummenhofer, and T. Brox (2018) Deeptam: deep tracking and mapping. In Proceedings of the European conference on computer vision (ECCV), pp. 822–838. Cited by: §2.

Appendix 0.A Appendix

0.a.1 Network Architecture

Figure 8:

Network architecture details for the full 4.8M parameter model and the small 1.0M parameter model. The context and feature encoders have the same architecture, the only difference is that the feature encoder uses instance normalization while the context encoder uses batch normalization. In RAFT-S, we replace the residual units with bottleneck residual units. The update block takes in context features, correlation features, and flow features to update the latent hidden state. The updated hidden status is used to predict the flow update. The full model uses two convolutional GRU update blocks with 1x5 filters and 5x1 filters respectively, while the small model uses a single GRU with 3x3 filters.

0.a.2 Timing, Parameters, and Training Iterations

px

Method Parameters (M) Time (Reported) Time (1080Ti) Training Iter. (#GPUs) Accuracy
LiteFlowNetX[20] 0.9M 0.03s - 2000k 4.79
LiteFlowNet[20] 5.4M 0.09s 0.09s 2000k 4.04
IRR-PWC[22] 6.4M - 0.20s 850k 3.95
PWCNet+[36] 9.4M 0.03s 0.04s 1700k 3.93
VCN[44] 6.2M 0.18s 0.26s 220k(4) 3.63
FlowNet2[23] 162M 0.12s 0.11s 7000k 3.54
Ours (small) 1.0M - 0.05s 160k(2) 3.37
Ours 4.8M - 0.11s 160k(2) 2.87
Table 4: Parameter counts, inference time, training iterations, and accuracy on the Sintel (train) final pass. We report the timing and accuracy of our method after 12 iterations using a GTX 1080Ti GPU. If possible, we download the code from the other methods and re-time using our machine. If the model is trained using more than one GPU, we report the number of GPUs used to train in parenthesis. Overall, RAFT requires much fewer training iterations and parameters when compared to prior work.