1 Introduction
Optical flow is the task of estimating per-pixel motion between video frames. It is a long-standing vision problem that remains unsolved. The best systems are limited by difficulties including fast-moving objects, occlusions, motion blur, and textureless surfaces.
Optical flow has traditionally been approached as a hand-crafted optimization problem over the space of dense displacement fields between a pair of images [19, 46, 11]. Generally, the optimization objective defines a trade-off between a data term which encourages the alignment of visually similar image regions and a regularization term which imposes priors on the plausibility of motion. Such an approach has achieved considerable success, but further progress has appeared challenging, due to the difficulties in hand-designing an optimization objective that is robust to a variety of corner cases.
Recently, deep learning has been shown as a promising alternative to traditional methods. Deep learning can side-step formulating an optimization problem and train a network to directly predict flow. Current deep learning methods
[23, 37, 20, 44, 18] have achieved performance comparable to the best traditional methods while being significantly faster at inference time. A key question for further research is designing effective architectures that perform better, train more easily and generalize well to novel scenes.
correlation volume by taking the inner product of all pairs of feature vectors. The last 2-dimensions of the 4D volume are pooled at multiple scales to construct a set of multi-scale volumes. (3) An
update operator which recurrently updates optical flow by using the current estimate to look up values from the set of correlation volumes.We introduce Recurrent All-Pairs Field Transforms (RAFT), a new deep network architecture for optical flow. RAFT enjoys the following strengths:
-
State-of-the-art accuracy: RAFT achieves large improvements over existing methods. On Sintel [9] (final pass), a challenging benchmark covering diverse motions, RAFT obtains an end-point-error of 3.39 pixels, a 20% error reduction from the best published result (4.26 pixels).
-
Strong generalization: When trained only on synthetic data, RAFT achieves an end-point-error of 5.54 pixels on KITTI [16], a 34% error reduction from the best prior deep network trained on the same data (8.36 pixels).
-
High efficiency: RAFT processes videos at 9 frames per second on a 1080Ti GPU. It trains with 10X fewer iterations than other architectures. A smaller version of RAFT with 1/5 of the parameters runs at 20 frames per second while still outperforming all prior methods on Sintel.
RAFT consists of three main components: (1) a feature encoder that extracts a feature vector for each pixel; (2) a correlation layer that produces a 4D correlation volume for all pairs of pixels, with subsequent pooling to produce lower resolution volumes; (3) a recurrent GRU-based update operator that retrieves values from the correlation volumes and iteratively updates a flow field initialized at zero. Fig. 1 illustrates the design of RAFT.
The RAFT architecture is motivated by traditional optimization-based approaches. The feature encoder extracts per-pixel features. The correlation layer computes visual similarity between pixels. The update operator mimics the steps of an iterative optimization algorithm. But unlike traditional approaches, features and motion priors are not handcrafted but learned—learned by the feature encoder and the update operator respectively.
The design of RAFT draws inspiration from many existing works but is substantially novel. First, RAFT operates entirely at high resolution. It maintains and updates a single high-resolution flow field, with zero upsampling operations during inference. This is different from the prevailing coarse-to-fine design in prior work [37, 44, 20, 21, 45], where flow is first estimated at low resolution and upsampled and refined at high resolution. By operating entirely at high resolution, RAFT overcomes several limitations of a coarse-to-fine cascade: the difficulty of recovering from errors at coarse resolutions, the tendency to miss small fast-moving objects, and the many training iterations (often over 1M) typically required for training a multi-stage cascade.
Second, the update operator of RAFT is recurrent and lightweight. Many recent works [22, 37, 44, 20, 23] have included some form of iterative refinement, but do not tie the weights across iterations [37, 44, 20] and are therefore limited to a fixed number of iterations. To our knowledge, IRR [22] is the only deep learning approach [22] that is recurrent. It uses FlowNetS [13] or PWC-Net [37] as its recurrent unit. When using FlowNetS, it is limited by the size of the network (38M parameters) and is only applied up to 5 iterations. When using PWC-Net, iterations are limited by the number of pyramid levels. In contrast, our update operator has only 2.7M parameters and can be applied 100+ times during inference without divergence.
Third, the update operator has a novel design, which consists of a convolutional GRU that performs lookups on 4D multi-scale correlation volumes; in contrast, refinement modules in prior work typically use only plain convolution or correlation layers.
2 Related Work
Optical Flow as Energy Minimization Optical flow has traditionally been treated as an energy minimization problem which imposes a tradeoff between a data term and a regularization term. Horn and Schnuck [19] formulated optical flow as a continuous optimization problem using a variational framework, and were able to estimate a dense flow field by performing gradient steps. TV-L1 [46]
replaced the quadratic penalties with an L1 data term and total variation regularization, which allowed for motion discontinuities and was better equipped to handle outliers. Improvements have been made by defining better matching costs
[40, 8] and regularization terms [33].Such continuous formulations maintain a single estimate of optical flow which is refined at each iteration. To ensure a smooth objective function, a first order Taylor approximation is used to model the data term. As a result, they only work well for small displacements. To handle large displacements, the coarse-to-fine strategy is used, where an image pyramid is used to estimate large displacements at low resolution, then small displacements refined at high resolution. But this coarse-to-fine strategy may miss small fast-moving objects and have difficulty recovering from early mistakes. Like continuous methods, we maintain a single estimate of optical flow which is refined with each iteration. However, since we build correlation volumes for all pairs at both high resolution and low resolution, each local update uses information about both small and large displacements. Instead of using a subpixel Taylor approximation of the data term, our update operator learns to propose the descent direction.
More recently, optical flow has also been approached as a discrete optimization problem [30, 11, 42] using a global objective. One challenge of this approach is the massive size of the search space, as each pixel can be reasonably paired with thousands of points in the other frame. Menez et al[30] pruned the search space using feature descriptors and approximated the global MAP estimate using message passing. Raftl et al. [11] showed that by using the distance transform, solving the global optimization problem over the full space of flow fields is tractable. DCFlow [42]
showed further improvements in results by using a neural network as a feature descriptor, and constructed a 4D cost volume over all pairs of features. The 4D cost volume was then processed using the Semi-Global Matching (SGM) algorithm
[17]. Like DCFlow, we also constructed 4D cost volumes over learned features. However, instead of processing the cost volumes using SGM, we use a neural network to estimate flow. Our approach is end-to-end differentiable, meaning the feature encoder can be trained with the rest of the network to directly minimize the error of the final flow estimate. In contrast, DCFlow requires their network to be trained using an embedding loss between pixels; it cannot be trained directly on optical flow because their cost volume processing is not differentiable.Direct Flow Prediction Neural networks have been trained to directly predict optical flow between a pair of frames, side-stepping the optimization problem completely. Coarse-to-fine processing has emerged as a popular ingredient in many recent works [37, 45, 20, 21, 22, 44, 18]. A defining feature of coarse-to-fine processing is the use of upsampling operations, which allow coarse estimates to be refined at higher resolutions. In contrast, our method maintains and updates a single high-resolution flow field and does not use any upsampling operations during inference.
Iterative Refinement for Optical Flow Many recent works have used iterative refinement to improve results on optical flow [23, 34, 37, 20, 44] and related tasks [25, 47, 39]. Ilg et al. [23] applied iterative refinement to optical flow by stacking multiple FlowNetS and FlowNetC modules in series. SpyNet[34], PWC-Net[37], LiteFlowNet[20], and VCN [44] apply iterative refinement using coarse-to-fine pyramids. The main difference of these approaches from ours is that they do not share weights between iterations.
More closely related to our approach is IRR[22], which builds off of the FlownetS and PWC-Net architecture but shares weights between refinement networks. When using FlowNetS, it is limited by the size of the network (38M parameters) and is only applied up to 5 iterations. When using PWC-Net, iterations are limited by the number of pyramid levels. In contrast, we use a much simpler refinement module (2.7M parameters) which can be applied for 100+ iterations during inference without divergence.
Our method also has ties to TrellisNet [5] and Deep Equilibrium Models (DEQ) [6], which both use depth tied weights over a large number of layers. TrellisNet and DEQ were designed for sequence modeling tasks, but we adopt the core idea of use a large number of weight-tied units. Our update operator uses a modified GRU block[12], which is similar to the LSTM block used in TrellisNet. We found that this structure allows our update operator to more easily converge to a fixed point.
Learning to Optimize Many problems in vision can be formulated as an optimization problem. This has motivated several works to embed optimization problems into network architectures [4, 3, 38, 28, 39]. These works typically use a network to predict the inputs or parameters of the optimization problem, and then train the network weights by backpropogating the gradient through the solver, either implicitly[4, 3] or unrolling each step [28, 38]. However, this technique is limited to problems with an objective that can be easily defined.
Another approach is to learn iterative updates directly from data [1, 2]. These approaches are motivated by the fact that first order optimizers such as Primal Dual Hybrid Gradient (PDHG)[10] can be expressed as a sequence of iterative update steps. Instead of using an optimizer directly, Adler et al. [1] proposed building a network which mimics the updates of a first order algorithm. This approach has been applied to inverse problems such as image denoising [24], tomographic reconstruction [2], and novel view synthesis[15]. TVNet [14] implemented the TV-L1 algorithm as a computation graph, which enabled the training the TV-L1 parameters. However, TVNet operates directly based on intensity gradients instead of learned features, which limits the achievable accuracy on challenging datasets such as Sintel.
Our approach can be viewed as learning to optimize: our network uses a large number of update blocks to emulate the steps of a first-order optimization algorithm. However, unlike prior work, we never explicitly define a gradient with respect to some optimization objective. Instead, our network retrieves features from correlation volumes to propose the descent direction.
3 Approach
Given a pair of consecutive RGB images, , , we estimate a dense displacement field which maps each pixel in to its corresponding coordinates in . An overview of our approach is given in Figure 1
. Our method can be distilled down to three stages: (1) feature extraction, (2) computing visual similarity, and (3) iterative updates, where all stages are differentiable and composed into an end-to-end trainable architecture.
3.1 Feature Extraction
Features are extracted from the input images using a convolutional network. The feature encoder network is applied to both and and maps the input images to dense feature maps at a lower resolution. Our encoder, outputs features at 1/8 resolution where we set . The feature encoder consists of 6 residual blocks, 2 at 1/2 resolution, 2 at 1/4 resolution, and 2 at 1/8 resolution (more details in the supplemental material).
We additionally use a context network. The context network extracts features only from the first input image . The architecture of the context network, is identical to the feature extraction network. Together, the feature network and the context network form the first stage of our approach, which only need to be performed once.
3.2 Computing Visual Similarity
We compute visual similarity by constructing a full correlation volume between all pairs. Given image features and , the correlation volume is formed by taking the dot product between all pairs of feature vectors. The correlation volume, , can be efficiently computed as a single matrix multiplication.
(1) |

Correlation Pyramid: We construct a 4-layer pyramid
by pooling the last two dimensions of the correlation volume with kernel sizes 1, 2, 4, and 8 and equivalent stride (Figure
2). Thus, volume has dimensions . The set of volumes gives information about both large and small displacements; however, by maintaining the first 2 dimensions (the dimensions) we maintain high resolution information, allowing our method to recover the motions of small fast-moving objects.Correlation Lookup: We define a lookup operator which generates a feature map by indexing from the correlation pyramid. Given a current estimate of optical flow , we map each pixel in to its estimated correspondence in : . We then define a local grid around
(2) |
as the set of integer offsets which are within a radius of units of using the L1 distance. We use the local neighborhood to index from the correlation volume. Since is a grid of real numbers, we use bilinear sampling.
We perform lookups on all levels of the pyramid, such that the correlation volume at level , , is indexed using the grid . A constant radius across levels means larger context at lower levels: for the lowest level, using a radius of 4 corresponds to a range of 256 pixels at the original resolution. The values from each level are then concatenated into a single feature map.
An important point here is that we are build the grid directly in the coordinate system defined by . Previous work has used warping operations followed by local correlation [37, 22, 44], so the local search is actually being performed on a warped coordinate system; in contrast, our approach avoids warping. While this difference is subtle, it is important for subpixel accuracy, particularly near motion boundaries where warping will change the local geometry.
Efficient Computation for High Resolution Images: The all pairs correlation scales where is the number of pixels, but only needs to be computed once and is constant in the number of iterations . However, there exists an equivalent implementation of our approach which scales exploiting the linearity of the inner product and average pooling. Consider the cost volume at level , , and feature maps , :
which is the average over the correlation response in the grid. This means that the value at can be computed as the inner product between the feature vector and pooled with kernel size .
In this alternative implementation, we do not precompute the correlations, but instead precompute the pooled image feature maps. In each iteration, we compute each correlation value on demand—only when it is looked up. This gives a complexity of .
We found empirically that precomputing all pairs is easy to implement and not a bottleneck, due to highly optimized matrix routines on GPUs—even for 1088x1920 videos it takes only 17% of total inference time. Note that we can always switch to the alternative implementation should it become a bottleneck.
3.3 Iterative Updates
Our update operator estimates a sequence of flow estimates from an initial starting point . With each iteration, it produces an update direction which is applied to the current estimate: .
The update operator takes flow, correlation, and a latent hidden state as input, and outputs the update and an updated hidden state. The architecture of our update operator is designed to mimic the steps of an optimization algorithm. As such, we used tied weights across depth and use bounded activations to encourage convergence to a fixed point. The update operator is trained to perform updates such that the sequence converges to a fixed point .
Initialization: By default, we initialize the flow field to 0 everywhere, but our iterative approach gives us the flexibility to experiment with alternatives. When applied to video, we test warm-start
initialization, where optical flow from the previous pair of frames is forward projected to the next pair of frames with occlusion gaps filled in using nearest neighbor interpolation.
Inputs: Given the current flow estimate , we use it to retrieve correlation features from the correlation pyramid as described in Sec. 3.2. The correlation features are then processed by 2 convolutional layers. Additionally, we apply 2 convolutional layers to the flow estimate itself to generate flow features. Finally, we directly inject the input from the context network. The input feature map is then taken as the concatenation of the correlation, flow, and context features.
Update: A core component of the update operator is a gated activation unit based on the GRU cell, with fully connected layers replaced with convolutions:
(3) | |||
(4) | |||
(5) | |||
(6) |
where is the concatenation of flow, correlation, and context features previously defined. We also experiment with a separable ConvGRU unit, where we replace the convolution with two GRUs: one with a convolution and one with a convolution to increase the receptive field without significantly increasing the size of the model.
Flow Prediction: The hidden state outputted by the GRU is passed through two convolutional layers to predict the flow update . The output flow is at 1/8 resolution of the input image. During training and evaluation, we upsample the predicted flow fields to match the resolution of the ground truth.
3.4 Supervision
We supervised our network on the distance between the predicted and ground truth optical flow over the full sequence of flow estimates, , with exponentially increasing weights. Given ground truth flow
, the loss function is defined as
(7) |
where we set in our experiments.
4 Experiments
We evaluate RAFT on Sintel[9] and KITTI[16]. Following previous works, we pretrain our network on FlyingChairs[13] and FlyingThings[29], followed by dataset specific finetuning. Our method achieves state-of-the-art performance on both Sintel (both clean and final passes) and KITTI. Additionally, we test our method on 1080p video from the DAVIS dataset[32] to demonstrate that our method scales to videos of very high resolutions.
Implementation Details:
RAFT is implemented in PyTorch
[31]. All modules are initialized from scratch with random weights. During training, we use the AdamW[27] optimizer with weight decay 0.00005 and clip gradients Unless otherwise noted, we evaluate after 50 iterations on Sintel and 25 on KITTI.Gradient Stopping: For every update, , we only backpropgate the gradient through the branch, and zero the gradient through the branch.
Training Schedule: We pretrain on FlyingChairs for 100k iterations with a batch size of 6 and 2 GPUs. We then finetune on the FlyingThings3D dataset for an additional 60k iterations with a batch size of 3 and 2 GPUs. We linearly increase the learning rate for the first 20% of training, then linearly decay to 0. In total, this gives 160k training steps. This is significantly fewer than the 7M steps used to train FlowNet2 and the 1.7M steps used to train PWCNet. VCN[44] is trained for 220k steps, but with 4 GPUs.
We perform dataset-specific finetuning on Sintel[9] and KITTI[16]. We use 60k iterations on Sintel and 40k on KITTI following the same schedule.
Augmentation: We apply color augmentation by adjusting contrast, saturation, brightness, and hue. We apply spatial augmentation by random resizing and flipping. Following HSM-Net [43], we also randomly erase rectangular regions in
with probability 0.5 to simulate occlusions.

4.1 Sintel
We train our model using the FlyingChairsFlyingThings schedule and then evaluate on the Sintel dataset using the train split for validation. Results are shown in Table 1 and Figure 3, and we split results based on the data used for training. C + T means that the models are trained on FlyingChairs(C) and FlyingThings(T), while C+T+S indicates the model is finetuned on Sintel (S).
px
Training Data | Method | Sintel (train) | Sintel (test) | ||
---|---|---|---|---|---|
Clean | Final | Clean | Final | ||
- | FlowFields[7] | - | - | 3.75 | 5.81 |
FlowFields++[35] | - | - | 2.94 | 5.49 | |
DCFlow [42] | - | - | 3.54 | 5.12 | |
TVNet[14] | 7.45 | 8.59 | - | - | |
MRFlow[41] | 1.83 | 3.59 | 2.53 | 5.38 | |
C + T | HD3[45] | 3.84 | 8.77 | - | - |
LiteFlowNet[20] | 2.48 | 4.04 | - | - | |
PWC-Net[37] | 2.55 | 3.93 | - | - | |
LiteFlowNet2[21] | 2.24 | 3.78 | - | - | |
VCN[44] | 2.21 | 3.68 | - | - | |
FlowNet2[23] | 2.02 | 3.54 | 3.96 | 6.02 | |
Ours (small) | 2.21 | 3.35 | - | - | |
Ours | 1.63 | 2.83 | - | - | |
C+T+S | FlowNet2[23] | - | - | 4.16 | 5.74 |
LiteFlowNet2 [21] | - | - | 3.45 | 4.90 | |
HD3[45] | - | - | 4.79 | 4.67 | |
PWC-Net+[36] | - | - | 3.45 | 4.60 | |
IRR-PWC[22] | - | - | 3.84 | 4.58 | |
VCN[44] | - | - | 2.81 | 4.40 | |
SelfFlow[26] | - | - | 3.74 | 4.26 | |
Ours | - | - | 2.77 | 3.61 | |
Ours (warm-start) | - | - | 2.42 | 3.39 |
When using C+T for training, our method outperforms all existing approaches, despite using a significantly shorter training schedule. Our method achieves an average EPE (end-point-error) of 1.63 on the Sintel(train) clean pass, which is a 20% lower error than FlowNet2 and 44% lower than PWC-Net. These results demonstrates good cross dataset generalization. One of the reasons for better generalization is the structure of our network. By constraining optical flow to be the product of a series of identical update steps, we force the network to learn an update operator which mimics the updates of a first-order descent algorithm. This constrains the search space, reduces the risk of over-fitting, and leads to faster training and better generalization.
When evaluating on the Sintel(test) set, we finetune on the combined clean and final passes of the training set. Our method ranks 1st on both the Sintel clean and final passes, and outperforms the SelFlow[26], the best performing prior work, by 0.87 pixels (3.39 versus 4.26). We evaluate two versions of our model, Ours uses zero initialization, while Ours (warp-start) initializes flow by forward projecting the flow estimate from the frame. Since our method operates at a single resolution, we can initialize the flow estimate to utilize motion smoothness from past frames, which cannot be easily done using the coarse-to-fine model.
Image 1 VCN Ours
4.2 Kitti
We also evaluate RAFT on KITTI and provide results in Table 2 and Figure 4. We first evaluate cross-dataset generalization by evaluating on the KITTI-15 (train) split after training on Chairs(C) and FlyingThings(T). Our method outperforms prior works by a large margin, improving EPE (end-point-error) from 8.36 to 5.54, which shows that the underlying structure of our network facilitates generalization. This property is important for applying our method in circumstances where it is difficult to collect training data.
Training Data | Method | KITTI-15 (train) | KITTI-15 (test) | |
---|---|---|---|---|
F1-epe | F1-all | F1-all | ||
- | FlowFields [7] | 8.33 | 24.4 | 15.31 |
DCFlow [42] | - | - | 14.86 | |
MRFlow [41] | - | - | 12.19 | |
C + T | HD [45] | 13.17 | 24.0 | - |
LiteFlowNet [20] | 10.39 | 28.5 | - | |
PWC-Net [37] | 10.35 | 33.7 | - | |
FlowNet2 [23] | 10.08 | 30.0 | - | |
LiteFlowNet2 [21] | 8.97 | 25.9 | - | |
VCN [44] | 8.36 | 25.1 | - | |
Ours (small) | 7.51 | 26.9 | - | |
Ours | 5.54 | 19.8 | - | |
C+T+K | FlowNet2 [23] | - | - | 11.48 |
LiteFlowNet2 [21] | - | - | 7.74 | |
PWC-Net [36] | - | - | 7.72 | |
IRR-PWC [22] | - | - | 7.65 | |
HD [45] | - | - | 6.55 | |
VCN [44] | - | - | 6.30 | |
Ours | - | - | 6.30 |
![]() |
![]() |
4.3 Ablations
We perform a set of ablation experiments to show the relative importance of each component. All ablated versions are trained on FlyingChairs(C) + FlyingThings(T). Results of the ablations are shown in Table 3. In each section of the table, we test a specific component of our approach in isolation, the settings which are used in our final model is underlined. Below we describe each of the experiments in more detail.
Number of Iterations: Although we unroll 12 iterations during training, we can apply an arbitrary number of iterations during inference. In Figure 5 (left), we plot EPE as a function of the number of iterations. Our method quickly converges, surpassing PWC-Net after 3 iterations and FlowNet2 after 6 iterations, but continues to improve with more iterations. Figure 5 (right) shows the magnitude of each subsequent update . In Table 3 we provide numerical results for selected number of iterations, and test an extreme case of 1000 iterations to show that our method doesn’t diverge.
px
Experiment | Method | Sintel (train) | KITTI-15 (train) | Parameters | ||
---|---|---|---|---|---|---|
Clean | Final | F1-epe | F1-all | |||
Inference Iter. | 1 | 4.57 | 5.97 | 18.10 | 46.3 | 4.8M |
3 | 2.47 | 3.86 | 10.00 | 29.8 | 4.8M | |
8 | 1.87 | 3.00 | 6.32 | 21.8 | 4.8M | |
32 | 1.71 | 2.83 | 5.54 | 19.8 | 4.8M | |
100 | 1.64 | 2.86 | 5.73 | 20.1 | 4.8M | |
1000 | 1.66 | 2.87 | 5.80 | 20.2 | 4.8M | |
Update Op. | ConvGRU | 1.63 | 2.83 | 5.54 | 19.8 | 4.8M |
Conv | 2.04 | 3.21 | 7.66 | 26.1 | 4.1M | |
Tying | Tied Weights | 1.63 | 2.83 | 5.54 | 19.8 | 4.8M |
Untied Weights | 1.96 | 3.20 | 7.64 | 24.1 | 32.5M | |
Context | Context | 1.63 | 2.83 | 5.54 | 19.8 | 4.8M |
No Context | 1.93 | 3.06 | 6.25 | 23.1 | 3.3M | |
Feature Scale | Single-Scale | 1.63 | 2.83 | 5.54 | 19.8 | 4.8M |
Multi-Scale | 2.08 | 3.12 | 6.91 | 23.2 | 6.6M | |
Lookup Radius | 0 | 3.41 | 4.53 | 23.6 | 44.8 | 4.7M |
1 | 1.80 | 2.99 | 6.27 | 21.5 | 4.7M | |
2 | 1.78 | 2.82 | 5.84 | 21.1 | 4.8M | |
4 | 1.63 | 2.83 | 5.54 | 19.8 | 4.8M | |
Correlation Pooling | No | 1.95 | 3.02 | 6.07 | 23.2 | 4.7M |
Yes | 1.63 | 2.83 | 5.54 | 19.8 | 4.8M | |
Correlation Range | 32px | 2.91 | 4.48 | 10.4 | 28.8 | 4.8M |
64px | 2.06 | 3.16 | 6.24 | 20.9 | 4.8M | |
128px | 1.64 | 2.81 | 6.00 | 19.9 | 4.8M | |
All-Pairs | 1.63 | 2.83 | 5.54 | 19.8 | 4.8M | |
Features for Refinement | Correlation | 1.63 | 2.83 | 5.54 | 19.8 | 4.8M |
Warping | 2.27 | 3.73 | 11.83 | 32.1 | 2.8M |
Architecture of Update Operator:
We use a gated activation unit based on the GRU cell. We experiment with replacing the convolutional GRU with a set of 3 convolutional layers with ReLU activation. We achieve better performance by using the GRU block, likely because the gated activation makes it easier for the sequence of flow estimates to converge.
Weight Tying: By default, we tied the weights across all instances of the update operator. Here, we test a version of our approach where each update operator learns a separate set of weights. Accuracy is better when weights are tied and the parameter count is significantly lower. This suggests that the tied weights induce proper constraints over network architecture.
Context: We test the importance of context by training a model with the context network removed. Without context, we still achieve good results, outperforming all existing works on both Sintel and KITTI. But context is helpful. Directly injecting image features into the update operator likely allows spatial information to be better aggregated within motion boundaries.
Feature Scale: By default, we extract features at a single resolution. We also try extracting features at multiple resolutions by building a correlation volume at each scale separately. Single resolution features simplifies the network architecture and allows fine-grained matching even at large displacements.
Lookup Radius: The lookup radius specifies the dimensions of the grid used in the lookup operation. When a radius of 0 is used, the correlation volume is retrieved at a single point. Surprisingly, we can still get a rough estimate of flow when the radius is 0, which means the network is learning to use 0’th order information. However, we see better results as the radius is increased.
Correlation Pooling: We output features at a single resolution and then perform pooling to generate multiscale volumes. Here we test the impact when this pooling is removed. Results are better with pooling, because large and small displacements are both captured.
Correlation Range: Instead of all-pairs correlation, we also try constructing the correlation volume only for a local neighborhood around each pixel. We try a range of 32 pixels, 64 pixels, and 128 pixels. Overall we get the best results when the all-pairs are used, although a 128px range is sufficient to perform well on Sintel because most displacements fall within this range. That said, all-pairs is still preferable because it eliminates the need to specify a range. It is also more convenient to implement: it can be computed using matrix multiplication allowing our approach to be implemented entirely in PyTorch.
Features for Refinement: We compute visual similarity by building a correlation volume between all pairs of pixels. In this experiment, we try replacing the correlation volume with a warping layer, which uses the current estimate of optical flow to warp features from onto and then estimates the residual displacement. While warping is still competitive with prior work on Sintel, correlation performs significantly better, especially on KITTI.
4.4 Timing and Parameter Counts
Inference time and parameter counts are shown in Figure 6. Accuracy is determined by performance on the Sintel(train) final pass after training on FlyingChairs and FlyingThings (C+T). In these plots, we report accuracy and timing after 12 iterations, and we time our method using a GTX 1080Ti GPU. Parameters counts for other methods are taken as reported in their papers, and we report times when run on our hardware. RAFT is more efficient in terms of parameter count, inference time, and training iterations. The context and feature encoder both use 1.05M parameters each. The update operator uses 1.5M parameters. The remaining 1.2M parameters are used to process correlation features and predict flow. Ours-S uses only 1M parameters, but outperforms PWC-Net and VCN which are more than 6x larger. We provide an additional table with numerical values for parameters, timing, and training iterations in the supplemental material.

4.5 Video of Very High Resolution
To demonstrate that our method scales well to videos of very high resolution we apply our network to HD video from the DAVIS[32] dataset. We use 1080p (1088x1920) resolution video and apply 12 iterations of our approach. Inference takes 550ms for 12 iterations on 1080p video, with all-pairs correlation taking 95ms. Fig. 7 visualizes example results on DAVIS.

5 Conclusions
We have proposed RAFT—Recurrent All-Pairs Field Transforms—a new end-to-end trainable model for optical flow. RAFT is unique in that it operates at a single resolution using a large number of lightweight, recurrent update operators. Our method achieves state-of-the-art accuracy across a diverse range of datasets, strong cross dataset generalization, and is efficient in terms of inference time, parameter count, and training iterations.
Acknowledgments: This work was partially funded by the National Science Foundation under Grant No. 1617767.
References
- [1] (2017) Solving ill-posed inverse problems using iterative deep neural networks. Inverse Problems 33 (12), pp. 124007. Cited by: §2.
- [2] (2018) Learned primal-dual reconstruction. IEEE transactions on medical imaging 37 (6), pp. 1322–1332. Cited by: §2.
- [3] (2019) Differentiable convex optimization layers. In Advances in Neural Information Processing Systems, pp. 9558–9570. Cited by: §2.
-
[4]
(2017)
Optnet: differentiable optimization as a layer in neural networks.
In
Proceedings of the 34th International Conference on Machine Learning-Volume 70
, pp. 136–145. Cited by: §2. - [5] (2018) Trellis networks for sequence modeling. arXiv preprint arXiv:1810.06682. Cited by: §2.
- [6] (2019) Deep equilibrium models. In Advances in Neural Information Processing Systems, pp. 688–699. Cited by: §2.
-
[7]
(2015)
Flow fields: dense correspondence fields for highly accurate large displacement optical flow estimation.
In
Proceedings of the IEEE international conference on computer vision
, pp. 4015–4023. Cited by: Table 1, Table 2. -
[8]
(2009)
Large displacement optical flow.
In
2009 IEEE Conference on Computer Vision and Pattern Recognition
, pp. 41–48. Cited by: §2. -
[9]
(2012)
A naturalistic open source movie for optical flow evaluation
. In European conference on computer vision, pp. 611–625. Cited by: 1st item, §1, §4, §4. - [10] (2011) A first-order primal-dual algorithm for convex problems with applications to imaging. Journal of mathematical imaging and vision 40 (1), pp. 120–145. Cited by: §2.
- [11] (2016) Full flow: optical flow estimation by global optimization over regular grids. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4706–4714. Cited by: §1, §2.
-
[12]
(2014)
On the properties of neural machine translation: encoder-decoder approaches
. arXiv preprint arXiv:1409.1259. Cited by: §2. - [13] (2015) Flownet: learning optical flow with convolutional networks. In Proceedings of the IEEE international conference on computer vision, pp. 2758–2766. Cited by: §1, §4.
- [14] (2018) End-to-end learning of motion representation for video understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6016–6025. Cited by: §2, Table 1.
- [15] (2019) DeepView: high-quality view synthesis by learned gradient descent. Cited by: §2.
- [16] (2013) Vision meets robotics: the kitti dataset. The International Journal of Robotics Research 32 (11), pp. 1231–1237. Cited by: 2nd item, §1, §4, §4.
- [17] (2007) Stereo processing by semiglobal matching and mutual information. IEEE Transactions on pattern analysis and machine intelligence 30 (2), pp. 328–341. Cited by: §2.
- [18] (2019) The five elements of flow. arXiv preprint arXiv:1912.10739. Cited by: §1, §2.
- [19] (1981) Determining optical flow. In Techniques and Applications of Image Understanding, Vol. 281, pp. 319–331. Cited by: §1, §2.
-
[20]
(2018)
Liteflownet: a lightweight convolutional neural network for optical flow estimation
. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8981–8989. Cited by: Table 4, §1, §1, §1, §2, §2, Table 1, Table 2. - [21] (2019) A lightweight optical flow cnn–revisiting data fidelity and regularization. arXiv preprint arXiv:1903.07414. Cited by: §1, §2, Table 1, Table 2.
- [22] (2019) Iterative residual refinement for joint optical flow and occlusion estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5754–5763. Cited by: Table 4, §1, §2, §2, §3.2, Table 1, Table 2.
- [23] (2017) Flownet 2.0: evolution of optical flow estimation with deep networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2462–2470. Cited by: Table 4, §1, §1, §2, Table 1, Table 2.
- [24] (2017) Variational networks: connecting variational methods and deep learning. In German conference on pattern recognition, pp. 281–293. Cited by: §2.
- [25] (2018) Learning for disparity estimation through feature constancy. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2811–2820. Cited by: §2.
-
[26]
(2019)
Selflow: self-supervised learning of optical flow
. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4571–4580. Cited by: §4.1, Table 1. - [27] (2017) Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101. Cited by: §4.
- [28] (2019) Taking a deeper look at the inverse compositional algorithm. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4581–4590. Cited by: §2.
- [29] (2016) A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4040–4048. Cited by: §4.
- [30] (2015) Discrete optimization for optical flow. In German Conference on Pattern Recognition, pp. 16–28. Cited by: §2.
- [31] (2017) Automatic differentiation in pytorch. Cited by: §4.
- [32] (2017) The 2017 davis challenge on video object segmentation. arXiv preprint arXiv:1704.00675. Cited by: §4.5, §4.
- [33] (2014) Non-local total generalized variation for optical flow estimation. In European Conference on Computer Vision, pp. 439–454. Cited by: §2.
- [34] (2017) Optical flow estimation using a spatial pyramid network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4161–4170. Cited by: §2.
- [35] (2018) FlowFields++: accurate optical flow correspondences meet robust interpolation. In 2018 25th IEEE International Conference on Image Processing (ICIP), pp. 1463–1467. Cited by: Table 1.
- [36] (2018) Models matter, so does training: an empirical study of cnns for optical flow estimation. arXiv preprint arXiv:1809.05571. Cited by: Table 4, Table 1, Table 2.
- [37] (2018) Pwc-net: cnns for optical flow using pyramid, warping, and cost volume. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8934–8943. Cited by: §1, §1, §1, §2, §2, §3.2, Table 1, Table 2.
- [38] (2018) Ba-net: dense bundle adjustment network. arXiv preprint arXiv:1806.04807. Cited by: §2.
- [39] (2018) Deepv2d: video to depth with differentiable structure from motion. arXiv preprint arXiv:1812.04605. Cited by: §2, §2.
- [40] (2013) DeepFlow: large displacement optical flow with deep matching. In Proceedings of the IEEE international conference on computer vision, pp. 1385–1392. Cited by: §2.
- [41] (2017) Optical flow in mostly rigid scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4671–4680. Cited by: Table 1, Table 2.
- [42] (2017) Accurate optical flow via direct cost volume processing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1289–1297. Cited by: §2, Table 1, Table 2.
- [43] (2019) Hierarchical deep stereo matching on high-resolution images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5515–5524. Cited by: §4.
- [44] (2019) Volumetric correspondence networks for optical flow. In Advances in Neural Information Processing Systems, pp. 793–803. Cited by: Table 4, §1, §1, §1, §2, §2, §3.2, Figure 4, Table 1, Table 2, §4.
- [45] (2019) Hierarchical discrete distribution decomposition for match density estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6044–6053. Cited by: §1, §2, Table 1, Table 2.
- [46] (2007) A duality based approach for realtime tv-l 1 optical flow. In Joint pattern recognition symposium, pp. 214–223. Cited by: §1, §2.
- [47] (2018) Deeptam: deep tracking and mapping. In Proceedings of the European conference on computer vision (ECCV), pp. 822–838. Cited by: §2.
Appendix 0.A Appendix
0.a.1 Network Architecture

Network architecture details for the full 4.8M parameter model and the small 1.0M parameter model. The context and feature encoders have the same architecture, the only difference is that the feature encoder uses instance normalization while the context encoder uses batch normalization. In RAFT-S, we replace the residual units with bottleneck residual units. The update block takes in context features, correlation features, and flow features to update the latent hidden state. The updated hidden status is used to predict the flow update. The full model uses two convolutional GRU update blocks with 1x5 filters and 5x1 filters respectively, while the small model uses a single GRU with 3x3 filters.
0.a.2 Timing, Parameters, and Training Iterations
px
Method | Parameters (M) | Time (Reported) | Time (1080Ti) | Training Iter. (#GPUs) | Accuracy |
---|---|---|---|---|---|
LiteFlowNetX[20] | 0.9M | 0.03s | - | 2000k | 4.79 |
LiteFlowNet[20] | 5.4M | 0.09s | 0.09s | 2000k | 4.04 |
IRR-PWC[22] | 6.4M | - | 0.20s | 850k | 3.95 |
PWCNet+[36] | 9.4M | 0.03s | 0.04s | 1700k | 3.93 |
VCN[44] | 6.2M | 0.18s | 0.26s | 220k(4) | 3.63 |
FlowNet2[23] | 162M | 0.12s | 0.11s | 7000k | 3.54 |
Ours (small) | 1.0M | - | 0.05s | 160k(2) | 3.37 |
Ours | 4.8M | - | 0.11s | 160k(2) | 2.87 |
Comments
There are no comments yet.