1 Introduction
Optical flow is the task of estimating perpixel motion between video frames. It is a longstanding vision problem that remains unsolved. The best systems are limited by difficulties including fastmoving objects, occlusions, motion blur, and textureless surfaces.
Optical flow has traditionally been approached as a handcrafted optimization problem over the space of dense displacement fields between a pair of images [19, 46, 11]. Generally, the optimization objective defines a tradeoff between a data term which encourages the alignment of visually similar image regions and a regularization term which imposes priors on the plausibility of motion. Such an approach has achieved considerable success, but further progress has appeared challenging, due to the difficulties in handdesigning an optimization objective that is robust to a variety of corner cases.
Recently, deep learning has been shown as a promising alternative to traditional methods. Deep learning can sidestep formulating an optimization problem and train a network to directly predict flow. Current deep learning methods
[23, 37, 20, 44, 18] have achieved performance comparable to the best traditional methods while being significantly faster at inference time. A key question for further research is designing effective architectures that perform better, train more easily and generalize well to novel scenes.We introduce Recurrent AllPairs Field Transforms (RAFT), a new deep network architecture for optical flow. RAFT enjoys the following strengths:

Stateoftheart accuracy: RAFT achieves large improvements over existing methods. On Sintel [9] (final pass), a challenging benchmark covering diverse motions, RAFT obtains an endpointerror of 3.39 pixels, a 20% error reduction from the best published result (4.26 pixels).

Strong generalization: When trained only on synthetic data, RAFT achieves an endpointerror of 5.54 pixels on KITTI [16], a 34% error reduction from the best prior deep network trained on the same data (8.36 pixels).

High efficiency: RAFT processes videos at 9 frames per second on a 1080Ti GPU. It trains with 10X fewer iterations than other architectures. A smaller version of RAFT with 1/5 of the parameters runs at 20 frames per second while still outperforming all prior methods on Sintel.
RAFT consists of three main components: (1) a feature encoder that extracts a feature vector for each pixel; (2) a correlation layer that produces a 4D correlation volume for all pairs of pixels, with subsequent pooling to produce lower resolution volumes; (3) a recurrent GRUbased update operator that retrieves values from the correlation volumes and iteratively updates a flow field initialized at zero. Fig. 1 illustrates the design of RAFT.
The RAFT architecture is motivated by traditional optimizationbased approaches. The feature encoder extracts perpixel features. The correlation layer computes visual similarity between pixels. The update operator mimics the steps of an iterative optimization algorithm. But unlike traditional approaches, features and motion priors are not handcrafted but learned—learned by the feature encoder and the update operator respectively.
The design of RAFT draws inspiration from many existing works but is substantially novel. First, RAFT operates entirely at high resolution. It maintains and updates a single highresolution flow field, with zero upsampling operations during inference. This is different from the prevailing coarsetofine design in prior work [37, 44, 20, 21, 45], where flow is first estimated at low resolution and upsampled and refined at high resolution. By operating entirely at high resolution, RAFT overcomes several limitations of a coarsetofine cascade: the difficulty of recovering from errors at coarse resolutions, the tendency to miss small fastmoving objects, and the many training iterations (often over 1M) typically required for training a multistage cascade.
Second, the update operator of RAFT is recurrent and lightweight. Many recent works [22, 37, 44, 20, 23] have included some form of iterative refinement, but do not tie the weights across iterations [37, 44, 20] and are therefore limited to a fixed number of iterations. To our knowledge, IRR [22] is the only deep learning approach [22] that is recurrent. It uses FlowNetS [13] or PWCNet [37] as its recurrent unit. When using FlowNetS, it is limited by the size of the network (38M parameters) and is only applied up to 5 iterations. When using PWCNet, iterations are limited by the number of pyramid levels. In contrast, our update operator has only 2.7M parameters and can be applied 100+ times during inference without divergence.
Third, the update operator has a novel design, which consists of a convolutional GRU that performs lookups on 4D multiscale correlation volumes; in contrast, refinement modules in prior work typically use only plain convolution or correlation layers.
2 Related Work
Optical Flow as Energy Minimization Optical flow has traditionally been treated as an energy minimization problem which imposes a tradeoff between a data term and a regularization term. Horn and Schnuck [19] formulated optical flow as a continuous optimization problem using a variational framework, and were able to estimate a dense flow field by performing gradient steps. TVL1 [46]
replaced the quadratic penalties with an L1 data term and total variation regularization, which allowed for motion discontinuities and was better equipped to handle outliers. Improvements have been made by defining better matching costs
[40, 8] and regularization terms [33].Such continuous formulations maintain a single estimate of optical flow which is refined at each iteration. To ensure a smooth objective function, a first order Taylor approximation is used to model the data term. As a result, they only work well for small displacements. To handle large displacements, the coarsetofine strategy is used, where an image pyramid is used to estimate large displacements at low resolution, then small displacements refined at high resolution. But this coarsetofine strategy may miss small fastmoving objects and have difficulty recovering from early mistakes. Like continuous methods, we maintain a single estimate of optical flow which is refined with each iteration. However, since we build correlation volumes for all pairs at both high resolution and low resolution, each local update uses information about both small and large displacements. Instead of using a subpixel Taylor approximation of the data term, our update operator learns to propose the descent direction.
More recently, optical flow has also been approached as a discrete optimization problem [30, 11, 42] using a global objective. One challenge of this approach is the massive size of the search space, as each pixel can be reasonably paired with thousands of points in the other frame. Menez et al[30] pruned the search space using feature descriptors and approximated the global MAP estimate using message passing. Raftl et al. [11] showed that by using the distance transform, solving the global optimization problem over the full space of flow fields is tractable. DCFlow [42]
showed further improvements in results by using a neural network as a feature descriptor, and constructed a 4D cost volume over all pairs of features. The 4D cost volume was then processed using the SemiGlobal Matching (SGM) algorithm
[17]. Like DCFlow, we also constructed 4D cost volumes over learned features. However, instead of processing the cost volumes using SGM, we use a neural network to estimate flow. Our approach is endtoend differentiable, meaning the feature encoder can be trained with the rest of the network to directly minimize the error of the final flow estimate. In contrast, DCFlow requires their network to be trained using an embedding loss between pixels; it cannot be trained directly on optical flow because their cost volume processing is not differentiable.Direct Flow Prediction Neural networks have been trained to directly predict optical flow between a pair of frames, sidestepping the optimization problem completely. Coarsetofine processing has emerged as a popular ingredient in many recent works [37, 45, 20, 21, 22, 44, 18]. A defining feature of coarsetofine processing is the use of upsampling operations, which allow coarse estimates to be refined at higher resolutions. In contrast, our method maintains and updates a single highresolution flow field and does not use any upsampling operations during inference.
Iterative Refinement for Optical Flow Many recent works have used iterative refinement to improve results on optical flow [23, 34, 37, 20, 44] and related tasks [25, 47, 39]. Ilg et al. [23] applied iterative refinement to optical flow by stacking multiple FlowNetS and FlowNetC modules in series. SpyNet[34], PWCNet[37], LiteFlowNet[20], and VCN [44] apply iterative refinement using coarsetofine pyramids. The main difference of these approaches from ours is that they do not share weights between iterations.
More closely related to our approach is IRR[22], which builds off of the FlownetS and PWCNet architecture but shares weights between refinement networks. When using FlowNetS, it is limited by the size of the network (38M parameters) and is only applied up to 5 iterations. When using PWCNet, iterations are limited by the number of pyramid levels. In contrast, we use a much simpler refinement module (2.7M parameters) which can be applied for 100+ iterations during inference without divergence.
Our method also has ties to TrellisNet [5] and Deep Equilibrium Models (DEQ) [6], which both use depth tied weights over a large number of layers. TrellisNet and DEQ were designed for sequence modeling tasks, but we adopt the core idea of use a large number of weighttied units. Our update operator uses a modified GRU block[12], which is similar to the LSTM block used in TrellisNet. We found that this structure allows our update operator to more easily converge to a fixed point.
Learning to Optimize Many problems in vision can be formulated as an optimization problem. This has motivated several works to embed optimization problems into network architectures [4, 3, 38, 28, 39]. These works typically use a network to predict the inputs or parameters of the optimization problem, and then train the network weights by backpropogating the gradient through the solver, either implicitly[4, 3] or unrolling each step [28, 38]. However, this technique is limited to problems with an objective that can be easily defined.
Another approach is to learn iterative updates directly from data [1, 2]. These approaches are motivated by the fact that first order optimizers such as Primal Dual Hybrid Gradient (PDHG)[10] can be expressed as a sequence of iterative update steps. Instead of using an optimizer directly, Adler et al. [1] proposed building a network which mimics the updates of a first order algorithm. This approach has been applied to inverse problems such as image denoising [24], tomographic reconstruction [2], and novel view synthesis[15]. TVNet [14] implemented the TVL1 algorithm as a computation graph, which enabled the training the TVL1 parameters. However, TVNet operates directly based on intensity gradients instead of learned features, which limits the achievable accuracy on challenging datasets such as Sintel.
Our approach can be viewed as learning to optimize: our network uses a large number of update blocks to emulate the steps of a firstorder optimization algorithm. However, unlike prior work, we never explicitly define a gradient with respect to some optimization objective. Instead, our network retrieves features from correlation volumes to propose the descent direction.
3 Approach
Given a pair of consecutive RGB images, , , we estimate a dense displacement field which maps each pixel in to its corresponding coordinates in . An overview of our approach is given in Figure 1
. Our method can be distilled down to three stages: (1) feature extraction, (2) computing visual similarity, and (3) iterative updates, where all stages are differentiable and composed into an endtoend trainable architecture.
3.1 Feature Extraction
Features are extracted from the input images using a convolutional network. The feature encoder network is applied to both and and maps the input images to dense feature maps at a lower resolution. Our encoder, outputs features at 1/8 resolution where we set . The feature encoder consists of 6 residual blocks, 2 at 1/2 resolution, 2 at 1/4 resolution, and 2 at 1/8 resolution (more details in the supplemental material).
We additionally use a context network. The context network extracts features only from the first input image . The architecture of the context network, is identical to the feature extraction network. Together, the feature network and the context network form the first stage of our approach, which only need to be performed once.
3.2 Computing Visual Similarity
We compute visual similarity by constructing a full correlation volume between all pairs. Given image features and , the correlation volume is formed by taking the dot product between all pairs of feature vectors. The correlation volume, , can be efficiently computed as a single matrix multiplication.
(1) 
Correlation Pyramid: We construct a 4layer pyramid
by pooling the last two dimensions of the correlation volume with kernel sizes 1, 2, 4, and 8 and equivalent stride (Figure
2). Thus, volume has dimensions . The set of volumes gives information about both large and small displacements; however, by maintaining the first 2 dimensions (the dimensions) we maintain high resolution information, allowing our method to recover the motions of small fastmoving objects.Correlation Lookup: We define a lookup operator which generates a feature map by indexing from the correlation pyramid. Given a current estimate of optical flow , we map each pixel in to its estimated correspondence in : . We then define a local grid around
(2) 
as the set of integer offsets which are within a radius of units of using the L1 distance. We use the local neighborhood to index from the correlation volume. Since is a grid of real numbers, we use bilinear sampling.
We perform lookups on all levels of the pyramid, such that the correlation volume at level , , is indexed using the grid . A constant radius across levels means larger context at lower levels: for the lowest level, using a radius of 4 corresponds to a range of 256 pixels at the original resolution. The values from each level are then concatenated into a single feature map.
An important point here is that we are build the grid directly in the coordinate system defined by . Previous work has used warping operations followed by local correlation [37, 22, 44], so the local search is actually being performed on a warped coordinate system; in contrast, our approach avoids warping. While this difference is subtle, it is important for subpixel accuracy, particularly near motion boundaries where warping will change the local geometry.
Efficient Computation for High Resolution Images: The all pairs correlation scales where is the number of pixels, but only needs to be computed once and is constant in the number of iterations . However, there exists an equivalent implementation of our approach which scales exploiting the linearity of the inner product and average pooling. Consider the cost volume at level , , and feature maps , :
which is the average over the correlation response in the grid. This means that the value at can be computed as the inner product between the feature vector and pooled with kernel size .
In this alternative implementation, we do not precompute the correlations, but instead precompute the pooled image feature maps. In each iteration, we compute each correlation value on demand—only when it is looked up. This gives a complexity of .
We found empirically that precomputing all pairs is easy to implement and not a bottleneck, due to highly optimized matrix routines on GPUs—even for 1088x1920 videos it takes only 17% of total inference time. Note that we can always switch to the alternative implementation should it become a bottleneck.
3.3 Iterative Updates
Our update operator estimates a sequence of flow estimates from an initial starting point . With each iteration, it produces an update direction which is applied to the current estimate: .
The update operator takes flow, correlation, and a latent hidden state as input, and outputs the update and an updated hidden state. The architecture of our update operator is designed to mimic the steps of an optimization algorithm. As such, we used tied weights across depth and use bounded activations to encourage convergence to a fixed point. The update operator is trained to perform updates such that the sequence converges to a fixed point .
Initialization: By default, we initialize the flow field to 0 everywhere, but our iterative approach gives us the flexibility to experiment with alternatives. When applied to video, we test warmstart
initialization, where optical flow from the previous pair of frames is forward projected to the next pair of frames with occlusion gaps filled in using nearest neighbor interpolation.
Inputs: Given the current flow estimate , we use it to retrieve correlation features from the correlation pyramid as described in Sec. 3.2. The correlation features are then processed by 2 convolutional layers. Additionally, we apply 2 convolutional layers to the flow estimate itself to generate flow features. Finally, we directly inject the input from the context network. The input feature map is then taken as the concatenation of the correlation, flow, and context features.
Update: A core component of the update operator is a gated activation unit based on the GRU cell, with fully connected layers replaced with convolutions:
(3)  
(4)  
(5)  
(6) 
where is the concatenation of flow, correlation, and context features previously defined. We also experiment with a separable ConvGRU unit, where we replace the convolution with two GRUs: one with a convolution and one with a convolution to increase the receptive field without significantly increasing the size of the model.
Flow Prediction: The hidden state outputted by the GRU is passed through two convolutional layers to predict the flow update . The output flow is at 1/8 resolution of the input image. During training and evaluation, we upsample the predicted flow fields to match the resolution of the ground truth.
3.4 Supervision
We supervised our network on the distance between the predicted and ground truth optical flow over the full sequence of flow estimates, , with exponentially increasing weights. Given ground truth flow
, the loss function is defined as
(7) 
where we set in our experiments.
4 Experiments
We evaluate RAFT on Sintel[9] and KITTI[16]. Following previous works, we pretrain our network on FlyingChairs[13] and FlyingThings[29], followed by dataset specific finetuning. Our method achieves stateoftheart performance on both Sintel (both clean and final passes) and KITTI. Additionally, we test our method on 1080p video from the DAVIS dataset[32] to demonstrate that our method scales to videos of very high resolutions.
Implementation Details:
RAFT is implemented in PyTorch
[31]. All modules are initialized from scratch with random weights. During training, we use the AdamW[27] optimizer with weight decay 0.00005 and clip gradients Unless otherwise noted, we evaluate after 50 iterations on Sintel and 25 on KITTI.Gradient Stopping: For every update, , we only backpropgate the gradient through the branch, and zero the gradient through the branch.
Training Schedule: We pretrain on FlyingChairs for 100k iterations with a batch size of 6 and 2 GPUs. We then finetune on the FlyingThings3D dataset for an additional 60k iterations with a batch size of 3 and 2 GPUs. We linearly increase the learning rate for the first 20% of training, then linearly decay to 0. In total, this gives 160k training steps. This is significantly fewer than the 7M steps used to train FlowNet2 and the 1.7M steps used to train PWCNet. VCN[44] is trained for 220k steps, but with 4 GPUs.
We perform datasetspecific finetuning on Sintel[9] and KITTI[16]. We use 60k iterations on Sintel and 40k on KITTI following the same schedule.
Augmentation: We apply color augmentation by adjusting contrast, saturation, brightness, and hue. We apply spatial augmentation by random resizing and flipping. Following HSMNet [43], we also randomly erase rectangular regions in
with probability 0.5 to simulate occlusions.
4.1 Sintel
We train our model using the FlyingChairsFlyingThings schedule and then evaluate on the Sintel dataset using the train split for validation. Results are shown in Table 1 and Figure 3, and we split results based on the data used for training. C + T means that the models are trained on FlyingChairs(C) and FlyingThings(T), while C+T+S indicates the model is finetuned on Sintel (S).
px
Training Data  Method  Sintel (train)  Sintel (test)  

Clean  Final  Clean  Final  
  FlowFields[7]      3.75  5.81 
FlowFields++[35]      2.94  5.49  
DCFlow [42]      3.54  5.12  
TVNet[14]  7.45  8.59      
MRFlow[41]  1.83  3.59  2.53  5.38  
C + T  HD3[45]  3.84  8.77     
LiteFlowNet[20]  2.48  4.04      
PWCNet[37]  2.55  3.93      
LiteFlowNet2[21]  2.24  3.78      
VCN[44]  2.21  3.68      
FlowNet2[23]  2.02  3.54  3.96  6.02  
Ours (small)  2.21  3.35      
Ours  1.63  2.83      
C+T+S  FlowNet2[23]      4.16  5.74 
LiteFlowNet2 [21]      3.45  4.90  
HD3[45]      4.79  4.67  
PWCNet+[36]      3.45  4.60  
IRRPWC[22]      3.84  4.58  
VCN[44]      2.81  4.40  
SelfFlow[26]      3.74  4.26  
Ours      2.77  3.61  
Ours (warmstart)      2.42  3.39 
When using C+T for training, our method outperforms all existing approaches, despite using a significantly shorter training schedule. Our method achieves an average EPE (endpointerror) of 1.63 on the Sintel(train) clean pass, which is a 20% lower error than FlowNet2 and 44% lower than PWCNet. These results demonstrates good cross dataset generalization. One of the reasons for better generalization is the structure of our network. By constraining optical flow to be the product of a series of identical update steps, we force the network to learn an update operator which mimics the updates of a firstorder descent algorithm. This constrains the search space, reduces the risk of overfitting, and leads to faster training and better generalization.
When evaluating on the Sintel(test) set, we finetune on the combined clean and final passes of the training set. Our method ranks 1st on both the Sintel clean and final passes, and outperforms the SelFlow[26], the best performing prior work, by 0.87 pixels (3.39 versus 4.26). We evaluate two versions of our model, Ours uses zero initialization, while Ours (warpstart) initializes flow by forward projecting the flow estimate from the frame. Since our method operates at a single resolution, we can initialize the flow estimate to utilize motion smoothness from past frames, which cannot be easily done using the coarsetofine model.
4.2 Kitti
We also evaluate RAFT on KITTI and provide results in Table 2 and Figure 4. We first evaluate crossdataset generalization by evaluating on the KITTI15 (train) split after training on Chairs(C) and FlyingThings(T). Our method outperforms prior works by a large margin, improving EPE (endpointerror) from 8.36 to 5.54, which shows that the underlying structure of our network facilitates generalization. This property is important for applying our method in circumstances where it is difficult to collect training data.
Training Data  Method  KITTI15 (train)  KITTI15 (test)  

F1epe  F1all  F1all  
  FlowFields [7]  8.33  24.4  15.31 
DCFlow [42]      14.86  
MRFlow [41]      12.19  
C + T  HD [45]  13.17  24.0   
LiteFlowNet [20]  10.39  28.5    
PWCNet [37]  10.35  33.7    
FlowNet2 [23]  10.08  30.0    
LiteFlowNet2 [21]  8.97  25.9    
VCN [44]  8.36  25.1    
Ours (small)  7.51  26.9    
Ours  5.54  19.8    
C+T+K  FlowNet2 [23]      11.48 
LiteFlowNet2 [21]      7.74  
PWCNet [36]      7.72  
IRRPWC [22]      7.65  
HD [45]      6.55  
VCN [44]      6.30  
Ours      6.30 
4.3 Ablations
We perform a set of ablation experiments to show the relative importance of each component. All ablated versions are trained on FlyingChairs(C) + FlyingThings(T). Results of the ablations are shown in Table 3. In each section of the table, we test a specific component of our approach in isolation, the settings which are used in our final model is underlined. Below we describe each of the experiments in more detail.
Number of Iterations: Although we unroll 12 iterations during training, we can apply an arbitrary number of iterations during inference. In Figure 5 (left), we plot EPE as a function of the number of iterations. Our method quickly converges, surpassing PWCNet after 3 iterations and FlowNet2 after 6 iterations, but continues to improve with more iterations. Figure 5 (right) shows the magnitude of each subsequent update . In Table 3 we provide numerical results for selected number of iterations, and test an extreme case of 1000 iterations to show that our method doesn’t diverge.
px
Experiment  Method  Sintel (train)  KITTI15 (train)  Parameters  

Clean  Final  F1epe  F1all  
Inference Iter.  1  4.57  5.97  18.10  46.3  4.8M 
3  2.47  3.86  10.00  29.8  4.8M  
8  1.87  3.00  6.32  21.8  4.8M  
32  1.71  2.83  5.54  19.8  4.8M  
100  1.64  2.86  5.73  20.1  4.8M  
1000  1.66  2.87  5.80  20.2  4.8M  
Update Op.  ConvGRU  1.63  2.83  5.54  19.8  4.8M 
Conv  2.04  3.21  7.66  26.1  4.1M  
Tying  Tied Weights  1.63  2.83  5.54  19.8  4.8M 
Untied Weights  1.96  3.20  7.64  24.1  32.5M  
Context  Context  1.63  2.83  5.54  19.8  4.8M 
No Context  1.93  3.06  6.25  23.1  3.3M  
Feature Scale  SingleScale  1.63  2.83  5.54  19.8  4.8M 
MultiScale  2.08  3.12  6.91  23.2  6.6M  
Lookup Radius  0  3.41  4.53  23.6  44.8  4.7M 
1  1.80  2.99  6.27  21.5  4.7M  
2  1.78  2.82  5.84  21.1  4.8M  
4  1.63  2.83  5.54  19.8  4.8M  
Correlation Pooling  No  1.95  3.02  6.07  23.2  4.7M 
Yes  1.63  2.83  5.54  19.8  4.8M  
Correlation Range  32px  2.91  4.48  10.4  28.8  4.8M 
64px  2.06  3.16  6.24  20.9  4.8M  
128px  1.64  2.81  6.00  19.9  4.8M  
AllPairs  1.63  2.83  5.54  19.8  4.8M  
Features for Refinement  Correlation  1.63  2.83  5.54  19.8  4.8M 
Warping  2.27  3.73  11.83  32.1  2.8M 
Architecture of Update Operator:
We use a gated activation unit based on the GRU cell. We experiment with replacing the convolutional GRU with a set of 3 convolutional layers with ReLU activation. We achieve better performance by using the GRU block, likely because the gated activation makes it easier for the sequence of flow estimates to converge.
Weight Tying: By default, we tied the weights across all instances of the update operator. Here, we test a version of our approach where each update operator learns a separate set of weights. Accuracy is better when weights are tied and the parameter count is significantly lower. This suggests that the tied weights induce proper constraints over network architecture.
Context: We test the importance of context by training a model with the context network removed. Without context, we still achieve good results, outperforming all existing works on both Sintel and KITTI. But context is helpful. Directly injecting image features into the update operator likely allows spatial information to be better aggregated within motion boundaries.
Feature Scale: By default, we extract features at a single resolution. We also try extracting features at multiple resolutions by building a correlation volume at each scale separately. Single resolution features simplifies the network architecture and allows finegrained matching even at large displacements.
Lookup Radius: The lookup radius specifies the dimensions of the grid used in the lookup operation. When a radius of 0 is used, the correlation volume is retrieved at a single point. Surprisingly, we can still get a rough estimate of flow when the radius is 0, which means the network is learning to use 0’th order information. However, we see better results as the radius is increased.
Correlation Pooling: We output features at a single resolution and then perform pooling to generate multiscale volumes. Here we test the impact when this pooling is removed. Results are better with pooling, because large and small displacements are both captured.
Correlation Range: Instead of allpairs correlation, we also try constructing the correlation volume only for a local neighborhood around each pixel. We try a range of 32 pixels, 64 pixels, and 128 pixels. Overall we get the best results when the allpairs are used, although a 128px range is sufficient to perform well on Sintel because most displacements fall within this range. That said, allpairs is still preferable because it eliminates the need to specify a range. It is also more convenient to implement: it can be computed using matrix multiplication allowing our approach to be implemented entirely in PyTorch.
Features for Refinement: We compute visual similarity by building a correlation volume between all pairs of pixels. In this experiment, we try replacing the correlation volume with a warping layer, which uses the current estimate of optical flow to warp features from onto and then estimates the residual displacement. While warping is still competitive with prior work on Sintel, correlation performs significantly better, especially on KITTI.
4.4 Timing and Parameter Counts
Inference time and parameter counts are shown in Figure 6. Accuracy is determined by performance on the Sintel(train) final pass after training on FlyingChairs and FlyingThings (C+T). In these plots, we report accuracy and timing after 12 iterations, and we time our method using a GTX 1080Ti GPU. Parameters counts for other methods are taken as reported in their papers, and we report times when run on our hardware. RAFT is more efficient in terms of parameter count, inference time, and training iterations. The context and feature encoder both use 1.05M parameters each. The update operator uses 1.5M parameters. The remaining 1.2M parameters are used to process correlation features and predict flow. OursS uses only 1M parameters, but outperforms PWCNet and VCN which are more than 6x larger. We provide an additional table with numerical values for parameters, timing, and training iterations in the supplemental material.
4.5 Video of Very High Resolution
To demonstrate that our method scales well to videos of very high resolution we apply our network to HD video from the DAVIS[32] dataset. We use 1080p (1088x1920) resolution video and apply 12 iterations of our approach. Inference takes 550ms for 12 iterations on 1080p video, with allpairs correlation taking 95ms. Fig. 7 visualizes example results on DAVIS.
5 Conclusions
We have proposed RAFT—Recurrent AllPairs Field Transforms—a new endtoend trainable model for optical flow. RAFT is unique in that it operates at a single resolution using a large number of lightweight, recurrent update operators. Our method achieves stateoftheart accuracy across a diverse range of datasets, strong cross dataset generalization, and is efficient in terms of inference time, parameter count, and training iterations.
Acknowledgments: This work was partially funded by the National Science Foundation under Grant No. 1617767.
References
 [1] (2017) Solving illposed inverse problems using iterative deep neural networks. Inverse Problems 33 (12), pp. 124007. Cited by: §2.
 [2] (2018) Learned primaldual reconstruction. IEEE transactions on medical imaging 37 (6), pp. 1322–1332. Cited by: §2.
 [3] (2019) Differentiable convex optimization layers. In Advances in Neural Information Processing Systems, pp. 9558–9570. Cited by: §2.

[4]
(2017)
Optnet: differentiable optimization as a layer in neural networks.
In
Proceedings of the 34th International Conference on Machine LearningVolume 70
, pp. 136–145. Cited by: §2.  [5] (2018) Trellis networks for sequence modeling. arXiv preprint arXiv:1810.06682. Cited by: §2.
 [6] (2019) Deep equilibrium models. In Advances in Neural Information Processing Systems, pp. 688–699. Cited by: §2.

[7]
(2015)
Flow fields: dense correspondence fields for highly accurate large displacement optical flow estimation.
In
Proceedings of the IEEE international conference on computer vision
, pp. 4015–4023. Cited by: Table 1, Table 2. 
[8]
(2009)
Large displacement optical flow.
In
2009 IEEE Conference on Computer Vision and Pattern Recognition
, pp. 41–48. Cited by: §2. 
[9]
(2012)
A naturalistic open source movie for optical flow evaluation
. In European conference on computer vision, pp. 611–625. Cited by: 1st item, §1, §4, §4.  [10] (2011) A firstorder primaldual algorithm for convex problems with applications to imaging. Journal of mathematical imaging and vision 40 (1), pp. 120–145. Cited by: §2.
 [11] (2016) Full flow: optical flow estimation by global optimization over regular grids. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4706–4714. Cited by: §1, §2.

[12]
(2014)
On the properties of neural machine translation: encoderdecoder approaches
. arXiv preprint arXiv:1409.1259. Cited by: §2.  [13] (2015) Flownet: learning optical flow with convolutional networks. In Proceedings of the IEEE international conference on computer vision, pp. 2758–2766. Cited by: §1, §4.
 [14] (2018) Endtoend learning of motion representation for video understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6016–6025. Cited by: §2, Table 1.
 [15] (2019) DeepView: highquality view synthesis by learned gradient descent. Cited by: §2.
 [16] (2013) Vision meets robotics: the kitti dataset. The International Journal of Robotics Research 32 (11), pp. 1231–1237. Cited by: 2nd item, §1, §4, §4.
 [17] (2007) Stereo processing by semiglobal matching and mutual information. IEEE Transactions on pattern analysis and machine intelligence 30 (2), pp. 328–341. Cited by: §2.
 [18] (2019) The five elements of flow. arXiv preprint arXiv:1912.10739. Cited by: §1, §2.
 [19] (1981) Determining optical flow. In Techniques and Applications of Image Understanding, Vol. 281, pp. 319–331. Cited by: §1, §2.

[20]
(2018)
Liteflownet: a lightweight convolutional neural network for optical flow estimation
. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8981–8989. Cited by: Table 4, §1, §1, §1, §2, §2, Table 1, Table 2.  [21] (2019) A lightweight optical flow cnn–revisiting data fidelity and regularization. arXiv preprint arXiv:1903.07414. Cited by: §1, §2, Table 1, Table 2.
 [22] (2019) Iterative residual refinement for joint optical flow and occlusion estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5754–5763. Cited by: Table 4, §1, §2, §2, §3.2, Table 1, Table 2.
 [23] (2017) Flownet 2.0: evolution of optical flow estimation with deep networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2462–2470. Cited by: Table 4, §1, §1, §2, Table 1, Table 2.
 [24] (2017) Variational networks: connecting variational methods and deep learning. In German conference on pattern recognition, pp. 281–293. Cited by: §2.
 [25] (2018) Learning for disparity estimation through feature constancy. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2811–2820. Cited by: §2.

[26]
(2019)
Selflow: selfsupervised learning of optical flow
. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4571–4580. Cited by: §4.1, Table 1.  [27] (2017) Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101. Cited by: §4.
 [28] (2019) Taking a deeper look at the inverse compositional algorithm. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4581–4590. Cited by: §2.
 [29] (2016) A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4040–4048. Cited by: §4.
 [30] (2015) Discrete optimization for optical flow. In German Conference on Pattern Recognition, pp. 16–28. Cited by: §2.
 [31] (2017) Automatic differentiation in pytorch. Cited by: §4.
 [32] (2017) The 2017 davis challenge on video object segmentation. arXiv preprint arXiv:1704.00675. Cited by: §4.5, §4.
 [33] (2014) Nonlocal total generalized variation for optical flow estimation. In European Conference on Computer Vision, pp. 439–454. Cited by: §2.
 [34] (2017) Optical flow estimation using a spatial pyramid network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4161–4170. Cited by: §2.
 [35] (2018) FlowFields++: accurate optical flow correspondences meet robust interpolation. In 2018 25th IEEE International Conference on Image Processing (ICIP), pp. 1463–1467. Cited by: Table 1.
 [36] (2018) Models matter, so does training: an empirical study of cnns for optical flow estimation. arXiv preprint arXiv:1809.05571. Cited by: Table 4, Table 1, Table 2.
 [37] (2018) Pwcnet: cnns for optical flow using pyramid, warping, and cost volume. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8934–8943. Cited by: §1, §1, §1, §2, §2, §3.2, Table 1, Table 2.
 [38] (2018) Banet: dense bundle adjustment network. arXiv preprint arXiv:1806.04807. Cited by: §2.
 [39] (2018) Deepv2d: video to depth with differentiable structure from motion. arXiv preprint arXiv:1812.04605. Cited by: §2, §2.
 [40] (2013) DeepFlow: large displacement optical flow with deep matching. In Proceedings of the IEEE international conference on computer vision, pp. 1385–1392. Cited by: §2.
 [41] (2017) Optical flow in mostly rigid scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4671–4680. Cited by: Table 1, Table 2.
 [42] (2017) Accurate optical flow via direct cost volume processing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1289–1297. Cited by: §2, Table 1, Table 2.
 [43] (2019) Hierarchical deep stereo matching on highresolution images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5515–5524. Cited by: §4.
 [44] (2019) Volumetric correspondence networks for optical flow. In Advances in Neural Information Processing Systems, pp. 793–803. Cited by: Table 4, §1, §1, §1, §2, §2, §3.2, Figure 4, Table 1, Table 2, §4.
 [45] (2019) Hierarchical discrete distribution decomposition for match density estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6044–6053. Cited by: §1, §2, Table 1, Table 2.
 [46] (2007) A duality based approach for realtime tvl 1 optical flow. In Joint pattern recognition symposium, pp. 214–223. Cited by: §1, §2.
 [47] (2018) Deeptam: deep tracking and mapping. In Proceedings of the European conference on computer vision (ECCV), pp. 822–838. Cited by: §2.
Appendix 0.A Appendix
0.a.1 Network Architecture
0.a.2 Timing, Parameters, and Training Iterations
px
Method  Parameters (M)  Time (Reported)  Time (1080Ti)  Training Iter. (#GPUs)  Accuracy 

LiteFlowNetX[20]  0.9M  0.03s    2000k  4.79 
LiteFlowNet[20]  5.4M  0.09s  0.09s  2000k  4.04 
IRRPWC[22]  6.4M    0.20s  850k  3.95 
PWCNet+[36]  9.4M  0.03s  0.04s  1700k  3.93 
VCN[44]  6.2M  0.18s  0.26s  220k(4)  3.63 
FlowNet2[23]  162M  0.12s  0.11s  7000k  3.54 
Ours (small)  1.0M    0.05s  160k(2)  3.37 
Ours  4.8M    0.11s  160k(2)  2.87 
Comments
There are no comments yet.