spynet
Spatial Pyramid Network for Optical Flow
view repo
We learn to compute optical flow by combining a classical spatialpyramid formulation with deep learning. This estimates large motions in a coarsetofine approach by warping one image of a pair at each pyramid level by the current flow estimate and computing an update to the flow. Instead of the standard minimization of an objective function at each pyramid level, we train one deep network per level to compute the flow update. Unlike the recent FlowNet approach, the networks do not need to deal with large motions; these are dealt with by the pyramid. This has several advantages. First, our Spatial Pyramid Network (SPyNet) is much simpler and 96 of model parameters. This makes it more efficient and appropriate for embedded applications. Second, since the flow at each pyramid level is small (< 1 pixel), a convolutional approach applied to pairs of warped images is appropriate. Third, unlike FlowNet, the learned convolution filters appear similar to classical spatiotemporal filters, giving insight into the method and how to improve it. Our results are more accurate than FlowNet on most standard benchmarks, suggesting a new direction of combining classical flow methods with deep learning.
READ FULL TEXT VIEW PDF
We present a compact but effective CNN model for optical flow, called
PW...
read it
Most of current Convolution Neural Network (CNN) based methods for optic...
read it
In this paper, we propose a spatiotemporal contextual network, STCFlow...
read it
FlowNet2, the stateoftheart convolutional neural network (CNN) for op...
read it
Over four decades, the majority addresses the problem of optical flow
es...
read it
Deep learning approaches to optical flow estimation have seen rapid prog...
read it
Spatial Pyramid Network for Optical Flow
Recent years have seen significant progress on the problem of accurately estimating optical flow, as evidenced by improving performance on increasingly challenging benchmarks. Despite this, most flow methods are derived from a “classical formulation” that makes a variety of assumptions about the image, from brightness constancy to spatial smoothness. These assumptions are only coarse approximations to reality and this likely limits performance. The recent history of the field has focused on improving these assumptions or making them more robust to violations [7]. This has led to steady but incremental progress.
An alternative approach abandons the classical formulation altogether and starts over using recent neural network architectures. Such an approach takes a pair (or sequence) of images and learns to directly compute flow from them. Ideally such a network would learn to solve the correspondence problem (short and long range), learn filters relevant to the problem, learn what is constant in the sequence, and learn about the spatial structure of the flow and how it relates to the image structure. The first attempts are promising but are not yet as accurate as the classical methods.
Goal.
We argue that there is an alternative approach that combines the best of both approaches. Decades of research on flow has produced well engineered systems and principles that are effective. But there are places where these methods make assumptions that limit their performance. Consequently, here we apply machine learning to address the weak points, while keeping the engineered architecture, with the goal of 1) improving performance over existing neural networks and the classical methods upon which our work is based; 2) achieving realtime flow estimates with accuracy better than the much slower classical methods; and 3) reducing memory requirements to make flow more practical for embedded, robotic, and mobile applications.
Problem. The key problem with recent methods for learning flow [16] is that they typically take two frames, stack them together, and apply a convolutional network architecture. When the motions between frames are larger than one (or a few) pixels, spatiotemporal convolutional filters will not obtain meaningful responses. Said another way, if a convolutional window in one image does not overlap with related image pixels at the next time instant, no meaningful temporal filter can be learned.
There are two problems that need to be solved. One is to solve for longrange correlations while the other is to solve for detailed, subpixel, optical flow and precise motion boundaries. FlowNet [16] attempts to learn both of these at once. In contrast, we tackle the latter using deep learning and rely on existing methods to solve the former.
Approach. To deal with large motions we adopt a traditional coarsetofine approach using a spatial pyramid^{1}^{1}1This, of course, has wellknown limitations, which we discuss later.. At that top level of the pyramid, the hope is that the motions between frames are smaller than a few pixels and that, consequently, the convolutional filters can learn meaningful temporal structure. At each level of the pyramid we solve for the flow using a convolutional network and upsample the flow to the next pyramid level. As is standard, with classical formulations [36], we warp one image towards the other using the current flow, and repeat this process at each pyramid level. Instead of minimizing a classical objective function at each level, we learn a convolutional network to predict the flow increment at that level. We train the network from coarse to fine to learn the flow correction at each level and add this to the flow output of the network above. The idea is that the displacements are then always less than a few pixels at each pyramid level.
We call the method SPyNet, for Spatial Pyramid Network, and train it using the same Flying Chairs data as FlowNet [16]. We report similar performance as FlowNet on Flying Chairs and Sintel [11] but are significantly more accurate than FlowNet on Middlebury [4] and KITTI [18] after fine tuning. The total size of SPyNet is 96% smaller than FlowNet, meaning that it runs faster, and uses much less memory. The expensive iterative propagation of classical methods is replaced by the noniterative computation of the neural network.
We do not claim to solve the full optical flow problem with SPyNet – we address the same problem as traditional approaches and inherit some of their limitations. For example, it is well known that large motions of small or thin objects are difficult to capture with a pyramid representation. We see the large motion problem as separate, requiring different solutions. Rather, what we show is that the traditional problem can be reformulated, portions of it can be learned, and performance improves in many scenarios.
Additionally, because our approach connects past methods with new tools, it provides insights into how to move forward. In particular, we find that SPyNet learns spatiotemporal convolutional filters that resemble traditional spatiotemporal derivative or Gabor filters [2, 23]. The learned filters resemble biological models of motion processing filters in MT and V1 [35]. This is in contrast to the highly randomlooking filters learned by FlowNet. This suggests that it is timely to reexamine older spatiotemporal filtering approaches with new tools.
In summary our contributions are: 1) the combination of traditional coarsetofine pyramid methods with deep learning for optical flow estimation; 2) a new SPyNet model that is 96% smaller and faster than FlowNet; 3) SPyNet achieves comparable or lower error than FlowNet on standard benchmarks – Sintel, KITTI and Middlebury; 4) the learned spatiotemporal filters provide insight about what filters are needed for flow estimation; 5) the trained network and related code are publicly available for research ^{2}^{2}2https://github.com/anuragranj/spynet.
Our formulation effectively combines ideas from “classical” optical flow and recent deep learning methods. Our review focuses on the work most relevant to this.
Spatial pyramids and optical flow. The classical formulation of the optical flow problem dates to Horn and Schunck [24] and involves optimizing the sum of a data term based on brightness constancy and a spatial smoothness term. The classical methods typically suffer from the fact that they make very approximate assumptions about the image brightness change and the spatial structure of the flow. Many methods focus on improving robustness by changing the assumptions. A full review would effectively cover the history of the field; for this we refer the reader to [36]. The key advantage of learning to compute flow, as we do here, is that we do not hand craft changes in these assumptions. Rather, the variation in image brightness and spatial smoothness are embodied in the learned network.
The idea of using a spatial pyramid has a similarly long history dating to [10] with its first use in the classical flow formulation appearing in [19]. Typically Gaussian or Laplacian pyramids are used for flow estimation with the primary motivation to deal with large motions. These methods are well known to have problems when small objects move quickly. Brox et al. [8] incorporate long range matching into the traditional optical flow objective function. This approach of combining image matching to capture large motions, with a variational [31] or discrete optimization [20] for fine motions, can produce accurate results.
Of course spatial pyramids are widely used in other areas of computer vision and have recently been used in deep neural networks
[15] to learn generative image models.Spatiotemporal filters. Burt and Adelson [2] lay out the theory of spatiotemporal models for motion estimation and Heeger [23] provides a computational embodiment. While inspired by human perception, such methods did not perform well at the time [6].
Various methods have shown that spatiotemporal filters emerge from learning, for example using independent component analysis
[41], sparseness [30], and multilayer models [12]. Memisevic and Hinton learn simple spatial transformations with a restricted Boltzmann machine
[28], finding a variety of filters. Taylor et al. [39] use synthetic data to learn “flow like” features using a restricted Boltzmann machine but do not evaluate flow accuracy.Dosovitskiy et al. [16] learn spatiotemporal filters for flow estimation using a deep network, yet these filters do not resemble classical filters inspired by neuroscience. By using a pyramid approach, here we learn filters that are visually similar to classical spatiotemporal filters, yet because they are learned from data, produce good flow estimates.
Learning to model and compute flow. Possibly the first attempt to learn a model to estimate optical flow is the work of Freeman et al. [17] using an MRF. They consider a simple synthetic world of uniform moving blobs with ground truth flow. The training data was not realistic and they did not apply the method to real image sequences.
Roth and Black [32]
learn a fieldofexperts (FoE) model to capture the spatial statistics of optical flow. The FoE can be viewed as a (shallow) convolutional neural network. The model is trained using flow fields generated from laser scans of real scenes and natural camera motions. They have no images of the scenes (only their flow) and consequently the method only learns the spatial component.
Sun et al. [14] describe the first fully learned model that can be considered a (shallow) convolutional neural network. They formulate a classical flow problem with a data term and a spatial term. The spatial term uses the FoE model from [32], while the data term replaces traditional derivative filters with a set of learned convolutional image filters. With limited training data and a small set of filters, it did not fully show the full promise of learning flow.
Wulff and Black [44] learn the spatial statistics of optical flow by a applying robust PCA [21] to real (noisy) optical flow computed from natural movies. While this produces a global flow basis and overly smooth flow, they use the model to compute reasonable flow relatively quickly.
Deep Learning. The above learning methods suffer from limited training data and the use of shallow models. In contrast, deep convolutional neural networks have emerged as a powerful class of models for solving recognition [22, 38] and dense estimation [13, 27] problems.
FlowNet [16] represents the first deep convolutional architecture for flow estimation that is trained endtoend. The network shows promising results, despite being trained on an artificial dataset of chairs flying over randomly selected images. Despite promising results, the method lags behind the state of the art in terms of accuracy [16]. Deep matching methods [20, 31, 42, thewlis2016fully] do not fully solve the problem, since they resort to classical methods to compute the final flow field. It remains an open question as to which architectures are most appropriate for the problem and how best to train these.
Tran et al. [40]
, use a traditional flow method to create “semitruth” training data for a 3D convolutional network. The performance is below the state of the art and the method is not tested on the standard benchmarks. There have also been several attempts at estimating optical flow using unsupervised learning
[3, 45]. However these methods have lower accuracy on standard benchmarks.Fast flow. Several recent methods attempt to balance speed and accuracy, with the goal of realtime processing and reasonable (though not top) accuracy. GPUflow [43] began this trend but several methods now outperform it. PCAFlow [44] runs on a CPU, is slower than frame rate, and produces overly smooth flow fields. EPPM [5] achieves similar, middleofthepack, performance on Sintel (test), with similar speed on a GPU. Most recently DISFast [26] is a GPU method that is significantly faster than previous methods but is also significantly less accurate.
Our method is also significantly faster than the best previous CNN flow method (FlowNet), which reports a runtime of 80ms/frame for FlowNetS. The key to our speed is to create a small neural network that fits entirely on the GPU. Additionally all our pyramid operations are implemented on the GPU.
Size is an important issue that has not attracted as much attention as speed. For optical flow to exist on embedded processors, aerial vehicles, phones, etc., the algorithm needs a small memory footprint. Our network is 96% smaller than FlowNetS and uses only 9.7 MB for the model parameters, making it easily small enough to fit on a mobile phone GPU.
Our approach uses the coarsetofine spatial pyramid structure of [15] to learn residual flow at each pyramid level. Here we describe the network and training procedure.
Let be the downsampling function that decimates an image to the corresponding image of size . Let be the reverse operation that upsamples images. These operators are also used for downsampling and upsampling the horizontal and vertical components of the optical flow field, . We also define a warping operator that warps the image, according to the flow field,
, using bilinear interpolation.
Let denote a set of trained convolutional neural network (convnet) models, each of which computes residual flow,
(1) 
at the th pyramid level. The convnet computes the residual flow using the upsampled flow from the previous pyramid level, , and the frames at level . The second frame is warped using the flow as before feeding it to the convnet . The flow, at the th pyramid level is then
(2) 
As shown in Fig. 1, we start with downsampled images and an initial flow estimate that is zero everywhere to compute the residual flow at the top of the pyramid. We upsample the resulting flow, , and pass it to the network along with to compute the residual flow . At each pyramid level, we compute the flow using Equation (2). The flow is similarly propagated to higher resolution layers of the pyramid until we obtain the flow at full resolution. Figure 1 shows the working of our approach using a 3level pyramid. In experiments, we use a 5level pyramid ().
We train each of the convnets independently and sequentially to compute the residual flow given the inputs . We compute target residual flows as a difference of target flow at the th pyramid level and the upsampled flow, obtained from the trained convnet of the previous level
(3) 
As shown in Fig. 2, we train each of the networks, , to minimize the average End Point Error (EPE) loss on the residual flow .
Each level in the pyramid has a simplified task relative to the full optical flow estimation problem; it only has to estimate a smallmotion update to an existing flow field. Consequently each network can be simple. Here, each has 5 convolutional layers, which we found gave the best combination of accuracy, size, and speed. We train five convnets at different resolutions of the Flying Chairs dataset. The network is trained with 24x32 images. We double the resolution at each lower level and finally train the convnet, with a resolution of 384x512.
Each convolutional layer is followed by a Rectified Linear Unit (ReLU), except the last one. We use a 7x7 convolutional kernel for each of the layers and found these work better than smaller filters. The number of feature maps in each convnet,
are {32, 64, 32, 16, 2}. The image and the warped image have 3 channels each (RGB). The upsampled flow is 2 channel (horizontal and vertical). We stack image frames together with upsampled flow to form an 8 channel input to each . The output is 2 channel flow corresponding to velocity in and directions.We train five networks such that each network uses the previous network as initialization. The networks are trained using Adam [25] optimization with and
. We use a batch size of 32 across all networks with 4000 iterations per epoch. We use a learning rate of 1e4 for the first 60 epochs and decrease it to 1e5 until the networks converge. We use Torch7
^{3}^{3}3http://torch.ch/ as our deep learning framework. We use the Flying Chairs [16] dataset and the MPI Sintel [11] for training our network. All our networks are trained on a single Nvidia K80 GPU.We include various types of data augmentation during training. We randomly scale images by a factor of and apply rotations at random within . We then apply a random crop to match the resolution of the convnet, being trained. We include additive white Gaussian noise sampled uniformly from . We apply color jitter with additive brightness, contrast and saturation sampled from a Gaussian,
. We finally normalize the images using a mean and standard deviation computed from a large corpus of ImageNet
[33] data in [22].


Method  Sintel Clean  Sintel Final  KITTI  Middlebury  Flying Chairs  Time (s)  
train  test  train  test  train  test  train  test  test  


Classic+NLP  4.13  6.73  5.90  8.29      0.22  0.32  3.93  102 
FlowNetS  4.50  7.42  5.45  8.43  8.26    1.09    2.71  0.080 
FlowNetC  4.31  7.28  5.87  8.81  9.35    1.15    2.19  0.150 
SPyNet  4.12  6.69  5.57  8.43  9.12    0.33  0.58  2.63  0.069 
FlowNetS+ft  3.66  6.96  4.44  7.76  7.52  9.1  0.98    3.04  0.080 
FlowNetC+ft  3.78  6.85  5.28  8.51  8.79    0.93  2.27  0.150  
SPyNet+ft  3.17  6.64  4.32  8.36  4.13  4.7  0.33  0.58  3.07  0.069 



Method  Sintel Final  Sintel Clean  
FlowNetS+ft  7.25  4.61  2.99  1.87  5.83  43.24  5.99  3.56  2.19  1.42  3.81  40.10 
FlowNetC+ft  7.19  4.62  3.30  2.30  6.17  40.78  5.57  3.18  1.99  1.62  3.97  33.37 
SpyNet+ft  6.69  4.37  3.29  1.39  5.53  49.71  5.50  3.12  1.71  0.83  3.34  43.44 

We evaluate our performance on standard optical flow benchmarks and compare with FlowNet [16] and Classic+NLP [36], a traditional pyramidbased method. We compare performance using average end point errors in Table 1. We evaluate on all the standard benchmarks and find that SPyNet is the most accurate overall, with and without fine tuning (details below). Additionally SPyNet is faster than all other methods.
Note that the FlowNet results reported on the MPISintel website are for a version that applies variational refinement (“+v”) to the convnet results. Here we are not interested in the variational component and only compare the results of the convnet output.
Once the convnets are trained on Flying Chairs, we fine tune the network on the same dataset but without any data augmentation at a learning rate of 1e6. We see an improvement of EPE by 0.14 on the test set. Our model achieves better performance than FlowNetS [16] on the Flying Chairs dataset, however FlowNetC [16] performs better than ours. We show the qualitative results on Flying Chairs dataset in Fig. 3 and compare the performance in Table 1.
The resolution of Sintel images is 436x1024. To use SPyNet, we scale the images to 448x1024, and use 6 pyramid levels to compute the optical flow. The networks used on each pyramid level are . We repeat the network at the sixth level of pyramid for experiments on Sintel. Because Sintel has extremely large motions, we found that this gives better performance than using just five levels.
We evaluate the performance of our model on MPISintel [11] in two ways. First, we directly use the model trained on Flying Chairs dataset and evaluate our performance on both the training and the test sets. Second, we extract a validation set from the Sintel training set, using the same partition as [16]. We fine tune our model independently on the Sintel Clean and Sintel Final split, and evaluate the EPE. The finetuned models are listed as “+ft” in Table 1. We show the qualitative results on MPISintel in Fig. 4.
Table 2 compares our finetuned model with FlowNet [16] for different velocities and distances from motion boundaries. We observe that SPyNet is more accurate than FlowNet for all velocity ranges except the largest displacements (over 40 pixels/frame). SPyNet is also more accurate than FlowNet close to motion boundaries, which is important for many problems.
We evaluate KITTI [18] scenes using the base model SPyNet trained on Flying Chairs. We then finetune the model on Driving and Monkaa scenes from [29] and evaluate the finetuned model SPyNet+ft. Fine tuning results in a significant improvement in accuracy by about 5 pixels. The large improvement in accuracy suggests that better training datasets are needed and that these could improve the accuracy of SPyNet further on general scenes. While SPyNet+ft is much more accurate than FlowNet+ft, the latter is finetuned on different data.
For the Middlebury [4] dataset, we evaluate the sequences using the base model SPyNet as well as SPyNet+ft, which is finetuned on the SintelFinal dataset; the Middlebury dataset itself is too small for finetuning. SPyNet is significantly more accurate on Middlebury, where FlowNet has trouble with the small motions. Both learned methods are less accurate than Classic+NL on Middlebury but both are also significantly faster.
Combining spatial pyramids with convnets results in a huge reduction in model complexity. At each pyramid level, a network, , has 240,050 learned parameters. The total number of parameters learned by the entire network is 1,200,250, with 5 spatial pyramid levels. In comparison, FlowNetS and FlowNetC [16] have 32,070,472 and 32,561,032 parameters respectively. SPyNet is about 96 % smaller than FlowNet (Fig. 5).
The spatial pyramid approach enables a significant reduction in model parameters without sacrificing accuracy. There are two reasons – the warping function and learning of residual flow. By using the warping function directly, the convnet does not need to learn it. More importantly, the residual learning restricts the range of flow fields in the output space. Each network only has to model a smaller range of velocities at each level of the spatial pyramid.
SPyNet also has a small memory footprint. The disk space required to store all the model parameters is 9.7 MB. This could simplify deployment on mobile or embedded devices with GPU support.
Figure 6 shows examples of filters learned by the first layer of the network, . In each row, the first two columns show the spatial filters that operate on the RGB channels of the two input images respectively. The third column is the difference between the two spatial filters hence representing the temporal features learned by our model. We observe that most of the spatiotemporal filters in Fig. 6 are equally sensitive to all color channels, and hence appear mostly grayscale. Note that the actual filters are pixels and are upsampled for visualization.
We observe that many of the spatial filters appear to be similar to traditional Gaussian derivative filters used by classical methods. These classical filters are hand crafted and typically are applied in the horizontal and vertical direction. Here, we observe a greater variety of derivativelike filters of varied scales and orientations. We also observe filters that spatially resemble second derivative or Gabor filters [2]. The temporal filters show a clear derivativelike structure in time. Note that these filters are very different from those reported in [16] (Sup. Mat.), which have a highfrequency structure, unlike classical filters.
Figure 6 illustrates how filters learned by the network at each level of the pyramid differ from each other. Recall that, during training, each network is initialized with the network before it in the pyramid. The filters, however, do not stay exactly the same with training. Most of the filters in our network look like rows 1 and 2, where the filters become sharper as we progress towards the finerresolution levels of the pyramid. However, there are some filters that are similar to rows 3 and 4, where these filters become more defined at higher resolution levels of the pyramid.
Optical flow estimation is traditionally viewed as an optimization problem involving some form of variational inference. Such algorithms are computationally expensive, often taking several seconds or minutes per frame. This has limited the application of optical flow in robotics, embedded systems, and video analysis.
Using a GPU can speed up traditional methods [37, 43] but with reduced accuracy. Feed forward deep networks [16] leverage fast GPU convolutions and avoid iterative optimization. Of course for embedded applications, network size is critical (see Fig. 5). Figure 7 shows the speedaccuracy comparisons of several well known methods. All times shown are measured with the images already loaded in the memory. The errors are computed as the average EPE of both the clean and final MPISintel sequences. SPyNet offers a good balance between speed and accuracy; no faster method is as accurate.
Traditional flow methods linearize the brightness constancy equation resulting in an optical flow constraint equation implemented with spatial and temporal derivative filters. Sometimes methods adopt a more generic filter constancy assumption [1, 9]. Our filters are somewhat different. The filters learned by SPyNet are used in the direct computation of the flow by the feedforward network.
SPyNet is small compared with other recent optical flow networks. Examination of the filters, however, suggests that it might be possible to make it significantly smaller still. Many of the filters resemble derivative of Gaussian filters or Gabor filters at various scales, orientations, spatial frequencies, and spatial shifts. Given this, it may be possible to significantly compress the filter bank by using dimensionality reduction or by using a set of analytic spatiotemporal features. Some of the filters may also be separable.
Early methods for optical flow used analytic spatiotemporal features but, at the time, did not produce good results and the general line of spatiotemporal filtering decayed. The difference from early work is that our approach suggests the need for a large filter bank of varied filters. Note also that these approaches considered only the first convolutional layer of filters and did not seek a “deep” solution. This all suggests the possibility that a deep network of analytic filters could perform well. This could vastly reduce the size of the network and the number of parameters that need to be learned.
Note that pyramids have wellknown limitations for dealing with large motions [8, 34]. In particular, small or thin objects that move quickly effectively disappear at coarse pyramid levels, making it impossible to capture their motion. Recent approaches for dealing with such large motions use sparse matching to augment standard pyramids [8, 42]. Future work should explore adding longrange matches to SPyNet. Alternatively Sevilla et al. [34] define a channel constancy representation that preserves fine structures in a pyramid. The channels effectively correspond to filters that could be learned.
A spatial pyramid can be thought of as the simple application of a set of linear filters. Here we take a standard spatial pyramid but one could learn the filters for the pyramid itself. SPyNet also uses a standard warping function to align images using the flow computed from the previous pyramid level. This too could be learned.
An appealing feature of SPyNet is that it is small enough to fit on a mobile device. Future work will explore a mobile implementation and its applications. Additionally, we will explore extending the method to use more frames (e.g. 3 or 4). Multiple frames could enable the network to reason more effectively about occlusion.
Finally, Flying Chairs is not representative of natural scene motions, containing many huge displacements. We are exploring new training datasets to improve performance on common sequences where the motion is less dramatic.
In summary, we have described a new optical flow method that combines features of classical optical flow algorithms with deep learning. In a sense, there are two notions of “deepness” here. First we use a “deep” spatial pyramid to deal with large motions. Second we use deep neural networks at each level of the spatial pyramid and train them to estimate a flow update at each level. This approach means that each network has less work to do than a fully generic flow method that has to estimate arbitrarily large motions. At each pyramid level we assume that the motion is small (on the order of a pixel). This is borne out by the fact that the network learns spatial and temporal filters that resemble classical derivatives of Gaussians and Gabors. Because each subtask is so much simpler, our network needs many fewer parameters than previous methods like FlowNet. This results in a method with a small memory footprint that is faster than existing methods. At the same time, SPyNet achieves an accuracy comparable to FlowNet, surpassing it in several benchmarks. This opens up the promise of optical flow that is both accurate, practical, and widely deployable.
We thank Jonas Wulff for his insightful discussions about optical flow.
Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on
, pages 41–48. IEEE, 2009.Scalable robust principal component analysis using Grassmann averages.
IEEE Trans. Pattern Analysis and Machine Intelligence (PAMI), Dec. 2015.A model of neuronal responses in visual area MT.
Vision Res., 38(5):743–761, 1998.
Comments
There are no comments yet.