1 Introduction
Optical flow estimation seeks for perceiving the motion information across consecutive video frames, and has a wide range of vision applications such as human action recognition and abnormal event detection. Despite the significant progress in the literature, optical flow estimation is still confronted with a number of difficulties in discriminative feature representation, correspondence structure modeling, computational flexibility, etc. In this paper, we focus on how to set up an effective learning pipeline that is capable of performing multiscale correspondence structure modeling with discriminative feature representation in a flexible endtoend deep learning framework.
Due to the effectiveness in statistical modeling, learning based approaches emerge as an effective tool of optical flow estimation [8, 11, 1, 18]. Usually, these approaches either just take image matching at a single scale into account, or take a divideandconquer strategy that copes with image matching at multiple scales layer by layer. Under the circumstances of complicated situations (e.g., large interimage displacement or complex motion), they are often incapable of effectively capturing the interaction or dependency relationships among the multiscale interimage correspondence structures, which play an important role in robust optical flow estimation. Furthermore, their matching strategies are often carried out in the following two aspects. 1) Set a fixed range of correspondence at a single scale in the learning process [8, 11, 18]; and 2) update the matching range dynamically with a coarsetofine scheme [1, 13]. In practice, since videos have timevarying dynamic properties, selecting an appropriate fixed range for matching is difficult for adapting to various complicated situations. Besides, the coarsetofine scheme may cause matching error propagations or accumulations from coarse scales to fine scales. Therefore, for the sake of robust optical flow estimation, correspondence structure modeling ought to be performed in an adaptive multiscale collaborative way. Moreover, it is crucial to effectively capture the crossscale dependency information while preserving spatial selfcorrelations for each individual scale in a totally datadriven fashion.
Motivated by the above observations, we propose a novel unified endtoend optical flow estimation approach called MultiScale Correspondence Structure Learning (MSCSL) (as shown in Fig. 1), which jointly models the dependency of multiscale correspondence structures by a Spatial ConvGRU neural network model based on multilevel deep learning features. To summarize, the contributions of this work are twofold:

We propose a multiscale correspondence structure learning approach, which captures the multiscale interimagecorrelation correspondence structures based on the multilevel deep learning features. As a result, the task of optical flow estimation is accomplished by jointly learning the interimage correspondence structures at multiple scales within an endtoend deep learning framework. Such a multiscale correspondence structure learning approach is innovative in optical flow estimation to the best of our knowledge.

We design a Spatial ConvGRU neural network model to model the crossscale dependency relationships among the multiscale correspondence structures while preserving spatial selfcorrelations for each individual scale in a totally datadriven manner. As a result, adaptive multiscale matching information fusion is enabled to make optical flow estimation adapt to various complicated situations, resulting in robust estimation results.
2 Our Approach
2.1 Problem Formulation
Let be a set of training samples, where and represent an RGB image pair and the corresponding optical flow respectively. In this paper, our objective is to learn a model parameterized by to predict the dense motion of the first image . For the sake of expression, we ignore the left subscript in the remaining parts.
In this paper, we focus on two factors, (1) computing the correlation maps between image representations at different scales and adaptively setting up the correspondence structure in a datadriven way, (2) encoding the correspondence maps into highlevel feature representation for regressing the optical flow.
2.2 MultiScale Correspondence Structure Modelling
MultiScale Image Representations.
To represent the input image at multiple scales, we firstly employ convolution neural networks (CNNs) to extract the deep features at a single scale parameterized by
to represent the image , as illustrated in Fig. 1:(1) 
and then model the multilevel feature representations parameterized by with as the input, as depicted in Fig. 1:
(2) 
where represents the th level, and the size of is larger than that of . From top to bottom (or coarse to fine), the feature representations at small scales^{1}^{1}1In this paper, the small scale means small size; the large scale means large size tend to learn the sematic components, which contribute to find the correspondence of semantic parts with large displacements; Furthermore, the large scale feature maps tend to learn the local representation, which can distinguish the patches with small displacements. In this paper, we use and to denote the multiscale representations of and respectively.
Correspondence Structure Modelling. Given an image pair from a video sequence, we firstly extract their multilevel feature representations and using Eq. 1 and Eq. 2. In order to learn the correspondence structures between the image pair, we calculate the similarity between the corresponding feature representations instead. Firstly, we discuss the correlation computation proposed in [8]:
(3) 
where and
denote the feature vector at the
th location of and respectively, and denotes concatenating the elements in the set to a vector, denotes the neighborhood of location . The meaning of Eq. 3 is that given a maximum displacement , the correlations between the location in and in can be obtained by computing the similarities between the square patch of size centered at location in and square patches of the same size centered at all locations of in .To model the correspondence between the th location in and its corresponding location in , we can (1) calculate in a small neighbourhood of the th location in , or (2) calculate in a large enough neighbourhood of the th location in , or even in the whole feature map . But the former can not guarantee the computation of similarity between the th location and the corresponding th location, while the latter leads to low computational efficiency, because the complexity of Eq. 3 exhibits quadratic growth when the value of increases. To address that problem, we adopt correlation computation at each scale of multiscale feature representations and :
(4) 
where the maximum displacement varies from bottom to top.
2.3 Correspondence Maps Encoding Using Spatial ConvGRU
CrossScale Dependency Relationships Modelling. For the sake of combining the correlation representations
and preserving the spatial structure to estimate dense optical flow, we consider the representations as a feature map sequence, and then apply Convolutional GatedRecurrentUnit Recurrent Networks(ConvGRUs) to model the crossscale dependency relationships among the multiscale correspondence structures. ConvGRUs have been used to model the temporal dependencies between frames of the video sequence
[4, 15]. A key advantage of ConvGRUs is that they can not only model the dependencies among a sequence, but also preserve the spatial location of each feature vector. One of significant differences between a ConvGRU and a traditional GRU is that innerproduct operations are replaced by convolution operations.However, because of the employed scheme similar to coarsetofine, the size of the th input in the sequence is larger than that of the th input. We cannot apply the standard ConvGRU on our problem, so instead we propose a Spatial ConvGRU in which each layer’s output is upsampled as the input of the next layer. For the input sequence , the formulation of the Spatial ConvGRU is:
(6)  
(7)  
(8)  
(9)  
(10) 
where and denote a convolution operation and an elementwise multiplication respectively, and
is an activation function, e.g.,
., denotes the transposed convolution. The Spatial ConvGRU can model the transition from coarse to fine and recover the spatial topology, outputting intralevel dependency maps .Methods  Sintel clean  Sintel final  KITTI 2012  Middlebury  Flying Chairs  Time (sec)  

train  test  train  test  train  test  train  test  
EpicFlow  
DeepFlow  
FlowFields  
EPPM  
DenseFlow  
LDOF  
FlowNetS  
FlowNetC  
SPyNet  
MSCSL/wosr  
MSCSL/wor  
MSCSL  
FlowNetS+ft  
FlowNetC+ft  
SPyNet+ft  
MSCSL/wosr+ft  
MSCSL/wor+ft  
MSCSL+ft 
IntraLevel Dependency Maps Combination. After getting the hidden outputs , we upsample them to the same size, written as :
(11) 
where are the parameters needed to be optimized. Furthermore, we concatenate the hidden outputs with the nd convolutional output of to get the final encoded feature representation for optical flow estimation, as depicted in Fig. 1:
(12) 
where represents the concatenation operation.
Finally, the proposed framework learns a function parameterized by to predict the optical flow:
(13) 
2.4 Unified EndtoEnd Optimization
As the image representation, correspondence structure learning and correspondence maps encoding are highly related, we construct a unified endtoend framework to optimize the three parts jointly. The loss function used in the optimization framework consists of two parts, namely, a supervised loss and an unsupervised loss (or reconstruction loss). The former is the endpoint error (EPE), which measures the Euclidean distance between the predicted flow
and the ground truth , while the latter is based on the brightness constancy assumption, which measures the Euclidean distance between the first image and the warped second image .(14)  
(15)  
(16) 
where and denote the displacement in horizontal and vertical respectively, and is the balance parameter. can be calculated via bilinear sampling according to
, as proposed in Spatial Transform Networks
[10]:(17) 
Methods  Sintel Final  

FlowNetS+ft  
FlowNetC+ft  
SPyNet+ft  
MSCSL/wosr+ft  
MSCSL/wor+ft  
MSCSL+ft  
Methods  Sintel Clean  
FlowNetS+ft  
FlowNetC+ft  
SPyNet+ft  
MSCSL/wosr+ft  
MSCSL/wor+ft  
MSCSL+ft 
Because the raw data and contain noise and illumination changes and are less discriminative, in some cases the brightness constancy assumption is not satisfied; Furthermore, in highly saturated or very dark regions, the assumption also suffers difficulties [11]. Therefore, applying Eq. 16 on the raw data directly will make the network more difficult when training. To address that issue, we apply the brightness constancy assumption on the nd convolutional outputs and of and instead of and . The training and test stages are shown in Alg. 1.
3 Experiments
3.1 Datasets
Flying Chairs [8] is a synthetic dataset created by applying affine transformations to a real image dataset and a rendered set of 3D chair models. This dataset contains image pairs, and is split into training and test pairs.
MPI Sintel [7] is created from an animated movie and contains many large displacements and provides dense ground truth. It consists of two versions: the Final version and the Clean version. The former contains motion blurs and atmospheric effects, while the latter does not include these effects. There are training image pairs for each version.
KITTI 2012 [9] is created from real world scenes by using a camera and a 3D laser scanner. It consists of training image pairs with sparse optical flow ground truth.
Middlebury [3] is a very small dataset, containing only image pairs for training. And the displacements are typically limited to pixels.
3.2 Implementation Details
3.2.1 Network Architecture
In this part, we introduce the network architecture briefly. We use convolutional kernel for the first convolutional layer and
for the second and third convolutional layers. Then we use maxpooling and convolutional operations to obtain multiscale representations, as illustrated in Fig.
1. The correlation layer is the same as that proposed in [8], and the are set to from top to bottom (or from coarse to fine). And then we employ kernel and kernel for the other convolutional layers and deconvolutional layers respectively.3.2.2 Data Augmentation
To avoid overfitting and improve the generalization of network, we employ the data augmentation strategy for the training by performing random online transformations, including scaling, rotation, translation, as well as additive Gaussian noise, contrast, multiplicative color changes to the RGB channels per image, gamma and additive brightness.
3.2.3 Training Details
We implement our architecture using Caffe
[12] and use an NVIDIA TITAN X GPU to train the network. To verify our proposed framework, we conduct three comparison experiments, (1) MSCSL/wosr, this experiment does not contain both the proposed Spatial ConvGRU and reconstruction loss, and use the refinement network proposed in [8] to predict dense optical flow, (2) MSCSL/wor, this experiment employs the Spatial ConvGRU, which can be implemented by unfolding the recurrent model in the prototxt file, to encode the correspondence maps for dense optical flow estimation and demonstrates the effectiveness of the Spatial ConvGRU in comparison to MSCL/wosr, (3) MSCSL, this experiment contains all parts (Spatial ConvGRU and reconstruction loss) aforementioned.In the MSCSL/wosr and MSCSL/wor, we train the networks on Flying Chairs training dataset using Adam optimization with and . To tackle the gradients explosion, we adopt the same strategy as proposed in [8]. Specifically, we firstly use a learning rate of for the first k iterations with a batch size of pairs. After that, we increase the learning rate to for the following k iterations, and then divide it by every k iterations. We terminate the training after k iterations (about hours).
In the MSCSL, we firstly train the MSCSL/wor for k iterations using the training strategy above. After that, we add the reconstruction loss with the balance parameter . And then we finetune the network for k iterations with a fixed learning of .
After training the three networks on Flying Chairs training dataset respectively, we finetune the networks on the MPI Sintel training dataset for tens of thousands of iterations with a fixed learning rate of until the networks converge. Specifically, we finetune the networks on the Clean version and Final version together with for training and for validation. Since the KITTI 2012 dataset and Middlebury dataset are small and only contain sparse ground truth, we do not conduct finetuning on these two datasets.
3.3 Comparison to StateoftheArt
In this section, we compare our proposed methods to recent stateoftheart approaches, including traditional methods, such as EpicFlow [14], DeepFlow [16], FlowFields [2], EPPM [5], LDOF [6], DenseFlow [17], and deep learning based methods, such as FlowNetS [8], FlowNetC [8], SPyNet [13]. Table 1 shows the performance comparison between our proposed methods and the stateoftheart using average endpoint errors (EPE). We mainly focus on the deep learning based methods, so we only compare our proposed methods with the learningbased frameworks such as FlowNet and SpyNet.
Flying Chairs. For all three comparison experiments, We train our networks on this dataset firstly, and employ MPI Sintel dataset to finetune them further. Table 1 shows that MSCSL outperforms the other comparison experiments, MSCSL/wosr and MSCSL/wor. Furthermore, our proposed methods achieve better performance comparable with the stateoftheart methods. After finetuning, in most cases most learning based methods suffer from performance decay, this is mostly because of the disparity between Flying Chairs and MPI Sintel dataset. Some visual estimation results on this dataset are shown in Fig. 3.
MPI Sintel. After the training on Flying Chairs firstly, we finetune the trained models on this dataset. The models trained on Flying Chairs are evaluated on the training dataset. The results shown in Table 1 demonstrate MSCSL’s and MSCSL/sor’s better ability to generalize than MSCSL/wosr’s and other learning based approaches’. To further verify our proposed methods, we compare our methods with FlowNetS, FlownetC and SPyNet on MPI Sintel test dataset for different velocities and distances from motion boundaries, as described in Table 2. As shown in Table 1 and Table 2, our proposed methods perform better than other deep learning based methods. However, in the regions with velocities larger than pixels (smaller than pixels), the proposed methods are less accurate than FlowNetC (SpyNet). Some visual results are shown in Fig. 2.
KITTI 2012 and Middlebury. These two datasets are too small, so we do not finetune the models on these datasets. We evaluate the trained models on KITTI 2010 training dataset, KITTI 2012 test dataset and Middlebury training dataset respectively. Table 1 shows that our proposed methods outperform other deep learning based approaches remarkably on the KITTI 2012 dataset (including training set and test set). However, in most cases, on Middlebury training dataset, mainly containing small displacements, our proposed methods do not perform well, comparison to SPyNet.
Analysis. The results of our framework are more smooth and finegrained. Specifically, our framework is capable of capturing the motion information of finegrained object parts, as well as preserving edge information. Meanwhile, our Spatial ConvGRU can suppress the noises in the results of model without it. All these insights can be observed in Fig. 3 and Fig. 2. However, our proposed frameworks are incapable of effectively capturing the correspondence structure and unstable in regions where the texture is uniform (e.g., on Middlebury dataset).
Timings. In Table 1, we show the perframe runtimes of different approaches. Traditional methods are often implemented on a single CPU, while deep learning based methods tend to run on GPU. Therefore, we only compare the runtimes with FlowNetS, FlowNetC and SPyNet. The results in Table 1 demonstrate that our proposed methods (run on NVIDIA TITAN X GPU) improve the accuracy with a comparable speed against the stateoftheart.
4 Conclusion
In this paper, we propose a novel endtoend multiscale correspondence structure learning based on deep learning for optical flow estimation. The proposed MSCSL learns the correspondence structure and models the multiscale dependency in a unified endtoend deep learning framework. Our model outperforms the stateoftheart approaches based on deep learning by a considerable computing efficiency. The experimental results on several datasets demonstrate the effectiveness of our proposed framework.
Acknowledgements
This work was supported in part by the National Natural Science Foundation of China under Grant U1509206 and Grant 61472353, in part by the AlibabaZhejiang University Joint Institute of Frontier Technologies.
References
 [1] Ahmadi, A., and Patras, I. Unsupervised convolutional neural networks for motion estimation. In Image Processing (ICIP), 2016 IEEE International Conference on (2016), IEEE, pp. 1629–1633.
 [2] Bailer, C., Taetz, B., and Stricker, D. Flow fields: Dense correspondence fields for highly accurate large displacement optical flow estimation. In Proceedings of the IEEE International Conference on Computer Vision (2015), pp. 4015–4023.
 [3] Baker, S., Scharstein, D., Lewis, J., Roth, S., Black, M. J., and Szeliski, R. A database and evaluation methodology for optical flow. International Journal of Computer Vision 92, 1 (2011), 1–31.
 [4] Ballas, N., Yao, L., Pal, C., and Courville, A. Delving deeper into convolutional networks for learning video representations. arXiv preprint arXiv:1511.06432 (2015).

[5]
Bao, L., Yang, Q., and Jin, H.
Fast edgepreserving patchmatch for large displacement optical flow.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(2014), pp. 3534–3541.  [6] Brox, T., and Malik, J. Large displacement optical flow: descriptor matching in variational motion estimation. IEEE transactions on pattern analysis and machine intelligence 33, 3 (2011), 500–513.
 [7] Butler, D. J., Wulff, J., Stanley, G. B., and Black, M. J. A naturalistic open source movie for optical flow evaluation. In European Conference on Computer Vision (2012), Springer, pp. 611–625.
 [8] Dosovitskiy, A., Fischery, P., Ilg, E., Hazirbas, C., Golkov, V., van der Smagt, P., Cremers, D., Brox, T., et al. Flownet: Learning optical flow with convolutional networks. In 2015 IEEE International Conference on Computer Vision (ICCV) (2015), IEEE, pp. 2758–2766.
 [9] Geiger, A., Lenz, P., and Urtasun, R. Are we ready for autonomous driving? the kitti vision benchmark suite. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on (2012), IEEE, pp. 3354–3361.
 [10] Jaderberg, M., Simonyan, K., Zisserman, A., et al. Spatial transformer networks. In Advances in Neural Information Processing Systems (2015), pp. 2017–2025.

[11]
Jason, J. Y., Harley, A. W., and Derpanis, K. G.
Back to basics: Unsupervised learning of optical flow via brightness constancy and motion smoothness.
In Computer Vision–ECCV 2016 Workshops (2016), Springer, pp. 3–10.  [12] Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., and Darrell, T. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM international conference on Multimedia (2014), ACM, pp. 675–678.
 [13] Ranjan, A., and Black, M. J. Optical flow estimation using a spatial pyramid network. arXiv preprint arXiv:1611.00850 (2016).

[14]
Revaud, J., Weinzaepfel, P., Harchaoui, Z., and Schmid, C.
Epicflow: Edgepreserving interpolation of correspondences for optical flow.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015), pp. 1164–1172.  [15] Siam, M., Valipour, S., Jagersand, M., and Ray, N. Convolutional gated recurrent networks for video segmentation. arXiv preprint arXiv:1611.05435 (2016).
 [16] Weinzaepfel, P., Revaud, J., Harchaoui, Z., and Schmid, C. Deepflow: Large displacement optical flow with deep matching. In Proceedings of the IEEE International Conference on Computer Vision (2013), pp. 1385–1392.

[17]
Yang, J., and Li, H.
Dense, accurate optical flow estimation with piecewise parametric model.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015), pp. 1019–1027.  [18] Zhu, Y., Lan, Z., Newsam, S., and Hauptmann, A. G. Guided optical flow learning. arXiv preprint arXiv:1702.02295 (2017).
Comments
There are no comments yet.