Exploiting Geometric Constraints on Dense Trajectories for Motion Saliency

09/29/2019 ∙ by Muhammad Faisal, et al. ∙ Information Technology University Australian National University 22

The existing approaches for salient motion segmentation are unable to explicitly learn geometric cues and often give false detections on prominent static objects. We exploit multiview geometric constraints to avoid such mistakes. To handle nonrigid background like sea, we also propose a robust fusion mechanism between motion and appearance-based features. We find dense trajectories, covering every pixel in the video, and propose trajectory-based epipolar distances to distinguish between background and foreground regions. Trajectory epipolar distances are data-independent and can be readily computed given a few features' correspondences in the images. We show that by combining epipolar distances with optical flow, a powerful motion network can be learned. Enabling the network to leverage both of these information, we propose a simple mechanism, we call input-dropout. We outperform the previous motion network on DAVIS-2016 dataset by 5.2 network with an appearance network using the proposed input-dropout, we also outperform the previous methods on DAVIS-2016, 2017 and Segtrackv2 dataset.



There are no comments yet.


page 1

page 3

page 5

page 6

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Existing methods fail to automatically learn geometric cues between the foreground objects and the rigid background. As a result, they often give false detections on prominent static objects as we show an example from DAVIS [31]. Whereas by exploiting these constraints over the whole video, we avoid making such mistakes.

Segmenting object(s) with significant motion in a video is called Salient Motion Segmentation. In contrast, segmenting the most prominent object(s) in an image (or a video) is Salient Appearance Segmentation. While the data-driven approaches have been quite successful for the later, we argue, that the former suffers from the sparsity of the video-based training data and remains ill-posed. Specifically, for a moving camera, it remains hard to learn, whether the 2D projected motion field corresponds to a static object in the video, or the one having independent motion. To segment out the rigid background from the independently moving foreground objects, we exploit extensively studied geometric constraints [14], over the complete video, in a learning paradigm. Unlike the data-dependent learning, these constraints have closed-form solutions and can be computed very efficiently. Our method can still handle nonrigid background with the fusion of motion and appearance-based features. In Fig. 1 we give an example from DAVIS [31], showing that the previous approaches give false detections on prominent static objects; whereas the proposed approach is able to disambiguate static and nonstatic objects. This clearly shows that the existing deep-networks are unable to automatically learn the geometric cues even when the optical flow was provided as an input.

To exploit multiview geometric constraints, we convert optical flow between consecutive frames into dense trajectories, covering every pixel in the video, and then use trifocal tensors to find epipolar distances

[14] for them. The trajectory epipolar distance serves as a measure of (non)rigidity: a small distance corresponds to the rigid background, and a large distance implies a foreground object(s).

Trajectory epipolar distances, capture temporally global constraint on foreground and background region, whereas optical flow only captures local temporal information. However, the former is quite sensitive to optical flow errors. In essence, they both are complementary and by combining both, powerful features for motion saliency can be learned. Given trajectory epipolar distances and optical flow as an input, we build an encoder-decoder based network [35], called EpO-Net. We devise a strategy called input-dropout, enabling the network to learn robust motion features and handle failure cases of one of the two inputs.

EpO-Net brings two key advantages over the existing motion network, Mp-Net [41]

. 1) EpO-Net exploits geometric constraints over a large temporal window, whereas Mp-Net makes suboptimal decisions based on temporally local optical flow information. Consequently, as we show, EpO-Net can be trained on smaller training data, while having better generalization than Mp-Net. 2) In contrast to Mp-Net, EpO-Net does not require any objectness score on top of the estimated motion saliency map. The main reason for this is, we prepare and train our network on more realistic, a synthetic dataset consisting of real background videos and insert synthetic foreground objects in it. Whereas Mp-Net was trained on unrealistic synthetic flying objects


Being a motion-only network, EpO-Net suffers from optical flow errors. It also cannot handle a nonrigid background. To handle these cases, we exploit appearance [3] along with motion-based features in the form of a joint network, EpO-Net+. Using the proposed input-dropout strategy, we show that the EpO-Net+ is robust against the failure cases of individual motion and appearance-based features.

To the best of our knowledge, ours is the first method to combine geometric constraints in a learning paradigm for motion segmentation. Our paper has three main contributions. 1) Trajectory epipolar distance and optical flow based motion-network. 2) RBSF dataset that can be used to train salient motion segmentation. Applications like video annotation [10], object tracking [53]

, and video anomaly detection

[51], can use our network and the dataset, to exploit geometric constraints on the rigid world. The source code of our method as well as the dataset is available111https://github.com/mfaisal59/EpONet. 3) The input-dropout technique that can be used to robustify early or late fusion of features in deep architectures. Our motion network outperforms Mp-Net on DAVIS-2016 [31] by a significant margin of 5.2% in mean IOU score and is quite close to other recent methods exploiting additional appearance features. The proposed joint network also demonstrates significant improvement over the previous methods on DAVIS (2016 [31] & 2017 [34]) and Segtrack-v2 [22].

2 Related Work

Recently, video object segmentation (VOS) has been gaining interest [17, 41, 42, 40, 15, 6, 20], much credit to the new challenging benchmark datasets. One of the factors to categorize existing approaches could be the degree of supervision. Supervised approaches [28, 5] or interactive ones assume user input in the form of scribbles is available at multiple instances, helping algorithm refine the results. Semi-Supervised methods [15, 52, 16, 23, 1, 26, 24], assume that at least for the first frame, segmentation is given, reducing the problem to label propagation. For brevity, we discuss below only a few prominent unsupervised methods.

In unsupervised settings, to make the problem tractable the motion-saliency constraint is enforced. Many methods try to capture motion information across the multiple frames, mostly by constructing the long sparse point trajectories  [2, 11, 30, 38]. Salient object segmentation is then reduced to clustering these trajectories [19] and converting into dense points [29]. Among the other early approaches, few approaches  [21, 22, 25, 54, 32] extract object proposals  [8] and try to build the connection between the proposals temporally. These trajectories based methods are not robust because they heavily rely on feature matching, that may fail due to occlusion, fast motion, dynamic background, and appearance change.

Recently Deep Learning based methods have been used to solve the video segmentation problem. Broadly these techniques have three components, network to capture the motion information and network to extract appearance and enforce object boundaries; the third component is temporal memory so that decision made at one frame is propagated to others 

[41, 42, 17, 6, 40]. Among all these approaches, Mp-Net [41] and LVO [42] are very close to our method. Mp-Net constructs an encoder/decoder based network to segment the optical flow into the salient and non-salient one. Encoder/decoder network is trained on large synthetic dataset [27] and then fine-tuned on DAVIS [31]. Since motion information they learn is not sufficient, they rely on the objectness score [33] to clean their results. LVO [42], builds on Mp-Net, using bi-directional ConvGRU to propagate the information across the other frames. Their results improve drastically (LSMO [43]) by just using a better optical flow estimation and appearance model (DeepLabv2 instead of Deep Lab v1). MotAdapt [39] used the teacher-student learning paradigm, where the teacher provides pseudo labels using the optical flow and image as input.

AGS [48] explore the concepts of video saliency or dynamic fixation prediction, with an argument that UVOS (unsupervised video object segmentation) is closely related to Video Saliency [46]. Authors trained a visual attention module on the dynamic fixation data, collected by tracking eyes of viewers watching videos. Unlike AGS which required the data gathered by tracking the gaze of viewers, we try to model the concept of motion-saliency by exploiting the information (geometric constraints) inside the video itself and do not require extra data.

An early method by Torr [44], Sheikh et. al. [37], and Tron and Vidal [45], try to exploit motion models. [37], and [45] exploited trajectory information to separate out the foreground and background objects. Many recent methods [20, 18, 15]

have relied on the previous trajectory-based segmentation work, using the deep features for image saliency and optical flow for motion saliency to construct a neighbourhood graph.

[36] used optical flow-based point trajectories to propagate the user input scribbles. [47]

clusters neighbouring trajectories to create super-trajectories, which along with mask of the first frame is used for the video segmentation. However, they have not exploited the geometry-based constraints, rather relying on the heuristics and complex pipelines.

Our work relies on all three techniques. We use optical flow to build trajectories and geometry-based technique to penalize the trajectories not following the geometric constraint. To make our deep learning models robust we design input-Dropout technique for training. To the best of our knowledge, our approach is the first one that tries to combine CNNs and geometrical constraints for video object segmentation.

3 Epipolar Constraints on Dense Trajectories

Figure 2: An illustration of multitview geometric constraints on rigid points. A 3D rigid line (red) is viewed by a moving camera at different times. Back-projecting their 2D projections should meet at the actual line. In contrast, the 2D projections of a 3D nonrigid point (orange) are not constrained to lie on any 2D lines. This relationship can be captured by finding trifocal tensors and the corresponding fundamental matrices. In contrast to rigid points, the nonrigid point may not lie on the corresponding epipolar lines and their epipolar distances can be used as a measure of nonrigidity.

Existing methods for salient motion segmentation, use appearance and optical flow based features to distinguish foreground from background. These features are not geometry inspired, learned from the data and alone do not provide enough constraints for the rigid background. We propose geometry inspired features and leverage them in a learning pipeline. We use trifocal tensors to constraint the rigid background in the video and propose epipolar distances for the dense trajectories as a measure of nonrigidity (See Fig. 2).

We first find forward and backward optical flow of frames, each of height and width , using [4] and then convert it into dense trajectories covering every pixel in the video. Each trajectory, , where , is an vector of 2D image coordinates and may consists of missing values due to pixels’ occlusion. , because for every occlusion new pixels appear. We use forward and backward optical flow consistency to find occluding regions. We stack all the trajectories into a sparse matrix, .

Figure 3: An illustration of exploiting the complete trajectories to find epipolar distances. Part of the bear remains static in this frame and the previous frame, giving small epipolar distance, shown in the middle. Since trajectories aggregate these distances over their full time-span, the trajectory-based epipolar distances are still high for almost the complete bear.

Once trajectories are found, we estimate the dominant rigid background, by finding the trifocal tensors in every three consecutive frames, using the six-point algorithm [14]222Algorithm 20.1 page 511, Hartley & Zisserman (2nd Ed), and RANSAC. We convert the trifocal tensor to the corresponding six pair-wise fundamental matrices, [14]333Algorithm 15.1, page 375, Hartley & Zisserman (2nd Ed)

. When the camera is static and optical flow is zero for the background, the estimation of the trifocal tensor can become degenerate. Any skew-symmetric matrix, in this case, would be a valid fundamental matrix. To avoid degeneracy, we first detect if the camera remains static, by checking if at least 50% of the pixels have zero optical flow, in the current triplet of frames. Then we initialise fundamental matrices to arbitrary skew-symmetric matrices.

We find the epipolar distances for the triplet as follows. Let and denote the homogenous 2D coordinates of the selected three frames in the trajectory. We find the distance between and as,


where is the epipolar line in frame 2 corresponding to the frame 1, , its component and is the distance between the line and . By normalizing the line w.r.t its magnitude, gives the normlize epipolar distance. The triplet epipolar distance would be


The epipolar distance for the trajectory is computed as the mean of all triplet epipolar distances along this trajectory. Concatenating all the trajectory epipolar distances gives a matrix, .

We assign the epipolar distance of a trajectory to all the constituent pixels. Hence, the proposed approach can deal with parts of the foreground object that remain static for a few frames but were in motion otherwise. As we show in Fig. 3 the epipolar distance estimated based on the current and the previous frame is quite small for the static part of the bear, whereas the trajectory-based epipolar distance is able to detect a significant part of the bear. Trajectory epipolar distances help us find powerful motion features for video segmentation, as we show in the next section.

4 Approach

Figure 4: Flow diagram depicting different parts & information transition in the algorithm. Top Row: steps to compute the motion trajectories & Epipolar Distance. Bottom row: (Left) Deep-Lab based Appearance Network trained to compute the Appearance Features. (Right) Motion-Images (Optical Flow & Epipolar Distance) fed to EpO, which outputs motion saliency map. (Middle) Motion-saliency map concatenated with appearance features are fed into the bidirectional convGRU.

The proposed pipeline consists of three distinct stages. 1) Our motion network, EpO-Net takes motion images, concatenation of optical flow and epipolar distances, as input and outputs motion-saliency-map. 2) Parallel to this, we have a network to compute the appearance features to extract scene context and object information [3]. 3) Our joint network, EpO-Net+ fuses the appearance features and the motion-saliency-map with a bidirectional-ConvGRU and outputs saliency mask. We introduce Input-Dropout, a mechanism for robustly fusing noisy input feature-maps. We discuss these stages in details as follows.

4.1 Motion Images

Given an input video, we compute optical flow, convert it into dense trajectories, find trajectory epipolar distances and convert it into per-frame Epipolar Distances (ED). ED, has a temporally bigger receptive field, assigning a large weight to the foreground and lower to the background but is sensitive towards optical flow errors. Yet optical flow captures temporally local but relatively robust information containing motion patterns to distinguish foreground from background.

Both of these information are complementary. To exploit both and learn motion features from them, we merge 2-channel of optical flow vectors with ED, to get a 3-channel image, we call motion-images. The main challenge in fusion is to identify when both of these information are reliable and when only optical flow should be used.

4.2 Epipolar Optical flow Network (EpO-Net)

Given motion image as input, we design an encoder-decoder deep network, in fashion of UNet [35] and outputs motion-saliency-map. The latent space after the encoder captures the context of the whole motion image, different motion patterns and their relationship with ED. The decoding part on the other-hand has unravelled the context to decide about each pixel. Use of skip layers gives decoder access to local information ([50]) collected from the lower layers of the encoding-network and use them with the context to make a decision at pixel level.

In our network, we use four encoders followed by four decoders, where each block consists of a convolution layer, followed by batch normalization, ReLU activation and max-pooling layers. Different from the Mp-Net, our much informative input allows us to have less number of channels before the final classification layer (128 instead of 512). Motion-saliency-map is produced using a sigmoid layer in the end. CRF is used to clean the output.

4.3 Joint Network (EpO-Net)

Any algorithm solely based on motion information will struggle with defining object boundaries and be confused by the non-rigid background. We use the pre-trained Deep-Lab [3] features and fuse them with our motion network, similar to LVO [42]. Although FC6 layer of Deep-Lab is just th of the spatial size of the original image, it still captures important information about the objects, their boundaries and nonrigid background. Please note that our appearance features are quite generic, customized appearance networks for video segmentation can produce better results. The reason we use such generic features is to demonstrate the significance of the proposed motion network.

We train the bottleneck layer to reduce the appearance features from 1024 to 128 and concatenate it with the down-sampled output of EpO-Net. To exploit temporal continuity in the joint-features and build global context, we concatenate the bi-directional Convolutional Gated Recurrent Unit (ConvGRU) at the end of our network. To robustly handle motion network failures in the case of nonrigid background, we introduce input-drop, discussed next.

5 Challenges in Training

The proposed architecture contains fusion of features, encapsulating information at different spatial and temporal receptive fields, at different stages of the network. To enable the network to properly learn the concept of motion saliency and fuse motion and appearance features required contribution both in the training methodology and dataset.

5.1 RBSF Dataset

Training sequences in the DAVIS 2016 are too few to train a robust motion network. We find that F3DT [27] and PHAV [7] datasets are not very useful for us. F3DT has holes and the objects’ motion is quite fast. PHAV has low resolution than DAVIS and the ground-truth optical flow is noisy because of jpeg compression. We create our own synthetic dataset, called RBSF (Real Background, Synthetic Foreground), by mixing the 20 different foreground objects performing various movements with 5 different real background videos. Fairly large size (30% to 50% of the frame) and reasonable fast motion of objects allows us to compute accurate optical flow and long trajectories. We observe that generating more data do not improve results, thanks to the well-constrained epipolar distances. After training on RBSF, we fine-tune EpO-Net on DAVIS-2016 [31]. Few example frames from RBSF dataset are shown in the supplementary material.

5.2 Feature Fusion & Input-Dropout

The main challenge in devising a robust fusion mechanism is to identify when to rely on one of the two input feature volumes. Intuitively, epipolar distance and optical flow should be fused early, so that they can help each other. However, determining their usefulness requires contextual information and can only be done in deeper layers of the network. By that time, the learned features have already mixed the input channels. Therefore, training with more data or for more iterations might not improve the results.

Such problems are usually solved by introducing an early and late fusion of the features, and their combination, requiring complex network designs, where skip layers are going from one part to other. Instead, we choose a much simpler method, called Input-Dropout Training

. While training EpO-Net, we randomly make complete ED-channel zero, for some of the sequences which have erroneous ED-maps (sequences with dynamic background and occlusion). For rest, motion-images are unaltered. This is done for the initial 10 epochs, allowing the filters to give more importance to the optical flow. After that, we repeat the same procedure but instead of zero, we assign random values, forcing the network to learn the diverse enough filters to capture the motion information from the optical flow, ED and their combination, separately. With input-dropout EpO’s mean IoU increases from

to (Table 6).

The late fusion of appearance and motion features in the joint network can exploit the same input-dropout strategies. We randomly set the motion-saliency-map to zero for a few frames of the sequences, where the motion network fails (sequences with dynamic background and occlusion). Using input-dropout, mean IoU improves from to . The complete network, containing all the above stages and layers is called EpO-Net+.

6 Experiments

Ground truth X-Displancement Y-Displacement ED Motion Images EpO-Net Mp-Net [41]
Figure 5: Qualitative Comparison of our EpO-Net with Mp-Net [41].

We train and evaluate on RBSF (Sec. 5.1), DAVIS2016 [31], DAVIS2017 [34] and Segtrack-v2 [22]. Below we detail our training parameters and evaluations resutls.

Method AC DB FM MB OCC Mean
Mp-Net 0.71 -0.02 0.58 0.14 0.68 0.04 0.65 0.10 0.69 0.01 0.700

0.77 -0.03 0.63 0.14 0.72 0.06 0.67 0.14 0.67 0.11  0.752
Table 1: EpO-Net vs. Mp-Net [41] on DAVIS-2016 dataset.

6.1 Implementation Details

EpO is trained using the mini-batch SGD with a batch size of 12, the initial learning rate is set to 0.001, with the momentum of 0.9, and weight decay of 0.005. The network is trained from scratch for 50 epochs, with the learning rate and weight decay by a factor of 0.1, after every 5 epochs. We down-sample the images by a factor of 0.5 to fit a batch size of 12 images in the GPU memory.

We train EpO in two stages: training on synthetic dataset, RBSF (Sec. 5.1), and then fine-tune on DAVIS-2016. For both of these training, we perform input-dropout for epipolar channel for only 20% of training data i.e. randomly assigning zero and adding small random Gaussian noise in epipolar channel. We call this final trained model, EpO and one trained on RBSF EpO-RBSF.

Fusion network is fully trained only on the DAVIS-2016’s training set, resulting in EpO+. We use the batch size of 12 and an initial learning rate set to 0.001, which is decreased after every epoch with a factor . The model is trained using the back-propagation through time [49]

using binary cross-entropy loss and RMSProp optimizer. The weights of all the layers in the fusion network are initialized using the Xavier

[12], except for those in ConvGRU. We clip the gradients to the [-50, 50], before each update step [13] to avoid numerical issues. For robust fusion, we again use input-dropout mechanism by setting the motion-saliency-map to zero, for 20% frames of the sequence with fast motion and dynamic background. We also perform the random cropping and flipping of sequences during the training. We train for 50 epochs, including bottleneck layer in Fusion-Net. The final output is refined using CRF, during inference. To test on DAVIS-2017, we fine-tine EpO-RBSF and EpO on the DAVIS-2017’s training-set.

EpO+ EpO AGS[48] MOA[39] LSMO[43] STP[15] PDB[40] ARP[20] LVO[42] Mp-Net[41] FSeg[17] SFL[6]

Mean  0.806  0.752  0.797  0.772  0.782  0.776  0.772  0.762 0.759  0.700  0.707  0.674

Recall  0.952  0.888  0.911  0.878  0.891  0.886  0.901  0.911  0.891  0.850  0.835  0.814

Decay  0.022  0.053  0.019  0.050  0.041  0.044  0.009  0.070  0.000  0.013  0.015  0.062
Mean  0.755  0.711 0.774  0.774  0.759  0.750  0.745  0.706  0.721  0.659  0.653  0.667
Recall  0.879  0.830  0.858  0.844  0.847  0.869  0.844  0.835  0.834  0.792  0.738  0.771
Decay  0.024  0.043  0.016  0.033  0.035  0.042  -0.002  0.079  0.013  0.025  0.018  0.051
Mean  0.185  0.388  0.267  0.279  0.212  0.243  0.277  0.384  0.255  0.563  0.316  0.282
Table 2: Comparison of our motion (EpO) and fusion network (EpO+), with state-of-the-art on DAVIS-2016 with intersection over union , F-measure , and temporal stability . Best & second best scores have been bold and are underlined respectively. AGS uses eye-gaze data to train their network, whereas we only exploit information existent in the videos itself by enforcing the geomatrical constraints.
Attribute EpO+ AGS[48] MOA[39] LSMO[43] STP[15]
AC 0.83 -0.04 0.80 -0.01 0.78 -0.01 0.78 +0.00 0.72 +0.07
DB 0.72 +0.10 0.66 +0.16 0.61 +0.20 0.55 +0.27 0.66 +0.15
FM 0.78 +0.04 0.77 +0.04 0.74 +0.05 0.73 +0.08 0.75 +0.04
MB 0.78 +0.06 0.74 +0.10 0.71 +0.10 0.73 +0.10 0.74 +0.06
OCC 0.75 +0.08 0.76 +0.05 0.78 -0.02 0.74 +0.06 0.81 -0.05
Table 3: Attribute-based Analysis of top performing methods on DAVIS-2016 dataset. The mean IoU of all sequences with specific attribute: appearance cahnge (AC), dynamic background (DB), fast motion (FM), motion blur (MB), and occlusion (OCC) is computed. The values in small font indicates the change in performance (gain or loss) for the method on the remaining sequences without that respective attribute.

6.2 Evaluation

We follow the standard training & validation split, to train and evaluate using the protocol proposed in [31] and compute intersection-over-union , F-measures , and temporal stability , contour accuracy and smoothness of segmentation over the time respectively. The evaluation results are summarized in Table 2.

6.2.1 Motion Network

By exploiting geometric constraints in salient motion segmentation, our motion-only network EpO scores mean of over DAVIS-2016 validation set. This is much higher than score of Mp-Net [41], which also relies on non-motion features (objectness score), and is competitive to LVO, which is using a bi-directional ConvGRU and appearance information in addition to optical flow. Whereas EpO only uses motion-images (optical flow & ED).

Qualitative comparison of EpO-Net with Mp-Net is given in Fig. 5. It’s evident from to column that ED and optical flow are complimenting each other, and the results are robust against the failure of one of these inputs. In case of optical flow being too small (row-1), or is in the same direction as the camera motion (row-3), ED helps distinguish the object. Similarly, when the ED score is sporadically bad (row-2 & 4), optical-flow information helps distinguish the object, much due to the robust motion features learned with input-dropout training. Whereas Mp-Net makes local decisions, unable to recover from the optical flow errors (row 4 & 6). It is unable to distinguish salient object when camera and object have similar motion (row-3).

6.2.2 EpO+

Combining motion-saliency map obtained from EpO with the appearance features and adding temporal memory, EpO+ outperforms its direct competitors LVO and LSMO, by a significant margin of 4.7% and 2.4% over mean IoU. EpO+ outperforms even recently published works, like AGS [48], which requires training on dynamic fixation dataset collected by tracking the gaze of viewers, both in mean IoU and its recall. Important to note is mean temporal stability, which is substantially better than rest explicitly indicating the effectiveness of our formulation. Our attribute analysis is given in Table 3. Our method outperforms the baselines in all categories except the occlusion.

Qualitative comparison of EpO+ with the state of art algorithms is presented in Fig. 6. AGS has failed to properly segment moving objects ( and row). Most of the errors in the previous methods are over-segmenting and are due to over-exploitation of appearance information. This we can attribute to the very basic reason of not being able to exploit/learn enough constraints for motion saliency. While the proposed method, due to more informative proposed motion features (based on geometric constraints) and input-dropout training procedure, is being able to learn how to balance appearance and motion cues. For details see supplementary material.

6.2.3 Evaluation on other datasets

DAVIS-2017: We fine-tune EpO-RBSF and EpO+ on the DAVIS-2017’s training sequences. We could not find the comparative results, but we are reporting ours for future comparison in Table 5.

Segtrack-v2: Evalaution resutls of EpO+ and EpO on SegTrack-v2 [22] dataset have been presented in Table  4. Our results are better than existing state-of-art, including STP [15]. Although, it’s with a small margin of ; this could be attributed to the difference in resolution of SegTrack-v2 videos vs that of DAVIS-2016. Removing birdfall, the only sequence we perform poor, the results improves to . AGS [48] uses both SegTrackv2 and DAVIS for training, therefore, do not evaluate on this. Note that, since NLC [9] reports results only on subset of sequences in their paper, results in Table  4 were taken from [42, 15].

Ground truth LVO [42] STP [15] MotAdapt [39] AGS [48] Our
Figure 6: Qualitative comparison with state-of-the-art methods on DAVIS-2016.
Mean IoU 57.3 67.2 61.4 57.3 59.1 70.1 68.3 70.9
Table 4: EpO+ results on SegTrack-v2 dataset [22]. We only perform bad on one sequence (birdfall). Removing this increase our Mean IoU to 72.8.
Method AC DB FM MB OCC Mean
EpO 0.67 -0.02 0.56 0.10 0.62 0.04 0.57 0.11 0.59 0.08 0.652

0.79 -0.04 0.72 0.05 0.74 0.03 0.72 0.06 0.66 0.13 0.763
Table 5: Results on DAVIS 2017 dataset.
#enc/dec Input Modality
2 57.2 54.7 62.7
3 58.9 59.7 64.4
4 49.2 63.3 67.5
EpO Variant Mean IoU
EpO(R) 48.5
EpO(D) 72.7
EpO(R)+Drop 50.6
EpO(D)+Drop 75.2
Table 6: Left: Studying the effects of different input modalities against network depth. Right: Effect of dropout in epipolar channel of motion images, R and D denote RBSF and DAVIS dataset respectively.

6.3 Ablation Study

In this section, we present the study on the impact and effectiveness of different design choices. We first analyze the influence of different input modalities and depth of the network architecture by training and validating on DAVIS-2016 dataset. Specifically, we use the single-channel epipolar distance, 2 channel optical flow i.e. X-Y displacement, and the combination of the both as 3 channel motion images. For each input modality, we train and validate EpO network with two, three and four-layer encoders/decoders to study which modality needs the deeper network.

In Table 6, we observe that ED being a very simple yet informative feature, the epipolar alone network requires less number of parameters to learn, implying that they should not require (i) deep network, ii) large datasets. In contrast, optical flow, being a complex information for motion saliency, requires more number of encoders and decoders. Since small errors in optical flow, get accumulated in trajectories estimation and result in quite noisy epipolar distances, optical flow with 4 encoders/decoders architecture beats the epipolar network, with 63.3% mean IoU using 4 encoders/decoders architecture. However, when we combine both, in the form of motion images, the accuracy further improves by 4.2% as compared to optical flow based 4 encoders/decoders network. This shows that the combination is able to exploit both the global temporal geometric information and local temporal motion information distinguishing foreground and background. Note all the experiments are performed using the same hyper-parameters stated in Sec. 6.1, epipolar dropout strategy is not used, and all models are trained for 30 Epochs only.

Next, we demonstrate the effectiveness of our dataset RBSF and the input-dropout in Table 6. The mean IoU on DAVIS-2016 with the proposed dataset was 48.5%. That increases to 72.7% with fine-tuning on DAVIS-training. Comparing this with our Ep+OF’s best combination of 4 encoders/decoders, the increase is 5.3%, showing the significance of the proposed dataset. With the proposed dropout the results further improve by 2.5%, showing the effectiveness of the input-dropout.

We also study the effect of GRU-sequence length. As expected, when we increase sequence length, from 6 to 12, the mean IoU improves from to . A considerable improvement comes in the videos having occlusion. Finally, we observe that instead of angle-magnitude representation of optical flow, the velocity representation gives better results. Qualitative review of the dataset, made us realize that the channel representing angle information is not robust to optical flow errors. Even for humans, inferring motion patterns by just looking at them, is quite difficult.

7 Conclusion

We exploit multiview geometric constraints to define motion saliency and find trajectory epipolar distances, as a measure of non-rigidity. By combining epipolar distances with optical flow, we train a powerful motion network and demonstrates significant improvement over the previous motion network. Unlike previous methods, the learned motion features avoid over-reliance on appearance-based features. Even without using RNNs, appearance features, our motion network is competitive to the existing state of art. With them, our method gives state of the art results. The proposed learning paradigm, involving the strong geometric constraints, should be useful for a number of related applications. The proposed input-dropout idea may also be useful to learn robust joint features in network fusion.


  • [1] L. Bao, B. Wu, and W. Liu (2018) CNN in mrf: video object segmentation via inference in a cnn-based higher-order spatio-temporal mrf. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 5977–5986. Cited by: §2.
  • [2] T. Brox and J. Malik (2010) Object segmentation by long term analysis of point trajectories. In European conference on computer vision, pp. 282–295. Cited by: §2.
  • [3] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille (2015) Semantic image segmentation with deep convolutional nets and fully connected crfs. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, External Links: Link Cited by: §1, §4.3, §4.
  • [4] Q. Chen and V. Koltun (2016) Full flow: optical flow estimation by global optimization over regular grids. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4706–4714. Cited by: §3.
  • [5] Y. Chen, J. Pont-Tuset, A. Montes, and L. V. Gool (2018) Blazingly fast video object segmentation with pixel-wise metric learning. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pp. 1189–1198. External Links: Link, Document Cited by: §2.
  • [6] J. Cheng, Y. Tsai, S. Wang, and M. Yang (2017) SegFlow: joint learning for video object segmentation and optical flow. In 2017 IEEE International Conference on Computer Vision (ICCV), pp. 686–695. Cited by: §2, §2, Table 2.
  • [7] C. De Souza, A. Gaidon, Y. Cabon, and A. Lopez Pena (2017) Procedural generation of videos to train deep action recognition networks. In CVPR, Cited by: §5.1.
  • [8] I. Endres and D. Hoiem (2010) Category independent object proposals. In European Conference on Computer Vision, pp. 575–588. Cited by: §2.
  • [9] A. Faktor and M. Irani (2014) Video segmentation by non-local consensus voting.. In BMVC, Vol. 2, pp. 8. Cited by: §6.2.3.
  • [10] S. Feng, R. Manmatha, and V. Lavrenko (2004) Multiple bernoulli relevance models for image and video annotation. In null, pp. 1002–1009. Cited by: §1.
  • [11] K. Fragkiadaki, G. Zhang, and J. Shi (2012) Video segmentation by tracing discontinuities in a trajectory embedding. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pp. 1846–1853. Cited by: §2.
  • [12] X. Glorot and Y. Bengio (2010)

    Understanding the difficulty of training deep feedforward neural networks


    Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics

    pp. 249–256. Cited by: §6.1.
  • [13] A. Graves (2013)

    Generating sequences with recurrent neural networks

    arXiv preprint arXiv:1308.0850. Cited by: §6.1.
  • [14] R. Hartley and A. Zisserman (2003) Multiple view geometry in computer vision. Cambridge university press. Cited by: §1, §1, §3.
  • [15] Y. Hu, J. Huang, and A. Schwing (2018) Unsupervised video object segmentation using motion saliency-guided spatio-temporal propagation. In Proc. ECCV, Cited by: Figure 1, §2, §2, Figure 6, §6.2.3, Table 2, Table 3.
  • [16] Y. Hu, J. Huang, and A. G. Schwing (2018) VideoMatch: matching based video object segmentation. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part VIII, pp. 56–73. External Links: Link, Document Cited by: §2.
  • [17] S. D. Jain, B. Xiong, and K. Grauman (2017) Fusionseg: learning to combine motion and appearance for fully automatic segmention of generic objects in videos. arXiv preprint arXiv:1701.05384 2 (3), pp. 6. Cited by: §2, §2, Table 2.
  • [18] Y. Jun Koh, Y. Lee, and C. Kim (2018) Sequential clique optimization for video object segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 517–533. Cited by: §2.
  • [19] M. Keuper, B. Andres, and T. Brox (2015) Motion trajectory segmentation via minimum cost multicuts. In Computer Vision (ICCV), 2015 IEEE International Conference on, pp. 3271–3279. Cited by: §2.
  • [20] Y. J. Koh and C. Kim (2017) Primary object segmentation in videos based on region augmentation and reduction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Vol. 1, pp. 6. Cited by: §2, §2, Table 2.
  • [21] Y. J. Lee, J. Kim, and K. Grauman (2011) Key-segments for video object segmentation. In IEEE International Conference on Computer Vision, pp. 1995–2002. Cited by: §2.
  • [22] F. Li, T. Kim, A. Humayun, D. Tsai, and J. M. Rehg (2013) Video segmentation by tracking many figure-ground segments. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2192–2199. Cited by: §1, §2, §6.2.3, Table 4, §6.
  • [23] S. Li, B. Seybold, A. Vorobyov, A. Fathi, Q. Huang, and C. J. Kuo (2018) Instance embedding transfer to unsupervised video object segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6526–6535. Cited by: §2.
  • [24] J. Luiten, P. Voigtlaender, and B. Leibe PReMVOS: proposal-generation, refinement and merging for video object segmentation. In 14th Asian Conference on Computer Vision, Perth, Australia, December 2-6, 2018, pages = 565–580, Cited by: §2.
  • [25] T. Ma and L. J. Latecki (2012) Maximum weight cliques with mutex constraints for video object segmentation. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pp. 670–677. Cited by: §2.
  • [26] K. Maninis, S. Caelles, Y. Chen, J. Pont-Tuset, L. Leal-Taixé, D. Cremers, and L. V. Gool (2019) Video object segmentation without temporal information. IEEE Trans. Pattern Anal. Mach. Intell. 41 (6), pp. 1515–1530. External Links: Link, Document Cited by: §2.
  • [27] N. Mayer, E. Ilg, P. Hausser, P. Fischer, D. Cremers, A. Dosovitskiy, and T. Brox (2016) A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4040–4048. Cited by: §1, §2, §5.1.
  • [28] N. S. Nagaraja, F. R. Schmidt, and T. Brox (2015) Video segmentation with just a few strokes. In 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015, pp. 3235–3243. External Links: Link, Document Cited by: §2.
  • [29] P. Ochs and T. Brox (2011) Object segmentation in video: a hierarchical variational approach for turning point trajectories into dense regions. In IEEE International Conference on Computer Vision, pp. 1583–1590. Cited by: §2.
  • [30] P. Ochs and T. Brox (2012)

    Higher order motion models and spectral clustering

    In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pp. 614–621. Cited by: §2.
  • [31] F. Perazzi, J. Pont-Tuset, B. McWilliams, L. Van Gool, M. Gross, and A. Sorkine-Hornung (2016) A benchmark dataset and evaluation methodology for video object segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 724–732. Cited by: Figure 1, §1, §1, §2, §5.1, §6.2, §6.
  • [32] F. Perazzi, O. Wang, M. Gross, and A. Sorkine-Hornung (2015) Fully connected object proposals for video segmentation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3227–3234. Cited by: §2.
  • [33] P. O. Pinheiro, T. Lin, R. Collobert, and P. Dollár (2016) Learning to refine object segments. In European Conference on Computer Vision, pp. 75–91. Cited by: §2.
  • [34] J. Pont-Tuset, S. Caelles, F. Perazzi, A. Montes, K. Maninis, Y. Chen, and L. Van Gool (2018) The 2018 davis challenge on video object segmentation. arXiv preprint arXiv:1803.00557. Cited by: §1, §6.
  • [35] O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: §1, §4.2.
  • [36] N. Shankar Nagaraja, F. R. Schmidt, and T. Brox (2015) Video segmentation with just a few strokes. In Proceedings of the IEEE ICCV, pp. 3235–3243. Cited by: §2.
  • [37] Y. Sheikh, O. Javed, and T. Kanade (2009) Background subtraction for freely moving cameras. In 2009 IEEE 12th International Conference on Computer Vision, pp. 1219–1225. Cited by: §2.
  • [38] J. Shi and J. Malik (1998) Motion segmentation and tracking using normalized cuts. In Computer Vision, 1998. Sixth International Conference on, pp. 1154–1160. Cited by: §2.
  • [39] M. Siam, C. Jiang, S. W. Lu, L. Petrich, M. Gamal, M. Elhoseiny, and M. Jägersand (2018) Video segmentation using teacher-student adaptation in a human robot interaction (HRI) setting. CoRR abs/1810.07733. External Links: Link, 1810.07733 Cited by: Figure 1, §2, Figure 6, Table 2, Table 3.
  • [40] H. Song, W. Wang, S. Zhao, J. Shen, and K. Lam (2018) Pyramid dilated deeper convlstm for video salient object detection. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 715–731. Cited by: §2, §2, Table 2.
  • [41] P. Tokmakov, K. Alahari, and C. Schmid (2017) Learning motion patterns in videos. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 531–539. Cited by: §1, §2, §2, Figure 5, §6.2.1, Table 1, Table 2.
  • [42] P. Tokmakov, K. Alahari, and C. Schmid (2017) Learning video object segmentation with visual memory. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pp. 4491–4500. Cited by: Figure 1, §2, §2, §4.3, Figure 6, §6.2.3, Table 2.
  • [43] P. Tokmakov, C. Schmid, and K. Alahari (2019) Learning to segment moving objects. International Journal of Computer Vision 127 (3), pp. 282–301. External Links: Link, Document Cited by: §2, Table 2, Table 3.
  • [44] P. H. Torr (1998) Geometric motion segmentation and model selection. Philosophical Transactions of the Royal Society of London A: Mathematical, Physical and Engineering Sciences 356 (1740), pp. 1321–1340. Cited by: §2.
  • [45] R. Tron and R. Vidal (2007) A benchmark for the comparison of 3-d motion segmentation algorithms. In 2007 IEEE conference on computer vision and pattern recognition, pp. 1–8. Cited by: §2.
  • [46] W. Wang, J. Shen, F. Guo, M. Cheng, and A. Borji (2018) Revisiting video saliency: a large-scale benchmark and a new model. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4894–4903. Cited by: §2.
  • [47] W. Wang, J. Shen, J. Xie, and F. Porikli (2017) Super-trajectory for video segmentation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1671–1679. Cited by: §2.
  • [48] W. Wang, H. Song, S. Zhao, J. Shen, S. Zhao, S. C. Hoi, and H. Ling (2019) Learning unsupervised video object segmentation through visual attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3064–3074. Cited by: Figure 1, §2, Figure 6, §6.2.2, §6.2.3, Table 2, Table 3.
  • [49] P. J. Werbos (1990) Backpropagation through time: what it does and how to do it. Proceedings of the IEEE 78 (10), pp. 1550–1560. Cited by: §6.1.
  • [50] Z. Wojna, J. R. R. Uijlings, S. Guadarrama, N. Silberman, L. Chen, A. Fathi, and V. Ferrari (2017) The devil is in the decoder. In British Machine Vision Conference 2017, BMVC 2017, London, UK, September 4-7, 2017, Cited by: §4.2.
  • [51] T. Xiang and S. Gong (2008) Video behavior profiling for anomaly detection. IEEE transactions on pattern analysis and machine intelligence 30 (5), pp. 893–908. Cited by: §1.
  • [52] H. Xiao, J. Feng, G. Lin, Y. Liu, and M. Zhang (2018) MoNet: deep motion exploitation for video object segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1140–1148. Cited by: §2.
  • [53] A. Yilmaz, O. Javed, and M. Shah (2006) Object tracking: a survey. Acm computing surveys (CSUR) 38 (4), pp. 13. Cited by: §1.
  • [54] D. Zhang, O. Javed, and M. Shah (2013) Video object segmentation through spatially accurate and temporally dense extraction of primary object regions. In Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pp. 628–635. Cited by: §2.