Object Discovery in Videos as Foreground Motion Clustering

12/06/2018 ∙ by Christopher Xie, et al. ∙ University of Washington Nvidia 0

We consider the problem of providing dense segmentation masks for object discovery in videos. We formulate the object discovery problem as foreground motion clustering, where the goal is to cluster foreground pixels in videos into different objects. We introduce a novel pixel-trajectory recurrent neural network that learns feature embeddings of foreground pixel trajectories linked in time. By clustering the pixel trajectories using the learned feature embeddings, our method establishes correspondences between foreground object masks across video frames. To demonstrate the effectiveness of our framework for object discovery, we conduct experiments on commonly used datasets for motion segmentation, where we achieve state-of-the-art performance.



There are no comments yet.


page 3

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Discovering objects from videos is an important capability that an intelligent system needs to have. Imagine deploying a robot to a new environment. If the robot can discover and recognize unknown objects in the environment by observing, it would enable the robot to better understand its work space. In the interactive perception setting [8], the robot can even interact with the environment to discover objects by touching or pushing objects. To tackle the object discovery problem, we need to answer the question: what defines an object? In this work, we consider an entity that can move or be moved to be an object, which includes various rigid, deformable and articulated objects. We utilize motion and appearance cues to discover objects in videos.

Motion-based video understanding has been studied in computer vision for decades. In low-level vision, different methods have been proposed to find correspondences between pixels across video frames, which is known as optical flow estimation

[19, 3]. Both camera motion and object motion can result in optical flow. Since the correspondences are estimated at a pixel level, these methods are not aware of the objects in the scene, in the sense that they do not know which pixels belong to which objects. In high-level vision, object detection and object tracking in videos has been well-studied [1, 23, 17, 48, 4, 46]. These methods train models for specific object categories using annotated data. As a result, they are not able to detect nor track unknown objects that have not been seen in the training data. In other words, these methods cannot discover new objects from videos. In contrast, motion segmentation methods [9, 24, 5, 33] aim at segmenting moving objects in videos, which can be utilized to discover new objects based on their motion.

In this work, we formulate the object discovery problem as foreground motion clustering, where the goal is to cluster pixels in a video into different objects based on their motion. There are two main challenges in tackling this problem. First, how can foreground objects be differentiated from background? Based on the assumption that moving foreground objects have different motion as the background, we design a novel encoder-decoder network that takes video frames and optical flow as inputs and learns a feature embedding for each pixel, where these feature embeddings are used in the network to classify pixels into foreground or background. Compared to traditional foreground/background segmentation methods

[11, 20], our network automatically learns a powerful feature representation that combines appearance and motion cues from images.

Figure 1: Overview of our framework. RGB images and optical flow are fed into a recurrent neural network, which computes embeddings of pixel trajectories. These embeddings are clustered into different foreground objects.

Second, how can we consistently segment foreground objects across video frames? We would like to segment individual objects in each video frame and establish correspondences of the same object across video frames. Inspired by [9] that clusters pixel trajectories across video frames for object segmentation, we propose to learn feature embeddings of pixel trajectories with a novel Recurrent Neural Network (RNN), and then cluster these pixel trajectories with the learned feature embeddings. Since the pixel trajectories are linked in time, our method automatically establishes the object correspondences across video frames by clustering the trajectories. Different from [9] that employs hand-crafted features to cluster pixel trajectories, our method automatically learns a feature representation of the trajectories, where the RNN controls how to combine pixel features along a trajectory to obtain the trajectory features. Figure 1 illustrates our framework for object motion clustering.

Since our problem formulation aims to discover objects based on motion, we conduct experiments on five motion segmentation datasets to evaluate our method: Flying Things 3D [28], DAVIS [34, 36], Freiburg-Berkeley motion segmentation [31], ComplexBackground [29] and CamouflagedAnimal [6]. We show that our method is able to segment potentially unseen foreground objects in the test videos and consistently across video frames. Comparison with the state-of-the-art motion segmentation methods demonstrates the effectiveness of our learned trajectory embeddings for object discovery. In summary, our work has the following key contributions:

  • We introduce a novel encoder-decoder network to learn feature embeddings of pixels in videos that combines appearance and motion cues.

  • We introduce a novel recurrent neural network to learn feature embeddings of pixel trajectories in videos.

  • We use foreground masks as an attention mechanism to focus on clustering of relevant pixel trajectories for object discovery.

  • We achieve state-of-the-art performance on commonly used motion segmentation datasets.

This paper is organized as follows. After discussing related work, we introduce our foreground motion clustering method designed for object discovery, followed by experimental results and a conclusion.

2 Related Work

Video Foreground Segmentation. Video foreground segmentation is the task of classifying every pixel in a video as foreground or background. This has been well-studied in the context of video object segmentation [6, 32, 43, 43, 22], especially with the introduction of unsupervised challenge of the DAVIS dataset [34]. [6] uses a probabilistic model that acts upon optical flow to estimate moving objects. [32] predicts video foreground by iteratively refining motion boundaries while encouraging spatio-temporal smoothness. [43, 44, 22]

adopt a learning-based approach and train Convolutional Neural Networks (CNN) that utilize RGB and optical flow as inputs to produce foreground segmentations. Our approach builds on these ideas and uses the foreground segmentation as an attention mechanism for pixel trajectory clustering.

Instance Segmentation. Instance segmentation algorithms segment individual object instances in images. Many instance segmentation approaches have adopted the general idea of combining segmentation with object proposals [18, 35]. While these approaches only work for objects that have been seen in a training set, we make no such assumption as our intent is to discover objects. Recently, a few works have investigated the instance segmentation problem as a pixel-wise labeling problem by learning pixel embeddings [12, 30, 26, 14]. [30] predicts pixel-wise features using translation-variant semi-convolutional operators. [14] learns pixel embeddings with seediness scores that are used to compose instance masks. [12] designs a contrastive loss and [26] unrolls mean shift clustering as a neural network to learn pixel embeddings. We leverage these ideas to design our approach of learning embeddings of pixel trajectories.

Motion Segmentation. Pixel trajectories for motion analysis were first introduced by [41]. [9]

used them in a spectral clustering method to produce motion segments. This inspired extensions such as

[31] which provided a variational minimization to produce pixel-wise motion segmentations from trajectories. Other works that build off this idea of pixel trajectories include formulating trajectory clustering as a multi-cut problem [24], and detecting discontinuities in the trajectory spectral embedding [16]. More recent approaches include using occlusion relations to produce layered segmentations [42], combining piecewise rigid motions with pre-trained CNNs to merge the rigid motions into objects [7], and jointly estimating scene flow and motion segmentations [38]. We use pixel trajectories in a recurrent neural network to learn trajectory embeddings for motion clustering.

3 Method

Figure 2: Overall architecture. First, feature maps of each frame are extracted from the Y-Net. Next, foreground masks are computed, shown in orange. The PT-RNN uses these foreground masks to compute trajectory embeddings (example foreground trajectory from frame to shown in purple

), which are normalized to produce unit vectors. Backpropagation passes through the

blue solid arrows, but not through the red dashed arrows.

Our approach takes video frames and optical flow between pairs of frames as inputs, which are fed through an encoder-decoder network, resulting in pixel-wise features. These features are used to predict foreground masks of moving objects. In addition, a recurrent neural network is designed to learn feature embeddings of pixel trajectories inside the foreground masks. Lastly, the trajectory embeddings are clustered into different objects, giving a consistent segmentation mask for each discovered object. The network architecture is visualized in Figure 2.

3.1 Encoder-Decoder: Y-Net

Let be an RGB image and optical flow image at time , respectively. Our network receives these images from a video as inputs and feeds them into an encoder-decoder network separately at each time step, where the encoder-decoder network extracts dense features for each video frame. Our encoder-decoder network is an extension of the U-Net architecture [37] (Figure 2(a)) to two different input types, i.e., RGB images and optical flow images, by adding an extra input branch. We denote this mid-level fusion of low-resolution features as Y-Net. We illustrate the Y-Net architecture in Figure 2(b).

In detail, our network has two parallel encoder branches for the RGB and optical flow inputs. Each encoder branch consists of four blocks of two convolutions (each of which is succeeded by a GroupNorm layer [45]

and ReLU activation) followed by a

max pooling layer. The encodings of the RGB and optical flow branches are then concatenated and input to a decoder network, which consists of a very similar architecture to [37] with skip connections from both encoder branches to the decoder.

We argue that this mid-level fusion performs better than early fusion and late fusion (using completely separate branches for RGB and optical flow, similar to two-stream networks [39, 15, 44]) of encoder-decoder networks while utilizing less parameters, and show this empirically in Section 4.1. The output of Y-Net, , is a pixel-dense feature representation of the scene. We will refer to this as pixel embeddings of the video.

3.2 Foreground Prediction

The Y-Net extracts a dense feature map for each video frame that combines appearance and motion information of the objects. Using these features, our network predicts a foreground mask for each video frame by simply applying another convolution on top of the Y-Net outputs to compute foreground logits. These logits are passed through a sigmoid layer and thresholded at 0.5. For the rest of the paper, we will denote

to the binary foreground mask at time .

The foreground masks are used as an attention mechanism to focus on the clustering of the trajectory embeddings. This results in more stable performance, as seen in Section 4.1. Note that while we focus on moving objects in our work, the foreground can be specified depending on the problem. For example, if we specify that certain objects such as cars should be foreground, then we would learn a network that learns to discover and segment car instances in videos.

3.3 Trajectory Embeddings

In order to consistently discover and segment objects across video frames, we propose to learn deep representations of foreground pixel trajectories of the video. Specifically, we consider dense pixel trajectories throughout the videos, where trajectories are defined as in [41, 9]. Given the outputs of Y-Net, we compute the trajectory embedding as a weighted sum of the pixel embeddings along the trajectory.

(a) U-Net architecture
(b) Y-Net architecture
Figure 3: We show U-Net [37] and our proposed Y-Net to visually demonstrate the difference. Y-Net has two encoding branches (shown in green) for each input modality, which is fused (shown in purple) and passed to the decoder (shown in yellow). Skip connections are visualized as blue arrows.

3.3.1 Linking Foreground Trajectories

We first describe the method to calculate pixel trajectories according to [41]. Denote to be the forward optical flow field at time and to be the backward optical flow field at time . As defined in [41], we say the optical flow for two pixels at time and at time is consistent if


where denotes the -th element of . Essentially, this condition requires that the backward flow points in the inverse direction of the forward flow, up to a tolerance interval that is linear in the magnitude of the flow. Pixels and are linked in a pixel trajectory if Eq. (1) holds.

To define foreground pixel trajectories, we augment the above definition and say pixels and are linked if Eq. (1) holds and both pixels are classified as foreground. Using this, we define a foreground-consistent warping function that warps a set of pixels forward in time along their foreground trajectories:

This can be achieved by warping with and multiplying by a binary consistency mask. By warping the foreground mask with using Eq. (1) and intersecting it with , we obtain this binary mask that is 1 if is linked to a foreground pixel at time . Figure 4 demonstrates the linking of pixels in a foreground pixel trajectory.

Figure 4: We illustrate pixel linking in foreground pixel trajectories. The foreground mask is shown in orange, forward flow is denoted by the blue dashed arrow, and backward flow is denoted by the red dashed arrow. The figure shows a trajectory that links pixels in frames . Two failure cases that can cause a trajectory to end are shown between frames and : 1) Eq. (1) is not satisfied, and 2) one of the pixels is not classified as foreground.

3.3.2 Pixel Trajectory RNN

After linking foreground pixels into trajectories, we describe our proposed Recurrent Neural Network (RNN) to learn feature embedings of these trajectories. Denote to be the pixel locations of a foreground trajectory, to be the pixel emdeddings of the foreground trajectory (Y-Net outputs, i.e. ), and as the length of the trajectory. We define the foreground trajectory embeddings to be a weighted sum of the pixel embeddings along the foreground trajectory. Specifically, we have


where denotes element-wise multiplication, the division sign denotes element-wise division, and .

To compute the trajectory embeddings, we encode as a novel RNN architecture which we denote Pixel Trajectory RNN (PT-RNN). In its hidden state, PT-RNN stores


which allows it to keep track of the running sum and total weight throughout the foreground trajectory. While Eq. (3) describes the hidden state at each pixel location and time step, we can efficiently implement the PT-RNN for all pixels by doing the following: at time step , PT-RNN first applies the foreground consistent warping function to compute . Next, we compute . We design three variants of PT-RNN to compute , named standard (based on simple RNNs), conv (based on convRNNs), and convGRU (based on [2]). For example, our conv architecture is described by


where denotes convolution, and are convolution kernels. After is computed by the PT-RNN, we update the hidden state with:


All model variants are described in detail in the supplement. Essentially, standard treats each set of linked pixels as a simple RNN, conv includes information from neighboring pixels, and convGRU allows the network to capture longer term dependencies by utilizing an explicit memory state.

When a trajectory is finished, i.e., pixel does not link to any pixel in the next frame, PT-RNN outputs , which is equivalent to Eq. (2). This results in a -dimensional embedding for every foreground pixel trajectory, regardless of its length, when it starts, or when it ends. Note that these trajectory embeddings are pixel-dense, removing the need for a variational minimization step [31]. The embeddings are normalized so that they lie on the unit sphere.

A benefit to labeling the trajectories is that we are enforcing consistency in time, since consistent forward and backward optical flow usually means that the pixels are truely linked [41]. However, issues can arise around the motion and object boundaries, which can lead to trajectories erroneously drifting and representing motion of two different objects or an object and background [41]. In this case, the foreground masks are beneficial and able to sever the trajectory before it drifts. We also note the similarity to the DA-RNN architecture [47] that uses data association in a RNN for semantic labeling.

3.3.3 Spatial Coordinate Module

The foreground trajectory embeddings incorporate information from the RGB and optical flow images. However, they do not encode information about the location of the trajectory in the image. Thus, we introduce a spatial coordinate module which computes location information for each foreground trajectory. Specifically, we compute a 4-dimensional vector consisting of the average pixel location and displacement for each trajectory and pass it through two fully connected (FC) layers to inflate it to a -dimensional vector, which we add to the output of (before the normalization of the foreground trajectory embeddings).

3.4 Loss Function

To train our proposed network, we use a loss function that is comprised of three terms:

where we set in our experiments. is a pixel-wise binary cross-entropy loss that is commonly used in foreground prediction. We apply this on the predicted foreground logits. and operate on the foreground trajectory embeddings. Inspired by [12], its goal is to encourage trajectory embeddings of the same object to be close while pushing trajectories that are different objects apart. For simplicity of notation, let us overload notation and define to be a list of trajectory embeddings of dimension where indexes the object and indexes the embedding. Since all the feature embeddings are normalized to have unit length, we use the cosine distance function to measure the distance between two feature embeddings and :

Proposition 1

Let be a set of unit vectors such that . Define the spherical mean of this set of unit vectors to be unit vector that minimizes the cosine distance


Then . For the proof, see the supplement.

The goal of the intra-object loss is to encourage these learned trajectory embeddings of an object to be close to their spherical mean. This results in


where is the spherical mean of trajectories for object (i.e., ), and denotes the indicator function. Note that is a function of the embeddings. The indicator function acts as a hard negative mining that focuses the loss on embeddings that are further than margin from the spherical mean. In practice, we do not let the denominator get too small as it could result in very unstable gradients, so we allow it to reach a minimum of 50.

Lastly, the inter-object loss is designed to push trajectories of different objects apart. We desire the clusters to be pushed apart by some margin , giving


where . This loss function encourages the spherical means of different objects to be at least away from each other. Since our embeddings lie on the unit sphere and our distance function measures cosine distance, does not need to depend on the feature dimension . In our experiments, we set which encourages the clusters to be at least 90 degrees apart.

3.5 Trajectory Clustering

At inference time, we cluster the foreground trajectory embeddings with the von Mises-Fisher mean shift (vMF-MS) algorithm [25]

. This gives us the clusters as well as the number of clusters, which is the estimated number of objects in a video. vMF-MS finds the modes of the kernel density estimate using the von Mises-Fisher distribution. The density can be described as

for unit vector where is a scalar parameter, , and is a normalization constant. should be set to reflect the choice of . If the training loss is perfect and , then all of the lie within a ball with angular radius of . In our experiments, we set , giving degrees. Thus, we set , resulting in almost 50% of the density being concentrated in a ball with radius 16 degrees around (by eyeing Figure 2.12 of [40]).

Running the full vMF-MS clustering is inefficient due to our trajectory representation being pixel-dense. Instead, we run the algorithm on a few randomly chosen seeds that are far apart in cosine distance. If the network learns to correctly predict clustered trajectory embeddings, then this random initialization should provide little variance in the results. Furthermore, we use a PyTorch-GPU implementation of the vMF-MS clustering for efficiency.

4 Experiments

Y-Net 0.905 0.701 0.631
Early Fusion 0.883 0.636 0.568
Late Fusion 0.897 0.631 0.570
Table 1: Fusion ablation. Performance is measured in IoU.

Datasets. We evaluate our method on video foreground segmentation and multi-object motion segmentation on five datasets: Flying Things 3d (FT3D) [28], DAVIS2016 [34], Freibug-Berkeley motion segmentation [31], Complex Background [29], and Camouflaged Animal [6]. FT3D is a large-scale synthetic dataset. Foreground labels are provided by [43], which we combine with object segmentation masks (foreground and background) provided by [28] to produce motion segmentation masks. DAVIS2016 has 50 videos with pixel-dense foreground masks at each frame. We use the -measure and -measure as defined in [34]

as evaluation metrics. DAVIS2017


expands on DAVIS2016, providing more videos and multiple object segmentations per video, which we leverage for training only. FBMS, Complex Background, and Camouflaged Animal are motion segmentation datasets containing 59, 5, and 9 sequences, respectively. For these datasets, we use precision, recall, and F-score, and

Obj metrics for evaluation as defined in [31, 7]. Corrected labels for these datasets are provided by [5]. Full details of each dataset can be found in the supplement.

Multi-object Foreground
P R F Obj P R F
conv PT-RNN 75.9 66.6 67.3 4.9 90.3 87.6 87.7
standard PT-RNN 72.2 66.6 66.0 4.27 88.1 89.3 87.5
convGRU PT-RNN 73.6 63.8 64.8 4.07 89.6 85.8 86.3
per-frame embedding 79.9 56.7 59.7 11.2 92.1 85.4 87.4
no FG mask 63.5 60.3 59.6 1.97 82.5 85.7 82.1
no SCM 70.4 65.5 63.2 3.70 89.3 89.1 88.1
no pre-FT3D 70.2 63.6 63.1 3.66 87.6 88.2 86.3
no DAVIS-m 66.9 63.6 62.1 2.07 87.1 86.9 85.2
Table 2: Architecture and Dataset ablation on FBMS testset.

Implementation Details.

We train our networks using stochastic gradient descent with a fixed learning rate of 1e-2. We use backpropagation through time with sequences of length 5 to train the PT-RNN. Each image is resized to

before processing. During training (except for FT3D), we perform data augmentation, which includes translation, rotation, cropping, horizontal flipping, and color warping. We set . We extract optical flow via [21].

Labels for each foreground trajectory are given by the frame-level label of the last pixel in the trajectory. Due to sparse labeling in the FBMS training dataset, we warp the labels using Eq. (1) so that each frame has labels. Lastly, due to the small size of FBMS (29 videos for training), we hand select 42 videos from the 90 DAVIS2017 videos that roughly satisfy the rubric of [5] to augment the FBMS training set. We denote this as DAVIS-m. The exact videos in DAVIS-m can be found in the supplement.

When evaluating the full model on long videos, we suffer from GPU memory constraints. Thus, we devise a sliding window scheme to handle this. First, we cluster all foreground trajectories within a window. We match the clusters of this window with the clusters of the previous window using the Hungarian algorithm. We use distance between cluster centers as our matching cost, and further require that matched clusters must have . When a cluster is not matched to any of the previous clusters, we declare it a new object. We use 5-frame window and adopt this scheme for the FBMS and Camouflaged Animal datasets.

Our implementation is in PyTorch, and all experiments run on a single NVIDIA TitanXP GPU. Given optical flow, our algorithm runs at approximately 15 FPS. Note that we do not use a CRF post-processing step for motion segmentation.

Video Foreground Segmentation Multi-object Motion Segmentation
PCM [6] FST [32] NLC [13] MPNet [43] LVO [44] CCG [7] Ours CVOS [42] CUT [24] CCG [7] Ours


P 79.9 83.9 86.2 87.3 92.4 85.5 90.3 72.7 74.6 74.2 75.9
R 80.8 80.0 76.3 72.2 85.1 83.1 87.6 54.4 62.0 63.1 66.6
F 77.3 79.6 77.3 74.8 87.0 81.9 87.7 56.3 63.6 65.0 67.3
Obj - - - - - - - 11.7 7.7 4.0 4.9


P 84.3 87.6 79.9 86.8 74.6 87.7 83.1 60.8 67.6 64.9 57.7
R 91.7 85.0 69.3 77.5 77.0 93.1 89.7 44.7 58.3 67.3 61.9
F 86.6 80.6 73.7 78.2 70.5 90.1 83.5 45.8 60.3 65.6 58.3
Obj - - - - - - - 3.4 3.4 3.4 3.2


P 81.9 73.3 82.3 77.8 77.6 80.4 78.5 84.7 77.8 83.8 77.2
R 74.6 56.7 68.5 62.0 51.1 75.2 79.7 59.4 68.1 70.0 77.2
F 76.3 60.4 72.5 64.8 50.8 76.0 77.1 61.5 70.0 72.2 75.3
Obj - - - - - - - 22.2 5.7 5.0 5.4


P 80.8 82.1 84.7 85.3 87.4 84.7 87.1 73.8 74.5 75.1 74.1
R 80.7 75.8 73.9 70.7 77.2 82.7 86.2 54.3 62.8 65.0 68.2
F 78.2 75.8 75.9 73.1 77.7 81.5 85.1 56.2 64.5 66.5 67.9
Obj - - - - - - - 12.9 6.8 4.1 4.8
Table 3: Results for FBMS, ComplexBackground (CB), CamouflagedAnimal (CA), and averaged over all videos in these datasets (ALL). Best results are highlighted in red with second best in blue.

4.1 Ablation Studies

FST [32] FSEG [22] MPNet [43] LVO [44] Ours
DAVIS 55.8 70.7 70.0 75.9 74.2
51.1 65.3 65.9 72.1 73.9
FT3D IoU - - 85.9 - 90.7
Table 4: Results on Video Foreground Segmentation for DAVIS2016 and FT3D. Best results are highlighted in red.

Fusion ablation. We show the choice of mid-level fusion with Y-Net is empirically a better choice than early fusion and late fusion of encoder-decoder networks. For early fusion, we concatenate RGB and optical flow and pass it through a single U-Net. For late fusion, there are two U-Nets: one for RGB and one for optical flow, with a conv layer at the end to fuse the outputs. Note that Y-Net has more parameters than early fusion but less parameters than late fusion. Table 1 shows that Y-Net outperforms the others in terms of foreground IoU. Note that the performance gap is more prominent on the real-world datasets.

Architecture ablation. We evaluate the contribution of each part of the model and show results in both the multi-object setting and the binary setting (foreground segmentation) on the FBMS testset. All models are pre-trained on FT3D for 150k iterations and trained on FBMS+DAVIS-m for 100k iterations. Experiments with the different PT-RNN variants shows that conv PT-RNN performs the best empirically in terms of F-score, thus we use this in our comparison with state-of-the-art methods. Standard performs similarly, while convGRU performs worse perhaps due to overfitting to the small dataset. Next, we remove the PT-RNN architecture (per-frame embedding) and cluster the foreground pixels at each frame. The F-score drops significantly and Obj is much worse, which is likely due to this version not labeling clusters consistently in time. Because the foreground prediction is not affected, these numbers are still reasonable. Next, we remove foreground masks (no FG mask) and cluster all foreground and background trajectories. The clustering is more sensitive; if the background trajectories are not clustered adequately in the embedding space, the performance will suffer. Lastly, we removed the spatial coordinate module (no SCM) and observed lower performance. Similar to the per-frame embedding experiment, foreground prediction is not affected.

Dataset ablation. We also study the effects of the training schedule and training dataset choices. In particular, we first explore the effect of not pre-training on FT3D, shown in the bottom portion of Table 2. Secondly, we explore the effect of training the model only on FBMS (without DAVIS-m). Both experiments show a noticeable drop in performance in both the multi-object and foreground/background settings, showing that these ideas are crucial to our performance.

Figure 5: Qualitative results for motion segmentation. The videos are: goats01, horses02, and cars10 from FBMS, and forest from ComplexBackground.

4.2 Comparison to State-of-the-Art Methods

We show results of our method against state-of-the-art methods in video foreground segmentation and multi-object motion segmentation. As discussed in the previous section, we select our full model to be the conv PT-RNN variant of Figure 2. We train this model for 150k iterations on FT3D, then fine-tune on FBMS+DAVIS-m for 100k iterations.

Video Foreground Segmentation. For FBMS, ComplexBackground and CamouflagedAnimal, we follow the protocol in [7] which converts the motion segmentation labels into a single foreground mask and use the metrics defined in [31] and report results averaged over those three datasets. We compare our method to state-of-the-art methods including PCM [6], FST [32], NLC [13], MPNet [43], LVO [44], and CCG [7]. We report results in Table 3. In terms of F-score, our model outperforms all other models on FBMS and CamouflagedAnimal, but falls just short on ComplexBackground behind PCM and CCG. Looking at all videos, we show a relative gain of 4.4% on F-score compared to the second best method CCG, due to our high recall.

Additionally, we report results of our model on FT3D and the validation set of DAVIS2016. We compare our model to state-of-the-art methods: including LVO [44], FSEG [22], MPNet [43], and FST [32] in Table 4. For this experiment only, we train a Y-Net with channels on FT3D for 100k iterations, resulting in outperforming MPNet by a relative gain of 5.6%. We then fine-tune for 50k iterations on the training set of DAVIS2016 and use a CRF [27] post-processing step. We outperform all methods in terms of -measure and all methods but LVO on -measure. Note that unlike LVO, we do not utilize an RNN for video foreground segmentation, yet we still achieve performance comparable to the state-of-the-art. Also, LVO [44] reports a -measure of 70.1 without using a CRF, while our method attains a -measure of 71.4 without using a CRF. This demonstrates the efficacy of the Y-Net architecture.

Multi-object Motion Segmentation. We compare our method with state-of-the-art methods CCG [7], CUT [24], and CVOS [42]. We report our results in Table 3. We outperform all models on F-score on the FBMS and CamouflagedAnimal datasets. On FBMS, we dominate on precision, recall, and F-score with a relative gain of 3.5% on F-score compared to the second best method CCG. Our performance on Obj is comparable to the other methods. On CamouflagedAnimal, we show very high recall with slightly lower precision, leading to a 4.4% relative gain in F-score. Again, our result on Obj is comparable. However, our method places third on the ComplexBackground dataset. This small 5-sequence dataset exhibits backgrounds with varying depths, which is hard for our network to correctly segment. However, we still outperform all other methods on F-score when looking at all videos. Similarly to the binary case, this is due to our high recall. Because we are the first work to use FT3D for motion segmentation, we report results on FT3D in the supplement for the interested readers.

To illustrate our method, we show qualitative results in Figure 5. We plot RGB, optical flow [21], groundtruth, results from the state-of-the-art CCG [7], and our results on 4 sequences (goats01, horses02, and cars10 from FBMS, and forest from ComplexBackground). On goats01, our results illustrate that due to our predicted foreground mask, our method is able to correctly segment objects that do not have instantaneous flow. CCG struggles in this setting. On horses02, we show a similar story, while CCG struggles to estimate rigid motions for the objects. Note that our method provides accurate segmentations without the use of a CRF post-processing step. We show two failure modes for our algorithm: 1) if the foreground mask is poor, the performance suffers as shown on cars10 and forest, and 2) cluster collapse can cause multiple objects to be segmented as a single object as shown in cars10.

5 Conclusion

We proposed a novel deep network architecture for solving the problem of object discovery using object motion cues. We introduced an encoder-decoder network that learns representations of video frames and optical flow, and a novel recurrent neural network that learns feature embeddings of pixel trajectories inside foreground masks. By clustering these embeddings, we are able to discover and segment potentially unseen objects in videos. We demonstrated the efficacy of our approach on several motion segmentation datasets for object discovery.


  • [1] B. Babenko, M.-H. Yang, and S. Belongie. Robust object tracking with online multiple instance learning. IEEE transactions on pattern analysis and machine intelligence, 2011.
  • [2] N. Ballas, L. Yao, C. Pal, and A. Courville. Delving deeper into convolutional networks for learning video representations. In International Conference on Learning Representations (ICLR), 2016.
  • [3] J. L. Barron, D. J. Fleet, and S. S. Beauchemin. Performance of optical flow techniques. International journal of computer vision, 12(1):43–77, 1994.
  • [4] J. Berclaz, F. Fleuret, E. Turetken, and P. Fua. Multiple object tracking using k-shortest paths optimization. IEEE transactions on pattern analysis and machine intelligence, 2011.
  • [5] P. Bideau and E. Learned-Miller. A detailed rubric for motion segmentation. arXiv preprint arXiv:1610.10033, 2016.
  • [6] P. Bideau and E. Learned-Miller. It’s moving! a probabilistic model for causal motion segmentation in moving camera videos. In European Conference on Computer Vision (ECCV), pages 433–449. Springer, 2016.
  • [7] P. Bideau, A. RoyChowdhury, R. R. Menon, and E. Learned-Miller. The best of both worlds: Combining cnns and geometric constraints for hierarchical motion segmentation. In

    IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , 2018.
  • [8] J. Bohg, K. Hausman, B. Sankaran, O. Brock, D. Kragic, S. Schaal, and G. S. Sukhatme. Interactive perception: Leveraging action in perception and perception in action. IEEE Transactions on Robotics, 33(6):1273–1291, 2017.
  • [9] T. Brox and J. Malik. Object segmentation by long term analysis of point trajectories. In European Conference on Computer Vision (ECCV), 2010.
  • [10] A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, J. Xiao, L. Yi, and F. Yu. ShapeNet: An Information-Rich 3D Model Repository. Technical Report arXiv:1512.03012 [cs.GR], Stanford University — Princeton University — Toyota Technological Institute at Chicago, 2015.
  • [11] J. Cheng, J. Yang, Y. Zhou, and Y. Cui. Flexible background mixture models for foreground segmentation. Image and Vision Computing, 2006.
  • [12] B. De Brabandere, D. Neven, and L. Van Gool. Semantic instance segmentation with a discriminative loss function. arXiv preprint arXiv:1708.02551, 2017.
  • [13] A. Faktor and M. Irani. Video segmentation by non-local consensus voting. In British Machine Vision Conference (BMVC), 2014.
  • [14] A. Fathi, Z. Wojna, V. Rathod, P. Wang, H. O. Song, S. Guadarrama, and K. P. Murphy. Semantic instance segmentation via deep metric learning. arXiv preprint arXiv:1703.10277, 2017.
  • [15] C. Feichtenhofer, A. Pinz, and A. Zisserman. Convolutional two-stream network fusion for video action recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  • [16] K. Fragkiadaki, G. Zhang, and J. Shi. Video segmentation by tracing discontinuities in a trajectory embedding. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2012.
  • [17] S. Hare, S. Golodetz, A. Saffari, V. Vineet, M.-M. Cheng, S. L. Hicks, and P. H. Torr. Struck: Structured output tracking with kernels. IEEE transactions on pattern analysis and machine intelligence, 2016.
  • [18] K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask r-cnn. In IEEE International Conference on Computer Vision (ICCV), 2017.
  • [19] B. K. Horn and B. G. Schunck. Determining optical flow. Artificial intelligence, 17(1-3):185–203, 1981.
  • [20] W. Hu, X. Li, X. Zhang, X. Shi, S. Maybank, and Z. Zhang.

    Incremental tensor subspace learning and its applications to foreground segmentation and tracking.

    International Journal of Computer Vision, 2011.
  • [21] E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox. Flownet 2.0: Evolution of optical flow estimation with deep networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • [22] S. Jain, B. Xiong, and K. Grauman. Fusionseg: Learning to combine motion and appearance for fully automatic segmention of generic objects in videos. arXiv preprint arXiv:1701.05384, 2017.
  • [23] Z. Kalal, K. Mikolajczyk, J. Matas, et al. Tracking-learning-detection. IEEE transactions on pattern analysis and machine intelligence, 2012.
  • [24] M. Keuper, B. Andres, and T. Brox. Motion trajectory segmentation via minimum cost multicuts. In IEEE International Conference on Computer Vision (ICCV), 2015.
  • [25] T. Kobayashi and N. Otsu. Von mises-fisher mean shift for clustering on a hypersphere. In International Conference on Pattern Recognition (ICPR), 2010.
  • [26] S. Kong and C. Fowlkes. Recurrent pixel embedding for instance grouping. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  • [27] P. Krähenbühl and V. Koltun. Efficient inference in fully connected crfs with gaussian edge potentials. In Advances in Neural Information Processing Systems (NIPS), 2011.
  • [28] N. Mayer, E. Ilg, P. Häusser, P. Fischer, D. Cremers, A. Dosovitskiy, and T. Brox. A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  • [29] M. Narayana, A. Hanson, and E. Learned-Miller. Coherent motion segmentation in moving camera videos using optical flow orientations. In IEEE International Conference on Computer Vision (ICCV), 2013.
  • [30] D. Novotny, S. Albanie, D. Larlus, and A. Vedaldi. Semi-convolutional operators for instance segmentation. In European Conference on Computer Vision (ECCV), 2018.
  • [31] P. Ochs, J. Malik, and T. Brox. Segmentation of moving objects by long term video analysis. IEEE transactions on pattern analysis and machine intelligence, 2014.
  • [32] A. Papazoglou and V. Ferrari. Fast object segmentation in unconstrained video. In IEEE International Conference on Computer Vision (ICCV), 2013.
  • [33] D. Pathak, R. B. Girshick, P. Dollár, T. Darrell, and B. Hariharan. Learning features by watching objects move. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • [34] F. Perazzi, J. Pont-Tuset, B. McWilliams, L. Van Gool, M. Gross, and A. Sorkine-Hornung. A benchmark dataset and evaluation methodology for video object segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  • [35] P. O. Pinheiro, T.-Y. Lin, R. Collobert, and P. Dollár. Learning to refine object segments. In European Conference on Computer Vision (ECCV), 2016.
  • [36] J. Pont-Tuset, F. Perazzi, S. Caelles, P. Arbeláez, A. Sorkine-Hornung, and L. Van Gool. The 2017 davis challenge on video object segmentation. arXiv:1704.00675, 2017.
  • [37] O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), 2015.
  • [38] L. Shao, P. Shah, V. Dwaracherla, and J. Bohg. Motion-based object segmentation based on dense rgb-d scene flow. In IEEE Robotics and Automation Letters, volume 3, pages 3797–3804. IEEE, Oct. 2018.
  • [39] K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In Advances in neural information processing systems (NIPS), 2014.
  • [40] J. Straub. Nonparametric Directional Perception. PhD thesis, Massachusetts Institute of Technology, 2017.
  • [41] N. Sundaram, T. Brox, and K. Keutzer. Dense point trajectories by gpu-accelerated large displacement optical flow. In European conference on computer vision (ECCV), 2010.
  • [42] B. Taylor, V. Karasev, and S. Soatto. Causal video object segmentation from persistence of occlusions. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
  • [43] P. Tokmakov, K. Alahari, and C. Schmid. Learning motion patterns in videos. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • [44] P. Tokmakov, K. Alahari, and C. Schmid. Learning video object segmentation with visual memory. In IEEE International Conference on Computer Vision (ICCV), 2017.
  • [45] Y. Wu and K. He. Group normalization. arXiv preprint arXiv:1803.08494, 2018.
  • [46] Y. Xiang, A. Alahi, and S. Savarese. Learning to track: Online multi-object tracking by decision making. In IEEE International Conference on Computer Vision (ICCV), 2015.
  • [47] Y. Xiang and D. Fox. Da-rnn: Semantic mapping with data associated recurrent neural networks. Robotics: Science and Systems, 2017.
  • [48] L. Zhang, Y. Li, and R. Nevatia. Global data association for multi-object tracking using network flows. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2008.

Appendix A PT-RNN Variants

The three PT-RNN variants to compute the weight , standard, conv, and convGRU are shown in detail in Table 5. For standard, we show the equations for a single pixel trajectory. It computes weights based on the pixel embeddings along that trajectory without knowledge of any other trajectories. For conv, it uses a convolution kernel instead of the standard matrix multiply to include information from neighboring trajectories. Lastly, for convGRU, we design this architecture based on the convGRU architecture [2] which has an explicit memory state to capture longer-term dependencies. For all three variants, the hidden state is . However, in the RNN we propagate , which is the intermediate weighted sum at time . This allows the network to use knowledge of the previous weights and pixel embeddings to calculate .

standard conv convGRU
Table 5: PT-RNN variants. For standard, we show the equations for pixel , while for the others we show equations in terms of the entire feature map. Note that for standard, , while for conv and convGRU, are convolution kernels. denotes convolution and

is the sigmoid nonlinearity.

Appendix B Proof of Proposition 1

We note the the following:

Note that the unit vector that maximizes the inner product with a given vector is simply the normalized version of (if ). Thus, the solution to the above problem is .

Appendix C Dataset Details


The Flying Things 3D dataset (FT3D) [28] is a synthetic dataset comprised of approximately 2250 training and 450 test videos of 10 images each. Each video is created by instantiating a background with static objects and populating the scene by having sampled foreground objects from ShapeNet [10] flying along randomized 3D trajectories. Segmentation masks of all objects (foreground and background) are provided. While [28] does not provide information about which objects are foreground, [43] provided foreground labels by identifying the objects which underwent changes in 3D coordinates. We combined this with the object segmentation masks to produce foreground motion clustering masks. We use this dataset for both evaluation and pre-training. Performance on this dataset is measured by intersection over union (IoU) of the foreground masks.


The DAVIS2016 dataset [34] is a collection of 50 videos of approximately 3500 images, split into a 30 training videos and 20 test videos. Each video is accompanied by pixel-dense foreground labels at each frame. We evaluate on the test set for video foreground segmentation only. The DAVIS2017 dataset [36] expands on DAVIS2016 and provides 90 publicly available video sequences with full pixel-dense annotation. DAVIS2017 focuses on semi-supervised video segmentation (as opposed to unsupervised, i.e. foreground segmentation) and provides multiple labels per video. However, not every object labeled is foreground, and not every foreground object is labeled, thus this dataset is not suitable for the task of object discovery. Despite this, we leverage the sequences for training. We use the -measure (IoU) and the -measure as defined by [34] as evaluation metrics for DAVIS2016.


The Freiburg-Berkeley motion segmentation dataset [31] consists of 59 videos split into 29 training videos and 30 test videos. The videos can be up to 800 images long, and approximately every 20th frame has ground truth motion segmentation labels. The inconsistency and ambiguity in motion segmentation dataset labels inspired [5] to rigorously define the problem of motion segmentation and provide corrected labels which we use in this work. Performance on this dataset is measured by precision, recall, F-score, and Obj as described in [31, 7].


We also show results on the Complex Background [29] and Camouflaged Animal [6] datasets. These datasets are small and contain 5 and 9 sequences, respectively. Labels are corrected and provided by [5]. We use the same metrics for evaluation as the FBMS dataset.

Appendix D DAVIS-m

We hand-select 42 videos from the DAVIS2017 [36] train and val datasets (90 videos total) that roughly satisfy the rubric of [5]. We denote this dataset as DAVIS-m, and use it to supplement the small training dataset of FBMS (29 videos). In hand-selecting these videos, we make sure that only (and all of) the foreground objects are labeled, and that the foreground objects are correctly separated into different objects. For example, the video classic-car shows two people in a car with a segmentation mask for the car, and separate segmentation masks for the people. This is incredibly difficult for an algorithm to properly segment using motion cues (and does not fit the rubric of [5]), thus is not included in DAVIS-m. The exact videos are given in Table 6, where we show all 42 videos. There are 27 videos that have a single object (i.e. video foreground segmentation) and 15 videos with multiple objects.

Multi-object Foreground
boxing-fisheye bear
cat-girl bike-packing
disc-jockey blackswan
dog-gooses breakdance-flare
dogs-jump bus
gold-fish car-shadow
judo car-turn
kid-football cows
loading dance-twirl
night-race dog
pigs drift-chicane
planes-water drift-straight
sheep drift-turn
tuk-tuk elephant
walking flamingo
Table 6: DAVIS-m videos. The left column shows the 15 multi-object videos (2 or more objects), and the right column shows the 27 single-object videos (i.e. video foreground segmentation).

Appendix E Object Discovery results on FT3D

To facilitate motion segmentation and object discovery research, we provide our motion segmentation results for the FT3D [28] testset. We provide numbers for the metrics described in [31, 7], namely precision, recall, F-score, and Obj for the multi-object and foreground settings. We trained our full model for 150k iterations using the motion segmentation labels we extracted from foreground labels [43] and object segmentation labels [28]. The results are provided in Table 7.

Multi-object Foreground
P R F Obj P R F
74.3 75.1 72.9 2.46 96.4 97.7 96.9
Table 7: Results on FT3D