Video Object Segmentation and Tracking: A Survey

by   Rui Yao, et al.
Nanyang Technological University

Object segmentation and object tracking are fundamental research area in the computer vision community. These two topics are diffcult to handle some common challenges, such as occlusion, deformation, motion blur, and scale variation. The former contains heterogeneous object, interacting object, edge ambiguity, and shape complexity. And the latter suffers from difficulties in handling fast motion, out-of-view, and real-time processing. Combining the two problems of video object segmentation and tracking (VOST) can overcome their respective difficulties and improve their performance. VOST can be widely applied to many practical applications such as video summarization, high definition video compression, human computer interaction, and autonomous vehicles. This article aims to provide a comprehensive review of the state-of-the-art tracking methods, and classify these methods into different categories, and identify new trends. First, we provide a hierarchical categorization existing approaches, including unsupervised VOS, semi-supervised VOS, interactive VOS, weakly supervised VOS, and segmentation-based tracking methods. Second, we provide a detailed discussion and overview of the technical characteristics of the different methods. Third, we summarize the characteristics of the related video dataset, and provide a variety of evaluation metrics. Finally, we point out a set of interesting future works and draw our own conclusions.


Object Detection, Tracking, and Motion Segmentation for Object-level Video Segmentation

We present an approach for object segmentation in videos that combines f...

Single Object Tracking: A Survey of Methods, Datasets, and Evaluation Metrics

Object tracking is one of the foremost assignments in computer vision th...

Multiple Object Tracking: A Literature Review

Multiple Object Tracking (MOT) is an important computer vision problem w...

A Survey on Deep Learning Technique for Video Segmentation

Video segmentation, i.e., partitioning video frames into multiple segmen...

Uncertainty in Minimum Cost Multicuts for Image and Motion Segmentation

The minimum cost lifted multicut approach has proven practically good pe...

Tracking Deformable Parts via Dynamic Conditional Random Fields

Despite the success of many advanced tracking methods in this area, trac...

Weakly and Semi-Supervised Detection, Segmentation and Tracking of Table Grapes with Limited and Noisy Data

Detection, segmentation and tracking of fruits and vegetables are three ...

1. Introduction

The rapid development of intelligent mobile terminals and the Internet has led to an exponential increase in video data. In order to effectively analyze and use video big data, it is very urgent to automatically segment and track the objects of interest in the video. Video object segmentation and tracking are two basic tasks in field of computer vision. Object segmentation divides the pixels in the video frame into two subsets of the foreground target and the background region, and generates the object segmentation mask, which is the core problem of behavior recognition and video retrieval. Object tracking is used to determine the exact location of the target in the video image and generate the object bounding box, which is a necessary step for intelligent monitoring, big data video analysis and so on.

The segmentation and tracking problems of video objects seem to be independent, but they are actually inseparable. That is to say, the solution to one of the problems usually involves solving another problem implicitly or explicitly. Obviously, by solving the object segmentation problem, it is easy to get a solution to the object tracking problem. On the one hand, accurate segmentation results provide reliable object observations for tracking, which can solve problems such as occlusion, deformation, scaling, etc., and fundamentally avoid tracking failures. Although not so obvious, the same is true for object tracking problems, which must provide at least a coarse solution to the problem of object segmentation. On the other hand, accurate object tracking results can also guide the segmentation algorithm to determine the object position, which reduces the impact of object fast movement, complex background, similar objects, etc., and improves object segmentation performance. A lot of research work has noticed that the simultaneous processing of the object segmentation and tracking problems, which can overcome their respective difficulties and improve their performance. The related problems can be divided into two major tasks: video object segmentation (VOS) and video object tracking (VOT).

The goal of video object segmentation is to segment a particular object instance in the entire video sequence of the object mask on a manual or automatic first frame, causing great concern in the computer vision community. Recent VOS algorithms can be organized by their annotations. The unsupervised and interactive VOS methods denote the two extremes of the degree of user interaction with the method: at one extreme, the former can produce a coherent space-time region through the bottom-up process without any user input, that is, without any video-specific tags (Irani and Anandan, 1998; Grundmann et al., 2010; Brox and Malik, 2010; Lee et al., 2011; Faktor and Irani, 2014; Li et al., 2018). In contrast, the latter uses a strongly supervised interaction method that requires pixel-level precise segmentation of the first frame (human provisioning is very time consuming), but also the human needs to loop error correction system (Li et al., 2005; Wang et al., 2014; Benard and Gygli, 2017; Caelles et al., 2018; Maninis et al., 2018). There are semi-supervised VOS approaches between the two extremes, which requires manual annotation to define what is the foreground object and then automatically segment to the rest frames of the sequence (Ren and Malik, 2007; Tsai et al., 2012; Jain and Grauman, 2014; Caelles et al., 2017; Perazzi et al., 2017). In addition, because of the convenience of collecting video-level labels, another way to supervise VOS is to produce masks of objects given the  (Zhang et al., 2015; Tang et al., 2013) or natural language expressions (Khoreva et al., 2018). However, as mentioned above, the VOS algorithm implicitly handles the process of tracking. That is, the bottom-up approach uses a spatio-temporal motion and appearance similarity to segment the video in a fully automated manner. These methods read multiple or all image frames at once to take full advantage of the context of multiple frames, and segment the precise object mask. The datasets evaluated by these methods are dominated by short-term videos. Moreover, because these methods iteratively optimize energy functions or fine-turns a deep network, so it can be slow.

In contrast to VOS, given a sequence of input images, the video object tracking method utilizes a class-specific detector to robustly predict the motion state (location, size, or orientation, etc.) of the object in each frame. In general, most of VOT methods are especially suitable for processing long-term sequences. Since these methods only need to output the location, orientation or size of the object, the VOT method uses the online manner for fast processing. For example, tracking-by-detection methods utilize generative (Ross et al., 2008) and/or discriminative (Hare et al., 2016; Yao et al., 2012)

appearance models to accurate estimate object state. The impressive results of these methods prove accurate and fast tracking. However, most algorithms are limited to generating bounding boxes or ellipses for their output, so that when non-rigid and articulated motions are involved in the object, they are often subject to visual drift problems. To address this problem, part-based tracking methods 

(Yao et al., 2017a, b) have been presented, but they still use part of the bounding box for object localization. In order to leverage the precision object masks and fast object location, segmentation-based tracking methods have been developed which combine video object segmentation and tracking (Bibby and Reid, 2008; Aeschliman et al., 2010; Wen et al., 2015; Yeo et al., 2017; Wang et al., 2018). Most of methods estimate the object results (i.e. bounding boxes of the object or/and object masks) by a combination of bottom-up and top-down algorithms. The contours of deformable objects or articulated motions can be propagated using these methods efficiently.

In the past decade, a large number of video object segmentation and tracking (VOST) studies have been published in the literature. The field of VOST has a wide range of practical applications, including video summarization, high definition (HD) video compression, gesture control and human interaction. For instance, VOST methods are widely applied to video summarization that exploits visual object across multiple videos (Chu et al., 2015), and provide a useful tool that assists video retrieval or web browsing (Rochan et al., 2018). In the filed of video compression, VOST is used in video-coding standards MPEG-4 to implement content-based features and high coding efficiency (Kim and Hwang, 2002). In particular, the VOST can encode the video shot as a still background mosaic obtained after compensating the moving object by utilizing the content-based representation provided by MPEG-4 (Colombari et al., 2007). Moreover, VOST can estimate the non-rigid target to achieve accurate tracking positioning and mask description, which can identify its motion instructions (Wang et al., 2017c). They can replace simple human body language, especially various gesture controls.

1.1. Challenges and issues

Many problems in video object segmentation and tracking are very challenging. In general, VOS and VOT have some common challenges, such as background clutter, low resolution, occlusion, deformation, motion blur, scale variation, etc. But there are some specific characteristics determined by the objectives and tasks, for example, objects in the VOT can be complex due to fast motion, out-of-view, and real-time processing. In addition, segmenting and tracking the effects of heterogeneous object, interacting object, edge ambiguity, shape complexity, etc. A more detailed description is given in (Wu et al., 2013; Perazzi et al., 2016).

Figure 1. Taxonomy of video object segmentation and tracking.

To address these problems, tremendous progress has been made in the development of video object segmentation and tracking algorithms. These are mainly different from each other based on how they handle the following issues in visual segmentation and tracking: (i) which application scenario is suitable for VOST? (ii) Which object representation (i.e. point, superpixel, patch, and object) is adapted to VOS? (iii) Which image features are appropriate for VOST? (iv) How to model the motion of an object in VOST? (v) How to per-process and post-process CNN-based VOS methods? (vi) Which datasets are suitable for the evaluation VOST, and what are their characteristics? A number of VOST methods have been proposed that attempt to answer these issues for various scenarios. Motivated by the objective, this survey divides the video object segmentation and tracking methods into broad categories and provides a comprehensive review of some representative approaches. We hope to help readers gain valuable VOST knowledge and choose the most appropriate application for their specific VOST tasks. In addition, we will discuss video object segmentation and tracking new trends in the community, and hope to provide several interesting ideas to new methods.

1.2. Organization and contributions of this survey

As shown in Fig. 1, we summarize our organization in this survey. To investigate a suitable application scenario for VOST, we group these methods into five main categories: unsupervised VOS, semi-supervised VOS, interactive VOS, weakly supervised VOS, and segmentation-based tracking methods.

The unsupervised VOS algorithm typically relies on certain restrictive assumptions about the application scenario, so it does not have to be manually annotated in the first frame. According to discover primary objects using appearance and motion cues, in Sec. 2.1

, we categorize them as background subtraction, point trajectory, over-segmentation, “object-like” segments, and convolutional neural networks based methods. In Tab. 

1, we also summarize some object representation, for example, pixel, superpixel, supervoxel, and patch, and image features. In Sec. 2.2, we describe the semi-supervised VOS methods for modeling the appearance representations and temporal connections, and performing segmentation and tracking jointly. In Tab. 3, we discuss various of per-process and post-process CNN-based VOS methods. In Sec. 2.3, interactive VOS methods are summarized by the way of user interaction and motion cues. In Sec. 2.4, we discuss various weakly supervised information for video object segmentation. In Sec. 2.5, we group and describe the segmentation-based tracking methods, and explain the advantages or disadvantages of different bottom-up and joint-based frameworks, as shown in Tab. 5 and Tab. 6. In addition, we investigate a number of video datasets for video object segmentation and tracking, and explain the metrics of pixel-wise mask and bounding box based techniques. Finally, we present several interesting issues for the future research in Sec. 4, and help researchers in other related fields to explore the possible benefits of VOST techniques.

Although there are surveys on VOS (Perazzi et al., 2016; Erdem et al., 2004) and VOT (Yilmaz et al., 2006; Li et al., 2013; Wu et al., 2015), they are not directly applicable to joint video object segmentation and tracking, unlike our surveys. First, Perazzi et al. (Perazzi et al., 2016) present a dataset and evaluation metrics for VOS methods, Erdem et al. (Erdem et al., 2004) measure to evaluate quantitatively the performance of VOST methods in 2004. In comparison, we focuses on the summary of methods of video object segmentation, but also object tracking. Second, Yilmaz et al. (Yilmaz et al., 2006) and Li et al. (Li et al., 2013) discuss generic object tracking algorithms, and Wu et la. (Wu et al., 2015) evaluate the performance of single object tracking, therefore, they are different from our segmentation-based tracking discussion.

In this survey, we provide a comprehensive review of video object segmentation and tracking, and summarize our contributions as follows: (i) As shown in Fig. 1, a hierarchical categorization existing approaches is provided in video object segmentation and tracking. We roughly classify methods into five categories. Then, for each category, different methods are further categorized. (ii) We provide a detailed discussion and overview of the technical characteristics of the different methods in unsupervised VOS, semi-supervised VOS, interactive VOS, and segmentation-based tracking. (iii) We summarize the characteristics of the related video dataset, and provide a variety of evaluation metrics.

2. Major methods

In the section, video object segmentation and tracking methods are grouped into five categories: unsupervised video object segmentation methods, semi-supervised video object segmentation methods, interactive video object segmentation methods, weakly supervised video object segmentation methods, and segmentation-based tracking methods.

2.1. Unsupervised video object segmentation

The unsupervised VOS algorithm does not require any user input, it can automatically find objects. In general, they assume that the objects to be segmented and tracked have different motions or appear frequently in the sequence of images. Following we will review and discuss five groups of the unsupervised methods.

2.1.1. Background subtraction

Early video segmentation methods were primarily geometric based and limited to specific motion backgrounds. The classic background subtraction method simulates the background appearance of each pixel and treats rapidly changing pixels as foreground. Any significant change in the image and background model represents a moving object. The pixels that make up the changed region are marked for further processing. A connected component algorithm is used to estimate the connected region corresponding to the object. Therefore, the above process is called background subtraction. Video object segmentation is achieved by constructing a representation of the scene called the background model and then finding deviations from the model for each input frame.

According to the dimension of the used motion, background subtraction methods can be divided into stationary backgrounds (Stauffer and Grimson, 2000; Elgammal et al., 2002; Han and Davis, 2012), backgrounds undergoing 2D parametric motion (Irani et al., 1994; Ren et al., 2003; Criminisi et al., 2006; Barnich and Van Droogenbroeck, 2011), and backgrounding undergoing 3D motions (Torr and Zisserman, 1998; Irani and Anandan, 1998; Brutzer et al., 2011). Stationary backgrounds

Background subtraction became popular following the work of Wren et al. (Wren et al., 1997). They use a multiclass statistical model of color pixel, , of a stationary background with a single 3D (, and color space) Gaussian, . The model parameters (the mean and the covariance ) are learned from the color observations in several consecutive frames. For each pixel in the input video frame, after the model of the background is derived, they calculate the likelihood that their color is from , and the deviation from the pixel. The foreground model is marked as a foreground pixel. However, Gao et al. (Gao et al., 2000) show that a single Gaussian would be insufficient to model the pixel value while accounting for acquisition noise. Therefore, some work begin to improve the performance of background modeling by using a multimodal statistical model to describe the background color per pixel. For example, Stauffer and Grimson (Stauffer and Grimson, 2000) build models each pixel as a mixture of Gaussians (MoG) and uses an on-line approximation to update the model. Rather than explicitly modeling the values of all the pixels as one particular type of distribution, they model the values of a particular pixel as a mixture of Gaussians. In (Elgammal et al., 2002)

, Elgammal and Davis use nonparametric kernel density estimation (KDE) to model the per-pixel background. They construct a statistical representation of the scene background that supports sensitive detection of moving objects in the scene. In 

(Han and Davis, 2012)

, the authors propose a pixelwise background modeling and subtraction technique using multiple features, where generative (kernel density approximation (KDA)) and discriminative (support vector machine (SVM)) techniques are combined for classification. Backgrounds undergoing 2D parametric motion

Instead of modeling stationary backgrounds, another methods use backgrounds undergoing 2D parametric motion. For instance, Irani et al. (Irani et al., 1994) detect and track occluding and transparent moving objects, and use temporal integration without assuming motion constancy. The temporal integration maintains sharpness of the tracked object, while blurring objects that have other motions. Ren et al. (Ren et al., 2003) propose a background subtraction method based spatial distribution of Gaussians model for the foreground detection from a non-stationary background. Criminisi et al. (Criminisi et al., 2006) present segmentation of videos by probabilistic fusion of motion, color and contrast cues together with spatial and temporal priors. They build the automatic layer separation and background substitution method. Barnich and Droogenbroeck (Barnich and Van Droogenbroeck, 2011) introduce a universal sample-based background subtraction algorithm, which include pixel model and classification process, background model initialization, and updating the background model over time. Backgrounding undergoing 3D motions

Irani and Anandan (Irani and Anandan, 1998) describe a unified approach to handling moving-object detection in both 2D and 3D scenes. A two-dimensional algorithm applied when a scene can be approximated by a plane and when the camera is only rotated and scaled. 3D algorithm that works only when there is a significant depth change in the scene and the camera is translating. This method bridges the two extremes of the strategy of the gap. Torr and Zisserman (Torr and Zisserman, 1998) present a Bayesian methods of motion segmentation using the constraints enforced by rigid motion. Each motion model build 3D relations and 2D relations. In (Brutzer et al., 2011), Brutzer et al. evaluate the background subtraction method. They identify the main challenges of background subtraction, and then compare the performance of several background subtraction methods with post-processing.

Discussion. Due to the use of stationary backgrounds and 3D motions, these methods have different properties. Overall, the aforementioned methods must rely on the restrictive assumption that the camera is stable and slowly moving. That is, it is sensitive to model selection (2D or 3D) and cannot handle special backgrounds such as non-rigid object.

2.1.2. Point trajectory

Figure 2. Illustration of video object segmentation based on point trajectory (Peter et al., 2014). From left to right: frame 1 and 4: a shot of video, image 2 and 5: clustering of point, image 3 and 6: the results of segmentation.

The problem of video object segmentation can be solved by analyzing motion information over longer period. Motion is a strong perceptual cue for segmenting a video into separate objects (Shi and Malik, 1998). Works of (Brox and Malik, 2010; Lezama et al., 2011; Ochs and Brox, 2012; Fragkiadaki et al., 2012; Guizilini and Ramos, 2013; Chen et al., 2015; Held et al., 2016)

approaches use long term motion information with point trajectories to take advantage of motion information available in multiple frames in recent years. Typically, these methods first generate point trajectories and then cluster the trajectories by using their affinity matrix. Finally, the clustering trajectory is used as the prior information to obtain the video object segmentation result. We divide point trajectory based VOS methods into two subcategories based on the motion estimation used, namely, optical flow and feature tracking methods. Optical flow estimates a dense motion field from one frame to the next, while feature tracking follows a sparse set of salient image points over many frames. Optical flow based methods

Many video object segmentation methods heavily rely on the dense optical flow estimation and motion tracking. Optical flow is a dense field displacement vector used to determine the pixels of each region. And it is usually used to capture spatio-temporal motion information of the objects of a video (Horn and Schunck, 1981). There are two basic assumptions:

  • The brightness is constant. That is, when the same object moves between different frames, its brightness does not change. This is the assumption of the basic optical flow method for obtaining the basic equations of the optical flow method.

  • Small movement. That is, the change of time does not cause a drastic change in the object position, and the displacement between adjacent frames is relatively small.

Classic point tracking method use Kanade-Lucas-Tomasi (KLT) (LUCAS, 1981) to generate sparse point trajectories. Brox and Malik (Brox and Malik, 2010) first perform point tracking to build sparse long-term trajectories and divide them into multiple clusters. Compared to the two-frame motion field, they argue that analyzing long-term point trajectories can better obtain temporally consistent clustering on many frames. In order to calculate such a point trajectory, this work runs a tracker developed in (Sundaram et al., 2010) based on the large displacement optical flow (Brox and Malik, 2011). As shown in Fig. 2, Ochs et al. (Peter et al., 2014) propose to use a semi-dense point tracker based on optical flow (Brox and Malik, 2010), which can generate reliable trajectories of hundreds of frames with only a small drift and maintain a wide coverage of the video lens. Chen et al. (Chen et al., 2015) employ both global and local information of point trajectories to cluster trajectories into groups. Feature tracking based methods

The above-mentioned methods first estimated the optical flow and then processed the discontinuities. In general, optical flow measurement is difficult in areas with very small textures or movements. In order to solve this problem, these methods propose to implement smoothing constraints in the flow field for interpolation. However, one must first know the segmentation to avoid the requirement of smooth motion discontinuity. Shi and Malik 

(Shi and Malik, 1998) define a motion feature vector at each pixel called motion profile, and adopt the normalized cuts (Shi and Malik, 2000) to divide a frame into motion segments. In (Lezama et al., 2011), spatio-temporal video segmentation algorithm is proposed to incorporate long-range motion cues from the past and future frames in the form of clusters of point tracks with coherent motion. Later, based on (Brox and Malik, 2010), Ochs and Brox (Ochs and Brox, 2012)

introduce a variational method to obtain dense segmentations from such sparse trajectory clusters. Their method employs the spectral clustering to group trajectories based on a higher-order motion model. Fragkiadaki

et al. (Fragkiadaki et al., 2012) propose to detect discontinuities of embedding density between spatially neighboring trajectories. In order to segment the non-rigid object, they combine motion grouping cues to produce context-aware saliency maps. Moreover, a probabilistic 3D segmentation method (Held et al., 2016) is proposed to combine spatial, temporal, and semantic information to make better-informed decisions.

Discussion. For background subtraction based methods, they explore motion information in short term and do not perform well when objects keep static in some frames. In contrast, point trajectories usually use long range trajectory motion similarity for video object segmentation. Objects captured by trajectory clusters have proven to have a long time frame. However, non-rigid object or large motion lead to frequent pixel occlusions or dis-occlusions. Thus, in this case, point trajectories may be too short to be used. Additionally, these methods lack the appearance of the object appearance, i.e. with only low-level bottom-up information.

2.1.3. Over-segmentation

Figure 3. Illustration of hierarchy of over-segmentations (Chang et al., 2013). From over-segmentation of a human contour (left) to superpixel (right).

Several over-segmentation approaches group pixels based on color, brightness, optical flow, or texture similarity and produce spatio-temporal segmentation maps (Stein et al., 2007; Yang et al., 2017). These methods generate an over-segmentation of the video into space-time regions. Fig. 3 shows the valid over-segmentations for a given human image. This example illustrates the difference between oversegmentations and superpixels. Although the terms “over-segmentation” and “super-pixel segmentation” are often used interchangeably, there are some differences between them. A superpixel segmentation is an over-segmentation that preserves most of the structure necessary for segmentation.

However, the number of over-segmentation makes optimization over sophisticated models intractable. Most current methods for unsupervised video object segmentation are graph-based (Grundmann et al., 2010; Levinshtein et al., 2012; Jang et al., 2016). Graph-based approaches to pixel, superpixel, or supervoxel generation treat each pixel as a node in a graph, where the vertices of a graph are partitioned into disjoint subgraphs, , by pruning the weighted edges of the graph. The task is to assign a label to each , the vertices are created by minimizing a cost function defined over the graph. Given a set of vertices and a finite set of labels , an energy function is:


where the unary term express how likely is a label for pixel , the pairwise term represent how likely labels and are for neighboring pixels and , and is a collection of neighboring pixel pairs. The coefficients are the weights, and is the parameter.

References # Features Optical flow Methods
Background subtraction methods
Han (Han and Davis, 2012) M RGB color, gradient, Haar-like Stationary backgrounds: generative (KDA) and discriminative(SVM) techniques
Stauffer (Stauffer and Grimson, 2000) M RGB color Stationary backgrounds: model each pixel as MoG and use on-line approximation to update
Criminisi (Criminisi et al., 2006) M YUV color 2D motion: probabilistic fusion of motion, color and contrast cues using CRF
Ren (Ren et al., 2003) M RGB color 2D motion: estimate motion compensation using Gaussians model
Torr (Torr and Zisserman, 1998) M Corner features 3D motion: a maximum likelihood and EM approach to clustering
Irani (Irani and Anandan, 1998) M RGB color 3D motion: unified approach to handling moving-object detection in both 2D and 3D scenes
Point trajectory methods
Ochs (Peter et al., 2014) M RGB color Optical flow: build spare long-term trajectories using optical flow
Fragkiadaki (Fragkiadaki et al., 2012) M RGB color Optical flow: combine motion grouping cues with context-aware saliency maps
Held (Held et al., 2016) M Centroid, shape Feature tracking: probabilistic framework to combine spatial, temporal, and semantic information
Lezama (Lezama et al., 2011) M RGB color Feature tracking: spatio-temporal graph-based video segmentation with long-rang motion cues
Over-segmentation methods
Giordano (Giordano et al., 2015) M RGB color Superpixel: generate coarse segmentation with superpixel and refine mask with energy functions
Chang (Chang et al., 2013) M LAB color, image axes Supervoxel: object parts in different frames, model flow between frames with a bilateral Gaussian process
Grundmann (Grundmann et al., 2010) M distance of the normalized color histograms Supervoxel: graph-based method, iteratively process over multiple levels
Stein (Stein et al., 2007) M Brightness and color gradient Boundaries: graphical model, learn local classifier and global inference
Convolutional neural network methods
Tokmakov (Tokmakov et al., 2017a) S Deep features Build instance embedding network and link objects in video
Jain (Dutt Jain et al., 2017) S Deep features Design a two-stream fully CNN to combine appearance and motion information
Tokmakov (Tokmakov et al., 2017b) S Deep features Trains a two-stream network RGB and optical flow, then feed into ConvGRU
Vijayanarasimha (Vijayanarasimhan et al., 2017) M Deep features Geometry-aware CNN to predict depth, segmentation, camera and rigid object motions
Tokmakov (Tokmakov et al., 2017a) S Deep features Encoder-decoder architecture, coarse representation of the optical flow, then refines it iteratively to produce motion labels
Song (Song et al., 2018) S Deep features Pyramid dilated bidirectional ConvLSTM architecture, and CRF-based post-process
Continued on next page
Table 1. Summary of some major unsupervised VOS methods. #: number of objects, S: single, M: multiple.
References # Features Optical flow Methods
“Object-like” segments methods
Li (Li et al., 2014) M Bag-of-words on color SIFT Segment: generate a pool of segment proposals, and online update the model
Lee (Lee et al., 2011) S distance between unnormalized color histograms discover key-segments, space-time MRF for foreground object segmentation
Koh (Jun Koh et al., 2018) M Deep features Saliency: extract object instances using sequential clique optimization
Wang (Wang et al., 2015) S Edges, motion Saliency: generate frame-wise spatio-temporal saliency maps using geodesic distance
Faktor (Faktor and Irani, 2014) S RGB color, local structure

Saliency: combine motion saliency and visual similarity across large time-laps

Wang (Chaohui Wang et al., 2009) M RGB color Object proposals: MRF for segmentation, depth ordering and tracking with occlusion handling
Perazzi (Perazzi et al., 2015) S ACC, HOOF, NG, HOG Object proposals: geodesic superpixel edge Fully connected CRF using multiple object proposals
Tsai (Tsai et al., 2016a) S Color GMM, CNN feature Object proposals: CRF for joint optimization of segmentation and optical flow
Koh (Koh and Kim, 2017b) S Bag-of-visual-words with LAB color Object proposals: ultrametric contour maps (UCMs) in each frame, refine primary object regions
Table 1. Continued on next page. ACC: Area, centroid, average color. HOOF: Histogram of Oriented Optical Flow. NG: Objectness via normalized gradients. HOG: Histogram of oriented gradients. Superpixel representation

Shi and Malik (Shi and Malik, 2000) propose the graph-based normalized cut to overcome the oversegmentation problem. The term superpixel is coined by Ren and Malik (Ren and Malik, 2003) in their work on learning a binary classifier that can segment natural images. They use the normalized cut algorithm (Shi and Malik, 2000) for extracting the superpixels, with contour and texture cues incorporated. Levinshtein et al. (Levinshtein et al., 2012) introduce the concept of spatio-temporal closure, and automatically recovers coherent components in images and videos, corresponding to objects and object parts. Chang et al. (Chang et al., 2013) develop a graphical model for temporally consistent superpixels in video sequences, and propose a set of novel metrics to quantify performance of a temporal superpixel representation: object segmentation consistency, 2D boundary accuracy, intra-frame spatial locality, inter-frame temporal extent, and inter-frame label consistency. Giordano et al. (Giordano et al., 2015) generate a coarse foreground segmentation to provide predictions about motion regions by analyzing the superpixel segmentation changes in consecutive frames, and refine the initial segmentation by optimizing an energy function. In (Yang et al., 2016), appearance modeling technique with superpixel for automatic primary video object segmentation in the Markov random field (MRF) framework is proposed. Jang et al. (Jang et al., 2016)

introduce three foreground and background probability distributions: Markov, spatio-temporal, and antagonistic to minimize a hybrid of these energies to separate a primary object from its background. Furthermore, they refine the superpixel-level segmentation results. Yang 

et al. (Yang et al., 2017) introduce a multiple granularity analysis framework to handle a spatio-temporal superpixel labeling problem. Supervoxel representation

Several supervoxel-the video analog to a superpixel-methods over-segment a video into spatio-temporal regions of uniform motion and appearance (Grundmann et al., 2010; Oneata et al., 2014). Grundmann et al. (Grundmann et al., 2010) over-segment a volumetric video graph into space-time regions grouped by appearance, and propose a hierarchical graph-based algorithm for spatio-temporal segmentation of long video sequences. In (Oneata et al., 2014), Oneata et al. build a 3D space-time voxel graph to produce spatial, temporal, and spatio-temporal proposals by a randomized supervoxel merging process, and the algorithm is based on an extremely fast superpixel algorithm: simple linear iterative clustering (SLIC) (Achanta et al., 2012).

In addition to superpixel representation, there are some other video object segmentation methods based on over-segmentation, such as boundaries (Wang, 1998; Stein et al., 2007), patches  (Huang et al., 2009; Schiegg et al., 2014). Wang (Wang, 1998) proposes a unsupervised video segmentation method with spatial segmentation, marker extraction, and modified watershed transformation. The algorithm partitions the first frame into homogeneous regions based on intensity, motion estimation to estimate motion parameters for each region. Stein et al. (Stein et al., 2007) present a framework for introducing motion as a cue in detection and grouping of object or occlusion boundaries. A hypergraph cut method (Huang et al., 2009) is proposed to over-segment each frame in the sequence, and take the over-segmented image patches as the vertices in the graph. Schiegg et al. (Schiegg et al., 2014) build an undirected graphical model that couples decisions over all of space and all of time, and joint segment and track a time-series of oversegmented images/volumes for multiple dividing cells.

Discussion. In general, the over-segmentation approaches occupy the space between single pixel matching and standard segmentation approaches. The algorithm reduce the computational complexity, since disparities only need to be estimated per-segment rather than per-pixel. However, in more complex videos, the over-segmentation requires additional knowledge, and are sensitive to boundary strength fluctuations from frame to frame.

2.1.4. “Object-like” segments

Several recent approaches aim to upgrade the low-level grouping of pixels (such as pixel, superpixel, and supervoxel) to object-like segments (Fukuchi et al., 2009; Sundberg et al., 2011; Papazoglou and Ferrari, 2014). Although the details are different, the main idea is to generate a foreground object hypothesis for each frame of the image using the learning model of the "object-like" regions (such as salient objects, and object proposals from background). Salient objects

In (Rahtu et al., 2010; Wang et al., 2015; Hu et al., 2018a), these works introduce saliency information as prior knowledge to discover visually important objects in a video. A spatio-temporal video modeling and segmentation method is proposed to partition the video sequence into homogeneous segments with selecting salient frames (Song and Fan, 2007). Fukuchi et al. (Huang et al., 2009) propose a automatic video object segmentation method based on visual saliency with the maximum a posteriori (MAP) estimation of the MRF with graph cuts. Rahtu et al. (Rahtu et al., 2010) present a salient object segmentation method based on combining a saliency measure with a Conditional Random Field (CRF) model using local feature contrast in illumination, color, and motion information. Papazoglou and Ferrari (Papazoglou and Ferrari, 2014) compute a motion saliency map using optical flow boundaries, and handle fast moving backgrounds and objects exhibiting a wide range of appearance, motions and deformations. Faktor and Irani (Faktor and Irani, 2014) perform saliency votes at each pixel, and iteratively correct those votes by consensus voting of re-occurring regions across the video sequence to separate a segment track from the background. In (Wang et al., 2015), Wang et al. produce saliency results via the geodesic distances to background regions in the subsequent frames, and build global appearance models for foreground and background based the saliency maps. Hu (Hu et al., 2018a) propose a saliency estimation method and a neighborhood graph based on optical flow and edge cues for unsupervised video object segmentation. Koh et al. (Jun Koh et al., 2018) generate object instances in each frame and develop the sequential clique optimization algorithm to consider both the saliency and similarity energies, then convert the tracks into video object segmentation results. Object proposals

Recently, video object segmentation methods generate object proposals in each frame, and then rank several object candidates to build object and background models (Lee et al., 2011; Li et al., 2014; Xiao and Jae Lee, 2016; Koh and Kim, 2017a; Tsai et al., 2016a). Typically, it contains three main categories: (i) figure-ground segmentations based object regions; (ii) optical flow based object proposals; (iii) bounding box based object proposals.

Figure 4. Illustration of a set of key-segments to generate a foreground object segmentation of the video (Lee et al., 2011). Left: video frames. Right: Foreground object segments.

For (i), the proposals are obtained by multiple static figure-ground segmentations similar to (Carreira and Sminchisescu, 2012; Endres and Hoiem, 2010). Their works generate the generic foreground object using several image cues such as color, texture, and boundary. Brendel and Todorovic et al. (Brendel and Todorovic, 2009) segment a set of regions by using any available low-level segmenter in each frame, and cluster the similar regions across the video. In (Lee et al., 2011), Lee et al. discover key-segments and group them to predict the foreground objects in a video. As shown in Fig. 4, their works generate a diverse set of object proposals or key-segments in each frame using the static region-ranking method of (Endres and Hoiem, 2010). Later, Ma and Latercki produce a bag of object proposals in each frame using (Endres and Hoiem, 2010), and build a video object segmentation algorithm using maximum weight cliques with mutex constraints. Banica et al. (Banica et al., 2013) generate multiple figure-ground segmentations based on boundary and optical flow cues, and construct multiple plausible partitions corresponding to the static and the moving objects. Moreover, Li et al. (Li et al., 2014) produce a pool of segment proposals using the figure-ground segmentation algorithm, and present a new composite statistical inference approach for refining the obtained segment tracks. To handle occlusion of the method (Li et al., 2014), Wu et al. (Wu et al., 2015) propose a video segment proposal approach start segments from any frame and track them through complete occlusions. Koh and Kim (Koh and Kim, 2017b) generate candidate regions using both color and motion edges, estimate initial primary object regions, and augment the initial regions with missing parts or reducing them. In addition, visual semantics (Tsai et al., 2016b), or context (Wang and Wang, 2016) cues are used to generate object proposals and infer the candidates for subsequent frames.

For (ii), instead of modeling the foreground regions of statics object proposals, video object segmentation methods were presented by optical flow based object proposals (Fragkiadaki et al., 2015; Zhang et al., 2013). Lalos et al. (Lalos et al., 2010) propose the object flow to estimate both the displacement and the direction of an object-of-interest. However, their work does not solve the problem of flow estimation and segmentation. Therefore, Fragkiadaki et al. (Fragkiadaki et al., 2015) present a method to generate moving object proposals from multiple segmentations on optical flow boundaries, and extend the top ranked segments into spatio-temporal tubes using random walkers. Zhang et al. (Zhang et al., 2013) present a optical flow gradient motion scoring function for selection of object proposals to discriminate between moving objects and the background.

For (iii), several methods (Koh and Kim, 2017a; Xiao and Jae Lee, 2016) employ bounding box based spatio-temporal object proposals to segment video object recently. Xiao and Lee (Xiao and Jae Lee, 2016) present a unsupervised algorithm to generate a set of spatio-temporal video object proposals boxes and pixel-wise segmentation. Koh and Kim (Koh and Kim, 2017a) use object detector and tracker to generate multiple bounding box tracks for objects, transform each bounding box into a pixel-wise segment, and refine the segment tracks.

In addition, many researchers have exploited the Gestalt principle of “common fate” (Koffka, 2013) where similarly moving points are perceived as coherent entities, and grouping based on motion pointed out occlusion/disocclusion phenomena. In (Sundberg et al., 2011), Sundberg et al. exploit motion cues and distinguishes occlusion boundaries from internal boundaries based on optical flow to detect and segment foreground object. Taylor et al. (Taylor et al., 2015) infer the long-term occlusion relations in video, and used within a convex optimization framework to segment the image domain into regions. Furthermore, a video object segmentation method detect disocclusion in video of 3D scenes and to partition the disoccluded regions in objects.

Discussion. Object-like regions (such as salient objects and object proposals) have been very popular as a preprocessing step for video object segmentation problems. Holistic object proposals can often extract over entire objects with optical flow, boundary, semantics, shape and other global appearance features, lead to better video object segmentation accuracy. However, these methods often generate many false positives, such as background proposals, which reduces segmentation performance. Moreover, these methods usually require heavy computational loads to generate object proposals and associate thousands of segments.

2.1.5. Convolutional neural networks (CNN)

Prior to the impressive of deep CNNs, some methods segment video object to rely on hand-crafted feature and do not leverage a learned video representation to build the appearance model and motion model.

Recently, there have been attempts to build CNNs for video object segmentation. The early primary video object segmentation method first generate salient objects using complementary convolutional neural network (Li et al., 2017), then propagate the video objects and superpixel-based neighborhood reversible flow in the video. Later, several video object segmentation methods employ deep convolutional neural networks in an end-to-end manner. In (Tokmakov et al., 2017a, b; Dutt Jain et al., 2017; Vijayanarasimhan et al., 2017), these methods build a dual branch CNN to segment video object. MP-Net (Tokmakov et al., 2017a) takes the optical flow field of two consecutive frames of a video sequence as input and produces per-pixel motion labels. In order to solve the limitations of appearance features of object of MP-Net framework, Tokmakov et al. (Tokmakov et al., 2017b)

integrate one stream with appearance information and a visual memory module based on convolutional Gated Recurrent Units (GRU) 

(Xingjian et al., 2015). FSEG (Dutt Jain et al., 2017) also proposes a two-stream network with appearance and optical flow motion to train with mined supplemental data. SfM-Net (Vijayanarasimhan et al., 2017) combines two streams motion and structure to learn object masks and motion models without mask annotations by differentiable rending. Li et al. (Li et al., 2018) transfer transferring the knowledge encapsulated in image-based instance embedding networks, and adapt the instance networks to video object segmentation. In addition, they propose a motion-based bilateral network, then a graph cut model is build to propagate the pixel-wise labels. In (Goel et al., 2018)

, a deep reinforcement learning methods is proposed to automatically detect moving objects with the relevant information for action selection. Recently, Song 

et al. (Song et al., 2018) present a video salient object detection method using pyramid dilated bidirectional ConvLSTM architecture, and apply it to the unsupervised VOS. Then, based on the CNN-convLSTM architecture, Wang et al. (Wang et al., 2019) propose a visual attention-driven unsupervised VOS model. Additional, they collect unsupervised VOS human attention data from DAVIS (Perazzi et al., 2016), Youtube-Objects (Prest et al., 2012), and SegTrack v2 (Li et al., 2014) dataset.

2.1.6. Discussion

Without any human annotation, unsupervised methods take the foreground object segmentation on an initial frame automatically. They do not require user interaction to specify an object to segment. In other words, these methods exploit information of saliency, semantics, optical flow, or motion to generate primary objects, and then propagate it to the remainder of the frames. However, these unsupervised methods are not able to segment a specific object due to motion confusions between different instances and dynamic background. Furthermore, the problem with these unsupervised methods is that they are computationally expensive due to many unrelated interference object-like proposals. A qualitative comparison of some major unsupervised VOS methods are listed in Tab. 1.

2.2. Semi-supervised video object segmentation

Semi-supervised video object segmentation methods are given with an initial object mask in the first frame or key frames. Then, these methods segment the object in the remaining frames. Typically, it can be investigated in the following main two categories: spatio-temporal graph and CNN based semi-supervised VOS.

2.2.1. Spatio-temporal graphs

In recent years, early methods often solve some spatio-temporal graph with hand-crafted feature representation including appearance, boundary, and optical flows, and propagate the foreground region in the entire video. These methods typically rely on two important cues: object representation of graph structure and spatio-temporal connections. Object representation of graph structure.

Typically, the task is formulated as a spatio-temporal label propagation problem, these methods tackle the problem by building up graph structures over the object representation of (i) pixels, (ii) superpixels, or (iii) object patches to infer the labels for subsequent frames.

For (i), the pixels can maintain very fine boundaries, and are incorporated into the graph structure for video object segmentation. Tsai et al. (Tsai et al., 2012) perform MRF optimization using pixel appearance similarity and motion coherence to separate a foreground object from the background. In (Märki et al., 2016), given some user input as a set of known foreground and background pixels, Märki et al. design a regularly sampled spatio-temporal bilateral grid, and minimize implicitly approximates long-range, spatio-temporal connections between pixels. Several methods build the dense (Wang et al., 2017b) and sparse trajectories (Ellis and Zografos, 2013) to segment the moving objects in video by using a probabilistic model.

References Object rep. Connections Appearance features Optical flow
Wang (Wang et al., 2017b) Pixel Dense point clustering Spatial location, color, velocity
Märki (Märki et al., 2016) Pixel Spatio-temporal lattices Image axes, YUV color
Tsai (Tsai et al., 2016a) Superpixel CRF RGB color, CNN
Jiang (Jang and Kim, 2016) Superpixel MRF LAB color
Perazzi (Perazzi et al., 2015) Patch CRF HOG
Fan (Fan et al., 2015) Patch Nearest neighbor fields SURF, RGB color
Jain (Jain and Grauman, 2014) Superpixel MRF Color histogram
Ramakanth (Avinash Ramakanth and Venkatesh Babu, 2014) Patch Nearest neighbor fields Color histogram
Ellis (Ellis and Zografos, 2013) Pixel Sparse point tracking RGB color
Badrinarayanan (Badrinarayanan et al., 2013) Patch Mixture of tree Semantic texton forests feature
Budvytis (Budvytis et al., 2012) Superpixel Mixture of tree Semantic texton forests feature
Tsai (Tsai et al., 2012) Pixel MRF RGB color
Ren (Ren and Malik, 2007) Superpixel CRF Local brightness, color and texture
Wang (Wang et al., 2004) Superpixel Mean-shift Image axes, RGB color
Patras (Patras et al., 2003) Patch Maximization of joint probability Image axes, RGB color
Table 2. Summary of spatio-temporal graphs of semi-supervised VOS methods. Object rep. denotes object representation.

For (ii), in order to suffer from high computational cost and noisy temporal links of pixel-based graphs, many methods extract superpixels at from the input frames, and construct the superpixel graph. Each node in the graph represents a label. An edge is added between any two adjacent neighbors. Graph structures such as CRF (Ren and Malik, 2007; Tsai et al., 2016a; Wang, Jin, Liu, Liu, Zhang, Chen, Zhang, Guo, and Shao, Wang et al.), MRF (Jain and Grauman, 2014), or mixture of trees (Budvytis et al., 2012) can be integrated into the framework to further improve the accuracy. For instance, Ren and Malik (Ren and Malik, 2007) generate a sets of superpixels in the images, and build a CRF to segment figure from background in each frame. Then figure/ground segmentation operates sequentially in each frame by utilizing both static image cues and temporal coherence cues. Unlike using probabilistic graphical model to segment images independently, other graphical models decompose each into spatial nodes, and seek the foreground-background label assignment that maximizes both appearance consistency and label smoothness in space and time. Jain and Grauman (Jain and Grauman, 2014) present a higher order spatio-temporal superpixel label consistency potential for video object segmentation. In (Budvytis et al., 2012), a mixture of trees model is presented to link superpixels from the first to the last frame, and obtain super-pixel labels and their confidence. In addition, some methods use mean shift (Wang et al., 2004) and random walker (Jang and Kim, 2016) algorithm to separate the foreground from the background.

For (iii), to make video object segmentation more efficient, researchers embed per-frame object patches and employ different techniques to select to a set of temporally coherent segments by minimizing and energy function of spatio-temporal graph. For example, Ramakanth et al. (Avinash Ramakanth and Venkatesh Babu, 2014) and Fan et al. (Fan et al., 2015) employ approximate nearest neighbor algorithm to compute a mapping between two framers or fields, then predict the labels. Tasi et al. (Tsai et al., 2016a) utilize the CRF model to assign each pixel with a foreground or background label. Perazzi et al. (Perazzi et al., 2015) formulate as a minimization of a novel energy function defined over a fully connected CRF of object proposals, and use maximum a posteriori to inference the foreground-background segmentation. Following in the work in (Budvytis et al., 2012), Badrinarayanan et al. (Badrinarayanan et al., 2013) propose a patch-based temporal tree model to link patches between frames. Patras et al. (Patras et al., 2003) use watershed algorithm obtain the color-based segmentation in each frame, and employ an iterative local joint probability search algorithm to generates a sequence of label. Spatio-temporal connections

Another important cue is how to estimate temporal connections between nodes by using spatio-temporal lattices (Märki et al., 2016), nearest neighbor fields (Avinash Ramakanth and Venkatesh Babu, 2014; Fan et al., 2015), mixture of trees (Budvytis et al., 2012; Badrinarayanan et al., 2013). Some methods even build up long-range connections using appearance-based methods (Perazzi et al., 2015; Patras et al., 2003). Besides the cue of the temporal connections, another important issue is selecting the solution of optimization algorithm. Some algorithms use the local greedy strategy to infer labels by considering only two or more adjacent frames at a time (Fan et al., 2015; Avinash Ramakanth and Venkatesh Babu, 2014; Märki et al., 2016; Ellis and Zografos, 2013), while other algorithms try to find global optimal solutions considering all frames (Tsai et al., 2012; Perazzi et al., 2015). The locally optimization strategies perform segmentation on-the-fly allowing for applications where data arrives sequentially, while globally optimal solutions solve the limitation of short range interactions. A brief summary of spatio-temporal graphs methods is shown in Tab. 2.

2.2.2. Convolutional neural networks

With the success of convolutional neural networks on static image segmentation (Long et al., 2015; Lin et al., 2018), CNN based methods show overwhelming power when introduced to video object segmentation. According to the used techniques for temporal motion information, they can be grouped into two types: motion-based and detection-based. Motion-based methods

In general, the motion-based methods utilize the temporal coherence of the object motion, and formulate the problem of mask propagation starting from the first frame or a given annotated frame to the subsequent frames.

For (i), one class of methods are developed to train network to incorporate optical flow (Cheng et al., 2017; Jampani et al., 2017; Li and Change Loy, 2018; Hu et al., 2018c). Optical flow is important in early stages of the video description. It is common to apply optical flow to VOS to maintain motion consistency. And optical flow represents how and where each and every pixel in the image is going to move in the future pipeline. These VOS methods typically use optical flow as a cue to track pixels over time to establish temporal coherence. For instance, SegFlow (Cheng et al., 2017), MoNet (Xiao et al., 2018), PReMVOS (Luiten et al., 2018), LucidTrack (Khoreva et al., 2017), and VS-ReID (Li and Change Loy, 2018) methods consist of two branches: the color segmentation and the optical flow branch using the FlowNet (Dosovitskiy et al., 2015; Ilg et al., 2017). To learn to exploit motion cues, these methods receive twice or triple inputs, including the target frame and two adjacent frames. Jampani et al. (Jampani et al., 2017) present a temporal bilateral network to propagate video frames in an adaptive manner by using optical flow as additional feature. With temporal dependencies established by optical flow, Bao et al. (Bao et al., 2018) propose a VOS method via inference in CNN-based spatio-temporal MRF. Hu et al. (Hu et al., 2018c) employ active contour on optical flow to segment moving object.

To capture the temporal coherence, some methods employ a Recurrent Neural Network (RNN) for modeling mask propagation with optical flow 

(Hu et al., 2017; Li and Change Loy, 2018). RNN has been adopted by many sequence-to-sequence learning problems because it is capable to learn long-term dependency from sequential data. MaskRNN (Hu et al., 2017) build a RNN approach which fuses in each frame the output of a binary segmentation net and a localization net with optical flow. Li and Loy (Li and Change Loy, 2018) combine temporal propagation and re-identification functionalities into a single framework.

Figure 5. Illustration of mask prorogation to estimate the segmentation mask of the current frame from the previous frame as a guidance (Perazzi et al., 2017).

For (ii), another direction is to use CNNs to learn mask refinement of an object from current frame to the next one. For instance, as shown in Fig. 5, MaskTrack (Perazzi et al., 2017) method trains a refine the previous frame mask to create the current frame mask, and directly infer the results from optical flow. Compared to their approach (Perazzi et al., 2017) that uses the exact foreground mask of the previous frame, Yang et al. (Yang et al., 2018) use a very coarse location prior with visual and spatial modulation. Oh et al. (Wug Oh et al., 2018) use both the reference frame with annotation and the current frame with previous mask estimation to a deep network. A reinforcement cutting-agent learning framework is to obtain the object box from the segmentation mask and propagates it to the next frame (Han et al., 2018). Some methods leverage temporal information on the bounding boxes by tracking objects across frames (Cheng et al., 2018; Lee et al., 2018; Newswanger and Xu, 2017; Sharir et al., 2017). Sharir et al. (Sharir et al., 2017) present a temporal tracking method to enforce coherent segmentation throughout the video. Cheng et al. (Cheng et al., 2018) utilize a part-based tracking method on the bounding boxes, and construct a region-of-interest segmentation network to generate part masks. Recently, some methods (Xu et al., 2018; Valipour et al., 2017) introduce a combination of CNN and RNN for video object segmentation. Xu et al. (Xu et al., 2018)

generate the initial states for our convolutional Long Short-Term Memory (LSTM), and use a feed-forward neural network to encode both the first image frame and the segmentation mask. Detection-based methods

Without using temporal information, some methods learn a appearance model to perform a pixel-level detection and segmentation of the object at each frame. They rely on fine-tuning a deep network using the first frame annotation of a given test sequence (Caelles et al., 2017; Voigtlaender and Leibe, 2017). Caelles et al. (Caelles et al., 2017) introduce an offline and online training process by a fully convolutional neural network (FCN) on static image for one-shot video object segmentation (OSVOS), which fine-tunes a pretrained convolutional neural network on the first frame of the target video. Furthermore, they extend the model of the object with explicit semantic information, and dramatically improve the results (Maninis et al., 2018). Later, an online adaptive video object segmentation is proposed (Voigtlaender and Leibe, 2017), the network is fine-turned online to adapt to the changes in appearance. Cheng et al. (Cheng et al., 2017) propose a method to propagate a coarse segmentation mask spatially based on the pairwise similarities in each frame.

Other approaches formulate video object segmentation as a pixel-wise matching problem to estimate an object of interest with subsequence images until the end of a sequence. Yoon et al. (Shin Yoon et al., 2017) propose a pixel-level matching network to distinguish the object area from the background on the basis of the pixel-level similarity between two object units. To solve computationally expensive problems, Chen et al. (Chen et al., 2018) formulate a pixel-wise retrieval problem in an embedding space for video object segmentation, and VideoMatch approach (Hu et al., 2018b) learns to match extracted features to a provided template without memorizing the appearance of the objects. Discussion.

As indicated, CNN-based semi-supervised VOS methods can be roughly classified into: motion-based and detection-based ones. The classification of these two methods is based on temporal motion or non-motion. Temporal motion is an important feature cue in video object segmentation. As long as the appearance and position changes are smooth, the complex deformation and movement of the target can be handled. However, these methods are susceptible to temporal discontinuities such as occlusion and fast motion, and can suffer from drift once the propagation becomes unreliable. On the other hand, since such methods rarely rely on temporal consistency, they are robust to changes such as occlusion and rapid motion. However, since they need to estimate the appearance of the target, it is generally not possible to adapt to changes in appearance. It is difficult to separate the appearance of similar object instances.

A qualitative comparison of CNN-based semi-supervised VOS methods can be obtained based on motion-based or detection-based methods, requirement of optical flow, requirement of fine-tuning and computational speed, ability to handle post-processing, and requirement of data augmentation. In Tab. 3, we provide the qualitative comparison of the methods discussed in this section.

Figure 6. Illustration of the pipeline of one-shot video object segmentation (Caelles et al., 2017). The first step is to pertrain on large datasets (e.g.ImageNet (Deng et al., 2009) for image classification). The second step is to train parent network on the training set of DAVIS (Perazzi et al., 2016). Then, during test time, it fine-tunes on the first frame.
  • Fine-tuning. Most of CNN-based semi-supervised VOS methods share a similar two-stage paradigm (as shown in Fig. 6): first, train a general-purpose CNN to segment the foreground object; second, this network use online fine-tuning using the first frame of the test video to memorize target object appearance, leading to a boost in the performance (Caelles et al., 2017; Voigtlaender and Leibe, 2017; Perazzi et al., 2017). It has been shown that fine-tuning on the first frame significantly improves accuracy. However, since at test time some methods only use the fine-tuned network, it is not able to adapt to large changes in appearance, which might for example be caused by drastic changes in viewpoint (Caelles et al., 2017; Maninis et al., 2018; Hu et al., 2018c; Bao et al., 2018; Xiao et al., 2018; Khoreva et al., 2017). And it becomes harder for the fine-tuned model to generalize to new object appearances. To overcome this limitation, some methods update the network online to changes in appearance using training examples (Voigtlaender and Leibe, 2017; Newswanger and Xu, 2017).

  • Computational speed. Despite the high accuracies achieved by these approaches, the fine-tuning process requires many iterations of optimization, the step on the video is computationally expensive, where it usually takes more than ten minutes to update a model and is not suitable for online vision applications. Recently, several methods (Yang et al., 2018; Cheng et al., 2018; Chen et al., 2018; Hu et al., 2018b) work without the need of the computationally expensive fine-tuning in test time, and make them much faster than comparable methods. For instance, Chen et al. (Chen et al., 2018) only perform a single forward through the embedding network and a nearest-neighbor search to process each frame in test time. Yang et al. (Yang et al., 2018) use a single forward pass to adapt the segmentation model to the appearance of a specific object. VideoMatch approach (Hu et al., 2018b) build a soft matching layer, and does not require online fine-tuning.

  • Post-processing. Besides the training of CNN-based segmentation, several methods leverage post-processing steps to achieve additional gains. Post-processing is often employed to improve the contours, such as boundary snapping (Caelles et al., 2017; Maninis et al., 2018), refine-aware filter (Shin Yoon et al., 2017; Cheng et al., 2017), and dense MRF or CRF (Krähenbühl and Koltun, 2011) in (Bao et al., 2018; Perazzi et al., 2017). For instance, OSVOS (Caelles et al., 2017) perform boundary snapping to capture foreground masks to accurate contours. Yoon et al. (Shin Yoon et al., 2017) perform a weighted median filter on the resulting segmentation mask. Li et al. (Li and Change Loy, 2018) additionally consider post-processing steps to link the tracklets. In addition, some VOS frameworks (Bao et al., 2018; Khoreva et al., 2017; Xiao et al., 2018) utilize MRF or CRF as a a post-processing step to improve the labeling results produced by a CNN. They attach the MRF or CRF inference to the CNN as a separate step, and utilize the representation capability of CNN and fine-grained probability modeling capability of MRF or CRF to improve performance. PReMVOS method (Luiten et al., 2018) present a refinement network that produces accurate pixel masks for each object mask proposal.

  • Data augmentation. In general, data augmentation is a widely strategy to improve generalization of neural networks. Khoreva et al. (Khoreva et al., 2017) present a heavy data augmentation strategy for online learning. Other methods (Luiten et al., 2018; Jampani et al., 2017; Cheng et al., 2017; Sharir et al., 2017) fine-tune the training network on a large set of augmented images generated from the first-frame ground truth.

References M / D Optical flow Fine-tuning Post-pro. Speed Data aug.
Hu (Hu et al., 2018c) M
Bao (Bao et al., 2018) M
Xiao (Xiao et al., 2018) M
Khoreva (Khoreva et al., 2017) M
Luiten (Luiten et al., 2018) M fast
Li (Li and Change Loy, 2018) M
Yang (Yang et al., 2018) M fast
Wug (Wug Oh et al., 2018) M fast
Cheng (Cheng et al., 2018) M fast
Han (Han et al., 2018) M
Lee (Lee et al., 2018) M
Xu (Xu et al., 2018) M
Newswanger (Newswanger and Xu, 2017) M
Sharir (Sharir et al., 2017) M
Perazzi (Perazzi et al., 2017) M
Valipour (Valipour et al., 2017) M
Jampani (Jampani et al., 2017) M fast
Cheng (Cheng et al., 2017) M
Hu (Hu et al., 2017) M
Maninis (Maninis et al., 2018) D
Chen (Chen et al., 2018) D fast
Hu (Hu et al., 2018b) D fast
Caelles (Caelles et al., 2017) D
Voigtlaender (Voigtlaender and Leibe, 2017) D
Cheng (Cheng et al., 2017) D
Shin (Shin Yoon et al., 2017) D
Table 3. Summary of convolutional neural network based semi-supervised video object segmentation methods. M/D: motion-based and detection-based methods. Post-pro.: post-processing. Data aug.: data augmentation.

2.3. Interactive video object segmentation

Interactive video object segmentation is a special form of supervised segmentation that relies on iterative user interaction to segment objects of interest. This is done by repeating the segmentation results of the correction system using additional strokes on the foreground or background. And these methods require the user to input either scribbles or clicks. In general, every segmentation algorithm needs to solve two problems, namely the criteria of good partitioning and the method of achieving effective partitioning (Shi and Malik, 2000). Typically, interactive video object segmentation techniques can be divided into one of the following three main branches: graph partitioning models, active contours models, and convolutional neural network models.

2.3.1. Graph partitioning models

Most of image segmentation techniques of interactive video object segmentation methods are formulated as a graph partitioning problems, where the vertices of a graph are partitioned into disjoint subgraphs. Examples of existing interactive segmentation methods are graph-cuts, random walker, and geodesic based. Graph-cuts based

Several works are based on the GrabCut algorithm (Rother et al., 2004)

, which iteratively alternates between estimating appearance models (typically Gaussian Mixture Models) and refining the segmentation using graph cuts 

(Boykov and Jolly, 2001). Wang et al.(Wang et al., 2005) were among the first authors to address interactive video segmentation tasks. To improve the performance, they used two-stage hierarchical mean-shift clustering as a preprocessing step to reduce the computation of the min-cut problem. In (Li et al., 2005), Li et al. segment every tenth frame, and graph cut uses the global color model from keyframes, gradients and coherence as its primary clue to calculate the choice between frames. The user can also manually indicate the area in which the local color model is applied. Price et al. (Price et al., 2009) propose additional types of local classifiers, namely LIVEcut. The user iteratively corrects the propagated mask frame to frame and the algorithm learns from it. In (Bai et al., 2009), Bai et al. build a set of local classifiers that each adaptively integrates multiple local image features. This method re-trains the classifier from the new mask by transforming the neighborhood regions according to the optical flow, and then retrains the user correction through the classifier. Later, the author construct the foreground and background appearance models adaptively in the same group (Bai et al., 2010), and use the probability optical flow to update the color space Gaussian of the individual pixel. In contrast to pixel, Reso et al. (Reso et al., 2014) and Dondera et al. (Dondera et al., 2014) adopt the graph-cut framework by using superpixels on every video frame. In addition, Chien et al. (Chien et al., 2013) and Pont-Tuset et al. (Pont-Tuset et al., 2015) use normalized cut (Shi and Malik, 2000) based multiscale combinatorial grouping (MCG) algorithm to segment and generate accurate region proposals, and use point clicks on the boundary of the objects to fit object proposals to them. Random walker based

In (Shankar Nagaraja et al., 2015), Nagaraja et al. use a few strokes to segment videos by using optical flow and point trajectories. Their method integrate into a user interface where the user can draw scribbles in the first frame. When satisfied, the user presses a button to run the random walker. Geodesic based

Bai and Sapiro (Bai and Sapiro, 2009) present a geodesics-based algorithm for interactive natural image and video by using region color to compute a geodesic distance to each pixel to form a selection. This method exploits weights in the geodesic computation that depend on the pixel value distributions. In (Wang et al., 2017a), Wang et al. combine geodesic distance-based dynamic models with pyramid histogram-based confidence map to segment the image regions. Additionally, their method determines the frame of the operator’s mark to improve segmentation performance.

2.3.2. Active contours models

In the active contour framework, object segmentation use an edge detector to halt the evolution of the curve on the boundary of the desired object. Based on this framework, the TouchCut approach (Wang et al., 2014) uses a single touch to segment the object using level-set techniques. They simplify the interaction to a single point in the first frame, and then propagates the results using optical flow.

2.3.3. CNN models

Many recent works employ convolutional neural network models to accurately interactive segment the object in successive frames (Maninis et al., 2018; Chen et al., 2018; Benard and Gygli, 2017; Caelles et al., 2018). Benard et al. (Benard and Gygli, 2017) and Caelles et al. (Caelles et al., 2018) propose the deep interactive image and video object segmentation method use OSVOS technique (Caelles et al., 2017). To improve localization, Benard et al. propose to refine the initial predictions with a fully connected CRF. Caelles et al. (Caelles et al., 2018) define a baseline method (i.e. Scribble-OSVOS) to show the usefulness of the 2018 DAVIS challenge benchmark. Chen et al. (Chen et al., 2018) formulate video object segmentation as a pixel-wise retrieval problem. And their method allow for a fast user interaction. iFCN (Xu et al., 2016) guides a CNN from positive and negative points acquired from the ground-truth masks. In (Maninis et al., 2018), Maninis  et al. build on iFCN to improve the results by using four points of an object as input to obtain precise object segmentation for images and videos.

References # Methods Way of labeling Optical flow Over-segmentation
Maninis (Maninis et al., 2018) M CNN models Clicks Pixel
Caelles (Caelles et al., 2018) M CNN models Scribbles Pixel
Chen (Chen et al., 2018) M CNN models Clicks Pixel
Benard (Benard and Gygli, 2017) S CNN models Clicks Pixel
Nagaraja (Shankar Nagaraja et al., 2015) S Random walker Scribbles Pixel
Pont-Tuset(Pont-Tuset et al., 2015) S Graph-cut Clicks Superpixel
Chien (Chien et al., 2013) S Graph-cut Clicks Pixel
Wang (Wang et al., 2014) S Active contours Clicks Pixel
Donder (Dondera et al., 2014) S Graph-cut Clicks Superpixel
Reso (Reso et al., 2014) S Graph-cut Scribbles Superpixel
Bai (Bai et al., 2010) S Graph-cut Scribbles Pixel
Bai (Bai and Sapiro, 2009) S Geodesic Scribbles Pixel
Bai (Bai et al., 2009) S Graph-cut Scribbles Pixel
Price (Price et al., 2009) S Graph-cut Scribbles Pixel
Li (Li et al., 2005) S Graph-cut Scribbles Pixel
Wang (Wang et al., 2005) S Graph-cut Scribbles Pixel
Table 4. Summary of interactive video object segmentation methods. #: number of objects, S: single, M: multiple.

2.3.4. Discussion.

Given scribbles or a few clicks by the user, the interactive video object segmentation helps the system produce a full spatio-temporal segmentation of the object of interest. Interactive segmentation methods have been proposed in order to reduce annotation time. However, on small touch screen devices, using a finger to provide precise clicks or drawing scribbles can be cumbersome and inconvenient for the user. A qualitative comparison of interactive VOS methods can be made based on their ability to segment single or multiple objects, label an object with clicks or scribbles, and type of over-segmentation (i.e. pixel or superpixel). A brief summary of the qualitative comparison is shown in Tab. 4. Most of the conventional graph partitioning model based interactive VOS methods to seeded segmentation is the graph-cut algorithm. Recent methods use the ideas in the pipeline of deep architectures, CNN models are utilized to improve the interactive segmentation performance. In addition, several CNN models based methods can handle multiple-object interactive video object segmentation.

2.4. Weakly supervised video object segmentation

Weakly supervised VOS can provide a large amount of video for this method, where all videos are known to contain the same foreground object or object class. Several weakly supervised learning-based approaches to generate semantic object proposals for training segment classifiers 

(Hartmann et al., 2012; Tang et al., 2013) or performing label transfer (Liu et al., 2014), and then produce the target object in videos. For instance, Hartmann et al. (Hartmann et al., 2012) formulate pixel-level segmentations as multiple instance learning weakly supervised classifiers for a set of independent spatio-temporal segments. Tang et al. (Tang et al., 2013) estimate the video in the positive sample with a large number of negative samples, and regard those segments with a distinct appearance as the foreground. Liu et al. (Liu et al., 2014) further advance the study to address this problem in multi-class criterion rather than traditional binary classification. These methods rely on training examples and may produce inaccurate segmentation results. To overcome this limitation, Zhang et al. (Zhang et al., 2015) propose to segment semantic object in weakly labeled video by using object detection without the need of training process. In contrast, Tsai et al. (Tsai et al., 2016b) does not require object proposal or video-level annotations. Their method link objects between different video and construct a graph for optimization.

Recently, Wang et al. (Wang et al., 2016) combine the recognition and representation power of CNN with the intrinsic structure of unlabelled data in the target domain of weakly supervised semantic video object segmentation to improve inference performance. Unlike semantics-aware weakly-supervised methods, Khoreva et al. (Khoreva et al., 2018) employ natural language expressions to identify the target object in video. Their method integrate textual descriptions of interest as foreground into convnet-based techniques.

2.5. Segmentation-based Tracking

In the previous video object segmentation methods, they usually cues like motion and appearance similarity to segment videos, that is, these methods estimate the position of a target in a manual or automatic manner. The object representation consists of a binary segmentation mask which indicates whether each pixel belongs to the target or not. For applications that require pixel-level information, such as video editing and video compression, this detailed representation is more desirable. Therefore, the estimating of all pixels requires a large amount of computational cost, and video object segmentation methods have been traditionally slow. In contrast, visual object tracking is to estimate the position of an object in the image plane as it moves around a scene. In general, the classical object shape is represented by a rectangle, ellipse, etc. This simple object representation helps reduce the cost of data annotation. Moreover, such methods can quickly detect and track targets, and the initialization of object is relatively simple. However, these methods still operate more or less on the image regions described by the bounding box and are inherently difficult to track objects that undergo large deformations. To overcome this problem, some approaches integrate some form of segmentation into the tracking process. Segmentation-based tracking methods provide an accurate shape description for these objects. The strategies of these methods can be grouped into two main categories: bottom-up methods and joint-based methods. Figure 7 presents the flowchart of two segmentation-based tracking frameworks.

(a) Bottom-up based framework.
(b) Joint-based framework.
Figure 7. Segmentation-based tracking frameworks: (a) Bottom-up based, (b) Joint-based.

2.5.1. Bottom-up based methods

In the domain of bottom-up segmentation-based tracking, the object is presented from a segmented area instead of a bounding box. The segmentation-based tracking is a natural solution to handle non-rigid and deformable objects effectively. These methods use a low-level segmentation to extract regions in all frames, and then transitively match or propagate the similar regions across the video. We divide bottom-up based methods into two categories, namely contour matching and contour propagation. Contour matching approaches search for the object region in the current frame. On the other hand, by using a state space model, contour propagation methods change the initial contour to a new position in the current frame. A qualitative comparison of bottom-up segmentation-based tracking approaches is given in Tab. 5. Contour matching

Contour matching searches the object silhouette and their associated models in the current frame. One solution is to build an appearance model of the object shape and match the best candidate image region to match the model, i.e., generative methods, for instance, integral histogram based models (Adam et al., 2006; Chockalingam et al., 2009a)

, independent component analysis based models 

(Kong, 2010), subspace learning based models (Zhou and Tao, 2013), distance measures-based models (Li and Ngan, 2007; Colombari et al., 2007; Hsiao et al., 2006), spatio-temporal filter (Kompatsiaris and Strintz, 2000), spectral matching (Cai et al., 2014). Some approaches measure similarity between patches by comparing their gray-level or color histograms. Adam et al. (Adam et al., 2006) segment the target into a number of fragments to preserve the spatial relationships of the pixels, and use the integral histogram. Later, Chockalingam et al. (Chockalingam et al., 2009a) choose fragment adaptively according to the video frame, by clustering pixels with similar appearance, rather than using a fixed arrangement of rectangles. Yang et al. (Kong, 2010) propose a boosted color soft segmentation algorithm and incorporate independent component analysis with reference into the tracking framework. Zhou et al. (Zhou and Tao, 2013) present a shifted subspaces tracking to segment the motions and recover their trajectories. The authors use the Hausdorff distance (Li and Ngan, 2007) and Mahalanobis distance (Colombari et al., 2007) to construct a correlation surface from which the minimum is selected as the new object position. Hsiao et al. (Hsiao et al., 2006) utilize trajectory estimation scheme for automatically deploying the growing seeds for tracking the object in further frames. Kompatsiaris et al. (Kompatsiaris and Strintz, 2000) take into account the intensity differences between consequent frames, and present a spatio-temporal filter to separate the moving person from the static background. Cai et al. (Cai et al., 2014) build a dynamic graph to exploit the inner geometric structure information of the object by oversegmenting the target into several superpixels. And spectral clustering is used to solve the graph matching.

References # Methods Segment Techniques Track Techniques Results
Son (Son et al., 2015) S Contour matching Graph-cut

Boosting decision tree

Box, Mask
Cai (Cai et al., 2014) S Contour matching Graph-cut SVM Box
Duffner (Duffner and Garcia, 2014) S Contour matching Probabilistic soft segmentation Hough voting Box
Wang (Wang and Nevatia, 2013) S Contour propagation Mean shift clustering Particle filter Box
Zhou (Zhou and Tao, 2013) M Contour matching Subspace learning Subspace learning Mask
Heber (Heber et al., 2013) S Contour matching Graph-cut Blending-based template, hough voting, mean-shift Mask
Chien (Chien et al., 2013) M Contour propagation Threshold decision Particle filter Box
Belagiannis (Belagiannis et al., 2012) S Contour propagation Graph-cut Particle filter Mask
Godec (Godec et al., 2013) S Contour matching Graph-cut Hough voting Mask
Wang (Wang et al., 2011) S Contour matching SLIC Particle filter Box
Chockalingam (Chockalingam et al., 2009b) S Contour matching Spatially variant finite mixture models Particle filter Mask
Colombari (Colombari et al., 2007) M Contour matching Region matching Blob matching and connection Mask
Hsiao (Hsiao et al., 2006) S Contour matching Region growing and merging Interframe difference Mask
Kompatsiaris (Kompatsiaris and Strintz, 2000) S Contour matching K-Means clustering Spatiotemporal filter Box, Mask
Gu (Gu and Lee, 1998) S Contour propagation Morphological watershed Temporal gradient Mask
Table 5. Summary of bottom-up segmentation-based tracking methods. #: number of objects, S: single, M: multiple. Box and Mask: the bounding box and mask of the object.

Another approach is to model both the object and the background and then to distinguish the object from the background by using a discriminative classifier, such as boosting-based models (Son et al., 2015), Hough-based models (Duffner and Garcia, 2014; Godec et al., 2013), and so on. These methods maintain object appearances based on small local patches or object regions, and perform tracking by classifying the silhouette into the foreground or the background. And the final tracking result is given by the mask of the best sample. Son et al. (Son et al., 2015) employ an online gradient decision boosting tree to classify each patch, and construct segmentation masks. Godec et al. (Godec et al., 2013) propose a patch-based voting algorithm with Hough forests (Gall et al., 2011). By back-projecting the patches that voted for the object center, the authors initialize a graph-cut algorithm to segment foreground from background. However, the graph-cut segmentation it is relatively slow, and the binary segmentation increases the risk of drift due to wrongly segmented image regions. To address this problem, Duffner and Garcia (Duffner and Garcia, 2014) present a fast tracking algorithm using a detector based on the generalized Hough transform and pixel-wise descriptors, then update the global segmentation model.

In addition, researchers propose hybrid generative-discriminative segmentation-based methods to fuse the useful information from the generative and the discriminative models. For instance, Heber et al. (Heber et al., 2013) present a segmentation-based tracking method to fuse three target tracker, i.e. blending-based template tracker, Hough voting-based discriminative tracker, and feature histogram-based mean shift tracker. And the fusion process additionally provides a segmentation. Contour propagation

Contour propagation of bottom-up based methods can be done using two different approaches: sequential Monte Carlo (or particle filter) based methods and direct minimization based methods. Some approaches employ sequential Monte Carlo-based methods to generate the state of the candidate of object contour (Wang et al., 2011; Wang and Nevatia, 2013; Chien et al., 2013; Belagiannis et al., 2012). The state is defined in terms of the shape and the motion parameters of the contour. Given all available observations of object up to the -th frame, the state variable is updated by the maximum a posteriori (MAP) estimation, i.e. argmax

. The posterior probability

can be computed recursively as


Here, is the observation model, which is usually defined in terms of the distance of the contour from observed edges. And represents the dynamic motion model. The dynamic motion model depicts the temporal correlation of state transition between two consecutive frames. The observation model describes the similarity between a candidate offset and the best offset of the tracked object. For instance, Wang et al. (Wang et al., 2011) use superpixel for appearance modeling and incorporate particle filtering to find the optimal target state, and their observation model is built as , where represents the confidence of an observation at state . Belagiannis et al. (Belagiannis et al., 2012)

propose two particle sampling strategies based on segmentation to handle the object’s deformations, occlusions, orientation, scale and appearance changes. Some methods use particle-based approximate inference algorithm over the Dynamic Bayesian Network (DBN) 

(Wang and Nevatia, 2013)

and the Hidden Markov Model (HMM) 

(Chien et al., 2013) to estimate the contour.

Both segmentation and tracking methods can minimize functions through gradient descent. In addition, Gu et al. (Gu and Lee, 1998) combine supervised segmentation with unsupervised tracking. Specifically, the supervised segmentation method use mathematical morphology, and the unsupervised tracking method use computation of the partial derivatives.

2.5.2. Joint-based methods

In the above bottom-up based methods, the foreground region is first segmented from the input image, then some features are extracted from the foreground region, and finally the object is tracked according to these features. The foreground segmentation and object tracking are performed as two separate tasks, as shown in Fig. 7

(a). The biggest limitation of these methods is that the errors in the foreground segmentation inevitably propagate forward, causing errors in object tracking. Therefore, many researchers integrate foreground segmentation and object tracking into a joint framework. The result of foreground segmentation determines the accuracy of feature extraction, which further affects the performance of silhouette tracking. On the other hand, the tracking results can provide top-down cues for foreground segmentation. These methods make full use of the correlation between foreground segmentation and object tracking, which greatly improve the performance of video segmentation and tracking, as shown in Fig. 

7 (b). To utilize energy minimization techniques of the joint video object segmentation and tracking framework, we divide these methods into three categories, namely, graph-based framework, probabilistic framework, and CNN framework. In Tab. 6, we provide the qualitative comparison of these methods in this section. Graph-based framework.

The basic technique of joint-based methods is to construct a graph for the energy function to be minimized. The variations on graph-based framework are primarily built using a small set of core algorithms-graph cuts (Bugeau and Pérez, 2008; Wen et al., 2015; Yao et al., 2018; Keuper et al., 2018), random walker (Papoutsakis and Argyros, 2013), and shortest geodesics (Paragios and Deriche, 2000).

References # Methods Segment Technique Track Technique Results
Keuper (Keuper et al., 2018) M Graph-based framework Graph-cut Deep matching Box, Mask
Wang (Wang et al., 2018) S CNN CNN CNN Box, Mask
Yao (Yao et al., 2018) S Graph-based framework CNN Correlation filter Box
Zhang (Zhang et al., 2018) S CNN CNN CNN Box, Mask
Yeo (Yeo et al., 2017) S Probabilistic framework Markov Chain Markov Chain Box, Mask
Liu (Liu et al., 2016) M Probabilistic framework CRF CRF Mask
Tjaden (Tjaden et al., 2016) M Probabilistic framework Pixel-wise posterior Pixel-wise posterior Mask
Schubert (Schubert et al., 2015) S Probabilistic framework Pixel-wise posterior Pixel-wise posterior Box, Mask
Milan (Milan et al., 2015) M Probabilistic framework SVM CRF Box, Mask
Wen (Wen et al., 2015) S Graph-based framework Graph-cuts Energy minimization Mask
Papoutsakis (Papoutsakis and Argyros, 2013) S Graph-based framework Random walker Mean-shift Box, Mask
Lim (Lim et al., 2013) S Probabilistic framework Graph-cuts CRF Mask
Aeschliman (Aeschliman et al., 2010) M Probabilistic framework Probabilistic soft segmentation

Probabilistic principal component analysis

Box, Mask
Wu (Wu and Nevatia, 2009) M Probabilistic framework Boosting Boosting Box
Tao (Tao et al., 2008) M Probabilistic framework MCMC MCMC Box
Bugeau (Bugeau and Pérez, 2008) M Graph-based framework Graph-cuts Mean-shift Box, Mask
Bibby (Bibby and Reid, 2008) S Probabilistic framework Pixel-wise posterior Pixel-wise posterior Box, Mask
Paragios (Paragios and Deriche, 2000) M Graph-based framework Active contours Interframe difference Mask
Table 6. Summary of joint segmentation-based tracking methods. #: number of objects, S: single, M: multiple. Box and Mask: the bounding box and mask of the object.

For instance, Bugeau and Pérez (Bugeau and Pérez, 2008) formulate an objective functions that combine low-level pixel-wise measures and high-level observations. The minimization of these cost functions simultaneously allows tracking and segmentation of tracked objects. In (Wen et al., 2015), Wen et al. integrate the multi-part tracking and segmentation into a unified energy minimization framework, which is optimized iteratively by a RANSAC-style approach. Yao et al. (Yao et al., 2018) present a joint framework to introduce semantics (Lin et al., 2017) into tracking procedure. Then, they propose to exploit semantics to localise object accurately via an energy-minimization-based segmentation. In (Keuper et al., 2018), Keuper et al. present a graph-based segmentation and multiple object tracking framework. Specifically, they combine bottom-up motion segmentation by grouping of point trajectories with top-down multiple object tracking by clustering of bounding boxes. The random walker algorithm (Papoutsakis and Argyros, 2013) is also formulated on a weighted graph. The joint framework integrates the EM-based object tracking and Random Walker-based image segmentation in a closed loop scheme. In addition, Paragios and Deriche (Paragios and Deriche, 2000) present a graph-based framework to link the minimization of a geodesic active contour objective function to the detection and the tracking of moving objects. Probabilistic framework.

There are many probabilistic framework for jointly solving video object segmentation and tracking, such as Bayesian methods (Aeschliman et al., 2010; Tao et al., 2008; Wu and Nevatia, 2009; Yeo et al., 2017), pixel-wise posterior based methods (Bibby and Reid, 2008; Schubert et al., 2015; Tjaden et al., 2016), and CRF based methods (Liu et al., 2016; Lim et al., 2013; Milan et al., 2015). In (Aeschliman et al., 2010), Aeschliman et al. present a probabilistic framework for jointly solving tracking and fine, pixel-level segmentation. The candidate target locations are evaluated by first computing a pixel-level segmentation, and explicitly including this segmentation in the probability model. Then the segmentation is used to incrementally update the probability model. In addition, Zhao et al. (Tao et al., 2008) propose a Bayesian framework that integrates segmentation and tracking based on a joint likelihood for the appearance of multiple objects, and perform the inference by an Markov chain Monte Carlo-based approach. Later, Wu and Nevatia (Wu and Nevatia, 2009) present a joint framework to take the detection results as input and search for the multiple object configuration with the best image likelihood. Yeo et al. (Yeo et al., 2017) employ absorbing Markov Chain algorithm over superpixel segmentation to estimate the object state, and target segmentation is propagated to subsequent frames in an online manner.

In (Bibby and Reid, 2008; Schubert et al., 2015; Tjaden et al., 2016), a probability generative model is built a segmentation-based tracking method using pixel-wise posteriors. These methods construct the appearance-model using a probabilistic formulation, carry out the level-set segmentation using this model, and then perform the contour propagation. The minimization of these algorithms are implemented by the gradient descent. Thereinto, Tjaden et al. (Tjaden et al., 2016) segment multiple 3D objects and track pose using pixel-wise second-order optimization approach.

Some methods utilize energy minimization techniques of CRF to perform fine segmentation and target object. For instance, Milan et al. (Milan et al., 2015) propose a CRF model that exploits high-level detector responses and low-level superpixel information to jointly track and segment multiple objects. Lim et al. (Lim et al., 2013) handle joint estimation to segment foreground object and track human pose using a MAP solution. Liu et al. (Liu et al., 2016) present a unified dynamic couple CRF model to joint track and segment moving objects in region level. CNN framework.

Recently, some researchers begin to pay attention to perform visual object tracking and semi-supervised video object segmentation using convolutional neural network framework (Wang et al., 2018; Zhang et al., 2018). Wang et al. (Wang et al., 2018) present a a Siamese network to simultaneously estimate binary segmentation mask, bounding box, and the corresponding object/background scores. By only inputting a bounding box in the first frame, Zhang et al. (Zhang et al., 2018) build a two-branch network, i.e., appearance network and contour network. And tracking output and segmentation results help refine each other mutually.

2.5.3. Discussion.

In general, given a bounding box of a target in the first frame, the bottom-up based methods estimate the location of the object in subsequent images, which is similar to the tracking-by-detection methods. Unlike traditional visual object tracking methods, the bottom-up based methods use the foreground object contour as a special feature to solve the problem of object drift in non-rigid object tracking and segmentation. The purpose of these methods is to find the location of the target, so only the bounding box or coarse mask of object is estimated. Some methods simply use the result of the segmentation to estimate the scale problem in visual object tracking. Compared to joint-based methods, the processing speed of these methods is faster.

On the other hand, joint-based methods unify the two tasks of segmentation and tracking into the graph-based or probabilistic framework, and use energy minimization method to estimate the exact object mask. Specifically, these energy minimization methods are iterated many times to estimate accurate object poses, motions, occlusions, and so on. Many methods do not output the bounding box of the object, but only the object mask. Generally, iterative optimization inherently limits runtime speed. Recently, some researchers have used offline and online CNN-based methods to simultaneously process segmentation and tracking, and the impressive results demonstrate accurate and very fast tracking and segmentation.

3. Datasets and metrics

To evaluate the performance of various video object segmentation and tracking methods, one needs test video dataset, the ground truth, and metrics of the competing approaches. In this section, we will give brief introduced of datasets, evaluation protocols.

3.1. Video object segmentation and tracking datasets

A brief summary of video object segmentation and tracking datasets is shown in Table 7 . These datasets are described and discussed in more detail next.

V # C # O # A # Object pro. T. of methods Labels Publish year
SegTrack 6 6 6 244 Single U, S, I, W, T Mask 2012 (Tsai et al., 2012)
SegTrack v2 14 11 24 1475 Multiple U, S, I, W, T Mask 2014 (Li et al., 2014)
BMS-26 26 2 38 189 Multiple U Mask 2010 (Brox and Malik, 2010)
FBMS-59 59 16 139 1,465 Multiple U, S Mask 2014 (Peter et al., 2014)
YouTube-objects 126 10 96 2,153 Single U, S, I, W Mask 2014 (Prest et al., 2012; Jain and Grauman, 2014)
YouTube-VOS 3252 78 6048 133,886 Multiple S Mask 2018 (Xu et al., 2018)
JumpCut 22 14 22 6,331 Single U, S Mask 2015 (Fan et al., 2015)
DAVIS 2016 50 50 3,440 Single U, S, I, W Mask 2016 (Perazzi et al., 2016)
DAVIS 2017 150 384 10,474 Multiple U, S, I, W Mask 2017 (Pont-Tuset et al., 2017)
NR 11 11 1,200 Single S, T Box, Mask 2015 (Son et al., 2015)
MOT 2016 14 11,000 Multiple U, T Box 2016 (Milan et al., 2016)
VOT 2016 60 60 21,511 Single S, T Box, Mask 2016 (Kristan et al., 2016; Vojir and Matas, 2017)
OTB 2013 50 50 29,000 Single T Box 2013 (Wu et al., 2013)
OTB 2015 100 100 58,000 Single T Box 2015 (Wu et al., 2015)
Table 7. Brief illustration of datasets that are used in the evaluation of the video object segmentation and tracking methods. V #: number of video. C #: number of categories. O #: number of objects. A #: annotated frames. U, S, I, W, T: unsupervised VOS, semi-supervised VOS, interactive VOS, weakly supervised VOS, and segmentation-based tracking methods. Object pro.: object property, T. of methods: type of methods.

SegTrack (Tsai et al., 2012) and SegTrack v2 (Li et al., 2014) are introduced to evaluate tracking and video object segmentation algorithms. SegTrack contains 6 videos (monkeydog, girl, birdfall, parachute, cheetah, penguin) and pixel-level ground-truth for the single moving foreground object in every frame. These videos provide a variety of challenges, including non-rigid deformation, similar objects and fast motion of the camera and target. SegTrack v2 contains 14 videos with instance-level moving object annotations in all the frames. Other videos from SegTrack v2 also include cluttered backgrounds and dynamic scenes caused by camera movement or moving background objects. In some video sequences, the objects are visually very similar to the image background, that is, low contrast along object boundaries, such as the birdfall, frog and worm sequences in the SegTrack v2 dataset. In contrast to SegTrack, many videos have more than one object of interest in SegTrack v2.

BMS-26 (Berkeley motion segmentation) (Brox and Malik, 2010) and FBMS-59 (Freiburg-Berkeley motion segmentation) (Peter et al., 2014) are widely used for unsupervised and semi-supervised VOS methods. BMS-26 dataset consists 26 videos with a total of 189 annotated image frames, which shots from movie stories and the 10 vehicles and 2 human sequences. The FBMS-59 dataset reflects two major improvements in the previous version of BMS-26. First, the updated version dataset adds 33 new sequences, therefore, the FBMS-59 dataset consists of 59 sequences. Second, these 33 new sequences incorporate challenges of unconstrained videos such as fast motion, motion blur, occlusions, and object appearance changes. The sequences are divided into 29 training and 30 test video sequences.

YouTube-objects (Prest et al., 2012) and YouTube-VOS (YouTube video object segmentation) (Xu et al., 2018) contain a large amount of Internet videos. Jain et al. (Jain and Grauman, 2014) adopt its subset that contains 126 videos with 10 object categories and 20,977 frames. In these videos, 2,153 key-frames are sparsely sampled and manually annotated in pixel-wise masks according to the video labels. YouTube-objects dataset is used for unsupervised, semi-supervised, interactive, and weakly supervised VOS approaches. In 2018, Xu et al. (Xu et al., 2018) release a large-scale video object segmentation dataset called YouTube-VOS. The dataset contains 3,252 YouTube video clips and 133,886 object annotations, of which 78 categories cover 78 categories covering common animals, cars, accessories and human activities. At the same time, the authors build a sequence-to-sequence semi-supervised video object segmentation algorithm to verify this dataset and performance.

JumpCut dataset (Fan et al., 2015) consists of 22 video sequences with medium image resolution. It contains 14 categories (6,331 annotation images in total) along with pixel level ground-truth annotations. Most image frames in the JumpCut dataset contain very fast object motion and significant foreground deformations. Thus, the JumpCut dataset is considered a more challenging video sequences for unsupervised and semi-supervised VOS, and is widely used to evaluate modern unsupervised and semi-supervised video segmentation techniques.

DAVIS 2016, 2017, and 2018 datasets are one of the most popular datasets for training and evaluating video object segmentation algorithms. DAVIS 2016 (Perazzi et al., 2016) dataset contains 50 full high quality video sequences with 3,455 annotated frames in total, and focuses on single-object video object segmentation, that is, there is only one foreground object per video. 30 training set and 20 validation set in this dataset is divided. Later, DAVIS 2017 (Pont-Tuset et al., 2017) complements DAVIS 2016 dataset training and validation sets with 30 and 10 high quality videos, respectively. It also provides an additional 30 development test sequences and 30 challenge test sequences. Also, the DAVIS 2017 dataset relabels multiple objects in all video sequences. These improvements make it more challenging than the original DAVIS 2016 dataset. In addition, Each video is labeled with multiple attributes such as occlusion, object deformation, fast motion, and scale change to provide a comprehensive analysis of model performance. Moreover, DAVIS 2018 dataset (Caelles et al., 2018) adds 100 videos with multiple objects per video to the original DAVIS 2016 dataset, and complements an interactive segmentation teaser track.

NR (non-rigid object tracking) dataset (Son et al., 2015) consists of 11 video sequences with 1,200 frames which contain deformable and articulated objects. First, the pixel-level annotations are performed manually. The bounding box annotation is then generated by calculating the tightest rectangular bounding box that contains all of the object pixels. Each video is labeled with only one object. It has been to evaluate segmentation-based tracking and semi-supervised VOS algorithms.

MOT 2016 (multiple object tracking) dataset (Milan et al., 2016) consists of 14 sequences with 11,000 frames which contain crowded scenarios, different viewpoints, camera and object motions and weather conditions. The targets are annotated with axis-aligned minimum bounding boxes in each video sequence. The scale of datasets for MOT is relatively smaller than single object tracking, and current datasets focus on pedestrians. This dataset is used to evaluate multiple object of unsupervised VOS and segmentation-based tracking algorithms.

VOT 2016 (video object tracking) dataset (Kristan et al., 2016) contains 60 high-quality video sequences targeted at single video tracking and segmentation tasks. It consists 21,511 frames in total. In (Vojir and Matas, 2017), Vojir et al. provide pixel-level segmentation annotations for the VOT 2016 dataset, and construct a challenging segmentation tracking and test dataset.

OTB (object tracking benchmark) is widely used to evaluate single segmentation-based video object tracking algorithms. OTB 2013 dataset (Wu et al., 2013) has 50 video sequences includes fully annotated video sequences with bounding box. OTB 2015 dataset (Wu et al., 2015) consists of 100 video sequences and 58,000 annotated frames of real-world moving objects.

3.2. Metrics

To order to evaluate the performance, this section focuses on the specific case of video object segmentation and tracking, where both the predicted results and the ground-truth are used for foreground-background partitioning. Measures can be focused on evaluating which pixels of the ground truth are detected, or indicating the precision of the bounding box.

3.2.1. Evaluating of pixel-wise object segmentation techniques

For video object segmentation, the standard evaluation metric has three measurements (Perazzi et al., 2016), namely the spatial precision of the segmentation, the consistency of contour similarity and the temporal stability.

  • Region similarity . The region similarity is the intersection over union (IoU) function between the predicted object segmentation mask and ground truth . This quantitative metric for measuring the number of misclassified image pixels and measuring pixels matching segmentation algorithm. In this way, it is defined as .

  • Contour precision

    . The segmented mask is treated as a set of closed contour regions, and the function of precision and recall is to calculate the contour-based

    -measure. That is to say, the -measure of the contour precision is based on the precision and recall of the contour. This indicator is used to measure the precision of the segmentation boundary. Let segmented mask be interpreted as a set of closed contours . Thus, we can achieve the contour-based precision and recall based on and . Therefore, -measure is defined as .

  • Temporal stability . Most of VOS methods also use time stability to measure the turbulence and inaccuracy of the contours. The temporal stability of the video segmentation is measured by the dissimilarity of the target shape context descriptors that describe the pixels on the contour of the segmentation between two adjacent frames in the video sequences.

3.2.2. Evaluating of bounding box based object tracking techniques

For segmentation-based tracking approaches, both the mask and the bounding box of object may be output. To evaluate object tracking algorithms, therefore, we should account for two categories: single object and multiple objects.

The evaluation protocol of OTB 2013 (Wu et al., 2013) and VOT 2016 (Kristan et al., 2016) dataset is widely used in single object tracking algorithm. For OTB 2013 benchmark, four metrics with one-pass evaluation (OPE) are used to evaluate all the compared trackers: (i) bounding box overlap, which is measured by VOC overlap ratio (VOR); (ii) center location error (CLE), (iii) distance precision (DP), and (iv) overlap precision (OP). For VOT 2016 benchmark, there are three main measures for analyzing the performance of short-term tracking: accuracy, robustness, and expected average overlap (EAO). The accuracy is the average overlap between the prediction during the successful tracking and the real boundary box of the ground truth. The robustness measures the number of times a tracker loses a target (i.e., fails) during the tracking period. EAO estimates the accuracy of the estimated bounding box after processing a certain number of frames since initialization.

Metrics for multiple targets tracking are divided into three classes by different attributes: accuracy, precision, and completeness. Combining multiple target false positives, false positives, and mismatches into a single value becomes a multi-target tracking accuracy (MOTA) metric. The multiple object tracking accuracy (MOTP) metric describes the accuracy of measuring objects by boundary box overlap and/or center location distance. The complete metrics indicate the completeness of tracking the ground truth trajectory.

4. Future directions

Based on significant advances in video object segmentation and tracking, we suggest some future research directions that would be interesting to pursue.

Simultaneous prediction of VOS and VOT. In the traditional hand-crafted video object segmentation and tracking methods, there are many algorithms to simultaneously output the mask and bounding box of the object. Recently, researchers came up with end-to-end VOS and VOT methods that dealt with two problems in a deep framework, they simultaneously predict pixel-level object masks and object-level bounding boxes for impressive performance. This will lead to an important problem: speed and accuracy. On the one hand, accuracy is important in some applications, such as fine-tuning iterations to improve segmentation and tracking performance. It is computationally expensive and the speed is bound to be slow. On the other hand, if the processing speed is increased without losing the performance of object segmentation and tracking, this will be a very interesting direction.

Fine-grained video object segmentation and tracking. Segmentation and tracking of fine-grained objects in the full HD video is challenging. Since such videos generally have a large background of various appearance and motion, small parts of fine-grained objects in video cannot be segmented and tracked with sufficient accuracy. On the one hand, in fine-grained segmentation and recognition tasks, these small parts usually contain semantic information that is extremely important for fine-grained classification. Moreover, object tracking is essentially a process of continuous predicting the motion of a very small object between frames, if a method can not distinguish good small differences, it may not be an optimal design choice. Therefore, how to accurately segment and track fine-grained objects, and then improve the performance of video and video recognition tasks plays an important role in many real-world applications.

Generalization performance of VOST.

Generalization has always been a difficulty in video segmentation and tracking algorithms. Although VOST tasks can be solved after training, but it is difficult to transfer the acquisition experience to new categories, or unconstrained videos, such as these videos are noisy, compressed, unstructured, and included by moving from multiple views. End-to-end training in deep learning is currently used to improve generalization. Although there are many datasets, such as DAVIS, YouTube-VOS, OTB, and VOT, these datasets have some limitations and are somewhat different from the actual environment. Not only the diversity of the appearance of the foreground object, but also the complexity of the object motion trajectory will directly affect the generalization ability of the object segmentation and tracking methods. Therefore, how to fast and accurate segment and track objects in these new categories or environments will be the focus of research.

Multi-camera video object segmentation and tracking. Performing video analysis and monitoring in complex environments requires the use of multiple cameras. This problem has led to an increasing interest in research on multi-camera collaborative video analysis. In multiple cameras, due to the fusion of different visual information from different viewpoints, the method synergistically handles video object segmentation and tracking in the same scene monitored by different cameras, thus it may improve the performance. However, it should be noted that the images used in multi-camera surveillance are usually captured by cameras located at different locations. Therefore, there is a great diversity in visual perspective, which should be considered separately in the video object segmentation and tracking techniques.

3D video object segmentation and tracking. The analysis and processing of 3D object is a core problem in the computer vision community. There may be two directions of interest here. First, VOST is an important prerequisite for avoiding obstacles and pedestrians. Segmentation combined with 3D images produces detailed object boundaries in 3D. The subsequent path planning algorithm can then generate motion trajectories to avoid collisions. Autonomous robots use video object segmentation and tracking to locate, find and grab objects of interest. Second, in building infrastructure modeling, you can create virtual 3D models of buildings that contain their semantic regions. This model can then be used to quickly calculate statistics in the video. The 3D reconstruction system provides very detailed geometry. However, after scanning, cumbersome post-processing steps are required to cut the object of interest. Video object segmentation and tracking helps automate this task.

5. Conclusion

In this article, we provided a comprehensive survey of the video object segmentation and tracking literature. We described challenges and potential application in the field, classified and analyzed the recent methods, and discussed different algorithms. The presented survey uses an organization of application scenarios to review five important categories of literature in VOST: unsupervised VOS, semi-supervised VOS, interactive VOS, weakly supervised VOS, and segmentation-based tracking methods. We provided a hierarchical categorization of the different groups in existing works, and summarized some object representation, image features, motion cues, etc. We also described various of per-process and post-process CNN-based VOS methods, and discussed the advantages or disadvantages aspects of the methods. Moreover, we described the related video datasets for video object segmentation and tracking, and the evaluation metrics of pixel-wise mask and bounding box based techniques. We believe this review will benefit researchers in this field and provide useful insights into this important research topic. We hope to encourage more future work to develop in this direction.


  • (1)
  • Achanta et al. (2012) Radhakrishna Achanta, Appu Shaji, Kevin Smith, Aurelien Lucchi, Pascal Fua, and Sabine Süsstrunk. 2012. SLIC superpixels compared to state-of-the-art superpixel methods. IEEE transactions on pattern analysis and machine intelligence 34, 11 (2012), 2274–2282.
  • Adam et al. (2006) A. Adam, E. Rivlin, and I. Shimshoni. 2006. Robust Fragments-based Tracking using the Integral Histogram. In

    2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06)

    , Vol. 1. 798–805.
  • Aeschliman et al. (2010) Chad Aeschliman, Johnny Park, and Avinash C. Kak. 2010. A probabilistic framework for joint segmentation and tracking. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. 1371–1378.
  • Avinash Ramakanth and Venkatesh Babu (2014) S Avinash Ramakanth and R Venkatesh Babu. 2014. Seamseg: Video object segmentation using patch seams. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 376–383.
  • Badrinarayanan et al. (2013) Vijay Badrinarayanan, Ignas Budvytis, and Roberto Cipolla. 2013. Semi-supervised video segmentation using tree structured graphical models. IEEE transactions on pattern analysis and machine intelligence 35, 11 (2013), 2751–2764.
  • Bai and Sapiro (2009) Xue Bai and Guillermo Sapiro. 2009. Geodesic matting: A framework for fast interactive image and video segmentation and matting. International journal of computer vision 82, 2 (2009), 113–132.
  • Bai et al. (2010) Xue Bai, Jue Wang, and Guillermo Sapiro. 2010. Dynamic color flow: a motion-adaptive color model for object segmentation in video. In European Conference on Computer Vision. Springer, 617–630.
  • Bai et al. (2009) Xue Bai, Jue Wang, David Simons, and Guillermo Sapiro. 2009. Video snapcut: robust video object cutout using localized classifiers. In ACM Transactions on Graphics (ToG), Vol. 28. ACM, 70.
  • Banica et al. (2013) Dan Banica, Alexandru Agape, Adrian Ion, and Cristian Sminchisescu. 2013. Video object segmentation by salient segment chain composition. In Proceedings of the IEEE International Conference on Computer Vision Workshops. 283–290.
  • Bao et al. (2018) L. Bao, B. Wu, and W. Liu. 2018. CNN in MRF: Video Object Segmentation via Inference in a CNN-Based Higher-Order Spatio-Temporal MRF. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5977–5986.
  • Barnich and Van Droogenbroeck (2011) Olivier Barnich and Marc Van Droogenbroeck. 2011. ViBe: A universal background subtraction algorithm for video sequences. IEEE Transactions on Image processing 20, 6 (2011), 1709–1724.
  • Belagiannis et al. (2012) Vasileios Belagiannis, Falk Schubert, Nassir Navab, and Slobodan Ilic. 2012. Segmentation based particle filtering for real-time 2d object tracking. In European Conference on Computer Vision. Springer, 842–855.
  • Benard and Gygli (2017) Arnaud Benard and Michael Gygli. 2017. Interactive video object segmentation in the wild. arXiv preprint arXiv:1801.00269 (2017).
  • Bibby and Reid (2008) Charles Bibby and Ian Reid. 2008. Robust Real-Time Visual Tracking Using Pixel-Wise Posteriors. In Computer Vision – ECCV 2008, David Forsyth, Philip Torr, and Andrew Zisserman (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 831–844.
  • Boykov and Jolly (2001) Yuri Y Boykov and M-P Jolly. 2001. Interactive graph cuts for optimal boundary & region segmentation of objects in ND images. In Proceedings eighth IEEE international conference on computer vision. ICCV 2001, Vol. 1. IEEE, 105–112.
  • Brendel and Todorovic (2009) William Brendel and Sinisa Todorovic. 2009. Video object segmentation by tracking regions. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, 833–840.
  • Brox and Malik (2010) Thomas Brox and Jitendra Malik. 2010. Object Segmentation by Long Term Analysis of Point Trajectories. In Computer Vision – ECCV 2010, Kostas Daniilidis, Petros Maragos, and Nikos Paragios (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 282–295.
  • Brox and Malik (2011) Thomas Brox and Jitendra Malik. 2011. Large displacement optical flow: descriptor matching in variational motion estimation. IEEE transactions on pattern analysis and machine intelligence 33, 3 (2011), 500–513.
  • Brutzer et al. (2011) Sebastian Brutzer, Benjamin Höferlin, and Gunther Heidemann. 2011. Evaluation of background subtraction techniques for video surveillance. In CVPR 2011. IEEE, 1937–1944.
  • Budvytis et al. (2012) Ignas Budvytis, Vijay Badrinarayanan, and Roberto Cipolla. 2012. MoT-Mixture of Trees Probabilistic Graphical Model for Video Segmentation.. In BMVC, Vol. 1. Citeseer, 7.
  • Bugeau and Pérez (2008) Aurélie Bugeau and Patrick Pérez. 2008. Track and cut: Simultaneous tracking and segmentation of multiple objects with graph cuts. Eurasip Journal on Image and Video Processing 2008, October (2008).
  • Caelles et al. (2017) S. Caelles, K. . Maninis, J. Pont-Tuset, L. Leal-Taixé, D. Cremers, and L. V. Gool. 2017. One-Shot Video Object Segmentation. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 5320–5329.
  • Caelles et al. (2018) Sergi Caelles, Alberto Montes, Kevis-Kokitsi Maninis, Yuhua Chen, Luc Van Gool, Federico Perazzi, and Jordi Pont-Tuset. 2018. The 2018 davis challenge on video object segmentation. arXiv preprint arXiv:1803.00557 1, 2 (2018).
  • Cai et al. (2014) Zhaowei Cai, Longyin Wen, Zhen Lei, Nuno Vasconcelos, and Stan Z Li. 2014. Robust deformable and occluded object tracking with dynamic graph. IEEE Transactions on Image Processing 23, 12 (2014), 5497–5509.
  • Carreira and Sminchisescu (2012) Joao Carreira and Cristian Sminchisescu. 2012. CPMC: Automatic object segmentation using constrained parametric min-cuts. IEEE Transactions on Pattern Analysis and Machine Intelligence 34, 7 (2012), 1312–1328.
  • Chang et al. (2013) Jason Chang, Donglai Wei, and John W Fisher. 2013. A video representation using temporal superpixels. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2051–2058.
  • Chaohui Wang et al. (2009) Chaohui Wang, Martin de La Gorce, and Nikos Paragios. 2009. Segmentation, ordering and multi-object tracking using graphical models. In 2009 IEEE 12th International Conference on Computer Vision. 747–754.
  • Chen et al. (2015) Lin Chen, Jianbing Shen, Wenguan Wang, and Bingbing Ni. 2015. Video object segmentation via dense trajectories. IEEE Transactions on Multimedia 17, 12 (2015), 2225–2234.
  • Chen et al. (2018) Yuhua Chen, Jordi Pont-Tuset, Alberto Montes, and Luc Van Gool. 2018. Blazingly fast video object segmentation with pixel-wise metric learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1189–1198.
  • Cheng et al. (2017) Jingchun Cheng, Sifei Liu, Yi-Hsuan Tsai, Wei-Chih Hung, Shalini De Mello, Jinwei Gu, Jan Kautz, Shengjin Wang, and Ming-Hsuan Yang. 2017. Learning to segment instances in videos with spatial propagation network. arXiv preprint arXiv:1709.04609 (2017).
  • Cheng et al. (2018) Jingchun Cheng, Yi-Hsuan Tsai, Wei-Chih Hung, Shengjin Wang, and Ming-Hsuan Yang. 2018. Fast and Accurate Online Video Object Segmentation via Tracking Parts. arXiv preprint arXiv:1806.02323 (2018).
  • Cheng et al. (2017) Jingchun Cheng, Yi-Hsuan Tsai, Shengjin Wang, and Ming-Hsuan Yang. 2017. Segflow: Joint learning for video object segmentation and optical flow. In Computer Vision (ICCV), 2017 IEEE International Conference on. IEEE, 686–695.
  • Chien et al. (2013) S. Y. Chien, W. K. Chan, Y. H. Tseng, and H. Y. Chen. 2013. Video Object Segmentation and Tracking Framework With Improved Threshold Decision and Diffusion Distance. IEEE Transactions on Circuits and Systems for Video Technology 23, 6 (2013), 921–934.
  • Chockalingam et al. (2009a) P. Chockalingam, N. Pradeep, and S. Birchfield. 2009a. Adaptive fragments-based tracking of non-rigid objects using level sets. In 2009 IEEE 12th International Conference on Computer Vision. 1530–1537.
  • Chockalingam et al. (2009b) Prakash Chockalingam, Nalin Pradeep, and Stan Birchfield. 2009b. Adaptive fragments-based tracking of non-rigid objects using level sets. In 2009 IEEE 12th international conference on computer vision. IEEE, 1530–1537.
  • Chu et al. (2015) Wen-Sheng Chu, Yale Song, and Alejandro Jaimes. 2015. Video co-summarization: Video summarization by visual co-occurrence. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3584–3592.
  • Colombari et al. (2007) Andrea Colombari, Andrea Fusiello, and Vittorio Murino. 2007. Segmentation and tracking of multiple video objects. Pattern Recognition 40, 4 (2007), 1307–1317.
  • Criminisi et al. (2006) Antonio Criminisi, Geoffrey Cross, Andrew Blake, and Vladimir Kolmogorov. 2006. Bilayer segmentation of live video. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), Vol. 1. IEEE, 53–60.
  • Deng et al. (2009) Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. ImageNet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 248–255.
  • Dondera et al. (2014) Radu Dondera, Vlad Morariu, Yulu Wang, and Larry Davis. 2014. Interactive video segmentation using occlusion boundaries and temporally coherent superpixels. In IEEE Winter Conference on Applications of Computer Vision. IEEE, 784–791.
  • Dosovitskiy et al. (2015) Alexey Dosovitskiy, Philipp Fischer, Eddy Ilg, Philip Hausser, Caner Hazirbas, Vladimir Golkov, Patrick Van Der Smagt, Daniel Cremers, and Thomas Brox. 2015. Flownet: Learning optical flow with convolutional networks. In Proceedings of the IEEE international conference on computer vision. 2758–2766.
  • Duffner and Garcia (2014) Stefan Duffner and Christophe Garcia. 2014. PixelTrack: A Fast Adaptive Algorithm for Tracking Non-rigid Objects. In IEEE International Conference on Computer Vision.
  • Dutt Jain et al. (2017) Suyog Dutt Jain, Bo Xiong, and Kristen Grauman. 2017. FusionSeg: Learning to combine motion and appearance for fully automatic segmentation of generic objects in videos. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3664–3673.
  • Elgammal et al. (2002) Ahmed Elgammal, Ramani Duraiswami, David Harwood, and Larry S Davis. 2002. Background and foreground modeling using nonparametric kernel density estimation for visual surveillance. Proc. IEEE 90, 7 (2002), 1151–1163.
  • Ellis and Zografos (2013) Liam Ellis and Vasileios Zografos. 2013. Online Learning for Fast Segmentation of Moving Objects. In Computer Vision – ACCV 2012, Kyoung Mu Lee, Yasuyuki Matsushita, James M. Rehg, and Zhanyi Hu (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 52–65.
  • Endres and Hoiem (2010) Ian Endres and Derek Hoiem. 2010. Category independent object proposals. In European Conference on Computer Vision. Springer, 575–588.
  • Erdem et al. (2004) Çiǧdem Eroǧlu Erdem, Bülent Sankur, and A. Murat Tekalp. 2004. Performance measures for video object segmentation and tracking. IEEE Transactions on Image Processing 13, 7 (2004), 937–951.
  • Faktor and Irani (2014) Alon Faktor and Michal Irani. 2014. Video Segmentation by Non-Local Consensus voting.. In BMVC, Vol. 2. 8.
  • Fan et al. (2015) Qingnan Fan, Fan Zhong, Dani Lischinski, Daniel Cohen-Or, and Baoquan Chen. 2015. JumpCut: non-successive mask transfer and interpolation for video cutout. ACM Trans. Graph. 34, 6 (2015), 195–1.
  • Fragkiadaki et al. (2015) Katerina Fragkiadaki, Pablo Arbelaez, Panna Felsen, and Jitendra Malik. 2015. Learning to segment moving objects in videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4083–4090.
  • Fragkiadaki et al. (2012) Katerina Fragkiadaki, Geng Zhang, and Jianbo Shi. 2012. Video segmentation by tracing discontinuities in a trajectory embedding. In 2012 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1846–1853.
  • Fukuchi et al. (2009) Ken Fukuchi, Kouji Miyazato, Akisato Kimura, Shigeru Takagi, and Junji Yamato. 2009. Saliency-based video segmentation with graph cuts and sequentially updated priors. In 2009 IEEE International Conference on Multimedia and Expo. IEEE, 638–641.
  • Gall et al. (2011) Juergen Gall, Angela Yao, Nima Razavi, Luc Van Gool, and Victor Lempitsky. 2011. Hough forests for object detection, tracking, and action recognition. IEEE transactions on pattern analysis and machine intelligence 33, 11 (2011), 2188–2202.
  • Gao et al. (2000) Xiang Gao, Terrance E Boult, Frans Coetzee, and Visvanathan Ramesh. 2000. Error analysis of background adaption. In Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No. PR00662), Vol. 1. IEEE, 503–510.
  • Giordano et al. (2015) Daniela Giordano, Francesca Murabito, Simone Palazzo, and Concetto Spampinato. 2015. Superpixel-based video object segmentation using perceptual organization and location prior. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4814–4822.
  • Godec et al. (2013) Martin Godec, Peter M Roth, and Horst Bischof. 2013. Hough-based tracking of non-rigid objects. Computer Vision and Image Understanding 117, 10 (2013), 1245–1256.
  • Goel et al. (2018) Vikash Goel, Jameson Weng, and Pascal Poupart. 2018. Unsupervised video object segmentation for deep reinforcement learning. In Advances in Neural Information Processing Systems. 5688–5699.
  • Grundmann et al. (2010) Matthias Grundmann, Vivek Kwatra, Mei Han, and Irfan Essa. 2010. Efficient hierarchical graph-based video segmentation. In 2010 ieee computer society conference on computer vision and pattern recognition. IEEE, 2141–2148.
  • Gu and Lee (1998) Chuang Gu and Ming-Chieh Lee. 1998. Semiautomatic segmentation and tracking of semantic video objects. IEEE Transactions on Circuits and Systems for Video Technology 8, 5 (Sep. 1998), 572–584.
  • Guizilini and Ramos (2013) Vitor Guizilini and Fabio Ramos. 2013. Online self-supervised segmentation of dynamic objects. In 2013 IEEE International Conference on Robotics and Automation. IEEE, 4720–4727.
  • Han and Davis (2012) Bohyung Han and Larry S Davis. 2012. Density-based multifeature background subtraction with support vector machine. IEEE Transactions on Pattern Analysis and Machine Intelligence 34, 5 (2012), 1017–1023.
  • Han et al. (2018) Junwei Han, Le Yang, Dingwen Zhang, Xiaojun Chang, and Xiaodan Liang. 2018. Reinforcement Cutting-Agent Learning for Video Object Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 9080–9089.
  • Hare et al. (2016) Sam Hare, Stuart Golodetz, Amir Saffari, Vibhav Vineet, Ming-Ming Cheng, Stephen L Hicks, and Philip HS Torr. 2016. Struck: Structured output tracking with kernels. IEEE transactions on pattern analysis and machine intelligence 38, 10 (2016), 2096–2109.
  • Hartmann et al. (2012) Glenn Hartmann, Matthias Grundmann, Judy Hoffman, David Tsai, Vivek Kwatra, Omid Madani, Sudheendra Vijayanarasimhan, Irfan Essa, James Rehg, and Rahul Sukthankar. 2012. Weakly supervised learning of object segmentations from web-scale video. In European Conference on Computer Vision. Springer, 198–208.
  • Heber et al. (2013) Markus Heber, Martin Godec, Matthias Rüther, Peter M. Roth, and Horst Bischof. 2013. Segmentation-based tracking by support fusion. Computer Vision and Image Understanding 117, 6 (2013), 573 – 586.
  • Held et al. (2016) David Held, Devin Guillory, Brice Rebsamen, Sebastian Thrun, and Silvio Savarese. 2016. A Probabilistic Framework for Real-time 3D Segmentation using Spatial, Temporal, and Semantic Cues.. In Robotics: Science and Systems.
  • Horn and Schunck (1981) Berthold KP Horn and Brian G Schunck. 1981. Determining optical flow. Artificial intelligence 17, 1-3 (1981), 185–203.
  • Hsiao et al. (2006) Ying-Tung Hsiao, Cheng-Long Chuang, Yen-Ling Lu, and Joe-Air Jiang. 2006. Robust multiple objects tracking using image segmentation and trajectory estimation scheme in video frames. Image and Vision Computing 24, 10 (2006), 1123–1136.
  • Hu et al. (2018c) Ping Hu, Gang Wang, Xiangfei Kong, Jason Kuen, and Yap-Peng Tan. 2018c. Motion-Guided Cascaded Refinement Network for Video Object Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1400–1409.
  • Hu et al. (2017) Yuan-Ting Hu, Jia-Bin Huang, and Alexander Schwing. 2017. Maskrnn: Instance level video object segmentation. In Advances in Neural Information Processing Systems. 325–334.
  • Hu et al. (2018a) Yuan-Ting Hu, Jia-Bin Huang, and Alexander G Schwing. 2018a. Unsupervised video object segmentation using motion saliency-guided spatio-temporal propagation. In Proceedings of the European Conference on Computer Vision (ECCV). 786–802.
  • Hu et al. (2018b) Yuan-Ting Hu, Jia-Bin Huang, and Alexander G. Schwing. 2018b. VideoMatch: Matching based Video Object Segmentation. In The European Conference on Computer Vision (ECCV).
  • Huang et al. (2009) Yuchi Huang, Qingshan Liu, and Dimitris Metaxas. 2009. ] Video object segmentation by hypergraph cut. In 2009 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1738–1745.
  • Ilg et al. (2017) Eddy Ilg, Nikolaus Mayer, Tonmoy Saikia, Margret Keuper, Alexey Dosovitskiy, and Thomas Brox. 2017. FlowNet 2.0: Evolution of Optical Flow Estimation With Deep Networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  • Irani and Anandan (1998) Michal Irani and P Anandan. 1998. A unified approach to moving object detection in 2D and 3D scenes. IEEE transactions on pattern analysis and machine intelligence 20, 6 (1998), 577–589.
  • Irani et al. (1994) Michal Irani, Benny Rousso, and Shmuel Peleg. 1994. Computing occluding and transparent motions. International Journal of Computer Vision 12, 1 (1994), 5–16.
  • Jain and Grauman (2014) Suyog Dutt Jain and Kristen Grauman. 2014. Supervoxel-Consistent Foreground Propagation in Video. In Computer Vision – ECCV 2014, David Fleet, Tomas Pajdla, Bernt Schiele, and Tinne Tuytelaars (Eds.). Springer International Publishing, Cham, 656–671.
  • Jampani et al. (2017) Varun Jampani, Raghudeep Gadde, and Peter V Gehler. 2017. Video propagation networks. In Proc. CVPR, Vol. 6. 7.
  • Jang and Kim (2016) Won-Dong Jang and Chang-Su Kim. 2016. Semi-supervised Video Object Segmentation Using Multiple Random Walkers.. In Proc. BMVC.
  • Jang et al. (2016) Won-Dong Jang, Chulwoo Lee, and Chang-Su Kim. 2016. Primary object segmentation in videos via alternate convex optimization of foreground and background distributions. In Proceedings of the IEEE conference on computer vision and pattern recognition. 696–704.
  • Jun Koh et al. (2018) Yeong Jun Koh, Young-Yoon Lee, and Chang-Su Kim. 2018. Sequential Clique Optimization for Video Object Segmentation. In The European Conference on Computer Vision (ECCV).
  • Keuper et al. (2018) M. Keuper, S. Tang, B. Andres, T. Brox, and B. Schiele. 2018. Motion Segmentation and Multiple Object Tracking by Correlation Co-Clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence (2018), 1–1.
  • Khoreva et al. (2017) A. Khoreva, R. Benenson, E. Ilg, T. Brox, and B. Schiele. 2017. Lucid Data Dreaming for Object Tracking. In The 2017 DAVIS Challenge on Video Object Segmentation - CVPR Workshops.
  • Khoreva et al. (2018) Anna Khoreva, Anna Rohrbach, and Bernt Schiele. 2018. Video object segmentation with language referring expressions. arXiv preprint arXiv:1803.08006 (2018).
  • Kim and Hwang (2002) Changick Kim and Jenq-Neng Hwang. 2002. Fast and automatic video object segmentation and tracking for content-based applications. IEEE Transactions on Circuits and Systems for Video Technology 12, 2 (Feb 2002), 122–129.
  • Koffka (2013) Kurt Koffka. 2013. Principles of Gestalt psychology. Routledge.
  • Koh and Kim (2017a) Yeong Jun Koh and Chang-Su Kim. 2017a. CDTS: Collaborative Detection, Tracking, and Segmentation for Online Multiple Object Segmentation in Videos. In 2017 IEEE International Conference on Computer Vision (ICCV). IEEE, 3621–3629.
  • Koh and Kim (2017b) Yeong Jun Koh and Chang-Su Kim. 2017b. Primary object segmentation in videos based on region augmentation and reduction. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 7417–7425.
  • Kompatsiaris and Strintz (2000) I. Kompatsiaris and M. Gerassimos Strintz. 2000. Spatiotemporal segmentation and tracking of objects for visualization of videoconference image sequences. IEEE Transactions on Circuits and Systems for Video Technology 10, 8 (Dec 2000), 1388–1402.
  • Kong (2010) Hong Kong. 2010. ROBUST TRACKING BASED ON BOOSTED COLOR SOFT SEGMENTATION AND ICA-R Department of Electronic Engineering Dalian University of Technology Dalian , CHINA College of Information Science and Engineering Ritsumeikan University. In Icip. 3917–3920.
  • Krähenbühl and Koltun (2011) Philipp Krähenbühl and Vladlen Koltun. 2011. Efficient inference in fully connected crfs with gaussian edge potentials. In Advances in neural information processing systems. 109–117.
  • Kristan et al. (2016) Matej Kristan, Aleš Leonardis, Jiři Matas, Michael Felsberg, Roman Pflugfelder, and Luka Čehovin. 2016. The Visual Object Tracking VOT2016 Challenge Results. In Computer Vision – ECCV 2016 Workshops, Gang Hua and Hervé Jégou (Eds.). Springer International Publishing, Cham, 777–823.
  • Lalos et al. (2010) Constantinos Lalos, Helmut Grabner, Luc Van Gool, and Theodora Varvarigou. 2010. Object flow: Learning object displacement. In Asian Conference on Computer Vision. Springer, 133–142.
  • Lee et al. (2018) Hakjin Lee, Jongbin Ryu, and Jongwoo Lim. 2018. Joint Object Tracking and Segmentation with Independent Convolutional Neural Networks. In Proceedings of the 1st Workshop and Challenge on Comprehensive Video Understanding in the Wild. ACM, 7–13.
  • Lee et al. (2011) Yong Jae Lee, Jaechul Kim, and Kristen Grauman. 2011. Key-segments for video object segmentation. In 2011 International conference on computer vision. IEEE, 1995–2002.
  • Levinshtein et al. (2012) Alex Levinshtein, Cristian Sminchisescu, and Sven Dickinson. 2012. Optimal image and video closure by superpixel grouping. International journal of computer vision 100, 1 (2012), 99–119.
  • Lezama et al. (2011) J. Lezama, K. Alahari, J. Sivic, and I. Laptev. 2011. Track to the future: Spatio-temporal video segmentation with long-range motion cues. In CVPR 2011. 3369–3376.
  • Li et al. (2014) Fuxin Li, Taeyoung Kim, Ahmad Humayun, David Tsai, and James M. Rehg. 2014. Video Segmentation by Tracking Many Figure-Ground Segments. In IEEE International Conference on Computer Vision.
  • Li and Ngan (2007) Hongliang Li and King N Ngan. 2007. Automatic video segmentation and tracking for content-based applications. IEEE Communications Magazine 45, 1 (2007), 27–33.
  • Li et al. (2017) Jia Li, Anlin Zheng, Xiaowu Chen, and Bin Zhou. 2017. Primary video object segmentation via complementary cnns and neighborhood reversible flow. In Proceedings of the IEEE International Conference on Computer Vision. 1417–1425.
  • Li et al. (2018) Siyang Li, Bryan Seybold, Alexey Vorobyov, Alireza Fathi, Qin Huang, and C-C Jay Kuo. 2018. Instance embedding transfer to unsupervised video object segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6526–6535.
  • Li and Change Loy (2018) Xiaoxiao Li and Chen Change Loy. 2018. Video object segmentation with joint re-identification and attention-aware mask propagation. In Proceedings of the European Conference on Computer Vision (ECCV). 90–105.
  • Li et al. (2013) Xi Li, Weiming Hu, Chunhua Shen, Zhongfei Zhang, Anthony Dick, and Anton Van Den Hengel. 2013. A survey of appearance models in visual object tracking. ACM transactions on Intelligent Systems and Technology (TIST) 4, 4 (2013), 58.
  • Li et al. (2005) Yin Li, Jian Sun, and Heung-Yeung Shum. 2005. Video object cut and paste. In ACM Transactions on Graphics (ToG), Vol. 24. ACM, 595–600.
  • Lim et al. (2013) T. Lim, S. Hong, B. Han, and J. H. Han. 2013. Joint Segmentation and Pose Tracking of Human in Natural Videos. In 2013 IEEE International Conference on Computer Vision. 833–840.
  • Lin et al. (2017) Guosheng Lin, Anton Milan, Chunhua Shen, and Ian Reid. 2017. Refinenet: Multi-path refinement networks for high-resolution semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1925–1934.
  • Lin et al. (2018) Guosheng Lin, Chunhua Shen, Anton Van Den Hengel, and Ian Reid. 2018. Exploring context with deep structured models for semantic segmentation. IEEE transactions on pattern analysis and machine intelligence 40, 6 (2018), 1352–1366.
  • Liu et al. (2014) Xiao Liu, Dacheng Tao, Mingli Song, Ying Ruan, Chun Chen, and Jiajun Bu. 2014. Weakly supervised multiclass video segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 57–64.
  • Liu et al. (2016) Yuqiang Liu, Kunfeng Wang, and Dayong Shen. 2016. Visual tracking based on dynamic coupled conditional random field model. IEEE Transactions on Intelligent Transportation Systems 17, 3 (2016), 822–833.
  • Long et al. (2015) Jonathan Long, Evan Shelhamer, and Trevor Darrell. 2015. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3431–3440.
  • LUCAS (1981) B LUCAS. 1981. An iterative image registration technique with an application to stereo vision. Proc. of 7th IJCAI, 1981 (1981).
  • Luiten et al. (2018) Jonathon Luiten, Paul Voigtlaender, and Bastian Leibe. 2018. Premvos: Proposal-generation, refinement and merging for the davis challenge on video object segmentation 2018. In The 2018 DAVIS Challenge on Video Object Segmentation-CVPR Workshops.
  • Maninis et al. (2018) K. Maninis, S. Caelles, Y. Chen, J. Pont-Tuset, L. Leal-Taixé, D. Cremers, and L. Van Gool. 2018. Video Object Segmentation Without Temporal Information. IEEE Transactions on Pattern Analysis and Machine Intelligence (2018), 1–1.
  • Maninis et al. (2018) Kevis-Kokitsi Maninis, Sergi Caelles, Jordi Pont-Tuset, and Luc Van Gool. 2018. Deep extreme cut: From extreme points to object segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 616–625.
  • Märki et al. (2016) Nicolas Märki, Federico Perazzi, Oliver Wang, and Alexander Sorkine-Hornung. 2016. Bilateral space video segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 743–751.
  • Milan et al. (2016) Anton Milan, Laura Leal-Taixé, Ian Reid, Stefan Roth, and Konrad Schindler. 2016. MOT16: A benchmark for multi-object tracking. arXiv preprint arXiv:1603.00831 (2016).
  • Milan et al. (2015) Anton Milan, Laura Leal-Taixé, Konrad Schindler, and Ian Reid. 2015. Joint tracking and segmentation of multiple targets. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5397–5406.
  • Newswanger and Xu (2017) Amos Newswanger and Chenliang Xu. 2017. One-shot video object segmentation with iterative online fine-tuning. In CVPR Workshop, Vol. 1.
  • Ochs and Brox (2012) Peter Ochs and Thomas Brox. 2012. Higher order motion models and spectral clustering. In 2012 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 614–621.
  • Oneata et al. (2014) Dan Oneata, Jérôme Revaud, Jakob Verbeek, and Cordelia Schmid. 2014. Spatio-temporal object detection proposals. In European conference on computer vision. Springer, 737–752.
  • Papazoglou and Ferrari (2014) Anestis Papazoglou and Vittorio Ferrari. 2014. Fast Object Segmentation in Unconstrained Video. In IEEE International Conference on Computer Vision.
  • Papoutsakis and Argyros (2013) Konstantinos Papoutsakis and Antonis Argyros. 2013. Integrating tracking with fine object segmentation. Image and Vision Computing 31 (10 2013), 771–785.
  • Paragios and Deriche (2000) Nikos Paragios and Rachid Deriche. 2000. Geodesic active contours and level sets for the detection and tracking of moving objects. IEEE Transactions on pattern analysis and machine intelligence 22, 3 (2000), 266–280.
  • Patras et al. (2003) Ioannis Patras, Emile A Hendriks, and Reginald L Lagendijk. 2003. Semi-automatic object-based video segmentation with labeling of color segments. Signal Processing: Image Communication 18, 1 (2003), 51–65.
  • Perazzi et al. (2017) Federico Perazzi, Anna Khoreva, Rodrigo Benenson, Bernt Schiele, and Alexander Sorkine-Hornung. 2017. Learning video object segmentation from static images. In Computer Vision and Pattern Recognition, Vol. 2.
  • Perazzi et al. (2016) Federico Perazzi, Jordi Pont-Tuset, Brian McWilliams, Luc Van Gool, Markus Gross, and Alexander Sorkine-Hornung. 2016. A benchmark dataset and evaluation methodology for video object segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 724–732.
  • Perazzi et al. (2015) Federico Perazzi, Oliver Wang, Markus Gross, and Alexander Sorkine-Hornung. 2015. Fully connected object proposals for video segmentation. In Proceedings of the IEEE International Conference on Computer Vision, Vol. 2015 Inter. 3227–3234.
  • Peter et al. (2014) Ochs Peter, Malik Jitendra, and Brox Thomas. 2014. Segmentation of Moving Objects by Long Term Video Analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence 36, 6 (2014), 1187–1200.
  • Pont-Tuset et al. (2015) Jordi Pont-Tuset, Miquel A Farré, and Aljoscha Smolic. 2015. Semi-automatic video object segmentation by advanced manipulation of segmentation hierarchies. In 2015 13th International Workshop on Content-Based Multimedia Indexing (CBMI). IEEE, 1–6.
  • Pont-Tuset et al. (2017) Jordi Pont-Tuset, Federico Perazzi, Sergi Caelles, Pablo Arbeláez, Alex Sorkine-Hornung, and Luc Van Gool. 2017. The 2017 davis challenge on video object segmentation. arXiv preprint arXiv:1704.00675 (2017).
  • Prest et al. (2012) Alessandro Prest, Christian Leistner, Javier Civera, Cordelia Schmid, and Vittorio Ferrari. 2012. Learning object class detectors from weakly annotated video. In 2012 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 3282–3289.
  • Price et al. (2009) Brian L Price, Bryan S Morse, and Scott Cohen. 2009. Livecut: Learning-based interactive video segmentation by evaluation of multiple propagated cues. In 2009 IEEE 12th International Conference on Computer Vision. IEEE, 779–786.
  • Rahtu et al. (2010) Esa Rahtu, Juho Kannala, Mikko Salo, and Janne Heikkilä. 2010. Segmenting salient objects from images and videos. In European conference on computer vision. Springer, 366–379.
  • Ren and Malik (2003) Xiaofeng Ren and Jitendra Malik. 2003. Learning a classification model for segmentation. In null. IEEE, 10.
  • Ren and Malik (2007) X. Ren and J. Malik. 2007. Tracking as Repeated Figure/Ground Segmentation. In 2007 IEEE Conference on Computer Vision and Pattern Recognition. 1–8.
  • Ren et al. (2003) Ying Ren, Chin-Seng Chua, and Yeong-Khing Ho. 2003. Statistical background modeling for non-stationary camera. Pattern Recognition Letters 24, 1-3 (2003), 183–196.
  • Reso et al. (2014) Matthias Reso, Björn Scheuermann, Jörn Jachalsky, Bodo Rosenhahn, and Jörn Ostermann. 2014. Interactive segmentation of high-resolution video content using temporally coherent superpixels and graph cut. In International Symposium on Visual Computing. Springer, 281–292.
  • Rochan et al. (2018) Mrigank Rochan, Linwei Ye, and Yang Wang. 2018. Video summarization using fully convolutional sequence networks. In Proceedings of the European Conference on Computer Vision (ECCV). 347–363.
  • Ross et al. (2008) David A Ross, Jongwoo Lim, Ruei-Sung Lin, and Ming-Hsuan Yang. 2008. Incremental learning for robust visual tracking. International journal of computer vision 77, 1-3 (2008), 125–141.
  • Rother et al. (2004) Carsten Rother, Vladimir Kolmogorov, and Andrew Blake. 2004. Grabcut: Interactive foreground extraction using iterated graph cuts. In ACM transactions on graphics (TOG), Vol. 23. ACM, 309–314.
  • Schiegg et al. (2014) Martin Schiegg, Philipp Hanslovsky, Carsten Haubold, Ullrich Koethe, Lars Hufnagel, and Fred A Hamprecht. 2014. Graphical model for joint segmentation and tracking of multiple dividing cells. Bioinformatics 31, 6 (2014), 948–956.
  • Schubert et al. (2015) Falk Schubert, Daniele Casaburo, Dirk Dickmanns, and Vasileios Belagiannis. 2015. Revisiting robust visual tracking using pixel-wise posteriors. In International Conference on Computer Vision Systems. Springer, 275–288.
  • Shankar Nagaraja et al. (2015) Naveen Shankar Nagaraja, Frank R Schmidt, and Thomas Brox. 2015. Video segmentation with just a few strokes. In Proceedings of the IEEE International Conference on Computer Vision. 3235–3243.
  • Sharir et al. (2017) Gilad Sharir, Eddie Smolyansky, and Itamar Friedman. 2017. Video object segmentation using tracked object proposals. arXiv preprint arXiv:1707.06545 (2017).
  • Shi and Malik (1998) Jianbo Shi and J. Malik. 1998. Motion segmentation and tracking using normalized cuts. In Sixth International Conference on Computer Vision (IEEE Cat. No.98CH36271). 1154–1160.
  • Shi and Malik (2000) Jianbo Shi and Jitendra Malik. 2000. Normalized Cuts and Image Segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 8 (2000), 888–905.
  • Shin Yoon et al. (2017) Jae Shin Yoon, Francois Rameau, Junsik Kim, Seokju Lee, Seunghak Shin, and In So Kweon. 2017. Pixel-level matching for video object segmentation using convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2167–2176.
  • Son et al. (2015) Jeany Son, Ilchae Jung, Kayoung Park, and Bohyung Han. 2015.

    Tracking-by-segmentation with online gradient boosting decision tree. In

    Proceedings of the IEEE International Conference on Computer Vision. 3056–3064.
  • Song et al. (2018) Hongmei Song, Wenguan Wang, Sanyuan Zhao, Jianbing Shen, and Kin-Man Lam. 2018. Pyramid dilated deeper ConvLSTM for video salient object detection. In Proceedings of the European Conference on Computer Vision (ECCV). 715–731.
  • Song and Fan (2007) Xiaomu Song and Guoliang Fan. 2007. Selecting salient frames for spatiotemporal video modeling and segmentation. IEEE Transactions on Image Processing 16, 12 (2007), 3035–3046.
  • Stauffer and Grimson (2000) Chris Stauffer and W. Eric L. Grimson. 2000. Learning patterns of activity using real-time tracking. IEEE Transactions on pattern analysis and machine intelligence 22, 8 (2000), 747–757.
  • Stein et al. (2007) Andrew Stein, Derek Hoiem, and Martial Hebert. 2007. Learning to Find Object Boundaries Using Motion Cues. In 2007 IEEE 11th International Conference on Computer Vision. IEEE, 1–8.
  • Sundaram et al. (2010) Narayanan Sundaram, Thomas Brox, and Kurt Keutzer. 2010. Dense point trajectories by GPU-accelerated large displacement optical flow. In European conference on computer vision. Springer, 438–451.
  • Sundberg et al. (2011) Patrik Sundberg, Thomas Brox, Michael Maire, Pablo Arbeláez, and Jitendra Malik. 2011. Occlusion boundary detection and figure/ground assignment from optical flow. In CVPR 2011. IEEE, 2233–2240.
  • Tang et al. (2013) Kevin Tang, Rahul Sukthankar, Jay Yagnik, and Li Fei-Fei. 2013. Discriminative segment annotation in weakly labeled video. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2483–2490.
  • Tao et al. (2008) Zhao Tao, Nevatia Ram, and Wu Bo. 2008. Segmentation and tracking of multiple humans in crowded environments. IEEE Transactions on Pattern Analysis and Machine Intelligence 30, 7 (2008), 1198–1211.
  • Taylor et al. (2015) Brian Taylor, Vasiliy Karasev, and Stefano Soatto. 2015. Causal video object segmentation from persistence of occlusions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4268–4276.
  • Tjaden et al. (2016) Henning Tjaden, Ulrich Schwanecke, and Elmar Schömer. 2016. Real-time monocular segmentation and pose tracking of multiple objects. In European conference on computer vision. Springer, 423–438.
  • Tokmakov et al. (2017a) Pavel Tokmakov, Karteek Alahari, and Cordelia Schmid. 2017a. Learning motion patterns in videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3386–3394.
  • Tokmakov et al. (2017b) Pavel Tokmakov, Karteek Alahari, and Cordelia Schmid. 2017b. Learning video object segmentation with visual memory. arXiv preprint arXiv:1704.05737 3 (2017).
  • Torr and Zisserman (1998) Philip HS Torr and Andrew Zisserman. 1998.

    Concerning Bayesian motion segmentation, model averaging, matching and the trifocal tensor. In

    European Conference on Computer Vision. Springer, 511–527.
  • Tsai et al. (2012) David Tsai, Matthew Flagg, Atsushi Nakazawa, and James M. Rehg. 2012. Motion Coherent Tracking Using Multi-label MRF Optimization. International Journal of Computer Vision 100, 2 (01 Nov 2012), 190–202.
  • Tsai et al. (2016a) Yi-Hsuan Tsai, Ming-Hsuan Yang, and Michael J Black. 2016a. Video segmentation via object flow. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3899–3908.
  • Tsai et al. (2016b) Yi-Hsuan Tsai, Guangyu Zhong, and Ming-Hsuan Yang. 2016b. Semantic co-segmentation in videos. In European Conference on Computer Vision. Springer, 760–775.
  • Valipour et al. (2017) Sepehr Valipour, Mennatullah Siam, Martin Jagersand, and Nilanjan Ray. 2017. Recurrent fully convolutional networks for video segmentation. In Applications of Computer Vision (WACV), 2017 IEEE Winter Conference on. IEEE, 29–36.
  • Vijayanarasimhan et al. (2017) Sudheendra Vijayanarasimhan, Susanna Ricco, Cordelia Schmid, Rahul Sukthankar, and Katerina Fragkiadaki. 2017. Sfm-net: Learning of structure and motion from video. arXiv preprint arXiv:1704.07804 (2017).
  • Voigtlaender and Leibe (2017) Paul Voigtlaender and Bastian Leibe. 2017. Online adaptation of convolutional neural networks for video object segmentation. In Proc. BMVC.
  • Vojir and Matas (2017) Tomas Vojir and Jiri Matas. 2017. Pixel-wise object segmentations for the VOT 2016 dataset. Research Report CTU-CMP-2017–01, Center for Machine Perception, Czech Technical University, Prague, Czech Republic (2017).
  • Wang (1998) Demin Wang. 1998. Unsupervised video segmentation based on watersheds and temporal tracking. IEEE Transactions on Circuits and Systems for video Technology 8, 5 (1998), 539–546.
  • Wang et al. (2016) Huiling Wang, Tapani Raiko, Lasse Lensu, Tinghuai Wang, and Juha Karhunen. 2016. Semi-supervised domain adaptation for weakly labeled semantic video object segmentation. In Asian conference on computer vision. Springer, 163–179.
  • Wang and Wang (2016) Huiling Wang and Tinghuai Wang. 2016. Primary object discovery and segmentation in videos via graph-based transductive inference. Computer Vision and Image Understanding 143 (2016), 159–172.
  • Wang et al. (2005) Jue Wang, Pravin Bhat, R Alex Colburn, Maneesh Agrawala, and Michael F Cohen. 2005. Interactive video cutout. In ACM Transactions on Graphics (ToG), Vol. 24. ACM, 585–594.
  • Wang et al. (2004) Jue Wang, Yingqing Xu, Heung-Yeung Shum, and Michael F Cohen. 2004. Video tooning. In ACM Transactions on Graphics (ToG), Vol. 23. ACM, 574–583.
  • Wang et al. (2018) Qiang Wang, Li Zhang, Luca Bertinetto, Weiming Hu, and Philip HS Torr. 2018. Fast Online Object Tracking and Segmentation: A Unifying Approach. arXiv preprint arXiv:1812.05050 (2018).
  • Wang et al. (2011) Shu Wang, Huchuan Lu, Fan Yang, and Ming-Hsuan Yang. 2011. Superpixel tracking. In Proceedings of the 2011 International Conference on Computer Vision. IEEE Computer Society, 1323–1330.
  • Wang et al. (2014) Tinghuai Wang, Bo Han, and John Collomosse. 2014. Touchcut: Fast image and video segmentation using single-touch interaction. Computer Vision and Image Understanding 120 (2014), 14–30.
  • Wang and Nevatia (2013) Weijun Wang and Ramakant Nevatia. 2013. Robust Object Tracking Using Constellation Model with Superpixel. In Proceedings of the 11th Asian Conference on Computer Vision - Volume Part III (ACCV’12). Springer-Verlag, Berlin, Heidelberg, 191–204.
  • Wang et al. (2015) Wenguan Wang, Jianbing Shen, and Fatih Porikli. 2015. Saliency-aware geodesic video object segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3395–3402.
  • Wang et al. (2017a) Wenguan Wang, Jianbing Shen, and Fatih Porikli. 2017a. Selective video object cutout. IEEE Transactions on Image Processing 26, 12 (2017), 5645–5655.
  • Wang et al. (2017b) Wenguan Wang, Jianbing Shen, Jianwen Xie, and Fatih Porikli. 2017b. Super-Trajectory for Video Segmentation. In Computer Vision (ICCV), 2017 IEEE International Conference on. IEEE, 1680–1688.
  • Wang et al. (2019) Wenguan Wang, Hongmei Song, Shuyang Zhao, Jianbing Shen, Sanyuan Zhao, Steven Chu Hong Hoi, and Haibin Ling. 2019. Learning Unsupervised Video Object Segmentation through Visual Attention. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  • Wang, Jin, Liu, Liu, Zhang, Chen, Zhang, Guo, and Shao (Wang et al.) Zhenhua Wang, Jiali Jin, Tong Liu, Sheng Liu, Jianhua Zhang, Shengyong Chen, Zhen Zhang, Dongyan Guo, and Zhanpeng Shao. Understanding human activities in videos: A joint action and interaction learning approach. Neurocomputing 321, 2019 (????), 216–226.
  • Wang et al. (2017c) Zhenhua Wang, Liu Sheng, Jianhua Zhang, Shengyong Chen, and Guan Qiu. 2017c. A Spatio-temporal CRF for Human Interaction Understanding. IEEE Transactions on Circuits and Systems for Video Technology 27, 8 (2017), 1647–1660.
  • Wen et al. (2015) Longyin Wen, Dawei Du, Zhen Lei, Stan Z Li, and Ming-Hsuan Yang. 2015. Jots: Joint online tracking and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2226–2234.
  • Wren et al. (1997) Christopher Richard Wren, Ali Azarbayejani, Trevor Darrell, and Alex Paul Pentland. 1997. Pfinder: Real-time tracking of the human body. IEEE Transactions on pattern analysis and machine intelligence 19, 7 (1997), 780–785.
  • Wu and Nevatia (2009) Bo Wu and Ram Nevatia. 2009. Detection and segmentation of multiple, partially occluded objects by grouping, merging, assigning part detection responses. International journal of computer vision 82, 2 (2009), 185–204.
  • Wu et al. (2013) Yi Wu, Jongwoo Lim, and Ming-Hsuan Yang. 2013. Online object tracking: A benchmark. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2411–2418.
  • Wu et al. (2015) Yi Wu, Jongwoo Lim, and Ming Hsuan Yang. 2015. Object Tracking Benchmark. IEEE Transactions on Pattern Analysis and Machine Intelligence 37, 9 (2015), 1834–1848.
  • Wu et al. (2015) Zhengyang Wu, Fuxin Li, Rahul Sukthankar, and James M Rehg. 2015. Robust video segment proposals with painless occlusion handling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4194–4203.
  • Wug Oh et al. (2018) Seoung Wug Oh, Joon-Young Lee, Kalyan Sunkavalli, and Seon Joo Kim. 2018. Fast video object segmentation by reference-guided mask propagation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7376–7385.
  • Xiao and Jae Lee (2016) Fanyi Xiao and Yong Jae Lee. 2016. Track and segment: An iterative unsupervised approach for video object proposals. In Proceedings of the IEEE conference on computer vision and pattern recognition. 933–942.
  • Xiao et al. (2018) Huaxin Xiao, Jiashi Feng, Guosheng Lin, Yu Liu, and Maojun Zhang. 2018. MoNet: Deep Motion Exploitation for Video Object Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1140–1148.
  • Xingjian et al. (2015) SHI Xingjian, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai-Kin Wong, and Wang-chun Woo. 2015.

    Convolutional LSTM network: A machine learning approach for precipitation nowcasting. In

    Advances in neural information processing systems. 802–810.
  • Xu et al. (2016) Ning Xu, Brian Price, Scott Cohen, Jimei Yang, and Thomas S Huang. 2016. Deep interactive object selection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 373–381.
  • Xu et al. (2018) Ning Xu, Linjie Yang, Yuchen Fan, Jianchao Yang, Dingcheng Yue, Yuchen Liang, Brian Price, Scott Cohen, and Thomas Huang. 2018. YouTube-VOS: Sequence-to-Sequence Video Object Segmentation. In The European Conference on Computer Vision (ECCV).
  • Yang et al. (2016) Jiong Yang, Brian Price, Xiaohui Shen, Zhe Lin, and Junsong Yuan. 2016. Fast appearance modeling for automatic primary video object segmentation. IEEE Transactions on Image Processing 25, 2 (2016), 503–515.
  • Yang et al. (2018) Linjie Yang, Yanran Wang, Xuehan Xiong, Jianchao Yang, and Aggelos K Katsaggelos. 2018. Efficient video object segmentation via network modulation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6499–6507.
  • Yang et al. (2017) Rui Yang, Bingbing Ni, Chao Ma, Yi Xu, and Xiaokang Yang. 2017. Video segmentation via multiple granularity analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3010–3019.
  • Yao et al. (2018) Rui Yao, Guosheng Lin, Chunhua Shen, Yanning Zhang, and Qinfeng Shi. 2018. Semantics-Aware Visual Object Tracking. IEEE Transactions on Circuits and Systems for Video Technology (2018), 1–1.
  • Yao et al. (2012) Rui Yao, Qinfeng Shi, Chunhua Shen, Yanning Zhang, and Anton van den Hengel. 2012. Robust tracking with weighted online structured learning. In European Conference on Computer Vision. Springer, 158–172.
  • Yao et al. (2017a) Rui Yao, Qinfeng Shi, Chunhua Shen, Yanning Zhang, and Anton van den Hengel. 2017a. Part-based robust tracking using online latent structured learning. IEEE Transactions on Circuits and Systems for Video Technology 27, 6 (2017), 1235–1248.
  • Yao et al. (2017b) Rui Yao, Shixiong Xia, Zhen Zhang, and Yanning Zhang. 2017b. Real-time correlation filter tracking by efficient dense belief propagation with structure preserving. IEEE Transactions on Multimedia 19, 4 (2017), 772–784.
  • Yeo et al. (2017) Donghun Yeo, Jeany Son, Bohyung Han, and Joon Hee Han. 2017. Superpixel-based tracking-by-segmentation using markov chains. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 511–520.
  • Yilmaz et al. (2006) Alper Yilmaz, Omar Javed, and Mubarak Shah. 2006. Object tracking: A survey. Acm computing surveys (CSUR) 38, 4 (2006), 13.
  • Zhang et al. (2013) Dong Zhang, Omar Javed, and Mubarak Shah. 2013. Video object segmentation through spatially accurate and temporally dense extraction of primary object regions. In Proceedings of the IEEE conference on computer vision and pattern recognition. 628–635.
  • Zhang et al. (2015) Yu Zhang, Xiaowu Chen, Jia Li, Chen Wang, and Changqun Xia. 2015. Semantic object segmentation via detection in weakly labeled video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3641–3649.
  • Zhang et al. (2018) Zongpu Zhang, Yang Hua, Tao Song, Zhengui Xue, Ruhui Ma, Neil Robertson, and Haibing Guan. 2018. Tracking-assisted Weakly Supervised Online Visual Object Segmentation in Unconstrained Videos. In 2018 ACM Multimedia Conference on Multimedia Conference. ACM, 941–949.
  • Zhou and Tao (2013) Tianyi Zhou and Dacheng Tao. 2013.

    Shifted subspaces tracking on sparse outlier for motion segmentation. In

    Proceedings of the Twenty-Third international joint conference on Artificial Intelligence. AAAI Press, 1946–1952.