Unsupervised learning-based long-term superpixel tracking

02/25/2019 ∙ by Pierre-Henri Conze, et al. ∙ 0

Finding correspondences between structural entities decomposing images is of high interest for computer vision applications. In particular, we analyze how to accurately track superpixels - visual primitives generated by aggregating adjacent pixels sharing similar characteristics - over extended time periods relying on unsupervised learning and temporal integration. A two-step video processing pipeline dedicated to long-term superpixel tracking is proposed. First, unsupervised learning-based superpixel matching provides correspondences between consecutive and distant frames using new context-rich features extended from greyscale to multi-channel and forward-backward consistency contraints. Resulting elementary matches are then combined along multi-step paths running through the whole sequence with various inter-frame distances. This produces a large set of candidate long-term superpixel pairings upon which majority voting is performed. Video object tracking experiments demonstrate the accuracy of our elementary estimator against state-of-the-art methods and proves the ability of multi-step integration to provide accurate long-term superpixel matches compared to usual direct and sequential integration.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Finding correspondences between multiple images is a fundamental problem in computer vision tasks including scene segmentation lezama2011track , 3D reconstruction seitz2006comparison , visual tracking yang2014robust , trajectory analysis wang2013dense or video editing like 2D-to-3D video conversion cao2011semi or graphic elements propagation conze2016multi . Established via local or global search, correspondences are usually either for the dense pixel grid as with optical flow farneback2003two or sparse through key feature points shi1994good . Alternatively, finding associations between structural entities decomposing images by grouping pixels enables semi-dense coverage of the whole image while drastically reducing the cost of correspondence. Matching a limited number of structural elements can also solve challenging issues such as large displacement, occlusions, appearance or illuminations changes. In this context, patch-based approximate nearest neighbor (ANN) search methods such as PatchMatch (PM) barnes2009patchmatch and its extension to multi-resolution barnes2010generalized are mainly used to find correspondences between patches. However, a regular decomposition of the image grid does not respect both object and motion contours and does not offer enough consistent support regions for image processing methods.

Contrary to image patches, superpixels - visual primitives generated by aggregating adjacent pixels sharing similar characteristics into semantic areas achanta2012slic - offer more reliable support regions while preserving image geometry and object contours. Moreover, the hypothesis that motion discontinuities are a subset of photometric contours is usually used to preserve boundaries between objects exhibiting different motion. These findings have motivated recent optical flow algorithms using image data and smoothness terms adapted to the superpixel level chang2013superpixel ; donne2015fast . Conversely, we claim that image matching relying on superpixels could benefit from these advantages to offer more consistent associations than pixel or patch matches while providing a better management of motion discontinuities. In particular, this paper focuses on how to accurately find correspondences between superpixels over extended time periods.

Additionally to dense pixel-wise matching starting from superpixel-level pairings dong2016hsp2p , superpixel correspondences have been already employed for visual tracking through superpixel-based discriminative appearance models yang2014robust or object-background confidence maps fan2016visual . However, these works perform superpixel matching based on comparisons of intrinsic superpixel features only, without taking full advantage of neighborhood information. Conversely, giraud2017superpatchmatch exploits a structure of superpixel neighborhood called SuperPatch involved in a superpixel PM framework. Superpixel neighborhood information greatly improves correspondences since it alleviates some matching failures due to irregular decomposition of the same image content, not directly comparable between images. However, even with incorporated neighborhood information, directly computing a matching distance between irregular structures can be tedious, especially when images are divided into a large collection of superpixels.

A prior pixel-to-superpixel mapping can drive the matching at the superpixel level to provide more precise correspondences. In this direction, kanavati2017supervoxel uses random forests (RF) breiman2001random to establish supervoxel correspondences between two D images in an unsupervised fashion. RF is trained on one image by using supervoxel indexes as voxel-wise class labels and robust context-rich features to describe the extended neighborhood. Applying RF on the other image yields a voxel labelling which is then regularised using majority voting within supervoxel boundaries. First validated on medical image registration, we explore the use of such learning-based superpixel matching for accurate superpixel matching across long video sequences.

Despite recent advances related to optical flow integration crivelli2015robust ; conze2016multi , the temporal tracking of superpixels over long-term video sequences has received little attention in the literature. wang2017constrained uses a constrained graph where nodes denote superpixels and edges encode spatial, temporal, and appearance constraints. However, temporal constraints only model short-term smoothness between consecutive frames. The same finding arises in yang2014robust whose tracker is conducted sequentially and therefore prone to motion drift. Establishing long-term superpixel correspondences requires to perform superpixel matching between consecutive and distant frames and therefore to handle simultaneously small and large displacements. To address this challenge, we exploit the concept of multi-step integration introduced for long-term motion estimation using optical flow conze2016multi . The idea is to generate a large set of elementary displacement estimations performed between consecutive frames or with larger inter-frame distances. Once combined, elementary multi-step estimations result in a large set of long-term correspondences which are significative enough to be fused through statistical processing. We are not aware of any studies that have recovered this concept for long-term superpixel tracking while it could bring many benefits in our context. Indeed, it can alleviate matching errors during superpixel trajectory estimation since new steps can give a chance to match with a correct location again compared to sequential processing whose tracks may be lost. Moreover, statistical processing upon large representative long-term superpixel candidates can solve the uncertainty component present for matching tasks.

It must be reported that deep learning has become popular for object tracking relying on convolutional networks to learn discriminative features to encode the target appearance

ma2015hierarchical ; li2016deeptrack

or recurrent networks trained with reinforcement learning to learn how to predict object locations across videos

zhang2017deep . Despite their high performance, these methods only provide very sparse bounding box tracking and do not describe how the boundaries of an irregular shaped object evolves in time as expected through long-term superpixel tracking.

In summary, two main contributions are proposed towards accurate long-term superpixel tracking. First, unsupervised learning-based superpixel matching is generalized and adapted from medical image processing kanavati2017supervoxel to computer vision in order to find associations along sequences between consecutive and distant images decomposed into SLIC superpixels achanta2012slic (Sect.2

). The approach is carried out using classifiers such as

-nearest neighbors (NN) or RF breiman2001random , incorporates new forward-backward consistency contraints and fully exploits dedicated context-rich features we extended from greyscale glocker2014robust ; kanavati2017supervoxel ; conze2017semi to multi-channel to incorporate neighborhood information on RGB frames. Second, based on this learning-based matching approach used as an elementary displacement estimator, we propose a multi-step integration strategy for long-term superpixel tracking. It combines multiple elementary superpixel matches obtained for some intermediate images following randomly selected multi-step paths (Sect.3). This produces a large set of candidate long-term superpixel pairings upon which a majority voting selection is performed. Based on object tracking experiments, Sect.4 assesses the accuracy of the proposed elementary estimator against state-of-the-art methods and proves the ability of multi-step integration to provide an efficient long-term superpixel tracking compared to standard direct and sequential integration. We end with conclusions and perspectives in Sect.5.

2 Superpixel matching with unsupervised learning

Let be a video sequence of RGB images. Unsupervised learning-based superpixel matching is addressed between two consecutive or distant images and of . Each image

associates a RGB color vector

to each pixel located at with 111stands for first and second.

2.1 Problem formulation

Let and be respectively the set of and connected superpixels partitioning and . The superpixel decomposition is performed using the Simple Linear Iterative Clustering (SLIC) algorithm achanta2012slic which aggregates neighboring pixels based on spatial and intensity proximity criteria. Forward superpixel matching from to () consists in automatically learning a matching function that maps each superpixel of to a given superpixel of kanavati2017supervoxel such that:

(1)

Backward matching from to can be similarly considered by estimating mapping each superpixel to a given superpixel . In what follows, learning-based superpixel matching is described in forward from to .

source SLIC achanta2012slic pixel-wise prediction majority voting
source SLIC achanta2012slic pixel-wise training NN or RF breiman2001random
Figure 1: Superpixel matching between and using unsupervised learning applied with SLIC achanta2012slic superpixel indexes as label entities followed by majority voting following kanavati2017supervoxel . This example is produced for image pair of the lapa sequence sznitman2012data using RF breiman2001random .

2.2 Overall strategy

Instead of relying on nearest neighbor search at the superpixel level through superpixel feature comparisons dong2016hsp2p ; fan2016visual , which is prone to ambiguity due to possible severe overlaps in feature space, we explore the use of pixel-wise -nearest neighbors (NN) or random forests (RF) breiman2001random to establish correspondences between superpixels over-segmenting and , as formulated in Eq.1. Usually employed with success for multi-class classification or regression, we show that such classifiers can also be used with profit for accurate superpixel correspondences. The overall learning-based superpixel matching strategy, illustrated Fig.1, is carried out in an unsupervised manner. To give a powerful representation of global context, RF or NN is considered with new pixel-wise context-rich features extended from greyscale glocker2014robust ; kanavati2017supervoxel ; conze2017semi to multi-channel RGB and described Sect.2.3.

The key idea is to perform training on the target image () by using superpixel indexes as pixel-wise class labels and testing on the source image () to get a pixel-to-superpixel mapping, as introduced in kanavati2017supervoxel . In particular, the classifier aims at assigning a superpixel to each pixel . A training set is thus built by considering all pixels of with their associated superpixel index, i.e. the index of the superpixel they belong to. Once trained, the classifier is applied to to predict for each the index of a superpixel of . This pixel-to-superpixel mapping is addressed in detail Sect.2.4.

Mapping results are further regularized following superpixel boundaries to reach robust superpixel matches. Within each superpixel of , the most represented superpixel index among all pixel-wise predictions indicates the best superpixel match . This final superpixel matching step is detailed Sect.2.5 with new foward-backward (FW-BW) consistency constraints.

Figure 2: Pixel-wise context-rich multi-channel features provide a description of the extended spatial context (see Sect.2.3 for further details).

2.3 Context-rich multi-channel features

Let be the average intensity on a local box of size centered on located at for channel . Pixel-wise context-rich features assigned to pixels are extended from greyscale glocker2014robust ; kanavati2017supervoxel ; conze2017semi to multi-channel as follows:

(2)

where displacements are randomly defined starting from in a disc of maximal radius (Fig.2). is a binary parameter which focuses whether on intensity differences between two boxes randomly located in the extended neighborhood () or on the value obtained from one single box only (). Color intensities around are included in the feature vector by forcing for all possible pre-defined box sizes .

By randomly generating many different box sizes and offsets , we obtain a large set of features describing the extended spatial context for all color channels . Parameters are randomly generated once and remain similar whatever the image or under consideration.

2.4 Pixel-to-superpixel mapping

Pixel-to-superpixel mapping relies on machine learning to compute pixel-to-superpixel mapping probabilities denoted as

for each pixel with respect to all superpixels with .

The procedure with random forests (RF) breiman2001random is conducted as follows. The forest is formed by uncorrelated trees made of both internal nodes splitting data according to binary tests and leaf nodes which reach all together a final data partition. At each internal node, the split sends pixels to left and right child nodes during training () and prediction (). The associated binary test focuses on a random subset of context-rich multi-channel visual features assigned to (Sect.2.4) and divides the input pixel set based on the following split rule:

(3)

where is compared to a threshold and .

Internal node parameters () are optimized via information gain maximization with respect to the training dataset combining pixels belonging to with their associated superpixel index with . After optimization, each leaf node of the tree receives a partition of

and produces the class probability distribution

for all superpixels .

To predict the corresponding superpixel index of a given pixel with associated visual features during testing, is injected into each optimized tree and finally reaches a leaf node per tree following the successive split rules (Eq.3). The pixel-to-superpixel mapping probability denoting the probability that is assigned to is obtained for each by:

(4)

Contrary to RF, the NN classifier simply stores instances of training data insstead of building a general internal model. Pixel-to-superpixel mapping probabilities are computed by looking at the class (superpixel index) distribution among the nearest neighbors of each pixel in feature space. Nearest neighbors are estimated using Euclidean distance on context-rich features (Sect.2.3) assigned to pixels with .

2.5 Superpixel-to-superpixel matching

Once pixel-to-superpixel mapping probabilities are computed for each superpixel using context-rich features (Sect.2.3) involved in RF or NN (Sect.2.4), two steps are required to get final superpixel pairings. First, the final pixel-to-superpixel mapping for each of can be found using:

(5)

Second, majority voting among all pixels of a given superpixel can be performed by selecting the most represented superpixel index. The final matching is defined such that satisfies:

(6)

An alternative consists in averaging the pixel-to-superpixel mapping probabilities at the superpixel level instead of making hard decision for each as performed in Eq.5:

(7)

We keep at this point all possible outcomes between candidate matches. Decisions are postponed to the superpixel level by finding the superpixel which maximizes :

(8)

Forward-backward consistency can be enforced in the context where two mapping functions are learned: (resp. ) that maps each supervoxel () to a given () belonging to () in forward (backward). Thus, we extend Eq.8 with a new consistency constraint that guides the mutual matching between and :

(9)

The whole unsupervised learning-based strategy described above can be performed all along the video to match superpixels decomposing consecutive or distant images, both in forward and backward directions.

3 Long-term superpixel tracking using multi-step integration

We address at this stage long-term superpixel tracking for sequence composed of RGB frames using the learning-based superpixel matching strategy, described Sect.2 for a given pair of consecutive or distant frames, as elementary estimator. Each frame is decomposed into a set of superpixels obtained using SLIC achanta2012slic with the same compactness parameter, i.e. same weighting between spatial and intensity proximity. One particular frame (usually the first one) of is defined as the reference frame and denoted . In this context, we aim at finding correspondences between superpixels over-segmenting and superpixels defined in frames with . Let and be respectively the set of and connected superpixels partitioning and .

Both superpixel trajectory estimation between the reference frame and all the images of the sequence and superpixel matching to match each image to the reference frame can be considered, as in crivelli2015robust ; conze2016multi . From-the-reference estimation is useful for information pushing from superpixels of whereas to-the-reference estimation allows information propagation over superpixels of each frame by pulling it from . The description below focuses on a given pair where is located far away from . Correspondences for the whole sequence are obtained by processing each pair independently .

Starting from learning-based superpixel matching (Sect.2) as elementary motion estimator, two temporal integration schemes can be considered at first glance to find mapping each superpixel to a given superpixel of such that . First, sequential integration can be employed passing through all intermediate frames, similarly to dense point tracking algorithms brox2011large . This step-by-step strategy can gradually apprehend appearance changes and large displacements but it may lead to large error accumulation resulting in a substantial drift over extended time periods. This drawback is further enhanced when using superpixels since superpixel decompositions across the sequence may result in an irregular partitioning of the image content. Second, to avoid error accumulations, direct matching roth2009discrete can be applied between superpixels of and , exactly as in Sect.2. However, this ignores that consists of inter-related images with redundant and smoothly evolving content, which makes large displacement and aspect changes challenging to handle.

Issues related to both sequential and direct superpixel tracking could be partially compensated by complexifying the superpixel matching models and criteria, but an uncertainty component remains. This argues in favor of a statistical processing (Sect.3.2) which takes into account a large set of candidate long-term superpixel matches (Sect.3.1) obtained using multi-step combination of elementary superpixel pairings previously established through unsupervised learning following Sect.2.

3.1 Multi-step integration of elementary matches

Multi-step integration aims at producing a large set of candidate long-term superpixel pairings between and using intermediate superpixel correspondences to form a significative set of samples upon which statistical selection (Sect.3.2) is relevant. Formely introduced for optical flow integration conze2016multi

, we show that this heuristic can be extended towards accurate long-term superpixel tracking. As inputs, we take a set of superpixel match fields pre-estimated from each frame of

including . These matches are computed between consecutive frames or with larger inter-frame distances conze2016multi using learning-based superpixel matching (Sect.2). Let be the set of possible steps at instant which means that has been previously learned using NN or RF. Thus, for each step , we have a superpixel match in for each superpixel of through the mapping function , and this for each frame.

Figure 3: Generation of step sequences from to with steps , , and by creating a tree structure: .

The starting point of multi-step integration consists in initially generating all possible step sequences (Fig.3), i.e. combinations of steps, to join from . Then, each generated step sequence defines a multi-step path (Fig.4) linking each superpixel of to a superpixel in passing through superpixels of some intermediate frames.

Let be the set of the possible step sequences between and . A step sequence is defined by a set of steps which once cascaded join from . is computed by building a tree structure (Fig.3) where each node corresponds to a field of superpixel matches assigned to a given frame for a given step value (node value). Going from the root node to leaf nodes of this tree structure gives the possible step sequences which are stacked into . For instance, the tree displayed Fig.3 indicates the possible step sequences from to with steps , , and : .

Once all the possible step sequences between and are generated, the corresponding multi-step paths are constructed (Fig.4). For step sequence composed of steps, superpixel matching between and is performed via:

(10)

with . Once all the steps have been run through, one gets , the superpixel in corresponding to obtained with step sequence . For for instance (Fig.4), we have:

(11)

A large set of candidate superpixels in is finally reached by considering all the step sequences of and this for each superpixel defined in . Thus, to each is associated a large set of candidate superpixels in defined as .

Figure 4: Generation of multi-step paths corresponding to step sequences from to .

Multi-step integration has been previously presented as an exhaustive candidate generation process. In practice, selecting only a subset of all possible step sequences and therefore associated multi-step paths is required to be able to build and keep in memory the multi-step integration stage outputs growing exponentially conze2016multi . Overall, multi-step paths can be generated for a distance of frames with steps , , and for instance. Up to a few thousands can be actually considered to avoid computational and memory issues. The selection of step sequences among the possible step sequences is therefore necessary, with .

Two complexity reduction rules are taken from conze2016multi . We start by removing the largest step sequences in terms of number of constituting steps. A threshold of number of steps is thus set and only step sequences for which are kept. Indeed, too many steps may induce an important drift due to multiple intermediates. Then, random selection of step sequences among remaining ones is performed.

3.2 Long-term match selection

Once step selection is performed, we obtain for each superpixel of a set of candidate superpixels defined in with . The final candidate selection is performed via majority voting among , i.e. the final matching is defined such that satisfies:

(12)

Thank to the random step sequence selection (Sect.3.1), the set of generated superpixel candidates is both significative and uncorrelated enough to assume that the most represented superpixel provides an accurate superpixel match.

Forward-backward consistency can be also considered in this context by providing to-the-reference multi-step paths additionally to from-the-reference ones. We thus incorporate in superpixels such that where is the set of selected step sequences in the to-the-reference direction. The resulting additional superpixel candidates are referred as reverse candidates in opposition to direct ones, i.e. those which were formerly stacked into . To further guide mutual matching between and , one can apply majority voting (Eq.12) only on superpixels generated in both from/to-the-reference directions.

Superpixel correspondences with respect to are provided for the whole sequence relying on multi-step integration applied independently for each pair and based on unsupervised learning-based superpixel matching as an elementary estimator.

4 Application to video object tracking

Different aspects of the proposed methodology are evaluated through video object tracking experiments. First, the ability of unsupervised learning-based superpixel matching to provide a reliable accurate elementary estimator between consecutive and distant frames is proven with comparisons to state-of-the-art methods (Sect.4.1). Second, the capacity of the proposed multi-step integration stage to perform robust long-term superpixel tracking is shown using both NN and RF-based multi-step elementary superpixel matches (Sect.4.2). Moreover, multi-step integration results are assessed with respect to straightforward direct and sequential integration outputs. Third, multi-step integration is further analyzed by studying the impact of different candidate generation strategies in terms of tracking accuracy (Sect.4.3).

To provide a generic evaluation while ensuring content diversity and representativity, video object tracking is performed over sequences (Tab.1) extracted from databases: bag, fish3 (denoted fsh3) and octopus (octo) from the Visual Object Tracking (VOT) database kristan2016vot , sleep1 (sle1) with albedo from MPI Sintel butler12eccv , lapa from the laparoscopy dataset sznitman2012data as well as swan, bear, camel (caml), cows and flamingo (flam) from the Densely Annotated VIdeo Segmentation (DAVIS) database perazzi2016benchmark . As detailed in Tab.1, these sequences cover altogether many challenging situations such as complex non-rigid motion (NR), large displacement (LD), background clutter (BC), i.e. color similarities with background or between objects, dynamic background (DB) including moving background objects and camera viewpoint changes, scale variations (SV), partial occlusions (PO), thin structures (TS), illuminations changes and shadows (IC). Except for lapa whose ground-truth (GT) masks have been created from our own, all sequences were provided with associated GT masks indicating exact object delineations.

img obj NR LD BC DB SV PO TS IC
bag kristan2016vot 101 1 x x x x
fsh3 kristan2016vot 101 1 x x x
octo kristan2016vot 51 1 x x x x
lapa sznitman2012data 81 1 x x x
sle1 butler12eccv 50 3 x x x
swan perazzi2016benchmark 50 1 x x x
bear perazzi2016benchmark 82 1 x x x x x x
caml perazzi2016benchmark 90 1 x x x x x x x
cows perazzi2016benchmark 104 1 x x x x x x
flam perazzi2016benchmark 80 1 x x x x x x
Table 1: Overview of sequences extracted from kristan2016vot ; sznitman2012data ; butler12eccv ; perazzi2016benchmark and used for object tracking experiments with associated sequence length, tracked object number and video attributes including complex non-rigid motion (NR), large displacement (LD), background clutter (BC), dynamic background (DB), scale variations (SV), partial occlusions (PO), thin structures (TS), illuminations changes and shadows (IC).

Video object tracking, also called semi-supervised video object segmentation task, consists in estimating for the whole sequence the exact location of a semantically meaningful free-shape region of interest (ROI) manually defined in one single image referred as reference frame. Once produced, tracking results are assessed for each pair with based on three complementary measures. First, DICE scores measure the region-based segmentation similarity between estimated and GT masks by computing . Then, contour-based precision and recall between estimated and GT masks can be estimated relying on bipartite graph matching to be robust to small inaccuracies martin2004learning

. In practice, we focus on the F-measure combining precision and recall using

. Bi-partite matching is approximated using morphology operators, as in perazzi2016benchmark . Finally, consistency-based assessment is performed relying on the percentage of pixels of located inside the tracked ROI and whose belonging superpixel is consistent in terms of forward-backward binary consistency:

(13)

In terms of computation time, performing RF-based matching followed by multi-step integration using steps on a sequence of 640360 frames such as octo with superpixels takes approximately 6 minutes per frame using a GHz Intel Xeon CPU processor and Python implementation, without extensive code optimization. Processing time are reduced about when relying on NN for unsupervised learning.

4.1 Elementary superpixel matching

Our first experiments consist in evaluating the proposed unsupervised learning-based superpixel matching (Sect.2) between consecutive and distant frames against state-of-the-art methods. In this direction, ROI tracking is performed through direct integration (DIR) in the to-the-reference direction, i.e. relying on direct processing of image pairs without any sequential or multi-step combinations of pre-estimated superpixel matches. unsupervised learning-based matching using both NN and RF classifiers is compared to three other methodologies: superpixel-to-superpixel matching using superpixel-wise average color (RGBm) and color histogram (RGBh) features, PatchMatch (PM) barnes2009patchmatch , as well as optical flow through Farnebäck farneback2003two and SIFT Flow liu2011sift . Unsupervised learning-based matching works with superpixels per frame and employs context-rich multi-channel features computed with as maximal radius and as possible box sizes (Sect.2.4). RF is made of trees whereas NN relies on neighbors for queries. RGBm and RGBh use respectively average RGB colors and RGB histograms (using bins) as superpixel-wise features to give correspondences in for each superpixel of in a nearest neighbor manner. As for unsupervised learning-based matching, RGBm and RGBh exploit images decomposed into superpixels. PM barnes2009patchmatch is looking for the best patch matches using windows with iterations including both propagation and random refinement steps. Farnebäck farneback2003two and SIFT Flow liu2011sift estimators are used using by-default parameters.

Learning-based and superpixel-to-superpixel matching are performed once a groundtruth label is assigned to each superpixel of to indicate its belonging to the ROI to be tracked. 50% of the constituting pixels must be included into the ROI to label a superpixel as part of the object in . Label propagation can be then easily done at the superpixel level once to-the-reference superpixel pairings are obtained. Conversely, PM and optical flow estimators use dense to-the-reference fields to propagate labels at the pixel level from to the whole sequence.

  DICE   F-measure
  spx matching PMbarnes2009patchmatch optical flow proposed DIR   spx matching PMbarnes2009patchmatch optical flow proposed DIR
RGBm RGBh Farfarneback2003two SFliu2011sift NN RF   RGBm RGBh Farfarneback2003two SFliu2011sift NN RF
bag   67.1 87.5 11.6 11.8 11.0 96.9 92.8   59.3 82.6 10.7 10.1 10.6 97.8 89.0
fsh3   89.8 89.7 23.8 16.7 15.2 64.8 89.4   91.9 91.4 27.4 21.4 18.7 67.3 81.5
octo   25.6 47.6 98.6 96.5 96.4 84.8 85.5   17.2 32.1 95.2 92.9 92.9 75.6 74.9
lapa   65.8 67.7 70.4 61.1 48.5 89.2 87.9   57.1 57.7 53.0 51.2 51.5 86.1 85.3
sl1.1   84.5 84.2 41.7 35.2 32.5 82.9 94.9   81.2 83.3 53.1 42.2 43.1 81.8 94.4
sl1.2   24.5 51.2 24.3 13.4 11.8 82.0 89.4   21.6 44.6 50.4 34.1 24.4 82.4 90.3
sl1.3   38.5 51.2 13.5 7.95 7.75 66.7 88.3   39.5 49.1 24.7 12.1 11.4 72.9 77.5
swan   91.0 89.2 78.5 84.9 83.9 90.5 91.3   87.4 83.9 62.6 70.7 69.2 85.7 88.0
bear   81.5 73.8 79.7 85.6 84.0 84.3 89.1   57.7 46.5 56.3 64.5 65.9 65.2 76.3
camel   58.5 49.3 85.2 85.2 80.5 67.5 70.3   45.6 38.2 72.4 64.3 67.2 56.4 59.3
cows   64.2 67.5 89.6 89.6 84.4 16.8 87.1   32.2 36.9 77.3 68.3 60.4 15.5 69.8
flam   48.2 52.4 77.5 77.2 66.3 70.0 76.2   45.6 46.7 55.6 48.8 45.9 56.3 67.2
avg   61.6 67.6 57.9 54.2 51.9 74.7 86.9   53.0 57.8 53.2 48.4 46.8 70.3 80.3
Table 2: DICE and F-measure scores for ROI tracking across sequences using direct integration (DIR), i.e. direct processing of image pairs . Four methodologies are compared: superpixel-to-superpixel matching using superpixel-wise average color (RGBm) and color histogram (RGBh) features, PatchMatch (PM) barnes2009patchmatch , optical flow through Farnebäck farneback2003two and SIFT Flow liu2011sift as well as the proposed unsupervised learning-based superpixel matching using NN/RF classifiers. Bold and underline results indicate first and second best scores.

DICE and F-measure scores temporally averaged across each of the previously described sequences are given Tab.2 for each method. Bold and underline results indicate first and second best scores. Results indicate a good matching accuracy reached using the proposed unsupervised learning-based strategy for both consecutive and distant frames. On average, RF-based superpixel pairings provide the best direct tracking results with DICE and F-measure of and , followed by -NN-based results which reach and . Both methods are significantly superior to the other state-of-the-art methods. Averaged DICE (F-measure) goes down to () and () for RGBh and RGBm respectively. Despite fairly good scores for octo, caml, cows and flam, PM and optical flow methods do not globally outperform unsupervised learning-based and superpixel-to-superpixel matching with averaged DICE (F-measure) of (), () and () for PM, Farnebäck and SIFT Flow.

SLIC, SLIC,
GT assignment PM barnes2009patchmatch Far farneback2003two SF liu2011sift
RGBm RGBh NN - DIR RF - DIR
NN - SEQ NN - MSI RF - SEQ RF - MSI
training RF - DIR pred. RF - SEQ pred. RF - MSI pred.
Figure 5: ROI propagation for swan () perazzi2016benchmark with DIR, SEQ and MSI (steps ) integrations using NN and RF. Results are compared with: superpixel-to-superpixel matching with average color (RGBm) and color histogram (RGBh) features, PatchMatch (PM) barnes2009patchmatch , optical flow through Farnebäck (Far) farneback2003two and SIFT Flow (SF) liu2011sift . Blue boundaries in indicate superpixel labelling resulting from GT assignment. Green and red boundaries correspond to groundtruth (GT) and estimated tracking results. The last raw displays training () and prediction () masks resulting from DIR, SEQ and MSI integrations of RF-based elementary pairings.

These findings are illustrated visually Fig.5 for the pair of swan. Red and green boundaries denote propagated and GT ROI location. We can notice that PM, Farnebäck and SIFT Flow under-estimate the area covered by the swan, especially for the neck and near the water. RGBm, RGBh as well as RF and NN-based direct (DIR) superpixel pairings provide clearly better contours despite the tendency to propagate the ROI outside the swan area due to shadows and color similarities with background. One drawback with superpixel-based methods is the lack of boundary adherence which may not suit perfectly the object to be tracked. This aspect is revealed for by the blue boundaries which indicate the superpixel labelling resulting from GT assignment. However, Tab.2 demonstrates that this limitation is compensated by robust superpixel matching heuristics compared to straithforward pixel-wise matching and mask propagation which rely on a regular decomposition of the image grid without enough context considerations. Finally, Fig.5 shows a more accurate propagation achieved with multi-step integration of learning-based superpixel pairings, especially with RF. It tends to indicate the ability of unsupervised learning-based superpixel matching to provide a reliable and accurate enough elementary estimator towards efficient long-term multi-step matching and tracking. The performance achieved with multi-step integration is more deeply demonstrated in the next section.

4.2 Long-term superpixel tracking

Long-term ROI tracking resulting from direct (DIR), sequential (SEQ) and multi-step (MSI) integrations are compared based on unsupervised learning-based superpixel matching whose accuracy against state-of-the-art methods has been demonstrated in Sect.4.1 with NN and RF classifiers. MSI is applied with maximal step sequences per image pairs. Only step sequences whose length is less than or equal to are kept to prevent from motion drift (Sect.3.1). NN and RF-based elementary multi-step superpixel matches are obtained with steps for octo, swan and sle1 and for longer sequences (fsh3, bag, lapa, bear, caml, cows and flam). Context-rich features are estimated using the same parameters as in Sect.4.1. Majority voting (Eq.12) focuses only on superpixel candidates generated in both to/from-the-reference directions to improve forward-backward consistency (see Sect.4.3 for further details).

  DICE   F-measure   consistency
NN   RF   NN   RF   NN   RF
DIR SEQ MSI   DIR SEQ MSI   DIR SEQ MSI   DIR SEQ MSI   DIR SEQ MSI   DIR SEQ MSI
bag   96.9 97.7 97.7   92.8 74.9 92.9   97.8 99.7 99.4   89.0 65.9 88.3   39.2 12.1 32.1   30.2 12.2 27.8
fsh3   64.8 90.5 85.7   89.4 91.1 92.0   67.3 92.6 81.5   91.8 93.9 95.4   56,5 37,6 74,3   69.5 38.8 58.7
octo   84.8 91.3 93.0   85.5 91.3 92.8   75.6 83.6 86.3   74.9 83.6 86.0   84,6 67,5 76,6   82.9 67,1 75,7
lapa   89.2 86.5 92.8   87.9 88.3 92.8   86.1 86.2 95.0   85.3 86.8 94.8   96.1 68.9 96.8   90.2 64.6 91.5
sl1.1   82.9 95.3 95.6   94.9 95.5 95.9   81.8 93.2 94.8   94.4 94.0 95.5   89,2 63,4 96,8   94,5 76,1 93,4
sl1.2   82.0 80.9 90.3   89.4 80.5 90.9   82.4 77.9 93.0   90.3 73.1 93.9   76.0 62,4 91,6   84,9 61,6 91,2
sl1.3   66.7 92.4 92.2   88.3 92.4 94.2   72.9 76.7 78.1   77.5 76.7 81.6   86.7 78.1 95.0   87.7 75.6 100
swan   90.5 85.0 93.5   91.3 86.9 93.4   85.7 73.9 93.6   88.0 78.4 93.8   86.1 57.7 83.0   82.3 59.3 77.7
bear   84.3 87.8 87.7   89.1 87.0 92.5   65.2 68.2 72.2   76.3 68.9 82.1   65.6 61.2 68.5   62.7 62.1 67.2
camel   67.5 76.6 76.9   70.3 77.7 79.6   56.4 64.7 67.2   59.3 63.3 69.1   58.7 41.5 51.8   63.8 42.3 54.7
cows   16.8 80.1 87.3   87.1 79.1 89.1   15.5 59.6 66.8   69.8 61.4 74.1   15.5 39.6 68.5   75.8 39.3 67.6
flam   70.0 63.5 76.7   76.2 66.5 80.8   56.3 62.5 67.1   67.2 62.8 71.9   55.4 34.9 63.4   43.0 39.7 55.0
avg   74.7 85.6 89.1   86.9 84.3 90.6   70.3 78.2 82.9   80.3 75.7 85.5   67.5 52.1 74.9   72.3 53.2 71.7
Table 3: DICE, F-measure and consistency scores for ROI tracking across sequences. We compare direct (DIR), sequential (SEQ) and multi-step (MSI) integration based on unsupervised learning-based superpixel matching using NN and RF. Bold results indicate the best performance between DIR, SEQ and MSI. Underline scores highlight best results between NN and RF-based methods.

Tab.3 presents temporally averaged metrics (DICE, F-measure and consistency scores) obtained by DIR, SEQ and MSI across all sequences using NN and RF. Except for consistency scores when relying on RF, Tab.3 confirms that MSI is the best integration strategy towards long-term superpixel tracking compared to DIR and SEQ. For instance, RF and NN-based MSI reach the highest DICE scores with and in comparison to () and () obtained with RF and NN-based SEQ (DIR). Second and third positions in terms of DICE and F-measure vary depending on the classifier. SEQ outperforms DIR for NN whereas RF exhibits the opposite behavior. Except for MSI in terms of consistency and SEQ for F-measure, another main finding is that RF-based elementary matches usually make better long-term tracking than NN-based pairings, as one expects.

NN RF NN RF

——- lapa sznitman2012data

—— sle1.2 butler12eccv

——– octo kristan2016vot

——— swan perazzi2016benchmark

?
Figure 6: Temporal evolution of DICE and F-measure scores during ROI tracking across lapa sznitman2012data , sle1.2 butler12eccv , octo kristan2016vot and swan perazzi2016benchmark sequences. We compare direct (DIR), sequential (SEQ) and multi-step (MSI) integration based on NN and RF-based elementary pairings.

Temporal evolutions of DICE and F-measure scores are displayed Fig.6 along lapa, sle1.2 (sle1 for object ), octo and swan sequences with both classifiers. As already confirmed, best tracking results are reached with MSI compared to DIR and SEQ, especially for distant pairs. Contrary to SEQ whose performance decreases across sequences due to error accumulations (lapa and swan especially), multi-step estimations involved in MSI allow to fix uncorrect superpixel tracks as we can notice for sle1.2 from frame . Moreover, DIR is not prone to motion drift as SEQ but direct matching becomes tedious when inter-frame distances increase as shown for octo starting from frame . Finally, it can be noticed that the temporal behavior remains the same regardless of the classifier.

——- ——- ——- ——- NN - DIR NN - SEQ NN - MSI RF - DIR RF - SEQ RF - MSI
Figure 7: ROI selection and tracking across lapa sequence sznitman2012data from to . We compare direct (DIR), sequential (SEQ) and multi-step (MSI, steps ) integration (Sect.3) of superpixel matches obtained through unsupervised learning-based superpixel matching (Sect.2) with NN and RF breiman2001random . Superpixel decompositions are obtained via SLIC achanta2012slic . Blue boundaries () indicate superpixel labelling resulting from GT assignment. Green and red boundaries correspond resp. groundtruth (GT) and estimated tracking results.
- sle1.1 - sle1.2 - sle1.3 sle1.1 sle1.1 sle1.2 sle1.1 sle1.3 sle1.1 NN - DIR NN - SEQ NN - MSI RF - DIR RF - SEQ RF - MSI
Figure 8: ROI selections and tracking across sle1 sequence butler12eccv from to . We compare direct (DIR), sequential (SEQ) and multi-step (MSI, steps ) integration based on unsupervised learning-based superpixel matching with NN and RF.


——

——

——



NN - MSI RF - DIR RF - SEQ RF - MSI
Figure 9: ROI selection and tracking across bag (resp. fsh3, octo) sequences kristan2016vot for segments (, ). We compare DIR, SEQ and MSI (steps for bag and fsh3, for octo) integration based on unsupervised learning-based superpixel matching with NN and RF breiman2001random .

Finally, quantitative results are illustrated by series of ROI selection and visual tracking examples for several pairs of lapa (Fig.7), sle1 (Fig.8), bag, fsh3, and octo (Fig.9) sequences. Fig.7 shows that NN-based MSI provides a very good delineation of the surgical tool for all image pairs, which suggests that series of medical images can be also processed with the proposed methodology. Reliable ROI tracking through MSI is also shown on synthetic images (Fig.8) despite strong scale variations. Propagation of matching errors with SEQ is clearly illustrated Fig.7 for lapa (NN) and Fig.9 for bag. Tracking failures with DIR are temporally uncorrelated but strong enough to damage the propagation task for the fish (Fig.9) due to color variations of its right part. ROI tracking for octo (Fig.9) gives correct results both with NN and RF despite significant color similarities with the dynamic background. Results provided Fig.5,8 demonstrate that RF-based MSI outperforms NN-based MSI as well as RF-based DIR and SEQ for swan and sle1 videos. Another visualization through prediction masks given Fig.5 for the swan pair confirms the ability of RF-based MSI to reach accurate long-term correspondences (see for instance the swan beak). Indeed, a given object part must keep the same color between training and prediction in case of correct matching.

4.3 Long-term candidate generation

We propose to perform a more in-depth study of multi-step integration by comparing different long-term candidate generation strategies in terms of tracking accuracy. As described Sect.3.2, long-term superpixel candidates can be generated using direct candidates only (MSId), both direct and reverse candidates (MSIr) or only candidates generated in both direct and reverse directions (MSIm). Note that the previously given MSI results corresponded to MSIm where only superpixel duplicates are taken into account for majority voting (Eq.12). MSId, MSIr and MSIm are comparatively evaluated in terms of tracking accuracy based on RF-based superpixel elementary matches. Results are provided Tab.4 through DICE, F-measure and consistency scores across the sequences used for ROI tracking.

  DICE   F-measure   consistency
MSId MSIr MSIm   MSId MSIr MSIm   MSId MSIr MSIm
bag   84.5 91.9 92.9   80.3 88.3 88.3   19.2 27.8 27.8
fsh3   92.0 92.0 92.0   94.9 95.4 95.4   52.8 59.6 58.7
octo   91.8 92.2 92.8   83.8 84.4 86.0   71.6 75.8 75.7
lapa   92.6 92.9 92.8   94.2 95.2 94.8   85.4 91.5 91.5
sl1.1   95.7 95.8 95.9   94.5 94.8 95.5   91.2 93.7 93.4
sl1.2   90.3 91.0 90.9   91.8 94.5 93.9   78.6 90.4 91.2
sl1.3   94.2 94.3 94.2   81.3 81.5 81.6   90.0 99.5 100
swan   93.2 93.1 93.4   93.4 93.0 93.8   70.6 77.6 77.7
bear   92.2 92.5 92.5   83.3 82.9 82.1   54.7 67.0 67.2
caml   79.6 79.3 79.6   69.6 69.3 69.1   44.2 54.1 54.7
cows   90.0 89.3 89.1   75.5 74.4 74.1   59.8 67.2 67.6
flam   79.8 80.7 80.8   64.0 69.0 71.9   40.9 54.3 55.0
avg   89.7 90.4 90.6   83.9 85.2 85.5   63.3 71.5 71.7
Table 4: DICE, F-measure and consistency scores for ROI tracking across sequences. Based on RF-based superpixel elementary matches, we compare three different superpixel candidate generation strategies for multi-step (MSI) integration using: only direct candidates (MSId), direct and reverse candidates (MSIr), only candidates generated in both direct and reverse directions (MSIm). Bold results indicate the best performance.

Results from Tab.4 bring two main findings. First, we observe that tracking accuracy is improved when reverse candidates are used additionally to direct ones. Consistency ratios are clearly improved (from to when comparing MSId/MSIr) as expected but DICE and F-measure improvements can be also observed with gains of and between MSId/MSIr. Second, relying on superpixel duplicates only (MSIm) brings a slighltly better ROI tracking compared to MSIr. Average results slightly increase from to , to and to for DICE, F-measure and consistency which shows that extensive mutual matching guidance as performed with MSIm is the best strategy to perform long-term superpixel tracking from multi-step elementary correspondences.

5 Conclusion

In this work, we proposed a two-step pipeline dedicated to long-term superpixel tracking. unsupervised learning-based superpixel matching is firstly considered as an elementary displacement estimator to provide correspondences between consecutive and distant images using either nearest neighbors or random forests with robust context-rich features we extended from greyscale to multi-channel and forward-backward consistency contraints. Resulting elementary matches are then combined along multi-step paths running through the sequence with various inter-frame distances to produce a large set of candidate long-term superpixel pairings upon which majority voting selection is performed. Compared to state-of-the-art methods including pixel or patch-based strategies which may suffer from regular support regions, video object tracking experiments demonstrate that unsupervised learning can produce reliable correspondences between semantically meaningful areas. Moreover, the ability of multi-step integration to combine these pairings towards accurate long-term superpixel tracking has been shown against usual direct and sequential integrations. Extending this work from single to hierarchical multi-scale superpixel decomposition would deserve further investigation for future research since dealing with multiple spatial extends can drive the matching process in a coarse-to-fine fashion. Other features such as spectral features could be employed to further improve unsupervised learning-based matching while reducing processing time. In addition, very long-term superpixel tracking could be reached by combining superpixel pairings estimated with respect to multiple reference frames. Our contributions also give new insights for optical flow and registration initialization, in particular to provide a better management of large displacements, appearance and illumination changes. More generally, the proposed framework could be easily extended to other imaging modalities including series of medical images for anatomical structure tracking.

References

  • (1)

    J. Lezama, K. Alahari, J. Sivic, I. Laptev, Track to the future: Spatio-temporal video segmentation with long-range motion cues, in: IEEE Conference on Computer Vision and Pattern Recognition, 2011, pp. 3369–3376.

  • (2) S. M. Seitz, B. Curless, J. Diebel, D. Scharstein, R. Szeliski, A comparison and evaluation of multi-view stereo reconstruction algorithms, in: IEEE Conference on Computer Vision and Pattern Recognition, 2006, pp. 519–528.
  • (3) F. Yang, H. Lu, M.-H. Yang, Robust superpixel tracking, IEEE Transactions on Image Processing 23 (4) (2014) 1639–1651.
  • (4) H. Wang, A. Kläser, C. Schmid, C.-L. Liu, Dense trajectories and motion boundary descriptors for action recognition, International Journal of Computer Vision 103 (1) (2013) 60–79.
  • (5) X. Cao, Z. Li, Q. Dai, Semi-automatic 2D-to-3D conversion using disparity propagation, IEEE Transactions on Broadcasting 57 (2) (2011) 491–499.
  • (6) P.-H. Conze, P. Robert, T. Crivelli, L. Morin, Multi-reference combinatorial strategy towards longer long-term dense motion estimation, Computer Vision and Image Understanding 150 (2016) 66–80.
  • (7) G. Farnebäck, Two-frame motion estimation based on polynomial expansion, Image Analysis (2003) 363–370.
  • (8) J. Shi, C. Tomasi, Good features to track, in: IEEE International Conference on Computer Vision and Pattern Recognition, 1994, pp. 593–600.
  • (9) C. Barnes, E. Shechtman, A. Finkelstein, D. B. Goldman, Patchmatch: A randomized correspondence algorithm for structural image editing, ACM Transaction on Graphics 28 (3) (2009) 24–1.
  • (10) C. Barnes, E. Shechtman, D. Goldman, A. Finkelstein, The generalized patchmatch correspondence algorithm, European Conference on Computer Vision (2010) 29–43.
  • (11) R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, S. Susstrunk, SLIC superpixels compared to state-of-the-art superpixel methods, IEEE Transactions on Pattern Analysis and Machine Intelligence 34 (2012) 2274–2282.
  • (12) H.-S. Chang, Y.-C. F. Wang, Superpixel-based large displacement optical flow, in: IEEE International Conference on Image Processing, 2013, pp. 3835–3839.
  • (13) S. Donné, J. Aelterman, B. Goossens, W. Philips, Fast and robust variational optical flow for high-resolution images using SLIC superpixels, in: International Conference on Advanced Concepts for Intelligent Vision Systems, 2015, pp. 205–216.
  • (14) X. Dong, J. Shen, L. Shao, HSP2P: Hierarchical superpixel-to-pixel dense image matching, IEEE Transactions on Circuits and Systems for Video Technology – (–) (2017) –, not fully edited.
  • (15) H. Fan, J. Xiang, Z. Chen, Visual tracking by local superpixel matching with markov random field, in: Pacific Rim Conference on Multimedia, 2016.
  • (16) R. Giraud, V. T. Ta, A. Bugeau, P. Coupé, N. Papadakis, Superpatchmatch: An algorithm for robust correspondences using superpixel patches, IEEE Transactions on Image Processing 26 (8) (2017) 4068–4078.
  • (17) F. Kanavati, T. Tong, K. Misawa, M. Fujiwara, K. Mori, D. Rueckert, B. Glocker, Supervoxel classification forests for estimating pairwise image correspondences, Pattern Recognition 63 (2017) 561–569.
  • (18) L. Breiman, Random Forests, Machine learning 45 (1) (2001) 5–32.
  • (19) T. Crivelli, M. Fradet, P.-H. Conze, P. Robert, P. Pérez, Robust optical flow integration, IEEE Transactions on Image Processing 24 (1) (2015) 484–498.
  • (20) L. Wang, H. Lu, M. H. Yang, Constrained superpixel tracking, IEEE Transactions on Cybernetics (99) (2017) 1–12.
  • (21) C. Ma, J.-B. Huang, X. Yang, M.-H. Yang, Hierarchical convolutional features for visual tracking, in: IEEE International Conference on Computer Vision, 2015, pp. 3074–3082.
  • (22) H. Li, Y. Li, F. Porikli, Deeptrack: Learning discriminative feature representations online for robust visual tracking, IEEE Transactions on Image Processing 25 (4) (2016) 1834–1848.
  • (23) D. Zhang, H. Maei, X. Wang, Y.-F. Wang, Deep reinforcement learning for visual object tracking in videos, arXiv preprint 1701.08936.
  • (24) B. Glocker, D. Zikic, D. R. Haynor, Robust registration of longitudinal spine CT, in: Medical Image Computing and Computer-Assisted Intervention, 2014, pp. 251–258.
  • (25) P.-H. Conze, F. Tilquin, V. Noblet, F. Rousseau, F. Heitz, P. Pessaux, Hierarchical multi-scale supervoxel matching using random forests for automatic semi-dense abdominal image registration, in: IEEE International Symposium on Biomedical Imaging, 2017, pp. 490–493.
  • (26) R. Sznitman, K. Ali, R. Richa, R. H. Taylor, G. D. Hager, P. Fua, Data-driven visual tracking in retinal microsurgery, in: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer, 2012, pp. 568–575.
  • (27) T. Brox, J. Malik, Large displacement optical flow: descriptor matching in variational motion estimation, IEEE Transactions on Pattern Analysis and Machine Intelligence 33 (3) (2011) 500–513.
  • (28) S. Roth, V. Lempitsky, C. Rother, Discrete-continuous optimization for optical flow estimation, in: Statistical and Geometrical Approaches to Visual Motion Analysis, 2009, pp. 1–22.
  • (29) M. Kristan, et al., The visual object tracking VOT2016 challenge results, in: Workshop of European Conference on Computer Vision, 2016, pp. 777–823.
  • (30) D. J. Butler, J. Wulff, G. B. Stanley, M. J. Black, A naturalistic open source movie for optical flow evaluation, in: European Conference on Computer Vision, 2012, pp. 611–625.
  • (31) F. Perazzi, J. Pont-Tuset, B. McWilliams, L. Van Gool, M. Gross, A. Sorkine-Hornung, A benchmark dataset and evaluation methodology for video object segmentation, in: IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 724–732.
  • (32) D. R. Martin, C. C. Fowlkes, J. Malik, Learning to detect natural image boundaries using local brightness, color, and texture cues, IEEE Transactions on Pattern Analysis and Machine Intelligence 26 (5) (2004) 530–549.
  • (33) C. Liu, J. Yuen, A. Torralba, Sift flow: Dense correspondence across scenes and its applications, IEEE Transactions on Pattern Analysis and Machine Intelligence 33 (5) (2011) 978–994.