Learning Multi-Object Tracking and Segmentation from Automatic Annotations

12/04/2019 ∙ by Lorenzo Porzi, et al. ∙ Mapillary Universitat Autònoma de Barcelona TU Graz 0

In this work we contribute a novel pipeline to automatically generate training data, and to improve over state-of-the-art multi-object tracking and segmentation (MOTS) methods. Our proposed tracklet mining algorithm turns raw street-level videos into high-fidelity MOTS training data, is scalable and overcomes the need of expensive and time-consuming manual annotation approaches. We leverage state-of-the-art instance segmentation results in combination with optical flow obtained from models also trained on automatically harvested training data. Our second major contribution is MOTSNet - a deep learning, tracking-by-detection architecture for MOTS - deploying a novel mask-pooling layer for improved object association over time. Training MOTSNet with our automatically extracted data leads to significantly improved sMOTSA scores on the novel KITTI MOTS dataset (+1.9 cars/pedestrians). Even without learning from a single, manually annotated MOTS training example we still improve over prior state-of-the-art, confirming the compelling properties of our pipeline. On the MOTSChallenge dataset we improve by +4.1



There are no comments yet.


page 1

page 4

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction and Motivation

We focus on the challenging task of multi-object tracking and segmentation (MOTS) [47], which was recently introduced as an extension of bounding-box based multi-object tracking. Joining instance segmentation with tracking was shown to noticeably improve tracking performance, as unlike bounding boxes, instance segmentation masks do not suffer from overlapping issues but provide fine-grained, pixel-level information about the objects to be tracked.

While this finding is encouraging, it comes with the downside of requiring pixel-level (typically polygon-based) annotations, which are known to be time-consuming in generation and thus expensive to obtain. The works in [9, 33] report annotation times of 90 minutes per image, to manually produce high-quality panoptic segmentation masks. Analogously, it is highly demanding to produce datasets for the MOTS task where instance segmentation masks need to also contain tracking information across frames.

To avoid generating MOTS labels completely from scratch and in a purely manual way, [47] has followed the semi-automatic annotation procedure from [4], and extended the existing multi-object tracking datasets KITTI tracking [11] (i.e. 21 sequences from the raw KITTI dataset) and MOTChallenge [32] (4/7 sequences). These datasets already provide bounding box-based tracklets for cars and pedestrians. Instance segmentation masks were generated as follows. First, two segmentation masks per object were generated by human annotators (which also had to chose the objects based on diversity in the first place). Then, DeepLabV3+ [6], i.e. a state-of-the-art semantic segmentation network, was trained on the initially generated masks in a way to overfit these objects in a sequence-specific way, thus yielding reasonable segmentation masks for the remaining objects in each track. The resulting segmentation masks then underwent another manual correction step to fix remaining errors. Finally, these automatic and manual steps were iterated until convergence.

Including an automated segmentation algorithm is clearly speeding up the MOTS dataset generation process, but still has significant shortcomings. Most importantly, the approach in [47] still depends on the availability of bounding-box based, multi-object tracking information as provided by datasets like [11, 32]. Also, their remaining human annotation effort for generating initial instance masks was considerable, i.e[47] reports that 8k masks (or of all masks in KITTI MOTS) had been manually labeled before fine-tuning a (modified) DeepLabV3+ segmentation network, pre-trained on COCO [25] and Mapillary Vistas [33]. Finally, such an approach cannot be expected to generalize well for generating orders of magnitude more training data.

In this paper we introduce a novel approach for automatically generating high-quality training data (see Fig. 1 for an example) from generic street-level videos for the task of joint multi-object tracking and segmentation. Our methods are conceptually simple and leverage state-of-the-art instance [12] (or Panoptic [39, 8]) segmentation models trained on existing image datasets like Mapillary Vistas [33] for the task of object instance segmentation. We further describe how to automatically mine tracking labels for all detected object instances, building upon state-of-the-art optical flow models [53], that in turn have been trained on automatically harvested flow supervision obtained from image-based 3d modeling (structure-from-motion).

Another important contribution of our paper is MOTSNet, a deep learning, tracking-by-detection approach for MOTS. MOTSNet takes advantage of a novel mask-pooling layer, which allows it to exploit instance segmentation masks to compute a representative embedding vector for each detected object. The tracking process can then be formulated as a series of Linear Assignment Problems, optimizing a payoff function which compares detections in the learned embedding space.

We demonstrate the superior quality for each step of our proposed MOTS training data extraction pipeline, as well as the efficacy of our novel mask-pooling layer. The key contributions and findings of our approach are:

  • Automated MOTS training data generation – instance detection, segmentation and actual tracking – is possible at scale and with very high fidelity, approximating the quality of previous state-of-the-art results [47] without even directly training for it.

  • Deep-learning based, supervised optical flow [53] can be effectively trained on pixel-based correspondences obtained from Structure-from-Motion and furthermore used for tracklet extraction.

  • Direct exploitation of instance segmentation masks through our proposed mask-pooling layer yields to significant improvements of sMOTSA scores

  • The combination of our general-purpose MOTS data generation pipeline with our proposed MOTSNet improves by up to 1.2% for cars and 7.5% for pedestrians over the previously best method on KITTI MOTS in absolute terms, while using fewer parameters. Similarly we can gain over the previously best-performing work on the MOTSChallenge dataset.

We provide extensive ablation studies on the KITTI MOTS, MOTSChallenge, and Berkeley Deep Drive BDD100k [54] datasets and obtain results consistently improving with a large margin over the prior state-of-the-art [47].

2 Related Works

Since the MOTS task was only very recently introduced in [47], directly related works are scarce. Instead, the bulk of related works in a broader sense comes from the MOT (multi-object tracking) and VOS (video object segmentation) literature. This is true in terms of both, datasets (see e.g[11, 21, 32] for MOT and [37, 40, 23] for VOS/VOT, respectively), and methods tackling the problems ([44] for MOT and [52, 29, 35, 48] for VOS).

Combining motion and semantics.

The work in [17] uses semantic segmentation results from [27] for video frames together with optical flow predictions from EpicFlow [41] to serve as ground truth, for training a joint model on future scene parsing, i.e. the task of anticipating future motion of scene semantics. In [28] this task is extended to deal with future instance segmentation prediction, based on convolutional features from a Mask R-CNN [12] instance segmentation branch. In [16]

a method for jointly estimating optical flow and temporally consistent semantic segmentation from monocular video is introduced. Their approach is based on a piece-wise flow model, enriched with semantic information through label consistency of superpixels.

Semi-automated dataset generation.

Current MOT [49, 32, 11] and VOS [51, 38] benchmarks are annotated based on human efforts. Some of them, however, use some kind of automation. In [51], they exploit the temporal correlation between consecutive frames in a skip-frame strategy, although the segmentation masks are manually annotated. A semi-automatic approach is proposed in [1] for annotating multi-modal benchmarks. It uses a pre-trained VOS model [18] but still requires additional manual annotations and supervision on the target dataset. The MOTS training data from [47] is generated by augmenting an existing MOT dataset [32, 11] with segmentation masks, using the iterative, human-in-the-loop procedure we briefly described in Sec. 1. The tracking labels on [32, 11] were manually annotated and, in the case of [32], refined by adding high-confidence detections from a pedestrian detector.

The procedures used in these semi-automatic annotation approaches often resemble those of VOS systems which require user input to select objects of interest in a video. Some proposed techniques in this field [7, 5, 31] are extensible to multiple object scenarios. A pixel-level embedding is built in [7] to solve segmentation and achieve state-of-the-art accuracy with low annotation and computational cost.

To the best of our knowledge, the only work that generates tracking data from unlabeled large-scale videos is found in [36]. By using an adaptation of the approach in [35], Osep et al. perform object discovery on unlabeled videos from the KITTI Raw [10] and Oxford RobotCar [30] datasets. Their approach however cannot provide joint annotations for instance segmentation and tracking.

Deep Learning methods for MOT and MOTS.

Many deep learning methods for MOTS/VOS are based on “tracking by detection”, i.e. candidate objects are first detected in each frame, and then joined into tracklets as a post-processing step (e.g[47]). A similar architecture, i.e. a Mask R-CNN augmented with a tracking head, is used in [52] to tackle Video Instance Segmentation. A later work [29] improved over [52] by adapting the UnOVOST [55] model that, first in an unsupervised setting, uses optical flow to build short tracklets later merged by a learned embedding. For the plain MOT task, Sharma et al[44] incorporated shape and geometry priors into the tracklet generation process.

Other works instead, perform detection-free tracking. CAMOT [35] tracks category-agnostic mask proposals across frames, exploiting both appearance and geometry cues from scene flow and visual odometry. Finally, the VOS work in [48], follows a correlation-based approach, extending popular tracking models [3, 22]. Their SiamMask model only requires bounding box initialization but is pre-trained on multiple, human-annotated datasets [25, 51, 43].

3 Dataset Generation Pipeline

Our proposed data generation approach is rather generic w.r.t. its data source, and here we focus on the KITTI Raw [10] dataset. KITTI Raw contains 142 sequences (we excluded the 9 sequences overlapping with the validation set of KITTI MOTS), for a total of k images, captured with a professional rig including stereo cameras, LIDAR, GPS and IMU. Next we describe our pipeline which, relying only on monocular images and GPS data is able to generate accurate MOTS annotations for the dataset.

3.1 Generation of Instance Segmentation Results

We begin with segmenting object instances in each frame per video and consider a predefined set of object classes that belong to the Mapillary Vistas dataset [33]. To extract the instance segments, we run the Seamless Scene Segmentation method [39], augmented with a ResNeXt-101-328d [50] backbone and trained on Mapillary Vistas. By doing so, we obtain a set of object segments per video sequence. For each segment , we denote by the frame where the segment was extracted, by the class it belongs to and by a pixel indicator function representing the segmentation mask, i.e. if and only if pixel belongs to the segment. For convenience, we also introduce the notation to denote the set of all segments extracted from frame of a given video. For KITTI Raw we roughly extracted 1.25M of segments.

3.2 Generation of Tracklets using Optical Flow

After having automatically extracted the instance segments from a given video in the dataset, our goal is to leverage on optical flow to extract tracklets, i.e. consecutive sequences of frames where a given object instance appears.

Flow training data generation.

We automatically generate ground-truth data for training an optical flow network by running a Structure-from-Motion (SfM) pipeline, namely OpenSfM111https://github.com/mapillary/OpenSfM, on the KITTI Raw video sequences and densify the result using PatchMatch [45]. To further improve the quality of the 3d reconstruction, we exploit the semantic information that has been already extracted per frame to remove spurious correspondences generated by moving objects. Consistency checks are also performed in order to retain correspondences that are supported by at least images. Finally, we derive optical flow vectors between pairs of consecutive video frames in terms of the relative position of correspondences that are visible in both frames. This process produces a sparse optical flow, which will be densified in the next step.

Flow network.

We train a modified version of the HD flow network [53] on the dataset that we generated from KITTI Raw, without any form of pre-training. The main differences with respect to the original implementation are the use of In-Place Activated Batch Norm (iABN) [42], which provides memory savings that enable the second difference, namely the joint training of forward and backward flow. We then run our trained flow network on pairs of consecutive frames in KITTI Raw in order to determine a dense pixel-to-pixel mapping between them. In more detail, for each frame we compute the backward mapping , which provides for each pixel of frame the corresponding pixel in frame .

Figure 2: Overview of the MOTSNet architecture. Blue: network backbone; yellow: Region Proposal Head; green: Region Segmentation Head; red: Tracking Head. For an in-depth description of the various components refer to Sec. 4.1 in the text.

Tracklets generation.

In order to represent the tracklets that we have collected up to frame , we use a graph , where vertices are all segments that have been extracted until the th frame, i.e. , and edges provide the matching segments across consecutive frames. We start by initializing the graph in the first frame as . We construct for inductively from using the segments extracted at time and the mapping according to the following procedure. The vertex set of graph is given by the union of the vertex set of and , i.e. . The edge set is computed by solving a linear assignment problem between and , where the payoff function is constructed by comparing segments in against segments in warped to frame via the mapping . We design the payoff function between a segment in frame and a segment in frame as follows:

where computes the Intersection-over-Union of the two masks given as input, and

is a characteristic function imposing constraints regarding valid mappings. Specifically,

returns if and only if the two segments belong to the same category, i.e. , and none of the following conditions on hold:

  • , and

  • ,

where and denote the largest and second-largest areas of intersection with obtained by segments warped from frame and having class , and denotes the area of that is not covered by segments of class warped from frame . We solve the linear assignment problem by maximizing the total payoff under a relaxed assignment constraint, which allows segments to remain unassigned. In other terms, we solve the following optimization problem:

maximize (1)
s.t. (2)

Finally, the solution of the assignment problem can be used to update the set of edges , i.e. , where is the set of matching segments, i.e. the pre-image of under .

Note that this algorithm can’t track objects across full occlusions, since optical flow can’t be used to infer the trajectories of invisible objects. When training MOTSNet however, we easily overcome this limitation with a simple heuristic, described in Sec. 

4.2. As will be shown in Sec. 5.3, MOTSNet predictions actually outperform the ground truth data it was trained on.


Here we describe our multi-object tracking and segmentation approach. Its main component is MOTSNet (see Sec. 4.1), a deep net based on the Mask R-CNN [12] framework. Given an RGB image, MOTSNet outputs a set of instance segments, each augmented with a learned embedding vector which represents the object’s identity in the sequence (see Sec. 4.2). The object tracks are then reconstructed by applying a LAP-based algorithm, described in Sec. 4.3.

4.1 Architecture

MOTSNet’s architecture follows closely the implementation of Mask R-CNN described in [39], and depicted in Figure 2, with the addition of a Tracking Head (TH) which runs parallel to the Region Segmentation Head (RSH). The “backbone” of the network is composed of an FPN [24] component on top of a ResNet-50 body [13], producing multi-scale features at five different resolutions. These features are fed to a Region Proposal Head (RPH), which predicts candidate bounding boxes where object instances could be located in the image. Instance specific features are then extracted from the FPN levels using ROI Align [12]

in the areas defined by the candidate boxes, and fed to the RSH and the TH. Finally, for each region, the RSH predicts a vector of class probabilities, a refined bounding box and an instance segmentation mask. Synchronized InPlace-ABN 

[42] is employed throughout the network after each layer with learnable parameters, except for those producing output predictions. For an in-depth description of these components we refer the reader to [39].

Tracking Head (TH).

Instance segments, together with the corresponding ROI Align features from the FPN, are fed to the TH to produce a set of -dimensional embedding vectors. As a first step, the TH applies “mask-pooling” (described in the section below) to the input spatial features, obtaining 256-dimensional vector features. This operation is followed by a fully connected layer with 128 channels and synchronized InPlace-ABN. Similarly to masks and bounding boxes in the RSH, embedding vectors are predicted in a class-specific manner by a final fully connected layer with outputs, where is the number of classes, followed by L2-normalization to constrain the vectors on the unit hyper-sphere. The output vectors are learned such that instances of the same object in a sequence are mapped close to each other in the embedding space, while instances of other objects are mapped far away. This is achieved by minimizing a batch hard triplet loss, described in Sec. 4.2.


ROI Align features include both foreground and background information, but, in general, only the foreground is actually useful for recognizing an object across frames. In our setting, by doing instance segmentation together with tracking, we have a source of information readily available to discriminate between the object and its background. This can be exploited in a straightforward attention mechanism: pooling under the segmentation mask. Formally, given the pixel indicator function for a segment (see Sec. 3.1), and the corresponding input -dimensional ROI-pooled feature map , mask-pooling computes a feature vector as:

During training we pool under the ground truth segmentation masks, while at inference time we switch to the masks predicted by the RSH.

4.2 Training Losses

MOTSNet is trained by minimizing the following loss function:

where and represent the Region Proposal Head and Region Segmentation Head losses as defined in [39], and is a weighting parameter.

is the loss component associated with the Tracking Head, i.e. the following batch hard triplet loss [15]:

where is the set of predicted, positive segments in the current batch, is the class-specific embedding vector predicted by the TH for a certain segment and ground truth class , and is a margin parameter. The definition of a “positive” segment follows the same logic as in the RSH (see [39]), i.e. a segment is positive if its bounding box has high IoU with a ground truth segment’s bounding box. The functions and map to the sets of its “matching” and “non-matching” segments in , respectively. These are defined as:

where is true iff and belong to the same tracklet in the sequence. Note that we are restricting the loss to only compare embedding vectors of segments belonging to the same class, effectively letting the network learn a set of class-specific embedding spaces.

In order to overcome the issue due to occlusions mentioned in Sec. 3.2, when training on KITTI Synth we also apply the following heuristic. Ground truth segments are only considered in the tracking loss if they are associated with a tracklet that appears in more than half of the frames in the current batch. This ensures that, when an occlusion in the scene causes a tracklet to split, the two pieces are never mistakenly treated by the network as two different objects.

4.3 Inference

During inference, we feed frames through the network, obtaining a set of predicted segments , each with its predicted class and embedding vector . Our objective is now to match the segments across frames in order to reconstruct a tracklet for each object in the sequence. To do this, we follow the same algorithmic framework described in Sec. 3.2, with a couple of modifications. First, we allow matching segments between the current frame and a sliding window of frames in the past. Second, we do not rely on optical flow anymore, instead measuring the similarity between segments in terms of their distance in the embedding space and temporal offset.

More formally, we redefine the payoff function in Eq. (1) as:


where the characteristic function is redefined as:

and is a configurable threshold value. Furthermore, we replace in Eq. (2) and Eq. (4) with the set:

which contains all terminal (i.e. most recent) segments of the tracklets seen in the past frames. As a final step after processing all frames in the sequence, we discard the tracklets with fewer than segments, as, empirically, very short tracks usually arise from spurious detections.

5 Experiments

We provide a broad experimental evaluation assessing i) the quality of our automatically harvested KITTI Synth dataset on the MOTS task by evaluating it against the KITTI MOTS dataset, ii) demonstrate the effectiveness of MOTSNet and our proposed mask-pooling layer against strong baselines on the KITTI MOTS and MOTSChallenge datasets, iii) demonstrate the generality of our MOTS label generation process by extending the BDD100k tracking dataset with segmentation masks to become a MOTS variant thereof and iv) provide an ablation about the contribution of association terms for the final tracklet extraction. While we can benchmark the full MOTS performance on KITTI MOTS and MOTSChallenge, we are able to infer the tracking performance based on the ground truth box-based tracking annotations available for BDD100k.

Method Pre-training Car Ped Car Ped Car Ped Box Mask
KITTI Synth (val) + HD [53] model zoo inference only 61.4 45.0 73.3 65.6 87.6 76.6
KITTI Synth (val) + HD, KITTI-SfM – tuned inference only 65.6 45.6 77.4 66.1 87.6 76.6
MOTSNet with:
   AveBox+TH I 73.7 46.4 85.8 62.8 86.7 76.7 57.4 50.9
   AveMsk-TH I 76.4 44.0 88.5 60.3 86.8 76.6 57.8 51.3
   AveBox-TH I 75.4 44.5 87.3 60.8 86.9 76.7 57.5 51.0
   KITTI MOTS train sequences only I 72.6 45.1 84.9 62.9 86.1 75.6 52.5 47.6
MOTSNet I 77.6 49.1 89.4 65.6 87.1 76.4 58.1 51.8
MOTSNet I, M 77.8 54.5 89.7 70.9 87.1 78.2 60.8 54.1
Table 1: Results on the KITTI MOTS validation set when training on KITTI Synth. First section: ablation results. Second section: main results and comparison with state of the art. Note that all MOTSNet variants are trained exclusively on machine generated annotations, while TrackR-CNN is trained on human annotated ground truth.

5.1 Evaluation Measures

The CLEAR MOT [2] metrics, including Multi-Object Tracking Accuracy (MOTA) and Precision (MOTP), are well established as a standard set of measures for evaluating Multi-Object Tracking systems. These, however, only account for the bounding box of the tracked objects. Voigtlaender et al[47] describe an extension of MOTA and MOTP that measure segmentation as well as tracking accuracy, proposing the sMOTSA, MOTSA and MOTSP metrics. In particular, MOTSA and MOTSP are direct equivalents of MOTA and MOTP, respectively, where the prediction-to-ground-truth matching process is formulated in terms of mask IoU instead of bounding box IoU. Finally, sMOTSA can be regarded as a “soft” version of MOTSA where the contribution of each true positive segment is weighted by its mask IoU with the corresponding ground truth segment. Please refer to Appendix A for a formal description of these metrics.

5.2 Experimental Setup

All the results in the following sections are based on the same MOTSNet configuration, with a ResNet-50 backbone and embedding dimensionality fixed to . Batches are formed by sampling a subset of full-resolution, contiguous frames at a random offset in one of the training sequences, for each GPU. During training we apply scale augmentation in the range and random flipping. Training by SGD follows the following linear schedule: , where is the starting learning rate and is the training step (i.e. batch) index. Please refer to Appendix B

for a full specification of the training parameters. The network weights are initialized from an ImageNet-pretrained model (“I” in the tables), or from a Panoptic Segmentation model trained on the Mapillary Vistas dataset (“M” in the tables), depending on the experiment. Differently from 

[47], we do not pre-train on MS COCO [25] (“C” in the tables).

The hyper-parameters of the inference algorithm are set as follows: threshold ; window size equal to the per-GPU batch size used during training; minimum tracklet length . Note that, in contrast with [47], we do not fine-tune these parameters, instead keeping them fixed in all experiments, and independent of class. All experiments are run on four V100 GPUs with 32GB of memory.

5.3 Kitti Mots

The KITTI MOTS dataset was introduced in [47], adding instance segmentation masks for cars and pedestrians to a subset of 21 sequences from KITTI raw [11]. It has a total of 8.008 images (5.027 for training and 2.981 for validation), containing approximately 18.8k/8.1k annotated masks for 431/99 tracklets of cars/pedestrians in training, and roughly 8.1k/3.3k masks for 151/68 tracklets on the validation split. Approximately of all masks were manually annotated and the rest are human-verified and -corrected instance masks. As the size of this dataset is considerably larger and presumably more informative than MOTSChallenge, we focused our ablations on it.

Table 1 summarizes the analyses and experiments we ran on our newly generated KITTI Synth dataset, by evaluating on the official KITTI MOTS validation set. We discuss three types of ablations in what follows.

KITTI Synth data quality analysis.

The two topmost rows of Tab. 1 show the results when generating the validation set results solely with our proposed dataset generation pipeline, described in Section 3, i.e. no learned tracking component is involved. Using the optical flow model predictions from HD [53] and their best-performing model for KITTI222Pre-trained models available at https://github.com/ucbdrive/hd3., we obtain reasonable baseline results. When fine-tuning HD with the optical flow training data generation process from SfM described in Section 3, our synthetically extracted validation data considerably outperforms the sMOTSA scores on pedestrians and almost obtains the performance on cars for CAMOT [35], as reported in [47] While this validates our tracklet generation process and the way we harvest optical flow training data, we investigate further if learning from imperfect data (tracklets are prone to break when associated via optical flow) like KITTI Synth.

MOTSNet tracking head ablations.

The center block in Tab 1 compares different configurations for the Tracking Head described in 4. The first variant (AveBox+TH) replaces the mask-pooling operation in the Tracking Head with average pooling on the whole box. The second and third variants (AveMsk-TH, AveBox-TH) completely remove the Tracking Head and its associated loss, instead directly computing the embeddings by pooling features under the detected mask or box, respectively. All variants perform reasonably well and improve over [35, 34] on the primary sMOTSA metric. Notably, AveMsk-TH, i.e. the variant using our mask-pooling layer and no tracking head, is about on par with TrackR-CNN on cars despite being pre-trained only on ImageNet, using a smaller, ResNet-50 backbone and not exploiting any tracking supervision. The last variant in this block shows the performance obtained when MOTSNet is trained only on images also in the KITTI MOTS training set, and with our full model. Interestingly, the scores only drop to an extent where the gap to [47] might be attributed to their additional pre-training on MS COCO and Mapillary Vistas and a larger backbone (cf. Tab. 2). Finally, comparing the latter directly to our MOTSNet results from Tab. 2 (ImageNet pre-trained only), where we directly trained on KITTI MOTS training data, our KITTI Synth-based results improve substantially on cars, again confirming the quality of our automatically extracted dataset.

The bottommost part of this table shows the performance when training on the full KITTI Synth dataset and evaluating on KITTI MOTS validation, using different pre-training settings. The first is pre-trained on ImageNet and the second on both, ImageNet and Mapillary Vistas. While there is only little gap between them on the car class, the performance on pedestrians rises by over 5% when pre-training on Vistas. The most encouraging finding is that using our solely automatically extracted KITTI Synth dataset we obtain significantly improved scores (+1.6% on cars, +7.4% on pedestrians) in comparison to the previous state-of-the-art method from [47], trained on manually curated data.


In Tab. 2 we directly compare against previously top-performing works including TrackR-CNN [47] and references from therein, e.g. CAMOT [35], CIWT [34], and BeyondPixels [44]. It is worth noting that the last three approaches were reported in [47] and partially built on. We provide all relevant MOTS measures together with separate average AP numbers to estimate the quality of bounding box detections or instance segmentation approaches, respectively. For our MOTSNet we again show results under different pre-training settings (ImageNet, Mapillary Vistas and KITTI Synth). Different pre-trainings affect the classes cars and pedestrians in a different way, i.e. the ImageNet and Vistas pre-trained model performs better on pedestrians while the KITTI Synth pre-trained body is doing better on cars. This is most likely due to the imbalanced distribution of samples in KITTI Synth while our variant with pre-training on both, Vistas and KITTI Synth yields the overall best results. We obtain absolute improvements of 1.9%/7.5% for cars/pedestrians over the previously best work in [47] despite using a smaller backbone and no pre-training on MS COCO. Finally, also the recognition metrics for both, detection (box mAP) and instance segmentation (mask mAP) significantly benefit from pre-training on our KITTI Synth dataset, and in conjunction with Vistas rise by 7.2% and 6.4% over the ImageNet pre-trained variant, respectively.

Method Pre-training Car Ped Car Ped Car Ped Box Mask
TrackR-CNN [47] I, C, M 76.2 47.1 87.8 65.5 87.2 75.7
CAMOT [35] I, C, M 67.4 39.5 78.6 57.6 86.5 73.1
CIWT [34] I, C, M 68.1 42.9 79.4 61.0 86.7 75.7
BeyondPixels [44] I, C, M 76.9 89.7 86.5
MOTSNet I 69.0 45.4 78.7 61.8 88.0 76.5 55.2 49.3
I, M 74.9 53.1 83.9 67.8 89.4 79.4 60.8 54.9
I, KS 76.4 48.1 86.2 64.3 88.7 77.2 59.7 53.3
I, M, KS 78.1 54.6 87.2 69.3 89.6 79.7 62.4 55.7
Table 2: Results on the KITTI MOTS validation set when training on the KITTI MOTS training set. First section: state of the art results using masks and detections from [47]. Second section: our results under different pre-training settings.

5.4 MOTSNet on BDD100k

Since our MOTS training data generation pipeline can be directly ported to other datasets we also conducted experiments on the recently released BDD100k tracking dataset [54]. It comprises 61 sequences, for a total of k frames and comes with bounding box based tracking annotations for a total of 8 classes (in our evaluation we focus only on cars and pedestrians, for compatibility with KITTI Synth and KITTI MOTS). We split the available data into training and validation sets of, respectively, 50 and 11 sequences, and generate segmentation masks for each annotated bounding box, again by using [39] augmented with a ResNeXt-101-328d backbone. During data generation, the instance segmentation pipeline detected many object instances missing in the annotations, which is why we decided to provide ignore box annotations for detections with very high confidence but missing in the ground truth. Since there are only bounding-box based tracking annotations available, we present MOT rather than MOTS tracking scores. The results are listed in Tab. 3, comparing pooling strategies AveBox+Loss and AveMsk+Loss, again as a function of different pre-training settings. We again find that our proposed mask-pooling results compare favorable against the vanilla box-based pooling. While the improvement is small compared to the ImageNet-only pre-trained backbone, the Vistas pre-trained model benefits noticeably.

MOTSNet variant Pre-training MOTA MOTP
AveBox+Loss I 53.8 83.1
I, M 56.9 83.9
AveMsk+Loss I 53.9 83.1
I, M 58.2 84.0
Table 3: MOT ablation results on the BDD100K dataset.

5.5 MOTSNet on MOTSChallenge

In Tab. 4 we present our results on MOTSChallenge, the second dataset contributed in [47] and again compare against all related works reported therein. This dataset comprises of 4 sequences, a total of 2.862 frames and 228 tracks with roughly 27k pedestrians, and is thus significantly smaller than KITTI MOTS. Due to the smaller size, the evaluation in [47] runs leave-one-out cross validation on a per-sequence basis. We again report numbers for differently pre-trained versions of MOTSNet. The importance of segmentation pre-training on such small datasets is here quite evident: while MOTSNet (I) shows the overall worst performance, its COCO pre-trained version jumps ahead of all baselines.

Method Pre-training sMOTSA MOTSA MOTSP
TrackR-CNN [47] I, C, M 52.7 66.9 80.2
MHT-DAM [20] I, C, M 48.0 62.7 79.8
FWT [14] I, C, M 49.3 64.0 79.7
MOTDT [26] I, C, M 47.8 61.1 80.0
jCC [19] I, C, M 48.3 63.0 79.9
MOTSNet I 41.8 55.2 78.4
I, C 56.8 69.4 82.7
Table 4: Results on the MOTSChallenge dataset. First section: state of the art results using masks from [47]. Second section: our results under different pre-training settings.

5.6 Ablations on Tracklet Extraction

Here we analyze the importance of different cues in the payoff function used during inference (see Section 4.3): distance in the embedding space (Embedding), distance in time (Time) and signed intersection over union [46] (sIoU) between bounding boxes. Please refer to Appendix C for more details on the use of sIoU. As a basis, we take the best performing MOTSNet model from our experiments on KITTI MOTS, listed at the bottom of Table 2. This result was obtained by combining Embedding and Time, as in Eq. (C). As can be seen from differently configured results in Tab. 5, the embedding itself already serves as a good cue, and can be slightly improved on pedestrians when combined with information about proximity in time (with drop on cars), while outperforming sIoU. Figure 3 shows a visualization of the embedding vectors learned by this model.

sIoU Embedding Time sMOTSA Car sMOTSA Ped
76.7 50.8
78.2 54.4
77.0 51.8
78.1 54.6
Table 5: Ablation results on the KITTI MOTS dataset, using our best performing model from Tab. 2 when switching between different cues in the payoff function for inference.
Figure 3: t-SNE visualization of the embedding vectors computed by the Tracking Head for sequence “0014” of KITTI MOTS. Points corresponding to detections of the same object have the same color.

6 Conclusions

In this work we addressed and provided two major contributions for the novel task of joint multi-object tracking and segmentation (MOTS). First, we introduced an automated pipeline for extracting high-quality training data from generic street-level videos in order to overcome the lack of MOTS training data, without time- and cost-intensive, manual annotation efforts. Data is generated by solving the linear assignment on a causal tracklet graph, where instance segmentations per frame define the nodes, and optical-flow based compatibilities represent edges as connections over time. Our second major contribution is a deep-learning based MOTSNet architecture to be trained on MOTS data, exploiting a novel mask-pooling layer that guides the association process for detections based on instance segmentation masks. We provide exhaustive ablations for both, our novel training data generation process and our proposed MOTSNet, yielding cumulated, absolute improvements of 1.9%/7.5% for cars/pedestrians over the previously best work in [47] on the KITTI MOTS dataset.

Appendix A CLEAR MOT and MOTS metrics

The CLEAR MOT metrics (including MOTA and MOTP) are first defined in [2] to evaluate Multi-Object Tracking systems. In order to compute MOTA and MOTP, a matching between the ground truth and predicted tracklets needs to be computed at each frame by solving a linear assignment problem. We will not repeat the details of this process, instead focusing on the way its outputs are used to compute the metrics. In particular, for each frame , the matching process gives:

  • the number of correctly matched boxes ;

  • the number of mismatched boxes , i.e. the boxes belonging to a predicted tracklet that was matched to a different ground truth tracklet in the previous frame;

  • the number of false positive boxes , i.e. the predicted boxes that are not matched to any ground truth;

  • the number of ground truth boxes ;

  • the intersection over union between each correctly predicted box and its matching ground truth.

Given these, the metrics are defined as:


The MOTS metrics [47] extend the CLEAR MOT metrics to the segmentation case. Their computation follows the overall procedure described above, with a couple of exceptions. First, box IoU is replaced with mask IoU. Second, the matching process is simplified by defining a ground truth and predicted segment to be matching if and only if their IoU is greater than 0.5. Different from the bounding box case, the segmentation masks are assumed to be non-overlapping, meaning that this criterion results in a unique matching without the need to solve a LAP. With these changes, and given the definitions above, the MOTS metrics are:


Appendix B Hyper-parameters

b.1 Data Generation

In the data generation process, when constructing tracklets (see Sec. 3.2) we use the following parameters: , , .

b.2 Network training

As mentioned in Sec. 5.2, all our trainings follow a linear learning rate schedule:

where the initial learning rate and total number of steps depend on the dataset and pre-training setting. The actual values, together with the per-GPU batch sizes are reported in Tab. 6. The loss weight parameter in the first equation of Sec. 4.2 is fixed to in all experiments, except for the COCO pre-trained experiment on MOTSChallenge, where .

Dataset Pre-training

# epochs

KITTI Synth I 0.02 20 12
M 0.01 10 12
KITTI Synth, KITTI MOTS sequences I 0.02 180 12
KITTI MOTS I 0.02 180 12
M, KS 0.01 90 12
BDD100k all 0.02 100 10
MOTSChallenge I 0.01 90 10
C 0.02 180 10
Table 6:

MOTSNet training hyperparameters for different datasets and pre-training settings.

Appendix C Signed Intersection over Union

Signed Intersection over Union, as defined in [46], extends standard intersection over union between bounding boxes, by providing meaningful values when the input boxes are not intersecting. Given two bounding boxes and , where and are the coordinates of a box’s top-left and bottom-right corners, respectively, the signed intersection over union is:

  • greater than 0 and equal to standard intersection over union when the boxes overlap;

  • less than 0 when the boxes don’t overlap, and monotonically decreasing as their distance increases.

This is obtained by defining:


is an extended intersection operator, denotes the area of , and

is the “signed area” of .

Signed intersection over union is used in the ablation experiments of Sec. 5.6 as an additional term in the payoff function as follows:

where denotes the bounding box of segment .


  • [1] A. Berg, J. Johnander, F. Durand de Gevigney, J. Ahlberg, and M. Felberg (2019) Semi-automatic annotation of objects in visual-thermal video. In Proceedings of the IEEE International Conference on Computer Vision Workshops, Cited by: §2.
  • [2] K. Bernardin and R. Stiefelhagen (2008) Evaluating multiple object tracking performance: the clear mot metrics. Journal on Image and Video Processing, pp. 1. Cited by: Appendix A, §5.1.
  • [3] L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, and P. H. Torr (2016) Fully-convolutional siamese networks for object tracking. In Proceedings of the European Conference on Computer Vision, Cited by: §2.
  • [4] S. Caelles, K. Maninis, J. Pont-Tuset, L. Leal-Taixe, D. Cremers, and L. Van Gool (2017) One-shot video object segmentation. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    Cited by: §1.
  • [5] S. Caelles, A. Montes, K. Maninis, Y. Chen, L. Van Gool, F. Perazzi, and J. Pont-Tuset (2018) The 2018 davis challenge on video object segmentation. arXiv:1803.00557. Cited by: §2.
  • [6] L. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam (2018-09) Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision, Cited by: §1.
  • [7] Y. Chen, J. Pont-Tuset, A. Montes, and L. Van Gool (2018) Blazingly fast video object segmentation with pixel-wise metric learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.
  • [8] B. Cheng, M. D. Collins, Y. Zhu, T. Liu, T. S. Huang, H. Adam, and L. Chen (2019) Panoptic-deeplab. arXiv:1910.04751. Cited by: §1.
  • [9] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele (2016)

    The Cityscapes dataset for semantic urban scene understanding

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1.
  • [10] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun (2013) Vision meets robotics: The KITTI dataset. The International Journal of Robotics Research. Cited by: §2, §3.
  • [11] A. Geiger, P. Lenz, and R. Urtasun (2012) Are we ready for autonomous driving? The kitti vision benchmark suite. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1, §1, §2, §2, §5.3.
  • [12] K. He, G. Gkioxari, P. Dollár, and R. B. Girshick (2017) Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision, Cited by: §1, §2, §4.1, §4.
  • [13] K. He, X. Zhang, S. Ren, and J. Sun (2015) Deep residual learning for image recognition. arXiv:1512.03385. Cited by: §4.1.
  • [14] R. Henschel, L. Leal-Taixe, D. Cremers, and B. Rosenhahn (2018) Fusion of head and full-body detectors for multi-object tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Cited by: Table 4.
  • [15] A. Hermans, L. Beyer, and B. Leibe (2017) In defense of the triplet loss for person re-identification. arXiv:1703.07737. Cited by: §4.2.
  • [16] J. Hur and S. Roth (2016) Joint optical flow and temporally consistent semantic segmentation. arXiv:1607.07716. Cited by: §2.
  • [17] X. Jin, H. Xiao, X. Shen, J. Yang, Z. Lin, Y. Chen, Z. Jie, J. Feng, and S. Yan (2017) Predicting scene parsing and motion dynamics in the future. In Neural Information Processing Systems, Cited by: §2.
  • [18] J. Johnander, M. Danelljan, E. Brissman, F. S. Khan, and M. Felsberg (2019) A generative appearance model for end-to-end video object segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.
  • [19] M. Keuper, S. Tang, B. Andres, T. Brox, and B. Schiele (2018) Motion segmentation & multiple object tracking by correlation co-clustering. IEEE transactions on pattern analysis and machine intelligence. Cited by: Table 4.
  • [20] C. Kim, F. Li, A. Ciptadi, and J. M. Rehg (2015) Multiple hypothesis tracking revisited. In Proceedings of the IEEE International Conference on Computer Vision, Cited by: Table 4.
  • [21] L. Leal-Taixé, A. Milan, I. Reid, S. Roth, and K. Schindler (2015) MOTChallenge 2015: Towards a benchmark for multi-target tracking. arXiv:1504.01942. Cited by: §2.
  • [22] B. Li, J. Yan, W. Wu, Z. Zhu, and X. Hu (2018) High performance visual tracking with siamese region proposal network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.
  • [23] F. Li, T. Kim, A. Humayun, D. Tsai, and J. M. Rehg (2013) Video segmentation by tracking many figure-ground segments. In Proceedings of the IEEE International Conference on Computer Vision, Cited by: §2.
  • [24] T. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie (2017) Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, Cited by: §4.1.
  • [25] T. Lin, M. Maire, S. J. Belongie, L. D. Bourdev, R. B. Girshick, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft COCO: Common objects in context. In Proceedings of the European Conference on Computer Vision, Cited by: §1, §2, §5.2.
  • [26] C. Long, A. Haizhou, Z. Zijie, and S. Chong (2018) Real-time multiple people tracking with deeply learned candidate selection and person re-identification. In ICME, Cited by: Table 4.
  • [27] J. Long, E. Shelhamer, and T. Darrell (2015) Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.
  • [28] P. Luc, C. Couprie, Y. LeCun, and J. Verbeek (2018) Predicting future instance segmentations by forecasting convolutional features. In Proceedings of the European Conference on Computer Vision, Cited by: §2.
  • [29] J. Luiten, P. Torr, and B. Leibe (2019) Video instance segmentation 2019: a winning approach for combined detection, segmentation, classification and tracking. In Proceedings of the IEEE International Conference on Computer Vision Workshops, Cited by: §2, §2.
  • [30] W. Maddern, G. Pascoe, C. Linegar, and P. Newman (2017) 1 year, 1000 km: the oxford robotcar dataset. International Journal of Robotics Research 36 (1), pp. 3–15. Cited by: §2.
  • [31] K. Maninis, S. Caelles, J. Pont-Tuset, and L. Van Gool (2018) Deep extreme cut: from extreme points to object segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.
  • [32] A. Milan, L. Leal-Taixé, I. D. Reid, S. Roth, and K. Schindler (2016) MOT16: A benchmark for multi-object tracking. arXiv:1603.00831. Cited by: §1, §1, §2, §2.
  • [33] G. Neuhold, T. Ollmann, S. Rota Bulò, and P. Kontschieder (2017) The mapillary vistas dataset for semantic understanding of street scenes. In Proceedings of the IEEE International Conference on Computer Vision, Cited by: §1, §1, §1, §3.1.
  • [34] A. Ošep, W. Mehner, M. Mathias, and B. Leibe (2017) Combined image-and world-space tracking in traffic scenes. In Proceedings of the IEEE International Conference on Robotics and Automation, Cited by: §5.3, §5.3, Table 2.
  • [35] A. Ošep, W. Mehner, P. Voigtlaender, and B. Leibe (2018) Track, then decide: category-agnostic vision-based multi-object tracking. In Proceedings of the IEEE International Conference on Robotics and Automation, Cited by: §2, §2, §2, §5.3, §5.3, §5.3, Table 2.
  • [36] A. Ošep, P. Voigtlaender, J. Luiten, S. Breuers, and B. Leibe (2018) Towards large-scale video object mining. Proceedings of the ECCV 2018 Workshop on Interactive and Adaptive Learning in an Open World. Cited by: §2.
  • [37] F. Perazzi, J. Pont-Tuset, B. McWilliams, L. Van Gool, M. Gross, and A. Sorkine-Hornung (2016) A benchmark dataset and evaluation methodology for video object segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.
  • [38] J. Pont-Tuset, F. Perazzi, S. Caelles, P. Arbeláez, A. Sorkine-Hornung, and L. Van Gool (2017) The 2017 davis challenge on video object segmentation. arXiv:1704.00675. Cited by: §2.
  • [39] L. Porzi, S. R. Bulò, A. Colovic, and P. Kontschieder (2019) Seamless scene segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1, §3.1, §4.1, §4.2, §4.2, §5.4.
  • [40] A. Prest, C. Leistner, J. Civera, C. Schmid, and V. Ferrari (2012) Learning object class detectors from weakly annotated video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.
  • [41] J. Revaud, P. Weinzaepfel, Z. Harchaoui, and C. Schmid (2015)

    EpicFlow: edge-preserving interpolation of correspondences for optical flow

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.
  • [42] S. Rota Bulò, L. Porzi, and P. Kontschieder (2018) In-place activated batchnorm for memory-optimized training of DNNs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §3.2, §4.1.
  • [43] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karphathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei (2015) Imagenet large scale visual recognition challenge. International Journal of Computer Vision. Cited by: §2.
  • [44] S. Sharma, J. A. Ansari, J. K. Murthy, and K. M. Krishna (2018) Beyond pixels: leveraging geometry and shape cues for online multi-object tracking. In Proceedings of the IEEE International Conference on Robotics and Automation, Cited by: §2, §2, §5.3, Table 2.
  • [45] S. Shen (2013) Accurate multiple view 3d reconstruction using patch-based stereo for large-scale scenes. IEEE Transactions on Image Processing 22 (5), pp. 1901–1914. Cited by: §3.2.
  • [46] A. Simonelli, S. R. R. Bulò, L. Porzi, M. López-Antequera, and P. Kontschieder (2019) Disentangling monocular 3d object detection. In Proceedings of the IEEE International Conference on Computer Vision, Cited by: Appendix C, §5.6.
  • [47] P. Voigtlaender, M. Krause, A. Osep, J. Luiten, B. B. G. Sekar, A. Geiger, and B. Leibe (2019) MOTS: multi-object tracking and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: Appendix A, 1st item, §1, §1, §1, §1, §2, §2, §2, §5.1, §5.2, §5.2, §5.3, §5.3, §5.3, §5.3, §5.3, §5.5, Table 2, Table 4, §6.
  • [48] Q. Wang, L. Zhang, L. Bertinetto, W. Hu, and P. H. Torr (2019) Fast online object tracking and segmentation: a unifying approach. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2, §2.
  • [49] L. Wen, D. Du, Z. Cai, Z. Lei, M. Chang, H. Qi, J. Lim, M. Yang, and S. Lyu (2015) UA-detrac: a new benchmark and protocol for multi-object detection and tracking. arXiv:1511.04136. Cited by: §2.
  • [50] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He (2017)

    Aggregated residual transformations for deep neural networks

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §3.1.
  • [51] N. Xu, L. Yang, Y. Fan, D. Yue, Y. Liang, J. Yang, and T. Huang (2018) Youtube-vos: a large-scale video object segmentation benchmark. arXiv:1809.03327. Cited by: §2, §2.
  • [52] L. Yang, Y. Fan, and N. Xu (2019) Video instance segmentation. arXiv:1905.04804. Cited by: §2, §2.
  • [53] Z. Yin, T. Darrell, and F. Yu (2019) Hierarchical discrete distribution decomposition for match density estimation. In Proceedings of the IEEE International Conference on Computer Vision, Cited by: 2nd item, §1, §3.2, §5.3, Table 1.
  • [54] F. Yu, W. Xian, Y. Chen, F. Liu, M. Liao, V. Madhavan, and T. Darrell (2018) BDD100K: a diverse driving video database with scalable annotation tooling. arXiv:1805.04687. Cited by: §1, §5.4.
  • [55] I. E. Zulfikar, J. Luiten, and B. Leibe (2019) UnOVOST: unsupervised offline video object segmentation and tracking for the 2019 unsupervised davis challenge. In Proceedings of the 2019 DAVIS Challenge on Video Object Segmentation - CVPR Workshops, Cited by: §2.