ReMOTS: Self-Supervised Refining Multi-Object Tracking and Segmentation

by   Fan Yang, et al.

We aim to improve the performance of Multiple Object Tracking and Segmentation (MOTS) by refinement. However, it remains challenging for refining MOTS results, which could be attributed to that appearance features are not adapted to target videos and it is also difficult to find proper thresholds to discriminate them. To tackle this issue, we propose a self-supervised refining MOTS (i.e., ReMOTS) framework. ReMOTS mainly takes four steps to refine MOTS results from the data association perspective. (1) Training the appearance encoder using predicted masks. (2) Associating observations across adjacent frames to form short-term tracklets. (3) Training the appearance encoder using short-term tracklets as reliable pseudo labels. (4) Merging short-term tracklets to long-term tracklets utilizing adopted appearance features and thresholds that are automatically obtained from statistical information. Using ReMOTS, we reached the 1^st place on CVPR 2020 MOTS Challenge 1, with an sMOTSA score of 69.9.


page 1

page 2

page 3

page 4


ReMOTS: Refining Multi-Object Tracking and Segmentation

We aim to improve the performance of Multiple Object Tracking and Segmen...

Multi-Object Tracking with Multiple Cues and Switcher-Aware Classification

In this paper, we propose a unified Multi-Object Tracking (MOT) framewor...

Learning Global Structure Consistency for Robust Object Tracking

Fast appearance variations and the distractions of similar objects are t...

Multi-object tracking with self-supervised associating network

Multi-Object Tracking (MOT) is the task that has a lot of potential for ...

CDTB: A Color and Depth Visual Object Tracking Dataset and Benchmark

A long-term visual object tracking performance evaluation methodology an...

Long-term Tracking in the Wild: A Benchmark

We introduce a new video dataset and benchmark to assess single-object t...

1 Introduction

Multiple Object Tracking (MOT), which depends on information from the bounding box, faces a great challenge, since different objects may stay in the same bounding box and increase the ambiguity to distinguish them. Recently, some researchers in this filed have moved their eyes to Multiple Object Tracking and Segmentation (MOTS) and hope to take advantage of object-instance masks. Under such a background, the first MOTS challenge is organized to explore solutions for MOTS. We participated in this challenge (May-30th-2020) and won the place on Challenge 1. In this paper, we represent our solution.

2 Method Details

Overall, we apply the tracking-by-detection strategy to generate MOTS results. Since our ReMOTS is an offline approach, we refine

the data association by retraining the appearance feature encoder. In each step of ReMOTS, we give a practical guidance to quantitatively select hyperparameters. Our approach is illustrated in

Figure 1. After obtaining object-instance masks, we perform: (1) encoder training with intra-frame data, (2) associate masks to short-term tracklets by a short-term tracker, (3) inter-short-tracklet encoder retraining, and (4) merging short-term tracklets.

Figure 1: The illustration of ReMOTS Framework.
Figure 2: Constructing training samples for intra-frame training. P and N represent positive and negative samples, respectively.

2.1 Generate Object-Instance Masks

Referring to how the public detection is generated, we obtain object-instance masks using the Mask R-CNN X152 of Detectron 2 [5] and X-101-64x4d-FPN of MMDetection [2]. We fuse their segmentation results by a modified Non-maximum Suppression (NMS). Unlike the traditional NMS, where the IoU (Intersection over Union) is applied, we propose a new metric named IoM (Intersection over Minimum) for it since heavily overlapped masks may also have low IoU values. The python code of IoM is as follows.

1def pixel_iom(target,prediction):
2    """
3    Inputs:
4        target: binary mask, array([H,W])
5        prediction: binary mask, array([H,W])
6    Outputs:
7        iom_score: float
8    """
9    intersection = np.logical_and(target, prediction)
10    min_area = min(np.sum(target),np.sum(prediction))
11    iom_score = np.sum(intersection) / min_area
12    return iom_score

After performing our modified NMS, the remaining masks may still have overlapped areas. Therefore, we only keep the mask with the top confident score for each overlapping area.

Figure 3: Constructing training samples for inter-short-tracklet training. P and N represent positive and negative samples, respectively.

2.2 Encoder Training with Intra-frame Data

We take an off-the-shelf appearance encoder and its training scheme from an object re-identification work [3]. SeResNeXt50 is used as the backbone and its global-average-pooling output, which is a

-dimension vector, is used as the appearance

features. The triplet loss [1] is applied to train the appearance encoder. To adapt the appearance feature learning to the target videos, we incorporate intra-frame observations of target videos into a novel offline training process.

As Figure 2 shows, we can sample triplets from the training set only referring to the ground-truth tracklets. In test set, since Non-maximum Suppression (NMS) is performed, we assume that predicted object masks are exclusive within the same frame, and therefore it is easy to form negative pairs with intra-frame observations. Before tracking, we create a positive sample by augmenting an anchor sample. The augmentation process can dramatically change the pixel content of the anchor sample without altering identity. Finally, we take triplets from the training set and target set to form a mini-batch input by the ratio of . Using such new training samples, we retrain the appearance encoder to obtain more discriminative appearance features.

2.3 Short-term Tracker

After intra-frame training, we apply the appearance encoder to generate appearance features for data association. Since the tracker part is not our main focus, we build a simple tracker that only associates two-frame observations at once. Using the dense optical flow function of OpenCV, we generate optical flow between two adjacent frames, and then warp the mask from previous frame to current frame to calculate IoU of cross-frame masks. The distance matrix is formulated as follows:


where and respectively denote the mask of the previous frame and the mask of the current frame; is their edge weight (i.e., distance); and are their appearance features.

Besides constraining data association with low IoU values, we also hope to constrain data association with low appearance similarity. However, it is tricky to heuristically

determine a threshold for

constraining. We tackle this issue by analyzing the intra-frame distribution. Specifically, the histogram of appearance cosine similarity between intra-frame masks can be approximated by a normal distribution, and within three standard deviations is

of the observation pairs (see Figure 4). We set an appearance affinity threshold at three standard deviations, as value .

Figure 4: The appearance threshold value for short-term tracking.

Consequently, using automatically obtained , we further process by


We apply linear assignment on to determine the association of masks between the previous frame and the current frame. Due to misdetection and occlusion, such a process can only generate short-term tracklets. However, short-term tracklets reduce the risk of mixing different identities, which is an important condition in next-step process.

2.4 Inter-short-tracklet Encoder Retraining

As we assume each short-term tracklet may only contain a unique identity, they can be used as dependable pseudo labels to train the feature encoder. However, different short-term tracklets, which have no overlap in the temporal domain, may still hold the same identity. Therefore, we do inter-short-tracklet retraining under the constraint that sampled short-term tracklets must be temporally overlapping within the same video.

We illustrate the process of training data sampling for inter-short-tracklet training as Figure 3 shows. Within a video, we first sample two identities that appear in a randomly chosen frame, and then randomly choose another frame for one of the selected identities, thus constructing a triplet. Other settings of inter-short-tracklet training are the same as intra-frame retraining. We update the appearance features after inter-short-tracklet retraining and use them in the next step.

place ReMOTS (ours)
place PTPM
place PT
Table 1: The performance on CVPR 2020 MOTS Challenge test set (up to submission deadline at May-30th-2020).
Table 2: The performance of ReMOTS on each sequence of CVPR 2020 MOTS Challenge test set (up to submission deadline at May-30th-2020).

2.5 Merging Short-term Tracklets

With better appearance features and more robust spatio-temporal information of short-term tracklets, we are able to merge short-term tracklets into long-term ones. The merging process is summarized in Figure 1

. Short-term tracklets association is formulated as a hierarchical clustering problem on a weighted graph, in which each node represents a tracklet and

the graph edges are represented in a distance matrix , defined as


where for tracklets and , is their edge weight (i.e., distance); and are their temporal ranges; and are their appearance features at frame and , and and are the number of observations within the tracklets, respectively.

Whenever the matching condition between two short-term tracklets violates any of the following three principles: (1) different short-term track ID, (2) the temporal gap between two short-term tracklets are within frames (we use ), and (3) no temporal overlap between two short-term tracklets, we set their distance value to be infinite. To hold these constraints in the whole process of hierarchical clustering, we apply the centroid linkage criteria to determine the distance between clusters.

Figure 5: The appearance threshold value for merging short-term tracklets.

The main challenge of applying hierarchical clustering is on how to set a proper cutting threshold. We do not give a heuristic value, and we let the data speak for themselves instead. We suppose that intra-frame and inter-short-tracklet cosine similarity histograms can be separated at (see Figure 5) after inter-short-tracket retraining, though small overlapping might exist. Without accessing to the ground-truth, this could be a reasonable boundary to distinguish objects based on appearance features. Therefore, we set as the cutting threshold in hierarchical clustering.

2.6 Experimental Setup

The only neural network model - appearance encoder 

[3], used in this work, is not our contribution and our ReMOTS can do the same refining when other appearance models are used. Therefore, we do little change to the default setting of [3], except for forming novel training samples in our intra-frame training and inter-short-tracklet training. Here, we omit the other details described in [3].

3 Results

We report the performance of our ReMOTS on the MOTChallenge evaluation system, with metrics introduced in [4]. In Table 1, we list the performance of top-3 methods up to the submission deadline. Our method mainly outperform the other two methods in terms of IDF1 score, and therefore leads to state-of-the-art performance in this challenge. The detailed performance on each test sequence is listed in Table 2. Though the same method is applied, it can be observed that the performance of each sequence varies a lot. This may be attributed to the diversity between videos, which calls for more exploration in automatically adapting MOTS models to target videos. Our ReMOTS analyzes the statistical information at the entire video level, but the temporal local statistical information, which might be useful for fine-grained adaption, has not been considered yet.

4 Conclusion

We present our solution which wins the CVPR 2020 MOTS Challenge 1. In our proposed ReMOTS framework, intra-frame training and inter-short-tracklet training are introduced for learning better appearance features for more effective data association, which are our main contributions. Besides, we quantitatively demonstrate how to select proper thresholds by analyzing the statistical information of tracklets, which could be useful for other multiple object tracking works. The main limitation of ReMOTS is that it cannot be used in real-time scenarios, but it may bring insights to design better online MOTS method with feature adaptation.


This work was supported by JSPS KAKENHI Grant Numbers JP17H06101.


  • [1] G. Chechik, V. Sharma, U. Shalit, and S. Bengio (2010)

    Large scale online learning of image similarity through ranking


    Journal of Machine Learning Research

    11 (Mar), pp. 1109–1135.
    Cited by: §2.2.
  • [2] e. al. Chen (2019) MMDetection: open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155. Cited by: §2.1.
  • [3] H. Luo, Y. Gu, X. Liao, S. Lai, and W. Jiang (2019) Bag of tricks and a strong baseline for deep person re-identification. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops

    pp. 0–0. Cited by: §2.2, §2.6.
  • [4] P. Voigtlaender, M. Krause, A. Os̆ep, J. Luiten, B. B. G. Sekar, A. Geiger, and B. Leibe (2019) MOTS: multi-object tracking and segmentation. In CVPR, Cited by: ReMOTS: Self-Supervised Refining Multi-Object Tracking and Segmentation ( place solution for MOTSChalelnge 2020 Track 1), §3.
  • [5] Y. Wu, A. Kirillov, F. Massa, W. Lo, and R. Girshick Detectron2. Cited by: §2.1.