Multiple Object Tracking (MOT), which depends on information from the bounding box, faces a great challenge, since different objects may stay in the same bounding box and increase the ambiguity to distinguish them. Recently, some researchers in this filed have moved their eyes to Multiple Object Tracking and Segmentation (MOTS) and hope to take advantage of object-instance masks. Under such a background, the first MOTS challenge is organized to explore solutions for MOTS. We participated in this challenge and won the place on Challenge 1. In this paper, we represent our solution.
2 Method Details
Overall, we apply the tracking-by-detection strategy to generate MOTS results. Since our ReMOTS is an offline approach, we refine
the data association by retraining the appearance feature encoder. In each step of ReMOTS, we give a practical guidance to quantitatively select hyperparameters. Our approach is illustrated inFigure 1. After obtaining object-instance masks, we perform: (1) encoder training with intra-frame data, (2) associate masks to short-term tracklets by a short-term tracker, (3) inter-short-tracklet encoder retraining, and (4) merging short-term tracklets.
2.1 Generate Object-Instance Masks
Referring to how the public detection is generated, we obtain object-instance masks using the Mask R-CNN X152 of Detectron 2  and X-101-64x4d-FPN of MMDetection . We fuse their segmentation results by a modified Non-maximum Suppression (NMS). Unlike the traditional NMS, where the IoU (Intersection over Union) is applied, we propose a new metric named IoM (Intersection over Minimum) for it since heavily overlapped masks may also have low IoU values. The python code of IoM is as follows.
After performing our modified NMS, the remaining masks may still have overlapped areas. Therefore, we only keep the mask with the top confident score for each overlapping area.
2.2 Encoder Training with Intra-frame Data
We take an off-the-shelf appearance encoder and its training scheme from an object re-identification work . SeResNeXt50 is used as the backbone and its global-average-pooling output, which is a
-dimension vector, is used as the appearancefeatures. The triplet loss  is applied to train the appearance encoder. To adapt the appearance feature learning to the target videos, we incorporate intra-frame observations of target videos into a novel offline training process.
As Figure 2 shows, we can sample triplets from the training set only referring to the ground-truth tracklets. In test set, since Non-maximum Suppression (NMS) is performed, we assume that predicted object masks are exclusive within the same frame, and therefore it is easy to form negative pairs with intra-frame observations. Before tracking, we create a positive sample by augmenting an anchor sample. The augmentation process can dramatically change the pixel content of the anchor sample without altering identity. Finally, we take triplets from the training set and target set to form a mini-batch input by the ratio of . Using such new training samples, we retrain the appearance encoder to obtain more discriminative appearance features.
2.3 Short-term Tracker
After intra-frame training, we apply the appearance encoder to generate appearance features for data association. Since the tracker part is not our main focus, we build a simple tracker that only associates two-frame observations at once. Using the dense optical flow function of OpenCV, we generate optical flow between two adjacent frames, and then warp the mask from previous frame to current frame to calculate IoU of cross-frame masks. The distance matrix is formulated as follows:
where and respectively denote the mask of the previous frame and the mask of the current frame; is their edge weight (i.e., distance); and are their appearance features.
Besides constraining data association with low IoU values, we also hope to constrain data association with low appearance similarity. However, it is tricky to heuristicallydetermine a threshold for
constraining. We tackle this issue by analyzing the intra-frame distribution. Specifically, the histogram of appearance cosine similarity between intra-frame masks can be approximated by a normal distribution, and within three standard deviations isof the observation pairs (see Figure 4). We set an appearance affinity threshold at three standard deviations, as value .
Consequently, using automatically obtained , we further process by
We apply linear assignment on to determine the association of masks between the previous frame and the current frame. Due to misdetection and occlusion, such a process can only generate short-term tracklets. However, short-term tracklets reduce the risk of mixing different identities, which is an important condition in next-step process.
2.4 Inter-short-tracklet Encoder Retraining
As we assume each short-term tracklet may only contain a unique identity, they can be used as dependable pseudo labels to train the feature encoder. However, different short-term tracklets, which have no overlap in the temporal domain, may still hold the same identity. Therefore, we do inter-short-tracklet retraining under the constraint that sampled short-term tracklets must be temporally overlapping within the same video.
We illustrate the process of training data sampling for inter-short-tracklet training as Figure 3 shows. Within a video, we first sample two identities that appear in a randomly chosen frame, and then randomly choose another frame for one of the selected identities, thus constructing a triplet. Other settings of inter-short-tracklet training are the same as intra-frame retraining. We update the appearance features after inter-short-tracklet retraining and use them in the next step.
2.5 Merging Short-term Tracklets
With better appearance features and more robust spatio-temporal information of short-term tracklets, we are able to merge short-term tracklets into long-term ones. The merging process is summarized in Figure 1
. Short-term tracklets association is formulated as a hierarchical clustering problem on a weighted graph, in which each node represents a tracklet andthe graph edges are represented in a distance matrix , defined as
where for tracklets and , is their edge weight (i.e., distance); and are their temporal ranges; and are their appearance features at frame and , and and are the number of observations within the tracklets, respectively.
Whenever the matching condition between two short-term tracklets violates any of the following three principles: (1) different short-term track ID, (2) the temporal gap between two short-term tracklets are within frames (we use ), and (3) no temporal overlap between two short-term tracklets, we set their distance value to be infinite. To hold these constraints in the whole process of hierarchical clustering, we apply the centroid linkage criteria to determine the distance between clusters.
The main challenge of applying hierarchical clustering is on how to set a proper cutting threshold. We do not give a heuristic value, and we let the data speak for themselves instead. We suppose that intra-frame and inter-short-tracklet cosine similarity histograms can be separated at (see Figure 5) after inter-short-tracket retraining, though small overlapping might exist. Without accessing to the ground-truth, this could be a reasonable boundary to distinguish objects based on appearance features. Therefore, we set as the cutting threshold in hierarchical clustering.
2.6 Experimental Setup
The only neural network model - appearance encoder, used in this work, is not our contribution and our ReMOTS can do the same refining when other appearance models are used. Therefore, we do little change to the default setting of , except for forming novel training samples in our intra-frame training and inter-short-tracklet training. Here, we omit the other details described in .
We report the performance of our ReMOTS on the MOTChallenge evaluation system, with metrics introduced in . In Table 1, we list the performance of top-3 methods up to the submission deadline. Our method mainly outperform the other two methods in terms of IDF1 score, and therefore leads to state-of-the-art performance in this challenge. The detailed performance on each test sequence is listed in Table 2. Though the same method is applied, it can be observed that the performance of each sequence varies a lot. This may be attributed to the diversity between videos, which calls for more exploration in automatically adapting MOTS models to target videos. Our ReMOTS analyzes the statistical information at the entire video level, but the temporal local statistical information, which might be useful for fine-grained adaption, has not been considered yet.
We present our solution which wins the CVPR 2020 MOTS Challenge 1. In our proposed ReMOTS framework, intra-frame training and inter-short-tracklet training are introduced for learning better appearance features for more effective data association, which are our main contributions. Besides, we quantitatively demonstrate how to select proper thresholds by analyzing the statistical information of tracklets, which could be useful for other multiple object tracking works. The main limitation of ReMOTS is that it cannot be used in real-time scenarios, but it may bring insights to design better online MOTS method with feature adaptation.
ACKNOWLEDGEMENTS This work was supported by JSPS KAKENHI Grant Numbers JP17H06101.
Large scale online learning of image similarity through ranking.
Journal of Machine Learning Research11 (Mar), pp. 1109–1135. Cited by: §2.2.
-  (2019) MMDetection: open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155. Cited by: §2.1.
-  (2019) Bag of tricks and a strong baseline for deep person re-identification. In , pp. 0–0. Cited by: §2.2, §2.6.
-  (2019) MOTS: multi-object tracking and segmentation. In CVPR, Cited by: ReMOTS: Refining Multi-Object Tracking and Segmentation ( place solution for MOTSChalelnge 2020 Track 1), §3.
-  Detectron2. Cited by: §2.1.