Multi-object tracking and segmentation (MOTS) is a fairly novel task that combines instance segmentation and multi-object tracking. It consists of associating and segmenting instances of objects corresponding to predetermined classes across multiple consecutive frames. This problem is important for many applications, in particular for intelligent transportation systems (ITS) and driver assistance and automation technologies.
The MOTS task is quite complex as it requires multiple outputs across multiple frames. Additionally, producing segmentation masks can be slow and very demanding in memory, which is not suitable for many ITS applications that are either running on on-board computers or in the cloud. For this reason, our method was designed using fast and lightweight elements. PolyTrack is based on the real-time instance segmentation method CenterPoly (perreault2021centerpoly) and on the tracking module of the CenterTrack (zhou2020tracking) multi-object tracker. It is crucial to reduce the representation size of the output as well as the computational cost of producing it. CenterPoly uses polygons for the masks which are well suited for this context, taking up less space than full segmentation masks and being fast to produce.
For the tracking part of our method, we follow a “detect then track” approach and for the segmentation part, we follow a “detect then segment” approach. We detect objects by finding their center keypoint, thus we train a network to produce center keypoint heatmaps. Simultaneously, another network head produces a dense polygon map, which contains a bounding polygon at each location on the map. The polygon head is trained such that the locations of object centers on the polygon map contain the polygons corresponding to the same respective objects. Another network head is trained to produce a tracking offset between corresponding objects of two consecutive frames. Using the proximity of the centers of the objects, we associate them into tracklets over whole sequences. A Kalman filter is also applied to reduce the number of ID switches. A typical output of our method is shown in Figure 1. Our method, PolyTrack, a fast multi-object tracking and segmentation method was evaluated on the KITTIMOTS (Voigtlaender19CVPR_MOTS) and MOTS (Voigtlaender19CVPR_MOTS) datasets. Results show that our proposed method is promising.
The contribution of our paper is the presentation of a novel MOTS method called PolyTrack, that is based on polygonal masks. PolyTrack can be thought of as an improved traditional multi-object tracker providing something better than bounding boxes, that is bounding polygons, at almost no additional cost. It was not designed to provide overly fine segmentation masks, but rather coarse segmentations that are sufficient in most cases (see Figure 0(b)), for instance to remove the background in an image, or provide better re-identification features. PolyTrack obtains good results on the tested benchmarks, particularly for more rigid objects, but not SOTA results. This is expected since PolyTrack does not produce segmentation masks, nor bounding boxes. Therefore, it is not possible to get impressive results with the established MOTS metrics that require 50% overlap with the ground-truth.
2 Related Work
2.1 Multiple Object Tracking
The majority of high performing multiple object trackers follows a tracking by detection paradigm. First, using a pre-trained object detector (felzenszwalb2009object; ren2017accurate; ren2015faster; yang2016exploit), they produce bounding boxes for the current video frame. In a second stage, they perform data association to get the tracks by matching the current frame bounding boxes with the previous frames bounding boxes.
For example, SORT (bewley2016simple) uses common tracking techniques like the Kalman Filter and the Hungarian algorithm combined with a high quality object detector, Faster R-CNN (ren2015faster), to produce good online results. Only bounding box overlap is used. In a follow-up work, Deep SORT (wojke2017simple), the appearance is integrated, making the method less susceptible to identity switches. Learning by tracking (leal2016learning)
implements a new two-stage approach to data association while tracking pedestrians. In the first stage, a Siamese convolutional neural network (CNN) is trained to produce descriptor encodings of the input images. Second, contextual information from the position and size of the input image patches is integrated with the CNN by gradient boosting to produce the matching probability. The work ofschulter2017deep
show that it is possible to learn features for data association in a network-flow via backpropagation. The authors express the optimum of a smoothed network flow problem as a differentiable function. They outperform the methods based on handcrafted features. The authors of(tang2017multiple) proposed a method that links person detections over time by solving a minimum cost lifted multicut problem. To group tracks over longer periods of time, they implemented a new deep architecture for re-identification. The work of sharma2018beyond leverages traditional MOT techniques by incorporating 3D information into the cost for data association. ooi2018multiple improve data association by using label information provided by the detector. The work of (xu2019spatial) proposes a unified framework to obtain similarity measures between image patches that can encode different cues and work across both spatial and temporal domains. Some methods, like CenterTrack (zhou2020tracking) and Tracktor (bergmann2019tracking), frame multi-object tracking as a regression problem (tracking by regression), using simple strategies for instance association but relying on strong object detectors to get good performance. The detectors are designed to also regress the object displacement.
2.2 Multiple Object Tracking and Segmentation
Multiple object Tracking and Segmentation (MOTS) is a fairly novel compound task that combines instance segmentation and multiple object tracking. Track R-CNN (Voigtlaender19CVPR_MOTS) is a baseline method proposed for the MOTS dataset. It extends Mask R-CNN (he2017mask) with 3D convolutions to incorporate the temporal aspect and a network head that performs data association. ReMOTS (yang2020remots) proposes a self-supervised refinement method by using short-term tracklets as pseudo labels to train for longer ones. GMPHD_MAF (song2020online) takes instance segmentation results as input and performs data association based on the Gaussian mixture probability hypothesis density filter for position and motion, and kernelized correlation filter for appearance. UniTrack (wang2021different) proposes a unified framework for solving five different tasks. It consists of an appearance model that is task agnostic, and several heads which are used for solving each task and do not require training. The appearance model can be either supervised or self-supervised. SORTS (ahrnbom2021real) presents the first real-time MOTS method. It is based on SORT and uses instance masks produced by Mask R-CNN. Additionally, they present an alternative method called SORTS-RReID which uses a re-identification method to better handle occlusions.
3 Proposed Method
Our method is based on the CenterTrack and CenterPoly models (these two being based on the CenterNet model). Our aim is to design a tracking by regression method to track objects delimited by polygons. The polygons being approximations of masks, we effectively and efficiently transform a bounding box tracking method into a mask tracking method, that is a MOTS method.
CenterNet is a multi-class object detector that predicts the position of the center of objects in an image using a heatmap and then obtains the coordinates of the bounding boxes of each object by regression. It is an object representation model that offers a lot of potential. Therefore, it is possible to adapt this base model for several purposes. It can be adapted for tracking (CenterTrack) and to regress polygons to get mask approximations (CenterPoly). Our proposed method, PolyTrack, draws ideas from both of these derived methods to design a complete MOTS solution.
3.1 General architecture of PolyTrack
The network architecture of PolyTrack is presented in Figure 2. Similarly to CenterPoly, it has a polygon regression head and a depth head to order the polygons by depth. Also similarly to CenterTrack, it has a tracking head to obtain the displacement of objects between two frames.
There are three inputs to the network: the current image , the previous image and a heatmap representing the positions of the tracklets detected in the previous image
. These three inputs each pass through a small convolutional network consisting of a convolution layer and a residual layer. These convolutions downsample the input resolution of the inputs by 4 and set the number of channels of the backbone input tensor to 256. The outputs of these three small networks are then summed before being passed through the network backbone. The backbone we used to produce our results is a 2-stacks hourglass. The features generated by the backbone are then used by the regression heads to produce outputs. The regression heads, except the one for polygons, are built on the same model, aconvolution followed by a convolution that generate one or more output values at each pixel of the downsampled feature map. The polygon regression head greatly benefits from multiple layers and a greater depth. The polygon regression head is therefore composed of three successive blocks of convolution,
convolution and a ReLU activation before generating the output feature map (see code for details). Our architecture is detailed in the following.
3.2 Detecting objects as polygons
To obtain mask approximation of the objects to track, we propose an approach similar to CenterPoly. First, the CenterNet bounding box generation head is replaced by a polygon generation head. Instead of simply generating a bounding box around an object, we generate a bounding polygon that allows for a much finer segmentation. To learn to generate the polygons, the heatmaps provided as ground truth for training are generated by producing elliptical Gaussians whose length and width depend on the object dimensions. This improves detection accuracy compared to circular Gaussians, as shown in (perreault2021centerpoly). An object center is no longer defined as the center of its bounding box, but as its center of gravity computed as the weighted mean of the points forming the bounding polygon.
Finally, similarly to CenterPoly, we use a pseudo-depth head that allows to learn the relative depth of objects in the frame to determine for two overlapping masks which one is in front of the other.
3.3 Tracking the object polygons
To track the object polygons, we use a similar approach to CenterTrack. Instead of having a single image as the input of our network, we use the current image, , the previous image , and a heatmap generated from the positions of the tracks present in the previous image. We also added a tracking regression head. This tracking head, a bit like an optical flow, predicts for each object detected in the current image their position in the previous image. This information is then used to establish associations between instances of the two frames using a greedy algorithm: an object will be assigned an existing ID if its previous position predicted by the tracking head is similar to the position of the center of an object detected in the previous frame.
In the CenterTrack method, the association of instances between the frames of a video stream is based solely on the object centers detected by the neural network and the displacement predicted by the tracking head. For each detected object, the displacement vector produced by the tracking head is added to the position of the object center. These new positions are then compared with the positions of the centers of the previously detected objects, and a greedy algorithm is used to associate the centers closest to each other between two frames to propagate the identity of already existing tracks. If a detection in the current frame is not close to any existing track, a new track is created with a new ID. If an existing track is not close to any detection of the current frame, it is frozen and a counter, reflecting for how long a track was not matched, is incremented until either a detection of a next frame is associated with it or its value exceeds a limit (established at 32 frames in our case). If the counter value exceeds the set limit, the track is terminated, as we consider that the instance left the video stream.
This solution has the advantage of being quite simple and efficient. However, when a track is frozen, its position remains as the last detected position. In practice, this has the consequence that the tracklet of an instance that has been frozen for a few frames because of an occlusion or an overlapping of two instances has a high chance of being “captured” by another instance passing nearby, causing an ID switch. If no instance passes nearby, but instance reappears too far from its last known position, another ID switch may happen, the new position being too far from the old one to be associated with it. Finally, if instance exits the video frame, its last known position will be on the boundary of the frame and its track will remain frozen there until the maximum counter limit is reached. However, if an instance enters the frame from the same area on the boundary, it will most likely be associated with the last known position of instance , causing an ID switch once again.
To address these three problems and inspired by (chen2018tracking)
, we bring a partial solution: the use of Kalman Filters, more precisely Unscented Kalman Filters (UKF). A Kalman filter works by estimating, through observations, the parameters of a system in order to be able to remove noise from observations and to predict a future state of the system. In our case, the system under study corresponds to objects moving through a video stream. Newton’s kinematics equations are therefore used to get an approximation of an object state (position, velocity and acceleration) at a time. We use an UKF because it is much more efficient than a classical Kalman filter when applied to nonlinear systems and because it has already been proven efficient in the field of object tracking (chen2018tracking). An UKF is applied to each track at its birth. As long as a track is active and detected in the video stream, each new position of the instance is used to update the UKF, which allows it to progressively build the kinematic state of the object (position, velocity and acceleration). When a track is frozen, the UKF is used to predict the position of the object in each frame until the track is abandoned, or a new association is made with one of the instance detections in a subsequent frame. When an object disappears from a frame in the video stream, the tracker assumes that it will continue to move on the same trajectory it was going before disappearing.
This use of the UKF thus allows us to partially counter the problems presented above and to significantly reduce the number of identity switches happening at inference. When an object leaves the frame of the video stream, the UKF will make it continue on its way and therefore its position for the tracker will be outside the dimensions of the frame, so a new instance entering the frame afterwards is less likely be associated with it. In the same way, a track that has been frozen because of an occlusion will have a better chance of being correctly associated with its original instance if, when it reappears, its position is approximately predicted to follow the true position of the instance.
3.4 Training loss
The loss functions used for each of the outputs of the regression heads are a Focal Loss(lin2017focal) for the center point heatmap generation head and a simple Loss for each of the other regression heads. The global loss is a linear combination of the different losses of each head. It is given by
where is the focal loss as defined by CenterNet, , , and are the losses as defined by CenterPoly and CenterTrack. At training, we used weights of 1 for and , and of 0.1 for .
We use the same vertex selection policy for ground truth polygon generation as described in detail in (perreault2021centerpoly). To summarize it, it consists in drawing lines at regular intervals around the bounding box of the instance towards its center and to keep the first point of each line that intersects with the instance segmentation mask. The polygons learned by the model correspond to the coordinates of these vertices as offsets from the object center. The object centers are moreover not defined as the center of the bounding box, but as an average of the coordinates forming the bounding polygon, which makes it easier to learn the vertices as offsets. The ground truth regarding the tracking head is quite simple, for each object center present in an image , if the same object is in the image at , the ground truth for the object displacement corresponds to
The loss of all outputted downsampled feature maps (except for the center point heatmap) is only evaluated at the centers of object instances present in the current image.
We tested PolyTrack on two datasets, MOTS and KITTIMOTS. We also performed an ablation studies to assess the contribution of the various parts of our method.
4.1 Datasets used in the experiments
MOTS (Voigtlaender19CVPR_MOTS) is a small dataset whose training set consists of only four videos and one class, pedestrians. The challenge of using polygons is that when generating the ground truth, a lot of information is already lost and the resulting masks are therefore already not perfect. For a dataset like MOTS whose results are essentially evaluated by metrics based on IoU, MOTSA and sMOSTA, it is simply impossible to obtain a perfect score when evaluating with these metrics. In Table 1, the first row shows the maximum scores obtained by evaluating the ground truth approximated by polygons against the ground truth provided with the dataset, i.e. our upper bound. We used MOTS as the basis for running our experiments and testing the incremental improvements made to the model. We separated the dataset into a training set consisting of the videos MOTS20-02, MOTS20-05, and MOTS20-11 and a validation set consisting of video MOTS20-09. The results in Table 1 are from experiments conducted on this validation set. Results on the test set are in Table 3.
KITTIMOTS (Voigtlaender19CVPR_MOTS) is a bigger dataset evaluated on cars and pedestrians that contains 21 training and 29 test sequences in urban environment. Methods are evaluated and ranked by HOTA (Luiten2020IJCV). Our results compared with published and peer-reviewed state-of-the-art methods are shown in Table 2 for cars and pedestrians. We also show a study made on the usage of UKF on a custom train/evaluation split on KITTIMOTS in Table 4.
4.2 Details about the training
We realized during our experiments that it was almost impossible to start from randomly initialized weights for a trained model to be efficient. Indeed, when training a model from scratch, it never really seemed to converge for a simple reason: the network gives too much importance to the input heatmap and too little to the two other inputs, and . This means that the network is then unable to detect objects without already existing tracklets. This is illustrated by the second row of Table 1. Track births are therefore near impossible since at the beginning of a video sequence the heatmap passed in parameter is empty, the network must then only rely on the images and
to detect the instances. The solution to the problem is to use a pre-trained network. Therfore, during training, we fine-tuned our models from weights of a CenterNet backbone (which only rely on images to perform detection) trained on the COCO(DBLP:journals/corr/LinMBHPRDZ14) dataset.
Regarding the model training parameters, we have kept the optimal parameters determined by the CenterTrack authors. Polygons have 32 vertices. We applied standard data augmentation: image flipping, translations, rotations and color shifts. We also introduced perturbations to the input heatmap at training to make it more resilient towards errors at inference: the generated instance centers are slightly shifted from their position in the ground truth, there is a 40% chance that an object center is omitted to simulate a detection error and there is a 10% chance that a false positive is added. The input image ) is also randomly chosen from the three frames preceding the current frame to make sure the model does not overfit on a single framerate.
4.3 Results and discussion
|DLA34 + W||21.11||34.27||68.8||2014||341||2760||42.19||85.52||87.7|
|HG + W||27.74||47.55||68.85||3036||720||1738||63.59||80.83||72.3|
|HG + W + Deep||32.88||51.59||69.01||2882||383||1892||60.37||88.27||59.6|
Method HOTA DetA AssA DetRe DetPr AssRe AssPr LocA sMOTSA ViP-DeepLab (vip_deeplab) 76.38% 82.70% 70.93% 88.70% 88.77% 75.86% 86.00% 90.75% 81.03% EagerMOT (Kim21ICRA) 74.66% 76.11% 73.75% 79.59% 90.24% 76.27% 92.70% 90.46% 74.53% MOTSFusion (luiten2019MOTSFusion) 73.63% 75.44% 72.39% 78.32% 90.78% 75.53% 89.97% 90.29% 74.98% ReMOTS (yang2020remots) 71.61% 78.32% 65.98% 83.51% 87.42% 68.03% 92.61% 89.33% 75.92% PointTrack (xu2020Segment) 61,95% 79,38% 48,83% 85,77% 85.66% 79.07% 56.35% 88.52% 78.50% PolyTrack (ours) 57.61% 62.47% 53.97% 67.83% 79.11% 57.13% 81.92% 83.70% 57.49% TrackR-CNN (Voigtlaender19CVPR_MOTS) 56.63% 69.90% 46.53% 74.63% 84.18% 63.13% 62.33% 86.60% 66.97% GMPHD_SAF (gmphdsaf) 55.14% 77.01% 39.76% 81.57% 87.29% 69.22% 49.42% 88.72% 75,39% Pedestrians
|Cars w/ UKF||64,25%||62,40%||66,84%||68,22%||77,82%||70,68%||82,15%||83,13%||57,42%|
|Pedestrians w/ UKF||36,33%||37,41%||35,49%||42,26%||55,24%||41,10%||53,30%||69,89%||15,13%|
shows the contribution of various elements of our method. First, it shows how important it is to use pre-trained weights for the backbone network. For DLA-34, we used weights from a CenterPoly model trained on the Cityscapes dataset and for the 2-stacks hourglass we used weights from a CenterNet model trained on the COCO dataset. We can also see that using a 2-stacks hourglass backbone greatly improves the polygon accuracy for segmentation. Finally, we realized that using more layers in the polygon regression head also improved the segmentation accuracy. This is why we used the 2-stacks hourglass backbone with pre-trained weights and a deeper polygons regression head as described in Section3.1 for generating our final results on MOTS and KITTIMOTS.
for some qualitative results on the KITTIMOTS dataset), PolyTrack suffers from the evaluation metrics when compared to state-of-the-art methods. These metrics are thought of for finer segmentation methods, being quite strict in terms of IOU. This is why our method has the lowest detection recall (DetRe). PolyTrack resides in between bounding box tracking and fine segmentation tracking, providing a bounding polygon instead of a bounding box at almost no additional cost. When compared to other methods, PolyTrack performs better with cars than pedestrians (Table2). This can be explained by the fact that cars are a lot less deformable than pedestrians, and therefore their representation is easier to learn with polygons. Also, pedestrians are more concave than cars, making them harder to segment with a fixed number of points. For instance, when looking at the detection precision (DetPr), PolyTrack has approximately points less than PointTrack (xu2020Segment) for the Car category, and approximately points less on the same metric for the Pedestrian category. The results in Table 3 confirm this trend with passable results on MOTS where only pedestrians are to be tracked. We also believe that MOTS is too small to learn to appropriately extract polygons and that KITTIMOTS reflects better the potential of our method.
Table 4 shows that the UKF improves the tracking significantly for both cars and pedestrians. However, our approach still has its limitations, as the association is still only based on the position of the object center. If two instances cross each other, an identity switch is always possible if the positions of the two instances from one frame to the other correspond exactly at the time of the crossing. The greedy algorithm can then very well swap the two identities at the time of the association, even more so if more than two objects cross at the same time. An improvement of the algorithm based on the UKF could be to include the information of the velocity and acceleration vectors produced by the UKF in the greedy association algorithm. This would allow avoiding potential identity switches between two objects crossing each other (which would therefore have opposite velocity vectors).
In this work, we presented PolyTrack, a novel multi-object tracking and segmentation method that tracks objects using a detect then track paradigm. PolyTrack detects objects by their center keypoint, segments them using bounding polygons and tracks them with offset regression and an Unscented Kalman filter. PolyTrack shows promising results on MOTS and KITTIMOTS, especially considering the speed / accuracy trade-off.