Multiple object tracking (MOT) aims to detect and affect a unique identity to various objects in video sequences. It is in many cases solved by the tracking- by-detection paradigm, which consists in separating the problem into two distinct tasks, detection and association (which we do in this work), but can also be solved by tracking-by-regression
, a method that performs these two actions in parallel for short-term associations. While recent advances in machine learning (ML) have led to a huge performance gain for the detection phase in MOT, the association phase remains a challenge, especially because of its combinatorial complexity.
In this paper, to better deal with the combinatorial complexity in the association phase, we propose a module based on Constraint Programming (CP) whose goal is to be grafted to any existing tracker in order to improve its object association results. It can be applied to methods that are from either the tracking-by-detection or tracking-by-regression framework since our method addresses long-term data association at the tracklet level. Our proposed model is divided into three independent phases. The first phase, called TrackletCutter consists in recovering the tracklets provided by a base tracker and to cut them in places where uncertain associations are spotted. In the second phase, called CP Associator, we associate the previously constructed tracklets using a Belief Propagation Constraint Programming model, where we propose various novel constraints that assign scores to each of the tracklets based on multiple characteristics, such as their dynamics or the distance between them in time and space. Finally, the last phase is a rudimentary interpolation model to fill in the remaining holes in the trajectories we built.
In the experiments, we show the benefit of our method with improvements in the results for all three of the state-of-the-art trackers on which we have tested it (3 to 4 points gained on the HOTA and IDF1 metrics).
Ii Background and related work
MOT is a fast-growing field with many existing approaches whose performance has been greatly improved by the recent advances in machine learning. It can mostly be divided into two aspects (that are most often phases of resolution) : the object detection and the object association.
Ii-a Object Detection
ML detection techniques have recently allowed great advances in the field of object detection. This is the case, for example, of the detectors of the R-CNN family [RCNN] whose principle consists in extracting regions of interest (ROI
) and then, via convolutional neural networks (CNNs), in inferring the main features of these ROI in order to find the class of an object and its exact position in the image. Improvements have been made to this method, in particular with the Faster R-CNN detector (FRCNN) [FRCNN]
for which the region proposal method is itself a neural network calledRegion Proposal Network. The typical output of a detector is usually provided in a format (where represent the spatial coordinates of the upper-left corner of the bounding box and its width and height). Such detections are typically the inputs of trackers. Since detection is not the subject of this article, we can just remember here that better detections lead to better tracking.
Ii-B Object Association
Given a set of detections at each frame, a MOT method typically builds tracks by associating detections across frames. The tracker aims at affecting a single identifier to each object of interest. To efficiently perform the association phase, two questions arise:
How to represent the detections in such a way as to be able to recognize and differentiate the different objects of interest?
How to efficiently explore the set of solutions and reach a satisfactory solution in a reasonable time?
Ii-B1 Models to describe the detections
Regarding the representation of detections, for an efficient association, we generally seek to maximize one (or more) similarity metric between associated detections, and these metrics are computed from descriptors that are mainly divided into motion and appearance models.
Motion models are generally based on the fact that the objects of interest being tracked meet a certain number of physical constraints concerning their speed of movement and deformation in the image. Because videos are often captured at more than 30 frames per second, the displacement in pixels of the classes of objects of interest is normally small between frames. We can consider what we will call positional models which make the hypothesis that at the scale of successive frames, the object of interest followed is immobile, and of fixed size and shape. This allows the design of very simple and therefore very fast models such as the one on which much of our study is based, IOU-Tracker [IOU2017], which consists in looking for the best associations of detections frame by frame by maximizing their IOU (quotient of their intersection and union surfaces). This metric measures the superposition of two detections, and penalizes the differences in size and position. To increase the accuracy of this kind of positional model, one can consider that the derivatives of these position and size values are fixed and use these to predict the position of the object in subsequent frames. This is the principle of the Kalman filter [Kalman] used for example in SORT [SORT], one of the trackers on which we test our model. The tracking task can be simplified by keeping a model of the camera movement in parallel with the movement models of the objects of interest [CMC].
Appearance models consist in the description of some visual characteristics of the objects of interest (or rather of their bounding boxes) and are based on the fact that an object of interest generally keeps a similar appearance through time (or changes only slightly and progressively anyway), which will allow to associate the detections with similar descriptors. These descriptors are generally based on the distribution of colors [HOC], gradients of intensities [HOG] or more advanced methods such as covariance matrices and multiple kind of filters [KCF]. CNNs can also be used to model appearance [DAN], as well as transformers [MOTer_TransCtr] and Mixture Density Neural Networks [trajeocc]. The method proposed by Tracktor [Tracktor], on which we test our model, consists precisely in a regression of bounding boxes from one frame to another to extend the trajectories. CenterTrack [CenterTrack] works in a similar way but by working on the center of the detections, which allows it to be more robust to occlusions. For all these methods the regression is used for short-term association.
Ii-B2 Resolving the association problem
Once the representation models have been chosen, we try to associate detections with one another to build our trajectories. The goal is then to assign a trajectory identifier to each detection to ensure that we can track every object of interest from its entry into the camera field of view (FOV) to its exit of it. Trackers can be divided into two categories regarding the way they process data.
Online trackers [IOU2017, SORT] aim to build their trajectories in real-time and therefore work frame by frame in chronological order. The method they use must thus be incremental and build up tracks by adding detections to existing tracks at each new frame. The tracker must have a criterion to open a track (i.e. state that a new object entered the FOV), to close one (i.e. state that it has left the FOV) and to add a detection to a track. These kinds of trackers can only be based on information from past trajectories and current detections (that provide only static information) and cannot return to modify previously constructed trajectories.
Offline trackers remove the real-time constraint and can thus be applied to the whole set of detections at once, and our work fits into this paradigm as it allows to consider the association problem as a global optimization problem. Even if it allows access to many new characteristics of the objects of interest and modes of resolution, this shift to offline greatly increases the complexity of the problem. Different avenues are taken to resolve that difficulty. Some methods resolve it by still working frame by frame, whether it is with dynamic programming [GOG] or by going through time in both directions [TMOH] to make their results more robust than they would be online. Some choose to go from a local to a global scale, as in the H2T tracker [H2T], which divides frames into small subsets, minimizes a sum of affinity functions to associate detections within them, and then resolves recursively the same problem on bigger and bigger associated sets until whole tracks are finalized.
Iii Approach and methodology
We build on the work of [pineault_article] who was the first to apply CP to multi-object tracking. The benefit of choosing CP is that it is a very efficient method of formalizing and solving optimization problems [pesant_cp]. This is the reason why we have decided to pursue in that direction. The choice of working on the association of pairs of detections was the main limitation of this previous work. Indeed this made the search space extremely vast and therefore the computation time quite high. This forced the decision to work in batches of frames and to strongly restrict the spatial distances between bounding boxes considered, which increases the sensitivity of the model to occlusions.
To remedy this difficulty, our main idea is to work at the level of tracklets (sequences of detections) instead of individual detections. This greatly reduces the size of the problem. Indeed there are simple trackers that are extremely efficient in terms of execution time (notably IOU-Tracker [IOU2017] that manages to process up to 100,000 frames per second, making it hundreds of times faster than most of the other state-of-the-art trackers) and they are good at making associations in simple cases. We decided to start from them to capitalize on their speed and try to correct their errors a posteriori, especially those which intervene in the case of occlusions. This choice also allows us to increase the refinement of the association model, which can now be based on the characteristics of the tracklets, which are dynamic, and not only on those of the detections, which are static.
Our model is also applicable to all kinds of trackers since we work from their outputs. Here we will speak of tracklet for any association of detections provided by an initial tracker, whether we have cut it or not, and of trajectory to speak either of the associations of tracklets that we constructed, or of the associations of detections provided by the ground truth representing a single object.
Iii-B A three-phase model
Our model is divided into three modules :
the TrackletCutter cuts the tracklets provided by the initial tracker where they intersect each other with a high degree of overlap;
the CP Associator is the original association model we propose here, based on a Belief Propagation Constraint Programming algorithm [miniCPBP];
an interpolation model fills the gap in trajectories by linearly interpolating these detections based on the ones at the edges of the gaps.
While the CP Associator cannot be disabled (as it would prevent association), the other two modules are optional.
Iii-C TrackletCutter - Cutting tracklets on overlapping sections
While the sensitivity to occlusions of most trackers often leads to fragmentation (i.e. tracks being cut into multiple tracklets), we developed a module designed to separate tracklets that are at risk of containing multiple different objects of interest. As we know that this risk is at its highest when multiple objects of interest cross paths, it was decided to proceed as follows : as shown in Figure 2, whenever in one frame two detections have an overlap (IOU) that reaches a fixed threshold , the tracklets to which they belong are cut at that frame.
Iii-D Tracklet modeling
We define our set of tracklets as and each tracklet is, as shown in Figure 3, described by:
a frame: (resp. ) the frame in which the first (resp. last) detection of the tracklet appears;
a bounding box: (,,,) (resp. (,,,)) the mean of the spatial coordinates of the six bounding boxes following the first one (resp. preceding the last one) of the tracklet;
a speed: (,) (resp. (,)) the mean speed of the centers of the the six bounding boxes following the first one (resp. preceding the last one) of the tracklet.
If is formed by fewer than ten detections, the averaging of the six first and last bounding boxes is not performed. The speed is then computed between the first (resp. last) two bounding boxes and the spatial coordinates are those of the first (resp. last) bounding box. However, doing this for each tracklet would have led us to put too much weight on the quality of these first and last bounding boxes that are by essence the least representative of the object of interest (as we can infer that the track has been separated at them because of some defect or occlusion), thus the choice of averaging the few following (resp. preceding) detections. The choice of working on six bounding boxes is however arbitrary and could be refined in future works.
Iii-E Associating tracklets with Constraint Programming
Once the tracklets have been modeled, the goal is to associate them using our CP Associator. To do so, we need to model the problem following the CP paradigm, to define our use of constraints (both filtering and assigning marginals) and to define the methods we use to explore the solution space.
Iii-E1 Modeling the association problem in a Cp paradigm
A CP model is given by a finite set of variables, each taking its value from a finite set called its domain. Constraints are then specified on the combinations of values that these variables can take. This model defines a Constraint Satisfaction Problem (CSP). In our case, the tracklet association phase is modeled as:
our set of successor variables such that, for every tracklet , is its successor, meaning that immediately follows in the same trajectory.
For every tracklet , the domain contains every tracklet that starts temporally after ends, and a stopping value (meaning that is the last tracklet of its trajectory).
is our set of constraints, which we use to filter the domains of each successor and to affect a score to each tracklet-successor pair.
Iii-E2 Using constraints to filter
Constraints are most often used to restrict the domains of variables. We use them in that fashion to ensure the following characteristics for our trajectories:
No detection should be found in multiple trajectories and therefore no tracklet should be assigned to multiple trajectories. To accomplish that goal, we used the allDifferent constraint. Applied to the whole set of variables, it ensures that no two variables are assigned the same value. It is given by:
As we suppose in this part of our model that our tracklets are perfect (namely that each detection in a specific tracklet belongs to a single object), there should be no overlap in time between tracklets affected to a single trajectory. Therefore, we define the temporal consistency constraint as follows:
Iii-E3 Score-based constraints
We also propose to use constraints in a different manner, that is to assign a score to each pair (where is a tracklet and is a successor) based on a given characteristic . Each constraint is based on a distance between and . These different kinds of distances (that can be, as we will see below, temporal or spatial for example) are then transformed into scores ranging from 0 to 1 where:
leads to the immediate removal of from the domain of the successors of .
leads to the immediate assignment of to i.e. becomes the successor of .
means that according to characteristic , is a more likely candidate than to be the successor of .
Each score-based constraint is assigned three thresholds that will help define their behavior:
: value of the distance for which we set the score at (i.e. ).
: value of the distance between the tracklet and the fictional tracklet which represents the stopping of the trajectory at . It is indeed essential to compare the set of possible successors to the possibility of associating with none of them (which would mean that is that last tracklet of its trajectory).
: value of the distance beyond which the examined successor is removed from . This threshold may or may not be activated, but if it is and it is reached for a given pair , is then equal to and is removed from .
As shown in Figure 4, as long as (where it automatically falls down to 0), the score of the association of and according to the characteristic , , is calculated as follows. We compute
so that for a distance of . Finally the actual score is bounded as follows
where is the lower bound of the score and the upper bound (so that it does not reach 0 or 1).
Constraint on time spacing
The first kind of score-based constraint we developed favors a small temporal distance between a tracklet and its successor. For each pair , we get the metric ( where stands for time distance) such that:
Constraints on dynamics
We also decided to built constraints that aim to maintain the trajectories as smooth as possible by minimizing the discontinuities in acceleration. For each pair we then obtain the metrics and (for angle difference and speed norm difference) such that:
Constraints on a predicted position
We can suppose that with the help of the aforementioned time distance constraint that the tracklets most likely to be associated are those that are not too temporally distant. As the shorter the time interval, the less the object of interest can change its speed, direction and even position in the image, we decided to add a constraint on a predicted position. Following the example of a Kalman filter [Kalman], we consider that projecting a bounding box using its speed onto subsequent frames not too far apart is a relatively efficient prediction mechanism. Therefore we propose two constraints, which are described in Figure 5 : they compare the predicted position of the considered object and the evaluated successor by IOU or by the distance between their centers (which is largely considered to be more robust to occlusions [CenterTrack]).
Iii-E4 Adaptation to video sequences properties
As the video sequences we work with have different characteristics regarding dimensions and capture speed (measured in frames per second or FPS), we chose to adapt the parameters of the score-based constraints. Indeed, affecting a time distance score of to a pair of tracklets separated by six frames has a very different meaning for a sequence of FPS compared to one four times faster. Therefore we decided to adapt the time distance constraint to the FPS of the sequence, the predicted center distance constraint to the size of the image (represented by proxy by the length of the diagonal of the image), and the speed norm difference constraint to the FPS and the diagonal.
Once each of the successor domains have been restricted by the activated relevant constraints, the remaining pairs (where and ) get a score from each activated score-based constraint (as explained before). That leaves us with up to 5 scores per pair that we would like to use to guide our solution exploration. To do so, we combine these into marginals, following in a way the example set by [CEM] where the authors try to minimize a sum of energies. We compute these marginals as the product of scores normalized over :
Usually when Belief Propagation (BP) is used in CP, the marginals built represent the density of solutions resulting from the branching of a constraint for the considered variable-value pair. It is used here to convey our marginals.
Iii-E6 Exploration method
A branching heuristic calledMaxMarginal has been developed in MiniCPBP [miniCPBP]. It consists in guiding the construction of the search tree exploring the pairs tracklet-successor by descending order of marginal. Regarding the exploration strategy, to prioritize staying close to the model by promoting high marginals association first, we use Depth First Search () that consists in exploring the search tree by taking deviations as low as possible in the search tree if a valid is not found at first.
Iii-F Interpolation model
Once the tracklets are associated with each other, it is likely that the resulting trajectories will have gaps, i.e. sequences of frames in which the object disappears before reappearing. This is why we decided to integrate a simple interpolation model in our method, which works as follows:
As shown in Figure 6, we identify the holes in each trajectory and if they are smaller than a threshold, , we fill them by making a linear interpolation from the detection preceding the hole to the one following it. Simply put, we consider that the object has moved (and changed shape or size) at a constant speed from the detection that precedes the hole to the one that follows it, and we add all the missing detections to the trajectory.
Iv Experiments and results
Iv-a Evaluation Dataset
We have chosen to evaluate our model on the training set of the MOT17 [MOT16] challenge based on the detections of the best proposed detector, FRCNN [FRCNN]. This dataset represents a reference in the field and presents many difficulties that are particularly interesting to confront. Whether it is the often high occlusion rates, turbulence, moving cameras, strong variations in light exposure, sometimes subjective POV and other times very elevated POV, or the numerous reflections present in these videos, we are dealing with extremely varied situations that should allow us to evaluate our model in the majority of situations that can happen in urban settings.
Iv-B Evaluation metrics
We evaluate our method using three of the main MOT metrics: MOTA (and MOTP) [CLEAR-MOT] that mainly measure the quality of detections, and are used very broadly in the literature, IDF1 [IDF1] that refers more to the quality of the association between detections, also widely used in the literature, and HOTA [HOTA] a more recent metric that accounts for both the performance in terms of association of detections and the quality of these detections.
We tested multiple combinations of parameters for each of our modules (tried to activate or not each of the constraints) to find interactions and select the best configuration. Table I represents the best configuration we found by applying our model to the MOT17 training data. We found during these calibration sessions that the vast majority of high scoring configurations were those that did not give the ability to filter successors (i.e. reduce their score to 0) to score-based constraints. We therefore disabled this ability of constraints that only guide the search in the configuration presented below.
Concerning the exploration of the solutions, it turned out that by pushing this one even up to the valid solution explored for multiple configurations of parameters and constraints, we did not obtain better results than by stopping at the first one found by branching on the maximal marginals (except for the very bad models, which obtained in all cases worst solutions than the initial tracker), so we decided to stop at the first valid solution in our exploration.The code for our method can be found at github.com/reminahon/tracklet_associator.
Iv-D Results and Discussion
|IOU-Tracker + CP||45.34%||49.91%||53.78%|
|IOU-Tracker + TC + CP||45.47%||49.90%||54.11%|
|IOU-Tracker + CP + Int||46.04%||50.62%||54.38%|
|IOU-Tracker + TC + CP + Int||46.18%||50.49%||54.74%|
|SORT + CP||45.40%||48.71%||54.34%|
|SORT + TC + CP||45.14%||48.70%||53.96%|
|SORT + CP + Int||46.35%||49.46%||54.92%|
|SORT + TC + CP + Int||46.06%||49.39%||54.58%|
|Tracktor + CP||56.39%||61.87%||67.40%|
|Tracktor + TC + CP||55.89%||61.85%||66.77%|
|Tracktor + CP + Int||56.98%||62.99%||67.88%|
|Tracktor + TC + CP + Int||56.49%||62.96%||67.29%|
Results are given in Table II. It can be noted that whatever the tracker we apply it to, our model allows to obtain improvements of several percents on the three scores that interest us: HOTA, MOTA and IDF1. Concerning the HOTA, the main metric of our evaluation, we obtain an improvement of for IOU-Tracker, for SORT and for Tracktor which already had a rather high score ( more than the two others originally) which shows that our module is likely effective on any type of tracker independently of their initial level of performance or their tracking paradigm.
Our goal was mainly to improve data association with our CP Associator, but we also addressed the detection phase with the interpolation module. HOTA and IDF1 are the two metrics that are the most sensitive to the quality of the associations, as opposed to MOTA that is not very sensitive to the data association quality, but more sensitive to the detections quality. We observe that the CP association model is the module that allows the most improvements in terms of HOTA and IDF1 ( of the improvements on average) to the results of the three trackers. The rest of the improvements are mainly brought by the interpolation model which allows to improve the MOTA by more than one point for Tracktor, in particular, by adding missing detections.
However, the IOU-Tracker is the only tracker for which the TrackletCutter really allows an improvement of the results. This may be due to the fact that this tracker has more errors due to occlusions detected by our method. Still, it seems that our model performs adequately without the TrackletCutter. It turns out that this is the part of the model that requires the most computation time, for little to no improvement. We would therefore advise not to use it for any other tracker than IOU-Tracker. Moreover, on the whole MOT17 training set, applying our model to get the improved trajectories takes between 30 to 60 seconds without the TrackletCutter and up to 5 minutes with it. Even without the use of the TrackletCutter, our model retains interest insofar as trackers tend to suffer from fragmentation which we correct by our association. One could even postulate that the more a tracker suffers from fragmentation, the better our post-processing can help its tracking performance.
We presented a method that can be used as a post-processing step for any state-of-the-art multi-object tracker to improve its association performance as we have been able to show by testing it on the trackers IOU-Tracker, SORT and Tracktor on the MOT17 dataset. This demonstrates its competitiveness in the field of pedestrian tracking. In addition we propose here the first association model based on Constraint Programming with Belief Propagation. Furthermore, a strength of our method for future improvements relies on our modularity: each module we propose (whether it is the TrackletCutter, the association model or the interpolation one) can be substituted with another one that would accomplish the same function. New constraints based on other characteristics (such as appearance for example) can also be added without any major changes in the architecture of the model.