This is the code for the paper "Learning a Proposal Classifier for Multiple Target tracking"
The recent trend in multiple object tracking (MOT) is heading towards leveraging deep learning to boost the tracking performance. However, it is not trivial to solve the data-association problem in an end-to-end fashion. In this paper, we propose a novel proposal-based learnable framework, which models MOT as a proposal generation, proposal scoring and trajectory inference paradigm on an affinity graph. This framework is similar to the two-stage object detector Faster RCNN, and can solve the MOT problem in a data-driven way. For proposal generation, we propose an iterative graph clustering method to reduce the computational cost while maintaining the quality of the generated proposals. For proposal scoring, we deploy a trainable graph-convolutional-network (GCN) to learn the structural patterns of the generated proposals and rank them according to the estimated quality scores. For trajectory inference, a simple deoverlapping strategy is adopted to generate tracking output while complying with the constraints that no detection can be assigned to more than one track. We experimentally demonstrate that the proposed method achieves a clear performance improvement in both MOTA and IDF1 with respect to previous state-of-the-art on two public benchmarks. Our code is available at <https://github.com/daip13/LPC_MOT.git>.READ FULL TEXT VIEW PDF
This is the code for the paper "Learning a Proposal Classifier for Multiple Target tracking"
Tracking multiple objects in videos is an important problem in many application domains. Particularly, estimating humans location and their motion is of great interest in surveillance, business analytics, robotics and autonomous driving. Accurate and automated perception of their whereabouts and interactions with others or environment can help identifying potential illegal activities, understanding customer interactions with retail spaces, planning the pathway of robots or autonomous vehicles.
The ultimate goal of multiple object tracking (MOT) is to estimate the trajectory of each individual person as one complete trajectory over their whole presence in the scene without having any contamination by the others. Much research is done in this domain to design and implement robust and accurate MOT algorithms in the past [9, 31, 51]. However, the problem still remains unsolved as reported in the latest results in various public benchmarks [17, 19, 21, 40]
. The key challenges in MOT are mostly due to occlusion and scene clutter, as in any computer vision problem. Consider the case when two people (yellow and purple boxes in Fig.1) are walking together in a spatial neighborhood. At one point, both people are visible to the camera and recent object detection algorithms like [36, 46, 47], can easily detect them. When the two people become aligned along the camera axis, however, one is fully occluded by another, and later both become visible when one passes the other. Since the visual appearance may have subtle difference between the two targets due to various reasons like illumination, shading, similar clothing, etc, estimating the trajectory accurately without contamination (often called as identity transfer) remains as the key challenge. In more crowded scenes, such occlusion can happen across multiple peoples which pose significant troubles to any MOT algorithm. Moreover, the MOT problem naturally has an exponentially large search space for the solution 111The tracking-by-detection approach, which is the de-facto framework in MOT domain, needs to solve the data-association problem given detections at each timestamp. The size of hypothesis space is exponential to the number of detections . which prohibits us from using complicated mechanisms.
Traditional approaches focus on solving the problem by employing various heuristics, hand-defined mechanisms to handle occlusions[10, 31]. Multiple Hypotheses Tracking (MHT ) is one of the earliest successful algorithms for MOT. A key strategy in MHT to handle occlusions is to delay data-association decisions by keeping multiple hypotheses active until data-association ambiguities are resolved. Network flow-based methods [10, 12] have recently become a standard approach for MOT due to their computational efficiency and optimality. In this framework, the data-association problem is modeled as a graph, where each node represents a detection and each edge indicates a possible link between nodes. Then, occlusions can be handled by connecting non-consecutive node pairs. Both MHT and network flow-based methods need to manually design appropriate gap-spanning affinity for different scenarios. However, it is infeasible to enumerate all possible challenging cases and to implement deterministic logic for each case.
In this paper, we propose a simple but surprisingly effective method to solve the MOT problem in a data-driven way. Inspired by the latest advancement in object detection  and face clustering , we propose to design the MOT algorithm using two key modules, 1) proposal generation and 2) proposal scoring with graph convolutional network (GCN) . Given a set of short tracklets (locally grouped set of detections using simple mechanisms), our proposal generation module (see Fig. 1(b)) generates a set of proposals that contains the complete set of tracklets for fully covering each individual person, yet may as well have multiple proposals with contaminated set of tracklets (i.e., multiple different people merged into a proposal). The next step is to identify which proposal is better than the others by using a trainable GCN and rank them using the learned ranking/scoring function (see Fig. 1(c)). Finally, we adopt an inference algorithm to generate tracking output given the rank of each proposal (see Fig. 1(d)), while complying with the typical tracking constraints like no detection assigned to more than one track.
The main contribution of the paper is in four folds: 1) We propose a novel learnable framework which formulates MOT as a proposal generation, proposal scoring and trajectory inference pipeline. In this pipeline, we can utilize algorithms off the shelf for each module. 2) We propose an iterative graph clustering strategy for proposal generation. It can significantly reduce the computational cost while guaranteeing the quality of the generated proposals. 3) We employ a trainable GCN for proposal scoring. By directly optimizing the whole proposal score rather than the pairwise matching cost, GCN can incorporate higher-order information within the proposal to make more accurate predictions. 4) We show significantly improved state-of-the-art results of our method on two MOTChallenge benchmarks.
Most state-of-the-art MOT works follow the tracking-by-detection paradigm which divides the MOT task into two sub-tasks: first, obtaining frame-by-frame object detections; second, linking the set of detections into trajectories. The first sub-task is usually addressed with object detectors [36, 46, 47, 59]. While the latter can be done on a frame-by-frame basis for online applications [25, 55, 56, 63, 64] or a batch basis for offline scenarios [3, 9, 42]. For video analysis tasks that can be done offline, batch methods are preferred since they can incorporate both past and future frames to perform more accurate association and are more robust to occlusions. A common approach to model data-association in a batch manner is using a graph, where each node represents a detection and each edge indicates a possible link between nodes. Then, data-association can be converted to a graph partitioning task, i.e., finding the best set of active edges to predict partitions of the graph into trajectories. Specifically, batch methods differ in the specific optimization methods used, including network flow , generalized maximum multi clique 27], maximum-weight independent set , conditional random field , k-shortest path , hyper-graph based optimization , etc. However, the authors in  showed that the significantly higher computational cost of these overcomplicated optimization methods does not translate to significantly higher accuracy.
, the research trend in MOT has been shifting from trying to find better optimization algorithms for the association problem to focusing on the use of deep learning in affinity computation. Most existing deep learning MOT methods focus on improving the affinity models, since deep neural networks are able to learn powerful visual and kinematic features for distinguishing the tracked objects from the background and other similar objects. Leal-Taixé et al.
adopted a Siamese convolutional neural network (CNN) to learn appearance features from both RGB images and optical flow maps. Amir et al.
employed long short-term memory (LSTM) to encode long-term dependencies in the sequence of observations. Zhu et al. proposed dual matching attention networks with both spatial and temporal attention mechanisms to improve tracking performance especially in terms of identity-preserving metrics. Xu et al.  applied spatial-temporal relation networks to combine various cues such as appearance, location, and topology. Recently, the authors in [5, 49] confirmed the importance of learned re-identification (ReID) features for MOT. All aforementioned methods learn the pair-wise affinities independently from the association process, thus a classical optimization solver is still needed to obtain the final trajectories.
Recently, some works [9, 14, 51, 57] incorporate the optimization solvers into learning. Chu et al.  proposed an end-to-end model, named FAMNet, to refine feature representation, affinity model and multi-dimensional assignment in a single deep network. Xu et al.  presented a differentiable Deep Hungarian Net (DHN) to approximate the Hungarian matching algorithm and provide a soft approximation of the optimal prediction-to-ground-truth assignment. Schulter et al.  designed a bi-level optimization framework which frames the optimization of a smoothed network flow problem as a differentiable function of the pairwise association costs. Brasó et al.  modeled the non-learnable data-association problem as a differentiable edge classification task. In this framework, an undirected graph is adopted to model the data-association problem. Then, feature learning is performed in the graph domain with a message passing network. Next, an edge classifier is learned to classify edges in the graph into active and non-active. Finally, the tracking output is efficiently obtained via grouping connected components in the graph. However, this pipeline does not generally guarantee the flow conservation constraints . The final tracking performance might be sensitive to the percentage of flow conservation constraints that are satisfied.
Similar to , our method also models the data-association problem with an undirected graph. However, our approach follows a novel proposal-based learnable MOT framework, which is similar to the two-stage object detector Faster RCNN , i.e. proposal generation, proposal scoring and proposal pruning.
Given a batch of video frames and corresponding detections , where is the total number of detections for all frames. Each detection is represented by , where denotes the raw pixels of the bounding box, contains its 2D image coordinates and indicates its timestamp. A trajectory is defined as a set of time-ordered detections , where is the number of detections that form trajectory . The goal of MOT is to assign a track ID to each detection, and form a set of trajectories that best maintains the objects’ identities.
As shown in Figure 1, our framework consists of four main stages.
Data Pre-Processing. To reduce the ambiguity and computational complexity in proposal generation, a set of tracklets is generated by linking detections in consecutive frames. And these tracklets are utilized as basic units in downstream modules.
Proposal Generation. As shown in Figure 1(b), we adopt a graph , where , , to represent the structured tracking data . A proposal = is a subset of the graph . The objective of proposal generation is to obtain an over-complete set of proposals which contain at least one perfect proposal for each target. However, it is computationally prohibitive to explore all perfect proposals from the affinity graph . Inspired by , we propose an iterative graph clustering strategy in this paper. By simulating the bottom-up clustering process, it can provide a good trade-off between proposal quality and the computational cost.
Proposal Scoring. With the over-complete set of proposals
, we need to calculate their quality scores and rank them, in order to select the subset of proposals that best represent real tracks. Ideally, the quality score can be defined as a combination of precision and recall rates.
where is a weighting parameter controlling the contribution of precision score, is the ground-truth set of all detections with label , and is the majority label of the proposal , measures the number of detections, represents the number of labels included in proposal . Intuitively, measures the purity, and reflects how close is to the matched ground-truth . Inspired by , we adopt a GCN based network to learn to estimate the proposal score given the above definition. The precision of a proposal can be learned with a binary-cross-entropy loss through training procedure. However, it is much harder for a GCN to learn the recall of a proposal without exploring the entire graph structure including the vertices that are very far from a given proposal. We find that the normalized track length (, where is a constant for normalization) is positively correlated with the recall of a proposal when precision is high. Thus, we approximate the recall rate of a proposal with the normalized track length and let the network to focus on accurately learning the precision of a proposal.
Trajectory Inference: Similar to the Non-Maximum Suppression in object detection, a trajectory inference strategy is needed to generate the final tracking output with the ranked proposals. This step is to comply with the tracking constraints like no tracklet assigned to more than one track. To reduce the computational cost, we adopt a simple de-overlapping algorithm with a complexity of ().
A tracklet is widely used as an intermediate input in many previous works [16, 61]. In our framework, we also use tracklets as basic units for graph construction, where is the number of tracklets and is far less than detections . Hence, it can significantly reduce overall computation. First, the ReID features for each detection is extracted with a CNN. Then, the overall affinity of two detections or detection-to-tracklet is computed by accumulating three elementary affinities based on their appearance, timestamps and positions. Finally, low-level tracklets are generated by linking detections based on their affinities with Hungarian algorithm . It is worth noting that the purity of the generated tracklets is crucial, because the downstream modules use them as basic units and there is no strategy to recover from impure tracklets. Similarly to , we use a dual-threshold strategy in which a higher threshold is used to accept only associations with high affinities, and a lower threshold is to avoid associations that have rivals with comparable affinities.
We propose an iterative clustering strategy to grow the proposals gradually, as shown in Figure 2. It mainly consists of two modules.
Affinity Graph Construction. At each iteration , we build an affinity graph to model the similarity between vertices . Let vertex , where be the averaged ReID feature of a proposal, = be the sorted timestamps of detections in the proposal, = be the corresponding 2D image coordinates. The affinity score of an edge (, ) is defined as the average score based on temporal, spatial and appearance similarities.
where measures the minimum time gap between two vertices and = -1 if vertex has temporal overlapping with vertex , measures the Euclidean distance between the predicted box 222We apply a global constant velocity model to predict the 2D image coordinates of the bounding box. center of vertex and the starting box center of vertex , and are controlling parameters. To reduce the complexity of the graph, a simple gating strategy is adopted (see Appendix A.1 for details) and the maximum number of edges linked to one vertex is set to be less than .
Cluster Proposals. The basic idea of proposal generation is to use connected components to find clusters. In order to keep the purity of the generated clusters high in the early iterations, we constrain the maximum size of each cluster to be below a threshold . In this phase, the vertices of a target object may be over-fragmented into several clusters. The clusters generated in iteration are used as the input vertices of the next iteration. And a new graph can be built on top of these clusters, thereby producing clusters of larger sizes. The final proposal set includes all the clusters in each iteration, thus providing an over-complete and diverse set of proposals . The exact procedures are detailed in Algorithm 1 and 2 in Appendix A.2.
In this subsection, we devise the purity classification network to estimate the precision scores of the generated proposals . Specifically, given a proposal = withbeing pure. As shown in Figure 3, this module consists of the following two main parts.
Design of Feature Encoding. Both the appearance and the spatial-temporal features are crucial cues for MOT. For appearance features, a CNN is applied to extract feature embeddings directly from RGB data of each detection . Then, we obtain ’s corresponding appearance features by taking the average value of all detection appearance features. For spatial-temporal features, we seek to obtain a representation that encodes, for each pair of temporal adjacent tracklets, their relative position, relative box size, as well as distance in time. For proposal = , its vertices are sorted first in ascending order according to the start timestamp of each vertex. Then, for every pair of temporal adjacent tracklets and , the ending timestamp of and the starting timestamp of is denoted as and respectively. And their bounding box coordinates in these timestamps are parameterized by top left corner image coordinates, width and height, i.e., (, , , ) and (, , , ). We compute the spatial-temporal feature for vertex as:
With appearance feature and spatial-temporal feature at hand, we concatenate them to form the feature encoding = for each vertex .
Design of GCN. As described above, we have obtained the features associated to vertices in (denoted as ). As for the affinity matrix for (denoted as ), a fully-connected graph is adopted, in which we compute the affinity between each pair of vertices, as shown in Figure 3 (a). The GCN network consists of layers and the computation of each layer can be formulated as:
where = is the diagonal degree matrix. indicates the feature embeddings of the -th layer, represents the transform matrix, and
is a non-linear activation function (in our implementation). At the top-level feature embedding
, a max pooling is applied over all vertices into provide an overall summary. Finally, a fully-connected layer is employed to classify into a pure or impure proposal. As shown in Equation 9, for each GCN layer, it actually does three things: 1) computes the weighted average of the features of each vertex and its neighbors; 2) transforms the features with ; 3) feeds the transformed features to a nonlinear activation function. Through this formulation, the purity network can learn the inner consistency of proposal .
With the purity inference results, we can obtain the quality scores of all proposals with Equation 1. A simple de-overlapping algorithm is adopted to guarantee that each tracklet is assigned one unique track ID. First, we rank the proposals in descending order of the quality scores. Then, we sequentially assign track ID to vertices in the proposals from the ranked list, and modify each proposal by removing the vertices seen in preceding ones. The detailed algorithm is described in Algorithm 3 in Appendix A.2.
In this section, we first present an ablation study to better understand the behavior of each module in our pipeline. Then, we compare our methods to published methods on the MOTChallenge benchmarks.
All experiments are done on the multiple object tracking benchmark MOTChallenge, which consists of several challenging pedestrian tracking sequences with frequent occlusions and crowded scenes. We choose two separate tracking benchmarks, namely MOT17  and MOT20 . These two benchmarks consist of challenging video sequences with varying viewing angle, size, number of objects, camera motion, illumination and frame rate in unconstrained environments. To ensure a fair comparison with other methods, we use the public detections provided by MOTChallenge, and preprocess them by first running . This strategy is widely used in published methods [9, 37].
For the performance evaluation, we use the widely accepted MOT metrics [6, 8, 48], including Multiple Object Tracking Accuracy (MOTA), ID F1 score (IDF1), Mostly Track targets (MT), Mostly Lost targets (ML), False Positives (FP), False Negatives (FN), ID switches (IDs), etc. Among these metrics, MOTA and IDF1 are the most important ones, as they quantify two of the main aspects of MOT, namely, object coverage and identity preservation.
ReID Model. For the CNN network used to extract ReID features, we employ a variant of ResNet50, named ResNet50-IBN 
, which replaces batch norm layer with instance-batch-norm (IBN) layer. After global average pooling layer, a batch norm layer and a classifier layer is added. We use triplet loss and ID loss to optimize the model weights. For the ablation study, we use the ResNet50-IBN model trained on two publicly available datasets: ImageNet and Market1501 . While for the final benchmark evaluation, we add the training sequences in MOT17  and MOT20  to finetune the ResNet50-IBN model. Note that using training sequences in the benchmark to finetune ReID model for the test sequences is a common practice among MOT methods [24, 32, 53].
Parameter Setting. In affinity graph construction, the parameter and is empirically set to 40 and 100, respectively. In proposal generation, the maximum iteration number is set to =10, the maximum neighbors for each node is set to =3, the maximum cluster size is set to =2, and the cluster threshold step is set to =0.05. In trajectory inference, the weighting parameter is set to 1 and =200.
GCN Training. We use a GCN with =4 hidden layers in our experiments. The GCN model is trained end-to-end with Adam optimizer, where weight decay term is set to , and is set to 0.9 and 0.999, respectively. The batch size is set to 2048. We train for 100 iterations in total with a learning rate . For data augmentation, we randomly remove detections to simulate missed detections. For the ablation study, the leave-one-out cross-validation strategy is adopted to evaluate the GCN model.
We perform simple bilinear interpolation along missing frames to fill gaps in our trajectories.
In this subsection, we aim to evaluate the performance of each module in our framework. We conduct all of our experiments with the training sequences of the MOT17 datasets.
To evaluate the performance of proposal generation, we choose the oracle purity network for proposal purity classification, i.e., determine whether the proposal is pure or not by comparing it with the ground-truth data. For baseline, we adopt the MHT algorithm  by removing the -scan prunning step. To reduce the search space, a simple gating strategy is adopted which limits the maximum number of linkage for each vertex to be less than 20. The comparison results are summarized in Table 1. As expected, the time cost of our iterative proposal generation method is far less than that of the MHT-based method. Meanwhile, our method can achieve comparable MOTA and IDF1 scores. This demonstrates its ability to reduce the computational cost while guarantee the quality of the generated proposals.
Effect of Maximum Iteration Number. There are four parameters in proposal generation, namely , , and . Experimental results show that the tracking performance is insensitive to , and . The detailed results are shown in Appendix B. Intuitively, increasing the maximum iteration number allows to generate a larger number of proposals, and improves the possibility of the generated proposals to contain good tracklets under long-term occlusions. Hence, one would expect higher values to yield better performance. We test this hypothesis in Figure 4 by doing proposal generation with increasing number of , from 1 to 10. As expected, we see a clear upward tendency for both MOTA and IDF1 metrics. Moreover, it can be observed that the performance boost in both metrics mainly occurs when increasing from 1 to 2, which demonstrates that most of the occlusions are short-term. We also observe that the upwards tendency for both MOTA and IDF1 metrics stagnates around seven iterations. There is a trade-off between performance and computational cost in choosing the proper number of iterations. Hence, we use in our final configuration.
Effects of the features. Our GCN-based purity classification network receives two main streams of features for each vertex: (i) appearance features from ReID model, and (ii) spatial-temporal features from Equation 8. We test their effectiveness by experimenting with combinations of the above two groups of features. Results are summarized in Table 2. It can be concluded that: (i) the appearance features seems to play a more important role in identity preservation, hence having higher IDF1 and MT measures, (ii) the spatial-temporal features can reduce the the number of FP and IDs, and (iii) combination of these two streams of features can improve the overall performance.
Effects of different loss functions.
We perform an experiment to study the impact of different loss functions in model training. Table3 lists the detailed quantitative comparison results by using binary-cross-entropy loss (BCELoss) and mean-squared-error loss (MSELoss), respectively. Using BCELoss shows a gain of 0.6 IDF1 measure and a small amount of decrease of IDs. Hence, we use BCELoss in our final configuration.
Effects of different networks. There are numerous previous works that use deep neural networks, such as Temporal Convolutional Network (TCN ), Attention Long-Short Term Memory (ALSTM ), ALSTM Fully Convolutional Network (ALSTM-FCN ) to conduct temporal reasoning on the sequence of observations. Table 4 presents the results by using these neural networks. It should be noticed that the oracle performance in Table 4 is obtained by using ground-truth data for purity classification. By comparing GCN with Oracle, we can see that GCN obtains better MT and ML measures, but worse MOTA and IDF1 measures than Oracle. The reason might be due to the false positives in GCN-based proposal purity classification, which would generate a few impure trajectories and hence reduce IDF1 measure. Moreover, the impure trajectories would cause quite a few FPs in the post processing (as shown in Table 4), hence reducing the MOTA measure. By comparing GCN with other neural networks, it is clear that GCN achieves better performance on most metrics, improving especially the IDF1 measure by 1.2 percentage. The performance gain is attributed to its capability of learning higher-order information in a message-passing way to measure the purity of each proposal. It verifies that GCN is more suitable for solving the proposal classification problem.
The iterative greedy strategy is a widely used technique in MOT, which can be an alternative choice of inference. Specifically, it iteratively performs the following steps: first, estimate the quality scores of all existing proposals; second, collect the proposal with highest quality score and assign unique track ID to the vertices within this proposal; third, modify the remaining proposals by removing the vertices seen in preceding ones. Hence, the computational complexity of the iterative greedy strategy is . Compared with the iterative greedy strategy, the simple de-overlapping algorithm only estimates the quality scores once. Therefore, it can reduce the computational complexity to . The comparison results are summarized in Table 5. It can be observed that the simple de-overlapping algorithm achieves slightly better performance in both MOTA and IDF1 metrics than the iterative greedy strategy. The reason might be due to that as the number of iteration increases, the number of nodes in each proposal decreases. Hence, the classification accuracy of the purity network might decrease.
We report the quantitative results obtained by our method on MOT17 and MOT20 in Table 6 and Table 7 respectively, and compare it to methods that are officially published on the MOTChallenge benchmark. As shown in Table 6 and Table 7, our method obtains state-of-the-art results on MOT17, improving especially the IDF1 measure by 5.1 percentage points on MOT17 and 3.4 percentage points on MOT20. It demonstrates that our method can achieve strong performance in identity preservation. We attribute this performance increase to our proposal-based learnable framework. First, our proposal generation module generates an over-complete set of proposals, which improves its anti-interference ability in challenging scenarios such as occlusions. Second, our GCN-based purity network directly optimizes the whole proposal score rather than the pairwise matching cost, which takes higher-order information into consideration to make globally informed predictions. We also provide more comparison results with other methods on MOT16  benchmark in Appendix C.
Our method outperforms MPNTrack  only by a small margin in terms of the MOTA score. It should be noticed that MOTA measures the object coverage and overemphasizes detection over association . We use the same set of detections and post-processing strategy (simple bilinear interpolation) as MPNTrack . Then, achieving similar MOTA results is in line with expectations. IDF1 is preferred over MOTA for evaluation due to its focus on measuring association accuracy over detection accuracy. We also provide more qualitative results in Appendix D.
In this paper, we propose a novel proposal-based MOT learnable framework. For proposal generation, we propose an iterative graph clustering strategy which strikes a good trade-off between proposal quality and computational cost. For proposal scoring, a GCN-based purity network is deployed to capture higher-order information within each proposal, hence improving anti-interference ability in challenge scenarios such as occlusions. We experimentally demonstrate that our method achieves a clear performance improvement with respect to previous state-of-the-art. For future works, we plan to make our framework be trainable end-to-end especially for the task of proposal generation.
Acknowledgements. This research is funded by the National Key Research and Development Program of China (No. 2018AAA0100701)
A strong baseline and batch normalization neck for deep person re-identification.TMM, pages 1–1, 2019.