Graph Neural Based End-to-end Data Association Framework for Online Multiple-Object Tracking

by   Xiaolong Jiang, et al.

In this work, we present an end-to-end framework to settle data association in online Multiple-Object Tracking (MOT). Given detection responses, we formulate the frame-by-frame data association as Maximum Weighted Bipartite Matching problem, whose solution is learned using a neural network. The network incorporates an affinity learning module, wherein both appearance and motion cues are investigated to encode object feature representation and compute pairwise affinities. Employing the computed affinities as edge weights, the following matching problem on a bipartite graph is resolved by the optimization module, which leverages a graph neural network to adapt with the varying cardinalities of the association problem and solve the combinatorial hardness with favorable scalability and compatibility. To facilitate effective training of the proposed tracking network, we design a multi-level matrix loss in conjunction with the assembled supervision methodology. Being trained end-to-end, all modules in the tracker can co-adapt and co-operate collaboratively, resulting in improved model adaptiveness and less parameter-tuning efforts. Experiment results on the MOT benchmarks demonstrate the efficacy of the proposed approach.



page 7


GCNNMatch: Graph Convolutional Neural Networks for Multi-Object Tracking via Sinkhorn Normalization

This paper proposes a novel method for online Multi-Object Tracking (MOT...

Tracklet Association Tracker: An End-to-End Learning-based Association Approach for Multi-Object Tracking

Traditional multiple object tracking methods divide the task into two pa...

Joint Multi-Object Detection and Tracking with Camera-LiDAR Fusion for Autonomous Driving

Multi-object tracking (MOT) with camera-LiDAR fusion demands accurate re...

Multi-object Tracking via End-to-end Tracklet Searching and Ranking

Recent works in multiple object tracking use sequence model to calculate...

TrackMPNN: A Message Passing Graph Neural Architecture for Multi-Object Tracking

This study follows many previous approaches to multi-object tracking (MO...

Learnable Graph Matching: Incorporating Graph Partitioning with Deep Feature Learning for Multiple Object Tracking

Data association across frames is at the core of Multiple Object Trackin...

Deep Continuous Conditional Random Fields with Asymmetric Inter-object Constraints for Online Multi-object Tracking

Online Multi-Object Tracking (MOT) is a challenging problem and has many...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Given a video sequence, Multi-Object Tracking (MOT) algorithms generate consistent trajectories by localizing and identifying multiple targets in consecutive frames. Considering its spatial-temporal nature, MOT task is intrinsically complicated for claiming a formidable solution search space. Moreover, the complications of MOT further aggravates with the increasing number of targets, complex object behaviors, and intricate real-life tracking environments.

Aiming at decoupling the combinatorial complications, most trackers solve the object localization and identification separately and lead to two categories of MOT algorithms. On one hand, the Tracking-by-Prediction methods [93, 34, 12] prioritize object identification by deploying multiple Single Object Trackers (SOTs) on the basis of motion prediction. However, due to the absence of detections, these methods are troubled to adapt to the varying object number because of the object birth & death (i.e. object entering or leaving the scene). On the other hand, the Tracking-by-Detection methods first localize objects anonymously with detectors, then resolve object identification via data association [59, 77, 32].

In this work, we follow the Tracking-by-Detection strategy. Taking detections as a given, the core of our proposed tracker is its data association module. To achieve online tracking capability, the tracker performs frame-by-frame data associations which can be graphically formulated as Maximum Weighted Bipartite Matching problems. For each pair of consecutive frames, a weighted bipartite graph is constructed involving trajectories in the previous frame and detection responses in the current frame. The matching problem established whereupon is resolved by first generating pairwise affinities as edge weights, then solving the obtained optimization problem to generate the association output.

Accordingly, the data association module starts with generating pairwise affinities. In tracking scenes baffled with target appearance variations and similar distractors, the expressivity and discriminability of the computed affinities is determined by the adopted feature representation method, as well as the distance metric deployed to quantify the affinities. Earlier approaches leverage advanced hand-crafted features [86, 53, 15]

to achieve robust representation. More recently, CNN based deep features are widely exploited

[77, 38, 72, 3] instead. Furthermore, a multi-cues strategy has also been practiced to supplement the appearance cue with others [42, 53, 57], among which motion is the most vastly adopted [70, 63, 1, 64, 37]

. On the basis of the encoded feature vectors, hand-engineered distance measures

[86, 73, 60] are generally utilized to compute the affinity scores. In addition, attempts have also been made to learn metrics that can co-adapt with the feature learning altogether [8, 64, 79].

Given the computed affinities, the following optimization problem defined on the weighted bipartite graph is normally configured into a linear assignment formalism and solved with well-designed optimizers or heuristics

[11, 71]. However, these approaches suffer from tedious designing efforts, prohibitive computation expense, and poor scalability. Particularly, in the presence of frequent object birth & death, the combinatorial formalism of linear assignment constraints are violated, thus inducing erroneous optimization results which in turn leads to false associations. As a solution, we strive to resolve the optimization problem relying on the function approximation capacity of deep networks in a data-driven way. Nevertheless, this approach is non-trivial to realize. Firstly, although deep neural networks such as CNN and RNN have exceptional feature learning capability, yet their capacities to conduct relational reasoning for data association are limited; Secondly, the varying number of targets give rise to changing dimensionality of the association problem, demanding the otherwise fixed model to be adaptive; Moreover, available data is limited for tracking problem to support the training of heavy models.

Inspired by its graphical formulation of the optimization problem, we observe Graph Neural Network (GNN) [67] is well-suited to solve the problem. By reasoning over non-Euclidean graph data in a message-passing way, the proposed GNN optimization module is endowed with improved relational reasoning capacity and can cope with the varying cardinality problem via the deployments of localized operations. Furthermore, the module is light-weight and converges well. By integrating the aforementioned affinity learning module end-to-end, all parameters in the data association pipeline can co-adapt and co-operate compactly, results in better model adaptiveness, scalability, and efficiency with acceptable model complexity. For the purpose to better optimize the complicated network with diverse modules, we design the multi-level matrix loss which is assembled to enhance the training performance. The main contributions of this work include:

  • We propose an end-to-end framework incorporating affinity learning and optimization modules to solve the data association problem in online multiple-object tracking.

  • We design the optimization module with Graph Neural Network (GNN), which learns to solve the constructed maximum weighted bipartite matching problem in a data-driven way, avoiding excessive algorithm design and parameter tuning efforts.

  • We employ assembled supervision in conjunction with the proposed multi-level matrix loss to ensure the training performance of the end-to-end network composing diverse modules.

  • We demonstrate experimentally that the GNN optimization module improves data association performance, and overall our method yields competitive results with other state-of-the-art trackers on the MOT benchmark.

Figure 1: The pipeline of the proposed framework. It consists of the Siamese Network for affinity computation and the Graph Neural Network for optimization for the in-complete bipartite graph. The figure is best viewed in color.

2 Related Work

Affinity Computation in Data Association. Data association is the process of dividing a set of instances into different groups, such that to maximize the global cross-group similarities while maintaining one-to-one association constraint. This fundamental technique exists in various domains that involve correspondence matching [96], such as person re-identification [62, 97, 6], keypoint matching [91], 3D reconstruction [90], action recognition [9], and T-by-D based MOT [44].

Affinity computation lays the foundation for data association. It provides the similarity measures upon which the maximization is established. In the context of multi-object tracking, pairwise or tracklets-based appearance affinity scores are computed by first extracting reliable feature representations with hand-crafted features [86, 53, 15, 95, 94, 36, 14], learnable features [84, 83, 2, 42, 85, 48], or deep features [3, 80, 47]. For the purpose to supplement the appearance cues under severe variations, attempts have been made to jointly investigate multiple cues [42, 53, 57]. Amongst, motion information is the most widely applied [83, 18, 70, 63, 1, 64]. The similarity between paired feature vectors is quantified by a distance metric such as Euclidean, Mahbolinios or Bhattacharyya distance [46, 52, 59, 86, 73, 60]. Moreover, metric learning has also been deployed in [8, 64, 79] to learn adaptive metric from data. Noteworthily, end-to-end affinity computation has been proposed with deep Siamese architecture [16, 44, 25, 32, 81, 62, 79].

Optimization in Data Association. Data association can be interpreted as the Set Partition Problem (SPP)[4]. In multi-object tracking, this SPP formalism is specialized into the Multi-Dimensional Assignment (MDA) problem, which describes the optimization procedure defined on a k-partite graph to partition object observations into trajectories cross frames. Offline methods indicate such that the data association is executed in a batch mode. Such a global association strategy is robust and accurate, yet inevitably introduce NP-hardness and forfeit the real-time tracking capability. Tracklet-based [35, 33, 98, 17] and detection-based [52, 92] methods have been practiced to realize offline multi-object tracking. Differently, online multi-object trackers [11, 71, 59] perform frame-by-frame associations with real-time tracking capability recovered. Online methods are usually graphically formulated as the Maximum Weighted Bipartite Matching [91] problem and solved with the Hungarian algorithm [41]. A variety of other graphical formalisms have also been devised to settle the data association, including Network Flow [23, 82, 58, 69], Minimum Cost Multi-cut [75, 76, 77], Maximum-weight Multi-Clique [22], etc. On the basis of these optimization formulations, attempts have been made to solve the problem in a data-driven manner [69, 91]. Particularly, in [56, 55]

, Ondruska et al. initially propose to deploy recurrent neural networks (RNNs) in solving MOT on a primitive level without formal data associations. Following this line of research, in

[50] Milan et al. established the first end-to-end online multi-object tracking method with explicit data association realized using RNNs.

Graph Neural Network. Common neural networks models are designed to work with Euclidean data as inputs. Graph neural networks, as neural network models operating on non-Euclidean graph data, refers to a structured architecture to conduct graph-to-graph relational reasoning computations [10, 5]. In general, GNN manages to learn complicated semantic information by building them from the lower level in a hierarchical way [30]. In actualization, this GNN hierarchy is established by aggregating local information from edges to nodes, then to the global level in a message passing manner [28]. Over its development in the past decade [31, 66, 68], GNN models have found far-reaching applications across domains including supervised [61], semi-supervised [40], few-shot [27]

, and reinforcement learning

[89], resulted in fruitful well-established networks such as graph convolutional networks [40, 29, 78], graph-based generative models [21, 88], graph-based adversarial learning [24, 74], etc. Closely related to the proposed work, a few attempts have been taken to settle the Quadratic Assignment Programming (QAP) problem with GNN. In [54], the general GNN architecture proposed in [68] is adopted to solve the QAP in the context of graph matching. In [20]

, a unique combination of reinforcement learning and graph embedding is realized to resolve the NP-hard combinatorial optimization.

3 End-to-end Data Association

Using affinity measures as edge weights, the data association of online multi-object tracking can take the graphical form of a Maximum Weighted Bipartite Matching problem. This matching problem can be formulated into a linear assignment and solved with well-defined algorithms [19, 51]. In this section, we present our end-to-end data association model, wherein the affinity computation and optimization module are jointly trained to achieve co-adaptation and co-operation. More importantly, the varying cardinality of the optimization problem caused by object birth & death is settled in a data-driven way.

3.1 Problem Formulation

For online data association, one bipartite graph is constructed between every pair of consecutive frames, where the two disjoint sets each contain the existing trajectories in the previous frame , and the newly detected object observations in the current frame . and , where and defines the cardinality of the association problem. Particularly, trajectory is represented by its bounding box observation (i.e. the (x, y, w, h) annotation) at frame

and its short tracklet of coordinates. The graph is weighted by the affinity matrix

, where affinity score is associated as the weight on the edge between trajectory and observation . Each edge is also associated with a binary indicator in the association matrix . The association result is the solution of a Maximum Weighted Bipartite Matching problem defined on , i.e. the optimal association matrix of the corresponding linear assignment defined as follows:


The computation in (1) denotes dot-product of two vectorized matrices. The first two constraints in (2) ensure the assignment feasibilities that no two trajectories can claim the same observation at the same frame. The last constraint computes matrix norm of , indicates that there are exactly one-to-one associations which satisfy the equality constraint. In other words, one-to-one associations are established across two frames whereas the rest are birth & death associations. In cases where , zero nodes, and edges need to be augmented into the graph to maintain symmetry in matrices and so that Hungarian algorithm can be applied to solve the assignment.

Nonetheless, the application of the Hungarian algorithm or similar alternatives is not ideal for all tracking scenarios. For one, such algorithms scale poorly with the increasing problem size and easily becomes intractable in real tracking scenes. More seriously, is not known as a during tracking due to irregular object birth & death, so the last constraint in (2) does not always hold, i.e. the optimization formulated as above cannot enclose all association scenarios. In these cases, the availability of combinatorial optimization algorithms is invalidated.

Aiming at achieving an efficient optimization well-compatible to real tracking scenes, we establish a module leveraging the GNN to approximate an optimization solution via supervised learning. In addition, an affinity learning module is proposed to compute the

matrices. Jointly, an end-to-end data association module is realized enabling the co-adaptation and co-operation of both modules collaboratively.

3.2 Affinity Learning Module

Affinity is the quantification of similarity between observations, affinity computation is the starting point of a variety of matching-based tasks [97, 91, 44, 90, 9]. In the context of online multi-object tracking, an affinity matrix is computed to weight the bipartite graph. Each element in indicates the similarity between trajectory and the newly detected observation . In the proposed tracker,

is computed end-to-end from input bounding box annotations to output scalar value affinity scores through the proposed two-stream affinity learning module, where appearance affinity is quantified with a Siamese Convolutional Neural Network, while motion affinity is investigated based on proximity reasoning with an LSTM motion prediction component. A set of fully connected layers are deployed as the learnable metric to integrate the two.

3.2.1 Appearance Affinity

Appearance cue provides the most defining image evidence to recognize and discriminate an object. In presences of appearance variations and similar distractors, robust feature representation is vital to achieve reliable affinity. In our method, appearance affinity is computed with a Siamese CNN feature encoding architecture. As shown in Figure 1, and are encoded into feature vectors and with dimension of . The appearance affinity score is computed using and via the learnable metric described later in this section. Differ from the early fusion strategy adopted in [44] where pairs of instances are stacked before input to the Siamese network, we opt to fuse and later as both of them are needed in the GNN optimization module.

3.2.2 Motion Affinity

Motion cue offers generic and appearance-invariant information to characterize a object according to its dynamic behavior. It has been proven to be a beneficial complement to reinforce appearance cue in cases of severe appearance variations and cluttered background [70, 63, 64]. As shown in Figure 1, motion affinity is computed on the basis of motion prediction with a LSTM motion model. For a trajectory in the previous frame, we maintain a short tracklet of as a sequence of 4D bounding box annotations , where denotes the length of the tracklet. Accordingly, the LSTM network unrolled into time steps with hidden unit size as . By feeding in a tracklet , the LSTM generate motion feature vector . On the other hand, the bounding box annotation of a observation from the current frame is encoded by a set of fully-connected layers into feature vector . The motion affinity is then computed based on and utilizing the learnable distance metric.

3.2.3 Metric Learning

Aiming at end-to-end module adaptation and cooperation, we avert application of hand-engineered distance metrics but to learn them from data. As shown in Figure 1, three metric learning components , , and are distributedly implemented in the affinity learning module. In particular, () is formed with a sequence of fully connected layers interleaved with non-linearity, which first concatenate a pair of appearance (motion) feature vectors into a () dimension vector, then transform them into a scalar value indicating the pairwise affinity score. The two-stream appearance and motion cues are integrated by , which mimics a weighted summation of and , then outputs as final affinity score. This distributed computation of affinities renders , , and as intermediate outputs, upon which assembled supervision can be realized. Details can be found in Section 4.1. Affinity matrices , , and are generated by packing , and into matrix formation, and is fed into the following GNN optimization module. Additionally, the encoded appearance and motion feature vectors for each trajectory and observation are concatenated into one unified feature vector . Vectors for trajectories and observations are packed together in and , which are also input into the GNN optimization module.

3.3 GNN Optimization Module

The main obstacle of applying deep neural networks to solve data association is the varying cardinality. To cope with such variations, tedious heuristics must be designed for the adoption of deep networks who involve fixed size matrix multiplications such as MLP or LSTM.

Motivated by the graphical nature of the optimization problem, we are thus inspired to overcome the varying cardinality problem by utilizing Graph Neural Network. In comparison with LSTM [50], we believe Graph Neural Network (GNN) is more suitable in solving the maximum weighted bipartite matching problem for three reasons. Firstly, GNN operates on graphical data, therefore matches the graphical formulation of the optimization problem. Secondly, GNN deploys local operations in a message-passing way, thus can cope with the varying cardinality. Thirdly, GNN supports light-weight implementations, such that the requirement of training data is less intense. Following these motivations, a particular GNN architecture is established as shown in Figure 2. Given the computed affinity matrix as edge weights and feature vectors , as node features on a weighted in-complete bipartite , the proposed GNN module composes the feature update layer and the relation update layer, who is expected to solve the optimization problem by outputting an optimal association matrix , which denotes an optimal set of one-to-one associations together with accurate birth & death indications.

3.3.1 Feature update layer

Taking the input edge weights and node features, the feature update layer instantiates the message-passing functionality via matrix multiplications in the context of bipartite graphs, i.e. updates the feature vector for each node in one set of the graph according to all nodes in the other set weighted by the affinities in between. Considering the fact that each node in a bipartite graph only has one-hop neighbor, thus a pair of feature update layers defined as follows only need to be deployed once in the GNN to realize feature updates globally. After the message-passing step, the resulted features are further embedded into a higher dimension for enlarged model capacity.


On the left side of the above equations, and denote the resulted features for each trajectory in the previous frame and each observation in the current frame. On the right side of the equations, represents the affinity matrix computed as discussed in Section 3.2. indicates applying softmax normalization row-wise in the affinity matrix ((4) in the transpose of ) computed by the affinity learning module. indicates a set of learnable weights and denote the parameterizations.

is the element-wise non-linearity which we adopt ReLU in this paper.

3.3.2 Relation update layer

The updated feature vectors are fed into the relation update layer, wherein elements in the association matrix

are iteratively estimated by first aggregating features from a pair of nodes into the feature on the edge connecting the two, then apply a learnable transformation to compute the scalar value output. This layer can be formalized as follows:


Here represents the feature aggregation functionality that aggregates node features into the edge features in between.

can take many forms, in the scope of this work we realize it with non-parameterized element-wise subtraction. Basing on the aggregated edge feature, a Multilayer Perceptron parameterized by

is employed to instantiate the transformation to get the scalar value .

Figure 2: The pipeline of the proposed optimization module based on GNN. Given the affinity matrix and the features of the objects aside in two frames, the module firstly updates their features in a message-passing way, then output their relation to output the optimal association matrix.

3.4 End-to-end Supervision

It is non-trivial to supervise the training of the proposed end-to-end data association module for two reasons. Firstly, the data association result is given in , which is a matrix with varying dimensionality from frame-to-frame. This matrix contains both one-to-one as well as birth & death associations, such that different supervision need to be imposed on both rows and columns. Secondly, the end-to-end framework establishes a network composing components for various tasks, training needs to be carefully designed to facilitate back-propagation and to avoid gradient vanishing problem. To overcome these difficulties, we first clarify the ground truth matrix generation, then propose the multi-level matrix loss in conjunction with assembled supervision.

3.4.1 Ground Truth Generation

To realize matrix-wise supervision, we generate ground truth association matrix with elements . As defined in (2), there exists a sub-matrix in that conforms to one-to-one associations. Corresponding rows and columns in are generated as one-hot vectors with placed at the , indicating row (trajectory) occupies detection (column) . () indicates birth & death associations, corresponding row and column vectors in are generated with all elements.

3.4.2 Multi-level Matrix Loss

The loss computed on an estimated association matrix and its corresponding ground truth is defined as a combination of element-level and vector-level losses. On the element-level, each element in the matrix is the estimation for a binary classification, specifying whether this element denoting a match or mismatch. Accordingly, a binary cross-entropy loss formulated as follows is applied to each element to supervise the classification:


where . is the weighting factor assigned to positive examples to alleviate the sample imbalance. In our experiments equals to 25.

On the vector-level, we separate the one-to-one association from the birth & death, and denote the sub-matrix as and respecitively. For vectors within , a cross entropy loss is adopted to supervise a multi-class classification composing the estimated vectors and the one-hot ground truths:


where denotes corresponding one-hot ground truth vector, denotes the number of associations.

For vectors in , we apply a mean square error (MSE) loss to enforce the vector to approach to negative infinity, which can be easily recognized in the tracking process:


where .

The multi-level matrix loss is then computed as a summation of losses on different levels over the matrix:


3.4.3 Assembled Supervision

Instead of only computing on the final output matrix , we also assemble it computed on affinity scores , , and . During back-propagation, the gradient flows start at each matrix and flow backwards distributedly. As such, the gradients on earlier network layers are a summation of each flow, therefore the gradient is enhanced and the vanishing problem is alleviated.

3.5 Association Result Interpretation

The optimal association matrix produced by the optimization module contains indications for both one-to-one and object death & birth associations. The elements in is not binary indicators directly but can be easily interpreted. Conforming to our training setup as detailed in Section 4.1, each element is the result of a binary classification, where the one-to-one association is marked as positive. As a result, the one-to-one association is indicated by the largest positive value in a row, and the trajectory corresponding to this row occupies the detection denoted by the corresponding column that largest value resides in. Each one-to-one association is denoted with an indicator in . Birth or death happens when a row or column contains only negative values, and each of them is marked with an indicator in and . To efficiently integrate , , and from , we follow a straightforward procedure by iteratively finding the largest element in . If , then add into , and row as well as column is marked unavailable. Once added is negative, then all the available but un-associated rows and columns are added into and . Accordingly, the final association result is given indicators in , , and .

4 Implementation

The implementation of our method is separated into two parts: the training methodology and the tracking methodology.

4.1 Training Methodology

The proposed model is implemented as follow. For the network design, we construct the CNN in the Siamese network with only four convolutional layers inserted with the ReLU nonlinearity. For the LSTM module, we set the length of tracklet . We use the ground-truth bounding boxes and target IDs in the MOT17 [49] training set to generate pairs of images and motion information. All crop images are resized to to maintain the aspect ratio of targets. All modules are trained from scratch. The proposed model is trained end-to-end with Adam [39] optimizer, where weight decay is set to and initial learning rate is set to , divided by for every iterations, and iterations in total.

Figure 3: Visualization of tracking results on MOT17 benchmark dataset. Displayed frames are from the testing sequence 01 and 08.

4.2 Tracking Methodology

To limit false births & deaths induced by noisy detections, we set up a time window with length to verify an object birth. Specifically, once a birth is indicated in , the tracker postpones the trajectory initialization process until this object consecutively appears in for the next frames. Similarly, is set up for confirming the death. Once reports a death, we deploy a dummy observation with the object’s last seen bounding box dimension but propagated into future frames with a linear motion model calculated using the trajectory’s short tracklet of coordinates. If the dummy fails to be associated in frames, it is terminated. We set and ( denotes the frame rate), according to the model’s performance on the training set.

5 Experimental Results

In this section, we present our experimental results on 2D MOT 2015 [45] and MOT17 [49] benchmark datasets with ablation studies and comparisons with selected baselines. More details for the MOT benchmark are available at

Mode Tracker MOTA MOTP IDF1 ID Sw. MT ML Frag FP FN
Online GM PHD [26] 36.4 76.2 33.9 4,607 4.1 57.3 11,317 23,723 330,767
Online GMPHD KCF [43] 39.6 74.5 36.6 5,811 8.8 43.3 7,414 50,903 284,228
Online EAMTT [65] 42.6 76.0 41.8 4,488 12.7 42.7 5,720 30,711 288,474
Online Ours 45.5 76.3 40.5 4,091 15.6 40.6 5,579 25,685 277,663
Offline MHT bLSTM [38] 47.5 77.5 51.9 2,069 18.2 41.7 3,124 25,981 268,042
Offline IOU17 [7] 45.5 76.9 39.4 5,988 15.7 40.5 7,404 19,993 281,643
Offline EDMT17 [13] 50.0 77.3 51.3 2,264 21.6 36.3 3,260 32,279 247,297
Table 1: Tracking Performance on the MOT17 Benchmark
Mode Tracker MOTA MOTP IDF1 ID Sw. MT ML Frag FP FN
Online RMOT [87] 18.6 69.6 32.6 684 5.3 53.3 1,282 12,473 36,835
Online RNN LSTM [50] 19.0 71.0 17.1 1490 5.5 45.6 2,081 11,578 36,706
Online Ours 21.8 70.5 27.8 1488 9 40.2 1,851 11,970 34,587
Table 2: Tracking Performance on the MOT15 Benchmark
Configurations MOTA ID Sw. MT
w/o assembled supervision 43.2 4275 14.4%
w/o optimization module 38.6 9197 11.8%
Full 45.4 4126 15.6%
Table 3: Ablation Study on the MOT17 Benchmark

5.1 Evaluation Metrics

The evaluation metrics used on MOT benchmarks include Multiple Object Tracking Accuracy (MOTA), Multiple Object Tracking Precision (MOTP), ID F1 score (IDF1), ID Precision (IDP), ID Recall (IDR), Mostly Tracked trajectories (MT, the ratio of ground-truth trajectories that are at least 80% covered by the tracking output), Mostly Lost trajectories (ML, the ratio of ground-truth trajectories that are at most 20% covered by the tracking output), number of False Negatives (FN), number of False Positives (FP), number of ID Switches (ID Sw.), number of Track Fragmentations (Frag).

5.2 Evaluation on MOT Benchmark

The evaluation results on both MOT17 and MOT15 datasets are shown in Table 1 and 2. The arrows in each column denote the favorable changing direction of the corresponding metric. Being the only fully end-to-end trained online tracker, we emphasize highlighting the benefits of the end-to-end training methodology as well as the GNN optimization module. To this end, we avoid excessive parameter tuning efforts during testing, and training data augmentation, as well as post-tracking performance boosting heuristics refrain in the experiments. Nevertheless, we still demonstrate competitive results with other published online trackers in both datasets. Particularly, on MOT15, we compare our results with the RNN LSTM [50], which is claimed the first fully end-to-end multi-object tracking method that inspires our work. As shown, we achieve favorable results on several important evaluation metrics, including 39.5%, 12.8%, and 3.5% improvements on IDF1, MOTA, and MT. These improvements results from the fact we integrate the appearance affinity into the end-to-end framework and the GNN optimization module can better cope with the varying cardinality problem than RNN and LSTM.

5.3 Ablation Study

Ablation study is also conducted on the MOT17 benchmark. As shown in Table 3, we demonstrate the contributions of the assembled supervision and the GNN optimization module. Specifically, the first row of the table shows the performance of the whole network trained with single supervision applied on the final association output without assembled supervision. The second row illustrates the result of directly reasoning the association result using the affinity matrix , without the following GNN optimization module. Training for this configuration is assembled on affinity matrices , , and . The last row reports the full network. As illustrated, the full network outperforms the first-row configuration in all three metrics with 4.9%, 3.5%, and 7.7% improvements, proving the merits of the assembled supervision. Comparing to the second-row configuration, the full network enlarges the improvements to 15%, 122.9%, and 24.4%. Although this configuration cannot be trained with the full assembled supervision due to the absence of , the extra performance improvements still advocate the contribution of the GNN optimization module.

6 Conclusion and Future work

We propose an end-to-end data association model for online multi-object tracking. By jointly training the affinity learning module and the GNN optimization module, they can co-adapt collaboratively, improving the adaptivity, scalability, and accuracy of the data association model. Particular, we firstly introduce GNN in the context of solving online data association with frequent birth and death, successfully settles the irregular linear assignment formulation in a data-driven way. In this paper, we emphasize on demonstrating the efficacy of end-to-end data association and the GNN optimization module, therefore the affinity computation module is lightly designed and no performance enhancing heuristics have been employed. The performance of the method can be further enhanced in future work.


  • [1] A. Alahi, K. Goel, V. Ramanathan, A. Robicquet, L. Fei-Fei, and S. Savarese. Social lstm: Human trajectory prediction in crowded spaces. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , pages 961–971, 2016.
  • [2] S. Avidan. Ensemble tracking. IEEE transactions on pattern analysis and machine intelligence, 29(2), 2007.
  • [3] S.-H. Bae and K.-J. Yoon. Confidence-based data association and discriminative deep appearance learning for robust online multi-object tracking. IEEE transactions on pattern analysis and machine intelligence, 40(3):595–610, 2018.
  • [4] E. Balas and M. W. Padberg. Set partitioning: A survey. SIAM review, 18(4):710–760, 1976.
  • [5] P. W. Battaglia, J. B. Hamrick, V. Bapst, A. Sanchez-Gonzalez, V. Zambaldi, M. Malinowski, A. Tacchetti, D. Raposo, A. Santoro, R. Faulkner, et al. Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261, 2018.
  • [6] L. Beyer, S. Breuers, V. Kurin, and B. Leibe. Towards a principled integration of multi-camera re-identification and tracking through optimal bayes filters. In Computer Vision and Pattern Recognition Workshops (CVPRW), 2017 IEEE Conference on, pages 1444–1453. IEEE, 2017.
  • [7] E. Bochinski, V. Eiselein, and T. Sikora. High-speed tracking-by-detection without using image information. In Advanced Video and Signal Based Surveillance (AVSS), 2017 14th IEEE International Conference on, pages 1–6. IEEE, 2017.
  • [8] W. Brendel, M. Amer, and S. Todorovic. Multiobject tracking as maximum weight independent set. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 1273–1280. IEEE, 2011.
  • [9] W. Brendel and S. Todorovic. Learning spatiotemporal graphs of human activities. In Computer vision (ICCV), 2011 IEEE international conference on, pages 778–785. IEEE, 2011.
  • [10] M. M. Bronstein, J. Bruna, Y. LeCun, A. Szlam, and P. Vandergheynst.

    Geometric deep learning: going beyond euclidean data.

    IEEE Signal Processing Magazine, 34(4):18–42, 2017.
  • [11] Y. Cai and G. Medioni. Exploring context information for inter-camera multiple target tracking. In Applications of Computer Vision (WACV), 2014 IEEE Winter Conference on, pages 761–768. IEEE, 2014.
  • [12] X. Cao, X. Jiang, X. Li, and P. Yan. Correlation-based tracking of multiple targets with hierarchical layered structure. IEEE transactions on cybernetics, 48(1):90–102, 2018.
  • [13] J. Chen, H. Sheng, Y. Zhang, and Z. Xiong. Enhancing detection model for multiple hypothesis tracking. In Conf. on Computer Vision and Pattern Recognition Workshops, pages 2143–2152, 2017.
  • [14] W. Choi. Near-online multi-target tracking with aggregated local flow descriptor. In Proceedings of the IEEE international conference on computer vision, pages 3029–3037, 2015.
  • [15] W. Choi and S. Savarese. A unified framework for multi-target tracking and collective activity recognition. In European Conference on Computer Vision, pages 215–230. Springer, 2012.
  • [16] S. Chopra, R. Hadsell, and Y. LeCun. Learning a similarity metric discriminatively, with application to face verification. In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, volume 1, pages 539–546. IEEE, 2005.
  • [17] Q. Chu, W. Ouyang, H. Li, X. Wang, B. Liu, and N. Yu. Online multi-object tracking using cnn-based single object tracker with spatial-temporal attention mechanism. In 2017 IEEE International Conference on Computer Vision (ICCV).(Oct 2017), pages 4846–4855, 2017.
  • [18] R. T. Collins. Multitarget data association with higher-order motion models. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 1744–1751. IEEE, 2012.
  • [19] R. T. Collins. Multitarget data association with higher-order motion models. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 1744–1751. IEEE, 2012.
  • [20] H. Dai, E. B. Khalil, Y. Zhang, B. Dilkina, and L. Song. Learning combinatorial optimization algorithms over graphs. arXiv preprint arXiv:1704.01665, 2017.
  • [21] N. De Cao and T. Kipf. Molgan: An implicit generative model for small molecular graphs. arXiv preprint arXiv:1805.11973, 2018.
  • [22] A. Dehghan, S. Modiri Assari, and M. Shah. Gmmcp tracker: Globally optimal generalized maximum multi clique problem for multiple object tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4091–4099, 2015.
  • [23] A. Dehghan, Y. Tian, P. H. Torr, and M. Shah. Target identity-aware network flow for online multiple target tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1146–1154, 2015.
  • [24] M. Ding, J. Tang, and J. Zhang. Semi-supervised learning on graphs with generative adversarial nets. arXiv preprint arXiv:1809.00130, 2018.
  • [25] X. Dong and J. Shen. Triplet loss in siamese network for object tracking. In Proceedings of the European Conference on Computer Vision (ECCV), pages 459–474, 2018.
  • [26] V. Eiselein, D. Arp, M. Pätzold, and T. Sikora.

    Real-time multi-human tracking using a probability hypothesis density filter and multiple detectors.

    In Advanced Video and Signal-Based Surveillance (AVSS), 2012 IEEE Ninth International Conference on, pages 325–330. IEEE, 2012.
  • [27] V. Garcia and J. Bruna. Few-shot learning with graph neural networks. arXiv preprint arXiv:1711.04043, 2017.
  • [28] J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl. Neural message passing for quantum chemistry. arXiv preprint arXiv:1704.01212, 2017.
  • [29] J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl. Neural message passing for quantum chemistry. arXiv preprint arXiv:1704.01212, 2017.
  • [30] I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio. Deep learning, volume 1. MIT press Cambridge, 2016.
  • [31] M. Gori, G. Monfardini, and F. Scarselli. A new model for learning in graph domains. In Neural Networks, 2005. IJCNN’05. Proceedings. 2005 IEEE International Joint Conference on, volume 2, pages 729–734. IEEE, 2005.
  • [32] A. He, C. Luo, X. Tian, and W. Zeng. A twofold siamese network for real-time object tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4834–4843, 2018.
  • [33] Q. He, J. Wu, G. Yu, and C. Zhang. Sot for mot. arXiv preprint arXiv:1712.01059, 2017.
  • [34] W. Hu, X. Li, W. Luo, X. Zhang, S. Maybank, and Z. Zhang. Single and multiple object tracking using log-euclidean riemannian subspace and block-division appearance model. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(12):2420–2440, 2012.
  • [35] C. Huang, B. Wu, and R. Nevatia. Robust object tracking by hierarchical association of detection responses. In European Conference on Computer Vision, pages 788–801. Springer, 2008.
  • [36] O. Javed, K. Shafique, Z. Rasheed, and M. Shah. Modeling inter-camera space–time and appearance relationships for tracking across non-overlapping views. Computer Vision and Image Understanding, 109(2):146–162, 2008.
  • [37] X. Jiang, P. Li, X. Zhen, and X. Cao. Model-free tracking with deep appearance and motion features integration. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 101–110. IEEE, 2019.
  • [38] C. Kim, F. Li, and J. M. Rehg. Multi-object tracking with neural gating using bilinear lstm. In Proceedings of the European Conference on Computer Vision (ECCV), pages 200–215, 2018.
  • [39] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • [40] T. N. Kipf and M. Welling. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016.
  • [41] H. W. Kuhn. The hungarian method for the assignment problem. Naval research logistics quarterly, 2(1-2):83–97, 1955.
  • [42] C.-H. Kuo, C. Huang, and R. Nevatia. Multi-target tracking by on-line learned discriminative appearance models. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pages 685–692. IEEE, 2010.
  • [43] T. Kutschbach, E. Bochinski, V. Eiselein, and T. Sikora. Sequential sensor fusion combining probability hypothesis density and kernelized correlation filters for multi-object tracking in video data. In 2017 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), pages 1–5. IEEE, 2017.
  • [44] L. Leal-Taixé, C. Canton-Ferrer, and K. Schindler. Learning by tracking: Siamese cnn for robust target association. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 33–40, 2016.
  • [45] L. Leal-Taixé, A. Milan, I. Reid, S. Roth, and K. Schindler. Motchallenge 2015: Towards a benchmark for multi-target tracking. arXiv preprint arXiv:1504.01942, 2015.
  • [46] L. Leal-Taixé, G. Pons-Moll, and B. Rosenhahn.

    Everybody needs somebody: Modeling social and grouping behavior on a linear programming multiple people tracker.

    In Computer Vision Workshops (ICCV Workshops), 2011 IEEE International Conference on, pages 120–127. IEEE, 2011.
  • [47] H. Li, Y. Li, and F. Porikli. Deeptrack: Learning discriminative feature representations online for robust visual tracking. IEEE Transactions on Image Processing, 25(4):1834–1848, 2016.
  • [48] Y. Li, C. Huang, and R. Nevatia. Learning to associate: Hybridboosted multi-target tracker for crowded scene. 2009.
  • [49] A. Milan, L. Leal-Taixé, I. Reid, S. Roth, and K. Schindler. Mot16: A benchmark for multi-object tracking. arXiv preprint arXiv:1603.00831, 2016.
  • [50] A. Milan, S. H. Rezatofighi, A. R. Dick, I. D. Reid, and K. Schindler. Online multi-target tracking using recurrent neural networks. In AAAI, volume 2, page 4, 2017.
  • [51] A. Milan, S. H. Rezatofighi, R. Garg, A. R. Dick, and I. D. Reid. Data-driven approximations to np-hard problems. In AAAI, pages 1453–1459, 2017.
  • [52] A. Milan, S. Roth, and K. Schindler. Continuous energy minimization for multitarget tracking. IEEE Trans. Pattern Anal. Mach. Intell., 36(1):58–72, 2014.
  • [53] D. Mitzel, E. Horbert, A. Ess, and B. Leibe. Multi-person tracking with sparse detection and continuous segmentation. In European Conference on Computer Vision, pages 397–410. Springer, 2010.
  • [54] A. Nowak, S. Villar, A. S. Bandeira, and J. Bruna. Revised note on learning quadratic assignment with graph neural networks. In

    2018 IEEE Data Science Workshop (DSW)

    , pages 1–5. IEEE, 2018.
  • [55] P. Ondruska, J. Dequaire, D. Z. Wang, and I. Posner. End-to-end tracking and semantic segmentation using recurrent neural networks. arXiv preprint arXiv:1604.05091, 2016.
  • [56] P. Ondruska and I. Posner. Deep tracking: Seeing beyond seeing using recurrent neural networks. arXiv preprint arXiv:1602.00991, 2016.
  • [57] S. Pellegrini, A. Ess, and L. Van Gool. Improving data association by joint modeling of pedestrian trajectories and groupings. In European conference on computer vision, pages 452–465. Springer, 2010.
  • [58] H. Pirsiavash, D. Ramanan, and C. C. Fowlkes. Globally-optimal greedy algorithms for tracking a variable number of objects. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 1201–1208. IEEE, 2011.
  • [59] H. Possegger, T. Mauthner, P. M. Roth, and H. Bischof. Occlusion geodesics for online multi-object tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1306–1313, 2014.
  • [60] Z. Qin and C. R. Shelton. Improving multi-target tracking via social grouping. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 1972–1978. IEEE, 2012.
  • [61] D. Raposo, A. Santoro, D. Barrett, R. Pascanu, T. Lillicrap, and P. Battaglia. Discovering objects and their relations from entangled scene representations. arXiv preprint arXiv:1702.05068, 2017.
  • [62] E. Ristani and C. Tomasi. Features for multi-target multi-camera tracking and re-identification. arXiv preprint arXiv:1803.10859, 2018.
  • [63] A. Robicquet, A. Sadeghian, A. Alahi, and S. Savarese. Learning social etiquette: Human trajectory understanding in crowded scenes. In European conference on computer vision, pages 549–565. Springer, 2016.
  • [64] A. Sadeghian, A. Alahi, and S. Savarese. Tracking the untrackable: Learning to track multiple cues with long-term dependencies. arXiv preprint arXiv:1701.01909, 4(5):6, 2017.
  • [65] R. Sanchez-Matilla, F. Poiesi, and A. Cavallaro. Online multi-target tracking with strong and weak detections. In European Conference on Computer Vision, pages 84–99. Springer, 2016.
  • [66] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini. Computational capabilities of graph neural networks. IEEE Transactions on Neural Networks, 20(1):81–102, 2009.
  • [67] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini. The graph neural network model. IEEE Transactions on Neural Networks, 20(1):61–80, 2009.
  • [68] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini. The graph neural network model. IEEE Transactions on Neural Networks, 20(1):61–80, 2009.
  • [69] S. Schulter, P. Vernaza, W. Choi, and M. Chandraker. Deep network flow for multi-object tracking. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pages 2730–2739. IEEE, 2017.
  • [70] P. Scovanner and M. F. Tappen. Learning pedestrian dynamics from the real world. In Computer Vision, 2009 IEEE 12th International Conference on, pages 381–388. IEEE, 2009.
  • [71] G. Shu, A. Dehghan, O. Oreifej, E. Hand, and M. Shah. Part-based multiple-person tracking with partial occlusion handling. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 1815–1821. IEEE, 2012.
  • [72] J. Son, M. Baek, M. Cho, and B. Han. Multi-object tracking with quadruplet convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5620–5629, 2017.
  • [73] B. Song, T.-Y. Jeng, E. Staudt, and A. K. Roy-Chowdhury. A stochastic graph evolution framework for robust multi-target tracking. In European Conference on Computer Vision, pages 605–619. Springer, 2010.
  • [74] J. Svoboda, J. Masci, F. Monti, M. M. Bronstein, and L. Guibas. Peernets: Exploiting peer wisdom against adversarial attacks. arXiv preprint arXiv:1806.00088, 2018.
  • [75] S. Tang, B. Andres, M. Andriluka, and B. Schiele. Subgraph decomposition for multi-target tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5033–5041, 2015.
  • [76] S. Tang, B. Andres, M. Andriluka, and B. Schiele. Multi-person tracking by multicut and deep matching. In European Conference on Computer Vision, pages 100–111. Springer, 2016.
  • [77] S. Tang, M. Andriluka, B. Andres, and B. Schiele. Multiple people tracking by lifted multicut and person reidentification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3539–3548, 2017.
  • [78] P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio. Graph attention networks. arXiv preprint arXiv:1710.10903, 2017.
  • [79] X. Wan, J. Wang, Z. Kong, Q. Zhao, and S. Deng.

    Multi-object tracking using online metric learning with long short-term memory.

    In 2018 25th IEEE International Conference on Image Processing (ICIP), pages 788–792. IEEE, 2018.
  • [80] N. Wang and D.-Y. Yeung. Learning a deep compact image representation for visual tracking. In Advances in neural information processing systems, pages 809–817, 2013.
  • [81] Q. Wang, Z. Teng, J. Xing, J. Gao, W. Hu, and S. Maybank. Learning attentions: residual attentional siamese network for high performance online visual tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4854–4863, 2018.
  • [82] X. Wang, E. Türetken, F. Fleuret, and P. Fua. Tracking interacting objects using intertwined flows. IEEE transactions on pattern analysis and machine intelligence, 38(EPFL-ARTICLE-210040):2312–2326, 2016.
  • [83] B. Yang and R. Nevatia. Multi-target tracking by online learning of non-linear motion patterns and robust appearance models. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 1918–1925. IEEE, 2012.
  • [84] B. Yang and R. Nevatia. An online learned crf model for multi-target tracking. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 2034–2041. IEEE, 2012.
  • [85] B. Yang and R. Nevatia. Online learned discriminative part-based appearance models for multi-human tracking. In European Conference on Computer Vision, pages 484–498. Springer, 2012.
  • [86] M. Yang, F. Lv, W. Xu, and Y. Gong. Detection driven adaptive multi-cue integration for multiple human tracking. In Computer Vision, 2009 IEEE 12th International Conference on, pages 1554–1561. IEEE, 2009.
  • [87] J. H. Yoon, M.-H. Yang, J. Lim, and K.-J. Yoon. Bayesian multi-object tracking using motion context from multiple objects. In Applications of Computer Vision (WACV), 2015 IEEE Winter Conference on, pages 33–40. IEEE, 2015.
  • [88] J. You, R. Ying, X. Ren, W. L. Hamilton, and J. Leskovec. Graphrnn: A deep generative model for graphs. arXiv preprint arXiv:1802.08773, 2018.
  • [89] V. Zambaldi, D. Raposo, A. Santoro, V. Bapst, Y. Li, I. Babuschkin, K. Tuyls, D. Reichert, T. Lillicrap, E. Lockhart, et al. Relational deep reinforcement learning. arXiv preprint arXiv:1806.01830, 2018.
  • [90] A. R. Zamir and M. Shah. Image geo-localization based on multiplenearest neighbor feature matching usinggeneralized graphs. IEEE transactions on pattern analysis and machine intelligence, 36(8):1546–1558, 2014.
  • [91] A. Zanfir and C. Sminchisescu. Deep learning of graph matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2684–2693, 2018.
  • [92] L. Zhang, Y. Li, and R. Nevatia. Global data association for multi-object tracking using network flows. In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, pages 1–8. IEEE, 2008.
  • [93] L. Zhang and L. Van Der Maaten. Preserving structure in model-free tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(4):756–769, 2014.
  • [94] S. Zhang, E. Staudt, T. Faltemier, and A. K. Roy-Chowdhury. A camera network tracking (camnet) dataset and performance baseline. In Applications of Computer Vision (WACV), 2015 IEEE Winter Conference on, pages 365–372. IEEE, 2015.
  • [95] S. Zhang, Y. Zhu, and A. Roy-Chowdhury. Tracking multiple interacting targets in a camera network. Computer Vision and Image Understanding, 134:64–73, 2015.
  • [96] F. Zheng. Visual Data Association: Tracking, Re-identification and Retrieval. PhD thesis, University of Sheffield, 2016.
  • [97] L. Zheng, Y. Yang, and A. G. Hauptmann. Person re-identification: Past, present and future. arXiv preprint arXiv:1610.02984, 2016.
  • [98] J. Zhu, H. Yang, N. Liu, M. Kim, W. Zhang, and M.-H. Yang. Online multi-object tracking with dual matching attention networks. In Proceedings of the European Conference on Computer Vision (ECCV), pages 366–382, 2018.