Learning a Neural Solver for Multiple Object Tracking

12/16/2019 ∙ by Guillem Brasó, et al. ∙ Technische Universität München 15

Graphs offer a natural way to formulate Multiple Object Tracking (MOT) within the tracking-by-detection paradigm. However, they also introduce a major challenge for learning methods, as defining a model that can operate on such a structured domain is not trivial. As a consequence, most learning-based work has been devoted to learning better features for MOT, and then using these with well-established optimization frameworks. In this work, we exploit the classical network flow formulation of MOT to define a fully differentiable framework based on Message Passing Networks (MPNs). By operating directly on the graph domain, our method can reason globally over an entire set of detections and predict final solutions. Hence, we show that learning in MOT does not need to be restricted to feature extraction, but it can also be applied to the data association step. We show a significant improvement in both MOTA and IDF1 on three publicly available benchmarks.



There are no comments yet.


page 3

Code Repositories


Official PyTorch implementation of "Learning a Neural Solver for Multiple Object Tracking" (CVPR 2020 Oral).

view repo


Unofficial PyTorch implementation of "Learning a Neural Solver for Multiple Object Tracking"

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Multiple object tracking (MOT) is the task of determining the trajectories of all object instances in a video. It is a fundamental problem in computer vision, with applications in fields such as autonomous driving, robotics, biology, and surveillance. Despite its relevance, it remains a challenging task and a relatively unexplored territory in the context of deep learning.

In recent years, tracking-by-detection has been the dominant paradigm among state-of-the-art methods in MOT. This two step approach consists in first obtaining frame-by-frame object detections, and then linking them to form trajectories. While the first task can be addressed with learning-based detectors [faster_rcnn, yolov2], the latter, data association, is generally formulated as a graph partitioning problem [tangcvpr2017, yucvpr2007, zhangcvpr2008, lealiccv2011, berclaztpami2011]. In this graph view of MOT, a node represents an object detection, and an edge represents the connection between two nodes. An active edge indicates the two detections belong to the same trajectory. Solving the graph partitioning task, i.e., finding the set of active edges or trajectories, can also be decomposed into two stages. First, a cost is assigned to each edge in the graph encoding the likelihood of two detections belonging to the same trajectory. After that, these costs are used within a graph optimization framework to obtain the optimal graph partition.

Previous works on graph-based MOT broadly fall into two categories: those that focus on the graph formulation, and those that focus on learning better costs. In the first group, numerous research has been devoted to establishing complex graph optimization frameworks that combine several sources of information, with the goal of encoding high-order dependencies between detections [People_Tracking, jCCpami2018, henscheltpami2016, JBNOT]. Such approaches often use costs that are handcrafted to some extent. In the second group, several works adopt a simpler and easier to optimize graph structure, and focus instead on improving edge cost definition by leveraging deep learning techniques [LealTaixeCVPR2014baseline, Son_2017_CVPR, Schulter_2017_CVPR, Zhu_2018_ECCV, sptn_iccv19]

. By exploiting siamese convolutional neural networks (CNN), these approaches can encode reliable pairwise interactions among objects, but fail to account for high-order information in the scene. Overall, these two lines of work present a dilemma: should MOT methods focus on improving the graph optimization framework or the feature extraction?

We propose to combine both tasks into a unified learning-based solver that can: (i) learn features for MOT, and (ii) learn to provide a solution by reasoning over the entire graph. To do so, we exploit the classical network flow formulation of MOT [Zhang2008]

to define our model. Instead of learning pairwise costs and then using these within an available solver, our method learns to directly predict final partitions of the graph into trajectories. Towards this end, we perform learning directly in the natural MOT domain, i.e., in the graph domain, with a message passing network (MPN). Our MPN learns to combine deep features into high-order information across the graph. Hence, our method is able to account for global interactions among detections despite relying on a simple graph formulation. We show that our framework yields substantial improvements with respect to state of the art, without requiring heavily engineered features and being over one order of magnitude faster than some traditional graph partitioning methods.

To summarize, we make the following contributions:

  • We propose a MOT solver based on message passing networks, which can exploit the natural graph structure of the problem to perform both feature learning as well as final solution prediction.

  • We propose a novel time-aware neural message passing update step inspired by classic graph formulations of MOT.

  • We show significantly improved state-of-the-art results of our method in three public benchmarks.

Our code will be released upon acceptance of the paper.

2 Related work

Most state-of-the-art MOT works follow the tracking-by-detection paradigm which divides the problem into two steps: (i) detecting pedestrian locations independently in each frame, for which neural networks are currently the state-of-the-art [rennips2015, yolov2, sdpdetector], and (ii) linking corresponding detections across time to form trajectories.

Tracking as a graph problem. Data association can be done on a frame-by-frame basis for online applications [breitensteiniccv2009, esscvpr2008, pellegriniiccv2009] or track-by-track [berclazcvpr2006]. For video analysis tasks that can be done offline, batch methods are preferred since they are more robust to occlusions. The standard way to model data association is by using a graph, where each detection is a node, and edges indicates possible link among them. The data association can then be formulated as maximum flow [berclaztpami2011] or, equivalently, minimum cost problem with either fixed costs based on distance [jiangcvpr2007, pirsiavashcvpr2011, zhangcvpr2008], including motion models [lealiccv2011], or learned costs [lealcvpr2014]. Both formulations can be solved optimally and efficiently. Alternative formulations typically lead to more involved optimization problems, including minimum cliques [zamireccv2012], general-purpose solvers, e.g., multi-cuts [tangcvpr2017]. A recent trend is to design ever more complex models which include other vision input such as reconstruction for multi-camera sequences [lealcvpr2012, wucvpr2011], activity recognition [choieccv2012], segmentation [milancvpr2015], keypoint trajectories [choiiccv2015] or joint detection [tangcvpr2017].

Learning in tracking. It is no secret that neural networks are now dominating the state-of-the-art in many vision tasks since [krizhevskyImageNet] showed their potential for image classification. The trend has also arrived in the tracking community, where learning has been used primarily to learn a mapping from image to optimal costs for the aforementioned graph algorithms. The authors of [lealcvprw2016]

use a siamese network to directly learn the costs between a pair of detections, while a mixture of CNNs and recurrent neural networks (RNN) is used for the same purpose in

[Sadeghian_2017_ICCV]. More evolved quadruplet networks [Son_2017_CVPR] or attention networks [Zhu_2018_ECCV] have lead to improved results. In [ristanicvpr2018], authors showed the importance of learned reID features for multi-object tracking. All aforementioned methods learn the costs independently from the optimization method that actually computes the final trajectories. In contrast, [kimaccv12, Wang2015b, Schulter_2017_CVPR] incorporate the optimization solvers into learning. The main idea behind these methods is that costs also need to be optimized for the solver in which they will be used. [kimaccv12, Wang2015b, end_to_end_urtasun] rely on structured learning losses while [Schulter_2017_CVPR] proposes a more general bi-level optimization framework. These works can be seen as similar to ours in spirit, given our common goal of incorporating the full inference model into learning for MOT. However, we follow a different approach towards this end: we propose to directly learn a solver and treat data association as a classification task, while their goal is to adapt their methods to perform well with closed form solvers. Moreover, all these works are limited to learning either pairwise costs [end_to_end_urtasun, Schulter_2017_CVPR] or additional quadratic terms [Wang2015b, kimaccv12] but cannot incorporate higher-order information as our method. Instead, we propose to leverage the common graph formulation of MOT as a domain in which to perform learning.

Deep Learning on graphs. Graph Neural Networks (GNNs) were first introduced in [scarselli_graph_nn] as a generalization of neural networks that can operate on graph-structured domains. Since then, several works have focused on further developing and extending them by developing convolutional variants [Bruna2013SpectralNA, deferrand_cnns, kipf2016semi]. More recently, most methods were encompassed within a more general framework termed neural message passing [Gilmer2017NeuralMP] and further extended in [battaglia_graph_networks] as graph networks. Given a graph with some initial features for nodes and optionally edges, the main idea behind these models is to embed nodes (and edges) into representations that take into account not only the node’s own features but also those of its neighbors in the graph, as well as the graph overall topology. These methods have shown remarkable performance at a wide variety of areas, ranging from chemistry [Gilmer2017NeuralMP]

to combinatorial optimization

[combinatorial_cnns]. Within vision, they have been successfully applied to problems such as human action recognition [graph_nets_action_recognition], visual question answering [graph_nets_vqa] or single object tracking [Gao_2019_CVPR].

(a) Input
(b) Graph Construction + Feature Encoding
(c) Neural Message Passing
(d) Edge Classification
(e) Output
Figure 1:

Overview of our method. (a) We receive as input a set of frames and detections. (b) We construct a graph in which nodes represent detections, and all nodes at different frames are connected by an edge. (c) We initialize node embeddings in the graph with a CNN, and edge embeddings with an MLP encoding geometry information (not shown in figure). (c) The information contained in these embeddings is propagated across the graph for a fixed number of iterations through neural message passing. (d) Once this process terminates, the embeddings resulting from neural message passing are used to classify edges into active (colored with green) and non-active (colored with red). During training, we compute the cross-entropy loss of our predictions w.r.t. ground truth labels and backpropagate gradients through our entire pipeline. (e) At inference, we follow a simple rounding scheme to binarize our classification scores and obtain final trajectories.

3 Tracking as a Graph Problem

Our method’s formulation is based on the classical min-cost flow view of MOT [Zhang2008]. In order to provide some background and formally introduce our approach, we start by providing an overview of the network flow MOT formulation. We then explain how to leverage this framework to reformulate the data association task as a learning problem.

3.1 Problem statement

In tracking-by-detection, we are given as input a set of object detections , where is the total number of objects for all frames of a video. Each detection is represented by , where denotes the raw pixels of the bounding box, contains its 2D image coordinates and its timestamp. A trajectory is defined as a set of time-ordered object detections , where is the number of detections that form trajectory . The goal of MOT is to find the set of trajectories , that best explains the observations .

The problem can be modelled with an undirected graph , where , , and each node represents a unique detection . The set of edges is constructed so that every pair of detections, i.e., nodes, in different frames is connected, hence allowing to recover trajectories with missed detections. Now, the task of dividing the set of original detections into trajectories can be viewed as grouping nodes in this graph into disconnected components. Thus, each trajectory in the scene can be mapped into a group of nodes in the graph and vice-versa.

3.2 Network Flow Formulation

In order to represent graph partitions, we introduce a binary variable for each edge in the graph. In the classical minimum cost flow formulation

111We present a simplified version of the minimum cost flow-based MOT formulation [Zhang2008]. Specifically, we omit both sink and source nodes (and hence their corresponding edges) and we assume detection edges to be constant and 1-valued. We provide further details on our simplification and its relationship to the original problem in the supplementary material. [Zhang2008], this label is defined to be 1 between edges connecting nodes that (i) belong to the same trajectory, and (ii) are temporally consecutive inside a trajectory; and 0 for all remaining edges.

A trajectory is equivalently denoted by the set of edges , corresponding to its time-ordered path in the graph. We will use this observation to formally define the edge labels. For every pair of nodes in different timestamps, , we define a binary variable as:

An edge is said to be active whenever . We assume trajectories in to be node-disjoint, i.e., a node cannot belong to more than one trajectory. Therefore, must satisfy a set of linear constraints. For each node :


These inequalities are a simplified version of the flow conservation constraints [Ahuja1993]. In our setting, they enforce that every node gets linked via an active edge to, at most, one node in past frames and one node in upcoming frames.

3.3 From Learning Costs to Predicting Solutions

In order to obtain a graph partition with the framework we have described, the standard approach is to first associate a cost to each binary variable . This cost encodes the likelihood of the edge being active [Leal-Taixe:2014:CVPR, lealcvprw2016, Schulter_2017_CVPR]. The final partition is found by optimizing:

which can be solved with available solvers in polynomial time [Berclaz:2006:CVPR, networkflows].

We propose to, instead, directly learn to predict which edges in the graph will be active, i.e., predict the final value of the binary variable . To do so, we treat the task as a classification problem over edges, where our labels are the binary variables . Overall, we exploit the classical network flow formulation we have just presented to treat the MOT problem as a fully learnable task.

4 Learning to Track with Message Passing Networks

Our main contribution is a differentiable framework to train multi-object trackers as edge classifiers, based on the graph formulation we described in the previous section. Given a set of input detections, our model is trained to predict the values of the binary flow variables for every edge in the graph. Our method is based on a novel message passing network (MPN) able to capture the graph structure of the MOT problem. Within our proposed MPN framework, appearance and geometry cues are propagated across the entire set of detections, allowing our model to reason globally about the entire graph.

Our pipeline is composed of four main stages:

1. Graph construction: Given a set of object detections in a video, we construct a graph where nodes correspond to detections and edges correspond to connections between nodes (Section 3.2).

2. Feature encoding: We initialize the node appearance

feature embeddings from a convolutional neural network (CNN) applied on the bounding box image. For each edge, i.e., for every pair of detections in different frames, we compute a vector with features encoding their bounding box relative size, position and time distance. We then feed it to a multi-layer perceptron (MLP) that returns a

geometry embedding (Section 4.3).

3. Neural message passing: We perform a series of message passing steps over the graph. Intuitively, for each round of message passing, nodes share appearance information with their connecting edges, and edges share geometric information with their incident nodes. This yields updated embeddings for node and edges containing higher-order information that depends on the overall graph structure (Section 4.1 and 4.2).

4. Training: We use the final edge embeddings to perform binary classification into active/non-active edges, and train our entire model using the cross-entropy loss (Section 4.4).

At test time, we use our model’s prediction per edge as a continuous approximation (between 0 and 1) of the target flow variables. We then follow a simple scheme to round them, and obtain the final trajectories.

For a visual overview of our pipeline, see Figure 1.

(a) Initial Setting
(b) Vanilla node update
(c) Time-aware node update
Figure 2: Visualization of node updates during message passing. Arrow directions in edges show time direction. Note the time division in , , and . In this case, we have and . (a)a shows the starting point after an edge update has been performed (equation 3), and the intermediate node update embeddings (equation 4) have been computed. (b)b shows the standard node update in vanilla MPNs, in which all neighbors’ embeddings are aggregated jointly. (c)c shows our proposed update, in which embeddings from past and future frames are aggregated separately, then concatenated and fed into an MLP to obtain the new node embedding.

4.1 Message Passing Networks

In this section, we provide a brief introduction to MPNs based on the work presented in [Gilmer2017NeuralMP, kipf_icml2018, interaction_nets_battaglia, battaglia_graph_networks]. Let be a graph. Let be a node embedding for every , and an edge embedding for every . The goal of MPNs is to learn a function to propagate the information contained in nodes and edge feature vectors across .

The propagation procedure is organized in embedding updates for edges and nodes, which are known as message passing steps [Gilmer2017NeuralMP]. In [battaglia_graph_networks, kipf_icml2018, interaction_nets_battaglia], each message passing step is divided, in turn, into two updates: one from from nodes to edges , and one from edges to nodes . The updates are performed sequentially for a fixed number of iterations . For each , the general form of the updates is the following [battaglia_graph_networks]:


Where and represent learnable functions, e.g., MLPs, that are shared across the entire graph. denotes concatenation, is the set of adjacent nodes to , and denotes an order-invariant operation, e.g., a summation, maximum or an average. Note, after iterations, each node contains information of all other nodes at distance in the graph. Hence, plays an analogous role to the receptive field of CNNs, allowing embeddings to capture context information.

4.2 Time-Aware Message Passing

The previous message passing framework was designed to work on arbitrary graphs. However, MOT graphs have a very specific structure that we propose to exploit. Our goal is to encode a MOT-specific inductive bias in our network, specifically, in the node update step.

Recall the node update depicted in Equations 4 and 5, which allows each node to be compared with its neighbors and aggregate information from all of them to update its embedding with further context. Recall also the structure of our flow conservation constraints (Equations 1 and 2), which imply that each node can be connected to, at most, one node in future frames and another one in past frames. Arguably, aggregating all neighboring embeddings at once makes it difficult for the updated node embedding to capture whether these constraints are being violated or not (see Section 5.2 for constraint satisfaction analysis).

More generally, explicitly encoding the temporal structure of MOT graphs into our MPN formulation can be a useful prior for our learning task. Towards this goal, we modify Equations 4 and 5 into time-aware update rules by dissecting the aggregation into two parts: one over nodes in the past, and another over nodes in the future. Formally, let us denote the neighboring nodes of in future and past frames by and , respectively. Let us also define two different MLPs, namely, and . At each message passing step and for every node , we start by computing past and future edge-to-node embeddings for all of its neighbors as:


Note, the initial embeddings have been added to the computation222This skip connection ensures that our model does not forget its initial features during message passing, and we apply it analogously with initial edge features in Equation 3.. After that, we aggregate these embeddings separately, depending on whether they were in future or past positions with respect to :


Now, these operations yield past and future embeddings and , respectively. We compute the final updated node embedding by concatenating them and feeding the result to one last MLP, denoted as :


We summarize our time-aware update in Figure 2(c). As we demonstrate experimentally (see 5.2), this simple architectural design results in a significant performance improvement with respect to the vanilla node update of MPNs, shown in Figure 2(b).

4.3 Feature encoding

The initial embeddings that our MPN receives as input are produced by other backpropagatable networks.

Appearance embedding. We rely on a convolutional neural network (CNN), denoted as , to learn to extract a feature embeddings directly from RGB data. For every detection , and its corresponding image patch , we obtain ’s corresponding node embedding by computing .

Geometry embedding. We seek to obtain a representation that encodes, for each pair of detections in different frames, their relative position size, as well as distance in time. For every pair of detections and with timestamps , we consider their bounding box coordinates parameterized by top left corner image coordinates, height and width, i.e., and . We compute their relative distance and size as:

We then concatenate this coordinate-based feature vector with the time difference and relative appearance and feed it to a neural network in order to obtain the initial edge embedding .

4.4 Training and inference

Training loss. To classify edges, we use an MLP with a sigmoid-valued single output unit, that we denote as . For every edge , we compute our prediction by feeding the output embeddings of our MPN at a given message passing step , namely , to . For training, we use the binary cross-entropy of our predictions over the embeddings produced in the last message passing steps, with respect to the target flow variables :


where is the first message passing step at which predictions are computed, and denotes a positive scalar used to weight 1-valued labels to account for the high imbalance between active and inactive edges.

Inference. During inference, we interpret the set of output values obtained from our model at the last message passing step as the solution to our MOT problem, i.e., the final value for the indicator variables . Since these predictions are the output of a sigmoid unit, their values are between 0 and 1. An easy way to obtain hard or decisions is to binarize the output by thresholding. However, this procedure does not generally guarantee that the flow conservation constraints in Equations 1 and 2 are preserved. In practice, thanks to the proposed time-aware update step, our method will satisfy over

of the constraints on average when thresholding at 0.5. After that, a simple greedy rounding scheme suffices to obtain a feasible binary output. The exact optimal rounding solution can also be obtained efficiently with a simple linear program (see supplementary material).

5 Experiments

In this section, we first present an ablation study to better understand the behavior of our model. We then compare to published methods on three datasets, and show state-of-the-art results. All experiments are done on the MOTChallenge pedestrian benchmark.

Datasets and evaluation metrics. The multiple object tracking benchmark MOTChallenge 333The official MOTChallenge web page is available at  https://motchallenge.net. consists of several challenging pedestrian tracking sequences, with frequent occlusions and crowded scenes. The challenge contains three separate tracking benchmarks, namely 2D MOT 2015 [lealarxiv2015], MOT16 [milanarxiv2016] and MOT17 [milanarxiv2016]

. They contain sequences with varying viewing angle, size and number of objects, camera motion and frame rate. For all challenges, we use the detections provided by MOTChallenge to ensure a fair comparison with other methods. The benchmark provides several evaluation metrics. The Multiple Object Tracking Accuracy (MOTA) 

[clear] and ID F1 Score (IDF1) [ristanieccvw2016] are the most important ones, as they quantify two of the main aspects of multiple object tracking, namely, object coverage and identity preservation.

5.1 Implementation details

Network models. For the network used to encode detections appearances (see section 4.3), we employ the first 4 blocks of ResNet50[He2016DeepRL]

architecture pretrained on ImageNet

[imagenet_cvpr09], followed by two fully-connected layers to obtain embeddings of dimension 256.

We train the network for the task of ReIdentification (ReID) on the Market1501[market_dataset], as done in [People_Tracking, Kim_2018_ECCV, maACCV2019]. Once trained, two additional fully connected layers are added to reduce the embedding size of to 32. The rest of the encoder and classifier networks are MLPs and their exact architectures are detailed in the supplementary material.

Data Augmentation. To train our network, we sample batches of 8 graphs, corresponding to 15 frames from a given sequence, sampled at 5 frames per second. We do data augmentation by randomly removing nodes from the graph as well as adding nodes, simulating missing detections and false alarms, respectively. The ground truth edge labels of the resulting graph are recomputed accordingly.

We use learning rate for convolutional layers and , weight decay term and an Adam Optimizer with and set to and

, respectively. We train for 30 epochs, which has shown to be sufficient for convergence in our experiments.

Batch Processing.. We process videos offline in batches of frames, with overlapping frames between batches to ensure that the maximum time distance between two connected nodes in the graph remains stable along the whole graph. We restrict the connectivity of graphs by connecting two nodes only if both are among the top- mutual nearest neighbors (with

) according to the pretrained CNN features. Each batch is solved independently by our network, and for overlapping edges between batches, we average the predictions coming from the all graph solutions before the rounding step. To fill gaps in our trajectories, we perform simple bilinear interpolation along missing frames.

Baseline. Recently, [tracktor] has shown the potential of detectors for simple data association, establishing a new baseline for MOT, a baseline we also follow. Note, the method still uses public detections, thereby, it is fully comparable to all methods on MOTChallenge. One key drawback of [tracktor] is its inability to fill in gaps, nor properly recover identities through occlusions. As we will show, this is exactly where out method excels.

5.2 Ablation study

In this section, we aim to answer three main questions towards understanding our model. Firstly, we compare the performance of our time-aware neural message passing updates with respect to the time-agnostic vanilla node update described in 4.1. Secondly, we assess the impact of the number of message passing steps in network training to the overall tracking performance. Thirdly, we investigate how different information sources, namely, appearance embeddings from our CNN and relative position information, affect different evaluation metrics.

Experimental Setup. We conduct all of our experiments with the training sequences of MOT15 and MOT17 datasets. To evaluate our models, we split MOT17 sequences into three sets, and use these to test our models with 3-fold cross-validation. We then report the best overall MOT17 metrics obtained during validation (see supplementary material for details). In order to provide a fair comparison with our baselines that show poor constraint satisfaction, we use exact rounding via a linear program in all experiments (see section 4.4).

Time-Aware Message Passing

. We investigate how our proposed time-aware node update affects performance. For a fair comparison, we perform hyperparameter search for our baseline. Still, we observe a significant improvement in almost all metrics, including over 6 points in IDF1. As we expected, our model is particularly powerful at linking detections, since it exploits neighboring information and graph structure, making the decisions more robust, and hence producing much less identity switches. We also report the percentage of constraints that are satisfied when directly binarizing by thresholding our model’s output values. Remarkably, our method with time-aware node updates is able to produce almost completely feasible results automatically, while the baseline has a much lower constraint satisfaction. This demonstrates its ability to capture the MOT problem structure.

Arch. MOTA IDF1 MT ML FP FN ID Sw. Constr.
Vanilla 62.0 63.1 557 386 4401 122584 1111 77.8
T. aware 63.7 69.2 638 378 6676 115078 587 98.8
Table 1: We investigate how our proposed update improves tracking performance with respect to a vanilla MPN. Vanilla stands for a basic MPN, T. aware denotes our proposed time-aware update. The metric Constr refers for the number of flow conservation constraints satisfied on average over entire validation sequences.

Number of Message Passing Steps. Intuitively, increasing the number of message passing steps allows each node and edge embedding to encode further context, and gives edge predictions the ability to be iteratively refined. Hence, one would expect higher values to yield better performing networks. We test this hypothesis in Table 3 by training networks with a fixed number of message passing steps, from 0 to 15. We use the case as a baseline in which we train a binary classifier on top of our initial edge embeddings, and hence, no contextual information is used. As expected, we see a clear upward tendency for both IDF-1 and MOTA. Moreover, we observe a steep increase in both metrics from 0 to 3 message passing steps, which demonstrates that the biggest improvement is obtained when switching from pairwise to high-order features in the graph. We also note that the upwards tendency stagnates around six message passing steps, and shows no improvement after twelve message passing steps. Hence, we use in our final configuration.

Effect of the features. Our model receives two main streams of information: (i) appearance information from a CNN operating on RGB images, and (i) geometry features from an MLP encoding relative position between detections. These are incorporated into the model by being used as initial node and edge embeddings, respectively, and later refined during neural message passing. We show several configurations in Table 5.2. For nodes, we explore the difference between initializing its feature vectors with zero-valued vectors vs. CNN features. For edge embeddings, we experiment with combinations from three sources of features: time difference, relative position and the euclidean distance in CNN embeddings between the two bounding boxes. We highlight the fact that relative position seems to be a key component to overall performance since, when no other information is available, the network can still achieve a MOTA value of 62.9. Nevertheless, node CNN features are powerful to reduce the number of false positives and identity switches. Note, that only by having both CNN embeddings on node and on edges features, we achieve a significantly higher accuracy and identity preservation.

Node Feats. Edge Feats. MOTA IDF1 MT ML FP FN ID Sw.
CNN Time 59.0 51.2 571 381 12623 122730 2609
Time+Pos 62.9 67.3 636 372 8435 115534 1004
CNN Time+Pos 63.2 67.8 643 370 8063 115142 906
Time+Pos+CNN 63.2 67.8 648 369 8054 114960 924
CNN Time+Pos+CNN 63.7 69.2 638 378 6676 115078 587
Table 2: We investigate the influence of incorporating several sources of information in our model. For Nodes, we consider either CNN embeddings (CNN) vs a 0-valued vector (–). For edges we explore combinations of three sources of information: time difference in seconds (Time), relative position features (Pos) and the Euclidean distance between CNN embeddings of the two detections (CNN).
Figure 3: We report the evolution of IDF-1 and MOTA when training networks with an increasing number of message passing steps.

5.3 Benchmark evaluation

We report the metrics obtained by our model in the MOT15, MOT16 and MOT17 datasets in Table 5.3. It is worth noting the big performance difference when comparing our method with graph partitioning methods (shown as (G) in Table  5.3). Due to space constraints, we show a detailed comparison of our method compared to graph methods in the supplementary material. Our method obtains state-of-the-art results on all challenges, improving especially the IDF1 measure by 9.4, 4.6, and 4.4 percentage points, respectively, which demonstrate its strong performance in identity preservation. We attribute this performance increase to the ability of our message passing architecture to collect higher-order information. Taking into consideration neighbors’ information when linking trajectories allows our method to make globally informed predictions, which leads inevitably to less identity switches. Moreover, we also achieve more trajectory coverage, represented by a significant increase in Mostly Tracked (MT) trajectories, an increase of up to 9 percentage points. It is worth noting that while surpassing previous approaches, we are significantly faster (well over one order of magnitude) than other SoA methods, especially when compared to expensive graph partitioning methods, e.g., [jCCpami2018].

2D MOT 2015 [lealarxiv2015]
Ours 48.3 56.5 32.2 24.3 9640 21629 504 11.8
Tracktor [tracktor] 44.1 46.7 18.0 26.2 6477 26577 1318 16.7
KCF [wacv_soa] (G) 38.9 44.5 16.6 31.5 7321 29501 720 0.3
AP_HWDPL_p [ChenASZB17] (G) 38.5 47.1 8.7 37.4 4005 33203 586 6.7
STRN [spatio_temporalrelation_networks] 38.1 46.6 11.5 33.4 5451 31571 1033 13.8
AMIR15 [SadeghianAS17] 37.6 46.0 15.8 26.8 7933 29397 1026 1.9
JointMC [jCCpami2018] (G) 35.6 45.1 23.2 39.3 10580 28508 457 0.6
DeepFlow[Schulter_2017_CVPR] (G) 26.8
MOT16 [milanarxiv2016]
Ours 55.9 59.9 26.0 35.6 7086 72902 431 11.8
Tracktor [tracktor] 54.4 52.5 19.0 36.9 3280 79149 682 16.7
NOTA [aggregate_Track_app] (G) 49.8 55.3 17.9 37.7 7428 83614 614
HCC [maACCV2019] (G) 49.3 50.7 17.8 39.9 5333 86795 391 0.8
LMP [TangAAS17] (G) 48.8 51.3 18.2 40.1 6654 86245 481 0.5
KCF [wacv_soa] (G) 48.8 47.2 15.8 38.1 5875 86567 906 0.1
GCRA [MaYYZZJX18] 48.2 48.6 12.9 41.1 5104 88586 821 2.8
FWT [HenschelLCR17] (G) 47.8 44.3 19.1 38.2 8886 85487 852 0.6
MOT17 [milanarxiv2016]
Ours 55.7 59.1 27.2 34.4 25013 223531 1433 11.8
Tracktor[tracktor] 53.5 52.3 19.5 36.6 12201 248047 2072 16.7
JBNOT [tracktor] (G) 52.6 50.8 19.7 35.8 31572 232659 3050 5.4
FAMNet [famnet] 52.0 48.7 19.1 33.4 14138 253616 3072
eHAF[sheng2018] (G) 51.8 54.7 23.4 37.9 33212 236772 1834 0.7
NOTA [aggregate_Track_app] (G) 51.3 54.7 17.1 35.4 20,148 252,531 2,285
FWT [HenschelLCR17] (G) 51.3 47.6 21.4 35.2 24101 247921 2648 0.2
jCC [jCCpami2018] (G) 51.2 54.5 20.9 37.0 25937 247822 1802 1.8
Table 3: Comparison of our method with state-of-the art. We set new state-of-the art results by a significant margin in terms of MOTA and especially IDF1. Our learned solver is more accurate while being magnitudes of order faster when compared to graph partitioning methods, indicated with (G).

6 Conclusions

We have demonstrated how to exploit the network flow formulation of MOT to treat the entire tracking problem as a learning task. We have proposed a fully differentiable pipeline in which both feature extraction and data association can be jointly learned. At the core of our algorithm lies a message passing network with a novel time-aware update step that can capture the problem’s graph structure. In our experiments, we have shown a clear performance improvement of our method with respect to previous state of the art. We expect our approach to open the door for future work to go beyond feature extraction for MOT, and focus, instead, on integrating learning into the overall data association task.