Occlusion-robust Online Multi-object Visual Tracking using a GM-PHD Filter with a CNN-based Re-identification

12/10/2019 ∙ by Nathanael L. Baisa, et al. ∙ University of Lincoln 18

We propose a novel online multi-object visual tracking algorithm via a tracking-by-detection paradigm using a Gaussian mixture Probability Hypothesis Density (GM-PHD) filter and deep Convolutional Neural Network (CNN) appearance representations learning. The GM-PHD filter has a linear complexity with the number of objects and observations while estimating the states and cardinality of unknown and time-varying number of objects in the scene. Though it handles object birth, death and clutter in a unified framework, it is susceptible to miss-detections and does not include the identity of objects. We use visual-spatio-temporal information obtained from object bounding boxes and deeply learned appearance representations to perform estimates-to-tracks data association for labelling of each target. We learn the deep CNN appearance representations by training an identification network (IdNet) on large-scale person re-identification data sets. We also employ additional unassigned tracks prediction after the update step to overcome the susceptibility of the GM-PHD filter towards miss-detections caused by occlusion. Our tracker which runs in real-time is applied to track multiple objects in video sequences acquired under varying environmental conditions and objects density. Lastly, we make extensive evaluations on Multiple Object Tracking 2016 (MOT16) and 2017 (MOT17) benchmark data sets and find out that our online tracker significantly outperforms several state-of-the-art trackers in terms of tracking accuracy and identification.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 10

page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Multi-target tracking is an active research field in computer vision with a wide variety of applications such as intelligent surveillance, autonomous driving, robot navigation and augmented reality. Its main purpose is to estimate the states (locations) of objects from noisy detections, recognize their identities in each video frame and produce their trajectories. The most commonly adopted paradigm for multi-target tracking in computer vision is a tracking-by-detection. This is due to the remarkable advances made in object detection algorithms driven by deep learning. In this tracking-by-detection paradigm, multi-target filters and/or data association are applied to object detections obtained from the object detector(s) applied to video frames to generate trajectories of tracked targets over-time. To perform this, online 

[46][48] and offline (batch) [26][37][39] tracking approaches are the commonly used ones in the literature. The online tracking methods estimate the target state using Bayesian filtering at each time instant using current detections and rely on prediction to handle miss-detections using motion models to continue tracking. However, both past and future detections are fed into mainly global optimization-based data association approaches to handle miss-detections in offline tracking methods. Generally, the offline tracking approaches outperform the online tracking methods though they are limited for time-critical real-time applications where it is crucial to provide state estimates as the detections arrive such as in autonomous driving and robot navigation.

Multi-target tracking algorithms generally receive a random number of detections when object detector is applied to a video frame. When the object detector is applied to this video frame, there can be information uncertainty usually considered as measurement origin uncertainty [54] which include miss-detection, clutter and very near unresolved objects. Thus, in addition to this measurement origin uncertainty, the multi-object tracking method needs to handle the targets’ births, deaths, and the process and observation noises. As surveyed in [32][54], the three commonly known traditional data association methods used for numerous applications are Global Nearest Neighbor (GNN) [8], Joint Probabilistic Data Association Filter (JPDAF) [8] and Multiple Hypothesis Tracking (MHT) [21, 8]. While the GNN (computed using the Hungarian algorithm [13]

) is sensitive to noise, the JPDAF and the MHT are computationally very expensive. Since these methods are computationally expensive and heavily rely on heuristics to track time-varying number of objects, another multi-target tracking approach has been proposed based on random finite set (RFS) theory 

[34]. This approach include all sources of uncertainty in unified probabilistic framework. A probability hypothesis density (PHD) filter [33] is the most commonly adopted RFS-based filter in computer vision for tracking targets in video sequences since it has a linear complexity with the number of objects and observations.

The PHD filter allows target birth, death, clutter (false alarms), and missing detections, however, it does not naturally incorporate the identity of objects in the framework since it is based on indistinguishability assumption of the point process. In order to include the identity of objects, additional technique is needed. This filter is also very susceptible to miss-detections. In fact, the PHD filter is designed originally for radar tracking applications where observations collected can contain numerous false alarms with very few miss-detections. However, in visual tracking applications, observations obtained from the recent deep learning-driven object detectors can contain very low false alarms (false positives) with high level of miss-detections (false negatives) due to occlusion. The parameter which controls the detection and miss-detection part of the PHD filter is the probability of detection (, see in section III-A in the Gaussian mixture implementation of the PHD (GM-PHD) filter [55]). In my experiment, The GM-PHD filter works if is set to not less than about 0.8 unless the covariance matrix fails to be a square, symmetric, positive definite matrix which in turn forces the GM-PHD filter to crash. This means even if we set to 0.8, the miss-detected target can not be maintained since the probability of detection drops too quickly (probability of miss-detection ). This is referred to as target death problem where targets die faster than they should when a miss-detection happens. Thus, naturally the GM-PHD filter is robust to false positives but it is very susceptible to miss-detections.

More recently, outstanding results have been obtained on a wide range of tasks using deep Convolutional Neural Network (CNN) features such as object recognition [24][19], object detection [40] and person re-identification [64]. Better performance has also been obtained on multi-target tracking using deep learning [26, 45] since deeply learned appearance representations of objects have a capability of discriminating object of interest from not only background but also other objects of similar appearance. However, the advantages of deep appearance representations in Random Finite Set based filters, such as the GM-PHD filter, have not been explored which works online and run fast enough to be suitable for real-time applications.

In this work, we propose an online multi-object visual tracker based on the GM-PHD filter using tracking-by-detection approach for real-time applications which not only runs in real-time but also addresses track management (target birth, death and labelling), false positives and miss-detections jointly. We also learn discriminative deep appearance representations of targets using identification network (IdNet) on large-scale person re-identification data sets. We formulate how to combine (fuse) spatio-temporal and visual similarities obtained from bounding boxes of objects and their CNN appearance features, respectively, to construct a cost to be minimized (similarity maximized) by the Hungarian algorithm to label each target. After this association step, additional unassigned tracks prediction step is used to overcome the miss-detection susceptibility of the GM-PHD filter caused by occlusion. Furthermore, we use the deeply learned CNN appearance representations as a person re-identification method to re-identify lost objects for consistently labelling them. To the best of our knowledge, nobody has adopted this approach.

The main contributions of this paper are as follows:

  1. We apply the GM-PHD filter with the deeply learned CNN features for tracking multiple targets in video sequences acquired under varying environmental conditions and targets density.

  2. We formulate how to integrate spatio-temporal and visual similarities obtained from bounding boxes of objects and their CNN appearance features.

  3. We use additional unassigned tracks predictions after the association step to overcome the miss-detection susceptibility of the GM-PHD filter.

  4. We use the deeply learned CNN appearance representations as a person re-identification method to re-identify lost objects for consistently labelling them.

  5. We make extensive evaluations on Multiple Object Tracking 2016 (MOT16) and 2017 (MOT17) benchmark data sets using the public detections provided in the benchmark’s test sets.

We presented a preliminary idea of this work in [5]. In this work, we make more elaborate descriptions of our algorithm. In addition, we change from joint-input Siamese network (StackNet) to identification network (IdNet) to learn the deep appearance representations of targets on a large-scale person re-identification data sets as this IdNet allows us to extract features from an object once in each frame in the tracking process which speeds up the tracker significantly. We also include additional add-on prediction step for predicting unassigned tracks after the association step to handle miss-detections caused by occlusion.

The paper is organized as follows. We discuss the related work in section II. In section III, our proposed algorithm is explained in detail including its all components, and section IV provides some important parameter values in the GM-PHD filter implementation. The experimental results are analyzed and compared in section V. The main conclusions and suggestions for future work are summarized in section VI.

Ii Related Work

Numerous multi-target tracking algorithms have been introduced in the literature [32][54][16]. Traditionally, multi-target trackers have been developed by finding associations between targets and observations mainly using JPDAF [8] and MHT [8, 21]. However, these approaches have faced challenges not only in the uncertainty caused by data association but also in algorithmic complexity that increases exponentially with the number of targets and measurements.

Recently, a unified framework which directly extends single to multiple target tracking by representing multi-target states and observations as RFS was developed by Mahler [33]

which not only addresses the problem of increasing complexity, but also estimates the states and cardinality of an unknown and time varying number of targets in the scene by allowing for target birth, death, clutter (false alarms), and missing detections. It propagates the first-order moment of the multi-target posterior, called intensity or the PHD 

[55], rather than the full multi-target posterior. This approach is flexible, for instance, it has been used to find the detection proposal with the maximum weight as the target position estimate for tracking a target of interest in dense environments by removing the other detection proposals as clutter [1][4]. Furthermore, the standard PHD filter was extended to develop a novel N-type PHD filter () for tracking multiple targets of different types in the same scene [3][2]. However, this approach does not naturally include target identity in the framework because of the indistinguishability assumption of the point process; additional mechanism is necessary for labelling each target. Recently, labeled RFS for multi-target tracking was introduced in [53][52][23], however, its computational complexity is high. In general, the RFS-based filters are susceptible to miss-detection even though they are robust to clutter.

The two common implementation schemes of the PHD filter are the Gaussian mixture (GM-PHD filter [55]) and Sequential Monte Carlo (SMC-PHD or particle PHD filter [56]). Though the PHD filter is the most widely adopted RFS-based filter in computer vision due to its computational efficiency (it has a linear complexity with number of targets and observations), it is weak in handling miss-detection. This is because the PHD filter is designed originally for radar tracking applications where the number of miss-detections is very low as opposed to the visual tracking applications where significant number of miss-detections occur due to occlusion. In this work, we overcome not only the miss-detection problem but also the labelling of targets in each frame for real-time visual tracking applications.

Incorporating deep appearance information into multi-target tracking algorithms improves the tracking performance as demonstrated in works such as  [21, 22, 23, 18]. A multi-output regularized least squares (MORLS) framework has been used to learn appearance models online and are integrated into a tree-based track-oriented MHT (TO-MHT) in [21]

. The same author has trained a bilinear long short-term memory (LSTM) on both motion and appearance and has incorporated it into the MHT for gating in 

[22]. These trackers are, however, computationally demanding and operates offline. Appearance models of objects are also learned in the same fashion as in [21] to integrate into a generalized labeled multi-Bernoulli (GLMB) filter [23]. Deep discriminative correlation filters has also been learned and integrated into the PHD filter in [18]. Though the latter two trackers work online, they are computationally demanding to be applied for time-critical real-time applications.

The two well known CNN structures are verification and identification models [63]. In general, Siamese network, a kind of verification network (similarity metric), is the most widely used network for developing multi-target tracking methods [26][49][5][60]. As discussed in [26], the Siamese topology has three types: those combined at cost function, in-network and StackNet. The StackNet which has been used in offline tracking [26][49] and online tracking [5][60] methods outperforms the other types of Siamese topologies. This StackNet can also be referred to as joint-input network [60]. This network takes two concatenated image patches along the channel dimension and infers their similarity. The last fully-connected layer of the StackNet models a 2-way classification problem (the same and different identities) i.e. given a pair of images, the StackNet produces the probability of the pair being the same or different identity by a forward pass. This means in multi-target tracking applications, all pair of tracks and detections (estimates in our case) need to be paired and given as input to the StackNet to get their probability of similarity in each video frame. This leads to high complexity as demonstrated in [26][49][5][60] which limits the trackers’ applications for real-time scenarios. We observed this in our preliminary work [5], thus, we change the StackNet to identification network (IdNet) compensating the performance by training the IdNet on large-scale person re-identification data sets; the StackNet generally outperforms the IdNet [49]. Using this IdNet, appearance features are extracted once in each video frame from detections (output estimates from the GM-PHD filter in our case) and are copied to the assigned tracks after the data association step. This speeds up the online tracker very significantly when compared to using the StackNet. In addition to learning the discriminative deep appearance representations to solve both tracks-to-estimates associations and lost tracks re-identifications, we also include additional add-on unsigned tracks prediction after the association step to over-come the miss-detection problem of the PHD filter due to occlusion. To date, no work has incorporated these two important components not only to improve the multi-target tracking performance but also to speed it up to the level of real-time, as is the case in our work.

Iii The Proposed Algorithm

The block diagram of our proposed multi-target tracking algorithm is given in Fig. 1. Our proposed online tracker consists of four components: 1) target states estimation using the GM-PHD filter, 2) tracks-to-estimates associations using the Hungarian algorithm, 3) add-on unassigned tracks prediction to alleviate miss-detections, and 4) lost tracks re-identification for tracks re-initialization. All of these four components are explained in details as follows.

Fig. 1:

Block diagram of the proposed multi-target visual tracking pipeline using the GM-PHD filter, visual-spatio-temporal information for tracks-to-estimates association, lost tracks re-identification using deep visual similarity and additional add-on unassigned tracks prediction.

Iii-a The GM-PHD Filter

The Gaussian mixture implementation of the standard PHD (GM-PHD) filter [55] is a closed-form solution of the PHD filter that assumes a linear Gaussian system. It has two steps: prediction and update. Before stating these two steps, certain assumptions are needed: 1) each target follows a linear Gaussian model:

(1)
(2)

where is the single target state transition probability density at time k given the previous state and is the single target likelihood function which defines the probability that is generated (observed) conditioned on state . denotes a Gaussian density with mean and covariance ; and are the state transition and measurement matrices, respectively. and are the covariance matrices of the process and the measurement noises, respectively. The measurement noise covariance can be measured off-line from sample measurements i.e. from ground truth and detection of training data [58] as it indicates detection performance. 2) A current measurement driven birth intensity inspired by but not identical to [43] is introduced at each time step, removing the need for the prior knowledge (specification of birth intensities) or a random model, with a non-informative zero initial velocity. The intensity of the spontaneous birth RFS is a Gaussian mixture of the form

(3)

where is the number of birth Gaussian components, is the weight accompanying the Gaussian component , is the current measurement and zero initial velocity used as mean, and is birth covariance for Gaussian component .

3) The survival and detection probabilities are independent of the target state: and .

Adaptive birth: We use adaptive measurement-driven approach for birth of targets. Each detection is associated with detection confidence score . We use more confident (strong) detections based on their score for birth of targets as they are more likely to represent a potential target. Confident detections used for birth of targets will be where is a detection score threshold. In fact, governs the relationship between the number of false positives (clutter) and miss-detections (false negatives). Increasing the value of gives more miss-detections and less false positives, and vice versa. The initial birth weight in Eq. (3) is also weighted by to give high probability for more confident detections for birth of targets i.e. . However, all measurements are used for the update step.

Prediction: It is assumed that the posterior intensity at time is a Gaussian mixture of the form

(4)

where is the number of Gaussian components of and it equals to the number of Gaussian components after pruning and merging at the previous iteration. Under these assumptions, the predicted intensity at time is given by

(5)

where

where is given by Eq. (3).

Since and are Gaussian mixtures, can be expressed as a Gaussian mixture of the form

(6)

where is the weight accompanying the predicted Gaussian component , and is the number of predicted Gaussian components and it equals to the number of born targets and the number of persistent (surviving) components. The number of persistent components is actually the number of Gaussian components after pruning and merging at the previous iteration.

Update: The posterior intensity (updated PHD) at time is also a Gaussian mixture and is given by

(7)

where

The clutter intensity due to the scene, , in Eq. (7) is given by

(8)

where is the uniform density over the surveillance region , and is the average number of clutter returns per unit volume i.e. .

After update, weak Gaussian components with weight are pruned, and Gaussian components with Mahalanobis distance less than pixels from each other are merged. These pruned and merged Gaussian components are predicted as existing (persistent) targets in the next iteration. Finally, Gaussian components of the pruned and merged intensity with means corresponding to weights greater than 0.5 as a threshold are selected as multi-target state estimates (we use the pruned and merged intensity rather than the posterior intensity as it gives better results).

Iii-B Data Association

The GM-PHD filter distinguishes between true and false targets, however, this does not distinguish between two different targets, so an additional step is necessary to identify different targets between consecutive frames. We use both the spatio-temporal and visual similarities between the track boxes and estimated object states (filtered output boxes) in frames and , respectively, to label each object across frames.

Iii-B1 Spatio-temporal information

The spatio-temporal information is computed using track boxes and filtered output boxes in consecutive frames. Let be the track’s box and be the estimate’s (GM-PHD filter’s filtered output) box at frame k. Their spatio-temporal similarity is calculated using Euclidean distance between their centers. We use Euclidean distance rather than Jaccard distance (1 - Intersection-over-Union) as it gives slightly better result. The spatio-temporal (motion or distance) relation has been commonly used, in different forms, in many multi-object tracking works [46][50][61]. The normalized Euclidean distance between the centers of the bounding boxes and is given by

(9)

where and are the center locations of their corresponding bounding boxes at frames and , respectively. and are the width and height of a video frame which are used for the Euclidean distance normalization.

Iii-B2 Deep Appearance Representations Learning

Visual cues are very crucial for associating tracks with detections (in our case current filtered outputs or estimated states) for robust online multi-object tracking. In this work, we propose an identification CNN network (IdNet) for computing visual affinities between image patches cropped at bounding box locations. We treated this task as a multi-class recognition problem to learn a discriminative CNN embedding. We adopted the ResNet [19]

as the network structure (ResNet50) by replacing the topmost layer (fully connected layer) to output confidence for each of the person identities in the training data set (changing from 1000 classes to 6654 classes in our case). The rest of the ResNet50 architecture remains the same except adding a dropout with a rate of 0.75 after the last pooling layer for reducing a possible over-fitting. We use a transfer learning approach i.e. it is pre-trained on the ImageNet data set 

[44] consisting of 1000 classes rather than training the network from scratch.

Data preparation: To learn discriminative deep appearance representations, we collected our training data set from numerous sources. First, we utilize publicly available person re-identification data sets including Market1501 data set [62] (736 identities from 751 as we restrict the number of images per identity to at least 4), CUHK03 data set [29] (1367 identities), LPW data set [47] (1974 identities), and MSMT data set [57] (1141 identities). In addition to these person re-identification data sets, we also collected training data from publicly available tracking data sets such as MOT15 [27] and MOT16/17 [36] training data sets (MOT16 and MOT17 have the same training data set though MOT17 is claimed to have more accurate ground truth and is used in our experiment). From all these tracking training data sets of MOT15 (TUD-Stadmitte, TUD-Campus, PETS09-S2L1, ETH-Bahnhof and ETH-Sunnyday) and MOT16/17 (5 sequences), we produce about 521 person identities. We also produce about 213 identities from TownCentre data set [9]. This helps the network more to adapt to the MOT benchmark test sequences as well as the network can learn the inter-frame variations. In total, we collected about 6,654 person identities from all these data sets to train our IdNet. 10% of this training set is used for validation (of each person identity if the number of images for that person identity is greater than 9 unless no validation for that class). We resize all the training images to and then subtract the mean image from all the images which is computed from all the training images. During training, we randomly crop all the images to and then mirror horizontally. We use a random order of images by reshuffling the data set.

Training:

We train the IdNet using a binary cross-entropy loss, softmax and mini-batch Stochastic Gradient Descent (SGD) with momentum. The mini-batch size is set to 20. We trained our model on a NVIDIA GeForce GTX 1050 GPU for 200 epochs, after which it generally converges, using MatConvNet 

[51]. We initialize the learning rate to for the first 75 epochs, to for the next 75 epochs and to for the last 50 epochs (1 epoch is one sweep over all the training data). In addition, we augment the training samples by random flipping horizontally as well as randomly shifting the cropping positions by no more than of detection box of width or height for and dimensions respectively to increase more variation and thus reduce possible over-fitting.

Testing:

We use 2 video sequences of MOT16/17 training data set (02 and 09) for testing of our trained IdNet. For this testing set, we produce about 66 person identities. We randomly sample about 800 positive pairs (the same identities) and 3200 negative pairs (different identities) from ground truth of MOT16/17-02 and MOT16/17-09 training dataset. We use this larger ratio of negative pairs to mimic the positive/negative distribution during tracking. We use verification accuracy as a an evaluation metric. Given a pair of images, we compute the cosine distance (using Eq. (

10

)) between their extracted deep appearance feature vectors utilizing our learned model. If the computed cosine-distance of positive pairs are greater than or equal to 0.75, they are assumed as correctly classified pairs. Similarly, if the computed cosine-distance of negative pairs are less than 0.75, they are assumed as correctly classified pairs. Accordingly, the IdNet trained on large-scale data sets (6,654 identities) gives about 97.5% accuracy.

Iii-B3 Tracks-to-Estimates Association

Here we use visual-spatio-temporaal information, fusion of both visual and spatio-temporal information, to associate tracks to the estimated (filtered output) boxes.

The visual similarity between the track’s box and estimate’s (filtered output) box at frame k is computed using the cosine distance between appearance feature vectors and which are extracted from the track’s box and filtered output box , respectively. Thus, this visual similarity (cosine distance) is given using the dot product and magnitude (norm) of the appearance feature vectors as

(10)

We consider the mean of track features weighted by confidences of the detections corresponding to this track till frame when computing the cosine distance between track features and estimated state features.

We use the Munkres’s variant of the Hungarian algorithm [13] to determine the optimal associations in case an estimate (filtered output) box is tried to be associated with multiple tracks using the following overall association cost

(11)

where is the visual difference used as a cost where each of its element , is a matrix of the normalized Euclidean distances where each element , and is the weight balancing the two costs. , and are matrices where and are the number of tracks and estimates (filtered outputs) at time ; is a matrix of of the same dimension as . Spatio-temporal relation gives useful information for tracks-estimates association of targets that are in close proximity, however, its importance starts to decrease as targets become (temporally) far apart. In contrast, visual similarity obtained from CNN allows long-range association as it is robust to large temporal and spatial distance. These combination of spatio-temporal and visual information helps to solve target ambiguities which may occur due to either targets motion or their visual content as well as allows long-range association of targets.

The outputs of the Hungarian algorithm are assigned tracks-to-estimates, unassigned tracks and unassigned estimates as shown in Fig. 1. The tracks-to-estimates association is confirmed if the cost is lower than the cost threshold . The associated estimates (filtered outputs) boxes are appended to their corresponding tracks to generate longer ones up to time k. The unassigned tracks are predicted using the add-on prediction step or killed accordingly as discussed in section III-C. The unassigned estimates either create new tracks or perform re-identification from the lost (dead) tracks to re-initialize the tracks as discussed in section III-D.

Iii-C Unassigned tracks Prediction

We keep state transition matrix (), process noise covariance () and the covariance matrices () from the update step of the GM-PHD filter for the unassigned tracks obtained after the Hungarian algorithm-based tracks-to-estimates association step. We, therefore, predict each of the unassigned track using its state transition matrix while also updating its covariance matrix for a period of number of predictions (frames) as follows (Eq. 12).

(12)

where and are the updated unassigned track ’s state (location) and covariance matrix at frame , respectively.

The effect of this additional add-on prediction step versus the number of predictions () is analyzed in Table II. We kill the track if the number of performed predictions is greater than the number of predictions threshold (). This killed track can be considered for re-identification in the coming frames. In our experiment, we choose as this gives better Multiple Object Tracking Accuracy (MOTA) value as shown in Table II and Fig. 2. Detailed investigation of this add-on prediction on the performance of our online tracker is given in experimental results section V-B.

Iii-D Re-identification for Tracking

Person re-identification in the context of multi-target tracking is very challenging due to occlusions (inter-object and background), cluttered background and inaccurate bounding box localization. Inter-object occlusions are very challenging in video sequences containing dense targets, hence, object detectors may miss some targets in some consecutive frames. Re-identification of lost targets due to miss-detections is crucial to keep track of identity of each target.

The tracks-to-estimates association using the Hungarian algorithm given in section III-B can also provide unassigned estimates. If a past track is not associated to any estimated box at frame k, the tracked target might be occluded or temporally missed by the object detector. If an estimated object box is not associated to any track, it is used for initializing a new track if it is not created earlier by checking it within the last frames from lost or dead tracks using visual similarity for re-identification. We use a visual similarity threshold of for the re-identification of targets i.e re-identification occurs if the visual similarity (cosine distance) is greater than . If multiple dead tracks are matched to the unassigned estimate, the one with the maximum similarity score is confirmed. Re-identification using the visual similarity along with combining the visual similarity with the spatio-temporal information to construct the cost for labelling of targets has increased the performance of our online tracker as shown in Table II and Fig. 3. An independent analysis of each component is also given in experimental results section V-B.

Iv Parameter Values in the GM-PHD Filter Implementation

Our state vector includes the centroid positions, velocities, width and height of the bounding boxes, i.e. . Similarly, the measurement is the noisy version of the target area in the image plane approximated with a x rectangle centered at i.e. .

We set the survival probability , and we assume the linear Gaussian dynamic model of Eq. (1) with matrices taking into account the box width and height at the given scale.

(13)

where and denote the state transition matrix and process noise covariance, respectively; and denote the n x n identity and zero matrices, respectively, and second is the sampling period defined by the time between frames. pixels

is the standard deviation of the process noise.

Similarly, the measurement follows the observation model of Eq. (2) with matrices taking into account the box width and height,

(14)

where and denote the observation matrix and the observation noise covariance, respectively, and pixels is the measurement standard deviation. The probability of detection is assumed to be constant across the state space and through time and is set to a value of

. The false positives are independently and identically distributed (i.i.d), and the number of false positives per frame is Poisson-distributed with mean

(false alarm rate of ; dividing the mean by frame resolution , refer to Eq. (8)).

Nothing is known about the appearing targets before the first observation. The distribution after the observation is determined by the current measurement and zero initial velocity used as a mean of the Gaussian distribution and using a predetermined initial covariance given in Eq. (

15) for birthing of targets.

(15)

The birth weight that any potential observation represents an appearing target in Eq. (3), detection score threshold and whether using detection score along with (multiplied by) depends on the application as they govern the relationship between false positives and miss-detections i.e. they are hyper-parameters that require tuning. For instance, we evaluated on , , and with and without using along with . Given and , using along with () gives better MOTA value than without using . Reducing to 0.0001 reduces MOTA value slightly, however, it greatly decreases false positives at the expense of increased miss-detections. In our experiment, we find , and without using along with gives better MOTA value at the expense of increased false positives. The influence of and partly depends on the hype-parameter value of , and thus, all these hyper-parameters need to be tuned well for the application at hand. Furthermore, after evaluating on (in Eq. (11)), we set it to 0.85 as this gives better result. The implementation parameters and their values are summarized in Table I.

Parameters
Values 0.85 0.0 0.1 6 pixels 5 pixels 0.95 0.99 10 4 pixels 1:k-1 frames 3 0.4 0.6
TABLE I: Implementation values of the parameters used in our evaluations on both MOT16 and MOT17 Benchmark data sets [36]; both for ablation study (Table II) and comparison with other trackers (Table III and Table IV).

V Experimental Results

In this section, we discuss experimental settings, ablation Study on MOT Benchmark training set and evaluations on MOT Benchmark test sets in detail.

V-a Experimental Settings

The experimental settings for proposed online tracker such as tracking data sets, evaluation metrics and implementation details are presented as follows.

Tracking Datasets: We make an extensive evaluations of our proposed online tracker using both MOT16 and MOT17 benchmark data sets [36] which are captured on unconstrained environments using both static and moving cameras. These data sets consist of 7 training sequences on which we make ablation study as given in section V-B (Table II) and 7 testing sequences on which we evaluate and compare our proposed online tracker with other trackers as shown in Table III and Table IV. We use the public detections provided by the MOT benchmark with a non-maximum suppression (NMS) of 0.3 for DPM detector [17] (for both MOT16 and MOT17) and 0.5 for FRCNN [40] and SDP [59] detectors (for MOT17).

Evaluation Metrics: We use numerous evaluation metrics which are presented as follows:

  • Multiple Object Tracking Accuracy (MOTA): A summary of overall tracking accuracy in terms of false positives, false negatives and identity switches, which gives a measure of the tracker’s performance at detecting objects as well as keeping track of their trajectories.

  • Multiple Object Tracking Precision [20] (MOTP): A summary of overall tracking precision in terms of bounding box overlap between ground-truth and tracked location, which shows the ability of the tracker to estimate precise object positions.

  • Identification F1 (IDF1) score [42]: The quantitative measure obtained by dividing the number of correctly identified detections by the mean of the number of ground truth and detections.

  • Mostly Tracked targets (MT): Percentage of mostly tracked targets (a target is tracked for at least 80% of its life span regardless of maintaining its identity) to the total number of ground truth trajectories.

  • Mostly Lost targets (ML) [30]: Percentage of mostly lost targets (a target is tracked for less than 20% of its life span) to the total number of ground truth trajectories.

  • False Positives (FP): Number of false detections.

  • False Alarms per Frame (FAF): This can also be referred to as false positives per image (FPPI) which measures false positive ratio.

  • False Negatives (FN): Number of miss-detections.

  • Identity Switches (IDSw): Number of times the given identity of a ground-truth track changes.

  • Fragmented trajectories (Frag): Number of times a track is interrupted (compared to ground truth trajectory) due to miss-detection.

True positives are detections which have at least 50% overlap with their corresponding ground truth bounding boxes. For more detailed description of each metric, please refer to [36].

Implementation Details: Our proposed tracking algorithm is implemented in Matlab on a i7 2.80 GHz core processor with 8 GB RAM. We use the MatConvNet [51]

for CNN feature extraction where its forward propagation computation is transferred to a NVIDIA GeForce GTX 1050 GPU, and our tracker runs at about 20.4 frames per second (fps). The forward propagation for feature extraction step is relatively the main computational load of our tracking algorithm, specially for constructing the cost due to the visual content in Eq. (

11). However, it is much significantly faster than our preliminary work in [5] (3.5 fps) since appearance features are extracted once from estimates in each frame and then copied to the associated tracks rather than concatenating both track and estimate patches along the channel dimension and extracting the features from all tracks and estimates in every frame.

V-B Ablation Study on MOT16 Benchmark Training Set

We investigate the contributions of the different components of our proposed online tracker, GMPHD-ReId, on the MOT16 [36] benchmark training set using public detections. These different components include motion information (Mot), appearance information (App), re-identification (ReId) and add-on unassigned tracks predictions (AddOnPr). First, we evaluate using only the motion information (Mot) as shown in Table II. Second, we include appearance information (App) and re-identification (ReId) in addition to the motion information to see the effect of the learned discriminative deep appearance representations on the tracking performance. Third, we include the additional add-on prediction (AddOnPr) on top of the motion information, appearance information and re-identification, particularly by varying the number of predictions () as shown in Table II using numerous tracking evaluation metrics. The graphical plot the MOTA values in Table II versus the number of predictions () is shown in Fig. 2.

Accordingly, using only the motion information provides MOTA value of 31.1 and IDF1 of 20.6 as shown in Table II. Including the deeply learned appearance information for data association and re-identification increases the MOTA and IDF1 to 33.8 and 44.2, respectively. We also investigate the influence the additional add-on prediction (AddOnPr) step by varying the number of predictions from 0 (no AddOnPr) to 12. The maximum MOTA value is obtained at as shown in Table II and Fig. 2. Thus, including the additional add-on prediction with in the our proposed online tracker increases the MOTA and IDF1 from 33.8 to 35.8 and from 44.2 to 46.5, respectively. This means an increase of 5.92% and 5.20% for MOTA and IDF1, respectively, is obtained using a very simple additional add-on unassigned tracks prediction. Thus, each component of the our proposed online tracker has an effect of increasing tracking performance.

Type MOTA IDF1 MOTP FAF MT (%) ML (%) FP FN IDS Frag
Mot Only 31.1 20.6 77.7 0.30 4.00 26.70 1609 71396 3016 2794
Mot + App + ReId + 0 AddOnPr 33.8 44.2 77.6 0.34 4.10 26.20 1820 70756 509 2553
Mot + App + ReId + 2 AddOnPr 35.7 46.0 77.0 0.70 5.80 24.20 3704 66839 442 1396
Mot + App + ReId + 3 AddOnPr 35.8 46.5 76.8 0.87 6.00 23.60 4604 65785 447 1274
Mot + App + ReId + 4 AddOnPr 35.6 46.9 76.6 1.06 6.70 23.30 5614 64975 461 1207
Mot + App + ReId + 5 AddOnPr 35.4 47.1 76.5 1.23 6.90 22.70 6550 64322 457 1167
Mot + App + ReId + 7 AddOnPr 34.3 47.3 76.3 1.64 7.30 22.40 8731 63346 471 1128
Mot + App + ReId + 10 AddOnPr 32.5 47.1 76.0 2.21 7.60 22.20 11767 62248 494 1078
Mot + App + ReId + 12 AddOnpr 31.2 46.5 75.9 2.59 7.50 22.10 13757 61675 508 1061
TABLE II: Tracking performance evaluation results on the MOT16 [36] benchmark training set using public detections in terms different GMPHD-ReId components: motion information (Mot), appearance information (App), re-identification (ReId) and add-on unassigned tracks predictions (AddOnPr). Evaluation measures with () show that higher is better, and with () denote lower is better. In this experiment, using motion, appearance, reid and add-on estimates predictions for 3 consecutive frames gives the best MOTA value.
Fig. 2: MOTA values of the proposed GMPHD-ReId tracker when the number of predictions () is varied on the MOT16 [36] benchmark training set. Maximum MOTA value is obtained at .
Fig. 3: Sample results on 3 frames of MOT16-01 data set for our proposed online tracker with motion information only (top row for frames 11, 75 and 125 from left to right) and with appearance, re-identification and add-on prediction (bottom row for frames 11, 75 and 125 from left to right). Bounding boxes represent the tracking results with their color-coded identities; small numbers are also shown on top of each bounding box for better clarity.

V-C Evaluations on MOT Benchmark Test Sets

After validating our proposed tracker, GMPHD-ReId, on the MOT16 Benchmark training set with the add-on prediction with in section V-B, we compare it against state-of-the-art online and offline tracking methods as shown in Table III and Table IV. Accordingly, quantitative evaluations of our proposed method with other trackers is compared in Table III on MOT16 benchmark data set. The Table shows that our algorithm outperforms both online and offline trackers listed in the table in many of the evaluation metrics. When compared to the online trackers, our proposed online tracker outperforms all the others in MOTA, IDF1, MT, ML and FN. The number of identity switches (IDSw) is also significantly lower than many of the online trackers. Our proposed online tracker outperforms not only many of the online trackers but also several offline trackers in terms of several evaluation metrics. In terms of IDF1 and ML, our proposed online tracker performs better than all of the trackers, both online and offline, listed in the table. Our online tracker also runs faster than many of both online and offline trackers, at about 20.4 fps.

Our online tracker also gives promising results on MOT17 benchmark data set as is quantitatively shown in Table IV. It outperforms all other online trackers in the table in all MOTA, IDF1, MT, ML and FN measures. The number of IDSw and Frag is also significantly lower than many of the online trackers in the table. In addition to the online trackers, our proposed online tracker outperforms many of the offline trackers listed in the table. Our proposed online tracker outperforms all the trackers in the table, both online and offline, in terms of ML.

The most important to notice here is that the comparison of our algorithm to the GM-PHD-HDA [48] (GMPHD-SHA for MOT17). These both trackers use the GM-PHD filter but with different approaches for labelling of targets from frame to frame. While our tracker uses the Hungarian algorithm for labelling of targets by postprocessing the output of the filter using a combination of spatio-temporal and visual similarities along with visual similarity for re-identification, the GM-PHD-HDA uses the approach in [38] at the prediction stage by also including appearance features for re-identification to label targets. In addition to the GM-PHD-HDA tracker, our proposed tracker outperforms the other GM-PHD filter-based trackers such as GMPHD-KCF [25], GM-PHD [15], GM-PHD-N1T (GMPHD-N1Tr) [2] and GM-PHD-DAL (GMPHD-DAL) [5] (our preliminary work) as shown in Tables  III and IV in almost all of the evaluation metrics.

The qualitative comparison of our proposed tracker (GMPHD-ReId) and our tracker without appearance information and additional unassigned tracks prediction is given in Fig. 3 for frames 11, 75 and 125. Due to the detection failures, some labels of targets are not consistent for our tracker without appearance information and additional unassigned tracks prediction (top row), for instance, labels 2 and 3 in frame 11 are changed to labels 9 and 36, respectively, in frame 75. Similarly labels 31, 35 and 35 in frame 75 are changed to labels 46, 44 and 42, respectively, in frame 125. However, the labels of the targets are consistent when using the GMPHD-ReId tracker (bottom row). The effect of the additional unassigned tracks prediction is also clearly visible. For instance, a person with label 5 in frame 11 is missed in 75 and 125 frames when using our tracker without appearance information and additional unassigned tracks prediction (top row), however, this same person with label 1 in frame 11 is tracked in both 75 and 125 frames when using our proposed online tracker which combines all the components together (bottom row): motion information, deep appearance information for both data association and re-identification, and additional add-on unassigned tracks prediction.

In our evaluations, the association cost constructed using only visual similarity CNN gives better result than using only spatio-temporal relation, however, their combination using Eq. (11) gives better result than each of them. Furthermore, weighted summation of the costs according to Eq. (11) gives slightly better result than the Hadamard product (element-wise multiplication) of the two costs.

Sample qualitative tracking results are shown as examples in Fig. 4 using SDP detector and MOT17 test sequences. The tracking results are represented by bounding boxes with their color-coded identities. On the top row, MOT17-01-SDP and MOT17-03-SDP are shown from left to right. In the first and second middle rows are MOT17-06-SDP and MOT17-07-SDP, and MOT17-08-SDP and MOT17-12-SDP, respectively. Finally, MOT17-14-SDP is shown on the bottom row.

Tracker Tracking Mode MOTA MOTP IDF1 MT (%) ML (%) FP FN IDSw Frag Hz
MHT-DAM [21] offline 45.8 76.3 46.1 16.2 43.2 6,412 91,758 590 781 0.8
MHT-bLSTM6 [22] offline 42.1 75.9 47.8 14.9 44.4 11,637 93,172 753 1,156 1.8
CEM [37] offline 33.2 75.8 N/A 7.8 54.4 6,837 114,322 642 731 0.3
DP-NMS [39] offline 32.2 76.4 31.2 5.4 62.1 1,123 121,579 972 944 5.9
SMOT [14] offline 29.7 75.2 N/A 5.3 47.7 17,426 107,552 3,108 4,483 0.2
JPDF-m [41] offline 26.2 76.3 N/A 4.1 67.5 3,689 130,549 365 638 22.2
GM-PHD-HDA [48] online 30.5 75.4 33.4 4.6 59.7 5,169 120,970 539 731 13.6
GM-PHD-N1T [2] online 33.3 76.8 25.5 5.5 56.0 1,750 116,452 3,499 3,594 9.9
HISP-T [4] online 35.9 76.1 28.9 7.8 50.1 6,406 107,905 2,592 2,299 4.8
OVBT [7] online 38.4 75.4 37.8 7.5 47.3 11,517 99,463 1,321 2,140 0.3
EAMTT-pub [46] online 38.8 75.1 42.4 7.9 49.1 8,114 102,452 965 1,657 11.8
JCmin-MOT [12] online 36.7 75.9 36.2 7.5 54.4 2,936 111,890 667 831 14.8
HISP-DAL [6] online 37.4 76.3 30.5 7.6 50.9 3,222 108,865 2,101 2,151 3.3
GM-PHD-DAL [5] online 35.1 76.6 26.6 7.0 51.4 2,350 111,886 4,047 5,338 3.5
GMPHD-ReId (ours) online 40.3 75.2 48.3 11.6 43.1 7,147 100,895 815 2,446 20.4
TABLE III: Tracking performance of representative trackers developed using both online and offline methods. All trackers are evaluated on the test data set of the MOT16 [36] benchmark using public detections. The first and second highest values are highlighted by red and blue, respectively (for both online and offline trackers). Evaluation measures with () show that higher is better, and with () denote lower is better. N/A shows not available.
Tracker Tracking Mode MOTA MOTP IDF1 MT (%) ML (%) FP FN IDSw Frag Hz
MHT-DAM [21] offline 50.7 77.5 47.2 20.8 36.9 22,875 252,889 2,314 2,865 0.9
MHT-bLSTM [22] offline 47.5 77.5 51.9 18.2 41.7 25,981 268,042 2,069 3,124 1.9
IOU17 [11] offline 45.5 76.9 39.4 15.7 40.5 19,993 281,643 5,988 7,404 1,522.9
SAS-MOT17 [35] offline 44.2 76.4 57.2 16.1 44.3 29,473 283,611 1,529 2,644 4.8
DP-NMS [39] offline 43.7 76.9 N/A 12.6 46.5 10,048 302,728 4,942 5,342 137.7
EAMTT [46] online 42.6 76.0 41.8 12.7 42.7 30,711 288,474 4,488 5,720 12.0
FPSN [28] online 44.9 76.6 48.4 16.5 35.8 33,757 269,952 7,136 14,491 10.1
GMPHD-KCF [25] online 40.3 75.4 36.6 8.6 43.1 47,056 283,923 5,734 7,576 3.3
GM-PHD [15] online 36.2 76.1 33.9 4.2 56.6 23,682 328,526 8,025 11,972 38.4
OTCD-1 [31] online 44.9 77.4 42.3 14.0 44.2 16,280 291,136 3,573 5,444 5.5
GMPHD-N1Tr [2] online 42.1 77.7 33.9 11.9 42.7 18,214 297,646 10,698 10,864 9.9
SORT17 [10] online 43.1 77.8 39.8 12.5 42.3 28,398 287,582 4,852 7,127 143.3
GMPHD-SHA [48] online 43.7 76.5 39.2 11.7 43.0 25,935 287,758 3,838 5,056 9.2
HISP-DAL17 [6] online 45.4 77.3 39.9 14.8 39.2 21,820 277,473 8,727 7,147 3.2
GMPHD-DAL [5] online 44.4 77.4 36.2 14.9 39.4 19,170 283,380 11,137 13,900 3.4
GMPHD-ReId (ours) online 46.3 76.5 50.8 19.0 33.8 37,249 260,871 4,742 8,636 20.1
TABLE IV: Tracking performance of representative trackers developed using both online and offline methods. All trackers are evaluated on the test data set of the MOT17 benchmark using public detections. The first and second highest values are highlighted by red and blue, respectively (for both online and offline trackers). Evaluation measures with () show that higher is better, and with () denote lower is better. N/A shows not available.
Fig. 4: Sample results on several sequences of MOT17 data sets using SDP detector; bounding boxes represent the tracking results with their color-coded identities. From left to right: MOT17-01-SDP and MOT17-03-SDP (top row), MOT17-06-SDP and MOT17-07-SDP (the 1st middle row), MOT17-08-SDP and MOT17-12-SDP (the 2nd middle row), and MOT17-14-SDP (bottom row). The videos of tracking results are available on the MOT Challenge website https://motchallenge.net/.

Vi Conclusions

We have developed a novel multi-target visual tracker based on the GM-PHD filter and deep CNN appearance representations learning. We apply this method for tracking multiple targets in video sequences acquired under varying environmental conditions and targets density. We followed a tracking-by-detection approach using the public detections provided in the MOT16 and MOT17 benchmark data sets. We integrate spatio-temporal similarity from the object bounding boxes and the appearance information from the learned deep CNN (using both motion and appearance cues) to label each target in consecutive frames. We learn the deep CNN appearance representations by training an identification network (IdNet) on large-scale person re-identification data sets. We also employ additional unassigned tracks prediction after the GM-PHD filter update step to overcome the susceptibility of the GM-PHD filter towards miss-detections caused by occlusion. Results show that our method outperforms state-of-the-art trackers developed using both online and offline approaches on the MOT16 and MOT17 benchmark data sets in terms of tracking accuracy and identification. In the future work, we will include inter-object relations model for tackling the interactions of different objects.

References

  • [1] N. L. Baisa, D. Bhowmik, and A. Wallace (2018) Long-term correlation tracking using multi-layer hybrid features in sparse and dense environments. Journal of Visual Communication and Image Representation 55, pp. 464 – 476. External Links: ISSN 1047-3203, Document, Link Cited by: §II.
  • [2] N. L. Baisa and A. Wallace (2019) Development of a N-type GM-PHD filter for multiple target, multiple type visual tracking. Journal of Visual Communication and Image Representation 59, pp. 257 – 271. External Links: ISSN 1047-3203, Document, Link Cited by: §II, §V-C, TABLE III, TABLE IV.
  • [3] N. L. Baisa and A. Wallace (2019) Multiple target, multiple type filtering in the RFS framework. Digital Signal Processing 89, pp. 49 – 59. External Links: ISSN 1051-2004, Document, Link Cited by: §II.
  • [4] N. L. Baisa (2018-06) Single to multiple target, multiple type visual tracking. Ph.D. Thesis, Heriot-Watt University. Cited by: §II, TABLE III.
  • [5] N. L. Baisa (2019-07) Online multi-object visual tracking using a GM-PHD filter with deep appearance learning. In 2019 22nd International Conference on Information Fusion (FUSION), Vol. . Cited by: §I, §II, §V-A, §V-C, TABLE III, TABLE IV.
  • [6] N. L. Baisa (2019) Robust online multi-target visual tracking using a HISP filter with discriminative deep appearance learning. External Links: 1908.03945 Cited by: TABLE III, TABLE IV.
  • [7] Y. Ban, S. Ba, X. Alameda-Pineda, and R. Horaud (2016) Tracking multiple persons based on a variational bayesian model. In Computer Vision – ECCV 2016 Workshops, G. Hua and H. Jégou (Eds.), Cham, pp. 52–67. External Links: ISBN 978-3-319-48881-3 Cited by: TABLE III.
  • [8] Y. Bar-Shalom, P.K. Willett, and X. Tian (2011) Tracking and data fusion: a handbook of algorithms. YBS Publishing. External Links: ISBN 9780964831278, Link Cited by: §I, §II.
  • [9] B. Benfold and I. Reid (2011-06) Stable multi-target tracking in real-time surveillance video. In CVPR, pp. 3457–3464. Cited by: §III-B2.
  • [10] A. Bewley, Z. Ge, L. Ott, F. Ramos, and B. Upcroft (2016-Sep.) Simple online and realtime tracking. In 2016 IEEE International Conference on Image Processing (ICIP), Vol. , pp. 3464–3468. External Links: Document, ISSN 2381-8549 Cited by: TABLE IV.
  • [11] E. Bochinski, V. Eiselein, and T. Sikora (2017-08) High-speed tracking-by-detection without using image information. In 2017 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), Vol. , pp. 1–6. External Links: Document, ISSN Cited by: TABLE IV.
  • [12] A. Boragule and M. Jeon (2017-08) Joint cost minimization for multi-object tracking. In 2017 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), Vol. , pp. 1–6. External Links: Document, ISSN Cited by: TABLE III.
  • [13] F. Bourgeois and J. Lassalle (1971-12) An extension of the munkres algorithm for the assignment problem to rectangular matrices. Commun. ACM 14 (12), pp. 802–804. Cited by: §I, §III-B3.
  • [14] C. Dicle, O. I. Camps, and M. Sznaier (2013-12) The way they move: tracking multiple targets with similar appearance. In 2013 IEEE International Conference on Computer Vision, pp. 2304–2311. Cited by: TABLE III.
  • [15] V. Eiselein, D. Arp, M. Pätzold, and T. Sikora (2012-Sep.) Real-time multi-human tracking using a probability hypothesis density filter and multiple detectors. In 2012 IEEE Ninth International Conference on Advanced Video and Signal-Based Surveillance, Vol. , pp. 325–330. External Links: Document, ISSN Cited by: §V-C, TABLE IV.
  • [16] P. Emami, P. M. Pardalos, L. Elefteriadou, and S. Ranka (2018) Machine learning methods for solving assignment problems in multi-target tracking. External Links: 1802.06897 Cited by: §II.
  • [17] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan (2010-09) Object detection with discriminatively trained part-based models. IEEE Trans. Pattern Anal. Mach. Intell. 32 (9), pp. 1627–1645. External Links: ISSN 0162-8828, Link, Document Cited by: §V-A.
  • [18] Z. Fu, F. Angelini, S. M. Naqvi, and J. A. Chambers (2018-04) GM-PHD filter based online multiple human tracking using deep discriminative correlation matching. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. , pp. 4299–4303. External Links: Document, ISSN 2379-190X Cited by: §II.
  • [19] K. He, X. Zhang, S. Ren, and J. Sun (2015) Deep residual learning for image recognition. CoRR abs/1512.03385. Cited by: §I, §III-B2.
  • [20] R. Kasturi, D. Goldgof, P. Soundararajan, V. Manohar, J. Garofolo, R. Bowers, M. Boonstra, V. Korzhova, and J. Zhang (2009-02) Framework for performance evaluation of face, text, and vehicle detection and tracking in video: data, metrics, and protocol. IEEE Transactions on Pattern Analysis and Machine Intelligence 31 (2), pp. 319–336. Cited by: 2nd item.
  • [21] C. Kim, F. Li, A. Ciptadi, and J. M. Rehg (2015-12) Multiple hypothesis tracking revisited. In 2015 IEEE International Conference on Computer Vision (ICCV), Vol. , pp. 4696–4704. External Links: Document, ISSN 2380-7504 Cited by: §I, §II, §II, TABLE III, TABLE IV.
  • [22] C. Kim, F. Li, and J. M. Rehg (2018-09) Multi-object tracking with neural gating using bilinear LSTM. In The European Conference on Computer Vision (ECCV), Cited by: §II, TABLE III, TABLE IV.
  • [23] D. Y. Kim (2017-10) Online multi-object tracking via labeled random finite set with appearance learning. In 2017 International Conference on Control, Automation and Information Sciences (ICCAIS), Vol. , pp. 181–186. External Links: Document, ISSN 2475-7896 Cited by: §II, §II.
  • [24] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25, F. Pereira, C.J.C. Burges, L. Bottou, and K.Q. Weinberger (Eds.), pp. 1097–1105. Cited by: §I.
  • [25] T. Kutschbach, E. Bochinski, V. Eiselein, and T. Sikora (2017-08) Sequential sensor fusion combining probability hypothesis density and kernelized correlation filters for multi-object tracking in video data. In 2017 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), Vol. , pp. 1–5. External Links: Document, ISSN Cited by: §V-C, TABLE IV.
  • [26] L. Leal-Taixé, C. Canton-Ferrer, and K. Schindler (2016) Learning by tracking: siamese CNN for robust target association.

    IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPR). DeepVision: Deep Learning for Computer Vision.

    .
    Cited by: §I, §I, §II.
  • [27] L. Leal-Taixé, A. Milan, I. D. Reid, S. Roth, and K. Schindler (2015) MOTChallenge 2015: towards a benchmark for multi-target tracking. CoRR abs/1504.01942. External Links: Link, 1504.01942 Cited by: §III-B2.
  • [28] S. Lee and E. Kim (2019) Multiple object tracking via feature pyramid siamese networks. IEEE Access 7 (), pp. 8181–8194. External Links: Document, ISSN 2169-3536 Cited by: TABLE IV.
  • [29] W. Li, R. Zhao, T. Xiao, and X. Wang (2014-06) DeepReID: deep filter pairing neural network for person re-identification. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, Vol. , pp. 152–159. External Links: Document, ISSN 1063-6919 Cited by: §III-B2.
  • [30] Y. Li, C. Huang, and R. Nevatia (2009) Learning to associate: hybridboosted multi-target tracker for crowded scene. In In CVPR, Cited by: 5th item.
  • [31] Q. Liu, B. Liu, Y. Wu, W. Li, and N. Yu (2019) Real-time online multi-object tracking in compressed domain. IEEE Access 7 (), pp. 76489–76499. External Links: Document, ISSN 2169-3536 Cited by: TABLE IV.
  • [32] W. Luo, X. Zhao, and T. Kim (2014) Multiple object tracking: A literature review. CoRR abs/1409.7618. External Links: Link, 1409.7618 Cited by: §I, §II.
  • [33] R. P. Mahler (2003) Multitarget bayes filtering via first-order multitarget moments. IEEE Trans. on Aerospace and Electronic Systems 39 (4), pp. 1152–1178. Cited by: §I, §II.
  • [34] R. P. Mahler (2014) Advances in statistical multisource-multitarget information fusion. Artech House, Norwood. External Links: Link Cited by: §I.
  • [35] A. Maksai and P. Fua (2019-06) Eliminating exposure bias and metric mismatch in multiple object tracking. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: TABLE IV.
  • [36] A. Milan, L. Leal-Taixé, I. Reid, S. Roth, and K. Schindler (2016-03) MOT16: A benchmark for multi-object tracking. arXiv:1603.00831 [cs]. Note: arXiv: 1603.00831 Cited by: §III-B2, TABLE I, Fig. 2, §V-A, §V-A, §V-B, TABLE II, TABLE III.
  • [37] A. Milan, S. Roth, and K. Schindler (2014-01) Continuous energy minimization for multitarget tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence 36 (1), pp. 58–72. Cited by: §I, TABLE III.
  • [38] K. Panta, D. E. Clark, and B. Vo (2009) Data association and track management for the gaussian mixture probability hypothesis density filter. IEEE Transactions on Aerospace and Electronic Systems 45 (3), pp. 1003–1016. Cited by: §V-C.
  • [39] H. Pirsiavash, D. Ramanan, and C. C. Fowlkes (2011-06) Globally-optimal greedy algorithms for tracking a variable number of objects. In CVPR 2011, pp. 1201–1208. Cited by: §I, TABLE III, TABLE IV.
  • [40] S. Ren, K. He, R. B. Girshick, and J. Sun (2015) Faster R-CNN: towards real-time object detection with region proposal networks. CoRR abs/1506.01497. External Links: Link, 1506.01497 Cited by: §I, §V-A.
  • [41] S. H. Rezatofighi, A. Milan, Z. Zhang, Q. Shi, A. Dick, and I. Reid (2015-12) Joint probabilistic data association revisited. In 2015 IEEE International Conference on Computer Vision (ICCV), pp. 3047–3055. Cited by: TABLE III.
  • [42] E. Ristani, F. Solera, R. S. Zou, R. Cucchiara, and C. Tomasi (2016) Performance measures and a data set for multi-target, multi-camera tracking. See DBLP:conf/eccv/2016w2, pp. 17–35. External Links: Link, Document Cited by: 3rd item.
  • [43] B. Ristic, D. E. Clark, B. Vo, and B. Vo (2012) Adaptive target birth intensity for PHD and CPHD filters. IEEE Transactions on Aerospace and Electronic Systems 48 (2), pp. 1656–1668. Cited by: §III-A.
  • [44] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei (2015) ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115 (3), pp. 211–252. External Links: Document Cited by: §III-B2.
  • [45] A. Sadeghian, A. Alahi, and S. Savarese (2017) Tracking the untrackable: learning to track multiple cues with long-term dependencies. CoRR abs/1701.01909. External Links: Link, 1701.01909 Cited by: §I.
  • [46] R. Sanchez-Matilla, F. Poiesi, and A. Cavallaro (2016) Online multi-target tracking with strong and weak detections. In Computer Vision - ECCV 2016 Workshops - Amsterdam, The Netherlands, October 8-10 and 15-16, 2016, Proceedings, Part II, pp. 84–99. Cited by: §I, §III-B1, TABLE III, TABLE IV.
  • [47] G. Song, B. Leng, Y. Liu, C. Hetang, and S. Cai (2017) Region-based quality estimation network for large-scale person re-identification. CoRR abs/1711.08766. External Links: Link, 1711.08766 Cited by: §III-B2.
  • [48] Y. Song and M. Jeon (2016) Online multiple object tracking with the hierarchically adopted GM-PHD filter using motion and appearance. In IEEE/IEIE The International Conference on Consumer Electronics (ICCE) Asia, Cited by: §I, §V-C, TABLE III, TABLE IV.
  • [49] S. Tang, M. Andriluka, B. Andres, and B. Schiele (2017-07) Multiple people tracking by lifted multicut and person re-identification. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 3701–3710. External Links: Document, ISSN 1063-6919 Cited by: §II.
  • [50] S. Tang, M. Andriluka, B. Andres, and B. Schiele (2017-07) Multiple people tracking by lifted multicut and person re-identification. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 3701–3710. External Links: Document, ISSN 1063-6919 Cited by: §III-B1.
  • [51] A. Vedaldi and K. Lenc (2015) MatConvNet – convolutional neural networks for matlab. In Proceedings of the 25th annual ACM international conference on Multimedia, Cited by: §III-B2, §V-A.
  • [52] B. Vo, B. Vo, and H. G. Hoang (2017-04) An efficient implementation of the generalized labeled multi-bernoulli filter. IEEE Transactions on Signal Processing 65 (8), pp. 1975–1987. External Links: Document, ISSN 1053-587X Cited by: §II.
  • [53] B. Vo, B. Vo, and D. Phung (2014-12) Labeled random finite sets and the bayes multi-target tracking filter. IEEE Transactions on Signal Processing 62 (24), pp. 6554–6567. External Links: Document, ISSN 1053-587X Cited by: §II.
  • [54] B.-N. Vo, M. Mallick, Y. Bar-Shalom, S. Coraluppi, R. O. III, R. Mahler, and B.-T. Vo (2015-09) ”Multitarget tracking”. Cited by: §I, §II.
  • [55] B. Vo and W. Ma (2006-11) The Gaussian mixture probability hypothesis density filter. Signal Processing, IEEE Transactions on 54 (11), pp. 4091–4104. Cited by: §I, §II, §II, §III-A.
  • [56] B. Vo, S. Singh, and A. Doucet (2005) Sequential monte carlo methods for multitarget filtering with random finite sets. IEEE Transactions on Aerospace and Electronic Systems 41 (4), pp. 1224–1245. Cited by: §II.
  • [57] L. Wei, S. Zhang, W. Gao, and Q. Tian (2017) Person transfer GAN to bridge domain gap for person re-identification. CoRR abs/1711.08565. External Links: Link, 1711.08565 Cited by: §III-B2.
  • [58] G. Welch and G. Bishop (2006)

    An introduction to the kalman filter

    .
    Cited by: §III-A.
  • [59] F. Yang, W. Choi, and Y. Lin (2016-06) Exploit all the layers: fast and accurate cnn object detector with scale dependent pooling and cascaded rejection classifiers. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 2129–2137. External Links: Document, ISSN Cited by: §V-A.
  • [60] Y. Yoon, D. Y. Kim, K. Yoon, Y. Song, and M. Jeon (2019) Online multiple pedestrian tracking using deep temporal appearance matching association. External Links: 1907.00831 Cited by: §II.
  • [61] F. Yu, W. Li, Q. Li, Y. Liu, X. Shi, and J. Yan (2016) POI: multiple object tracking with high performance detection and appearance feature. In ECCV Workshops, Cited by: §III-B1.
  • [62] L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, and Q. Tian (2015-12) Scalable person re-identification: a benchmark. In 2015 IEEE International Conference on Computer Vision (ICCV), Vol. , pp. 1116–1124. External Links: Document, ISSN 2380-7504 Cited by: §III-B2.
  • [63] Z. Zheng, L. Zheng, and Y. Yang (2016) A discriminatively learned CNN embedding for person re-identification. CoRR abs/1611.05666. External Links: 1611.05666 Cited by: §II.
  • [64] K. Zhou and T. Xiang (2019)

    Torchreid: a library for deep learning person re-identification in pytorch

    .
    External Links: 1910.10093 Cited by: §I.