Online Multi-Object Tracking Framework with the GMPHD Filter and Occlusion Group Management

07/31/2019 ∙ by Young-min Song, et al. ∙ Gwangju Institute of Science and Technology LG Electronics Inc University of Regina 1

In this paper, we propose an efficient online multi-object tracking framework based on the GMPHD filter and occlusion group management scheme where the GMPHD filter utilizes hierarchical data association to reduce the false negatives caused by miss detection. The hierarchical data association consists of two steps: detection-to-track and track-to-track associations, which can recover the lost tracks and their switched IDs. In addition, the proposed framework is equipped with an object grouping management scheme which handles occlusion problems with two main parts. The first part is "track merging" which can merge the false positive tracks caused by false positive detections from occlusions, where the false positive tracks are usually occluded with a measure. The measure is the occlusion ratio between visual objects, sum-of-intersection-over-area (SIOA) we defined instead of the IOU metric. The second part is "occlusion group energy minimization (OGEM)" which prevents the occluded true positive tracks from false "track merging". We define each group of the occluded objects as an energy function and find an optimal hypothesis which makes the energy minimal. We evaluate the proposed tracker in benchmark datasets such as MOT15 and MOT17 which are built for multi-person tracking. An ablation study in training dataset shows that not only "track merging" and "OGEM" complement each other but also the proposed tracking method has more robust performance and less sensitive to parameters than baseline methods. Also, SIOA works better than IOU for various sizes of false positives. Experimental results show that the proposed tracker efficiently handles occlusion situations and achieves competitive performance compared to the state-of-the-art methods. Especially, our method shows the best multi-object tracking accuracy among the online and real-time executable methods.



There are no comments yet.


page 1

page 4

page 14

page 15

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Multi-object Tracking (MOT) has become one of key techniques for intelligent video surveillance [5, 6] and autonomous vehicle systems [7] in the last decade.

In view of the processing pipelines, many state-of-the-art MOT methods [16, 17, 13, 14, 15, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 29, 28, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50]

have exploited the tracking-by-detection paradigm. This phenomenon has been standing out while deep neural networks based detectors such as FRCNN 

[10], SDP [11], and EB[12] have shown breakthrough in object classification and detection.

Besides, MOT algorithms are categorized into two approaches: offline and online processes. The most different point between two approaches is that whereas the offline process can see the whole time sequences at once, the online process can see only the frames from initial time to current processing time . In other words, from the system user’s perspective, whereas the offline method is suitable for post-processing, the online process for real-time application.

Thus, many offline methods [46, 28, 31, 29, 36, 34] take advantage of the global optimization models.  [36, 29, 46] exploit graphical models to solve MOT task. Pirsiavash et al.[36] designed a min-cost flow network where the nodes and the directed edges indicating observations and tracklets’ hypotheses, respectively form a directed acyclic graph (DAG). The DAG’s shortest (min-cost) path can be found with Dijkstra’s algorithm. Choi et al.[29] divided the tracking problem into subgraphs and solved each subgraph as conditional random field inference in parallel. Keuper et al.[46] applied vision-based perspective to the proposed graph optimization model. Feature points’ trajectories and bounding boxes build low-level and high-level graph models, respectively, and then, they find the optimal association results between the two levels graph models. Rezatofighi et al.[31] and Kim et al.[28] considered all possible hypotheses for data association. Because it involves the exponentially increasing complexity with a tree structure, [31] assumed -best solutions and [28] pruned out invalid hypotheses using their own rule. Besides, Milan et al.[34] proposed a sophisticated energy minimization technique considering detection, appearance, dynamic model, mutual exclusion, and target persistence for MOT task in video. Those offline methods have strength to generate the accurate and refined tracking results but is not suitable for practical real-time application.

On the other hand, since the online approach cannot apply the global optimization models, intensive motion analysis and appearance feature learning have been popularly utilized with a hierarchical data association framework and the online Bayesian model [23, 21, 25, 14, 15, 37, 42, 18]. Yoon et al.[23] proposed a relative motion analysis between all objects in a frame, and then improved the work [23] by adding the cost optimization function using context constraints in [21]. Bae et al.[25] exploited the incremental linear discriminant analysis (LDA) for appearance learning and presented a tracklet confidence based data association framework. Also, in [14], they improved their previous work [25] by using the deep neural network (DNN) based appearance learning instead of the incremental LDA. As we addressed in the previous paragraph, DNN has given breakthrough in appearance learning i.e., object classification and detection. So, some online MOT algorithms have focused on how to adopt deep appearance learning into their tracking frameworks. Yoon et al.[15]

exploited the siamese convolutional neural networks (CNN) 


to train appearance model. They train the deep appearance networks selectively where only the detection responses matched with high confidence between the historical object queues in the recent few frames. Then, they combine the trained networks to a simple Bayesian tracking model with the Kalman filter. Chen 

et al.[37] employed a re-identification (Re-ID) model [52]

to their tracking framework. They measure the similarity between detection and track by calculating the distance between Re-ID feature vectors of them. Then, they associate the pairs of detections and tracks which make the sum of the distances minimal. Both approaches 

[15, 37] proposed online Bayesian tracking models with conventional DNN models to measure the similarity between the visual objects. Those online MOT methods have proposed successful solutions with excellent tracking accuracy but their intensive analysis and learning processes take heavy computing resource and time. Also, even if they just employ conventional DNN models through state-of-the-art GPU processing technique, the requirement for a lot of computing resource is inevitable and it makes the trackers difficult to achieve real-time speed.

Recently, the closed-form implementations [2, 3] of the probability hypothesis density (PHD) filtering have been employed as an emerging theory for many online MOT methods [41, 42, 43, 18, 22, 38, 16, 17]. That is because Vo et al.[2, 3] provided not only theoretically optimal approach to the online multi-target Bayes filtering but also approximate the original PHD recursions involving multiple integrals, which alleviate the computational intractability. Moreover, the PHD filter was originally designed for multi-target tracking in radar/sonar systems which receive uncountable false positive observations, i.e., clutters. So it is robust to deal with false positives errors but weak to handle false negatives. V. Eiselein et al.[43] combined the feature-based label tree to the Gaussian mixture PHD (GMPHD) filter, which use visual features to help the GMPHD filter work sensibly in video data system. Song et al.[16]

extended the GMPHD filter based tracking with the two-stage hierarchical data association strategy and use simple motion estimation and appearance matching to recover lost tracks. T. Kutschbach 

et al.[42] joined the GMPHD filter with the kernelized correlation filters (KCF) [53] for online appearance update to overcome occlusion. Z. Fu et al.[18] adopted an adaptive gating technique and an online group-structured dictionary (appearance) learning strategy into the GMPHD filter. They make the GMPHD filter be sophisticated and fit to video based MOT. Besides, various tracking methods [41, 22, 38] utilizing the PHD filters have been proposed.

These latest MOT research trends motivate our work in terms of the three main contributions. Also, it reminds us of the requirements for the practical MOT applications. Thus, in this paper, we propose an online multi-object tracking framework to resolve the practical tracking problems which are based on occlusion and the characteristics of video data system. First, we exploit the GMPHD filter for online MOT. To efficiently change the GMPHD filter’s original domain, we define the tracking problems by miss detections in video data system. To deal with track loss by miss detection, we design a GMPHD filtering theory based hierarchical data association (HDA) strategy. Second, we assume that most of tracking problems are caused by occlusion in video data system. The occlusion between false positive tracks can cause ID-switch and the false positives, and the real occlusion between objects can make fragmented and miss tracks by miss detections. To handle these tracking problems, we propose a novel occlusion handling technique combined with HDA which is based on GMPHD filter tracking framework. Third, we consider that the proposed tracking framework should be implemented to run with real-time speed. That is because visual surveillance systems with higher intelligence require more immediate responses to the users with real-time speed. Also, immediate responses can help the systems’ user and the machines to react abnormal situation rapidly. Finally, we evaluate the proposed method on the popular benchmark dataset. Our method shows the competitive performance against state-of-the-art methods in terms of “tracking accuracy versus speed”.

Our main contributions are described as follows:

1) To apply the GMPHD filter into video data system, we extended the conventional GMPHD filter based tracking process with a hierarchical data association (HDA) strategy. Also, we revised the equations of the GMPHD filter as a new cost function for HDA. HDA consists of detection-to-track association (D2TA) and track-to-track association (T2TA). Each cost matrix of each association stage is solved by the Hungarian method with the linear complexity (assignment problem). These D2TA and T2TA recovers lost tracks, while preserving real-time speed.

2) To handle occlusion in video-based tracking system, we devised “tracking merging” and “occlusion group energy minimization (OGEM)” which complement each other. “Tracking merging” relieves false positive tracks and “OGEM” recovers false “track merging” by using the occluded objects’ group energy minimization. “Tracking merging” runs in tracking-level so is different to detection-level merging such as non-maximum-suppression. To measure overlapping ratio between occluded objects, we devise a new metric named as sum-of-intersection-over-area (SIOA). We use the SIOA metric instead of intersection-over-union (IOU) which is an extensively used metric. For “OGEM”, we devise a new energy function to find the optimal state having the minimum energy in a group of occluded objects. “Tracking merging” and “OGEM” follow D2TA and T2TA, respectively. We name both techniques as occlusion group management (OGM).

3) Consequently, we propose an online multi-object tracking framework with the GMPHD filter and occlusion group management (GMPHD-OGM). In view of optimization techniques, the first and second contribution locally optimize tracking process which are the minimization of the association cost matrix and the occlusion group energy. We evaluate the proposed tracking framework on MOT15 [5] and MOT17 [6] benchmarks. The ablation study on training set shows that our method is more robust than the given baselines. The qualitative and quantitative evaluation results shows that GMPHD-OGM efficiently handle the defined tracking problems by occlusion. Moreover, the proposed method achieves competitive tracking performance against state-of-the-art online MOT algorithms in terms of CLEAR-MOT metrics [54].

The related works are described in Section II. In Section III and IV, we introduce the GMPHD filter based tracking framework with HDA and OGM in detail, respectively. In Section V, our method is evaluated compared to baseline methods and state-of-the-art methods on the popular benchmarks MOT15 [5] and MOT17 [6]. We conclude this paper with future work in Section VI. Some preliminary results of this work was presented in Song et al.[16, 17].

Fig. 1: Comparison between (a) radar/sonar and (b) video system in terms of input and output, i.e., observations (detection) and states (tracking). The radar/sonar sensors receive a lot of clutters (false positive error) but rarely miss objects (false negative error), whereas the detector in video data tends to receive a few clutters around the objects and misses more objects than the radar/sonar senors do.

Ii Related Works

Our proposed tracking framework is influenced from the PHD filter based online multi-object tracking, and grouping approach (topology and relative motion analysis).

The PHD filter[1, 3, 2] was originally designed to deal with radar/sonar data based multi-object tracking (MOT) systems. Mahler et al.[1] proposed a recursive Bayes filter equations for the PHD filter which optimizes MOT process in radar/sonar systems with the random-finite set (RFS) of state and observations. Following this PHD filtering theory, Vo et al.[3]

implemented governing equations by using the Gaussian mixture model as closed-form recursions, named as the Gaussian mixture probability hypothesis density (GMPHD) filter. In the original domains the tracking algorithm should estimate true tracks (states) from a lot of observations as shown in Figure 

1-(a). Whereas the radar/sonar sensors receive massive false positive but rarely missed observations, visual object detectors generate much less false positive and more missed observations than the radar/sonar sensors does as shown in Figure 1-(b). Thus, the GMPHD filter is efficient dealing with the false positive observations, but needs to be extended and improved by additional techniques for MOT in video data system.

As demand increases on online and real-time tracker in video-based tracking system, the PHD filter have been an emerging tracking model, recently. Song et al.[16] extended the GMPHD filter based tracking with the two-stage hierarchical data association strategy to recover fragmented and lost tracks. They defined the affinity in the track-to-track association step by using tracks’ linear motion and color histogram appearance. This approach is an intuitive implementation of the GMPHD filter to handle tracking problems, but cannot correct the false associations already made in the detection-to-track association. T. Kutschbach et al.[42] added the kernelized correlation filters (KCF) [53] for online appearance update to overcome occlusion with the naive GMPHD filtering process. They showed a robust online appearance learning to re-find the IDs of the lost tracks. However, updating appearance information of all objects at every frame requires heavy computing resources. R. Sanchez-Matilla et al.[22] proposed a detection confidence based MOT model with the PHD filter. Strong (high confidence) detections initiate and propagate tracks but weak (low confidence) detections only propagate existing tracks. This strategy works well when the detection results are reliable. However, the tracking performance is dependent on the detection performance, and especially weak to long-term missed detections. Z. Fu et al.[18] adopted an adaptive gating technique and an online group-structured dictionary (appearance) learning strategy into the GMPHD filter. They made the GMPHD filter have a sophisticated tracking process and fit to video based MOT.

Grouping approach e.g., relative motion and topological model, already have been exploited in [23, 21]. The key difference between their methods and ours is that [23, 21] consider the relations between all objects in a scene but we only consider topological information in the group of occluded objects. Grouping only the occluded objects exclude trivial solutions (associations) which focuses on solving sub-problems and reduces computing time.

Fig. 2: Flow chart of the proposed online multi-object tracking framework. The red dotted line divides the proposed hierarchical data association into two stages. Each stage and its states and observations are marked as blue and red, i.e., D2TA and T2TA, respectively. The key components of this chart, such as init, D2TA, Merge, T2TA, OGEM, live and lost tracklets are used in Figure 3 and Figure 9, also.

Iii Proposed Online Multi-Object Tracking Framework

In this section, we briefly introduce the general tracking process of the Gaussian mixture probability hypothesis density (GMPHD) filter in Subsection III-A. In III-B, we address how to extend the GMPHD filter with the hierarchical data association strategy in video-based online MOT systems.

Iii-a The GMPHD Filter

The Gaussian mixture model (GMM) of the GMPHD filter includes means, covariances, and weights which are propagated at every time stamp as follows; Initialization, Prediction, Update, and Pruning steps. We employ this basic process of the GMPHD filter but revise fit to the video-based MOT system.


where and denote a set of objects’ states and the number of them at time , respectively. A state vector is composed of , where , , , and indicate the x-axis center point of the bounding box, the y-axis center point of the bounding box, the x-axis velocity, and the y-axis velocity, respectively. Likewise, and denote a set of observations (detection responses) and the number of them at time , respectively. An observation is composed of , where and indicate the x-axis and the y-axis center of the detection bounding box, respectively. Equation (3) and (4) describe the basic notations of state and observation.


The tracking process of the GM-PHD filter is composed of four steps: Initialization, Prediction, Update, and Pruning as follows.



where the GMM is initialized by the initial observations from the detection responses. Besides, when an observation fails to find the association pair, i.e., updating object state, the observation initializes a new Gaussian model (a new state). Gaussian probability function represents tracking objects with weight , mean vector , object state vector , and covariance matrix . At this step, we set the initial velocities of mean vector to zeros. Each weight is set to the normalized confidence value of the corresponding detection response.



where we assume that the GMM representing the objects’ states was initialized or active at the previous frame in (6). In (7) and (8), is the state transition matrix and is the process noise covariance matrix. and are constants in our tracker. Then, we can predict the state at time using the Kalman filtering. In (7), is derived by using the velocity of . Covariance is also predicted by the Kalman filtering method in (8).



where the goal of update step is deriving (9). First, we should find an optimal observation at time to update a Gaussian model. The optimal makes be the maximum in (10). denotes the observation noise covariance. denotes the observation matrix to transit a state vector to an observation vector. Both and are constants in our application. In the perspective of application, the update step involves data association. Updating the Gaussian state models follows finding the optimal observations updating the states through the data association. After finding the optimal , the GMM is updated to (9) through (10), (11), and (14), (13), (12).



where the states with the weight under threshold are pruned as in (15). We experimentally set to . Then the weights of the surviving states are normalized as shown in (18). The pruning step handles the false positive tracks by the false positive detections.

The GMPHD filter [3] is specialized in handling false positives e.g., clutters and noise. However, tracking systems have the different problems, depending on their domains as shown in Figure 1, where input and output indicate detection results (observations) and tracking results (states), respectively. As presented in  [4] and Figure 1-(a), at radar/sonar systems, the senors receive uncountable detection responses with a lot of clutters but objects are rarely missed. On the other hand, as shown in Figure 1-(b), the video data based detectors observe less clutters and miss more objects than the radar/sonar senors do. The conventional GMPHD filter is effective to handle the clutters (false positive) but missed detctions cause the new tracking problems in video data system (false negative). Thus, we propose the GMPHD filtering based tracker with a hierarchical data association strategy.

Iii-B Hierarchical Data Association

Video-based tracking systems have inherent problems as shown in Figure 1-(b). Generally, when objects are not detected, the objects’ IDs are frequently changed and the tracks are fragmented if only detection-to-track association is employed. To prevent these problems by missing objects, we take advantage of a hierarchical data association (HDA) strategy which has been widely used in many online multi-object tracking methods [25, 14, 16, 17, 22]. Thus, in this paper, we propose a simple HDA scheme with just two stages. The proposed HDA includes detection-to-track (D2T) and track-to-track (T2T) associations. We implement the both association methods with the GMPHD filtering process as given in III-A. Also, we derive a cost function from (11) of the GMPHD filtering process as follows:


where indicate the weight value, assuming that observation updates state . We use as a cost between and . Then, cost matrix can be built by every pair between state set and observation set as follows:


When the cost matrix C is built, the Hungarian algorithm is used to solve it. Then, the optimal pairs between observations and states are found, and consequently state is updated to in D2T and T2T associations. In III-B1and III-B2, we introduce the definition of observations and states in each association stage with more detail usage of the cost function.

Iii-B1 Detection-to-Track Association (D2TA, Stage 1)

In D2TA, observation set is filled with detection responses at time . We assume that state set already exists from time , and then is predicted by using the Kalman filtering as shown in (6)-(8). Thus, the cost matrix is easily calculated with these sets and .

Iii-B2 Track-to-Track Association (T2TA, Stage 2)

In T2TA, a simple temporal analysis of tracklet is conduced. A tracklet means a fragment of the track, and becomes a calculation unit. Before T2TA, all tracklets are categorized into two types, according to success or failure of tracking at the present time as follows:


where “live” indicates that tracking succeeds at time . “lost” indicates that tracking fails at time . Then, for the T2TA, observation set is filled with the first (oldest) elements s of “live” tracklets. However, the state set is not filled with the last (most recent) elements s of “lost” tracklets. One prediction step is needed as follows:


In (30) and are the averaged velocities in terms of x-axis and y-axis, respectively. The velocities are calculated by subtracting the center position of the first object state from that of the last state , and dividing it by the frame difference which is equivalent to the length of “lost” tracklet . D2TA has the identical time interval “1” between states and observations in transition matrix , whereas in T2TA , each cost of matrix has different time interval (frame difference) between states and observations. Variable depends on which state of “lost” tracklet and observation of “live” tracklet are paired. (31) means the prediction process of state with linear motion analysis. Finally, the cost matrix is filled by (31) and the oldest element of live tracklet .

The pseudo-code in Algorithm 1 includes the procedures presented in this section. Initialization, Prediction, Cost-minimization, Update, and Pruning in D2TA correspond to each of line 5-10, 13-15, 16-21, 22-24, and 25-27 in Algorithm 1. Tracklet-categorization, Cost-minimization, Update in T2TA correspond to line 36-44, 49-54, and 55-66 in Algorithm 1, respectively.

: the current frame number

: a set of states at time

: a set of observations at time

: threshold for track merging

: the minimum track length for T2TA

: the maximum frame interval for T2TA

: a {key:id,value:tracklet} set of live tracklets

: a {key:id,value:tracklet} set of lost tracklets

1:procedure GMPHD_OGM(,,,,,,,)
2:      ;// the number of states
3:      ;// the number of observations
4:      ; // a set of occlusion groups at time k-1 and k.   
5:      if  or  then
6:            Initialize states with ;
7:            ;
8:            ;
9:            return ;
10:      end if  
11:/* 1. Detection-to-Track Association (D2TA) */   
12:      ; // for cost matrix
13:      ; // for pairing observations’ indices   
14:ct0¡ 1      /*predict states to be */
15:      for  to  do
16:            ;
17:      end for
18:ct0¡ 1      /*calculate the GMPHD filter cost matrix */
19:      for  to  do
20:            for  to  do
21:                 ;
22:            end for
23:      end for  
24:ct0¡ 1      /*find min-cost pairs by the Hungarian method*/
25:      ;   
26:ct0¡ 1      /*update and birth states*/
27:ct0¡ 1      /*update with the min-costly observations*/
28:      for  to  do
29:            ;
30:      end for
31:ct0¡ 1      /*prune with the weight under 0.1*/
32:      for  to  do
33:            ;
34:      end for
35:      for  to  do
36:            if  is not assigned to update any state then
37:                 Initialize newly birth state with ;
38:                  = ;
39:            end if
40:      end for  
41:/* 2. Merge States and Find Occlusion Groups */
42:      ;
43:      ;   
44:ct0¡ 1      /*manage tracklet pool after D2TA and MERGE*/
45:      for  to  do
46:            if  is active then
47:                 update with ;
48:                 delete ;
49:            else
50:                 update with ;
51:                 delete ;
52:            end if
53:      end for
Algorithm 1 Proposed Online MOT Algorithm
54:/* 3. Track-to-Track Association (T2TA) */   
55:      ;// the number of lost tracklets
56:      ;// the number of live tracklets
57:      ; // for cost matrix
58:      ; // for pairing observations’ indices   
59:ct0¡ 1      /*calculate the GMPHD filter cost matrix */
60:      for  to  do
61:            for  to  do
62:                 ;
63:            end for
64:      end for  
65:ct0¡ 1      /*find min-cost pairs by the Hungarian method*/
66:      ;
67:ct0¡ 1      /*update tracklets and manage tracklet pool after T2TA*/
68:      for  to  do
69:            ;
70:      end for
71:      for  to  do
72:            if  is active then
73:                 update with ;
74:                 delete ;
75:            else
76:                 update with ;
77:                 delete ;
78:            end if
79:      end for  
80:/* 4. Occlusion Group Energy Minimization (OGEM) */
81:      if  and  then
82:            ;   
83:ct0¡ 1      /*manage tracklet pool after OGEM*/
84:            for  to  do
85:                 if  is active then
86:                       update with ;
87:                       delete ;
88:                 else
89:                       update with ;
90:                       delete ;
91:                 end if
92:            end for
93:      end if
94:      return ;// return final states
95:end procedure

Iv Occlusion Group Management Scheme

Fig. 3: In the proposed tracking framework, an object with state is transited within the defined states by the state-transition functions .

In Section III, we addressed that the proposed online multi-object tracking framework is based on the GMPHD filtering theory with the two-stage hierarchical data association. However, the tracking results from that framework still give uncertainty to us, even if we effectively extend the conventional GMPHD filter to be suitable for video-based tracking system. Thus, to handle it, we define two types of tracking problems and provide a solution. One is an intrinsic occlusion and the other is an extrinsic occlusion. The intrinsic occlusion is defined when the number of detection responses on one object is more than one. Generally, it makes the object ID switched and false positive tracks. The extrinsic occlusion is defined when the number of detection responses on objects, occluded each other, is less or more than the number of the occluded objects. That can cause false negative and positive tracks, respectively. Figure 8 and 9 show the defined tracking issues well. The false positive detections made by intrinsic and extrinsic occlusions inevitably generate false tracks as shown in the second row Figure 9 (D2TA), if appropriate techniques do not handle it. To resolve the two types of problems, we design a new occlusion group management (OGM) scheme. OGM consists of “Track Merging” and “Occlusion Group Energy Minimization (OGEM)” routines which execute just after D2TA and T2TA, respectively. Figure 2 briefly shows the tracking pipeline with those two components of OGM. Consequently, our occlusion group management technique not only decreases false positive tracking results but also prevents occluded tracks from false “track merging”. The effectiveness of the proposed OGM method is discussed in Section V in more detail.

: a set of states at time

: threshold for merging

: a set {key:id,value:states} of occlusion groups at time k

1:function Merge(,,)
2:      ; // : the number of states
3:      Let be the array set to all false;   
4:/* measure occlusion ratio between all states
5:by using the SIOA metric */
6:      for  to  do
7:            for  to  do
8:                 ;// SIOA occlusion ratio.
9:                 if  then
10:                       ;// check to be merged
11:                       ;// double check
12:                 else if  and  then
13:                       ;
14:                       if  then
15:                             ;
16:                       else
17:                             ;
18:                       end if
19:                 end if
20:            end for
21:      end for  
22:/* merge the states where */
23:      for  to  do
24:            for  to  do
25:                 if  then
26:                       if  then
27:                             ;
28:                             Deactivate state ;
29:                       else
30:                             ;
31:                             Deactivate state ;
32:                       end if
33:                 end if
34:            end for
35:      end for
36:      return ;
37:end function
Algorithm 2 Track Merging using the SIOA Metric

Iv-a Track Merging

Merging the neighboring objects’ states with the distances under a threshold is proposed in [3] already. However, it can only reflect point-to-point distance without considering regional information e.g., overlapping ratio between visual objects (bounding boxes). To measure the overlapping ratio, the intersection-over-union (IOU) metric has been widely used which was originally designed to measure mAP in object detection research fields [55, 56]. However, the IOU metric is nice to refine the detection bounding boxes but not adjustable to measure overlapping ratio for merging the objects. Figure 8 explains that reason by a case study. The case study mainly assumes that the number of detection responses (observations) is larger than the number of real objects. When the observations most likely include false positive detections, the object states by those observations also most likely become the false positive states. So, to handle and consider the characteristic of those observations with the false positive errors, we propose a new metric named as sum-of-intersection-over-area (SIOA). The IOU and SIOA metrics are formulated as follows:


where A and B indicate two different objects. represent a bounding box . Algorithm 2 describes the proposed track merging method. Track merging with the SIOA metric follows after the D2T association as presented in Subsection III-B1 and Figure 2.

: the current frame number

: a set of occlusion groups at time

: a set of states at time

1:function OGEM(,,)
2:      ; // : the number of the groups .
3:      ; // : the number of the states .   
4:ct0¡ 1      /*build the GMMs for all occlusion groups at time k-1*/   
5:ct0¡ 1      /*a GMM is used for the defined energy function in (36)*/
6:      for  to  do
7:             //the number of topological vectors.
8:             //the Gaussian mixture for a group.
9:            for  to  do //iterate topologies in a group.
10:                 Initialize a Gaussian mixture with
11:                 the mean vectors having topological info and
12:                 the covariance matrix as defined in (36)
13:            end for
14:            ; //variable for the min-energy.
15:            ; //index to the optimal hypothesis.
16:            for  to  do //iterate topological hypotheses.
17:                 if  then //find the optimal hypothesis.
20:                 end if
21:            end for
22:            Update with ;
23:            for  in  do //iterate group .
24:                 Find the state with in ;
25:                 if  is in  then
26:                       ;
27:                 else
28:                       ;
29:                 end if
30:            end for
31:      end for
32:      return ;
33:end function
Algorithm 3 Occlusion Group Energy Minimization

Iv-B Occlusion Group Energy Minimization

Fig. 4: Illustration of the proposed occlusion group energy minimization represented by the Gaussian mixture model. Six hypotheses exist in the case of three occluded objects.

The occlusion group energy minimization method is devised to prevent the true objects which are occluded to others from false merging. In other words, track merging after detection-to-track association (D2TA) may merge occluded objects with correct number of observations into the states with less number of real objects. That can cause tracking errors such as false negative and fragmented tracks.

Thus, we propose a new energy minimization model to prevent false merging, named as “Occlusion Group Energy Minimization (OGEM)”. Each group of occluded objects has an energy function represented by a Gaussian mixture model (GMM) as follows:


where , , , and indicate hypothesis, topological vector, mean vector, and Gaussian covariance marix (noise), respectively. A Gaussian probability function , i.e., component of the GMM, indicates a topological position vector between two objects in an occlusion group, which is given at time . The Gaussian function has a mean vector which denotes the topological position, i.e., relative position between the predicted center positions at time of the objects in the group. Those objects are denoted by and the notation indicates the prediction at time from position and velocity at time . If there are three occluded objects in a group, six hypotheses exists as shown in Figure 4. One hypothesis is a set of six topological vectors . For example, is calculated by and is calculated by . In the case that an object state becomes inactive by false merging in occlusion, we build a new hypothesis using a as a dummy. Then the dummy added hypothesis recovers the false merged object. If there are occluded objects in a group, hypotheses exists with the condition . Then with these topological models, we can find an optimal one among all hypotheses making the group cost minimal.

Whereas track merging runs after the D2TA, the proposed occlusion group energy minimization follows Track-to-Track Association (T2TA) as described in Figure 2. Figure 9 includes some examples to explain that the proposed group energy minimization complements track merging step. The tracking stage at frame 42 explains it.

In summary, both “Track Merging” and “Oclcusion Group Energy Minimization” procedures assume occlusion situations, and the GMPHD filtering is adopted as the main framework. The pseudo-code examples of proposed occlusion group management scheme are described in Algorithm 2 and 3. Also, both methods correspond to line 8, 35 and 67-78 in Algorithm 1 which represents the whole tracking framework. From now on we use GMPHD-OGM as the abbreviation for the proposed algorithm, online multi-object tracking with the GMPHD filter and occlusion group management.

V Experiments

In this section, we present development environment including parameter settings, and also discuss evaluation results of the GMPHD-OGM tracker which include an ablation study with baselines and comparisons to state-of-the-art methods. The GMPHD-OGM tracker is implemented by Visual C++ with OpenCV3.4.1 and boost1.61.0 libraries, and without any GPU-accelerated libraries such as CUDA. All experiments are conducted on Windows 10 with Intel i7-7700K CPU @ 4.20GHz and DDR4 32.0GB RAM.

V-a Parameter Setting

Our proposed tracking framework involves several parameter settings. Parameter indicates the threshold for “Track Merging” which is set to in terms of the SIOA metric. is set not only empirically set but by considering the occlusion cases between size-variant bounding boxes as shown in Case 4 and 5 of Figure 8. and are related to track-to-track association (T2TA) of the hierarchical data association, whose parameters are selected adaptively, scene-by-scene. The optimal values of and are gained from the ablation study presented in Figure 7. We use the optimal parameter settings from training to test sequences. These three parameters are summarized in Table I.

The GMPHD filtering process has a set of static parameters. The matrices , , , , and are used in Prediction Step and Update Step. Also, is used in Pruning Step. Experimentally, we set the parameters for the GMPHD filter’s tracking process as follows:

Symbol Description Value
threshold for track merging.
the minimum track length for T2TA
the maximum frame interval for T2TA to
TABLE I: Parameter Settings for “Track Merging” and track-to-track association (T2TA).

V-B Evaluation Results

In this section, we evaluated the propose method with state-of-the-art online [14, 15, 18, 19, 20, 21, 22, 23, 16, 24, 25, 37, 39, 38, 40, 41, 42, 43] and offline [28, 26, 27, 29, 30, 31, 32, 33, 34, 35, 36, 50, 44, 45, 46, 47, 48, 49] MOT methods in terms of the CLEAR-MOT metrics [54]. The CLEAR-MOT metrics gracefully measure multi-object tracking performance from the detailed perspectives such as multi-object tracking accuracy (MOTA), multi-object tracking precision (MOTP), mostly tracked objects (MT), mostly lost objects (ML), the total number of flase positives (FP), the total number of false negatives (missed tracks, FN), the total number of identity switches (IDS), the total number of times that a trajectory is fragmented (Frag), and processing speed (frames per second, FPS). Among these metrics, MOTA is normally proposed as the key metric, because it considers three error sources including FP, FN, and IDS, comprehensively. The evaluation results contain not only the tracking results on the MOT15 and MOT17 test datasets but also an ablation study on the MOT15 training dataset.

First, in the ablation study, we employ the two baseline methods, to find optimal parameters settings and to prove the effectiveness of the proposed method. One is the GMPHD filter based tracker with the hierarchical data association (HDA) and without the occlusion group management (OGM). The other is the GMPHD filter based tracker with HDA and OGM by using the IOU metric for measuring occlusion ratio. We name these three methods as GMPHD-HDA, GMPHD-OGM (/w IOU), and GMPHD-OGM (/w SIOA) as shown in Table II and Figure 7. The GMPHD-OGM (/w SIOA) method is our final tracking model. The scene-by-scene optimal parameter settings of those three methods are obtained by another ablation study as shown in Figure 7. The same and settings are applied to the whole training sequences with the range and , respectively. GMPHD-OGM (/w IOU) is improved over GMPHD-HDA in terms of the upper bound of tracking accuracy (the maximum MOTA). GMPHD-OGM (/w SIOA) shows that upper bound and lower bound of tracking accuracy increases. Besides, with over , the maximum and minimum values of MOTA increase on average. Figure 8 shows the comparison results of “Track Merging” between using the IOU metric and the SIOA metric when the detection results with a lot of false positives are given. Also, through the case study on occlusion, we observe that the IOU metric cannot consider size-variant detections with false positives and too sensitive to be used for merging as shown in Figure 8. On the other hand, the SIOA metric can consider the size-variant detection and the optimal value of merging threshold is decided to be by the occlusion cases 4 and 5, empirically. Table II provides the quantitative results on the MOT15 training dataset with the best performance results on each sequence and . GMPH-OGM (w/ IOU) does not show outstanding improvement compared to GMPHD-HDA even though GMPHD-OGM (w/ IOU) takes more processing time since the OGM scheme runs whereas it does not in GMPHD-HDA. GMPHD-OGM (w/ SIOA) shows meaningful improvements in terms of MOTA, The ablation study proves that GMPHD-OGM is not only overall improved but also more robust and less sensitive in parameters than baseline methods.

Figure 9 demonstrates some qualitative results of our tracking framework in view of overall process. Detection results (observations) initialize tracking objects (states). In the sequential tracking process, the states are associated with the proper observations by the detection-to-track association (D2TA) using the GMPHD filtering process. From false positive detections, false positive tracks can be generated and then “Track Merging” handles it. If objects are occluded and their IDs are switched, track-to-track association (T2TA) can recover their IDs. In the case that “Track Merging” merges true tracks (false merging), the occlusion group energy minimization (OGEM) process can recover it which optimizes energy of a group of occluded objects at current time by calculating the probability of the Gaussian mixture model as described in Subsection IV-B.

Table III and IV show the quantitative evaluations results on MOT15 and MOT17 test dataset, respectively. Those two benchmark datasets have crucial different characteristics. First, provided public detection results are different, MOT15 provides ACF [8] detector based detections, and MOT17 provides three types of detections such as DPM [9], FRCNN [10], and SDP [11]. Compared to the DNN based detectors FRCNN and SDP, ACF and DPM exploit hand-crafted features learning and models, and thus show relatively poor performance. DPM generates more false positives than FRCNN and SDP do, and especially ACF misses much more objects than others do. Thus, in MOT15, state-of-the-art trackers shows wider range of MOTA distribution than that in MOT17. Among online methods, the trackers with DNN [14, 37] shows the top MOTA scores in MOT15 and MOT17, respectively. Our method achieves the second best MOTA 30.7 vs. the best speed 169.5 fps in MOT15, but we think that the performance is competitive and enough to consider real-time application.

Fig. 5: Comparisons of tracking accuracy against speed with the state-of-the-art methods on the (a) MOT15 and (b) MOT17 test sequences.

In Figure 5-(a), the proposed method is located in a distinguished spot in terms of tracking accuracy (MOTA) vs. speed (fps). That also proves effectiveness of our occlusion group based object analysis (OGM), compared to other relation analysis between all objects in the scene [21, 23]. However, in MOT17, the speed of the proposed method decreases to 30.7 fps. That speed still belongs to real-time processing time but is not outstanding compared to other online methods. That is caused by the second different point between two datatsets. MOT15 includes 5783 frames with 721 tracks, 61440 bounding boxes, and 10.6 density i.e., the average number of objects a frame, whereas MOT17 includes 17757 frames with 2355 tracks, 564228 bounding boxes, and 31.8 density. Because MOT17 has the scenes not only with much higher density but also accurate detection results, those points increase tracking accuracy and processing time. Figure 5-(b) proves those facts where the performances of state-of-the-art methods are saturated on the spot with MOTA around 50 and speed under 5 fps. Even though our method achieves the second best MOTA and speed among online approaches in MOT17, the speed is drastically decreased compared to MOT15.

Fig. 6: Speed comparison of the proposed tracking method on MOT17 test dataset which provides three types of detection results for each scene, including DPM [9], FRCNN [10], and SDP [11].

Figure 6 explains that reason. In MOT17-03, the speed is around 10 fps since many objects appear in the scene with 69.8 density which means the average number of objects per frame. That makes the number of track-to-track association greatly increase. The proposed method is still comparative and positioned at meaningful spot for real-time application as shown in Figure 5-(b). In addition to our tracking algorithm (GMPHD-OGM), many PHD filter based online approaches [18, 38, 16, 22, 41, 42, 43] have been proposed in the past decade. Against them, GMPHD-OGM achieves not only the best MOTA, MOTP, MT, ML, FN, and speed scores on MOT15 but also the second best MOTA, speed, and best MT, FN, and Frag scores in MOT17. Especially, against to state-of-the-art online approaches, the proposed method is distinguished in terms of tracking accuracy (MOTA) vs. speed (fps), even though we did not utilize any complex visual features except bounding boxes. Also, the proposed tracker (GMPHD-OGM) against state-of-the-art algorithms including online with DNN even including offline, GMPHD-OGM shows the competitive MOTA versus speed as described in Figure 5, Table III, and Table IV.

Tracker MOTA MOTP MT ML FP FN IDS Frag Speed
GMPHD-HDA n/a 34.8 % 72.3 % 14.4 % 47.8 % 4,042 21,646 338 572 212.4 fps
GMPHD-OGM (w/ IOU) 0.2 34.5 % 72.4 % 13.4 % 49.8 % 3,594 22,226 285 550 201.1 fps
0.3 35.6 % 72.3 % 14.0 % 48.6 % 3,537 21,901 278 548 228.4 fps
0.4 35.3 % 72.4 % 14.0 % 48.2 % 3,667 21,865 291 562 201.9 fps
0.5 34.7 % 72.3 % 14.4 % 47.8 % 4,044 21,661 340 577 205.3 fps
GMPHD-OGM (w/ SIOA) 0.3 34.5 % 72.3 % 14.2 % 49.2 % 3,496 22,336 297 559 216.3 fps
0.4 35.4 % 72.4 % 14.6 % 48.4 % 3,556 21,930 284 540 216.3 fps
0.5 35.8 % 72.2 % 15.0 % 47.6 % 3,569 21,758 295 545 221.0 fps
0.6 35.5 % 72.3 % 15.6 % 47.2 % 3,702 21,724 312 556 221.9 fps
0.7 34.7 % 72.2 % 15.6 % 47.2 % 4,159 21,519 368 567 202.9 fps
TABLE II: Quantitative evaluation results on MOT15 training dataset. The proposed method namely GMPHD-OGM (w/ SIOA) is compared to two baseline methods GMPHD-HDA and GMPHD-OGM (w/ IOU). GMPHD-HDA employs the GMPHD filtering with hierarchical data association (HDA). GMPHD-OGM is equal to GMPHD-HDA with the proposed occlusion group management (OGM). The IOU and SIOA metrics are used for “Track Merging” in GMPHD-OGM (w/ IOU) and (w/ SIOA), respectively. The optimal values of the merging threshold are underlined and the best scores are in bold in terms of the CLEAR-MOT metrics.
Mode Tracker DNN MOTA MOTP MT ML FP FN IDS Frag Speed


CDA_DDAL [14] O 32.8 % 70.7 % 9.7 % 42.2 % 4,983 35,690 614 1,583 2.3 fps
HAM_SADF [15] O 28.6 % 71.1 % 10.0 % 44.0 % 7,485 35,910 460 1,038 18.7 fps
Proposed* X 30.7 % 71.6 % 11.5 % 38.1 % 6,502 35,030 1,034 1,351 169.5 fps
PHD_GSDL [18] X 30.5 % 71.2 % 7.6 % 41.2 % 6,534 35,284 879 2,208 8.2 fps
MDP [19] X 30.3 % 71.3 % 14.0 % 38.4 % 9,717 32,422 680 1,500 1.1 fps
TBSS [20] X 29.2 % 71.3 % 6.8 % 43.8 % 6,068 36,779 649 1,508 11.5 fps
SCEA [21] X 29.1 % 71.1 % 8.9 % 47.3 % 6,060 36,912 604 1,182 6.8 fps
EAMTT [22] X 22.3 % 69.6 % 5.4 % 52.7 % 7,924 38,982 833 1,485 12.2 fps
RMOT [23] X 18.6 % 69.6 % 5.3 % 53.3 % 12,473 36,835 684 1,282 7.9 fps
GMPHD_HDA [16] X 18.5 % 70.9 % 3.9 % 55.3 % 7,864 41,766 459 1,266 19.8 fps
GSCR [24] X 15.8 % 69.4 % 1.8 % 61.0 % 7,597 43,633 514 1,010 28.1 fps
TC_ODAL [25] X 15.1 % 70.5 % 3.2 % 55.8 % 12,970 38,538 637 1,716 1.7 fps


MHT_DAM [28] O 32.4 % 71.8 % 16.0 % 43.8 % 9,064 32,060 435 826 0.7 fps
CNNTCM [26] O 29.6 % 71.8 % 11.2 % 44.0 % 7,786 34,733 712 943 1.7 fps
SiameseCNN [27] O 29.0 % 71.2 % 8.5 % 48.4 % 5,160 37,798 639 1,316 52.8 fps
NOMT [29] X 33.7 % 71.9 % 12.2 % 44.6 % 7,762 32,547 442 823 11.5 fps
ELP [30] X 25.0 % 71.2 % 7.5 % 43.8 % 7,345 37,344 1,396 1,804 5.7 fps
JPDA_m [31] X 23.8 % 68.2 % 5.0 % 58.1 % 6,373 70,084 365 869 32.6 fps
MotiCon [32] X 23.1 % 70.9 % 4.7 % 52.0 % 10,404 35,844 1,018 1,061 1.4 fps
SegTrack [33] X 22.5 % 71.7 % 5.8 % 63.9 % 7,890 39,020 697 737 0.2 fps
CEM [34] X 19.3 % 70.7 % 8.5 % 46.5 % 14,180 34,591 813 1,023 1.1 fps
SMOT [35] X 18.2 % 71.2 % 2.8 % 54.8 % 8,780 40,310 1,148 2,132 2.7 fps
DP_NMS [36] X 14.5 % 70.8 % 5.0 % 40.8 % 13,171 34,814 4,537 3,090 444.8 fps
* The final proposed model is GMPHD-OGM (w/ SIOA).
TABLE III: Quantitative evaluation results on MOT15 test dataset. The proposed method is compared to state-of-the-art in terms of the CLEAR-MOT metrics. For each mode, i.e, online and offline, the first and the second best scores are highlighted in red and blue in terms of each metric.
Mode Tracker DNN MOTA MOTP MT ML FP FN IDS Frag Speed


MOTDT [37] O 50.9 % 76.6 % 17.5 % 35.7 % 24,069 250,768 2,474 5,317 18.3 fps
HAM_SADF [15] O 48.3 % 77.2 % 17.1 % 41.7 % 20,967 269,038 1,871 3,020 5.0 fps
DMAN [39] O 48.2 % 75.7 % 19.3 % 38.3 % 26,218 263,608 2,194 5,378 0.3 fps
Proposed* X 49.9 % 77.0 % 19.7 % 38.0 % 24,024 255,277 3,125 3,540 30.7 fps
MTDF [38] X 49.6 % 75.5 % 18.9 % 33.1 % 37,124 241,768 5,567 9,260 1.2 fps
AM_ADM [40] X 48.1 % 76.7 % 13.4 % 37.7 % 25,061 265,495 2,214 5,027 5.7 fps
PHD_GSDL [18] X 48.0 % 77.2 % 17.1 % 35.6 % 23,199 265,954 3,998 8,886 6.7 fps
GMPHD_HDA [16] X 43.7 % 76.5 % 11.7 % 43.0 % 25,935 287,758 3,838 5,056 9.2 fps
EAMTT [22] X 42.6 % 76.0 % 12.7 % 42.7 % 20,711 288,474 4,488 5,720 12.0 fps
GMPHD_N1Tr [41] X 42.1 % 77.7 % 11.9 % 42.7 % 18,214 297,646 10,698 10,864 9.9 fps
GMPHD_KCF [42] X 39.6 % 74.5 % 8.8 % 43.3 % 50,903 284,228 5,811 7,414 3.3 fps
GM_PHD [43] X 36.4 % 76.2 % 4.1 % 57.3 % 23,723 330,767 4,607 11,317 38.4 fps


MHT_DAM [28] O 50.7 % 77.5 % 20.8 % 36.9 % 22,875 252,889 2,314 2,865 0.9 fps
MHT_bLSTM [50] O 47.5 % 77.5 % 18.2 % 41.7 % 25,981 268,042 2,069 3,124 1.9 fps
eHAF [44] X 51.8 % 77.0 % 23.4 % 37.9 % 33,212 236,772 1,834 2,739 0.7 fps
FWT [45] X 51.3 % 77.0 % 21.4 % 35.2 % 24,101 247,921 2,648 4,279 0.2 fps
JCC [46] X 51.2 % 75.9 % 20.9 % 37.0 % 25,937 247,822 1,802 2,984 1.8 fps
TLMHT [47] X 50.6 % 77.6 % 17.6 % 43.4 % 22,213 255,030 1,407 2,079 2.6 fps
EDMT [48] X 50.0 % 77.3 % 21.6 % 36.3 % 32,279 247,297 2,264 3,260 0.6 fps
IOU [49] X 45.5 % 76.9 % 15.7 % 40.5 % 19,993 281,643 5,988 7,404 1,522.9 fps
* The final proposed model is GMPHD-OGM (w/ SIOA).
TABLE IV: Quantitative evaluation results on MOT17 test dataset. The proposed method is compared to state-of-the-art in terms of the CLEAR-MOT metrics. For each mode, i.e, online and offline, the first and the second best scores are highlighted in red and blue in terms of each metric.
Fig. 7: Ablation study with the two baseline methods, i.e., GMPHD-HDA and GMPHD-OGM (with IOU) IOU. The final proposed method is GMPHD-OGM (with SIOA). Three graphs indicates the MOTA scores’ distributions against (a) the minimum track length for T2TA () and the maximum frame interval for T2TA (), (b) , and (c) . GMPHD-OGM (with SIOA) shows overall improvement in upper and lower bound of MOTA score.
Fig. 8: Case study about “Track Merging” with the qualitative results on MOT17-05-DPM training sequence at frame 254. For the same detection results, the overlapping ratios between the occluded objects are measured with and . Under the different merging threshold values and , the IOU metric is more sensitive than the SIOA metric. The SIOA metric is more robust to merge size variant false positive bounding boxes than IOU metric.
Fig. 9: Illustration of the proposed multi-object tracking process with the qualitative results on MOT17-08-DPM test sequence. The whole process consists of four components which are D2TA, Merge, T2TA, and OGEM. Qualitative tracking results at frame 42, 66, 184, and 190 present that all components are complementary to each other with handling tracking problems.

Vi Conclusion and Future Work

In this paper, we proposed an efficient online multi-object tracking framework with the GMPHD filter and the occlusion group management (OGM) named as GMPHD-OGM. In the proposed framework, our first contribution is that the Gaussian mixture probability hypothesis density (GMPHD) filter [3] is exploited to resolve MOT task. Since the GMPHD filter is originally designed to handle MOT in radar/sonar system, we should revise the filter to fit to video data system. To resolve missed tracks problem in the difference domain, we extended the conventional GMPHD filtering process with the hierarchical data association (HDA) strategy as explained in Figure 1. The second contribution is that to solve the occlusion problems, we proposed an occlusion group management (OGM) scheme. OGM is composed of “Track Merging” and “Occlusion Group Energy Minimization (OGEM)”. “Track Merging” reduced the number of false positives by merging them. The OGEM prevents false merging between true tracks. Both modules complement each other, and also instead of the IOU metric, we designed a new metric named as sum-of-intersection-over-area (SIOA) to measure the occlusion ratio between visual objects. The third is that the effectiveness of our tracker (GMPHD-OGM) was introduced by the ablation study with the baselines and the evaluation results on MOT15 [5] and MOT17 [6] benchmarks with state-of-the-art MOT methods. The ablation study proves that GMPHD-OGM (w/ SIOA) is more efficient to solve the defined problems than the given baseline methods such as GMPHD-HDA and GMPHD-OGM (w/ IOU). GMPHD-OGM achieves the best MOTA scores in MOT15 and MOT17 datasets, respectively, in comparison with the PHD filter based online trackers [18, 38, 16, 22, 41, 42, 43]. Finally, by the comprehensive evaluation, we conclude the proposed tracker (GMPHD-OGM) against state-of-the-art algorithms including online with DNN even including offline, GMPHD-OGM shows the competitive value in “MOTA versus speed”. As a future work, we will develop an efficient real-time tracker even with the number of objects over hundred, simultaneously, achieving the state-of-the-art level tracking accuracy.


  • [1]

    R. P. S. Mahler, “Multitarget Bayes Filtering via First-Order Multitarget Moments,”

    IEEE Trans. Aerosp. Electron. Syst., vol. 39, no. 4, pp. 1152–1178, Oct. 2003.
  • [2] B.-N. Vo, S. Singh, and A. Doucet, “Sequential Monte Carlo implementation of the PHD filter for multi-target tracking,” in Proc. Int. Conf. Information Fusion (ICIF), pp 792–799, Jul. 2003.
  • [3] B.-N. Vo and W.-K. Ma, “The Gaussian mixture probability hypothesis density filter,” IEEE Trans. Signal Processing, vol. 54, no. 11, pp. 4091–4104, Oct. 2006.
  • [4] B.-T. Vo, “Random finite sets in multi-object filtering,” PhD Thesis, School of Electrical, Electronic and Computer Engineering, The Univ. of Western Australia, Perth, Australia, 2008.
  • [5] A. Milan, L. Leal-Taixé, I. Reid, S. Roth, and K. Schindler, “MOTChallenge 2015: Towards a benchmark for multi-target tracking,” [Online]. Available:, Apr. 2015.
  • [6] L. Leal-Taixé, A. Milan, I. Reid, S. Roth, and K. Schindler, “MOT16: A Benchmark for Multi-Object Tracking,”, [Online]. Available:, May. 2016.
  • [7]

    A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 3354–3361, Jun. 2012.

  • [8] P. Dollár, R. Appel, S. Belongie, and P. Perona, “Fast feature pyramids for object detection,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 36, no. 8, pp. 1532–1545, Aug. 2014.
  • [9] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan, “Object detection with discriminatively trained part-based models,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 32, no. 9, pp. 1627–1645, Sep. 2010.
  • [10] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” in Proc. 28th Int. Conf. Neural Inf. Process. Syst. (NIPS), Montréal, QC, Canada, vol. 1, pp. 91–99, Dec. 2015.
  • [11]

    F. Yang, W. Choi, and Y. Lin, “Exploit all the layers: Fast and accurate cnn object detector with scale dependent pooling and cascaded rejection classifiers,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 2129-2137, Jun. 2016.

  • [12] L. Wang, Y. Lu, H. Wang, Y. Zheng, H. Ye, and X. Xue, “Evolving boxes for fast vehicle detection,” in Proc. IEEE Conf. Multi. Expo (ICME), pp. 1135-1140, Jul. 2017.
  • [13] S. Murray, “Real-Time Multiple Object Tracking - A Study on the Importance of Speed,” [Online]. Available:, Oct. 2017.
  • [14] S.-H. Bae and K.-J. Yoon, “Confidence-Based Data Association and Discriminative Deep Appearance Learning for Robust Online Multi-Object Tracking,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 40, no. 3, pp. 595–610, Mar. 2018.
  • [15] Y.-C. Yoon, A. Boragule, Y. Song, K. Yoon, and M. Jeon, “Online Multi-Object Tracking with Historical Appearance Matching and Scene Adaptive Detection Filtering,” in Proc. IEEE Int. Workshop Traffic Street Surveill. Safety Secur. (AVSS), pp. 1–6, Nov. 2018.
  • [16] Y. Song and M. Jeon, “Online Multiple Object Tracking with the Hierarchically Adopted GM-PHD Filter using Motion and Appearance,” in Proc. IEEE Int. Conf. Consumer Electronics-Asia (ICCE-Asia), pp. 1–4, Oct. 2016.
  • [17] Y. Song, Y.-C. Yoon, K. Yoon, and M. Jeon, “Online and Real-Time Tracking with the GMPHD Filter using Group Management and Relative Motion Analysis,” in Proc. IEEE Int. Workshop Traffic Street Surveill. Safety Secur. (AVSS), pp. 1–6, Nov. 2018.
  • [18] Z. Fu, P. Feng, F. Angelini, J. Chambers, and S. M. Naqvi, “Particle phd filter based multiple human tracking using online group-structured dictionary learning,” IEEE Access, vol. 6, pp. 14764–14778, Mar. 2018.
  • [19] Y. Xiang, A. Alahi, and S. Savarese, “Learning to Track: Online Multi-Object Tracking by Decision Making,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), pp. 4705–4713, Dec. 2015.
  • [20] X. Zhou, P. Jiang, Z. Wei, H. Dong, and F. Wang, “Online Multi-Object Tracking with Structural Invariance Constraint,” In Proc. Brit. Mach. Vis. Conf. (BMVC) pp. 1–13, Sep. 2018.
  • [21] J. Yoon, C.-R. Lee, M.-H. Yang, K.-J. Yoon, “Online Multi-object Tracking via Structural Constraint Event Aggregation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 1392–1400, Jun. 2016.
  • [22] R. Sanchez-Matilla, F. Poiesi, and A. Cavallaro, “Multi-target tracking with strong and weak detections,” in Proc. Eur. Conf. Comput. Vis. Workshops (ECCVW), pp. 84–99, Oct. 2016.
  • [23] J. Yoon, M.-H. Yang, J. Lim, and K.-J. Yoon, “Bayesian Multi-Object Tracking Using Motion Context from Multiple Objects,” in Proc. IEEE Winter Conf. Appl. Comput. Vis. (WACV), pp. 1–-13, Jan. 2015.
  • [24] L. Fagot-Bouquet, R. Audigier, Y. Dhome, and F. Lerasle, “Online multi-person tracking based on global sparse collaborative representations”. in Proc. IEEE Conf. Image Processing (ICIP), pp. 2414–2418, Sep. 2015.
  • [25] S.-H. Bae and K.-J. Yoon, “Robust Online Multi-Object Tracking based on Tracklet Confidence and Online Discriminative Appearance Learning,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 1218–1225, Jun. 2015.
  • [26] B. Wang, L. Wang, B. Shuai, Z. Zuo, T. Liu, K. L. Chan, and G. Wang, “Joint Learning of Convolutional Neural Networks and Temporally Constrained Metrics for Tracklet Association,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. Workshops (CVPRW), pp. 386–393, Jun. 2016.
  • [27] L. Leal-Taixé, C. Canton-Ferrer, and K. Schindler, “Learning by Tracking: Siamese CNN for Robust Target Association,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. Workshops (CVPRW), pp. 33–40, Jun. 2016.
  • [28] C. Kim, F. Li, A. Ciptadi, and J. M. Rehg, “Multiple Hypothesis Tracking Revisited,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), pp. 4696–4704, Dec. 2015.
  • [29] W. Choi, “Near-Online Multi-target Tracking with Aggregated Local Flow Descriptor,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), pp. 3029–3037, Dec. 2015.
  • [30]

    N. McLaughlin, J. M. D. Rincon, and P. Miller, “Enhancing Linear Programming with Motion Modeling for Multi-target Tracking,” in Proc. IEEE Winter Conf. Appl. Comput. Vis. (WACV), pp. 71–-77, Jan. 2015.

  • [31] S. H. Rezatofighi, A. Milan, Z. Zhang, Q. Shi, A. Dick, and I. Reid, “Joint Probabilistic Data Association Revisited,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), pp. 3047–3055, Dec. 2015.
  • [32] L. Leal-Taixé, M. Fenzi, A. Kuznetsova, B. Rosenhahn, and S. Savarese, “Learning an image-based motion context for multiple people tracking,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 3542–3549, Jun. 2014.
  • [33] A. Milan, L. Leal-Taixé, K. Schindler, and I. Reid, “Joint Tracking and Segmentation of Multiple Targets,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 5397–5406, Jun. 2015.
  • [34] A. Milan, S. Roth, and K. Schindler, “Continuous Energy Minimization for Multitarget Tracking,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 36, no. 1, pp. 58–72, Aug. 2014.
  • [35] C. Dicle, O. I. Camps, and M. Sznaier, “The Way They Move: Tracking Targets with Similar Appearance,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), pp.2304–2311, Dec. 2013.
  • [36] H. Pirsiavash, D. Ramanan, and C. C. Fowlkes, “Globally-Optimal Greedy Algorithms for Tracking a Variable Number of Objects,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 1201–1208, Jun. 2011.
  • [37] C. Long, A. Haizhou, Z. Zijie and S. Chong, “Real-time Multiple People Tracking with Deeply Learned Candidate Selection and Person Re-identification,” in Proc. IEEE Conf. Multi. Expo (ICME), pp. 1–6, Oct. 2018.
  • [38] Z. Fu, F. Angelini, J. Chambers, and S. M. Naqvi, “Multi-Level Cooperative Fusion of GM-PHD Filters for Online Multiple Human Tracking,” IEEE trans. Multimedia. (Early Access), Mar. 2019.
  • [39] J. Zhu, H. Yang, N. Liu, M. Kim, W. Zhang, and M.-H. Yang, “Online Multi-Object Tracking with Dual Matching Attention Networks,” in Proc. Eur. Conf. Comput. Vis. (ECCV), pp. 366–382, Feb. 2019.
  • [40] S.-H. Lee, M.-Y. Kim, and S.-H. Bae, “Learning Discriminative Appearance Models for Online Multi-Object Tracking with Appearance Discriminability Measures,” IEEE Access, vol. 6, pp. 67316–67328, Nov. 2018.
  • [41] N. L. Baisa and A. Wallace, “Development of a N-type GM-PHD filter for multiple target, multiple type visual tracking.” Journal of Visual Communication and Image Representation, vol. 59, pp. 257–271, 2019.
  • [42] T. Kutschbach, E. Bochinski, V. Eiselein, and T. Sikora, “Sequential Sensor Fusion Combining Probability Hypothesis Density and Kernelized Correlation Filters for Multi-Object Tracking in Video Data,” in Proc. IEEE Int. Workshop Traffic Street Surveill. Safety Secur. (AVSS), pp. 1–6, Sep. 2017.
  • [43] V. Eiselein, D. Arp, M. Pätzold, and T. Sikora, “Real-time Multi-Human Tracking using a Probability Hypothesis Density Filter and multiple detectors,” in Proc. IEEE Int. Conf. Adv. Video Signal Based Surveill. (AVSS), pp. 325–330, Sep. 2012.
  • [44] H. Sheng, Y. Zhang, J. Chen, Z. Xiong, and J. Zhang, “Heterogeneous Association Graph Fusion for Target Association in Multiple Object Tracking,” IEEE Trans. Circuits Syst. Video Technol., (Early Access), Nov. 2018.
  • [45] R. Henschel, L. Leal-Taixé, D. Cremers, and B. Rosenhahn, “Fusion of Head and Full-Body Detectors for Multi-Object Tracking,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. Workshops (CVPRW), pp. 1428–1437, Jun. 2018.
  • [46] M. Keuper, S. Tang, B. Andres, T. Brox, and B. Schiele, “Motion Segmentation & Multiple Object Tracking by Correlation Co-Clustering,” IEEE Trans. Pattern Anal. Mach. Intell., (Early Access), Oct. 2018.
  • [47] H. Sheng, J. Chen, Y. Zhang, W. Ke, Z. Xiong, and J. Yu, “Iterative Multiple Hypothesis Tracking with Tracklet-level Association,” IEEE Trans. Circuits Syst. Video Technol., (Early Access), Nov. 2018.
  • [48] J. Chen, H. Sheng, Y. Zhang, and Z. Xiong, “Enhancing Detection Model for Multiple Hypothesis Tracking,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. Workshops (CVPRW), pp. 18–27, Jul. 2017.
  • [49] E. Bochinski, V. Eiselein, and T. Sikora, “High-Speed Tracking-by-Detection Without Using Image Information,” in Proc. IEEE Int. Workshop Traffic Street Surveill. Safety Secur. (AVSS), pp. 1–6, Sep. 2017.
  • [50] C. Kim, F. Li, and J. Rehg, “Multi-object Tracking with Neural Gating Using Bilinear LSTM,” in Proc. Eur. Conf. Comput. Vis. (ECCV), pp. 200–215, Sep. 2018.
  • [51] S. Chopra, R. Hadsell, Y. LeCun, “Learning a Similarity Metric Discriminatively,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 539–546, Jun. 2005.
  • [52] L. Zhao, X. Li, J. Wang, and Y. Zhuang, “Deeply-learned part-aligned representations for person re-identification,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), pp.3239–3248, Oct. 2017.
  • [53] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista, “High-Speed Tracking with Kernelized Correlation Filters,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 37, no. 3, pp. 583–596, Mar. 2015.
  • [54] K. Bernardin and R. Stiefelhagen, “Evaluating Multiple Object Tracking Performance: The CLEAR MOT Metrics,” EURASIP Journal on Image and Video Processing, vol. 2008 no. 1, pp. 1–10, May 2008.
  • [55]

    M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, “The Pascal Visual Object Classes Challenge: A Retrospective,” International Journal of Computer Vision, vol. 111 no. 1, pp. 98–136, Jan. 2015.

  • [56]

    O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, “ImageNet Large Scale Visual Recognition Challenge,” International Journal of Computer Vision, vol. 115 no. 3, pp. 211–252, Dec. 2015.