Currently, researches on MOT algorithms mainly focus on the problem of data association with given detections, and tracking-by-detection methods can be generally classified into online and offline MOT methods. For the offline MOT methods, attentions are often paid to the association assignment between tracklets, which would inevitably utilize detections from future frames. Therefore, offline MOT methods are considered as a global optimization scheme, and it can only be applied to non-real-time occasions such as offline video analysis. On the other hand, online MOT methods focus on the data association between detections in the current frame and historical tracklets, which do not need detections from future frames as inputs and thus can be applied to real-time situations such as video surveillance, robot navigation and so forth.
The design of pairwise association costs between detections and historical tracklets directly affects the performance of the data association assignment. In MOT methods, appearance features and motion cues of targets are usually extracted to obtain pairwise association costs. As the appearance models can help distinguish between different objects very well in most circumstances, many studies are committed to finding more discriminative appearance features, such as , , . However, the pairwise cost acquired from appearance information would be easily unreliable when tracking similar appearance targets. Therefore, the movement information becomes necessary to discriminate and identify multiple targets. It is undeniable that, in a fixed scene, we can use a linear motion model to characterize a simple moving target’s trajectory or establish an autoregressive motion model  for a complex one. However, for a moving camera, the movement of the individual target becomes the superposition of both the object’s movement itself and the translation of the frame image due to the moving camera. Under such circumstances, observable motion cues become unstable, unpredictable and unreliable. The conventional motion models can hardly describe the movement features of these targets, let alone making accurate prediction in the next frame. Although the uncertainty of the target motion deteriorates with the moving camera, the internal structure among multiple targets in the adjacent frames can approximately remain consistent and steady, which could be utilized to compensate the short board on account of ambiguous motion cues, as is shown in figure 1.
In this paper, we propose a heuristic approach to search for internal structural constraints between multiple targets to modify pairwise cost matrix and thus lessen the association ambiguities in the data association assignment. Furthermore, we propose a new method based on minimization of the cost function constructed by both motion and structure cues. This is applied to predict the next location of a missing target and thus it can be associated with the reappearing target. This proposed method based online MOT algorithm primarily consists of two steps. First, we construct the association cost matrix between the detections in the current frame and the historical target tracklets, including the design of the raw pairwise costs and the amelioration by the structural constraints. In specific, we exploit motion and appearance cues to construct the raw pairwise costs and then use the proposed heuristic approach to search for the optimal structural constraints to ameliorate the association cost matrix. The second step is the association assignment, in which we utilize generalized linear assignment  to match the available detections with the historical tracklets, and then recover the missing targets in a fixed size window by minimizing the cost function constructed by both motion and structure cues.
Ii Related Work
We review related MOT methods that pay much attention to motion cues and the structure of multiple targets. Andriyenko et al. , ,  focus on designing an energy function and constructing an optimization scheme to find local minima of the energy function. For this method, all the detection responses are the input of the energy function, the solution space of which contains all possible associations. The velocity models are used in this method to descript the motion of targets, which can only cover the situation of simple movements.
Dicle et al. 
only use motion cues to track multiple objects with similar appearance, the autoregressive model is utilized to represent the motion of each target and to construct association cost. This model handles complex target movements and similar appearances, but cannot work well when faced with non-stationary cameras. Possegger et al. exploit the geometric information, including occlusion information, detector reliability, and motion prediction, to recover missing objects. Collins et al.  develop a higher-order cost function for data association. This method uses active contour spline energy to measure the quality of a proposed trajectory. Liu et al.  extract game context features from noisy detections to build context-conditioned motion models for tracking sports players. Yang and Nevatia  use online conditional random field (CRF) to produce unary and pairwise energy functions based on linear and smooth motion to solve the multi-object tracking problems. Other trackers, like , ,  joint both the detection and the tracking assignments.
Yoon et al.  utilize the motion context to construct a relative motion network to deal with the unexpected camera motions. However, this method cannot handle abrupt camera motions and fluctuations. In , Yoon et al. exploit the structural motion constraints and propose an event aggregation approach to solve the MOT problem with moving cameras. This method shows great performance in public datasets. However, in this approach, motion cues of the targets only depend on their relative structure and the update of structure relies on a linear motion model, which can hardly handle the objects with sophisticated movement or high-speed motion.
The MOT problem can be regarded as an association assignment between the object detections and historical tracklets for online trackers, or tracklets-to-tracklets for offline trackers. There are about two directions to develop a tracker and make it more sophisticated. On the one hand, many researches focus on the tool of solving the data association. Therefore, many different approaches are adopted to solve the association assignment, including minimization of network flow based cost  , , 5], , Hungarian matching , ,  and subgraph decomposition ,. On the other hand, design of features that are more elaborate ,  and making association cost more discriminative have been broadly concerned. For example, Rezatofighi et.al. 
modify the association costs with joint probabilities, and decompose the original problem into a series of integer programming, which is more efficient and time-saving. However, in the method, each pairwise cost is the aggregation of costs generated by all possible match event under the fixed pair, and thus the most likely associated pair would be susceptible to noise caused by almost impossible associated pair, especially when detections and targets cannot match each other one-to-one. So does the tracker in.
Iii Proposed Method
The discrimination of pairwise costs is of great significance for solving the association assignments. In the section of related work, a variety of cost design methods are discussed, most of which are concerning about objects’ appearance and motion cues. For example,  proposes an event aggregation method to integrate structural constraints in all possible assignment events. In this paper, we propose a heuristic method to search for the optimal association event corresponding to the minimum structural cost for each possible pair.
Iii-B Heuristic search for the optimal structural constraint
The consecutive video sequences guarantee the continuity of the objects’ motion, thus making it possible to exploit structure cues between multiple targets in a frame. We denote the center location of a detection at frame as and an object at frame as (In the subsequent formulas, we will omit ). We use to denote the relative displacement between the th object location and the spatial distribution center of all detection positions, which is as shown as follows.
If the detection in the frame is associated with the object , the assignment is denoted by or . Otherwise, it is denoted by . Assuming there are detections in the current frame and historical objects in the past, and they could also match each other one-to-one, then the optimal assignment solution based on minimum structural cost of each possible pair can be constructed as:
Obviously, the above formula can be easily solved by the Hungarian algorithm. However, the actual number of detections in the current frame are usually different from the number of objects in the past due to false positives, false negatives and other noises. In other words, it can neither find global structural constraints between all detections and objects nor promise one-to-one match. Thus, Eq. (2) cannot be used to fit the practical assignments. As such, we present a heuristic method to search for the optimum match set by minimizing structural cost in each search loop, which is described as below.
With a certain association pair and the corresponding match set , which is initialized as , we are able to search alternative match pairs for the optimal pair by minimizing the structural cost function generated among elements in the match set and the alternative pair heuristically and continually. The structural cost function is described by:
If the structural cost computed from the above formula is smaller than a certain threshold , we add the pair into the matching set and repeat the entire procedure to find the next optimal pair on the circumstance of the updated match set. Otherwise, we end the search. The maximum match set has been found under such structural constraints.
When solving Eq.(3), we fix the target and assume that the solution space of detections is consecutive. The extreme point under such condition appears as the following formula:
Evidently, there exists a unique location prediction in the current frame for each historical object under the condition of the known match set. Due to the discretization of the detections, we have to find the closest detection with the solution of each target and extract the one that generates the minimum structural cost of the map between detections and targets in the current match set. If the minimum structural cost corresponding to the optimal pair is less than the certain threshold, we expand the match set. Otherwise jump out of the loop and end the search. The match set is what we need. The specific procedure is shown in Algorithm 1 and Fig. 2.
|Algorithm 1 Heuristic search for structural constraints|
|Input: a certain fixed association pair|
|Input: the detection in frame|
|Input: the target in frame|
|Output: the optimal match set for each fixed pair|
|While (the number of targets)|
|Candidate pair set|
|Find the closest detection with the location|
|computed by Eq. (4)|
|Compute minimization of structural cost from the alternative|
Iii-C Pairwise cost
In this section, we will introduce the construction and the improvement of the raw pairwise costs. This paper exploits the movements and appearances of the targets to construct their raw association costs with the detections in the current frame. Specifically, we establish a velocity autoregressive model  for the individual target and then predict the location for each object in the current frame. The Euclidean distance between each predicted position and every detection is taken as the raw motion cost for each possible pair. At the same time, we utilize the color histogram features of each target to design its raw appearance cost. Furthermore, the raw association cost for each pair is generated by coupling the raw motion cost with the raw appearance cost. In the section B, we have obtained the maximum match set for each possible pair, which can help us modify the raw association cost matrix as the Eq.(5).
The appearance model would undoubtedly become ambiguous and incapable when the targets have almost the same manners. Similarly, the motion model would become unstable and powerless when the video frames are captured by a moving camera. Correspondingly, the different raw pairwise costs made of appearance and motion costs would be almost the same level. Such resemblance would inevitably make the data association assignment more ambiguous and challenging. However, with the operation of each raw pairwise cost using Eq. (5), we can find that the more the number of match pairs in each match set, the lower the modified cost and vice versa. This is because the size of match set for each available pair is positively related to the structural similarity between the detections in the current frame and the historical tracklets under the condition of a fixed pair association. Obviously, the lager the size of the match set, the more likely the detection and the target in the fixed pair to be associated with each other, thus the lower their association cost and vice versa. Therefore, this modification using structural penalty makes the raw association cost, which might be generated from ambiguous appearance features and unreliable motion cues, more elaborate and discriminative.
Here, denotes the raw association cost of the possible pair . is the association cost of the possible pair under structural constraints; and denotes the number of elements in the maximum match set of the possible pair .
Iii-D Association assignment
Data association aims to find the optimal assignment event, which corresponds to the minimum cost function, between detections and objects. In this paper, we apply the generalized linear assignment algorithm  to solve the data association assignment.
Iii-E Prediction of missing targets
Generally, the false positive and false negative detections, and the situation that the targets enter into or walk out from the field of view would possibly lead to no detection in the current frame associated with the certain target in the final optimal assignment event, which is called the occurrence of missing targets. Here, we propose a target prediction method based on structural constraints as well as the motion inertia. Most of the conventional prediction methods can only promise one-frame prediction accuracy as the measurement information is not updated continuously. However, we exploit both the target motion information, which is along the time axis, and the structural information between the multi-targets, which is perpendicular to the time axis, to predict the location of missing targets. Owing to the fact that the structural information between the multiple objects is continually updated, the predicted location of a certain target would not be extended stiffly along the motion inertia, which could also alleviate the effect on motion cues caused by unpredictable moving cameras.
We denoteas the match set in the frame and as the set of missing targets. and are defined by the location of the object in the match set and the predicted location of the object , obtained by motion cues only, in the set of missing targets respectively. represents the predicted location of the object under the constraints of both structure and motion cues. The minimization is formulated by
The extreme point of the cost function in Eq. (7) is the optimal predicted location of missing targets. We set a certain time window, in which we continuously predict the location of a missing target until the target appears again. If a target does not show up in the whole time window, it would be considered as an end of a trajectory.
In this section, we will show the performance of the proposed method on the public data set 2DMOT2015 , as well as the performance of other multi-object tracking approaches. It is worth mentioning that the proposed heuristic search for structure constraint could largely diminish the association noise caused by moving cameras and similar appearances, and thus to make the association cost more elaborate and discriminative. It is a universal framework, in which the initial cost can be arbitrarily designed. In this paper, we establish a velocity autoregressive model for each target to obtain its motion cost, and extract the color histogram feature of the individual target to construct its appearance cost. In addition, we use the generalized linear assignment algorithm to solve the data association. For the termination of the historical targets and the initialization of the new targets, we performed the measurements by using a fixed frame number gap (this article uses 10). We will simply terminate a target if it is not associated with any detections for 10 consecutive frames. If a detection in the current frame does not match any one of the historical trajectories, we will initialize this detection as a new target.
Data set 2DMOT2015 provides both training and test data sets, each of which contain 11 video sequences of different scenes and pedestrian detections obtained by Aggregate Channel Features (ACF) pedestrian detector 
. To evaluate of the performance of trackers, we adopt the widely used MOT evaluation metrics, in which Multiple Object Tracking Accuracy (MOTA) is regarded as a comprehensive evaluation considering false detection, missed test and IDs. Multiple Object Tracking Precision (MOTP) measures the misalignment between the annotated and the predicted bounding boxes. For both MOTA and MOTP, a higher value represents a better performance. In addition, we also used other evaluation metrics like Mostly Tracked Targets (MT), Mostly Lost Targets (ML), FAF (The average number of false alarms per frame), False Positives (FP), False Negatives (FN), ID Sw. and so forth.
In the test datasets, the cameras keep moving in five scenes, including ADL-Rundle-1, ETH-Crossing, ETH-Jelmoli, ETH-Linthescher, and KITTI-19. The performance of the proposed method on these datasets is shown in table I. We also compared our method with other online trackers, including RMOT , TC_ODAL , RNN_LSTM . The RMOT constructs a relative motion network to deal with the unexpected camera motion. The TC_ODAL pay much attention to learning discriminative appearance features of targets. The RNN_LSTM propose an end-to-end learning approach for online multi-object tracking. It can be seen that the proposed method in this paper achieves best or second best performance in MOTA as well as FP in the video frames captured by moving cameras. This is because the moving cameras would inevitably make the objects’ motion cues unstable and unpredictable. As a result, the association cost computed from motion model would become much more ambiguous and less discriminative. However the internal structure among objects in adjacent frames would basically remain constant and stable, which could be perfectly used to compensate the undermining of motion cues. The RMOT also uses the structural constraints among objects. However, the performance of RMOT is worse than our method except for the dataset ETH-Jelmoli.
Table II shows the overall performance of our method and other online trackers, including RMOT , TC_ODAL  and RNN_LSTM  as well as offline trackers, including ALExTRAC , SegTrack  and DCO_X 
. The ALExTRAC utilizes appearance cues to learn an affinity model to estimate the data association cost. The SegTrack proposes a unified CRF model for joint tracking and segmentation of multiple targets. The DCO_X models the data association problem as minimization of unified discrete-continuous energy function. From tableII, we can conclude that our method achieves better performance on MOTA, FAF, ML and FP. The DCO_X get the best performance on IDS. As an offline tracker, the DCO_X deals with the data association assignment of detections-to-tracklets and tracklets-to-tracklets as well, so the length of each tracklet tends to get longer and the number of tracklets is smaller, which, to some extent, can lessen the number of IDs, but aggravate false positive detections. So does other trackers.
The performance of recovery of the missing targets directly affect the overall performance of a tracker. If the recovery method work well, it can cut down the false negatives and the IDs as well. Otherwise, it could barely decrease the false negatives and it could even increase the false positives. Therefore, when setting the parameters of our tracker, we adopt relatively prudential strategies as much as possible, which explains the low FPs, high IDs of our tracking results in table II. Nevertheless, the encouraging thing is that the FNs of our tracker are not as high as we thought and the ML of our tracker is better than other trackers, which indicate that our recovery method of the missing targets perform pretty well compared with other approaches.
Figure 3 shows the examples of recovery of the missing targets from several online MOT trackers, which is obtained from the MOT Benchmark . The first column shows the pedestrian detections on the TUD-Crossing and ETH-Linthescher dataset. The following columns shows the tracking result of the RNN_LSTM, the RMOT, the TC_ODAL, and our tracker respectively. The bounding boxes with solid lines in the first column represents the detections generated by ACF pedestrian detector , and the blue dash line boxes indicate the targets missing abruptly, which appeared in the past several frames. The tracking results illustrate that our tracker manage to recover the missing targets in both of the two frames. The RNN_LSTM method and the TC_ODAL tracker manage to recover the missing target once respectively and the RMOT fails to recover the missing targets.
When the video sequences are captured by a moving camera, the motion of the targets will become unsteady, unpredictable and ambiguous. In this paper, we propose a new heuristic approach to search for the optimal internal structure constraints between multiple targets, which could be utilized to alleviate the ambiguities of the objects’ motion costs and thereby make each pairwise association cost much more elaborate and discriminative. Furthermore, we models the assignment of recovery missing targets as minimization of a cost function constructed from both motion and structure cues. Experimental results show that the proposed method achieves encouraging performance on the MOT Challenge dataset. In future work, we intend to establish a dynamic model for the structural constraint and thus make it more stable and predictable.
This work was supported by the National Natural Science Foundation of China under Grant Nos: 61273366 and 61231018 and the program of introducing talents of discipline to university under grant no: B13043.
-  L. Leal-Taixé, A. Milan, I. Reid, S. Roth, and K. Schindler, “Motchallenge 2015: Towards a benchmark for multi-target tracking,” arXiv preprint arXiv:1504.01942, 2015.
F. Solera, S. Calderara, and R. Cucchiara, “Learning to divide and conquer for
online multi-target tracking,” in
Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 4373–4381.
-  A. Bewley, L. Ott, F. Ramos, and B. Upcroft, “Alextrac: Affinity learning by exploring temporal reinforcement within association chains,” in Robotics and Automation (ICRA), 2016 IEEE International Conference on. IEEE, 2016, pp. 2212–2218.
S.-H. Bae and K.-J. Yoon, “Robust online multi-object tracking based on
tracklet confidence and online discriminative appearance learning,” in
Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 1218–1225.
-  C. Dicle, O. I. Camps, and M. Sznaier, “The way they move: Tracking multiple targets with similar appearance,” in Proceedings of the IEEE International Conference on Computer Vision, 2013, pp. 2304–2311.
-  A. Andriyenko and K. Schindler, “Multi-target tracking by continuous energy minimization,” in Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on. IEEE, 2011, pp. 1265–1272.
-  A. Milan, S. Roth, and K. Schindler, “Continuous energy minimization for multitarget tracking,” IEEE transactions on pattern analysis and machine intelligence, vol. 36, no. 1, pp. 58–72, 2014.
-  A. Milan, K. Schindler, and S. Roth, “Multi-target tracking by discrete-continuous energy minimization,” IEEE transactions on pattern analysis and machine intelligence, vol. 38, no. 10, pp. 2054–2068, 2016.
-  H. Possegger, T. Mauthner, P. M. Roth, and H. Bischof, “Occlusion geodesics for online multi-object tracking,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 1306–1313.
-  R. T. Collins, “Multitarget data association with higher-order motion models,” in Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. IEEE, 2012, pp. 1744–1751.
-  J. Liu, P. Carr, R. T. Collins, and Y. Liu, “Tracking sports players with context-conditioned motion models,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 1830–1837.
-  B. Yang and R. Nevatia, “An online learned crf model for multi-target tracking,” in Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. IEEE, 2012, pp. 2034–2041.
-  A. Milan, L. Leal-Taixé, K. Schindler, and I. Reid, “Joint tracking and segmentation of multiple targets,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 5397–5406.
-  H.-U. Kim and C.-S. Kim, “Cdt: Cooperative detection and tracking for tracing multiple objects in video sequences,” in European Conference on Computer Vision. Springer, 2016, pp. 851–867.
-  S. Tang, M. Andriluka, and B. Schiele, “Detection and tracking of occluded people,” International Journal of Computer Vision, vol. 110, no. 1, pp. 58–69, 2014.
-  J. H. Yoon, M.-H. Yang, J. Lim, and K.-J. Yoon, “Bayesian multi-object tracking using motion context from multiple objects,” in Applications of Computer Vision (WACV), 2015 IEEE Winter Conference on. IEEE, 2015, pp. 33–40.
-  J. Hong Yoon, C.-R. Lee, M.-H. Yang, and K.-J. Yoon, “Online multi-object tracking via structural constraint event aggregation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 1392–1400.
-  A. Dehghan, Y. Tian, P. H. Torr, and M. Shah, “Target identity-aware network flow for online multiple target tracking,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1146–1154.
-  V. Chari, S. Lacoste-Julien, I. Laptev, and J. Sivic, “On pairwise costs for network flow multi-object tracking,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 5537–5545.
-  A. A. Butt and R. T. Collins, “Multi-target tracking by lagrangian relaxation to min-cost network flow,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 1846–1853.
-  C. Haubold, J. Aleš, S. Wolf, and F. A. Hamprecht, “A generalized successive shortest paths solver for tracking dividing targets,” in European Conference on Computer Vision. Springer, 2016, pp. 566–582.
-  H. Jiang, S. Fels, and J. J. Little, “A linear programming approach for multiple object tracking,” in Computer Vision and Pattern Recognition, 2007. CVPR’07. IEEE Conference on. IEEE, 2007, pp. 1–8.
-  C. Huang, B. Wu, and R. Nevatia, “Robust object tracking by hierarchical association of detection responses,” in European Conference on Computer Vision. Springer, 2008, pp. 788–801.
-  S. Tang, B. Andres, M. Andriluka, and B. Schiele, “Subgraph decomposition for multi-target tracking,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 5033–5041.
-  A. Dehghan, S. Modiri Assari, and M. Shah, “Gmmcp tracker: Globally optimal generalized maximum multi clique problem for multiple object tracking,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 4091–4099.
-  S. Zhang, Y. Gong, J.-B. Huang, J. Lim, J. Wang, N. Ahuja, and M.-H. Yang, “Tracking persons-of-interest via adaptive discriminative features,” in European Conference on Computer Vision. Springer, 2016, pp. 415–433.
-  S. Hamid Rezatofighi, A. Milan, Z. Zhang, Q. Shi, A. Dick, and I. Reid, “Joint probabilistic data association revisited,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 3047–3055.
-  P. Dollár, R. Appel, S. Belongie, and P. Perona, “Fast feature pyramids for object detection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 36, no. 8, pp. 1532–1545, 2014.
-  K. Bernardin and R. Stiefelhagen, “Evaluating multiple object tracking performance: the clear mot metrics,” EURASIP Journal on Image and Video Processing, vol. 2008, no. 1, p. 246309, 2008.
A. Milan, S. H. Rezatofighi, A. R. Dick, I. D. Reid, and K. Schindler, “Online multi-target tracking using recurrent neural networks.” inAAAI, 2017, pp. 4225–4232.