With the development of computer vision techniques, more and more people began to focus on understanding the behavior as well as other context of the objects via visual information. Tracking targets in video sequences, one of the core topics with wide applications in video surveillance, rocketed with the boost of tracking-by-detection (TBD) methods. The TBD reconstruct the states of targets based on the detection responses by assigning identity to each detection and optimizing the trajectories [2, 21]. The prosperity of TBD these years has raised people’s interests in a more challenging topic - multi-object tracking (MOT) with unknown numbers. MOT remains difficult due to complex settings of sequences, , intricate trajectories of targets, varying illumination, movements of cameras, .
The MOT problem can be handled in an online fashion, which could be adopted in time critical applications. However, the traditional online methods is susceptible to outliers brought by occlusions and noises, , false positives, true negatives, duplicate detections of a single target, . These outliers can cause ambiguities in data association. Some tackles the problem using sparse appearance model [19, 28], and others via prediction  of states in future frames. But dynamics and appearances of the targets are unpredictable in some cases. Batch tracking methods are easier to solve the problem of outliers than online methods by global optimization of association and trajectories. Terms that penalize mutual exclusions and the number of tracklets [21, 9] were added to the energy function to regularize trajectories.
Apart from advantages of batch methods, one major problem is that the global optimization involves frames in the whole sequence  which does not suit for real-time applications. Some batch methods require initial solutions, . Therefore, we propose our method in this paper, aiming at combining advantages of online and batch methods together while avoiding their disadvantages. We derive an iteratively Approximation-Shrink Scheme (AS Scheme) from the Maximum-A-Posterior (MAP) formulation using sequential approximation. We show that the state space can be effectively shrunk, but there may exist conflicts in the sequential optimization and the results may vary with different optimization sequences. In order to avoid these problems, an Ambiguity-Clearness Graph (A-C Graph) is formulated to efficiently represent the tracklet fragments and ambiguities in the association. A set of rules and procedures are defined for changes of nodes and edges in the graph, , connections, disconnections, transforms, merges, . A sliding Window-of-Ambiguity (WOA) is defined in the A-C Graph for sequential optimization of layers in the graph. Based on the A-C Graph and the sliding WOA optimization, MOT is conducted in a window-wise manner, which is able to disambiguate the association and accelerate the optimization process. We also show that the traditional online and batch approach can be embraced into this framework with different window sizes.
Our main contributions can be summarized as: (1) an approximation-shrink scheme that iteratively approximate the global optimization, (2) a window-wise optimization framework based on the novel A-C Graph which embrace the traditional online and batch methods, (3) a unified analysis of window-wise approaches with different window sizes using search tree.
2 Related Works
Different from the past tracking methods [24, 12], TBD reconstructs trajectories of targets by associating detections provided by the object detectors. Most of the researchers exploits the TBD framework to design their algorithms in MOT, which can be categorized as online and batch approaches.
As for batch tracking [21, 22, 2, 7, 25, 10, 8] approaches, conditional random field (CRF) is often used to learn and model the affinity such as appearance and motion to discriminate among different trajectories [29, 30]. A global and pairwise model is learned online in 
to form an energy function, which is minimized offline via heuristic search. Despite the popularity of CRF model, extensive training is needed. Continuous energy model is introduced by a series of work[21, 22, 2]. Milan  built a comprehensive continuous energy function by linearly combining terms regarding appearance, motion, mutual exclusion, trajectory persistence, . The continuous energy functions are easier to optimize than discrete ones, whereas they possess too many parameters and are hard to be tuned. Network flow is first applied to tracking by Zhang . A graph is formed with states of targets as nodes and the associations as edges. The likelihood of the states are represented as the capacity of edges. Butt  improved the network structure by defining their node as a candidate pair of matching observations between consecutive frames. In order for a better model of occlusions,  designed a latent data association framework. Instead of assigning each detection to a corresponding track, they assume each detection is its own track and assign a latent data to each node to represent the association. In addition to the general modeling of targets, some people worked on tracking targets with specific characteristics, , Dicle  focus on tracking targets with similar appearance but different motion patterns.
Online tracking [3, 4, 5, 6, 11, 31, 19] has become more and more popular these days. Network flow has also been adopted in online tracking.  formulate multi-object tracking into a multi-commodity network flow problem. They use sparse appearance to reduce computational complexity. Lu  constructed a dictionary using already tracked objects and assigned the new detections by minimizing the L1 regularized function. Wang  finds that the representation residuals follow the Laplacian distribution, by which they improved the sparse representation method on tracking. Hungarian algorithm is firstly introduced into tracking problems by Joo  to solve the bipartite graph model they proposed. The frame-by-frame scheme of online tracking takes great advantages of hungarian algorithm. Bae  designed tracklet confidence by considering the length, occlusion and affinity. Different strategies are applied to tracklets with high and low confidence. Hungarian algorithm is employed in the association for local and global association respectively. Hungarian algorithm greedily associates detections in consecutive frames which could possibly misses the global optimal and cause identity switches. Besides the popularity of Hungarian algorithm in association algorithms, Bayesian framework is also one of the most popular model for target modeling. Bae  improved their previous work 
by perform data association with a track existence probability, the provided detections are associated to the existed tracks and the corresponding track existence probabilities will be updated. Yoon constructed a Relative Motion Network(RMN) to factor out the camera motion by considering motion context from multiple object and incorporate relative motion network to Bayesian framework.
3 Approximate-Shrink Scheme
Given observations of a real time video sequence, where denotes the number of observations in frame , we assume: (1) each observation corresponds to a state , (2) states in the same frame are independent, (3) some of the states are already clear given observations. The Maximum-a-Posterior (MAP) formulation of MOT is
Based on Assumption (2), we resolve as
Assumption (3) offers us an intuition that there exist some states . Denote . We name Clear states (C states) and Ambiguous states (A states). The global optimization in Equation 2 can be relaxed to
Doing these two optimization separately is an approximation to Equation 2. First, we sequentially optimize every state in (approximation step) via Equation 4. Then we set fixed as the evidence for , and derive Equation 3 to
(shrink step). We iteratively find the , , let and repeat the above steps to shrink the search space.
This Approximate-Shrink Scheme (A-S Scheme) iteratively search and narrow down the state space. serve as nucleus of trajectories in the space which attract states to associate to them. Some nucleus merge together in the iteration to form longer tracklets during the iteration. However, the space is still too large, and the convergence is not guaranteed. More approximations are needed to accelerate the speed and ensure the convergence of this scheme. Moreover, it is necessary to design a data structure so as to avoid conflicts of associations of states in and the effects of the sequence on the optimization results. Therefore, we propose a self-organizing A-C Graph and window-wise optimization framework to meet the demands in this regard.
4 Window-wise Optimization for Tracking
4.1 Ambiguous-Clearness Graph
Given states and observations (the detections serve as observations in TBD multi-object tracking) in a real time video sequence, predefined thresholds and (the value of and are given in Section 5), we define state to be the parent of state if and there exists an association between and , and is the child of . ( and are only used as examples for clearness in illustration. They do not indicate certain states.) The determined parent of a state is its only parent and the affinity score of the association is greater than . We now formally define the C states and A states. If a state has one determined parent or does not have parent, is a clear state (C state), denoted as . On the contrary, if has parent states but does not have a determined parent, it is an Ambiguous State (A state), denoted as . Note that a C state can only have zero or one parent. All the parents of a state form its active set. We regulate that a state can have up to one C state as its child, and the frame number of its A state child should be smaller than that of its C state child. The observation corresponding to and is notated as and . A clear association is the association between a clear state and its parent, and a tracklet is defined as a group states connected by clear association. The tracklet including is denoted as . The C states in after is defined as the descendant of . By taking states and associations as the vertices and edges, we form the A-C Graph of the MOT problem. In this paper, we use states and associations instead of vertices and edges when discussing on the A-C Graph. The A-C Graph of TUD-Stadtmitte dataset is visualized in Figure 1, where the clear association is shown in solid line and the states belong to the same tracklet is in the same color.
As the association is directed from parent to child, the A-C Graph is a directed acyclic graph. In an A-C Graph, we define a time period to () where there is only clear association in to and to as Window-of-Ambiguity (WOA). The tracklet outside the WOA is determined and fixed and the changes of the states and association can only take place in the WOA. One can restrict the size of state space by setting the length of WOA.
As is mentioned in Section 3, actions in A-C Graph should help avoid conflicts, , multiple fathers for a C state, multiple C state children, clear association forms cycle, . Meanwhile, the actions should be symmetrical to avoid the effect of chronological order. The basic actions of A-C Graph are initializations, disconnections, connections and merges between two states. Table 1 shows functions and symbols used in defining these actions.
|Functions and Symbols||Description|
|Check whether the is empty.|
|Find the C States in the .|
|Find all the parents of .|
|Find all the children of .|
|Find the frame number of .|
|Judge whether is a clear state.|
|The affinity scores between all the|
|fathers of and .|
For a newly-entered state , first we initialize the active set by enumerating all the potential parents. As is regulated in Section 4.1, is able to connect with states in the previous frames, who does not have C state child or whose C state child is after . Procedure 1 shows the pseudocode of initializing the active set.
We disconnect two states and by removing the association between them, and update these two states.
As is shown in Procedure 2, we assign to as A state child. The procedure is terminated if is already a C state. If not, we check the descendant of . If has no descendants, we directly add an association between and , otherwise, we find the nearest C state descendant in the tracklet of not after . If is in frame , the procedure is terminated. If is before , add the association between and .
Procedure 5 illustrates the action that is connected to as C state child. If is currently not a C state, the existing parents of are removed. If does not have C state children, we directly add a connection between and , otherwise, we find ’s latest C state descendant not after . If is in frame , and are merged together via Procedure 6. As is before , an association is added between and . All the A and C state children of after are removed from and reconnected to following Procedure 2 and 5 respectively. If is currently a C state and is not, is inserted into ’s tracklet using Procedure 5 if there is not a state in frame in ’s tracklet and Procedure 6 if there exists a state in frame . If and are both C states, the two tracklets and will be grouped into one by recursively calling Procedure 5 and 6, as shown in Procedure 5. If one of the two states is in a tracklet, the other state will be inserted into the tracklet.
Procedure 6 describes the process of merging to in the same frame. As we cannot make changes on the states and tracklets outside WOA, we ensure that and cannot be C states at the same time to avoid merging of states outside WOA. For the descendants of and , we recursively merge them into one tracklet by Procedure 5. For the A state child of , we simply remove the association between and and connect it to via Procedure 2.
Although there exists recursion in the actions, it can be easily proved that the recursion in Procedure 2, 5 and 6 cannot form an endless recursion loop, and the sequence of carrying out actions on a set of states will not affect the structure of A-C Graph. Visualization of these actions in TUD-Stadtmitte dataset can be found in Figure 1. In Figure 1(b), newly-entered states to connected to their initial active sets via Procedure 1, 2 and 5. From Figure 1(a) to 1(b), was connected to as a C state child by Procedure 5, and merged with using Procedure 6.
4.3 Sliding Window Optimization
For a real time sequence, the A-C Graph is continuously adding new states from latest frame . The WOA should be sliding to keep its size from being too large and remove the ambiguities to generate tracks. So we set the upper bound of the size of WOA as .
The sliding window optimization consists of three steps. First, for all the newly-entered states in frame , , we find the active sets via Procedure 1 and compute the affinity score between and each state in the corresponding active set. If , do Procedure 5 with as input. If , do Procedure 2 with as input. Second, from frame to , we sequentially recompute the affinity score of states in the same frame with their fathers and reconnect them according to the new affinity. Third, Hungarian Algorithm  is carried out on states in frame with their father states to get the best arrangement of association and clear all the ambiguity in frame . All states in frame are transformed to C states and the WOA shifts forward. If has not reached the end, and return to the first step, otherwise, and redo the third step. The outline of the optimization process is shown in Procedure 7.
The sliding window optimization conducts A-S Scheme in a window-wise manner. Procedure 5 and 6 in step one and two serve as the approximation step, and updating affinity score in step two follows the shrink step. Step three forces the states in frame to determine their connections, which guarantees the convergence.
4.4 Online, Delayed and Batch Methods
Based on the definition of A-C Graph and sliding window optimization, we form this window-wise framework which includes online (), delay () and batch methods (). Figure 2 demonstrates the formation of a trajectory starting from in the A-C Graph via these three methods. The window-wise optimization finds a relatively small search tree according to at each iteration. As for an online method (Figure 2(b)), and the search is greedy. For a delayed method (Figure 2(c)), heuristic search is conducted in . The search space remains unchanged for a batch method (Figure 2(d)), so local search methods, , hill climbing, simulated annealing, , is often exploited to direct to local optimal iteratively. The experimental analysis of the relation between and optimization results is provided in Section 5.2.
5 Experimental Evaluation
Affinity model: We implemented a basic affinity model, following , which includes the appearance model , motion model and shape model . The appearance model measures the Bhattacharyya distance of histograms of and . If is in a tracklet , instead of using Incremental Linear Discriminant Analysis (ILDA) used in , we simply average the appearance histograms of all states in
using an exponential discount factor. First-order Kalman filter is applied to smoothing and predicting positions of the targets and shapes of the bounding boxes. We compute the normalized distance of target positions and bounding box shapes and map them to a Gaussian distributionto get the affinity scores. The overall affinity
Dataset description: We use the MOT Benchmark  for training and evaluation in this paper, where the benchmark contains both sequences for training and testing. In total, there are frames, for training set and for testing set. The sequences possess different frame rates and resolutions, and only tracking pedestrians.
Parameter Settings: In our experiment, the and
. We estimate the length of every occlusion (number of frames with overlap0.4) in the training set of MOT Benchmark and study the distribution of occlusion lengths. As shown in Figure 3, about of the overlaps are within , and of which are within . Therefore, the delayed time is set to and the length of WOA frame rate
delayed time. The variance of the Gaussian distribution in the motion model and shape model is. Other parameters of the affinity model are the same as .
5.2 Analysis of Window-of-Ambiguity
To analyze the connection of WOA size and the quality of the window-wise optimization, we define the energy of an A-C Graph as
Figure 4 presents the final energy with varying size of WOA on TUD-Stadtmitte (number of frame ), TUD-Campus (number of frame ) and PETS-S2L2 (number of frame ) in MOT Benchmark. The X-axis is in logarithmic scale. Interestingly, final energy of these sequences reduced significantly when window size grows from to , while the speed of decrease become much slower when . Settings of these sequences, , target density, viewpoint, , are different, but the patterns of energy change almost remain identical. It is likely that the trend of final energy only deals with WOA size . And the tracking results can be much improved with a small WOA comparing to the online method, which experimentally illustrates the better performance of delayed methods than online ones in the window-wise optimization framework. The final energy does not reduce too much when grows larger than . This indicates the sliding window approximation only has a minor effect on the final performance. And it becomes a trade-off between speed and better results when WOA grows larger.
5.3 Performance Evaluation
Evaluation Metrics: We apply the CLEAR MOT  and [29, 16]’s metric when evaluating our result. The multiple object tracking accuracy (MOTA) shows the combined accuracy based on the number of false positives (FP), identity switches (IDS) and missed targets (FN). The multiple object tracking precision (MOTP) measures the overlap of bounding boxes between ground truths and results given by trackers. MT and ML indicate the number of mostly tracked and lost targets. FG represents the number of fragmented tracks.
Evaluation: As shown in Table 2, our method clearly outperforms the TC_ODAL method using the same affinity model, not only in MOTA. Even in some datasets, shown in Table 3, our method with the basic affinity model reached the performance of the methods using state-of-the-art affinity models.
|AC-MOT(Proposed affinity of )||Delayed|
This paper proposed an A-S Scheme for sequential approximation and a window-wise optimization framework based on the A-C Graph. The core idea of this method is to cluster the states subject to several constraints, states in the same frame cannot be clustered into one group, . The A-C Graph together with the sliding window optimization transformed the global clustering into a sequential local clustering which self-organized the structure in a relatively small state space, which can be done efficiently with little harm to handling occlusions. We showed experimentally that the characteristics of window-wise optimization framework rarely change with the varying settings of the sequence. As the affinity model serves as the distance metric in clustering, it can influence the results of clustering. Therefore, it is a fair comparison of optimization models if similar affinity models are adopted. The experimental results show that by using the basic affinity model, our method even showed competitive performance in an unfair test. Our future work is to realize more state-of-the-art affinity models to the window-wise optimization model. Also, we plan to design a unity interface, which can help to embed the affinity models into different optimization models much easier than now.
-  R. K. Ahuja, T. L. Magnanti, and J. B. Orlin. Network flows. Prentice Hall, 1993.
A. Andriyenko, K. Schindler, and S. Roth.
Discrete-continuous optimization for multi-target tracking.
Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 1926–1933. IEEE.
-  S.-H. Bae and K.-J. Yoon. Robust online multi-object tracking based on tracklet confidence and online discriminative appearance learning. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, pages 1218–1225. IEEE.
-  S.-H. Bae and K.-J. Yoon. Robust online multi-object tracking with data association and track management. 2014.
-  H. Ben Shitrit, J. Berclaz, F. Fleuret, and P. Fua. Multi-commodity network flow for tracking multiple people. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 36(8):1614–1627, 2014.
-  M. D. Breitenstein, F. Reichlin, B. Leibe, E. Koller-Meier, and L. Van Gool. Online multiperson tracking-by-detection from a single, uncalibrated camera. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 33(9):1820–1833, 2011.
-  A. A. Butt and R. T. Collins. Multi-target tracking by lagrangian relaxation to min-cost network flow. In Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pages 1846–1853. IEEE.
-  C. Canton-Ferrer, J. R. Casas, M. Pard s, and E. Monte. Multi-camera multi-object voxel-based monte carlo 3d tracking strategies. EURASIP Journal on Advances in Signal Processing, 2011(1):1–15, 2011. B and limited O 3D.
-  W. Choi. Near-online multi-target tracking with aggregated local flow descriptor. arXiv preprint arXiv:1504.02340, 2015.
-  C. Dicle, O. I. Camps, and M. Sznaier. The way they move: Tracking multiple targets with similar appearance. In Computer Vision (ICCV), 2013 IEEE International Conference on, pages 2304–2311. IEEE.
-  C. Fantacci, B.-N. Vo, B.-T. Vo, G. Battistelli, and L. Chisci. Consensus labeled random finite set filtering for distributed multi-object tracking. arXiv preprint arXiv:1501.01579, 2015. new approach.
-  T. E. Fortmann, Y. Bar-Shalom, and M. Scheffe. Sonar tracking of multiple targets using joint probabilistic data association. Oceanic Engineering, IEEE Journal of, 8(3):173–184, 1983. B.
A. Geiger, M. Lauer, C. Wojek, C. Stiller, and R. Urtasun.
3d traffic scene understanding from movable platforms.Pattern Analysis and Machine Intelligence, IEEE Transactions on, 36(5):1012–1025, 2014.
-  S.-W. Joo and R. Chellappa. A multiple-hypothesis approach for multiobject visual tracking. Image Processing, IEEE Transactions on, 16(11):2849–2854, 2007.
-  B. Keni and S. Rainer. Evaluating multiple object tracking performance: the clear mot metrics. EURASIP Journal on Image and Video Processing, 2008, 2008.
-  C.-H. Kuo and R. Nevatia. How does person identity recognition help multi-person tracking? In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 1217–1224. IEEE. B.
-  L. Leal-Taixé, M. Fenzi, A. Kuznetsova, B. Rosenhahn, and S. Savarese. Learning an image-based motion context for multiple people tracking. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, pages 3542–3549. IEEE, 2014.
-  L. Leal-Taixé, A. Milan, I. Reid, S. Roth, and K. Schindler. Motchallenge 2015: Towards a benchmark for multi-target tracking. arXiv preprint arXiv:1504.01942, 2015.
-  W. Lu, C. Bai, K. Kpalma, and J. Ronsin. Multi-object tracking using sparse representation. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pages 2312–2316. IEEE. O.
-  A. Milan, L. Leal-Taixé, K. Schindler, and I. Reid. Joint tracking and segmentation of multiple targets.
-  A. Milan, S. Roth, and K. Schindler. Continuous energy minimization for multitarget tracking. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 36(1):58–72, 2014.
-  A. Milan, K. Schindler, and S. Roth. Detection-and trajectory-level exclusion in multiple object tracking. In Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pages 3682–3689. IEEE.
-  H. Pirsiavash, D. Ramanan, and C. C. Fowlkes. Globally-optimal greedy algorithms for tracking a variable number of objects. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 1201–1208. IEEE, 2011.
-  D. B. Reid. An algorithm for tracking multiple targets. Automatic Control, IEEE Transactions on, 24(6):843–854, 1979. B.
-  A. V. Segal and I. Reid. Latent data association: Bayesian model selection for multi-target tracking. In Computer Vision (ICCV), 2013 IEEE International Conference on, pages 2904–2911. IEEE.
-  A. W. Smeulders, D. M. Chu, R. Cucchiara, S. Calderara, A. Dehghan, and M. Shah. Visual tracking: An experimental survey. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 36(7):1442–1468, 2014.
-  B. Wang and F. Wang. Multi-object tracking using least absolute deviation. In Image and Signal Processing (CISP), 2014 7th International Congress on, pages 60–65. IEEE. O.
-  D. Wang, H. Lu, and M.-H. Yang. Least soft-threshold squares tracking. In Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pages 2371–2378. IEEE. O single.
-  B. Yang and R. Nevatia. An online learned crf model for multi-target tracking. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 2034–2041. IEEE.
-  B. Yang and R. Nevatia. Multi-target tracking by online learning a crf model of appearance and motion patterns. International Journal of Computer Vision, 107(2):203–217, 2014.
-  J. H. Yoon, M.-H. Yang, J. Lim, and K.-J. Yoon. Bayesian multi-object tracking using motion context from multiple objects. In Applications of Computer Vision (WACV), 2015 IEEE Winter Conference on, pages 33–40. IEEE. O.
-  L. Zhang, Y. Li, and R. Nevatia. Global data association for multi-object tracking using network flows. In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, pages 1–8. IEEE.