1 Introduction
With the development of computer vision techniques, more and more people began to focus on understanding the behavior as well as other context of the objects via visual information. Tracking targets in video sequences, one of the core topics with wide applications in video surveillance, rocketed with the boost of trackingbydetection (TBD) methods
[26]. The TBD reconstruct the states of targets based on the detection responses by assigning identity to each detection and optimizing the trajectories [2, 21]. The prosperity of TBD these years has raised people’s interests in a more challenging topic  multiobject tracking (MOT) with unknown numbers. MOT remains difficult due to complex settings of sequences, , intricate trajectories of targets, varying illumination, movements of cameras, .The MOT problem can be handled in an online fashion, which could be adopted in time critical applications. However, the traditional online methods is susceptible to outliers brought by occlusions and noises, , false positives, true negatives, duplicate detections of a single target, . These outliers can cause ambiguities in data association. Some tackles the problem using sparse appearance model [19, 28], and others via prediction [3] of states in future frames. But dynamics and appearances of the targets are unpredictable in some cases. Batch tracking methods are easier to solve the problem of outliers than online methods by global optimization of association and trajectories. Terms that penalize mutual exclusions and the number of tracklets [21, 9] were added to the energy function to regularize trajectories.
Apart from advantages of batch methods, one major problem is that the global optimization involves frames in the whole sequence [4] which does not suit for realtime applications. Some batch methods require initial solutions, [21]. Therefore, we propose our method in this paper, aiming at combining advantages of online and batch methods together while avoiding their disadvantages. We derive an iteratively ApproximationShrink Scheme (AS Scheme) from the MaximumAPosterior (MAP) formulation using sequential approximation. We show that the state space can be effectively shrunk, but there may exist conflicts in the sequential optimization and the results may vary with different optimization sequences. In order to avoid these problems, an AmbiguityClearness Graph (AC Graph) is formulated to efficiently represent the tracklet fragments and ambiguities in the association. A set of rules and procedures are defined for changes of nodes and edges in the graph, , connections, disconnections, transforms, merges, . A sliding WindowofAmbiguity (WOA) is defined in the AC Graph for sequential optimization of layers in the graph. Based on the AC Graph and the sliding WOA optimization, MOT is conducted in a windowwise manner, which is able to disambiguate the association and accelerate the optimization process. We also show that the traditional online and batch approach can be embraced into this framework with different window sizes.
Our main contributions can be summarized as: (1) an approximationshrink scheme that iteratively approximate the global optimization, (2) a windowwise optimization framework based on the novel AC Graph which embrace the traditional online and batch methods, (3) a unified analysis of windowwise approaches with different window sizes using search tree.
2 Related Works
Different from the past tracking methods [24, 12], TBD reconstructs trajectories of targets by associating detections provided by the object detectors. Most of the researchers exploits the TBD framework to design their algorithms in MOT, which can be categorized as online and batch approaches.
As for batch tracking [21, 22, 2, 7, 25, 10, 8] approaches, conditional random field (CRF) is often used to learn and model the affinity such as appearance and motion to discriminate among different trajectories [29, 30]. A global and pairwise model is learned online in [30]
to form an energy function, which is minimized offline via heuristic search. Despite the popularity of CRF model, extensive training is needed. Continuous energy model is introduced by a series of work
[21, 22, 2]. Milan [21] built a comprehensive continuous energy function by linearly combining terms regarding appearance, motion, mutual exclusion, trajectory persistence, . The continuous energy functions are easier to optimize than discrete ones, whereas they possess too many parameters and are hard to be tuned. Network flow is first applied to tracking by Zhang [32]. A graph is formed with states of targets as nodes and the associations as edges. The likelihood of the states are represented as the capacity of edges. Butt [7] improved the network structure by defining their node as a candidate pair of matching observations between consecutive frames. In order for a better model of occlusions, [25] designed a latent data association framework. Instead of assigning each detection to a corresponding track, they assume each detection is its own track and assign a latent data to each node to represent the association. In addition to the general modeling of targets, some people worked on tracking targets with specific characteristics, , Dicle [10] focus on tracking targets with similar appearance but different motion patterns.Online tracking [3, 4, 5, 6, 11, 31, 19] has become more and more popular these days. Network flow has also been adopted in online tracking. [5] formulate multiobject tracking into a multicommodity network flow problem. They use sparse appearance to reduce computational complexity. Lu [19] constructed a dictionary using already tracked objects and assigned the new detections by minimizing the L1 regularized function. Wang [27] finds that the representation residuals follow the Laplacian distribution, by which they improved the sparse representation method on tracking. Hungarian algorithm is firstly introduced into tracking problems by Joo [14] to solve the bipartite graph model they proposed. The framebyframe scheme of online tracking takes great advantages of hungarian algorithm. Bae [3] designed tracklet confidence by considering the length, occlusion and affinity. Different strategies are applied to tracklets with high and low confidence. Hungarian algorithm is employed in the association for local and global association respectively. Hungarian algorithm greedily associates detections in consecutive frames which could possibly misses the global optimal and cause identity switches. Besides the popularity of Hungarian algorithm in association algorithms, Bayesian framework is also one of the most popular model for target modeling. Bae [4] improved their previous work [3]
by perform data association with a track existence probability, the provided detections are associated to the existed tracks and the corresponding track existence probabilities will be updated. Yoon
[31] constructed a Relative Motion Network(RMN) to factor out the camera motion by considering motion context from multiple object and incorporate relative motion network to Bayesian framework.3 ApproximateShrink Scheme
Given observations of a real time video sequence, where denotes the number of observations in frame , we assume: (1) each observation corresponds to a state [25], (2) states in the same frame are independent, (3) some of the states are already clear given observations. The MaximumaPosterior (MAP) formulation of MOT is
(1) 
Based on Assumption (2), we resolve as
(2) 
Assumption (3) offers us an intuition that there exist some states . Denote . We name Clear states (C states) and Ambiguous states (A states). The global optimization in Equation 2 can be relaxed to
(3) 
and
(4) 
Doing these two optimization separately is an approximation to Equation 2. First, we sequentially optimize every state in (approximation step) via Equation 4. Then we set fixed as the evidence for , and derive Equation 3 to
(5) 
(shrink step). We iteratively find the , , let and repeat the above steps to shrink the search space.
This ApproximateShrink Scheme (AS Scheme) iteratively search and narrow down the state space. serve as nucleus of trajectories in the space which attract states to associate to them. Some nucleus merge together in the iteration to form longer tracklets during the iteration. However, the space is still too large, and the convergence is not guaranteed. More approximations are needed to accelerate the speed and ensure the convergence of this scheme. Moreover, it is necessary to design a data structure so as to avoid conflicts of associations of states in and the effects of the sequence on the optimization results. Therefore, we propose a selforganizing AC Graph and windowwise optimization framework to meet the demands in this regard.
4 Windowwise Optimization for Tracking
4.1 AmbiguousClearness Graph
Given states and observations (the detections serve as observations in TBD multiobject tracking) in a real time video sequence, predefined thresholds and (the value of and are given in Section 5), we define state to be the parent of state if and there exists an association between and , and is the child of . ( and are only used as examples for clearness in illustration. They do not indicate certain states.) The determined parent of a state is its only parent and the affinity score of the association is greater than . We now formally define the C states and A states. If a state has one determined parent or does not have parent, is a clear state (C state), denoted as . On the contrary, if has parent states but does not have a determined parent, it is an Ambiguous State (A state), denoted as . Note that a C state can only have zero or one parent. All the parents of a state form its active set. We regulate that a state can have up to one C state as its child, and the frame number of its A state child should be smaller than that of its C state child. The observation corresponding to and is notated as and . A clear association is the association between a clear state and its parent, and a tracklet is defined as a group states connected by clear association. The tracklet including is denoted as . The C states in after is defined as the descendant of . By taking states and associations as the vertices and edges, we form the AC Graph of the MOT problem. In this paper, we use states and associations instead of vertices and edges when discussing on the AC Graph. The AC Graph of TUDStadtmitte dataset is visualized in Figure 1, where the clear association is shown in solid line and the states belong to the same tracklet is in the same color.
As the association is directed from parent to child, the AC Graph is a directed acyclic graph. In an AC Graph, we define a time period to () where there is only clear association in to and to as WindowofAmbiguity (WOA). The tracklet outside the WOA is determined and fixed and the changes of the states and association can only take place in the WOA. One can restrict the size of state space by setting the length of WOA.
4.2 Actions
As is mentioned in Section 3, actions in AC Graph should help avoid conflicts, , multiple fathers for a C state, multiple C state children, clear association forms cycle, . Meanwhile, the actions should be symmetrical to avoid the effect of chronological order. The basic actions of AC Graph are initializations, disconnections, connections and merges between two states. Table 1 shows functions and symbols used in defining these actions.
Functions and Symbols  Description 

Check whether the is empty.  
Find the C States in the .  
Find all the parents of .  
Find all the children of .  
Find the frame number of .  
Judge whether is a clear state.  
The affinity scores between all the  
fathers of and . 
For a newlyentered state , first we initialize the active set by enumerating all the potential parents. As is regulated in Section 4.1, is able to connect with states in the previous frames, who does not have C state child or whose C state child is after . Procedure 1 shows the pseudocode of initializing the active set.
We disconnect two states and by removing the association between them, and update these two states.
As is shown in Procedure 2, we assign to as A state child. The procedure is terminated if is already a C state. If not, we check the descendant of . If has no descendants, we directly add an association between and , otherwise, we find the nearest C state descendant in the tracklet of not after . If is in frame , the procedure is terminated. If is before , add the association between and .
Procedure 5 illustrates the action that is connected to as C state child. If is currently not a C state, the existing parents of are removed. If does not have C state children, we directly add a connection between and , otherwise, we find ’s latest C state descendant not after . If is in frame , and are merged together via Procedure 6. As is before , an association is added between and . All the A and C state children of after are removed from and reconnected to following Procedure 2 and 5 respectively. If is currently a C state and is not, is inserted into ’s tracklet using Procedure 5 if there is not a state in frame in ’s tracklet and Procedure 6 if there exists a state in frame . If and are both C states, the two tracklets and will be grouped into one by recursively calling Procedure 5 and 6, as shown in Procedure 5. If one of the two states is in a tracklet, the other state will be inserted into the tracklet.
Procedure 6 describes the process of merging to in the same frame. As we cannot make changes on the states and tracklets outside WOA, we ensure that and cannot be C states at the same time to avoid merging of states outside WOA. For the descendants of and , we recursively merge them into one tracklet by Procedure 5. For the A state child of , we simply remove the association between and and connect it to via Procedure 2.
Although there exists recursion in the actions, it can be easily proved that the recursion in Procedure 2, 5 and 6 cannot form an endless recursion loop, and the sequence of carrying out actions on a set of states will not affect the structure of AC Graph. Visualization of these actions in TUDStadtmitte dataset can be found in Figure 1. In Figure 1(b), newlyentered states to connected to their initial active sets via Procedure 1, 2 and 5. From Figure 1(a) to 1(b), was connected to as a C state child by Procedure 5, and merged with using Procedure 6.
4.3 Sliding Window Optimization
For a real time sequence, the AC Graph is continuously adding new states from latest frame . The WOA should be sliding to keep its size from being too large and remove the ambiguities to generate tracks. So we set the upper bound of the size of WOA as .
The sliding window optimization consists of three steps. First, for all the newlyentered states in frame , , we find the active sets via Procedure 1 and compute the affinity score between and each state in the corresponding active set. If , do Procedure 5 with as input. If , do Procedure 2 with as input. Second, from frame to , we sequentially recompute the affinity score of states in the same frame with their fathers and reconnect them according to the new affinity. Third, Hungarian Algorithm [1] is carried out on states in frame with their father states to get the best arrangement of association and clear all the ambiguity in frame . All states in frame are transformed to C states and the WOA shifts forward. If has not reached the end, and return to the first step, otherwise, and redo the third step. The outline of the optimization process is shown in Procedure 7.
The sliding window optimization conducts AS Scheme in a windowwise manner. Procedure 5 and 6 in step one and two serve as the approximation step, and updating affinity score in step two follows the shrink step. Step three forces the states in frame to determine their connections, which guarantees the convergence.
4.4 Online, Delayed and Batch Methods
Based on the definition of AC Graph and sliding window optimization, we form this windowwise framework which includes online (), delay () and batch methods (). Figure 2 demonstrates the formation of a trajectory starting from in the AC Graph via these three methods. The windowwise optimization finds a relatively small search tree according to at each iteration. As for an online method (Figure 2(b)), and the search is greedy. For a delayed method (Figure 2(c)), heuristic search is conducted in . The search space remains unchanged for a batch method (Figure 2(d)), so local search methods, , hill climbing, simulated annealing, , is often exploited to direct to local optimal iteratively. The experimental analysis of the relation between and optimization results is provided in Section 5.2.
5 Experimental Evaluation
5.1 Implementation
Affinity model: We implemented a basic affinity model, following [3], which includes the appearance model , motion model and shape model . The appearance model measures the Bhattacharyya distance of histograms of and . If is in a tracklet , instead of using Incremental Linear Discriminant Analysis (ILDA) used in [3], we simply average the appearance histograms of all states in
using an exponential discount factor. Firstorder Kalman filter is applied to smoothing and predicting positions of the targets and shapes of the bounding boxes. We compute the normalized distance of target positions and bounding box shapes and map them to a Gaussian distribution
to get the affinity scores. The overall affinity(6) 
Dataset description: We use the MOT Benchmark [18] for training and evaluation in this paper, where the benchmark contains both sequences for training and testing. In total, there are frames, for training set and for testing set. The sequences possess different frame rates and resolutions, and only tracking pedestrians.
Parameter Settings: In our experiment, the and
. We estimate the length of every occlusion (number of frames with overlap
0.4) in the training set of MOT Benchmark and study the distribution of occlusion lengths. As shown in Figure 3, about of the overlaps are within , and of which are within . Therefore, the delayed time is set to and the length of WOA frame ratedelayed time. The variance of the Gaussian distribution in the motion model and shape model is
. Other parameters of the affinity model are the same as [3].5.2 Analysis of WindowofAmbiguity
To analyze the connection of WOA size and the quality of the windowwise optimization, we define the energy of an AC Graph as
(7) 
Figure 4 presents the final energy with varying size of WOA on TUDStadtmitte (number of frame ), TUDCampus (number of frame ) and PETSS2L2 (number of frame ) in MOT Benchmark. The Xaxis is in logarithmic scale. Interestingly, final energy of these sequences reduced significantly when window size grows from to , while the speed of decrease become much slower when . Settings of these sequences, , target density, viewpoint, , are different, but the patterns of energy change almost remain identical. It is likely that the trend of final energy only deals with WOA size . And the tracking results can be much improved with a small WOA comparing to the online method, which experimentally illustrates the better performance of delayed methods than online ones in the windowwise optimization framework. The final energy does not reduce too much when grows larger than . This indicates the sliding window approximation only has a minor effect on the final performance. And it becomes a tradeoff between speed and better results when WOA grows larger.
5.3 Performance Evaluation
Evaluation Metrics: We apply the CLEAR MOT [15] and [29, 16]’s metric when evaluating our result. The multiple object tracking accuracy (MOTA) shows the combined accuracy based on the number of false positives (FP), identity switches (IDS) and missed targets (FN). The multiple object tracking precision (MOTP) measures the overlap of bounding boxes between ground truths and results given by trackers. MT and ML indicate the number of mostly tracked and lost targets. FG represents the number of fragmented tracks.
Evaluation: As shown in Table 2, our method clearly outperforms the TC_ODAL method using the same affinity model, not only in MOTA. Even in some datasets, shown in Table 3, our method with the basic affinity model reached the performance of the methods using stateoftheart affinity models.
Method  Type  MOTA  MOTP  MT  ML  FP  FN  IDS  FG 

ACMOT(Proposed affinity of [3])  Delayed  
TBD[13]  Batch  
TC_ODAL [3]  Online  
DP_NMS [23]  Batch 
6 Conclusion
This paper proposed an AS Scheme for sequential approximation and a windowwise optimization framework based on the AC Graph. The core idea of this method is to cluster the states subject to several constraints, states in the same frame cannot be clustered into one group, . The AC Graph together with the sliding window optimization transformed the global clustering into a sequential local clustering which selforganized the structure in a relatively small state space, which can be done efficiently with little harm to handling occlusions. We showed experimentally that the characteristics of windowwise optimization framework rarely change with the varying settings of the sequence. As the affinity model serves as the distance metric in clustering, it can influence the results of clustering. Therefore, it is a fair comparison of optimization models if similar affinity models are adopted. The experimental results show that by using the basic affinity model, our method even showed competitive performance in an unfair test. Our future work is to realize more stateoftheart affinity models to the windowwise optimization model. Also, we plan to design a unity interface, which can help to embed the affinity models into different optimization models much easier than now.
References
 [1] R. K. Ahuja, T. L. Magnanti, and J. B. Orlin. Network flows. Prentice Hall, 1993.

[2]
A. Andriyenko, K. Schindler, and S. Roth.
Discretecontinuous optimization for multitarget tracking.
In
Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on
, pages 1926–1933. IEEE.  [3] S.H. Bae and K.J. Yoon. Robust online multiobject tracking based on tracklet confidence and online discriminative appearance learning. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, pages 1218–1225. IEEE.
 [4] S.H. Bae and K.J. Yoon. Robust online multiobject tracking with data association and track management. 2014.
 [5] H. Ben Shitrit, J. Berclaz, F. Fleuret, and P. Fua. Multicommodity network flow for tracking multiple people. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 36(8):1614–1627, 2014.
 [6] M. D. Breitenstein, F. Reichlin, B. Leibe, E. KollerMeier, and L. Van Gool. Online multiperson trackingbydetection from a single, uncalibrated camera. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 33(9):1820–1833, 2011.
 [7] A. A. Butt and R. T. Collins. Multitarget tracking by lagrangian relaxation to mincost network flow. In Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pages 1846–1853. IEEE.
 [8] C. CantonFerrer, J. R. Casas, M. Pard s, and E. Monte. Multicamera multiobject voxelbased monte carlo 3d tracking strategies. EURASIP Journal on Advances in Signal Processing, 2011(1):1–15, 2011. B and limited O 3D.
 [9] W. Choi. Nearonline multitarget tracking with aggregated local flow descriptor. arXiv preprint arXiv:1504.02340, 2015.
 [10] C. Dicle, O. I. Camps, and M. Sznaier. The way they move: Tracking multiple targets with similar appearance. In Computer Vision (ICCV), 2013 IEEE International Conference on, pages 2304–2311. IEEE.
 [11] C. Fantacci, B.N. Vo, B.T. Vo, G. Battistelli, and L. Chisci. Consensus labeled random finite set filtering for distributed multiobject tracking. arXiv preprint arXiv:1501.01579, 2015. new approach.
 [12] T. E. Fortmann, Y. BarShalom, and M. Scheffe. Sonar tracking of multiple targets using joint probabilistic data association. Oceanic Engineering, IEEE Journal of, 8(3):173–184, 1983. B.

[13]
A. Geiger, M. Lauer, C. Wojek, C. Stiller, and R. Urtasun.
3d traffic scene understanding from movable platforms.
Pattern Analysis and Machine Intelligence, IEEE Transactions on, 36(5):1012–1025, 2014.  [14] S.W. Joo and R. Chellappa. A multiplehypothesis approach for multiobject visual tracking. Image Processing, IEEE Transactions on, 16(11):2849–2854, 2007.
 [15] B. Keni and S. Rainer. Evaluating multiple object tracking performance: the clear mot metrics. EURASIP Journal on Image and Video Processing, 2008, 2008.
 [16] C.H. Kuo and R. Nevatia. How does person identity recognition help multiperson tracking? In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 1217–1224. IEEE. B.
 [17] L. LealTaixé, M. Fenzi, A. Kuznetsova, B. Rosenhahn, and S. Savarese. Learning an imagebased motion context for multiple people tracking. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, pages 3542–3549. IEEE, 2014.
 [18] L. LealTaixé, A. Milan, I. Reid, S. Roth, and K. Schindler. Motchallenge 2015: Towards a benchmark for multitarget tracking. arXiv preprint arXiv:1504.01942, 2015.
 [19] W. Lu, C. Bai, K. Kpalma, and J. Ronsin. Multiobject tracking using sparse representation. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pages 2312–2316. IEEE. O.
 [20] A. Milan, L. LealTaixé, K. Schindler, and I. Reid. Joint tracking and segmentation of multiple targets.
 [21] A. Milan, S. Roth, and K. Schindler. Continuous energy minimization for multitarget tracking. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 36(1):58–72, 2014.
 [22] A. Milan, K. Schindler, and S. Roth. Detectionand trajectorylevel exclusion in multiple object tracking. In Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pages 3682–3689. IEEE.
 [23] H. Pirsiavash, D. Ramanan, and C. C. Fowlkes. Globallyoptimal greedy algorithms for tracking a variable number of objects. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 1201–1208. IEEE, 2011.
 [24] D. B. Reid. An algorithm for tracking multiple targets. Automatic Control, IEEE Transactions on, 24(6):843–854, 1979. B.
 [25] A. V. Segal and I. Reid. Latent data association: Bayesian model selection for multitarget tracking. In Computer Vision (ICCV), 2013 IEEE International Conference on, pages 2904–2911. IEEE.
 [26] A. W. Smeulders, D. M. Chu, R. Cucchiara, S. Calderara, A. Dehghan, and M. Shah. Visual tracking: An experimental survey. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 36(7):1442–1468, 2014.
 [27] B. Wang and F. Wang. Multiobject tracking using least absolute deviation. In Image and Signal Processing (CISP), 2014 7th International Congress on, pages 60–65. IEEE. O.
 [28] D. Wang, H. Lu, and M.H. Yang. Least softthreshold squares tracking. In Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pages 2371–2378. IEEE. O single.
 [29] B. Yang and R. Nevatia. An online learned crf model for multitarget tracking. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 2034–2041. IEEE.
 [30] B. Yang and R. Nevatia. Multitarget tracking by online learning a crf model of appearance and motion patterns. International Journal of Computer Vision, 107(2):203–217, 2014.
 [31] J. H. Yoon, M.H. Yang, J. Lim, and K.J. Yoon. Bayesian multiobject tracking using motion context from multiple objects. In Applications of Computer Vision (WACV), 2015 IEEE Winter Conference on, pages 33–40. IEEE. O.
 [32] L. Zhang, Y. Li, and R. Nevatia. Global data association for multiobject tracking using network flows. In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, pages 1–8. IEEE.
Comments
There are no comments yet.