We consider a system consisting of a single PTZ camera (which can pan, tilt and zoom) to solve the problem of tracking multiple pedestrians while also capturing their faces. A necessary component of such a system is a scheduling algorithm that determines at any time step whether to remain in zoom-out mode or to zoom in on a face. This paper presents an efficient new data structure, the multi-strand graph (MSG), for multiple target tracking using a single PTZ system, and a novel scheduling algorithm based on it.
Our method aims to overcome one of the main challenges of a multiple target tracker: trajectory fragmentation. Such fragmentation is caused by occlusions, by the joining of two or more targets who then split (e.g., Figures 1(a), 2(a)), or when the PTZ camera zooms in on another target, creating what we call a blind gap (Figure 1(b)). Matching trajectory fragments (tracklets) is complicated due to ambiguities caused by similarity in appearance and location of different targets. We propose using the faces captured in zoom-in mode together with the available information of the system state, to resolve such ambiguities.
The objective of the proposed system is to maximize the total length of the labeled tracklets, that is, trajectory fragments with associated high resolution face images captured in zoom-in mode. Formally, let be the set of targets, and let be the set of its detected tracklets. For each target we define to be the union of the labeled tracklets of . The objective is given by
denotes the tracklet length. Our scheduler selects the target with the highest probability to maximizeto be the next target that will be zoomed in on. Note that resolving ambiguities (e.g., matching and in Figure 2(a)) can greatly increase .
We introduce the Multi-Strand Tracking Graph (MSG), which represents the tracklets computed by a tracker and their possible associations (its basic structure is similar to ). We show that a straightforward use of the graph for the abovementioned task requires a graph traversal. The main contribution of this paper is the proposed auxiliary data stored in each vertex. We use this data to efficiently compute the system state information without traversing the graph. The graph is constructed online and the auxiliary data is recursively computed based only on the vertex itself and on its direct parents. Hence, all the required information is available when scheduling decisions are made. Other contributions of this paper are the use of high-resolution images to resolve matching ambiguities of tracklets and the design of an efficient scheduling algorithm that uses the MSG.
System overview: The tracking system considered in this paper consists of a single PTZ camera, and several components, described below, are assumed to be available. These include a tracker that detects and tracks pedestrians in zoom-out mode. It also detects joining and splitting events of two or more targets moving together (as in ). According to the proposed scheduler, the system selects a person to zoom in on using a camera control algorithm. The control algorithm chooses the FOV that makes it possible to zoom in on selected target (e.g., ). In the zoom-in mode, a face image is acquired and a face-to-face and a face-to-person matchings are computed. The system then zooms out, to the same wide view, to continue tracking. A person-to-person matching module associates tracklets when returning from zoom-in mode or after targets split from a group. Figure 3 summarizes the system components. Our contribution to the system is the graph representation (MSG) and the efficient scheduling algorithm.
2 Previous Work
Scheduling of a single PTZ camera was considered in [13, 2, 11, 12]. Scenarios of joining/splitting targets were considered in [13, 2]. The goal of  was to minimize the slew time of an aerial camera tracking cars. High-resolution images were used to remove incorrect prediction hypothesis (stored as a tree). The greedy policy in  aimed to maximize the number of captured faces, considering the predicted time of each target to exit the scene and its movement angle w.r.t. the camera. An information-theoretic approach [11, 12] aims to decrease location uncertainty while capturing high-resolution images. A distributed game-theoretic approach for scheduling multiple PTZ cameras  aims to maximize the targets’ image quality and to capture their faces. None of the above scheduling algorithms considered the goal of resolving tracklet-matching ambiguities.
Other systems considered setups with both fixed and PTZ cameras, in a master-slave configuration. Such setups are less challenging since a fixed camera continuously views the entire region. They vary from a single master and a single slave [1, 3] to multiple masters and multiple slaves [4, 6, 10, 16]. The objectives in these studies are to acquire once , or as many times as possible , the face of each target, or to minimize camera motion . The scheduling methods consider the expected distance from the camera , the viewing angle [1, 6, 13, 16], and expected occlusions [6, 13]. In addition to these objectives, our algorithm also considers how zooming in contributes to the resolution of past and future ambiguities of tracklet matching.
Graphs were previously used to represent relations between tracklets [5, 9, 15, 17, 18], where the weighted edges reflect the appearance similarity and the consistency of location with respect to the computed motion direction and sometimes speed. A graph with a similar structure to the MSG [5, 8, 14] was used to associate isolated tracklets of targets with indistinct appearance as well as tracklets of a set of targets that cannot be separated. The joins/splits of targets were computed by a tracker. The association of single target tracklets is solved by finding the most probable set of paths. All these papers use the target’s location and only one appearance descriptor level for matching while we use both low- and high-resolution images. Moreover, they do not use auxiliary data, which allows efficient scheduling and online graph updating in our method.
We first describe the basic structure of the MSG graph. Next, we extend the MSG graph with auxiliary data for efficient matching by elimination. Finally, our scheduling algorithm is presented.
3.1 Graph Definition
The basic structure of the MSG is a dynamic augmented graph, , where represents the set of tracklets computed by the tracker, and the candidate associations of different tracklets computed by some available matching algorithm (similar to ). Each vertex is associated with the information regarding its tracklet. We consider two types of vertices that represent two types of tracklets. A solo vertex represents the tracklet of a single target and a compound vertex represents the shared tracklet of joined targets, that is, a set of targets that walk together (Figure 2(b)).
A directed edge, , represents the case in which at least one of the targets associated with may also be associated with , and and correspond to consecutive time intervals (ignoring zoom-in time). A compound vertex is generated as the child of other vertices when the tracker detects that the targets’ trajectories are joined into indistinguishable tracklets (see Figures 2(a), 2(b)). A new solo vertex is generated when a new target enters the scene, a target trajectory splits from those of others (as a child of the compound vertex), or a target reappears when the camera returns to zoom-out (after a blind gap).
Edges between solo vertices are generated at consecutive layers (e.g., when returning from zoom-in mode), according to a matching algorithm that is based on the targets’ low-resolution images captured in zoom-out mode and on their locations. When the matching of a new solo vertex is ambiguous, edges are set between the vertex and all the matching candidates, forming an X-type ambiguity (see Figure 1(b)). Additional edges are generated between compound and solo vertices, based on the tracker’s detection of splitting and joining targets.
When no ambiguities are present, the trajectory of each target can be fully recovered, and the graph contains only unconnected solo vertices. We wish to reduce as much as possible the number of vertices by concatenating consecutive tracklets to a single tracklet when possible. When a univocal matching exists for a consecutive set of tracklets (all of the same target), their corresponding vertices form a solo chain in the graph – in which the out degree and the in degree of a vertex’s parent and child, respectively, are one (Figure 2(c)). A solo chain, of any length, can be merged to a single solo vertex (Figure 2(d)). A compound chain can be defined and merged similarly.
A univocal matching of a pair of non-consecutive solo vertices, , can be computed using an available face-to-face matching algorithm. In addition, an indirect match can also be obtained by elimination (see Section 3.3).
Such a univocal matching of can be used for untangling the graph as long as the connected component of and is a DAG (a graph that does not contain cycles). In this case, a breadth-first-search (BFS) algorithm is used to recover the graph path, (e.g., in Figure 2(b)). All vertices of are guaranteed to represent consecutive tracklets of the same target. Hence, edges ‘to’ and ‘from’ solo vertices of (representing X-type ambiguities) are removed except those that are part of the path. Each compound vertex is split into two vertices. One solo vertex represents only the labeled target and is linked only to the solo chain. The second vertex, , represents the remaining targets of the compound vertex and is disconnected from the chain (Figure 2(c)). As a result, a solo chain of the labeled target, and possibly additional solo chains of other targets, are obtained. Each chain can be merged into a single solo vertex (Figure 2(d)). Note that no information is lost in the untangling process.
3.3 Matching by Elimination
When there is sufficient confidence that a labeled vertex cannot be matched to any of the previously labeled vertices, the vertex can sometimes be indirectly matched to an unlabeled vertex by elimination. For example, assume that and in Figure 2(b) were labeled and no match was found for the faces of either or with . It is possible to deduce that is the correct match to . Similarly, if only and were labeled, the non-source is deduced to be this match. We next define when an indirect match can be found in the general case, and how to compute it efficiently. Let be the set of labeled solo vertices. We define an unlabeled path between and , , to be a path that does not contain any labeled vertex except possibly and , that is, .
Claim 1: Sufficient and necessary conditions for a solo vertex to be an indirect match to are (i) cannot be matched to a previously labeled vertex; (ii) there exists an unlabeled path, ; (iii) if an unlabeled solo vertex satisfies (ii) then .
Proof: We begin with proving that (i)-(iii) are necessary conditions. Assume is an indirect match of . Then (i) must hold since otherwise can be directly matched; (ii) must hold since otherwise either does not exist and hence no match between and is possible, or , where is a labeled solo vertex. However, an indirect match of and implies a match between all solo vertices and . Hence, could be directly matched to , which contradicts condition (i). Finally, (iii) must hold since otherwise there exists that satisfies (ii). It follows that more than one feasible indirect match to exists. Hence, there is insufficient information to determine which of them is the correct one, and an indirect match of and cannot be determined.
We next prove that if conditions (i)-(iii) hold, then is an indirect match of . From condition (i) it follows directly that cannot be directly matched to . From condition (ii) it follows that exists; hence is a possible match. It is left to show that is the only feasible match. From condition (iii) it follows that is the only feasible match to since any other match, , satisfies .
When (i) holds, an indirect match to a labeled solo vertex can be computed in a straightforward manner by traversing the graph backwards from and checking whether a vertex that satisfies (ii) and (iii) exists. This is clearly time consuming. Instead, we propose to store auxiliary data in each vertex; this data, which can be efficiently computed online from the vertex itself and its parents, makes it possible to directly compute an indirect match, if one exists. We will also use this data later for scheduling.
Auxiliary data for matching by elimination: We define to be an origin of if (i) is a solo vertex; (ii) there exists an unlabeled path and (iii) is either a source of the graph (unlabeled origin) or a labeled vertex (labeled origin). A labeled vertex is the origin of itself and has no unlabeled origins. The set of origins of consists of the set of vertices – each associated with a distinct target ID – that may represent the same target as . Note that only a labeled origin of may be directly matched to .
We observe that a solo vertex may have an indirect match only if it has at least one unlabeled origin (otherwise it can only be directly matched). Furthermore, may have an indirect match only if just one of its parents has unlabeled origins (otherwise, the unlabeled origins, one from each parent, do not satisfy (iii) of Claim 1). Hence, to compute whether an indirect match exists, it is sufficient to store in each vertex the number of its unlabeled origins, denoted by , and the single parent that has unlabeled sources, if one exists, (set to zero if one does not exist). Let be the set of parents of . Both (given by summing the number of unlabeled origins of ) and can be recursively defined as follows:
Note that if a vertex is the indirect match of a solo vertex , it is also the indirect match of . Therefore, we can efficiently and recursively compute the single candidate of an indirect match of , :
where holds if is a solo vertex and 0 otherwise.
Note that if is a compound vertex, it cannot have an indirect match; however, the value contains the candidate indirect match for its descendants. After labeling and untangling the MSG (if such untangling is possible), the auxiliary data is recalculated to be . This reflects that ambiguities of this target prior to the labeling are no longer relevant for future ambiguities. After each labeling and once the untangling is complete, must also be recalculated for all the vertices that were disconnected from the solo chain during the untangling process. Each of these vertices then propagates the updated value to all its descendants, who recalculate their own values accordingly.
The scheduler selects the tracklet whose target’s face will be acquired in the next zoom-in mode or selects to stay in zoom-out mode. The score it provides reflects the expected contribution of a tracklet labeling to maximize the system’s objective, (Eq. 1). A prerequisite for choosing to zoom in on an unlabeled target is that the acquisition of its face is expected to be successful. A Boolean value that indicates the expected success, , can be computed by the tracker in a similar manner to previous studies (e.g., ). For example, the motion direction can be used for predicting occlusions and time to exit, and whether the face will be visible to the camera.
Ideally, the minimal number of required labelings for a full trajectory retrieval of targets is , one per target. The upper bound of the required number of ideal labelings is , where and is a connected component. This sum includes the labeling of the first and last solo-walking tracklet of each target . Thus, the full trajectory of is recovered and labeled (under the no-cycle assumption). Note that each untangling may further reduce the required number of labelings.
In practice, an ideal labeling set is often impossible to obtain: the online algorithm leaves limited time for zooming in, and each labeling may cause additional ambiguities due to a blind gap. Moreover, the target identity and hence its contribution to is unknown before zooming in.
Therefore, we propose a scheduling algorithm that approximates the estimated contribution of labeling each of the targets or staying in zoom-out mode to maximize. A zoom-out score, , can reflect global properties of the scene, such as the number of new targets expected to enter it, and the prevention of X-type ambiguities caused by a blind gap in zoom-in mode. Here we set it to be a constant. A labeling score, , is set for each vertex and reflects the expected contribution to if is chosen to be labeled. The to be selected for labeling is the one with the highest as long as . Otherwise, the system remains in zoom-out mode.
3.5 Labeling Score & Auxiliary Data
The score is a weighted sum of two terms that estimate the expected resolution of future () and past () ambiguities:
where the weights and are higher for source and expected sink vertices, respectively.
Future ambiguities: The probability that was not labeled before is given by , where and are the number of origins and unlabeled origins of , respectively. The score is defined to be , where is a Boolean value that reflects that the target of is expected to join another target with a similar appearance (computed by the tracker). The value of can be recursively computed (see Eq. 2). In a similar manner, can also be recursively computed (as specified in Appendix A).
Past ambiguities: The score reflects the expected increase in the length of the labeled tracklets, , if is chosen for labeling. Labeling a vertex increases the length of by the length of the tracklet . In addition, if is matched to , directly or indirectly, then is extended by the sum of over all such that . The identity of ’s target, and hence the origin to which will be matched, is unknown prior to its labeling. Hence, we average over all possible increases of with respect to the possible origins that may be the match of :
where and sum the increase of over the sets of unlabeled origins and labeled origins, respectively.
A straightforward computation of is by graph traversal. To avoid such a computationally expensive operation for each candidate vertex, we store in each vertex the auxiliary data fields, , , and . These values are recursively computed. We next describe the computation of . (For , see Appendix A.)
When no match is found (either direct or indirect), the contribution of labeling to is for each unlabeled origin. If an indirect candidate match is given by , that is, , then the additional contribution of labeling is given by . To compute it, is computed recursively as follows:
Complexity: The computation of the score is linear with – which is expected to be small for each visible target – instead of for the necessary graph traversal without the auxiliary data. Note that without untangling, the graph is expected to grow very fast when more targets enter the scene and many tracklets are detected, hence making the alternative even worse. Due to the overhead incurred by untangling, the auxiliary data of all the descendant vertices must be updated. In the worst case, it will require updating vertices. However, this operation is rarely performed. Moreover, each time it takes place, the size of the graph is greatly reduced. Hence, the amortized complexity of updating the graph is expected to be for each new vertex. A formal proof of this conjecture is left for future study.
We used simulated data as an input to our method to evaluate the scheduler’s performance independently from that of the other modules. Simulated data also makes it possible to bypass the limitations of comparing online algorithms on the same real data; each algorithm dictates different zoom-in operations, thus changing the data. We implemented our method as well as the simulated data using Matlab.
The simulated scene consists of a set of targets walking on a grid of intersecting diagonal roads. The targets’ velocities (speed and direction), entrance time and location, and the probability that meeting targets start walking together, are determined randomly. All targets have the same low-resolution appearance to increase ambiguity, and low-resolution images are not used for matching.
We present the objective score of our algorithm, (Eq. 1), as a function of the expected ambiguities in the scene. It is computed when the simulation ends and is based on the MSG’s tracklets and on the ground truth. We estimate the ambiguities of the scene by , where is the number of times a solo-walking target, , joins a group and then splits to walk alone again. Note that in practice blind gaps may cause additional ambiguities. For comparison we consider a naïve scheduler  that selects the unlabeled target predicted to leave the scene first. In both cases, we assume that the tracker provides the necessary available information (e.g., whether the face is expected to be captured successfully).
A simple simulation of 3 targets and one join/split event (Figure 4(a)) demonstrates a scenario where our scheduler selects the labeled target to be one of the two targets that are predicted by the tracker to join, before this event occurs. Consequently, one additional labeling after the split event untangles the MSG into an ideal graph, and all the trajectories are fully recovered (Figure 4(b)). When our scheduler is used without the untangling process, its final MSG is not ideal and a full recovery is not achieved (Figure 4(c)). The naïve scheduler selects the joining targets for labeling only after they split, thus preventing a full recovery and achieving the lowest (Figure 4(d)).
Figure 5(a) presents an example with a large number of targets and ambiguities. The MSG is growing rapidly but our scheduler achieves untangling in key points (see Figure 5(b-f)), allowing the final MSG to contain only one remaining ambiguity (Figure 5(g)). The naïve scheduler achieves a significantly lower due to a final MSG with many unresolved ambiguities (Figure 5(h)). Another complex example is presented in Figure 6.
The results on 414 simulations are presented in Figure 7. When is small, the performance of our and the naïve algorithms is close to perfect. When increases, the performance of our method decreases, mainly due to the limited time available to label all the desired targets. However, for moderate ambiguity of , our method still performs well: . The superiority of our algorithm over the naïve one is apparent both for moderate and high . For example, the score of the naïve algorithm obtained for is , which is lower than the worst score of our method, for .
Two components of our algorithm contribute to its superiority over the naïve one. The global view we keep of the system state allows us to associate one or more labeled tracklets of the same target with additional tracklets of that target. Using graph terminology, this corresponds to the untangling and merging of vertices, either by direct labeling or as a byproduct of labeling other targets. In addition, our scheduling method explicitly considers the task of disambiguating tracklet associations, and uses global information of the current state of the system efficiently.
5 Discussion & Future Work
We proposed a method for tracking multiple pedestrians and capturing their faces using a single PTZ camera. The goal of the system is to maximize the length of the labeled trajectories recovered by the tracker. Our main contribution is a novel data structure, MSG, that efficiently utilizes all the available global information of a tracking system. The auxiliary data of the MSG is used for an efficient scheduling algorithm that resolves or prevents tracklet ambiguities and matches tracklets directly or indirectly via target labeling. The MSG may be modified for various applications that use several cameras, with or without overlapping fields of view, when two distinct resolution levels can be used for resolving ambiguities. This is left for future research.
Our method aims to represent and efficiently use the data available from basic components of trackers and recognition systems, most of which are assumed to be deterministic for ease of exposition. It is clearly prone to the expected errors of each of these components.
Our method can be extended to handle a probabilistic setting where each component provides a degree of confidence for its output. This can be integrated into the graph by, for example, associating a weight with each edge. In the current system, X-type edges can represent the output of a probabilistic person-to-person matching algorithm. A threshold on the face-to-face matching confidence may be used for deciding whether to untangle the graph or wait for additional information.
Acknowledgment: This research was supported by the Israeli Ministry of Science, grant no. 3-8700, and by Award No. 2011-IJ-CX-K054, awarded by the National Institute of Justice, Office of Justice Programs, U.S. Department of Justice.
-  A. Bagdanov, A. del Bimbo, and F. Pernici. Acquisition of high-resolution images through on-line saccade sequence planning. In VSSN, 2005.
Y. Cai, G. Medioni, and T. Dinh.
Towards a practical PTZ face detection and tracking system.In WACV, 2013.
-  C. Costello, C. Diehl, A. Banerjee, and H. Fisher. Scheduling an active camera to observe people. In VSSN, 2004.
-  C. Costello and I. Wang. Surveillance camera coordination through distributed scheduling. In CDCECC, 2005.
-  J. Henriques, R. Caseiro, and J. Batista. Globally optimal solution to multi-object tracking with merged measurements. In ICCV, 2011.
-  S. Lim, L. Davis, and A. Mittal. Constructing task visibility intervals for video surveillance. MS, 12(3), 2006.
-  A. Morye, C. Ding, A. Roy-Chowdhury, and J. Farrell. Distributed constrained optimization for bayesian opportunistic visual sensing. TCST, 2014.
P. Nillius, J. Sullivan, and S. Carlsson.
Multi-target tracking-linking identities using bayesian network inference.In CVPR, 2006.
-  J. Prokaj, M. Duchaineau, and G. Medioni. Inferring tracklets for multi-object tracking. In CVPRW, 2011.
-  F. Qureshi and D. Terzopoulos. Surveillance in virtual reality: System design and multi-camera control. In CVPR, 2007.
-  P. Salvagnini, F. Pernici, M. Cristani, G. Lisanti, I. Masi, A. Del Bimbo, and V. Murino. Information theoretic sensor management for multi-target tracking with a single pan-tilt-zoom camera. In WACV, 2014.
-  E. Sommerlade and I. Reid. Information-theoretic active scene exploration. In CVPR, 2008.
-  T. Strat, P. Arambel, M. Antone, C. Rago, and H. Landan. A multiple-hypothesis tracking of multiple ground targets from aerial video with dynamic sensor control. In SPIE. 2004.
-  J. Sullivan and S. Carlsson. Tracking and labelling of interacting multiple targets. In ECCV. 2006.
-  X. Wang, E. Türetken, F. Fleuret, and P. Fua. Tracking interacting objects optimally using integer programming. In ECCV. 2014.
-  C. Ward and M. Naish. Scheduling active camera resources for multiple moving targets. In CCECE, 2009.
-  Z. Wu, T. Kunz, and M. Betke. Efficient track linking methods for track graphs using network-flow and set-cover techniques. In CVPR, 2011.
-  B. Yang and R. Nevatia. An online learned CRF model for multi-target tracking. In CVPR, 2012.
This appendix provides the recursive computation of labeling score computation, and , of Section 3.5.
Recursive computation of :
The number of origins of each vertex, , is recursively defined by:
Note that using (given in Eq. 2 of Section 3.3) and , we can also recursively compute the number of labeled origins of :
Recursive computation of :
Let us first consider the path , where is a labeled origin of . Its contribution to consists of , since is a labeled vertex prior to the labeling of . Consider , that is, the parent of on the path . It is possible to decompose into the sum: . It follows that contributes for each of its possible direct matchings, that is, . In addition, the value consists of the sum of for each of the parents of on possible direct match paths. Hence, can be recursively computed:
Refinement of the computation:
The labeled tracklet of each labeled origin of , , clearly consists of the tracklet represented by itself, . In addition, a labeling of a vertex can sometimes be extended also to label its parents and children. For example, assume that of a target is labeled in Figure 2(b). The tracklet clearly follows for this target, and is therefore an extension of . Formally, let be a labeled vertex with a single compound child, . The tracklets and represent the same target. Hence, can be extended to . As a result, the length of the labeled tracklets is given by . Such a forward labeling extension can be applied recursively to any forward labeling chain, , which is a path from in which each vertex is a single child of its parent. can be estimated more accurately by considering forward labeling extensions of the labeled origins of , as described next.
Let us consider again the path , where is a labeled origin of . We wish to find the contribution of to when considering not only itself but also its forward labeling extension. This contribution excludes the entire extension of , which is labeled prior to the labeling of . The length of the forward labeling extension of , , is therefore subtracted from . That is, the contribution of the possible matching of to is given by . We next describe the auxiliary data needed to compute the refined efficiently.
For each vertex , we define the number of forward labeling chains in which is included, . This value can be recursively computed based only on the vertex itself and its direct parents, as follows:
where the Boolean function determines whether has only one child. The refined recursive computation of (that replaces Eq. 10 above) is given by:
Note that the labeling of a vertex can also be extended backwards, in a manner similar to the forward labeling extension. Both extensions are considered in the experiments for the evaluation of our scheduler, but only the forward labeling extension is useful for the refined computation.