SpaRTA - Tracking across occlusions via global partitioning of 3D clouds of points

by   Andrea Cavagna, et al.

Any 3D tracking algorithm has to deal with occlusions: multiple targets get so close to each other that the loss of their identities becomes likely. In the best case scenario, trajectories are interrupted, thus curbing the completeness of the data-set; in the worse case scenario, identity switches arise, potentially affecting in severe ways the very quality of the data. Here, we present a novel tracking method that addresses the problem of occlusions within large groups of featureless objects by means of three steps: i) it represents each target as a cloud of points in 3D; ii) once a 3D cluster corresponding to an occlusion occurs, it defines a partitioning problem by introducing a cost function that uses both attractive and repulsive spatio-temporal proximity links; iii) it minimizes the cost function through a semi-definite optimization technique specifically designed to cope with link frustration. The algorithm is independent of the specific experimental method used to collect the data. By performing tests on public data-sets, we show that the new algorithm produces a significant improvement over the state-of-the-art tracking methods, both by reducing the number of identity switches and by increasing the accuracy of the actual positions of the targets in real space.



page 2

page 4

page 6

page 8

page 9

page 10

page 11


GReTA - a novel Global and Recursive Tracking Algorithm in three dimensions

Tracking multiple moving targets allows quantitative measure of the dyna...

Clinical Patient Tracking in the Presence of Transient and Permanent Occlusions via Geodesic Feature

This paper develops a method to use RGB-D cameras to track the motions o...

Deep 6-DOF Tracking

We present a temporal 6-DOF tracking method which leverages deep learnin...

A single target tracking algorithm based on Generative Adversarial Networks

In the single target tracking field, occlusion leads to the loss of trac...

Learning to Track with Object Permanence

Tracking by detection, the dominant approach for online multi-object tra...

Learning to Divide and Conquer for Online Multi-Target Tracking

Online Multiple Target Tracking (MTT) is often addressed within the trac...

Performance Measures and a Data Set for Multi-Target, Multi-Camera Tracking

To help accelerate progress in multi-target, multi-camera tracking syste...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Tracking large groups of targets in space is a challenging topic, which is particularly relevant in the field of turbulence [ouellette2006quantitative], collective animal behavior [wu2016global], [cheng2015novel] and social sciences [moussaid2012traffic], [wen2016multi] as well as in robotics [michel2007gpu] and autonomous mobility [ess2010object]. The technological progress of the last decades gave a boost to the development of new experimental strategies to collect

data, such as RGB–D, multicamera, lidar and radar systems. Nowadays the effort of a part of the computer vision community is directed towards finding general high-performance tracking methods.

The crucial point of all tracking algorithms is how to handle occlusions that arise every time that two or more objects get too close in space to be detected as multiple targets. This kind of ambiguities are particularly severe when dealing with featureless objects (objects that cannot be identified by any feature such as shape or color) and with large and dense groups of targets, where the chance to get in proximity is high. Occlusions hinder in a twofold way the quality of the retrieved trajectories: loss of one or more of the targets involved into the occlusions and a potential switch of identities. While the first drawback generally affects at the quantitative level the completeness of our tracking effort, the second one may severely change at the qualitative level the results of any data analysis.

In this paper we propose a novel tracking method called SpaRTA (Spatiotemporal Reconstruction Tracking Algorithm), which is able to solve occlusions identifying each target during the occlusions and producing negligible switches of identities. SpaRTA is meant to work on objects detected as clouds of points, regardless of the system used to collect the data, such as RGB–D, multicamera, lidar or radar. The core ideas of the methods are the following: i) SpaRTA reconstructs the spatio-temporal volume (where is the spatial dimension and represents the time dimension) occupied by each target during the entire acquisition as a cloud of points; ii) when an occlusion arises, SpaRTA tackles the problem of splitting it into different objects by defining a partitioning problem that uses both attractive and repulsive links depending on the distance in space and time among the points belonging to the occlusion; iii) as the superposition of attractive and repulsive links gives rise to frustration, namely to the emergence of many local minima of the partitioning cost function, SpaRTA uses an optimization method inspired on Semi-Definite Programming (SDP) techniques developed in the context of statistical physics of disordered systems [ricci2016performance], to find the optimal partition (i.e. ground state of the cost function), thus finally splitting the occlusion into the actual different targets composing it.

SpaRTA was tested on data of large groups of animals collected in the field with a multicamera system. This kind of data are a good benchmark for tracking methods because they are particularly hard to track: they are characterized by frequent occlusions, lasting several frames, and by a low spatial resolution such that targets appear as objects without any recognizable feature. The only limitation of these data is that the production of ground truth trajectories to evaluate the tracking result is quite difficult and time–consuming, and it is then hard to give a quantitative evaluation of the quality of the resulting set of trajectories. This is the reason why there are very few public data-sets to be used as benchmark. To the best of our knowledge, the only available two public data-sets of featureless objects collected via a multicamera system are published in [wu2014thermal]. We tested SpaRTA on these data-sets showing the high performance of the proposed algorithm in terms of the quality of the retrieved trajectories.

2 Related works

Since the seminal work of Reid [reid1979algorithm], several tracking strategies have been proposed in the past forty years. However, despite this strong effort, only few methods are designed to track large and dense systems of featureless objects and the research on this topic is still very much ongoing, especially for what concerns the solution of occlusions. There are two fundamentally different ways to represent the targets an algorithm wants to track: on one hand, we can associate to each target at a certain instant of time one single identity and spatial position (typically the baricenter) – we will call this case Single Point (SIP) representation; on the other hand, we can associate to a target a dense Cloud of Points (COP), representing the the full spatial volume of the target at that instant of time, see Appendix A.

The SIP representation is typically adopted within the context of multicamera data-taking systems, in which sets of single objects positions have to be turned into positions and trajectories. One way to achieve this is to first track the objects in each camera and then match the trajectories across cameras to retrieve the corresponding trajectories (the so-called Tracking-Reconstruction (TR) route). Working in the space of the cameras, TR algorithms have to deal not only with occlusions but also with occlusions, namely ambiguities due to targets getting in spatial proximity only on the image plane but not in the space. occlusions produce bifurcations of the trajectories, and hence a high intrinsic complexity due to the proliferation of the trajectories to be matched across the cameras. Several different strategies to prune the set of trajectories and reduce the complexity of this approach have been devised [attanasi2015GReTA], [cox1996efficient], [wu2011efficient], [wu2011automated], [wu2016global] and [liu2012automatic]. Conversely, yet still within the SIP representation, one can first reconstruct single objects turning them into objects (by matching their identities across the cameras), and then track them in space (the so-called Reconstruction-Tracking (RT) route). Working directly in the space, RT methods are not affected by occlusions that are naturally solved when matching objects across the cameras (see Appendix B), hence their complexity is naturally quite lower than the TR methods; however, RT strategies are typically more prone to creating false objects. The SIP-RT approach has been explored in small groups of objects [tyagi2007fusion] [li2002relaxation] [dockstader2001multiple] [cheng2015novel], and it is not clear how these approaches perform in case of dense groups. The most advanced SIP-RT method has been proposed in [wu2009tracking], which can successfully track large groups.

The SIP representation has some drawbacks, especially in dense systems, where occlusions are frequent. More specifically, whenever two or more objects are part of a single occlusion, the SIP method associates one single position to all of them, causing: i) loss of the actual targets individual identities; ii) potential identity switches after the occlusion; iii) an inaccurate positioning of the targets in space (see Appendix A).

The COP representation, on the other hand, allows to associate to each object (at each instant of time) a dense cloud of points and not only its baricenter. This dense representation reconstructs the actual volumes occupied by the detected objects. Because COP discards no information about the actual targets volumes, and therefore creates no simplified identities, it is more suited to prevents identity switches (problems i) and ii) above) and definitely more accurate to locate the positions of the targets. Besides, COP-based tracking (unlike SIP) is not forcibly embedded within a multicamera framework and it is therefore a significantly more general approach than SIP: indeed, COP tracking has been used in RGB–D systems [RGBD-Overview2013], lidar [Lidar-Asvadi2016] and radar [RADAR-Mobus2003]. In the COP context, occlusions have been tackled by different techniques, [RGBD-2012], [ZHANG2013126], [Lidar-Choi2013], [RADAR-2004], which, however are designed for the specific nature of the data to be tracked. Here, we will introduce a novel COP-based tracking method designed to be as general as possible.

3 Overview of the method

SpaRTA works with (space + time) clouds of points representing the volume occupied in time by a group of moving objects, without any limitation on the system used to collect the data111Note that when using a multicamera acquisition method, data are not directly obtained as clouds of points (unlike data acquired with RGB-D, lidar or radar systems); hence, in that specific case, targets images are converted into clouds of points through a pre–processing procedure as the one described in [cavagna2016towards] and in Appendix C.. The goal of the algorithm is to partition the cloud of points in sub–clouds, each corresponding to the trajectory of a single target.

Fig. 1: Clusters graph. Grey circles represent the clouds of points that are then dynamically linked. The connected components are highlighted in different colors. The first three components from the left represent single trajectories, since they are made of only one–to–one linked clusters. The last three components are, instead, ambiguous because they have at least one node with more than one link from the past or to the future.

SpaRTA can be broken into the following steps:

1 – Building the graph. The cloud of points is first clustered in space at a static level (fixed instant of time): a clustering algorithm based on the nearest neighbor distance [Rokach2005] is used to detect the well–separated dense cloud of points (clusters), which may represent the detected objects at each instant of time. These reconstructed clusters are then connected in time through a dynamical linking procedure, see Section 4.1 and Appendix D for further details. In this way, we create a set of clouds of points representing the volumes occupied by the objects during the event, actually building the graph shown in Fig. 1 with clusters as nodes and links as edges.

2 – Tackling the occlusions. A breadth–first search routine [hopcroft1973algorithm] is used to identify the connected components of the clusters graph, which should represent the trajectories of the detected targets. In the ideal situation where occlusions do not occur, each connected component is made of one single target at each instant of time, therefore it is made only of one–to–one linked nodes, see Fig. 1. However, in the more realistic situation where occlusions do occur, two or more objects may belong to the same connected component, sharing one or more nodes, as in the last three cases in Fig. 1. These connected components, with at least one multi-link, are due to occlusions, and they must be solved. The philosophy of SpaRTA is to break up the ambiguous connected components into different partitions, each corresponding to the trajectory of a single actual target, by defining and solving an optimization problem. First, a graph is built, whose nodes are the points belonging to the occlusion cluster and whose links depend on the distance/proximity in space and time of these points. The crucial idea, here, is to use both attractive

(positive) links connecting points that are close enough to suggest high probability of belonging to the same actual target, but at the same time to penalize with

repulsive (negative) links pairs of points that are too far from each other, compared with intrinsic space-time scales of the data. Once the graph is built, SpaRTA defines a cost function given by the negative sum of all links in each candidate partition, in such a way that the global minimum of this function corresponds to the optimal partitioning of the occlusion into bona fide targets. The presence of both attractive and repulsive links is crucially functional in associating the correct partition to the actual targets; on the other hand, using competing links is known to increase steeply the complexity of an optimization method, by creating a proliferation of sub-optimal solutions (local minima) [mezard2009information]. To deal with this problem, SpaRTA uses an optimization routine inspired on Semi-Definite Programming (SDP) techniques, that are known to work efficiently in disordered systems whose complexity is severe [ricci2016performance], [javanmard2016phase]. See Section 4.2 for details.

3 – Identifying 3D trajectories. Each non–ambiguous connected component identified in Step 2 above, represents the volume occupied by a single target during the dynamics of the system. However, for many practical purposes it is not convenient to work with volumes, which may be hard to be handled, and it is more desirable to associate a single position to each object at each instant of time. Thus, we associate to each cluster its baricenter position, i.e. average coordinates of the cluster points, and we define the trajectories as the time sequence of the baricenter coordinates. Notice that this is not the same as defining objects positions through their baricenter in the SIP framework, because here all occlusions have been already solved, hence the baricenter is indeed a fair tool to locate the target position; on the contrary, baricenters fail in SIP whenever one object corresponds to several targets into an occlusion.

4 Method details

In this section we describe in detail how we handle the cloud of points to build the graphs and how we solve occlusions.

4.1 Building the graph

The cloud of points is first analyzed at a static level to identify well–separated clusters, which may represent single objects or multi-objects during an occlusion. To this aim, we use a standard clustering algorithm based on the nearest neighbor distance [Rokach2005]: two reconstructed points, and belong to the same cluster, , if their distance is smaller than , with equal to the median of the targets nearest neighbor distance. The complexity of brute force implementation of this procedure is , with being the average number of points at each frame, but we lower it to by using the space–partitioning technique of [turk1989interactive].

Fig. 2: Cluster graph construction: dynamic linking. At a generic frame , a point–to–point multi–linking procedure is performed: at time is connected to and at frame , while at frame is connected with and . These point–to–point links are then used to define cluster–to–cluster links: two clusters are connected if there exists at least one point–to–point link between points belonging to the two clusters. Therefore, will be linked to both and (the two points and belong to and they are both linked to points belonging to and ). On the opposite, does not receive any link from the past, because none of its points receive a point–to–point link.

Once all the clusters of points are created at each instant of time, we need to dynamically link points at subsequent instants of time, actually building the graph. We have to be careful doing this, because missing dynamical links may result in fragmented trajectories, while extra–links increase the connectivity of the graph, creating false occlusions and making the solution of the problem hard. We define point–to–point dynamical links using a dynamical proximity method whose only assumption is that each point moves with a constant velocity between two consecutive instants of time, see Fig. 2. Note that the constant velocity assumption is reasonable when working on data collected at a high frame–rate, as the ones used to test SpaRTA, but it may need some refinements in a more general cases. Once we have built point–to–point dynamical links, we use them to define cluster–to–cluster dynamical links: two clusters and are connected in time if there exists at least one point–to–point link between a point and a point , see Fig. 2. For the sake of clarity, we omit here to describe all the details of the temporal linking procedure and we refer the interested reader to Appendix D.

4.2 Tackling the occlusions

In this section we analyze in detail how we solve the ambiguous connected components, i.e. the occlusions. Here is the core of the method, which overcomes the occlusions yet keeping the identities of the objects involved.

Fig. 3: 3D occlusion. A –shape connected component representing a occlusion. The two objects are well–separated for the most part of the event, but they share the same cluster during the occlusion. The branches of the represent the two different trajectories of the two objects before and after the occlusion, which is represented as a double grey circle at the center of the . During the SDP procedure the analysis of the ambiguous component is restricted to a quite short interval, highlighted in pink.

For the sake of simplicity we restrict our attention only on connected components made of two objects in a occlusion for one or more frames. The more general case of more than two trajectories belonging to the same connected components can be reduced to the simplest one solving each occlusion in a restricted frame range such that only two objects at the time are involved (under the mild assumption that no more than two objects can be involved in the same occlusion at the same time).

In an ambiguous connected component, the trajectories of the objects (involved in a occlusion) are well–separated for the most part of the event, sharing only few clusters, just during the occlusion. Therefore an ambiguous connected component due to a occlusion has the –shape shown in Fig. 3, with the branches of the representing the trajectories of the two objects before and after the occlusion, which is instead the centre of the . The two occluded objects are not distinguishable and they are detected as one cluster only. The goal of this step is then to identify and separate the volumes occupied by the two objects during the occlusion, i.e. to split the clusters in the two subsets representing the volumes of the two distinct objects.

To handle this situation we switch back from clusters to cloud of points, representing the ambiguous component as a graph with its points as nodes connected by links carrying static (equal time) and dynamic information (consecutive frames). Following the literature about graph bi–partitioning techniques [boykov2004experimental], we address the partitioning as an energy minimization problem. Therefore, we associate an energy, , defined as follows:


where and are two different points of the graph, identifies in which partition the point belongs and is the a coefficient associated to the pair of points and : because we want to minimize , and it has a minus in front of the sum, will be larger the higher the probability that and belong to the same partition. Clearly, a sensible definition of

is of paramount importance and some heuristics is inevitable in the choice. Despite this, there is a general principle that has been key to solve the problem: not only we must use a

positive coefficient when it is highly likely that and belong to the same partition, but it is also essential to assign a negative coefficient when it is likely that and belong to different partitions. Finally, it is reasonable to have a zero value of whenever it is unclear what is the likely fate of and . To implement this scenario, we use the simplest rule, which amounts to link the coefficients to the spatio-temporal proximity of the points and ,

4.2.1 Static coefficients

We first define the coefficients between points at the same instant of time. We statically connect two points, and , at a mutual distance , with the following weight:


is the median nearest neighbor distance, which is also the only natural unit of length of the system of points, hence all other lengths will be measured in units of to make all coefficients dimensionless; , on the other hand, is the

Fig. 4: 3D occlusion: static linking. a. The static coefficient, between two points and as a function of their mutual distance , with and of the specific occlusion shown in Fig. 5. b-d. Points belonging to the same frame are strongly connected if they are at a very short mutual distance (static positive links), while they are strongly disjointed, through a high but negative weight, when at a large mutual distance, both within the same cluster (static negative links) or in two different clusters (static negative links between clusters).

median size of the reconstructed clusters at that specific instant of time, ; is the Heaviside function. The actual shape of in Eq. (2) as a function of the space distance is depicted in Fig.4: the idea is to strongly and positively connect those points which are likely to belong to the same cluster, i.e. with a mutual distance smaller than , and to negatively connect those points which are likely to belong to different clusters, i.e. at a mutual distance bigger than the usual size of the objects; finally, all points with an uncertain distance, namely , have a non-committal . Hence, the exponent simply rules how sharp is the elbow around , and its value does not impact significantly on the results as long as (we use ; note, though, that any other sharp decay would do the job.

4.2.2 Dynamic coefficients

Using an identical philosophy, we dynamically connect two points, at time and at time , with the following weight:




is the distance between the extrapolated position of at time , namely , and the position of point at time . The displacement is linearly extrapolated from past frames under the assumption that each point moves at a constant velocity between two consecutive frames. Notice that once again we have used the median nearest neighbour distance as the natural length scale of the system to make the coefficient dimensionless. The meaning of these links is to strongly connect, in time, those pairs of points belonging to the same dynamic cluster, namely the points belonging to the same branch of the ambiguous shape (see Figs. 3 and 5), which are likely to belong to the same partition.

4.2.3 Graph partitioning

In order to find the partitioning which minimizes the energy in Eq. 1

one could use standard approaches, such as integer linear programming or Montecarlo techniques; these, however, are known to fail to find the correct solution (ground state) when there are many local minima of

, which is the case of the present problem. Therefore, we choose to approach the problem by using a more robust algorithm based on Semi-Definite Programming (SDP), whose details can be found in [ricci2016performance], and which successfully finds the absolute minimum of complex energy functions even in presence of multi-minima landscapes.

Once the occlusion is solved through the SDP procedure, the ambiguous –shaped component is divided into two partitions, connected both in space and in time, as shown in Fig. 5. The occluded potential clusters are actually split into two sub–clusters representing the volumes occupied by the two objects during the occlusion. In this way, we successfully solve the occlusion retrieving the two trajectories and maintaining the identities during the entire event. Ideally the optimization should be applied to the entire X–shape component, but this would require high computational resources and high computational time, since the minimization of the energy in Eq.(1) is an NP–hard problem. Hence, in order to reduce the number of variables of the problem, we restrict the optimization to a shorter interval of time, namely from few frames ( in the specific case of the tests presented in Section 5) before the occlusion starts, to few frames after the occlusion ends, as shown in Fig. 3 and Fig. 5 (left panel) where the interval to be analyzed is highlighted in red.

Fig. 5: 3D occlusion solution. An example of a occlusion from the Davis-08 dense dataset [wu2014thermal] used to test the method. On the left: the -shape component, with the occlusion and the sub–cloud analyzed with the SDP technique highlighted in red. On the right: the same component after the solution of the occlusion, with the two separated trajectories highlighted in blue and green. In the inset: the clusters of the two different targets once the occlusion is solved.

5 Experiments and discussion

We tested the method on two public datasets [wu2014thermal] of Brazilian bats colonies emerging from a natural cave in Texas (see Fig.6), acquired with a system of three synchronized high–speed cameras. To the best of our knowledge these are the only public datasets of featureless objects, but they are not in the form of cloud of points. Therefore SpaRTA cannot be directly applied to the data, but a pre–processing procedure on the images is needed. We used standard computer vision techniques to detect the targets of interest in the images and to reconstruct the corresponding clouds of points matching the information across the three cameras, see Appendix C and Appendix E for a detailed description of the pre–processing procedure and refinements of the method needed when working on multicamera data.

5.1 Evaluation metrics

The set of trajectories retrieved by SpaRTA, see Fig.6, is compared with the set of groudtruth trajectories (published together with the two datasets in [wu2014thermal]) and its quality is assessed using the standard CLEAR MOT evaluation method proposed by K. Bernardin and R. Stiefelhagen in [bernardin2008evaluating] and the metrics described in [wu2014thermal]. The most important metrics in this context is the so-called Multiple Object Tracking Accuracy parameter (MOTA); this quantity combines three observables, namely: i) the number of false–positive objects (reconstructed objects not belonging to the groundtruth set); ii) the number of missing objects (objects belonging to the groundtruth but not reconstructed by the algorithm); iii) the number of identity–switches (reconstructed trajectories switching between two different groundtruth identities).

Dataset Algorithm Class MOTA IDS MT ML FM
() () () () ()
MHT SIP-RT       
Davis–08 sparse SDD-MHT SIP-RT     
Davis–08 dense SDD-MHT SIP-RT  
GReTA* SIP-TR     
TABLE I: Comparison of the quality of the trajectories retrieved by SpaRTA and the algorithms: MHT, SDD-MHT [wu2014thermal], CP(LDQD) [wu2016global] and GReTA [attanasi2015GReTA] on the public datasets labeled Davis-08 sparse and Davis-08 dense published in [wu2014thermal]. In the Table we report the MOTA (Multiple Object Tracking Accuracy) and also: the number of switches of identities (IDS), the percentage of mostly tracked (MT) and mostly lost (ML) trajectories corresponding to groundtruth trajectories which are correctly reconstructed respectively for more than the and for less than the of their time length, the number of tracks fragments (FM) corresponding to the number of times that a groundtruth trajectory, correctly reconstructed, is interrupted. A perfect tracking algorithm produces MOTA , IDS , MT , ML , FM . In order to compute a match between groundtruth and reconstructed trajectories we chose a miss/hit threshold equal to 0.3m as suggested in [wu2014thermal].
GReTA*: The results obtained by GReTA presented in this table are not the ones published in [attanasi2015GReTA]. This is because, performing the quality evaluation of our new algorithm SpaRTA, we found a shift of one frame in the annotated file of the dataset published in [wu2014thermal]. We evaluated SpaRTA using the annotated file, but taking care of the time shift and for coherency we also updated the results obtained by GReTA.

MOTA can be interpreted as the fraction of groundtruth objects correctly reconstructed by the tracking method; its ideal value is equal to (note, however, that – weirdly as it may seem – MOTA can have negative values when the number of false–positive plus miss–reconstructed objects plus identity switches exceeds the number of groundtruth objects, quite clearly a rather bad scenario). The second most important metrics is IDS (Identity Switches), which identifies how many times the identities of different actual targets are switched, this is also a rather crucial parameter, as (the consequences of IDS in data analysis are often the most severe).

The other evaluation parameters are the following: MT (Mostly tracked) – fraction of groundtruth trajectories correctly reconstructed for more than the of their time length; ML (Mostly Lost) – fraction of groundtruth trajectories that are correctly reconstructed, but for less than the of their time length; FM (Fragmentation) – corresponding to the number of times that a groundtruth trajectory, correctly reconstructed, is interrupted. It was not possible to evaluate the last parameter MOTP (Multiple Object Tracking Precision), which measures the average distance in the space between the groundtruth and the reconstructed objects, because the dataset do not give the actual

positions of the targets but their estimates based on the SIP approach used by the author in

[wu2014thermal]. Hence, with the MOTP we would have not evaluated the precision of our method, but only the difference between the positions obtained by SpaRTA with the ones obtained in [wu2014thermal].

5.2 Results

The performance of SpaRTA is reported in Table I and compared with four other methods: MHT [wu2014thermal], SDD–MHT [wu2014thermal], CP(LDQD) [wu2016global] and GReTA [attanasi2015GReTA]. The difference between the two datasets is that one of them is sparse, and the data sequence is rather long ( frames), while the second dataset is dense, with a far shorter sequence ( frames), see Fig.6. The tests made on SpaRTA were run on a laptop equipped with a 3.3 GHz i7 Intel processor and with 16 GB of RAM. Moreover, SpaRTA is implemented in C++ using the OpenCV library [opencv_library].

The comparison with the state–of–the–art methods shows the high performance of SpaRTA, especially in terms of tracking accuracy (MOTA) and number of identity switches (IDS), with satisfactory values of mostly tracked (MT) and mostly lost (ML) groudtruth trajectories, but with a high value of fragmentation (FM) 222The high number of FM in SpaRTA is mainly due to miss–detection of the targets on the images during the pre–processing procedure, which was not optimized for this dataset. When a miss-detection occurs the corresponding object cannot be reconstructed, causing the interruption of the correspondent trajectory. This issue can be improved by developing a detection algorithm designed for this dataset; this, however, goes beyond the scope of the present work..

Fig. 6: Dataset and results. First row: on the left an image from the Davis-08 sparse dataset and on the right a sample of the trajectories, each with a different color, of the same dataset reconstructed by SpaRTA. Second row: on the left an image from the Davis-08 dense dataset and on the right a sample of the trajectories reconstructed by SpaRTA

On the sparse dataset, SpaRTA outperforms the two SIP–RT methods (MHT and SDD-MHT) in terms of MOTA and IDS, but with a lower number of MT and higher values of ML and FM. Hence, the set of trajectories retrieved by SpaRTA offers a lower coverage of the groundtruth set, but it is better from a qualitatively point of view providing a more accurate set of trajectories for the data analysis, thanks to the high value of MOTA and low number of IDS and consequently a negligible number of wrong trajectories. Moreover, the comparison of SpaRTA with the two SIP-TR methods, CP(LDQD) and GReTA, shows that the performance of the three methods are comparable, with SpaRTA offering the best balance between the value of MOTA and the number of IDS: CP(LDQD) produces the highest value of MOTA (), comparable with the one obtained by SpaRTA (), but with a higher number of IDS, while GReTA produces the lowest number of IDS (), comparable with the ones produced by SpaRTA (), but with a lower value of the MOTA parameter.

On the dense dataset, SpaRTA exhibits the best performance in terms of MOTA (), even though with the highest number of ML 333The high number ML is a consequence of the formation of ”ghost” trajectories (trajectories not representing existing targets), intrinsic to all RT methods as explained in detail in Appendix E. In SpaRTA these trajectories are identified from their time length and removed from the retrieved set of trajectories, so that short groundtruth trajectories may be mislead for ghost and for this reason thrown away.. Despite this high value of ML, SpaRTA shows an excellent result in terms of MOTA, particularly outstanding when compared with the two SIP–RT methods that produce low value of MOTA, even negative in the case of the MHT algorithm, thus implying that the number of false positive and missing reconstructing objects exceed the number of correctly reconstructed targets. Moreover the low number of IDS (), second only to GReTA, together with the high fraction of mostly tracked trajectories () comparable with the highest ratio of MT achieved by CP(LDQD), confirms the high–performance of our method that offers the best balance between the evaluation parameters, as it happens for the sparse dataset.

6 Conclusions

We proposed a tracking method, SpaRTA (Spatiotemporal Reconstruction Tracking Algorithm), designed to track large and dense group of featureless objects, without any specific prerequisites on the system used to acquired the data. SpaRTA works with dense clouds of points representing the volumes occupied in time by the targets of interest. Each cloud of points is partitioned in spatio-temporal connected components, corresponding to the trajectories of single individuals in the group. The method is designed to handle efficiently the ambiguities stemming from occlusions, i.e. objects getting too close in the space to be detected as separated entities; to this aim SpaRTA employs an optimization routine inspired on Semi–Definite Programming (SDP) techniques introduced in the field of statistical mechanics [ricci2016performance]. Apart from using in a proficous way for the first time SDP in the computer vision context, the true core of SpaRTA is the novel way in which the spatio-temporal graph is built: the key idea is to define an energy (or cost) function based on the use of both attractive and repulsive links between points within the cloud, in such a way to separate with relatively little numerical effort the ambiguous cases by minimizing such energy.

SpaRTA was tested on two public datasets, [wu2014thermal], producing high quality results in terms of correctness of the trajectories, evaluated through the standard quality parameter MOTA, and producing a low rate of identity switches. The retrieved trajectories were compared with the four state–of–the–art tracking methods MHT [wu2014thermal], SDD-MHT [wu2014thermal], CP(LDQD) [wu2016global] and GReTA [attanasi2015GReTA]. The greatest advantage of SpaRTA over the other methods is an outstanding MOTA, combined with an excellent IDS (second only to GReTA) and a very good MT. This means that, not only SpaRTA is able to achieve a quantitatively satisfying coverage of the set of trajectories, but it does that with the lowest number of false positives and a remarkably low number of identity switches compared to most other methods. At the level of data analysis, this kind of performance guarantees that the completeness of the data coverage is not jeopardized by a qualitative disruption of the results, due to the severe consequences of having wrong trajectories.


This work was partly supported by European Research Council, Proof of Concept Grant n. 713651. We thank I. Giardina and M. Viale for the advice and the fruitful discussion on the new tracking strategy.

Appendix A Multicamera data: SIP vs COP approach

When working with multicamera data, there are two fundamentally different ways to represent the targets an algorithm wants to track: on one hand Single Point (SIP) representation, where each target at a certain instant of time is associated to one single identity and spatial position (typically the baricenter); on the other hand dense Cloud of Points (COP) representation, where each target at a certain instant of time is associated to a spatial volume.

Fig. 7: Standard 3D reconstruction of occluded targets. On the left a occlusion: the blue and the green targets are occluded and detected as only one object (highlighted with a red contour in the scheme), in Camera 2 where they are both associated with the same point (black dot at the centre of the detected object), while they are associated to different points in the other two cameras. At the time of the occlusion, the two targets are reconstructed in the space as two different points (red circles), but the two positions do not correspond to the real baricenters of the two targets. As a consequence, the retrieved trajectories (dashed black lines) do not produce an accurate approximation of the real ones (solid black lines). On the right a occlusion: the blue and the green targets are occluded in all the cameras (highlighted with a red contour in the schemes of the three cameras). In each camera, the two targets are detected as only one object (black dots in the scheme) and they are reconstructed with the same position (red circle), which lies between the two real baricenters. The two targets are then not separated in the space and the two reconstructed trajectories (dashed lines) are not an accurate approximation of the two real trajectories (solid lines).

In the SIP state–of–the–art multicamera tracking methods, targets of interest are detected in the images through a segmentation routine and they associate to each detected target a single coordinate, which is generally the target baricenter on the images. Regardless the approach used for tracking, target baricenters need to be matched across cameras to retrieve the correspondent point, which in principle should be the baricenter of the target.

The assumption behind this procedure is that the baricenter of an object corresponds to the baricenters of its images, which is reasonable when dealing with not–occluded targets but not in a general case where and occlusions occur. Indeed, as shown in the left panel of Fig. 7, when targets (the green and the blue) are occluded in one camera only, the occluded objects are associated to the same position in the camera where the occlusion occurs, while they correspond to different coordinates in the other cameras. The information from the two cameras where the occlusion do not occur make the occluded targets to be reconstructed in two different positions (highlighted with red dots in the figure), but with a poor accuracy: both the two positions do not correspond to the target baricenters. The COP representation, on the other hand, allows to associate to each object (at each instant of time) a dense cloud of points and not only its baricenter. This dense representation actually reconstruct the volumes occupied by the detected objects, even during a occlusion; their positions are computed using the entire cloud and they actually correspond to the baricenters of the two separated objects, thus giving a better approximation of the real position of the target.

A more severe limitation of SIP representation concerns the solution of occlusions. In this particular situation two objects are occluded in all the cameras, as shown in the right panel of Fig. 7. Indeed the association of each detected target with only one position (the red dot in Fig. 7) implies that the same coordinate is associated to multiple targets during the occlusion, with the consequent loss of identity for all the targets involved. Instead in the COP representation, when two or more targets get in a occlusion, their total volume may be split in sub–parts each corresponding to the volume of one single object, as SpaRTA does with the use of the SDP procedure described in Section 4.2 of the manuscript. Therefore targets identities are not lost anymore and the baricenters of each target may then be computed on the correspondent sub–cloud with high–accuracy.

Trajectories retrieved with SIP representation are then less accurate in the computation of the positions of the targets (in the case of both and occlusions) than trajectories retrieved with COP representation and they are also more prone to identity switches after a occlusion occurs because occluded targets are reconstructed as only one single object.

Appendix B Multicamera data: Occlusions

One of the most spread experimental method for tracking is represented by acquire data with a system of multicamera, namely two or more cameras shooting synchronously images of the same group of moving targets.

Fig. 8: Optical occlusions. a) occlusions: two objects are well–separated in the space (the blue and the green clouds of points) but occluded in one of the three cameras and they are reconstructed as two well–separated clouds of points. b) occlusions: two objects are in proximity and therefore occluded in all the three cameras, so that they are reconstructed as only one cloud of points.

The main issue of all tracking methods is how to handle occlusions, which arise everytime that two or more objects get too close to be detected as multiple identities. In the specific case of data collected with multicamera systems two different kind of optical occlusions may occur: 2d–occlusion which arise when two or more objects, separated in the space, are too close to be individually detected in the space of one camera only, and 3d–occlusions which arise when multiple objects are occluded in all the cameras simultaneously, namely when multiple objects get in proximity, as shown in Fig. 8.

As outlined in Section 2 of the manuscript, state–of–the–art tracking approaches on data collected via multicamera systems may be divided into two main classes: tracking–reconstruction (TR) and reconstruction–tracking (RT) methods. TR algorithms address the problem in the space of the cameras: objects are first tracked in each camera, and trajectories are then matched across the cameras to retrieve the corresponding trajectories, hence they need to handle both and occlusions. occlusions make the tracking in the space of the cameras ambiguous and all the TR algorithms have then to be focused on how to cope with this kind of occlusions, solving the ambiguities and at the same time keeping the computational resources at an acceptable level. On the other hand, RT methods, to which SpaRTA belongs when applied to multicamera data, address tracking directly in the space. Therefore occlusions are naturally avoided, because they use the information of the cameras where the occluded targets are well–separated, and they have to deal only with occlusions.

A better explanation of the conceptual difference between and occlusions is shown in Fig. 8: in panel a) two targets (the green and the blue) well–separated in space are detected as only one object in the plane of the first camera, while they are well–separated in the other two cameras. The use of the information of the three cameras guarantees that they are reconstructed as two different and separated objects. In panel b) the two targets are, instead, seen as only one object in all the three cameras, their identities are not distinguishable anymore (because none of the cameras has the information on their multiple identities) and they are reconstructed as only one object, highlighted in red.

When applied to data collected with a multicamera system, as in the tests presented in Section 5 of the manuscript, SpaRTA follows an RT approach since it addresses the tracking problem directly in the space and thus, as all the RT methods, it is only affected by occlusions.

Fig. 9: Cluster graph construction: dynamic linking. Blue, green and orange circles represent reconstructed clusters at different instant of time. At a generic frame , a point–to–point linking procedure is performed. Point–to–point links are then used to define cluster–to–cluster links. Point–to–point linking: given the position of a point at frame , its position at the next frame,

, is computed using the velocity vector associated to the link received from the past by the cluster to which

belongs. is then connected to all the points at a distance from smaller than a certain threshold. In the example on the left, is connected to all the points in the region highlighted in red, which belong to the same cluster. In the example on the right, is connected with all the points in the two red regions, belonging to two different clusters. Cluster–to–cluster linking: two clusters are connected if there exists at least one point–to–point link between points belonging to the two clusters. In the example on the left, the blue cluster at frame is connected only with the blue cluster at frame . The velocity vector associated to this link is then computed as the spatial displacement between the baricenters of the two clusters. In the example on the right, instead, the green cluster at time is connected with both the green and the orange cluster at frame . The velocity vectors associated to these links are computed as the spatial displacement between the baricenter of the sub–clusters highlighted in red. This refinement guarantees high quality of the dynamic links even when an occlusion occurs.

Appendix C From stereo–images to (3d+1) cloud of points

In this Appendix we will give the details on the procedure we use to reconstruct the cloud of points from a set of images collected with a system of synchronized high–speed cameras. We applied this procedure to test our new algorithm, SpaRTA, on the public benchmarks in [wu2014thermal], which are collected with three synchronized cameras. For this reason in this Appendix we will refer to systems of three synchronized cameras, but the entire procedure may be easily generalized to any multicamera system.

In order to obtain the cloud of points from the images, we perform the following steps:

1 – Segmentation. The goal of the segmentation routine is to identify the active pixels in the images, i.e. pixels representing the objects to be tracked. Therefore this step strictly depends on the objects appearance and on the camera system used. The results of the entire tracking algorithm may be heavily affected by segmentation errors, since miss–detected as well as false detected objects may lead to fragmented or, even worst, to ghost trajectories, i.e. not corresponding to any real object. Therefore, it is necessary to design the segmentation routine for each specific dataset. In this particular case, we designed the segmentation routine as a standard background subtraction over a sliding window as suggested in [sobral2014comprehensive], since SpaRTA was developed to track featureless objects using monochromatic images, characterized by high color contrast between the moving objects and a still background, i.e. dark objects over a light background or vice versa.

2 – Pixels matching. At each instant of time, the sets of active pixels, , and (one for each camera) identified at the previous step, are analyzed to find the triplets of points (pixels) , projections of the same point in the three cameras. A triplet is considered a good match if its reprojection error, see [hartley2003multiple], is smaller than a threshold, that we choose equal to 1.5px. In principle we should reconstruct and check all the possible triplets, with being the average number of active pixels in each camera. However, using the trifocal constraint [hartley2003multiple] the number of triplets to be checked may be reduced to .

3 – (3D+1) cloud of points creation. All the triplets created at the previous step are reconstructed in the space, using a standard DLT reconstruction procedure [hartley2003multiple], thus forming a cloud of points which represents the volume occupied by all the targets during the entire event of interest.

Appendix D Dynamic linking procedure

This procedure is meant to identify and link in time those clusters corresponding to the same object at subsequent instants of time, actually building the clusters graph. This is the most delicate step of the method, because it can affect the quality of the result of the entire procedure: missing links may indeed result in fragmented trajectories, while extra–links increase the connectivity of the graph, creating fake occlusions and making the solution of the problem hard.

The idea is to use both spatial and temporal information using the duality between an object as a dense cloud of points and an object as a cluster. Indeed, we first define point–to–point links, using the only assumption that each point moves with a constant velocity between two consecutive instants of time. Point–to–point links are then used to define cluster–to–cluster links: two clusters and are connected in time if there exists at least a point–to–point link between a point and a point , see Fig. 9.

The first instant of time, , is quite special because we cannot use any dynamic information from the past. Therefore for this special case, we use a procedure based only on the distance between clusters.

In particular, denoting by with the clusters identified at time and by with the clusters identified at time :

  1. we compute the baricenters, and , of each cluster at time and at time , as the average position of the points belonging to those clusters;

  2. we perform a one–to–one matching, via a standard Munkres algorithm [munkres1957algorithms], between clusters at time and based on the distance between the cluster baricenters. For each matched pair of cluster we add a link labeled with the vector velocity defined as:


    where with we denote the velocity of the object at frame .

Instead, for a generic instant of time, , we use spatial information considering the clusters as dense clouds of points and performing at first a point–to–point linking procedure, based on the assumption of constant velocity between two consecutive instants of time. These point–to–point links are then used to define cluster–to–cluster links and therefore to build the graph in Fig. 1 of the main text.

In particular, denoting by and the clusters identified at time and respectively, by the velocity of cluster , see Fig. 9:

  1. we compute the predicted position, , of each point as:

  2. we connect to all those points , such that:


    with chosen as the median nearest neighbor distance, .

  3. we define cluster-to cluster links in such a way that two clusters and are temporally linked if at least one point is linked with a point .

Once all the clusters with links from the past are analyzed, we consider the set of those clusters without any link from the past at frame and and we perform a one–to–one match between the two subsets, following the same procedure described above for the first instant of time, .

Appendix E Ghost points formation and trajectories removal

Fig. 10: Ghost objects formation. The blue and the green objects, namely and are reprojected in the two cameras as and . The pair is a good match because the two pixels are the image of the same point , and the pair is a good match as well. The two pairs and do not represent any object of interest but they are good matches from a epipolar perspective, see [hartley2003multiple]. The pixel matching procedure will then reconstruct the two correct objects and and the two ghost objects and . The introduction of a third camera drastically reduces the creation of ghost points working on triplets of pixels, but even in this case ambiguities are not completely solved.

The pixel matching procedure described in Appendix C may lead to the formation of ghost points, representing points not belonging to any real object. Ghost points are an intrinsic artifact of the method which, at this level, cannot discriminate between ghost and correct points, as shown in Fig. 10. Therefore ghost points are included in the overall cloud of points, creating ghost clusters and hence ghost trajectories, which has to be identified and removed to not affect the quality of the retrieved solution.

Fig. 11: Ghost trajectories removal. A –shape connected component representing a real (blue) and a ghost (red) trajectories occluded at one frame. This ambiguous component is due to a wrong dynamic link (the red arrow), which can be easily identified from the peculiar –shape of the component together with the short length of the ghost branch. The wrong dynamic link is detected and removed as well as the ghost points.

For their nature, ghost trajectories last for very few frames, so that not ambiguous ghost connected components can be easily identified and eliminated removing all the trajectories shorter than a certain time length, which has to be empirically chosen depending on the dataset. In the tests presented in Section 5 of the manuscript it is set to .

A more complicated situation occurs when a ghost and a real object get in proximity creating a ambiguous connected component, with the typical –shape shown in Fig. 11 with one of the two branches (the one corresponding to the ghost trajectories) quite short. This kind of ambiguity is essentially due to an error in the dynamical linking procedure, which wrongly connects the ghost to a real cluster.

We identify the connected components with a –shape graph and we solve the ambiguity removing the wrong link (highlighted in red in Fig. 11) between the ghost and the real trajectory, actually breaking the graph into two partitions, representing a long and correct trajectory and a short ghost trajectory which is removed.