I Introduction and related work
Semisupervised learning (SSL) has recently gained a considerable interest in the machine learning and computer vision communities. The reason of the diffusion of SSL method can be easily explained observing that in many real world applications unlabelled data are easy to find, while labelled data are expensive and difficult to retrieve. Typical examples can be found in digital forensics and video surveillance large amount of images and video are readily available, but, without effective automatic tools, corresponding annotation and analysis usually requires the human intervention. Data coming from surveillance cameras need to be examined in order to identify targets that may be captured by different cameras or in different time intervals, therefore in these contexts, many problems can be interpreted as SSL problems, where the initial labelled samples are available (
e.g patches of the selected target, some initial frames of the video) in vast footage of unlabelled data.A comprehensive survey on SSL can be found in [1] and, among the others, graph based algorithms have a relevant role. In such methods labeled and unlabelled input data are modeled as nodes of undirected graphs whose edges represent the similarity among data points. Labels for unlabeled instances are estimated by propagating the information available at labeled nodes to unlabeled nodes. The underline idea is that point close to each other (i.e point with high similarity) have the same label. Labels are usually estimated in regularization frameworks where the aim is to minimize a function on the undirected graph [2, 3]
. Graph based approaches have been effectively used to solve several problems in computer vision and pattern recognition
[4], [5], [6], [7], [8], [9],[17] and also in medical imaging domain [10]. The main advantage of graph methods is to be nonparametric and transductive in nature (the solution is a cut of the graph, thus a label for unlabelled input point and not a general classification function as in the case of inductive methods [1]).Referring to the video forensics and surveillance fields and considering the tracking problem in its common definition of ”problem to estimate the location of target objects in a sequence of images starting from an initial detection”, many different approaches have been proposed. In [11] a large experimental survey of various tracking approaches is presented evaluating the suitability of each approach in different situations and with different constraints (e.g assumptions on the background, on the motion model, on the occlusions, etc.). The survey also suggests how the problem of people tracking can be viewed as a data association problem, and in particular as a semisupervised classification, where some examples of the people to track are initially detected.
The transductive learning paradigm has already been exploited to solve tracking and reidentification problems. The seminal work of [12] introduced the TL as a solution to the severe variation of the models in color tracking. They fitted the TL problem into an EM frameworks to estimate the pixel labels in hand and face color tracking. [13] proposed an online single target tracking using graph transduction applied to faces and cars. [14] proposed an online single target tracking and reidentification method based on a graph based formulation of the TL problem. They enforced the knowledge encoded in the labelled instances introducing an update strategy to avoid drift in the tracking and to allow reidentification in case of occlusions. In [15] , [16],[17], [18, 19] they consider all the pairwise relationships between detection responses in a temporal sliding window, which is used as an input to their optimization based on fullyconnected edgeweighted graph.
In this paper we propose to exploit the graph transduction to track multiple people in videos. To our knowledge this is the first application of transductive learning to multiple target tracking. Given few labelled patches of the people to follow we propose a formulation of the multitarget tracking as a graph transduction problem. We exploit the formulation of [20] based on a game theoretic framework and we prove its reliability in solving a real world problem. We propose an applications able to work offline on prerecorded video streams. Similarly to [14] we describe people patches with covariance matrices and we build the similarity graph using distance among covariance in Riemannian manifolds.
The rest of the paper is organized as follows: Section II summarizes the algorithmic framework proposed in [20] and subsequently Section III details how we propose to apply it in the context of multiple people tracking. In Section IV the performance of the application are evaluated on video surveillance datasets, finally Section V gives a glimpse on future works and possible extensions.
Ii Graph Transduction Game
The theoretical formulation of the Graph Transduction Game (GTG) has been recently introduced in [20]. Starting from the basis of the transductive learning on undirected graph, they build a solution in which the label estimation is based on gametheoretic notions, in contrast to common solution based on the spectrum of the graph. Precisely the graph transduction is formulated as a noncooperative multiplayer game and the labelling correspond to the Nash equilibria.
For notions about multiplayer games we refer to [21], for completeness we only underline that the main idea is that a game is modelled as a strategic interaction among players where the goal of each player is to maximize its own payoff by choosing the best action (pure strategy) to play. The graph transduction game in [20] is formulated assuming that players participating in the game corresponds to a particular point in a data set and that the set of strategy among whom the players can choose is . Each strategy expresses a certain hypothesis about its membership to a class and is the total number of classes (i.e the the mixed strategy profile of each player lies in the dimensional simplex ).
Since the problem is a problem of SSL, the players belongs to two disjoint groups: those which already have knowledge of their membership, referred to as labelled players and denoted with the symbol , and those which do not have any idea about their membership at the beginning of the game, which are hence called unlabelled players and correspondingly denoted with. The labelled players do not need to maximise their payoff since they always play their already chosen pure strategy where . The transduction game can be easily reduced to a game with only unlabelled players that need to find their mixed strategy and the fixed strategies of labelled players act as bias over the choices of unlabelled players. For the Nash equilibrium theorem [22] the GTG always has equilibrium in mixed strategies that corresponds to a steady state where each player plays a strategy that could yield the highest payoff when the strategies of the remaining players are kept fixed, and it provides us a globally consistent labeling of the data set. Once an equilibrium is reached, the label of a data point (player)
is simply given by the strategy with the highest probability in the equilibrium mixed strategy of player
as(1) 
thereby yielding a crisp classification.
Similarly to other graph transduction methods the data are represented with an undirected graph where is the set of nodes representing both labelled and unlabelled points, and are the edges weighted with an adjacency matrix . Being the solution considered as the equilibrium in a noncooperative game, the adjacency matrix is used to compute the payoff between players. And being the game considered as an instance belonging to the class of polymatrix games [23, 24] where players are nodes of a graph and every edge denote a twoplayer game between corresponding pair of players, the partial payoff matrix between two players and is computed as where
is the identity matrix of size
. The payoffs are then computed as and .The Nash equilibria, thus the labelling for unlabelled points, is computed using the evolutionary approach [25, 26]. The dynamic interpretation of Nash equilibria through the evolutionary approach imagines that the game is played repeatedly, generation after generation, during which a selection process acts on the multipopulation of strategies, thereby resulting in the evolution of the fittest strategies. The particular class of dynamics used in this are the so called imitation dynamics given by ˙x_ih=x_ih[∑_l∈S_ix_il(ϕ_i[u_i(e_i^he_i^l,x_i)] ϕ_i[u_i(e_i^le_i^h,x_i)])] where the dot signifies derivative w.r.t. time and is a strictly increasing function of . The multipopulation version of the replicator dynamics is obtained when is taken as the identity function, i.e. , as:
(2) 
[20] demonstrate how in both the discrete and continuous time version of the imitation dynamics the fixed points of Eq. 2 are Nash equilibria.
Iii Multitarget Tracking
This section describes the extension of the theoretical GTG framework presented in the previous section, to a solution for multiple people tracking and details the graph construction and the people extraction and description. An overview of the main steps of the proposal are depicted in Fig. 1.
We assume that the number of classes is known and that all classes are present in the labelled data i.e we assume that the initial targets estimates are given. The labels for the labelled points are in the set and represent the IDs of the different targets.
Since graphbased learning methods enforce label smoothness over the graph, it is important that the graph is built exploiting a meaningful and robust similarity measure capable of capture people similitude in patches of the same target and differentiate between different targets. We decided to model people appearance using covariance matrix feature descriptors, [27]. This representation has been adopted in multiple approaches [28, 29, 14] because of its robustness in capturing shape, location and color information.
Iiia Covariance representation
Considering
different pixel features extracted from the image patches, the resulting covariance matrix
is a square symmetric matrixwhere the diagonal entries represent the variance of each feature and the nondiagonal entries represent the correlations. We decided to model each pixel within a people patch with its position
, its HSVintensity values and its derivatives information. The 9dimensional resulting feature vector
is thus :(3) 
where and are the first order derivatives of the intensities calculated through Sobel operator w.r.t. and , and are respectively the magnitude and the angle of the first order derivatives. The covariance of a patch of size is then computed as with the means of the vectors . A schematic representation of the feature extraction is shown in Fig. 2. The choice of using the HSV color space instead of the basic RGB is because of its invariance w.r.t. scale and shift variance of light intensity.
IiiB Graph construction
In order to build the payoff matrix , the distances between covariance matrices are necessary, and, since they do not lie on the Euclidean space, we need to use an adhoc metric. Such distance metric between covariance matrices, proposed in [30]
is equal to the sum of the squared logarithms of the generalized eigenvalues. Formally the distance between two matrices
and is expressed as:(4) 
where are the generalized eigenvalues. It is proven that satisfies the metric axioms, positivity, symmetry, triangle inequality, for positive definite symmetric matrices.
Assuming that our system works offline on prerecorded videos, the adjacency matrix is built over the complete set of people patches extracted from all the images composing the video. The payoffs are then specified in terms of the normalized similarity matrix with being the diagonal degree matrix of whose elements are given by since it has been demonstrated that the normalization yield to better performance than the original similarity.
In our particular configuration the players of the game are the different target to track and finding the Nash equilibria is equivalent to find the association of the targets in the set of images of the video stream. The
Iv Experiments
In this section we report the results obtained with the two proposed configurations online and offline for people tracking and association. We evaluate our proposal on a set of video sampled from three datasets, namely:

THIS ^{1}^{1}1http://www.openvisor.org: the videos we used in the evaluation are recorded along the platforms and underpasses of a train station and usually exhibit people walking in small groups or alone.

CAVIAR ^{2}^{2}2http://homepages.inf.ed.ac.uk/rbf/CAVIAR/caviar.htm: clips of this dataset are collected in the hallway of a shopping center in Portugal.

3DPes ^{3}^{3}3http://www.openvisor.org/3dpes.asp: this dataset, [31], consists of videos collected at different times of the day in an university campus from different surveillance cameras. The main challenge of this dataset, originally proposed for reidentification tasks, is the large number of occlusions of the targets, we thus use it to evaluate the behaviour of our method in critical situation.
Fig. 3 and Fig. 4 show some frames taken respectively from THIS and Caviar and 3dPes. As explained in Sec. III the GTG framework assumes that for each frame the patches containing the people on the scene are given, thus we initially extracted those patches using a conventional people detector based on Histogram of Oriented Gradient. Working offline on the people patches the problem can also be considered as one of multiple object data association, therefore we measured the performance in terms of mean object Precision and mean object Recall, where we considered a True Positive
as a patch classified correctly with its label, a
False Positive a misclassified patch (e.g the target label is assigned to a different person) and False Negative a missing estimation (e.g the target label is non assigned in the frame even if that target is present). In order to abstract the overall performance we also evaluate the Fmeasure.Since the approach we proposed is offline and relies on prerecorded sequence, we evaluate the results varying the number of initially labelled frames. These frames were randomly chosen and the results have been averaged on 20 different executions. Table I reports the Fmeasure obtained using a initial labelling of five frames on the three datasets and summarizes the averaged length of the evaluated videos in number of frames and the number of targets to track for each dataset. Fig. 5(a) and 5(b)
show the values of mean precision and recall on the three datasets.
Clearly the obtained results improve as we increase the labelled frames as illustrated in Fig. 5
, but almost saturate to satisfactory values when the number of frames is fixed to 5 demonstrating good reliability without requiring a large number of labelled data. The results also show how increasing the number of labelled frames also increase the stability of the solution while labelling only one or three frames, despite the precision and recall reach adequate levels, the standard deviation is large. The best performing dataset is
THIS reaching the 100 % of precision and accuracy, but results are good even on challenging dataset such as 3dPes, where the number of target is higher and the targets are not always correctly identified by the detector due to occlusions, though on the 3dPes dataset the standard deviation of the results is higher. We would like to highlight that the video length reported in Table I is an average on different sequences, but, especially with the 3dPes dataset we stressed the algorithm increasing the number of frames from 200 to 430 and even with the highest number of frames the results were satisfactory with the Fmeasure of roughly the 90%. Few obtained tracking results are proposed in Fig. 3 and Fig. 4.Dataset  # Frames  # Targets  Fmeasure 

THIS  109  3  0.99 
CAVIAR  140  4  0.96 
3dPes  280  5  0.92 
To further the evaluation we also compare our approach with the other tracking methods based on graph transduction. At this aim we recast our solution as one of single people tracking, labelling the input samples with when they correspond to the target and otherwise. Results are reported in Tab. II, in particular we evaluate the GTG framework against the seminal Transductive Learning Tracker (TLT) of [13] and the work presented in [14]
, that proposed a single target Transduction Tracking (TT NoUpd) and also an improved version where the labelled data point are iteratively updated with an evolutionary spectral clustering (TT SpUpd). Even if we should specify that all the other methods work iteratively online on each frame of the video, our solution has been proven to be reliable even in a single target configuration and outperforms the other methods. The reason of the improvement over the stateoftheart is justified both by the robustness of the GTG framework as already demonstrated in
[20] for other classification tasks, and by the fact that working with graphs with more nodes increase the possible paths among nodes representing the target patches in different frames. In other words this overcomes the inevitable errors of the online methods when the people detection are imprecise in the evaluated frames.Dataset  Method  Precision  Recall  Fmeasure 

THIS  GTG  1.00  0.85  0.92 
TLT  0.76  0.92  0.83  
TT NoUpd  0.80  0.93  0.86  
TT SpUpd  0.97  0.95  0.96  
CAVIAR  GTG  0.90  0.92  0.91 
TT NoUpd  0.68  0.91  0.78  
TT SpUpd  0.78  0.90  0.84  
TLT  0.96  0.94  0.95  
3dPes  GTG  0.87  0.99  0.93 
TT NoUpd  0.45  0.44  0.44  
TT SpUpd  0.63  0.66  0.64  
TLT  0.86  0.83  0.84 
V Conclusion
In this paper we proposed a method for multitarget people tracking based on a Graph Transduction Game framework. The GTG considers the different targets to track as players of a game, and estimates the missing labels by finding the Nash equilibria over the set of possible strategies, where the payoffs of the different strategies are proportional to the similarity among people patches. We extracted people patches by using a HOG based detector and we represented their appearance with a covariance matrix.
The method has been evaluated over three different video surveillance datasets and obtained good results even with very few labelled initial frames. Our proposal also outperforms similar methods when considered in a single target configuration.
Given the promising results we obtained future work includes the extension to longer video sequence in order to test the robustness of the system also for long term tracking. We would like also to evaluate a online approach and insert mechanism to handle the insertion of new targets, or in other words to handle a variable number of classes . Always in the framework of an online label estimation we would like to test the possibility of avoid the initial step of people detection, and instead, to use a sliding window approach to extract the patches, such as the method is independent from the performance of the detector.
References
 [1] X. Zhu, “Semisupervised learning literature survey,” 2006.
 [2] T. Joachims, “Transductive learning via spectral graph partitioning,” in Proceedings of the International Conference on Machine Learning (ICML), 2003, pp. 290–297.
 [3] D. Zhou, J. Huang, and B. Schölkopf, “Learning from labeled and unlabeled data on a directed graph,” in Proceedings of the International Conference on Machine Learning (ICML), ser. ICML ’05. New York, NY, USA: ACM, 2005, pp. 1036–1043.
 [4] Y. T. Tesfaye, “Applications of a graph theoretic based clustering framework in computer vision and pattern recognition,” 2018.

[5]
E. Zemene, Y. T. Tesfaye, A. Prati, and M. Pelillo, “Simultaneous clustering and outlier detection using dominant sets,” in
Pattern Recognition (ICPR), 2016 23rd International Conference on. IEEE, 2016, pp. 2325–2330.  [6] E. Zemene, L. T. Alemu, and M. Pelillo, “Constrained dominant sets for retrieval,” in 23rd International Conference on Pattern Recognition, ICPR 2016, Cancún, Mexico, December 48, 2016, 2016, pp. 2568–2573.
 [7] ——, “Dominant sets for ”constrained” image segmentation,” CoRR, vol. abs/1707.05309, 2017. [Online]. Available: http://arxiv.org/abs/1707.05309
 [8] E. Zemene, Y. T. Tesfaye, H. Idrees, A. Prati, M. Pelillo, and M. Shah, “Largescale image geolocalization using dominant sets,” arXiv preprint arXiv:1702.01238, 2017.
 [9] E. Zemene, S. R. Bulò, and M. Pelillo, “Dominantset clustering using multiple affinity matrices,” in International Workshop on SimilarityBased Pattern Recognition. Springer, 2015, pp. 186–198.
 [10] T. M. Dagnew, L. Squarcina, M. W. Rivolta, P. Brambilla, and R. Sassi, “Learning from enhanced contextual similarity in brain imaging data for classification of schizophrenia,” in International Conference on Image Analysis and Processing. Springer, 2017, pp. 265–275.
 [11] A. W. M. Smeulders, D. M. Chu, R. Cucchiara, S. Calderara, A. Dehghan, and M. Shah, “Visual tracking: An experimental survey,” IEEE Trans. on Pattern Anal. Mach. Intell., vol. 99, 2013.
 [12] Y. Wu and T. Huang, “Color tracking by transductive learning,” in Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), vol. 1. IEEE Computer Society, 2000, pp. 133–138.
 [13] Y. Zha, Y. Yang, and D. Bi, “Graphbased transductive learning for robust visual tracking,” Pattern Recogn., vol. 43, no. 1, pp. 187 – 196, 2010.
 [14] D. Coppi, S. Calderara, and R. Cucchiara, “People appearance tracing in video by spectral graph transduction,” in Proceedings of the IEEE International Conference on Computer Vision Workshops, (ICCV), Barcellona, Spain, Nov. 2011, pp. 920–927.
 [15] Y. T. Tesfaye, E. Zemene, M. Pelillo, and A. Prati, “Multiobject tracking using dominant sets,” IET computer vision, vol. 10, pp. 289–298, 2016.
 [16] A. R. Zamir, A. Dehghan, and M. Shah, “Gmcptracker: Global multiobject tracking using generalized minimum clique graphs,” in European Conference on Computer Vision (ECCV), 2012, pp. 343–356.
 [17] Y. T. Tesfaye, E. Zemene, A. Prati, M. Pelillo, and M. Shah, “Multitarget tracking in multiple nonoverlapping cameras using constrained dominant sets,” arXiv preprint arXiv:1706.06196, 2017.
 [18] A. Dehghan, S. M. Assari, and M. Shah, “GMMCP tracker: Globally optimal generalized maximum multi clique problem for multiple object tracking,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 4091–4099.
 [19] Y. T. Tesfaye, “Multitarget tracking using dominant sets,” M.Sc. thesis, Università Ca’Foscari Venezia, 2014.
 [20] A. Erdem and M. Pelillo, “Graph transduction as a noncooperative game,” Neural Computation, vol. 24, no. 3, pp. 700–723, 2012.

[21]
J. W. Weibull, “Evolutionary game theory,” 1995.
 [22] J. Nash, “NonCooperative Games,” The Annals of Mathematics, vol. 54, no. 2, pp. 286–295, Sep. 1951.
 [23] L. Quintas, “A note on polymatrix games,” International Journal of Game Theory, vol. 18, no. 3, pp. 261–272, 1989.
 [24] J. T. Howson, “Equilibria of polymatrix games,” Management Science, vol. 18, pp. 312–318, 1972.

[25]
C. Daskalakis, P. W. Goldberg, and C. H. Papadimitriou, “The complexity of
computing a nash equilibrium,” in
Proceedings of the Thirtyeighth Annual ACM Symposium on Theory of Computing
, ser. STOC ’06. ACM, 2006, pp. 71–78.  [26] C. Daskalakis, “On the complexity of approximating a nash equilibrium,” 2011.
 [27] F. Porikli, O. Tuzel, and P. Meer, “Covariance tracking using model update based on lie algebra,” in Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR). IEEE Computer Society, 2005, pp. 728–735.
 [28] M. J. Metternich, M. Worring, and A. W. M. Smeulders, in Transactions on data hiding and multimedia security V, Y. Q. Shi, Ed. SpringerVerlag, 2010, ch. Color based tracing in reallife surveillance data, pp. 18–33.
 [29] Y. Liu, G. Li, and Z. Shi, “Covariance tracking via geometric particle filtering,” EURASIP J. Adv. Signal Process, pp. 1–22, 2010.
 [30] W. Forstner, B. B. Moonen, and C. Gauss, “A metric for covariance matrices,” 1999.
 [31] D. Baltieri, R. Vezzani, and R. Cucchiara, “3dpes: 3d people dataset for surveillance and forensics,” 2011.
Comments
There are no comments yet.