Deep Multi-Shot Network for modelling Appearance Similarity in Multi-Person Tracking applications

04/07/2020 ∙ by María J. Gómez-Silva, et al. ∙ Universidad Carlos III de Madrid 0

The automatization of Multi-Object Tracking becomes a demanding task in real unconstrained scenarios, where the algorithms have to deal with crowds, crossing people, occlusions, disappearances and the presence of visually similar individuals. In those circumstances, the data association between the incoming detections and their corresponding identities could miss some tracks or produce identity switches. In order to reduce these tracking errors, and even their propagation in further frames, this article presents a Deep Multi-Shot neural model for measuring the Degree of Appearance Similarity (MS-DoAS) between person observations. This model provides temporal consistency to the individuals' appearance representation, and provides an affinity metric to perform frame-by-frame data association, allowing online tracking. The model has been deliberately trained to be able to manage the presence of previous identity switches and missed observations in the handled tracks. With that purpose, a novel data generation tool has been designed to create training tracklets that simulate such situations. The model has demonstrated a high capacity to discern when a new observation corresponds to a certain track, achieving a classification accuracy of 97% in a hard test that simulates tracks with previous mistakes. Moreover, the tracking efficiency of the model in a Surveillance application has been demonstrated by integrating that into the frame-by-frame association of a Tracking-by-Detection algorithm.



There are no comments yet.


page 4

page 6

page 8

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Multi-Object Tracking (MOT) task consists of visually finding the location of multiple individuals from their visual measurements and conserving their identities in an image sequence.

In the context of Intelligent Surveillance Systems, the automatization of the MOT task is essential to manage the huge amount of data captured from a large-scale distributed network of cooperative sensors and consequently, to automatically monitor multiple individuals in wide areas. This automatization relies on the proper association between the consecutive observations of each individual along a surveillance sequence.

In real unconstrained and crowded scenarios, the tracking of multiple individuals is hampered by a wide variety of challenging situations: fast-moving people or moving camera platforms, presence of crowds, crossing people, people with changing-trajectories, partially or total occlusions along short or long term, people disappearing from the monitored area, or new individuals entering in the field of view of the surveillance camera.

The performance of the data association process substantially depends on the design of the person representation and on the formulation of the cost function. This function is a metric to measure the cost of assigning a certain identity to a certain detection. Consequently, data association methods based on only motion cues or targets’ dynamics, e.g. [mclaughlin2015enhancing], are not able to handle agents with varying trajectories. This circumstance boosts the research on modelling individuals’ appearance to improve the performance of online methods.

In addition, no information about the agents appearing in the scene is known in advance. Given the unpredictable nature of the surveillance task, an essential capacity for MOT algorithms is the versatility to be applied to any unknown individual, who must be recognised among a high number of observations.

To achieve that, instead of learning a number of specified patterns for each one of the tracked agents, e.g.[yang2016temporal], this article proposes the design of a unique deep neural model. The proposed network jointly models the appearance features of multiple person detections and an affinity metric to compare them, which results in the measurement of the Degree of Appearance Similarity (DoAS) between the person images. This model identifies the affinity between different images of the same person, allowing the tracking of multiple people using the same model for all of them. In that way, unlike online-learning models approaches, the developed method does not require previous knowledge about the scene and neither a large number of frames to learn a robust model for an agent’s appearance.

The recognition of a person by means of an appearance neural model presents an intrinsically unbalanced nature, given the lack of data about the people to identify and the huge number of possible false assignments with surrounding agents. This results in the collapse and over-fitting of the neural model. For that reason, a novel formulation to generate the proper training data to feed the model is proposed in this work.

Once the model is trained and integrated into a data association process, its performance in complex scenes could produce some identities switches, i.e. the association of incorrect identities to some detections. After an identity switch, a different person’s track is associated with an agent in further iterations, making very difficult the correction of such error.

With the aim of avoiding identity switches or dealing with the consequences of a previous mismatching, and in order to avoid the error propagation in further frames, temporal consistency has been implicitly added to the model through a novel contrastive network architecture design. This follows a Multi-Shot recognition approach, whose core is a Long Short Term Memory (LSTM) cell.

A Deep Convolutional Neural Network has been modelled to render the appearance of the individuals through a feature array. The obtained features for a certain individual at different frames are related by the LSTM cell, providing the global appearance feature for a certain track. This is compared with the new observations by the proposed model. The result is a contrastive metric, hereinafter called Multi-Shot DoAS. In that way, every detection is compared by a model that considers not only the last saved observations but also those from previous frames.

Therefore, this article presents a novel neural model to measure the Multi-Shot Degree of Appearance Similarity (MS-DoAS) between person images to perform the association of individuals’ observations through different frames in a Multi-Object Tracking algorithm. The main contributions of the proposed method are:

  1. Design of a novel Deep Neural Network architecture for performing Multi-Shot recognition of any unknown individual. This model relies on the temporal consistency of the agent’s appearance by analysing the visual features measured in previous frames.

  2. Formulation of a training process that makes the model able to face complex real surveillance situations, including short-term disappearances of the agents, missed detections, and occlusions. Moreover, the resultant model is also able to deal with previously failed associations, preventing from further propagation of the identities mismatching. These capabilities have been acquired by training it on a variate set of tracklets (fragments of tracks), especially generated with such purpose by deliberately introducing temporal steps between some captures, as well as, intruder detections111The tracklets generation tool has been implemented as a set of C++ functions, which are publicly available under

  3. Integration of the proposed model in the data association process of an online Multi-Object Tracking algorithm. The affinity measure given by the proposed model, MS-DoAS, has been used, together with other motion cues, as part of a multi-modal cost function.

The proposed model provides temporary consistency by modelling the agents’ appearance with an LSTM network which is fed by features from previous frames. In that way, the propagation of punctual association mistakes is avoided, without requiring extending the association process through multiple future frames, as batch methods do. Hence, the proposed model allows an online tracking algorithm, with a frame-to-frame assignment.

The effectiveness of the proposed model has been proved by measuring its recognition capacity over multiple and variate test sets, and also by evaluating the final performance of an MOT algorithm where the model is integrated.

The rest of the article is structured as follows: the second Section presents a review of the existing related works. Section 3 describes the proposed Re-Id neural model, and Section 4 the developed learning algorithm to train it. Finally, Section 5 and 6 present the obtained experimental results and some concluding remarks, respectively.

2 Related Work

Although Multi-Object Tracking (MOT) methods have been reviewed intensively, [bernardin2008evaluating], it remains a challenging problem under development. MOT has become a branch of research deeply studied by the scientific community due to its prominent application to Intelligent Surveillance Systems, ISS, since many other applications, such as behaviour analysis, rely on the tracking performance.

Furthermore, Multi-Object Tracking in video sequences is also widely used in other military and civil applications, such as sports players tracking and analysis [liu2013tracking], biology [meijering2009tracking], robot navigation [ess2008mobile], and autonomous driving vehicles [ess2009improved].

In the literature, tracking problem is commonly solved by selecting a detector and feeding a tracker with it, resulting in a wide range of approaches, which have long been encompassed under the paradigm called “tracking-by-detection”.

Once a set of reliable detections is collected, the task of the tracker translates into a data association problem for determining the correspondence of detections across frames. Therefore, the data association consists of finding the correct assignment between the detections at every new frame and their corresponding identities. Identity is given to every trajectory that describes the path of an individual instance over time, hereinafter called agent.

Data association methods are mainly composed of a cost function, which measures the cost of assigning a certain identity to a detected person, and an optimization algorithm, which is in charge of seeking the assignment that minimizes the cost function. Therefore, independently from the association mechanism, a significant part of the final Multi-person tracking performance relies on the proper formulation of the cost metric, whose is limited, in turn, by the person representation design.

Some of the most commonly used features are related to individuals’ motion, such as location, or velocity, and even the interactions between agents. Trajectories have been typically treated as state-space models, like in Kalman [kalman1960new] or particle filters [gordon1993novel]. Moreover, in [bae2014robusttracklet, bae2014robust], trajectories are clustered as a mean to learn motion patterns. Furthermore, another approach is to develop more complex motion models to better predict future trajectories. For instance, Fan et al. [fan2010human] used Deep Convolutional Neural Networks (DCNN) to predict the location and scale of an individual for tracking.

However, in crowded scenes, a location or motion-based online association method could find problems to deal with changing-trajectory and crossing agents. There is a vast number of works that exploit appearance information to solve data association and to overcome the dependency on the motion cues. In those cases, a primary task in people tracking is converting raw pixels into higher-level representations.

Some simple appearance models are based on extracting appearance information from the object pixels using hand-crafted features, including colour histogram [le2016long, tang2016multi] and texture descriptors [chen2015multitarget, zhang2015tracking].

Other approaches use covariance matrix representation, pixel comparison representation and SIFT-like features, or pose features [nam2016learning]. For instance, in [shu2012part] an edgelet-based part model for describing the appearance of objects is presented.

Recently, Deep Convolutional Neural Networks have been used for modelling appearance by learning high-level features, e.g. [held2016learning, leal2016learning, zhai2018deep]. For example, in [kim2015multiple]

the feature extraction is directly learnt by using a convolutional pipeline that can be completely trained on a vast number of samples.

Other tracking algorithms get improvements by means of modelling every tracked agent independently, e.g.[bae2014robust, yang2016temporal]. Since there is no previous knowledge about the people to track, the dedicated models are trained online. The drawback of these approaches is that a certain time is needed until the online learning catches enough number of samples of a person to learn a reliable pattern.

On the other hand, many works explicitly learn affinity or similarity metrics from data, in order to compare two observations, e.g. [leal2016learning]. These works are characterised by the use of a cost metric in their tracking formulation once the metric has been learnt, but they do not consider the actual inference model during the learning phase.

The recent trend in Multi-Target tracking is the integration of the people features learning into the association scheme method. This approach is applicable to batch association methods, a.k.a. offline methods, such as multi-dimensional alignment algorithms [reid1979algorithm] and network flow-based methods [schulter2017deep]. Batch association methods provide temporal context through sets of future observations, allowing for robust predictions.

For example, Multi Hypothesis Tracking method can be extended to include online learned discriminative appearance models for each track hypothesis [kim2015multiple]. On the other hand, in [schulter2017deep], features for Network Flow-based data association are learnt via back-propagation, by expressing the optimum of a smoothed network Flow problem as a differentiable function of the pairwise association costs.

Furthermore, many of the research efforts focused on reducing the tracking errors, exploit the temporal consistency by the extraction of people tracklets, i.e. short object tracks. Unfortunately, the availability of reliable tracklets cannot be guaranteed due to the propagation of mistakes. This effect is pronounced in network flow-based association methods due to their limited capacity to model complex appearance changes. An alternative is to define pairwise costs between tracklets that can be reliably computed, [shitrit2014multi].

In tracklet association, discriminative appearance models are trained with the aim of learning an improved affinity score function, e.g. [bae2014robusttracklet, yang2012online]. However, these are batch methods, which perform multi-frame generalization using tracklets or even the whole sequence at once and on a hierarchical global data association [zhang2008global], where all the detections are gradually connected after these have been collected for a huge set of frames. Therefore, batch methods rely on future observations and for that reason, they are not applicable in real-time vision systems, where a frame-by-frame association, called online association, is needed.

On the contrary, other methods add temporary consistency to the data association process by using Long Short-Term Memory (LSTM) models. Subsequently, the pairwise terms, which relate two observations, can be weighted by offline trained appearance templates [shitrit2011tracking] or a simple distance metric between appearance features [zhang2008global]. For instance, in [sadeghian2017tracking] an LSTM model which learns to predict similar motion and appearance patterns is presented.

Modelling appearance with LSTM cells in an offline learning process and using the obtained models into an online data association method brings together the advantages of allowing a real-time algorithm with temporal consistency, as the work presented in this article demonstrates. This approach does not require any knowledge about individuals and neither time to adapt the model to them, and any unknown agent can be tracked.

3 Degree of Appearance Similarity Model

With the aim of exploiting the visual appearance of a target individual to track him/her among multiple people, an appearance affinity model has been developed.

The differences in visual appearance between a certain agent (tracked identity) and a detection is taken as an affinity cue in their matching cost formulation. Instead of modelling a specific individual’s appearance pattern, a universal model has been designed to predicts whether the images correspond to the same person or not.

Therefore, a comparative metric has been trained to measure the Degree of Appearance Similarity (DoAS) between captures. This has been formulated as a pair-wise binary classification problem to discriminate between groups of images belonging to the same person, or corresponding to different people, which are called positive and negative tracklets, respectively.

3.1 Multi-Shot DoaS Architecture

The Multi-Shot DoAS model, MS-DoAS, measure the appearance affinity between a certain detection, , and a certain agent , and its computation is rendered by the scheme in Fig. 1.

Figure 1: Computation of Multi-Shot Degree of Appearance Similarity, MS-DoAS, between a detection, , and an agent, .

This model follows a multi-shot recognition scheme since a certain detection, captured in the current frame, , is compared with the visual appearance of a certain agent, not only in the previous iteration, , but in previously captured images. This Multi-Shot recognition approach provides temporary consistency to obtain an accurate prediction.

Due to certain occlusions, or temporary disappearances, a certain agent could not be detected in consecutive frames. The available captures feed the model in inverse order of acquisition. So, the number, , of the frame where the representation of the agent was acquired is always higher than the number, , of the frame where the next representation was captured, . Moreover, all these frames are previous to that where the compared detection was found.

The MS-DoAS metric has been computed by modelling the appearance of each compared individual in a feature array. The appearance of the query detection is rendered by a feature array, , computed by a pre-trained DCNN.

Analogously, the appearance of each one of the previously acquired representations of the query agent, , is rendered by a feature array, , previously computed by the same pre-trained DCNN in the frame where it was captured. The saved features about the agent are used as the inputs of a pre-trained Long-Short-Term-Memory (LSTM) cell, which provides a general feature for the agent, .

Subsequently, a second group of neural layers are used to model the affinity cue that contrasts the individuals’ appearance features. Firstly, and are concatenated to feed a fully connected (FC) layer, which has been also pre-trained. This FC layer is used as a binary classification function, which performs the optimal weighting of the elements of the features and and returns a pair of outputs () whose values, , are representative of the dissimilarity and similarity classes. Finally, a Softmax function, , defined by Eq. 1, normalises these values in the range .


Due to the contrastive essence of this pair-wise approach, the first normalised output,

, returns the probability of that the agent,

, and the measurement, , do not form a correct match, and the second, , the probability that they form a correct match. Therefore, is taken as the MS-DoAS between the agent, , and the detection, .

3.2 Features computation

Instead of directly compare the raw images, the comparison is performed from their representative feature arrays. Therefore, it is necessary to model an embedding, , to map an input image, , to a feature space, such that the distance between samples rendering the same person is smaller than that between different people in that feature space.

In order to deal with partial-term occlusions, after which the representation of a person changes, deep learning has been used to automatically find the most salient features of the individuals’ appearance. Hence, the feature embedding has been modelled by a Deep Convolutional Neural Network (DCNN). Therefore, the feature representation for an image,

, is given by the output of the DCNN, which depends on its weights values, .

Concretely, the used DCNN model follows an adapted version of the VGG architecture, presented as the A version of a set of Very Deep CNNs in [simonyan2014very]. The layers specifications of the proposed VGG-based embedding are listed in Table 1.

Layer Input size Output size Kernel
Conv-1-1 x x x x x x
Pool-1 x x x x x x ,
Conv-2-1 x x x x x x
Pool-2 x x x x x x ,
Conv-3-1 x x x x x x
Conv-3-2 x x x x x x
Pool-3 x x x x x x ,
Conv-4-1 x x x x x x
Conv-4-2 x x x x x x
Pool-4 x x x x x x ,
Conv-5-1 x x x x x x
Conv-5-2 x x x x x x
Pool-5 x x x x x x ,
FC-6 x x x x
FC-7 x x x x
FC-8 x x x x
Table 1: Structure of the used VGG-based model. The input and output sizes are described in x x ; the kernel, in x x , , or for FC layers.


presents eight convolutional layers, three fully connected layers and a SoftMax final layer. The SoftMax layer has been removed to get a feature array as output instead of a classification probability value. Hence, its output is a point in the feature space represented by a

-dimensional array (). Moreover, the input size used in [simonyan2014very]

has been modified to adapt its value to the person detections proportions. Therefore, the input of the proposed DCNN is an RGB image of a fixed size, which has been set 64x128 pixels. All hidden layers are provided of a Rectified Liner Unit, ReLU,


, as activation function.

This neural network has been trained following the Siamese model, which can perform the discrimination of the pairs of samples in two well-differentiated groups, positive and negative pairs. This discrimination has been accentuated by the use of the Normalised Double-Margin-based Contrastive Loss Function


The Normalised Double-Margin-based Contrastive Loss function has been implemented in a Caffe python layer, which is publicly available under
, formulated in [gomez2017deep].

The Pair-based Mini-Batch Gradient Descent Learning Algorithm, presented in[gomez2019balancing], has been conducted to learn the network weights. The training data has been generated from the MOT17 dataset of surveillance sequences. A data generation tool was used to extract people detections from the sequences, and subsequently, the detections are combined to create a huge number of training pairs, by means of using the balancing data method333The data generation tool, which includes a data balancing method, is composed of a set of C++ functions that are publicly available under, presented in [gomez2019balancing].

Once the neural network was trained, this was used to compute the appearance feature of a person’s image444The C++ class needed to interpret the network architecture and its pre-trained weights is publicly available under \(\_computation\).

4 Learning Algorithm

The training architecture used to learn the parameters of the LSTM cell and the FC layer that allow measuring the MS-DoAS is rendered in Fig. 2.

Figure 2: MS-DoAS learning architecture.

Each training sample,   is formed by an array (with zero-based indexing) of features, called feature tracklet, whose size is , where is the size of the memory of the LSTM cell. Each tracklet element, , corresponds to a feature computed from an image that was taken from a frame whose number is given by .

By computing the features in an offline process, previous to the learning, the training time is highly reduced. The number of possible training tracklets created from a given set of images is much larger than the size of that set of images. For that reason, computing the features of all the images of the set before forming tracklets combinations is much more efficient, as long as the computational time is concerned.

Because the neural model is learnt by a supervised training process, every feature of a tracklet is accompanied by the identification number of the person that it is rendering, . A tracklet is considered as positive if its first feature, , corresponds to the same individual than the rest of features of the tracklet, and it is considered as negative in the opposite case, as Eq. 2 defines. The identity rendered by the last features of the tracklet is given by the identity represented by the most of its components (the mode, ) since some intruders (with different ) can be added to the tracklet to simulate failed associations.


A feature tracklet-based version of the Mini-Batch Gradient Descent algorithm has been implemented to train the presented neural model, and its main procedures, along with the learning iterations, , are described by Algorithm 1. This algorithm is based on the use of the cross-entropy loss, to compute the loss function, , as Eq.3 and 4 define, on its forward propagation and its derivatives, Eq. 5, on the backward propagation. These partial derivatives are finally used to update the weights using the Adagrad optimisation method, [duchi2011adaptive].

1:Batch of feature tracklets, .
2:The network parameters
4:while it<IT do
6:     ;
7:     for all training tracklet of the batch set  do
8:         Calculate by forward propagation;
9:         Calculate by Eq. 1;
10:         Calculate by Eq.4;
11:     end for
12:     Calculate by Eq. 3;
13:     for all training tracklet of the batch set  do
14:         Calculate , by back propagation;
15:     end for
16:     Calculate according to Eq. 5;
17:     Update parameters according to Adagrad method;
18:end while
Algorithm 1 Feature Tracklet-based Mini-Batch Gradient Descent Learning Algorithm.

5 Tracklets Generation

To create a set of training tracklets, a data creation module has been employed555The data generation tool has been implemented as a set of C++ functions, which are publicly available under tool extracts person images from the MOT17 dataset. However, instead of forming tracklets directly from these images, firstly features are computed from them, and subsequently, the features are used to create the training, validation and test set of tracklets. Each feature, , is computed from a image captured in the frame denoted by , and corresponds to the identity .

Five different formulations have been designed to combine the features, and consequently, five different types of tracklets sets, , , , , , have been created and used to train the MS-DoAS network, and the resulting models have been evaluated.

These formulations are defined by the following equations, where is the set size, that is its number of tracklets. Every set is formed by the random mixture of two subsets, one is formed by positive, , and the other one, by negative samples, .

The first set, , is the simplest one. Every tracklet is formed by features of a certain person in consecutive frames, as Fig. 3 shows. And in the case of the negative tracklets, a different person representation is taken as component to simulate the comparison of an agent with a non-corresponding measurement. The subsets of positive and negative samples for the first set, and , are defined by Eq. 6 and 7 respectively.

Figure 3: Examples of the images from which tracklets of the set are generated. Positive tracklets are underlined in green, and negatives, in red. renders the number of the frame from which the first component was extracted, in the sequences of the MOT17 dataset.

The second set, , is similar to the previous one. However, in positive tracklets, the frame from which component is extracted has not to be consecutive to that for component , but a maximum time step (number frames difference) of frames is allowed between them. In that way, the identification of a person after a short-term occlusion can be simulated to train the model to re-identify agents. The subsets of positive and negative samples for the second set, and , are defined by Eqs. 8 and 9, respectively.


The third set, , allows until time steps with a maximum size of . Therefore, not only component can be extracted from non-consecutive frames to the adjacent component, but also other randomly located time steps can be generated in both, positive and negative tracklets. In that way, agents're-identification in previous frames is simulated. The subsets of positive and negative samples for the third set, and , are defined by Eqs. 10 and 11, respectively.


The fourth set, , is similar to the first one but until intruders are added in random locations of the positive and negative tracklets. That means, that some components are substituted by features of different people, to simulate incorrect associations in previous frames. The subsets of positive and negative samples for the fourth set, and , are defined by 12 and 13, respectively, where renders the component-wise product operation. In these equations,

renders a binary mask vector, of the same length than the tracklets, randomly generated for every tracklet,

. Positions, where the mask takes value , corresponds to the introduction of an intruder. is a tracklet formed by randomly picked intruders. and are different for every created tracklet.


The fifth set, , is a combination of the third and the fourth set. It includes time steps and intruders, to train the model with a challenging dataset, making it robust to deal with real inputs when it is applied in a tracking algorithm. The subsets of positive and negative samples for the fifth set, and , are defined by Eqs.14 and 15, respectively.


It should be noted that in the negative tracklets of training sets IV and V, and , the component could present same as some of the intruders, making the discrimination harder.

In order to generate a wide variety of tracklets, the intruder components and the component in the negative tracklets has been obtaining not only by taking different person detections from the same sequence but also from different ones, resulting in larger and cross-sequence training sets of tracklets.

6 Experimental Results

To evaluate the proposed model, both, its discriminative capacity and the performance of the MOT algorithm where the model has been integrated have been tested. The used dataset and the protocol followed to train and test the model are described below.

6.1 Datasets

The MOT17666MOT17 dataset is publicly available under dataset has been selected to train and test the model. This dataset belongs to the MOTchallenge777MOTChallenge is a Multiple Object Tracking Benchmark which provides a unified framework to standardise the evaluation of MOT methods. This is published under and was released in . It contains fourteen variate real-world surveillance video sequences in unconstrained environments (twelve outdoor sequences and two indoor sequences), filmed with both static and moving cameras. It contains the same sequences as MOT16 [milan2016mot16], but with an extended more accurate ground truth. The resolution is 1920x1080 in twelve of the sequences and 640x480 in the rest of them. There is a total of 11235 frames and 546 different identities.

The sequences of MOT17 dataset are split into two groups. The sequences of the first group are labelled, i.e. they are accompanied by their ground true files with annotations about individuals’ location and identity. Person images have been extracted from this group, and they have been divided, in turn, in two groups to train and test the discrimination capacity of the proposed neural model.

Secondly, the unlabelled sequences of the second set have been used to evaluate an MOT algorithm where the proposed MS-DoAS model has been used. Since the ground truth of these sequences is not publicly available, the algorithm’s output has been submitted to the public evaluation platform of the MOT Challenge, which provides the results of calculating standard performance metrics.

Furthermore, for every sequence, MOT17 dataset provides the detections given by three different people detectors (DPM[felzenszwalb2010object], Faster-R-CNN [ren2015faster], and SDP [yang2016exploit]). Therefore, three different versions of every sequence are available, resulting in sequences.

6.2 Evaluation of the discriminative capacity of the model

This article proposes measuring the Degree of Appearance Similarity through a pair-wise binary classification model. This has been evaluated as a binary classifier, in order to test its performance to discriminate between positive and negative tracklets, which is rendered by a ROC curve


This curve illustrates the diagnostic ability of a binary classifier as its discrimination threshold, , is varied. defines the value until which the classifier output, MS-DoAS metric, is considered as the prediction of a negative tracklet, and from which it is considered as a positive tracklet. In that way, the chosen threshold, , divides the distance space in two ranges of values corresponding to each class.

The ROC curve plots the True Positive Rate () against the False Positive Rate (), defined by Eqs. 16 and 17, respectively, where , , and are the number of true positives, true negatives, false positives and false negatives, respectively.


Moreover, score, defined by Eq. 19 provides a trade-off between the Positive Predictive Value, (), Eq. 18, and the True Positive Rate (), Eq. 16. For that reason has been used to compare methods in the conducted evaluation, as well as, the Accuracy metric, , defined by Eq. 20

, which is the proportion of well-classified samples. The number of positive and negatives samples in the test set has been completely balanced to provide a fair evaluation through the accuracy metric, which is not appropriate for the case of having skewed classes.


Five different formulations have been designed to generate training tracklets that simulate the tracking of the following type of agents:

  1. People who have been previously well identified.

  2. Just re-identified people after a temporary disappearance.

  3. People who have suffered from several disappearances, i.e. their tracks have been interrupted several times.

  4. People who have been wrongly identified (mismatched) in some of the previous frames.

  5. People who have suffered from several disappearances and mismatches.

Five different experiments have been conducted. They are called Exp.MS-DoAS.i, where i takes value 1, 2, 3, 4, or 5 to denote that the model has been trained on the set TR1, TR2, TR3, TR4, or TR5, respectively. Every experiment provides a model that has been tested over five different test sets TS1, TS2, TS3, TS4, TS5, which were also generated according to the five presented tracklets formulations. Therefore different tests have been performed.

Fig. 4 presents a comparative graphic for every obtained MS-DoAS model. Each graphic shows the ROC curves resulting from testing the query model over the five test sets. On the other hand, Fig. 5 presents a comparative graphic for every test set. Each graphic shows the ROC curves resulting from testing every obtained model over the query test set. Moreover, tables 2 and 3 show the highest values of the score and the accuracy metric, , for each one of the conducted tests. These tables also provide the classification threshold, , with which the maximum values were achieved.

Figure 4: ROC curves from the evaluation of every MS-DoAS model on five different test sets (a), and zoomed region (b).
Figure 5: ROC curves from five different evaluations conducted for each one of the MS-DoAS models (a), and zoomed region (b).
(=) (=) (=) (=) (=)
(=) (=) (=) (=) (=)
(=) (=) (=) (=) (=)
(=0.5) (=) (=) (=) (=)
(=) (=) (=) (=) (=)
Table 2: Maximum score value (in [%]) for every MS-DoAS model, evaluated over five different test sets, and value where it is achieved.
(=) (=) (=) (=) (=)
(=) (=) (=) (=) (=)