Unsupervised Noisy Tracklet Person Re-identification

01/16/2021 ∙ by Minxian Li, et al. ∙ Nanjing University Queen Mary University of London 4

Existing person re-identification (re-id) methods mostly rely on supervised model learning from a large set of person identity labelled training data per domain. This limits their scalability and usability in large scale deployments. In this work, we present a novel selective tracklet learning (STL) approach that can train discriminative person re-id models from unlabelled tracklet data in an unsupervised manner. This avoids the tedious and costly process of exhaustively labelling person image/tracklet true matching pairs across camera views. Importantly, our method is particularly more robust against arbitrary noisy data of raw tracklets therefore scalable to learning discriminative models from unconstrained tracking data. This differs from a handful of existing alternative methods that often assume the existence of true matches and balanced tracklet samples per identity class. This is achieved by formulating a data adaptive image-to-tracklet selective matching loss function explored in a multi-camera multi-task deep learning model structure. Extensive comparative experiments demonstrate that the proposed STL model surpasses significantly the state-of-the-art unsupervised learning and one-shot learning re-id methods on three large tracklet person re-id benchmarks.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 3

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: (Top) Existing tracklet person re-id benchmarks exhibit less realistic evaluation scenarios for unsupervised model learning. This is due to data selection from manual annotation which gives rises to an easier learning task with significantly less noisy data. In current benchmarks, tracklets without cross-camera matches are excluded, and poor tracklets are often discarded including (Bottom): (a) Partial detection, (b) Multiple persons, (c) Non-person, (d) Severe occlusion, (e) Identity switch.

Person re-identification (re-id) is a task of matching the identity information of person bounding box images extracted from disjoint surveillance camera views [9]. Existing state-of-the-art re-id methods rely heavily on supervised deep learning [42, 19, 20, 40, 34, 2, 32, 46, 33, 35]. They assume a large set of cross-camera pairwise training data exhaustively labelled per surveillance camera network (i.e. domain), and are often significantly degraded for new domain deployments. Such poor cross-domain scalability leads to recent research focus on developing unsupervised domain adaptation [50, 5, 51, 30, 39] and unsupervised deep learning [16, 17, 22, 3] methods. In general, model learning with domain adaptation is less scalable since it assumes some common characteristics between the source and target domains which is not always true.

Unsupervised deep learning re-id models [16, 17, 22, 3] have started increasingly to explore unlabelled tracking data (tracklets). This is reasonable and intuitive because most image frames of a tracklet may share the person identity (ID), which provides rich spatio-temporal appearance variation information. To enable discriminative model optimisation, the key is to self-discover and learning reliable within-camera and cross-camera tracklet true matching among inevitably noisy tracklet training data. It is non-trivial due to that person ID labelling is unavailable for discriminative learning and noisy tracklet frame detection, as well as the majority pairs are false matches.

Existing tracklet person re-id benchmarks (e.g. MARS [49] and DukeMTMC-SI-Tracklet [31, 17]) present artificially simplified evaluation scenarios for unsupervised learning. This is because after manual selection and annotation in dataset construction, their tracklet data are no longer realistic (Fig. 1). For example, all the tracklets without cross-camera true matching and with poor conditions are often removed. Manually removing such tracklets in annotation would significantly simplify model learning, e.g. a 10% rank-1 rate difference in model performance [17]. In real-world applications, tracklet manual filtering is not available. Scalable unsupervised learning algorithms are required to handle automatically unconstrained raw tracklet data without manual data selection.

In this work, we consider the problem of unsupervised deep learning on unconstrained raw tracklet data, which is a more realistic and scalable setting than the existing tests [49, 31, 17]. Given unfiltered and unlabelled noisy training data, more robust tracklet learning algorithm is required. To this end, we present a selective tracklet learning (STL) method. STL is characterised by a robust image-to-tracklet selective matching loss function that is able to selectively associate true matching tracklets and adaptively suppress potentially noisy frame images and tracklets in model learning. It does not assume the existence of true matches for individual tracklets within and across camera views.

The contributions of this work are as follows: (1) We analyse the limitations of the existing tracklet person re-id benchmarks and methods. In particular, the current benchmarks fail to reflect the true challenges for unsupervised model learning, due to the effect of data selection during manual annotation. This phenomenon makes the developed methods less scalable and robust to more realistic scenarios. (2) We formulate a selective tracklet learning (STL) method for unsupervised deep learning with superior robustness against unconstrained tracklet data. This is achieved by designing a data adaptive image-to-tracklet selective matching loss function. (3) To enable more realistic unsupervised tracklet learning test, we introduce an unconstrained raw tracklet person re-id dataset DukeMTMC-Raw-Tracklet. It is constructed based on the DukeMTMC tracking benchmark [31]. Extensive comparative experiments show the performance advantages and superior robustness of STL over the state-of-the-art unsupervised and one-shot learning models on three tracklet person re-id benchmarks: MARS [48], DukeMTMC-SI-Tracklet [17, 31], and the newly introduced DukeMTMC-Raw-Tracklet.

2 Related Work

Supervised person re-id. Most existing person re-id models are supervised learning methods with a large set of cross-camera ID labelled training data [18, 4, 19, 20, 40, 34, 2, 32, 33, 35]. Moreover, the training and test data are typically assumed to be sampled from the same surveillance camera network, i.e. the same domain. As a result, their scalability and usability is significantly reduced for large real-world applications. This is because no such large training sets are available in typical test domains due to high labelling costs.

Supervision reduction.

To address the scalability and generalisation limitation of supervised learning re-id models, unsupervised model learning is desired. A trade-off can be achieved by semi-supervised learning

[25, 38], although still need labelled data. Alternatively, human-in-the-loop models can reduce the labelling effort by leveraging human-computer interaction [37, 24], although the process can be overly elaborated and involved. Unsupervised model learning is attractive without the need to collect ID labelled training data. However, earlier attempts [7, 28, 14, 13, 12, 44, 26, 23, 36, 47] have rather poor re-id performance due to weak hand-crafted features.

Unsupervised domain adaptation person re-id. Recently, unsupervised domain adaptation methods have gained noticeable success [39, 6, 29, 45, 5, 51]. The idea is to transfer the available person identity information from a labelled source domain to an unlabelled target domain. Existing methods can be generally divided into three groups. The first group is image synthesis which aims to render the source person identities into the target domain environment [50, 5, 51, 1]. As such, the conventional supervised learning algorithms are enabled to train re-id models. The second group is the conventional feature alignment scheme [30, 29, 39]. The third group is by unsupervised clustering which generates pseudo labels for supervised learning [6, 45]. These methods are usually stronger than unsupervised learning methods. However, they assume a similar imagery data distribution between the source and target domains, which restricts their generalisation to arbitrary and unconstrained application scenarios.

Figure 2: An overview of the proposed selective tracklet learning (STL) method for unsupervised tracklet re-id model learning. (a) The frames from the same tracklet with the noisy frame. (b) The frame feature and tracklet feature are both used for model learning. (c) An adaptive sampler is introduced to generate per-camera neighbours and cross-camera neighbours of tracklets. (d) A per-camera image-to-tracklet selective matching loss is proposed to learn the feature representation against noisy tracklet data within the same camera. (e) A cross-camera image-to-tracklet selective matching loss is proposed to learn the feature representation against noisy tracklet data across cameras.

Unsupervised video tracklet person re-id. Unsupervised video tracklet re-id methods have been advanced notably [16, 3, 22, 17]. They excel at leveraging the spatio-temporal continuity information of tracklet data freely available from person tracking. Whilst solving this tracklet learning problem is inherently challenging, it promises enormous potential due to massive surveillance videos available for model learning. To encourage novel model development, two large scale tracklet person re-id benchmarks [48, 17, 31] were constructed. They are based on exhaustive data selection and identity labelling during benchmarking. When testing unsupervised learning algorithms, one assumes no identity labels whilst still use the same training tracklet data. This means the training data are selected manually, other than the unconstrained raw tracklet data as typically encountered in real-world unsupervised model learning. This discrepancy on model training data renders the existing benchmarks fail to test the realistic performance of unsupervised model learning algorithms. Moreover, this data bias also gives undesired influence on algorithm designing. For example, current methods [16, 3, 17, 22] often assume the existence of tracklet matches and class balanced training data. Such assumptions however is often not valid on truly unconstrained tracklet data. Consequently, they are less robust and scalable to more realistic unfiltered tracklets (Table 2).

In this study, we aim to resolve the aforementioned artificial and unrealistic assumption for scaling up unsupervised learning algorithms to real-world application scenarios. To this end, we propose a selective tracklet learning approach. It is capable of learning a discriminative re-id model directly from unlabelled raw tracklet data without manual selection and noise removal. To enable testing true performance, we further introduce an unconstrained raw tracklet person re-id benchmark by using DukeMTMC videos.

3 Methodology

Problem setting. We start with automated person tracklet extraction on a large set of multi-camera surveillance videos by off-the-shelf detection and tracking models [15, 8]. Let us denote the -th tracklet from the -th camera as with a varying number of image frames. There is a total of camera views. The per-camera tracklet number varies. In unsupervised learning, no person ID labels are available on person tracklet data. The objective is to learn a discriminative person re-id model from these unconstrained raw tracklets without any manual processing.

Approach overview. We adopt multi-task unsupervised tracklet learning with each task dedicated for an individual camera view as [17, 16]. In particular, we first automatically annotate each tracklet with a unique class label per-camera. Each camera is therefore associated with an independent class label set. In multi-task learning, we ground all the branches on a common re-id feature representation. This model takes individual frame images as input (Fig. 2(a)) other than tracklets. This is favourable due to the possibility of modelling noisy image frames within tracklets. After extracting the frame feature afte the backbone CNN, we aggregate the frame features from the same tracklet to the tracklet feature. Especially, we proposed an adaptive sampler to generate two neighbour sets: per-camera neighbours and cross-camera neighbours. Based on these two neighbour sets, we proposed the per-camera/cross-camera image-to-tracklet selective matching loss functions to learn the feature representation against noisy tracklet data. An overview of our STL is depicted in Fig. 2.

3.1 Per-Camera View Learning

For enforcing model learning constraints for each camera view, the softmax Cross Entropy (CE) classification loss function is utilised [17, 16] as:

(1)

where is the task-shared

feature vector of an input frame image

and

the classifier parameters for the

-th tracklet label. The indicator function returns for true arguments and for otherwise. The term in

defines the posterior probability on the tracklet label

.

In this context, the CE loss assumes that each tracklet is to be matched with only a single person candidate. This is not valid for many raw tracklet data with noise introduced into model learning. To address this limitation, we propose a novel image-to-tracklet selective matching loss formulation. It is a weighted non-parametric formulation of the CE loss. Formally, the proposed per-camera image-to-tracklet selective matching loss function for a specific training image frame is designed as (Fig. 2(d)):

(2)

where and specify the feature vector of image and the -th tracklet as ,

is the matching probability of image

and tracklet , is the tracklet number of the -th camera view, denotes the similarity weight between the ’s corresponding tracklet and the -th tracklet . This weight aims to minimise negative effect of trajectory fragmentation by taking into account the tracklet pairwise information in classification [17]. The specific computation of , as defined in Eq. (6), will be discussed below.

To suppress the contamination of tracklet by noisy and distracting image frames, we introduce the posterior probability based on image-to-tracklet selective matching as:

(3)

where expresses the matching degree between image and tracklet , is a temperature parameter that controls the concentration of the distribution [11]. It is normalised over all the tracklets by the softmax function.

In contrast to the point-to-point probability in the non-parametric classification loss [41], Eq. (3) is a point-to-set matching probability, which is more robust to contaminated and distracting tracklets. This can be understood at two aspects: (1) Suppose a tracklet contains noisy frames due to multiple person, non-person or ID switch, etc. Often, the noisy frames tend to have much smaller matching scores against a true-matching tracklet, as compared to other clean frames. (2) The image-to-tracklet pairs with large matching scores will become significantly more salient after applying the exponential operation . This is effectively a process of selecting good-quality matching tracklets (e.g. less noisy true matches) and simultaneously down-weighing the remaining ones (e.g. more noisy true matches and false matches). If there is no true match, all values tend to be small. This data adaptive and selective matching capability is highly desired for dealing with noisy raw tracklets in unsupervised tracklet re-id learning.

In unsupervised tracklet training data, the majority of tracklet pairs per camera are false matches. Therefore, considering all the pairs in Eq. (2) is likely to introduce a large amount of negative matching. As [17], we consider only a fraction of tracklets that are more likely true matches (i.e. tracklet association). To this end, nearest neighbour (-NN) search is often adopted:

(4)

For each tracklet, this implicitly assumes true matches in each camera view. Given unconstrained raw tracklets without manual selection, this condition is often harder to meet. Data adaptive tracklet association is hence needed.

To that end, we suggest to further exploit the concept of -neighbourhood (-NN) (Fig. 2(c))

(5)

where is the neighbourhood boundary threshold. Adding such a similarity score constraint, we aim to filter out the noisy tracklet pairs associated by -NN with low pairwise proximity. The resulting neighbourhood sizes vary from 0 to in accordance with how many similar tracklets exist, i.e. tracklet data adaptive. This property is critical for model learning on unconstrained tracklets without guarantee of a fixed number of reliable good-quality true matches.

After obtaining possibly matching tracklets for a specific tracklet , we can compute the tracklet similarity weight as their normalised quantity:

(6)

As such, only visually similar tracklets potentially with minimal noisy image frames are encouraged (Eq. (2)) to be positive matches in model learning.

Discussion. In formulation, our image-to-tracklet selective matching loss is similar as the instance loss [41]. Both are non-parametric variants of CE. However, there are a few fundamental differences: (1) The instance loss treats each individual image as a class, whilst our loss considers tracklet-wise class. Conceptually, this introduces a two-tier hierarchical structure into the instance loss: local image and global tracklet. (2) The instance loss does not consider the camera view structure. In contrast, we uniquely combine the multi-task inference idea with tracklet classes for additionally exploiting the underlying correlation between per-camera tracklet groups. Moreover, our loss design shares some spirit with the focal loss [21] both using a modulating parameter for controlling the target degree (noise measure in ours and imbalance measure in focal loss). But they have more fundamental differences in addition to different formulations: (1) The focal loss is parametric and supervised vs. our non-parametric and unsupervised loss. (2) The focal loss aims to solve the class imbalance between positive and negative samples in supervised learning, whilst ours is for selective and robust image-to-tracklet matching in unsupervised learning.

3.2 Cross-Camera View Learning

Besides per-camera view learning, it is crucial to simultaneously consider cross-camera tracklet learning [17, 16]. To this end, we need to similarly perform tracklet association across camera views. We consistently utilise -NN+-NN for tracklet association across different camera views. Specifically, for a tracklet we search the nearest tracklets from different cameras (Fig. 2(c)):

(7)

With the self-discovered cross-camera tracklet association of a specific tracklet which contains a training image frame , we then enforce a cross-camera image-to-tracklet matching loss function (Fig. 2(e)):

(8)

This loss encourages the image to have as similar feature representation of visually alike tracklets as possible in a cross-camera sense. In doing so, person appearance variation across camera views is minimised if the image-to-tracklet association is correct.

Input: Automatically generated raw tracklet data.
Output: An optimised person re-id model.
for to max_epoch do
     if

in the first training stage epochs


         Update per-camera tracklet neighbourhood (Eq. (5))
         for to per-epoch iteration number do
             Feedforward a mini-batch of tracklet frame images
             Update the tracklet representations (Eq. (10))
             Compute per-camera matching loss (Eq. (2))
             Update the model by back-propagation
         end for
     else /* The second training stage epochs */
         Update per-camera tracklet neighbourhood (Eq. (5))
         Update cross-camera tracklet neighbourhood (Eq. (7))
         for to per-epoch iteration number do
             Feedforward a mini-batch of tracklet frame images
             Update tracklet representations (Eq. (10))
             Compute STL model training loss (Eq. (9))
             Update the network model by back-propagation
         end for
     end if
end for

Algorithm 1 The STL model training procedure.
Dataset Training data Test data
# Identity # Tracklet # Identity # Tracklet
MARS* [48] 625 8,298 636 11,310
DukeMTMC-SI-Tracklet* [31, 17] 702 5,803 1,086 6,844
DukeMTMC-Raw-Tracklet* (New) 702 7,427 1,105 8,950
DukeMTMC-Raw-Tracklet (New) 702 + unknown 12,452 1,105 8,950
Table 1: Dataset statistics and benchmarking setting. *: With tracklet selection.

3.3 Model Training

Overall objective loss. By combining the per-camera and cross-camera learning constraints, we obtain the final model objective loss function as:

(9)

where is a balancing weight. Trained jointly by and , the STL model is able to mine the discriminative re-id information both within- and cross-camera views concurrently. This overall loss function is differentiable therefore allowing for end-to-end model optimisation. For more accurate tracklet association between camera views, we start to apply the cross-camera matching loss (Eq. (8)) in the middle of training as [17]. The STL model training process is summarised in Algorithm 1.

Tracklet representation. In our model formulation, we need to represent each tracklet as a whole. To obtain this feature representation, we adopt a moving average strategy [27] for computational scalability. Specifically, we maintain a representation memory for each tracklet during training. In each training iteration, given an input image frame , we update its corresponding tracklet’s feature vector as:

(10)

This scheme updates only the tracklets whose images are sampled by the current mini-batch in each iteration. Although not all the tracklets are updated and synchronised along with the model training, the discrepancy from their accurate representations is supposed to be marginal due to the small model learning speed therefore matters little.

Model parameter setting. In unsupervised re-id learning, we have no

access to labelled training data for model parameter selection by cross-validation. Also, it is improper to use any test data for model parameter tuning which is not available in real-world application. All the parameters of a model are usually estimated empirically. Moreover, the identical parameter setting should be used for all the different datasets in domain scalability and generic large scale application considerations. With this principle, we set the parameters of our STL model for all different tests and experiments as:

for Eq. (3), for Eq. (9), and for Eq. (5) and (7). Otherwise parameter settings may give better model performance on specific tests. But they are not exhaustively considered in our study, because this often assumes extra domain knowledge which is not generally available therefore making the performance evaluation less realistic and less generic.

To minimise the negative effect of inaccurate cross-camera, STL begins the model training with per-camera image-to-tracklet selective matching loss Eq. (2) during the first training stage, and then add cross-camera image-to-tracklet selective matching loss Eq. (8) in the second training stage. We do not incorporate the ccm loss until in the second training stage due to insufficiently reliable feature representations for cross-camera tracklet matching in the beginning of training stage.

4 Experiments

4.1 Experimental Setup

Datasets. To evaluate the proposed STL model, we tested on two publicly available person tracklet datasets: MARS [48], and DukeMTMC-SI-Tracklet [48, 17]. The dataset statistics and test settings are given in Table 1. The previous two datasets both contain only manually selected person tracklet data therefore presenting less realistic unsupervised learning scenarios. To enable more realistic algorithm test, we introduced a new raw tracklet person re-id dataset.

As DukeMTMC-SI-Tracklet, we used the DukeMTMC tracking videos [31]. To extract person tracklets, we leveraged an efficient detector-and-tracker model [8]. and a graph association method. From all the DukeMTMC videos captured from 8 distributed surveillance cameras, we obtained 21,402 person tracklets, including a total of 1,341,096 bounding box images. Detection and tracking errors are inevitable as shown in Fig. 1(b). In reality, we usually assume no manual efforts for cleaning tracklet data but expect the unsupervised learning algorithm to be sufficiently robust to any errors and noise. We therefore keep all tracklets. In spirit of DukeMTMC-SI-Tracklet, we call the newly introduced dataset DukeMTMC-Raw-Tracklet.

For the DukeMTMC-Raw-Tracklet dataset benchmarking, we utilised a similar method as [17] to automatically label the identity classes of test person tracklets. This is for enabling model performance test. We used the same 1,105 test person identity classes as DukeMTMC-SI-Tracklet for allowing an apple-to-apple comparison between datasets. The number of training person identity classes is unknown due to no manual annotation, a natural property of unconstrained raw tracklets in real-world application scenarios.

Figure 3: Cross-camera tracklet matching pairs from (a) MARS, (b) DukeMTMC-SI-Tracklet, and (c) DukeMTMC-Raw-Tracklet.

To assess the exact effect of manual selection to unsupervised model learning, we further selected out the tracklets of the same 702 training person identities as DukeMTMC-SI-Tracklet. Along with the same test data, we built another version of dataset with selected training tracklets as the conventional datasets [48, 17]. We name this dataset DukeMTMC-Raw-Tracklet*. The statistics of both datasets with and without selection are described in Table 1.

Performance metrics. For model performance measurement, we used the Cumulative Matching Characteristic (CMC) and mean Average Precision (mAP) metrics.

Implementation details.

We used an ImageNet pre-trained ResNet-50

[10] as the backbone for our STL model. The re-id feature representation is normalised with 128 dimensions each. Person bounding box images were resized to . We adopted the Stochastic Gradient Descend (SGD) optimiser. We set the learning rate is , the mini-batch size to 384, and the epoch number to 20. The first training stage begins from the first epoch, and the second training stage from the tenth epoch.

Methods MARS* Duke-SI-TKL* Duke-Raw-TKL* Duke-Raw-TKL
R1 R5 R20 mAP R1 R5 R20 mAP R1 R5 R20 mAP R1 R5 R20 mAP
GRDL [13] 19.3 33.2 46.5 9.6 - - - - - - - - - - - -
UnKISS [12] 22.3 37.4 53.6 10.6 - - - - - - - - - - - -
SMP [26] 23.9 35.8 44.9 10.5 - - - - - - - - - - - -
DGM+IDE [44] 36.8 54.0 68.5 21.3 - - - - - - - - - - - -
RACE [43] 43.2 57.1 67.6 24.5 - - - - - - - - - - - -
TAUDL [16] 43.8 59.9 66.0 29.1 26.1 42.0 57.2 20.8 - - - - - - - -
DAL [3] 46.8 63.9 71.6 21.4 - - - - - - - - - - - -
BUC [22] 51.1 64.2 72.9 26.4 30.6 43.9 51.7 16.7 38.6 50.1 61.9 20.1 31.1 41.3 52.0 15.1
UTAL [17] 49.9 66.4 77.8 35.2 43.8 62.8 76.5 36.6 48.7 62.9 76.6 38.4 41.3 55.7 71.3 31.8
STL (Ours) 54.5 71.5 82.0 37.2 46.7 65.4 78.1 38.9 55.4 71.7 79.0 41.5 56.1 69.6 80.9 41.5
Table 2: Unsupervised person re-id performance on tracklet benchmarking datasets. : Use one-shot labels. *: With manual tracklet selection. TKL = Tracklet.

4.2 Comparisons to the State-Of-The-Art Methods

Competitors. We compared the proposed STL method with three different modelling strategies: (a) Hand-crafted feature based methods (GRDL [13], UnKISS [12]), (b) One-shot learning methods (SMP [26], DGM+IDE [44], RACE [43]), (c) Unsupervised deep learning models (TAUDL [16], DAL [3], BUC [22], UTAL [17]).

Results. Table 2 compares the re-id performance. We have the following main observations:
(1) Hand-crafted feature based methods (GRDL and UnKISS) produce the weakest performance. This is due to the poor discriminative ability of manually designed features and the lacking of end-to-end model optimisation.
(2) One-shot learning methods (SMP, DGM, RACE) improve the re-id model generalisation capability. However, their assumption on one-shot training data would limit their application scalability due to the need for some amount of person identity labelling per domain.
(3) The more recent unsupervised deep learning methods (TAUDL, DAL, BUC111We utilised the officially released codes of BUC [22] with the default parameter settings, and we used the single parameter setting for BUC on all the tests. In contrast, the authors of BUC seemingly used the labelled test data to tune the model parameters which is improper for unsupervised learning. As a result, the results on MARS were reported differently., UTAL) further push the boundary of model performance. However, all these existing methods are outperformed clearly by the proposed STL model, in particularly on the unconstrained raw tracklets training data. This suggests the overall performance advantages of our model over the strong alternative methods.
(4) Existing methods BUC and UTAL both suffer from the noisy data in unconstrained raw tracklets, as indicated by their significant performance drop from DukeMTMC-Raw-Tracklet* (with tracklet selection) to DukeMTMC-Raw-Tracklet. This suggests that the data selection in dataset benchmarking simplifies the model learning task, i.e. less challenging than the realistic setting. The proposed DukeMTMC-Raw-Tracklet dataset is designed particularly for addressing this problem.
(5) Our STL model is shown to be more robust against more noisy tracklet data, with little re-id performance changes. This suggests the superiority of our image-to-tracklet selective matching in dealing with more noisy unconstrained tracklets.
(6) While both UTAL and STL adopt multi-task learning model design, STL is also superior on all three datasets with manual selection. This suggests that assuming a fixed number of true matches (due to using -NN) is suboptimal even for carefully constructed training data, and the modelling superiority of our image-to-tracklet selective matching in handling unconstrained raw tracklet data with more noise.

4.3 Further Analysis and Discussions

To provide more insight and interpretation into the performance advantages of our STL method, we analysed key model designs on two large tracklet re-id datasets: MARS and DukeMTMC-Raw-Tracklet.

Loss design for per-camera view learning. The loss function for per-camera view learning is a key component in STL. We compared our image-to-tracklet selective matching (ITSM) loss (Eq. (2)) with the conventional cross-entropy (CE) loss (Eq. (1).) Table 3 shows that the ITSM loss is significantly more effective especially on DukeMTMC-Raw-Tracklet dataset. This suggests the superiority of our loss design in handling noisy tracklets, thanks to its data adaptive and selective learning capability.

Dataset MARS* Duke-Raw-TKL
Loss Rank-1 mAP Rank-1 mAP
CE 43.8 31.4 22.4 15.3
ITSM 48.2 32.2 49.1 34.2
Table 3: Evaluate loss design for per-camera view learning. ITSM: per-camera Image-to-Tracklet Selective Matching.

Image-to-tracklet selective effect. We tested the data selective effect by controlling the temperature parameter in Eq. (3). It is one of the key factors for our method to be able to select possibly well matching tracklets and deselect potentially noisy tracklets. Table 4 shows several key observations. (1) With we impose no selection effect in matching, the model generalisation performance degrades significantly. (2) When setting small values (0.2) to , as expected the model performance is dramatically boosted. This is due to the modulating effect of our loss function to the selective matching between images and tracklets. It is also observed that more gain is obtained in case of unconstrained raw tracklets due to more noise and distraction. (3) The optimal value is around consistently, suggesting the generic benefit of a single setting.

Dataset MARS* Duke-Raw-TKL
Rank-1 mAP Rank-1 mAP
1 15.7 9.8 10.4 6.3
0.5 25.3 15.1 14.6 11.4
0.2 44.7 29.9 41.7 30.5
0.1 54.5 37.2 56.1 41.5
0.05 47.6 32.5 47.7 35.7
Table 4: Evaluate the temperature .

Benefit of cross-camera view learning. We evaluated the efficacy of. cross-camera view learning Table 5 shows that significant re-id accuracy gains can be obtained. This verifies the cross-camera image-to-tracklet matching loss (Eq. (8)) on top of per-camera view learning (Eq. (2)).

Dataset MARS* Duke-Raw-TKL
CCM Rank-1 mAP Rank-1 mAP
48.2 32.2 49.1 34.2
54.5 37.2 56.1 41.5
Table 5: Effect of cross-camera matching (CCM) learning.

To examine further cross-camera matching, we tracked the number and accuracy of tracklet association in training. Figure 4 shows that the number of cross-camera tracklet pairs grow dramatically whilst the matching accuracy drops slightly or moderately along the training. This justifies the positive effect of cross-camera tracklet association. That is, most pairs are correct true matches, providing model training with discriminative information for person appearance variation across camera views. Besides, we also observe further room for more accurate tracklet association.

Figure 4: The (a) number and (b) precision of cross-camera tracklet pairs discovered during training.

Tracklet association strategy. We tested the effect of -NN on top of -NN (Eq. (5) and (7)) as the tracklet association strategy in STL. We set and , the default setting of our model. Table 6 shows that using -NN only is inferior to -NN+-NN. This suggests the data adaptive benefit of -NN particularly in handling unconstrained raw tracklet data, verifying our design of additionally leveraging -NN for tracklet association except -NN.

Dataset MARS* Duke-Raw-TKL
Strategy Rank-1 mAP Rank-1 mAP
-NN 51.0 32.7 51.0 37.4
-NN+-NN 54.5 37.2 56.1 41.5
Table 6: -NN versus -NN in tracklet association.

Sensitivity of tracklet association threshold. We tested the model performance sensitivity of setting the tracklet matching similarity threshold (Eq. (5) and (7)). Table 7 shows that re-id accuracies vary with the change of as expected. This is because controls what image-to-tracklet matching pairs are used in objective loss functions during training. Importantly, is not very sensitive with a good value range (around ) giving strong model performance. This robustness is a critical property of our method, since when applied to diverse tracklets data with unconstrained conditions, label supervision is not available for hyper-parameter cross-validation.

Dataset MARS* Duke-Raw-TKL
Rank-1 mAP Rank-1 mAP
0.9 48.7 30.4 48.3 33.3
0.8 54.2 36.2 55.3 41.3
0.7 54.5 37.2 56.1 41.5
0.6 50.7 34.1 47.7 35.2
Table 7: Evaluate the tracklet association threshold .

5 Conclusion

We presented a selective tracklet learning (STL) approach, which aims to address the limitations of both the existing supervised person re-id methods and unsupervised tracklet learning methods concurrently. Specifically, STL is able to learn discriminative and generalisable re-id model from unlabelled raw tracklet datasets. This eliminates the artificial assumptions on exhaustive person ID labelling as by supervised re-id methods, and manual filtering as by existing tracklet unsupervised learning models. We also introduced an unconstrained raw tracklet person re-id benchmark, DukeMTMC-Raw-Tracklet. Extensive experiments show the superiority and robustness advantages of STL over the state-of-the-art unsupervised learning re-id methods on three tracklet person re-id benchmarks.

References

  • [1] S. Bak, P. Carr, and J. Lalonde (2018) Domain adaptation through synthesis for unsupervised person re-identification. In Proc. Eur. Conf. Comput. Vis., pp. 189–205. Cited by: §2.
  • [2] X. Chang, T. M. Hospedales, and T. Xiang (2018) Multi-level factorisation net for person re-identification. In

    Proc. IEEE Conf. Comput. Vis. Pattern Recognit.

    ,
    pp. 2109–2118. Cited by: §1, §2.
  • [3] Y. Chen, X. Zhu, and S. Gong (2018)

    Deep association learning for unsupervised video person re-identification

    .
    Proc. Bri. Mach. Vis. Conf.. Cited by: §1, §1, §2, §4.2, Table 2.
  • [4] Y. Chen, X. Zhu, W. Zheng, and J. Lai (2018) Person re-identification by camera correlation aware feature augmentation. IEEE Trans. Pattern Anal. Mach. Intell. 40 (2), pp. 392–408. Cited by: §2.
  • [5] W. Deng, L. Zheng, Q. Ye, G. Kang, Y. Yang, and J. Jiao (2018) Image-image domain adaptation with preserved self-similarity and domain-dissimilarity for person re-identification. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 994–1003. Cited by: §1, §2.
  • [6] H. Fan, L. Zheng, and Y. Yang (2017) Unsupervised person re-identification: clustering and fine-tuning. arXiv:1705.10444. Cited by: §2.
  • [7] M. Farenzena, L. Bazzani, A. Perina, V. Murino, and M. Cristani (2010) Person re-identification by symmetry-driven accumulation of local features. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 2360–2367. Cited by: §2.
  • [8] R. Girdhar, G. Gkioxari, L. Torresani, M. Paluri, and D. Tran (2018)

    Detect-and-track: efficient pose estimation in videos

    .
    In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 350–359. Cited by: §3, §4.1.
  • [9] S. Gong, M. Cristani, S. Yan, and C. C. Loy (2014) Person re-identification. Springer, . Cited by: §1.
  • [10] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 770–778. Cited by: §4.1.
  • [11] G. Hinton, O. Vinyals, and J. Dean (2015)

    Distilling the knowledge in a neural network

    .
    arXiv preprint arXiv:1503.02531. Cited by: §3.1.
  • [12] F. M. Khan and F. Bremond (2016) Unsupervised data association for metric learning in the context of multi-shot person re-identification. In Proc. IEEE Conf. Adv. Vid. Sig. Surv., pp. 256–262. Cited by: §2, §4.2, Table 2.
  • [13] E. Kodirov, T. Xiang, Z. Fu, and S. Gong (2016) Person re-identification by unsupervised graph learning. In Proc. Eur. Conf. Comput. Vis., pp. 178–195. Cited by: §2, §4.2, Table 2.
  • [14] E. Kodirov, T. Xiang, and S. Gong (2015) Dictionary learning with iterative laplacian regularisation for unsupervised person re-identification. In Proc. Bri. Mach. Vis. Conf., Cited by: §2.
  • [15] L. Leal-Taixé, A. Milan, I. Reid, S. Roth, and K. Schindler (2015) Motchallenge 2015: towards a benchmark for multi-target tracking. arXiv:1504.01942. Cited by: §3.
  • [16] M. Li, X. Zhu, and S. Gong (2018) Unsupervised person re-identification by deep learning tracklet association. In Proc. Eur. Conf. Comput. Vis., pp. 737–753. Cited by: §1, §1, §2, §3.1, §3.2, §3, §4.2, Table 2.
  • [17] M. Li, X. Zhu, and S. Gong (2019) Unsupervised tracklet person re-identification. IEEE Trans. Pattern Anal. Mach. Intell.. Cited by: §1, §1, §1, §1, §1, §2, §3.1, §3.1, §3.1, §3.2, §3.3, Table 1, §3, §4.1, §4.1, §4.1, §4.2, Table 2.
  • [18] W. Li, R. Zhao, T. Xiao, and X. Wang (2014) Deepreid: deep filter pairing neural network for person re-identification. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 152–159. Cited by: §2.
  • [19] W. Li, X. Zhu, and S. Gong (2017) Person re-identification by deep joint learning of multi-loss classification. In Proc. Int. Jo. Conf. of Artif. Intell., Cited by: §1, §2.
  • [20] W. Li, X. Zhu, and S. Gong (2018) Harmonious attention network for person re-identification. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 2285–2294. Cited by: §1, §2.
  • [21] T. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár (2017) Focal loss for dense object detection. In Proc. IEEE Int. Conf. Comput. Vis., pp. 2980–2988. Cited by: §3.1.
  • [22] Y. Lin, X. Dong, L. Zheng, Y. Yan, and Y. Yang (2019) A bottom-up clustering approach to unsupervised person re-identification. In AAAI Conf. on Art. Intel., Cited by: §1, §1, §2, §4.2, Table 2, footnote 1.
  • [23] G. Lisanti, I. Masi, A. D. Bagdanov, and A. Del Bimbo (2015) Person re-identification by iterative re-weighted sparse ranking. IEEE Trans. Pattern Anal. Mach. Intell. 37 (8), pp. 1629–1642. Cited by: §2.
  • [24] C. Liu, C. Change Loy, S. Gong, and G. Wang (2013) Pop: person re-identification post-rank optimisation. In Proc. IEEE Int. Conf. Comput. Vis., pp. 441–448. Cited by: §2.
  • [25] X. Liu, M. Song, D. Tao, X. Zhou, C. Chen, and J. Bu (2014) Semi-supervised coupled dictionary learning for person re-identification. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 3550–3557. Cited by: §2.
  • [26] Z. Liu, D. Wang, and H. Lu (2017) Stepwise metric promotion for unsupervised video person re-identification. In Proc. IEEE Int. Conf. Comput. Vis., pp. 2429–2438. Cited by: §2, §4.2, Table 2.
  • [27] J. M. Lucas and M. S. Saccucci (1990) Exponentially weighted moving average control schemes: properties and enhancements. Technometrics 32 (1), pp. 1–12. Cited by: §3.3.
  • [28] X. Ma, X. Zhu, S. Gong, X. Xie, J. Hu, K. Lam, and Y. Zhong (2017) Person re-identification by unsupervised video matching. Pattern Recognition 65, pp. 197–210. Cited by: §2.
  • [29] P. Peng, Y. Tian, T. Xiang, Y. Wang, M. Pontil, and T. Huang (2018)

    Joint semantic and latent attribute modelling for cross-class transfer learning

    .
    IEEE Trans. Pattern Anal. Mach. Intell. 40 (7), pp. 1625–1638. Cited by: §2.
  • [30] P. Peng, T. Xiang, Y. Wang, M. Pontil, S. Gong, T. Huang, and Y. Tian (2016) Unsupervised cross-dataset transfer learning for person re-identification. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 1306–1315. Cited by: §1, §2.
  • [31] E. Ristani, F. Solera, R. Zou, R. Cucchiara, and C. Tomasi (2016) Performance measures and a data set for multi-target, multi-camera tracking. In Workshop of Eur. Conf. Comput. Vis., pp. 17–35. Cited by: §1, §1, §1, §2, Table 1, §4.1.
  • [32] Y. Shen, H. Li, T. Xiao, S. Yi, D. Chen, and X. Wang (2018) Deep group-shuffling random walk for person re-identification. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 2265–2274. Cited by: §1, §2.
  • [33] Y. Shen, H. Li, S. Yi, D. Chen, and X. Wang (2018-09) Person re-identification with deep similarity-guided graph neural network. In Proc. Eur. Conf. Comput. Vis., Cited by: §1, §2.
  • [34] C. Song, Y. Huang, W. Ouyang, and L. Wang (2018)

    Mask-guided contrastive attention model for person re-identification

    .
    In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 1179–1188. Cited by: §1, §2.
  • [35] Y. Suh, J. Wang, S. Tang, T. Mei, and K. Mu Lee (2018-09) Part-aligned bilinear representations for person re-identification. In Proc. Eur. Conf. Comput. Vis., Cited by: §1, §2.
  • [36] H. Wang, S. Gong, and T. Xiang (2014) Unsupervised learning of generative topic saliency for person re-identification. In Proc. Bri. Mach. Vis. Conf., Cited by: §2.
  • [37] H. Wang, S. Gong, X. Zhu, and T. Xiang (2016) Human-in-the-loop person re-identification. In Proc. Eur. Conf. Comput. Vis., pp. 405–422. Cited by: §2.
  • [38] H. Wang, X. Zhu, T. Xiang, and S. Gong (2016) Towards unsupervised open-set person re-identification. In IEEE Int. Conf. on Img. Proc., pp. 769–773. Cited by: §2.
  • [39] J. Wang, X. Zhu, S. Gong, and W. Li (2018) Transferable joint attribute-identity deep learning for unsupervised person re-identification. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 2275–2284. Cited by: §1, §2.
  • [40] L. Wei, S. Zhang, W. Gao, and Q. Tian (2018) Person transfer gan to bridge domain gap for person re-identification. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 79–88. Cited by: §1, §2.
  • [41] Z. Wu, Y. Xiong, X. Y. Stella, and D. Lin (2018) Unsupervised feature learning via non-parametric instance discrimination. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 3733–3742. Cited by: §3.1, §3.1.
  • [42] T. Xiao, H. Li, W. Ouyang, and X. Wang (2016)

    Learning deep feature representations with domain guided dropout for person re-identification

    .
    In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 1249–1258. Cited by: §1.
  • [43] M. Ye, X. Lan, and P. C. Yuen (2018) Robust anchor embedding for unsupervised video person re-identification in the wild. In Proc. Eur. Conf. Comput. Vis., pp. 170–186. Cited by: §4.2, Table 2.
  • [44] M. Ye, A. J. Ma, L. Zheng, J. Li, and P. C. Yuen (2017) Dynamic label graph matching for unsupervised video re-identification. In Proc. IEEE Int. Conf. Comput. Vis., pp. 5142–5150. Cited by: §2, §4.2, Table 2.
  • [45] H. Yu, A. Wu, and W. Zheng (2017) Cross-view asymmetric metric learning for unsupervised person re-identification. In Proc. IEEE Int. Conf. Comput. Vis., pp. 994–1002. Cited by: §2.
  • [46] Y. Zhang, T. Xiang, T. M. Hospedales, and H. Lu (2018) Deep mutual learning. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Cited by: §1.
  • [47] R. Zhao, W. Ouyang, and X. Wang (2013) Unsupervised salience learning for person re-identification. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 3586–3593. Cited by: §2.
  • [48] L. Zheng, Z. Bie, Y. Sun, J. Wang, C. Su, S. Wang, and Q. Tian (2016) Mars: a video benchmark for large-scale person re-identification. In Proc. Eur. Conf. Comput. Vis., pp. 868–884. Cited by: §1, §2, Table 1, §4.1, §4.1.
  • [49] L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, and Q. Tian (2015) Scalable person re-identification: a benchmark. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 1116–1124. Cited by: §1, §1.
  • [50] Z. Zheng, L. Zheng, and Y. Yang (2017) Unlabeled samples generated by gan improve the person re-identification baseline in vitro. In Proc. IEEE Int. Conf. Comput. Vis., pp. 3754–3762. Cited by: §1, §2.
  • [51] Z. Zhong, L. Zheng, S. Li, and Y. Yang (2018) Generalizing a person retrieval model hetero-and homogeneously. In Proc. Eur. Conf. Comput. Vis., pp. 172–188. Cited by: §1, §2.