Towards better Validity: Dispersion based Clustering for Unsupervised Person Re-identification

06/04/2019 ∙ by Guodong Ding, et al. ∙ Nanjing University Australian National University 3

Person re-identification aims to establish the correct identity correspondences of a person moving through a non-overlapping multi-camera installation. Recent advances based on deep learning models for this task mainly focus on supervised learning scenarios where accurate annotations are assumed to be available for each setup. Annotating large scale datasets for person re-identification is demanding and burdensome, which renders the deployment of such supervised approaches to real-world applications infeasible. Therefore, it is necessary to train models without explicit supervision in an autonomous manner. In this paper, we propose an elegant and practical clustering approach for unsupervised person re-identification based on the cluster validity consideration. Concretely, we explore a fundamental concept in statistics, namely dispersion, to achieve a robust clustering criterion. Dispersion reflects the compactness of a cluster when employed at the intra-cluster level and reveals the separation when measured at the inter-cluster level. With this insight, we design a novel Dispersion-based Clustering (DBC) approach which can discover the underlying patterns in data. This approach considers a wider context of sample-level pairwise relationships to achieve a robust cluster affinity assessment which handles the complications may arise due to prevalent imbalanced data distributions. Additionally, our solution can automatically prioritize standalone data points and prevents inferior clustering. Our extensive experimental analysis on image and video re-identification benchmarks demonstrate that our method outperforms the state-of-the-art unsupervised methods by a significant margin. Code is available at https://github.com/gddingcs/Dispersion-based-Clustering.git.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 6

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The goal of the person re-identification (re-ID) is to find the correspondences of the same person across a system of non-overlapping cameras. It has critical applications such as people tracking and mobility analysis in multi-camera streams. To this end, plenty of research work in recent years investigated supervised learning schemes, mainly leveraging on neural networks models, which led to remarkable performance gains [1, 2, 3, 4, 5, 6, 7]. However, supervised learning requires large annotated datasets with manually labeled identities, which is a laborious undertaking for complex scenes. This motivated unsupervised approaches that are often based on handcrafted features [8, 9, 10], saliency indicators [11, 12] and sparsity constraints [13]. These attempts, on the other hand, attain inferior performance compared to fully supervised models.

As an alternative, a variety of methods treats the person re-ID as an unsupervised domain adaptation task [14, 15, 16]. The main idea is first to learn an identity embedding using an auxiliary dataset where the ground truth labels are available, and later transfer these learned feature representations onto an unlabeled target dataset. However, this strategy relies on the premise that these two domains share the same identity label space. For person re-ID, this assumption does not always hold since the identities from two different datasets are usually disjoint.

Clustering based techniques have also been studied within the context of person re-ID. For instance, Fan et al[14]

proposed pre-training a convolutional neural network (CNN) model on an external re-ID dataset where they apply

-means clustering on the features extracted from a target dataset to progressively select highly reliable data points for updating the model. One shortcoming of this approach is that the number of clusters, which dictates to the number of people, is preset and its right choice is unknown in the runtime.

Person re-ID methods based on clustering schemes often have a hierarchical nature, and some can also avoid the use of external datasets. As an example, the inspiring work in Lin et al[17]

presented a bottom-up clustering (BUC) approach that alternatively trains a CNN model and performs merging without any dependence on auxiliary data samples. It uses the minimum distance between images in two clusters as the similarity measure for the merging operation. However, such a naive heuristic is suboptimal since it only considers one pair of images from two clusters discarding other potentially useful cues. This naive scheme may merge distinct identity groups by forming elongated clusters, which result in poor performance. Besides, it uses a diversity regularization term, which is defined as the number of samples in a cluster, to impose isometric clusters. In reality, data distribution over classes (identities) is usually imbalanced with a long tail; thus such a regularization term can be error-prone and invalid. To illustrate this issue, we show the number of samples per class on two re-ID datasets Market-1501 and DukeMTMC-reID in Fig. 

1. As visible, the isometric cluster presumption does not hold for person re-ID scenarios.

Fig. 1: The number of classes vs. the number of images per class for Market-1501 and DukeMTMC-reID datasets. The distributions are imbalanced, which is common for person re-ID scenarios.

We tackle this challenging problem by incorporating a cluster validity criterion. Cluster validity is a measure to assess the quality of clustering results. We consider a clustering to be valid if it satisfies two dispersion based properties. In statistics, dispersion is the extent to which a distribution is stretched or compressed; thus, it characterizes the clustering quality. Low intra-dispersion and high inter-dispersion are the indicators of valid and questionable clusters.

Motivated by this, we employ this elegant and straightforward criterion as our merging rule. It consists of two balancing objectives: an inter-cluster dispersion term and an intra-cluster dispersion term. The inter-cluster term quantifies the average linkage between the clusters while measuring their separation in a designated feature space; hence, the clusters with low dispersion are placed higher in the merging list. We empirically show that the average linkage models a broader context, which makes it superior to single linkage strategy used in [17]. The intra-cluster term evaluates the compactness of candidate clusters and serves as a regularizer complementing the former term. Different from [17], which considers a cluster size constraint, we note that the number of samples should not be the primary concern as long as they are close enough to be considered as the same identity. As such, the intra-dispersion also helps to bypass the potential class imbalance issue.

In addition to introducing the above dispersion based criterion, we employ an agglomerative and alternating learning procedure tailored for the person re-ID task. The images of people moving in a non-overlapping multi-camera system typically constitute compact clusters according to their identities and acquisition conditions for each camera. However, viewpoint variations cause significant intra-class variations alongside. To address this challenge, we exploit a feature learning framework that trains CNN models and performs clustering in an alternating fashion. For CNN model training, we utilize the repelled loss that can incorporate both inter-class and intra-class variances to obtain invariant feature representations. With one accord, our clustering reduces intra-cluster dispersion while maximizing inter-cluster dispersion. The resulting clusters attain more refined structures, which further enhances the representation capability of the CNN model. Therefore, CNN model training and clustering leverage each other in a reciprocal exchange.

The cluster dispersion term brings two significant advantages to our approach. It allows automatically prioritizing singletons and preventing inadequate clustering. Singleton refers to a cluster with only one sample. Ideally, in a multi-camera system designed to re-identify people, a singleton cluster should not exist. Our approach selectively assigns a higher priority to a singleton cluster when the candidate clusters have equal inter-dispersion since a singleton has no intra-dispersion. Our criterion also minimizes potentially incorrect decisions of the agglomerative merging steps by deferring high intra-dispersion clusters form merging.

Our contributions are three-fold:

  • We propose to use cluster validity as the guidance and derive a dispersion based criterion which promotes compact and well-separated clusters.

  • The proposed criterion automatically handles singleton cluster problem and can prevent poor clusters. Furthermore, our approach has a faster learning speed and is more stable than its counterpart BUC [17].

  • The experimental results demonstrate that our approach outperforms the state-of-the-art unsupervised methods on both image-based and video-based re-ID datasets by a significant margin.

The remainder of this paper is organized as follows. In Section II, we review related work. In Section III, we provide details of our proposed unsupervised approach with cluster dispersion followed by a discussion on why DBC works better. Section IV demonstrates the effectiveness of proposed approach on both image-based and video-based person re-ID datasets along with an extensive ablation study. We conclude our work in Section V.

Ii Related Work

Top-performing deep architectures are trained on massive amounts of labeled data. Most existing re-ID models are trained with human annotated ID labels in a supervised mode. Therefore, their deployment in real-world applications is usually hindered by lack of large-scale annotated training sets. To learn models without explicit supervision has therefore been extensively studied in the literature.

Ii-a Traditional Unsupervised Solutions

Some unsupervised methods with hand-crafted features have been proposed in recent years [8, 18, 19, 13, 10, 20, 21, 12, 22, 23, 24, 25]. However, they achieve inadequate re-ID performance when compared to the supervised learning methods. Specifically, Farenzena et al[8] exploited the property of symmetry in person images to deal with view variances. To handle the illumination changes and cluttered background, Ma et al[26]

proposed to combine the Gabor filters and the covariance descriptor. Fisher Vector is explored in

[27] to encode higher order statistics of local features. Kodvirov et al[13] proposed to combine a laplacian regularization term with conventional dictionary learning formulation to encode cross-view correspondences.

Fig. 2: The overall framework for the unsupervised learning. The left part shows the iterative process of our approach. The framework first extracts image features with a CNN model, then clustering is performed by the feature similarities and lastly image labels are updated with new cluster set to retrain the feature extractor. After the training is finished, the CNN extracts features on both query and gallery samples and then performs a distance based retrieval. The right part exhibits two properties of our dispersion criterion. (a) Singleton cluster priority. In this case, green sample is merged with black sample as it has less (zero) dispersion compared with that of orange cluster. (b) Poor clustering prevention. We consider a cluster poorly merged when it has high dispersion. The blue cluster is prevented from merging with brown because the resulting cluster would have a high dispersion. Instead the brown cluster is merged with the green one which together have less intra-dispersion. (Best viewed in color)

Ii-B Unsupervised Domain Adaption

In the absence of labeled data for a certain task, domain adaptation often provides an attractive option given that labeled data of similar nature but from a different domain is available. A main practice for adaptation is to align the feature distribution between the source and target domain [28, 29, 30, 31, 32]

. Likewise, some unsupervised cross-domain transfer learning methods

[15, 33] have been studied in the field of person re-ID to deal with the misalignment between identities among different datasets. To better bridge this gap, Wang et al[34] first train with attributes on source domain and then learn a joint feature representation of both identity and attribute. A hetero-homogeneous learning approached is introduced in [35] to align domain distributions. Another stream of work uses generative adversarial networks (GAN) to generate augmented images to reduce the dataset differences [33, 16]. Deng et al[16] explored image self-similarity and cross-domain dissimilarity for a target domain image translation. Differently, Zhong et al[33]

exploited camera-to-camera alignment to perform image translation. These domain adaptation methods are all focused on the label estimation of target domain. One can see that the success of this category of approaches is based on an auxiliary labeled dataset. Compared to them, the unsupervised learning approach in this paper does not use any external data or annotations.

Ii-C Clustering Analysis

Clustering analysis is a long-standing approach to unsupervised machine learning. With the surge of deep learning techniques, recent studies have attempted to optimize clustering analysis and representation learning jointly for maximizing their complementary benefits [36, 37, 38, 39]. Fan et al[14] combines domain transfer and clustering for unsupervised re-ID task. They first train the model on an external labeled dataset which is used as a good model initialization. After that, unlabeled data samples are progressively selected for training according to their credibility defined as their distance to cluster centroids. However, this work relies on a strong assumption about the total number of identities. Aside from these methods that require auxiliary datasets or assumptions, Lin et al[17] proposed to apply a bottom-to-up framework for clustering, which hierarchically combines clusters according to a predefined criterion. The merging in [17] is based on a very simple minimum distance criterion with a cluster size regularization term. Different from their work, our dispersion criterion exploits feature affinities within and between clusters, which also has mutual interaction with the CNN model training process to reciprocate the model strength.

Iii Cluster Dispersion Criterion

In this section, we start with some preliminaries followed by our proposed criterion described in detail, and end with discussions and comparisons with close work.

Iii-a Preliminaries

Given an unlabeled training set containing cropped person images, we aim to learn a feature embedding function from without any available annotations. The parameters are optimized iteratively using an objective function. This feature extractor can be later applied to the gallery set and the query set to obtain their feature representations for a distance based retrieval. The distance between each pair of images is defined as, . For a higher distance based rank of a given pair, it is more likely that the pair belongs to the same identity.

Supervised learning provides person identity labels for each input image

, and the feature embedding function is appended by a classifier

parameterized by . Thus, can be learnt by optimizing the following objective function:

(1)

where is the cross-entropy (CE) loss for classification. One shortcoming of CE loss is it does not explicitly minimizes the intra-class distances. To this end, center loss is proposed that seeks to achieve within class compactness.

Similar to center loss [40, 41], repelled loss [42, 17] can act as a classifier

which has the ability to jointly consider inter-class and intra-class variances by computing probability based on the feature similarity as follows:

(2)

where

is a temperature parameter that controls the softness of probability distribution over classes,

is the normalized image feature obtained from , while is a lookup table (LUT) containing the centroid feature of each class. This LUT is updated on the fly by exponential moving average [43] over training iterations, thus avoiding exhaustive feature extraction that is more computation intensive.

Iii-B Validity Guided Dispersion Criterion

The main challenge towards using the above framework for an unsupervised setting lies in automatic label assignment for unlabeled data. Here, clustering comes as a natural choice as it aims to group similar entities together. In this paper, we propose a novel dispersion based agglomerative clustering approach based on a cluster validity consideration. The choice of affinity/dissimilarity measure between two clusters is the key to our proposed algorithm. In the task of person re-ID, which focuses on identifying images of the same identity, the inter and intra-cluster similarity should be considered for a reasonable merging. This requisite is fulfilled by a novel merging criterion used in our agglomerative clustering approach.

Given a cluster scattered in feature space, we define its dispersion as the average pairwise distance within:

(3)

where is the cardinality of cluster . As such, the dispersion between clusters can be written as follows:

(4)

To jointly consider both intra- and inter-cluster dispersion, we have the dissimilarity between clusters and formulated as:

(5)

where and are used in place of and for notation simplicity, and is the trade-off parameter between two components.

The former component in Eq. (5), dispersion between clusters, is a measure for cluster dissimilarity. A cluster with low dispersion should be considered for merging as features from the same identity should be close in the feature space. The later component , which is the sum of dispersion of both candidate clusters, serves as a regularizer to the former component. On one hand, it can help prioritize standalone data points for merging at the starting stages. On the other hand, this term can prevent escalating ”poor” clustering as the high dispersion within-cluster can overbalance the inter-cluster term. In fact, this candidate cluster selection strategy controls the trade-off between the tendency to select spatially closer clusters () or more compact clusters ().

Dataset Category Training Testing
# ID # Samples Query Gallery
# ID # Samples # ID # Samples
Market-1501 [44] Image-based 751 16,522 750 3,368 750 19,732
DukeMTMC-reID [45] Image-based 702 16,522 702 2,228 1,110 17,661
MARS [46] Video-based 625 8,298 626 1,980 636 12,180
DukeMTMC-VideoReID [47] Video-based 702 2,196 702 702 801 2,636
TABLE I: Statistics of four datasets used in our experiment settings. ”# Samples” stands for the number of images for image-based datasets and number of tracklets for video-based datasets.
  Input: Training data , merging percentage , trade-off parameter , CNN model
  Output: Optimized model
  Initialize label , Cluster number , merge batch
  while  do
     Train model with and with Eq. (1)
     Calculate cluster dissimilarity matrix
     for 1:k do
        Select candidate clusters according to Eq. (5) and merge them
        Update matrix with Eq. (6) and Eq. (7)
        
     end for
     Update with new cluster using Eq. (8)
     Evaluate Performance on validation set.
     if  then
        
        Best model =
     end if
  end while
Algorithm 1 Dispersion based Clustering Algorithm

Iii-C Matrix Update

Proposed dispersion criterion can be calculated with a matrix updating algorithm. The input to this clustering process is the dissimilarity matrix , also referred as the proximity matrix. It is a matrix whose element equals the inter cluster dispersion between and . can be efficiently computed by first calculating an image pairwise distance matrix which is the outer product of stacked feature vectors obtained from deep networks and then aggregate them by cluster IDs.

At each clustering step, when two candidate clusters selected by Eq. (5) are merged, the size of dissimilarity matrix becomes . In one operation, two rows and columns of corresponding merged clusters and are deleted and a new row and a new column are added that contain the updated dissimilarity between the newly formed cluster and an old cluster . The dissimilarity between and can be found using our dispersion definition, as follows:

(6)

Correspondingly, the intra-cluster dispersion of the newly formed cluster is written as:

(7)

Iii-D Learning Paradigm

The overall learning paradigm with proposed criterion is presented as the left part in Fig. 2, where two stages, i.e., embedding learning and clustering, take place on an alternating basis. This paradigm initiates with the embedding learning stage where each data point assigned a unique label

and the CNN model is trained for a few epochs to learn the mapping. This choice of considering every single sample as an independent class is often referred as ‘sample specificity learning’. The key idea is that even with this naive supervision, neural networks still can automatically reveal the visual similarity correlation between different classes and yield a decent CNN initialization.

In between embedding learning is the clustering stage. We consider a clustering stage to have steps, where top- cluster pairs with least dissimilarity defined by Eq. (5) are to be merged. The is defined as the product , controlled by a merging percent parameter (set of 0.05 percent of total number of samples in our experiments). After each step, the proximity matrix is updated with Eq. (6) and Eq. (7) introduced above. Before entering the next stage of CNN model training, samples in the entire training set are designated new labels according to their belonging to the resulting clusters as follows:

(8)

The whole training procedure of our unsupervised person re-ID learning is summarized in Algorithm 1.

Iii-E Complexity Analysis

We analyze the per-stage complexity cost induced by the new criterion. For a clustering stage which performs merging operations, the computation complexity for image pairwise distance computation is and for cluster pairwise dissimilarity calculation. Afterwards, a cost of is required for ranking and candidate selection. Lastly, a cost of for step merging and proximity matrix update is incurred. So the total computation complexity for the whole stage comes together to . Notice that since the cluster number decreases as the learning paradigm progresses, the computation complexity also decreases. Accordingly, the overall complexity is , which means the main complexity comes from the inevitable sample-wise similarity calculation. All operations mentioned above can be computed by matrix manipulation on GPUs and one can use multiple GPUs for acceleration since our algorithm is suitable for parallelism.

Iii-F Discussion.

The regularization term. The combination of the second component in Eq. (5) brings two advantages to the clustering process stated as follows:

1) Singleton cluster priority. Recall that each sample is assigned a unique label in the first round of CNN training in Sec III-D, yielding all singleton clusters, i.e., clusters with only a single sample. However, the intrinsic matching and association property of the re-ID task requires that there exists at least two samples for a given identity. Therefore, singleton clusters cannot emerge during clustering by the problem definition and must be explicitly taken care of. Importantly, they should be dealt with at the first few clustering stages as they may be further pushed away from points of their own identity as the CNN is trained to separate them. The priority shifting happens when two merging options have identical inter dispersion , consequently the standalone data point with less (zero) intra-cluster dispersion ( or ) gets promoted. An illustration can be found in Fig. 2(a).

2) Poor clustering prevention. One disadvantage of the nesting property of agglomerative clustering is that there is no way to recover from a ‘poor’ clustering that may have occurred in previous levels of the hierarchy [48]. The addition of the regularization term helps to avoid this. Consider the case where a poor cluster formed in previous merging step, the high intra-cluster dispersion would prevent it from being selected for merging in the following turns, albeit it may have high rankings in intra-cluster dispersion based merging list. An illustration can be found in Fig. 2(b), brown cluster is merged with more distanced green cluster for their lower intra-dispersion.

Comparison with close work. Our work shares a similar spirit as that of Bottom-up Clustering (BUC) [17] and adopts an agglomerative clustering framework for the task of unsupervised person re-ID. We differ substantially in terms of cluster merge criterion. On one hand, Lin et al[17] adopted minimum distance between cross cluster samples to measure their dissimilarity. It is known that the single linkage algorithm has a chaining effect, i.e., the dissimilarity is obtained from and whichever is smaller (). This implies it has a tendency to favor elongated clusters. Stretched clusters may hinder next iteration of model training with repelled loss which favours compact groups. On the other hand, based on the presumption that training samples are evenly distributed among identities, Lin et al[17] proposed to use cluster cardinality as a diversity regularization term, however, this is also error-prone. As shown in Fig. 1, the equal distribution of identity samples can hardly be assured in person re-ID. In contrast, our criterion works on the pairwise distance between individual data point which exploits the intra-cluster relations and bypasses the imbalanced data situation. Also, our criterion formulation can help in forming compact and well-separated clusters, which serves the same purpose as the repelled loss used in CNN model training. Two alternating stages pursuing the same goal of lower intra-cluster variation and higher inter-cluster separation form a reciprocal effect and speeds up the training process. The superiority of our proposed cluster dispersion criterion is evidenced through the numeric results provided in Sec. IV-D.

(a) Market-1501
(b) DukeMTMC-reID
(c) MARS
(d) DukeMTMC-VideoReID
Fig. 3: Examples of person images from Market-1501, DukeMTMC-reID, MARS and DukeMTMC-VideoReID.

Iv Experiments

In this section, we perform experiments on four large-scale person re-ID datasets to evaluate the effectiveness of our proposed approach.

Iv-a Datasets

Market-1501 [44] consists of 1,501 identities and 32,688 labeled images, among which 12936 images of 751 identities are used for training and 19,732 images of 750 identities are used for testing.

DukeMTMC-reID [45] contains 36,411 labeled images of 1,404 identities. Similar to Market1501, 702 identities are used for training and remaining identities as testing. Specifically, 16,522 training images, 2,228 query images and 17,661 gallery images.

MARS [46] is a large-scale video-based dataset for person re-ID, which contains 17,503 video clips of 1,261 identities. The training set comprises of 625 identities while testing set comprises 636 identities.

DukeMTMC-VideoReID [47] is a large-scale video-based dataset for person re-ID derived from DukeMTMC dataset. It has 2,196 tracklets of 702 identities for training, 2,636 tracklets of 702 identities for testing.

We list the statistics and some samples of these four datasets in Table I and Fig. 3.

Iv-B Protocols

Training. To perform unsupervised learning on above mentioned person re-ID datasets, the training protocols are changed as described next. For image-based datasets, i.e., Market1501 and DukeMTMC-reID, training split remains the same except for the removal of identity labels. Similarly, for video-based datasets, i.e., MARS and DukeMTMC-VideoReID, the training samples are the tracklets of identities and each tracklet is treated as an individual. Note that no extra annotation information are used for model initialization or during our unsupervised feature learning.

Methods Market-1501 DukeMTMC-reID MARS DukeMTMC-VideoReID
rank-1 mAP rank-1 mAP rank-1 mAP rank-1 mAP
BUC w/o regularization [17] 62.9 33.8 41.3 22.5 55.5 31.9 60.7 50.8
BUC with regularization[17] 66.2 38.3 47.4 27.5 61.1 38.0 69.2 61.9
Ours w/o regularization 66.2 38.7 48.2 27.5 59.8 37.2 71.8 63.2
Ours with regularization 69.2 41.3 51.5 30.0 64.3 43.8 75.2 66.1
TABLE II: The effectiveness of our dispersion based criterion and comparison with minimum distance criterion.The regularization terms are cluster size in BUC and intra-cluster dispersion in Ours.

Testing. When training is done, the CNN model with trained parameters is used as the feature extractor. The output activations from the penultimate layer of ResNet-50 are used as the person descriptor for image inputs, while the person descriptor for a tracklet input is the average of its frame features. These person descriptors are then used for a Euclidean distance based retrieval.

Evaluation Metrics. We evaluate our methods with rank- accuracy and mean average precision (mAP) on all four datasets. The rank- accuracy means that the correct match gets to be in the top-

ranked list to count as ‘correct’. It represents the retrieval precision. The mAP value reflects the overall precision and recall rates. These two metrics provide a more comprehensive evaluation of the approach.

Methods Venue Labels Market-1501 DukeMTMC-reID
rank-1 rank-5 rank-10 mAP rank-1 rank-5 rank-10 mAP
BOW[44] ICCV’15 None 35.8 52.4 60.3 14.8 17.1 528.8 34.9 8.3
OIM[42] CVPR’17 None 38.0 58.0 66.3 14.0 24.5 38.8 46.0 11.3
UMDL[15] CVPR’16 Transfer 34.5 52.6 59.6 12.4 18.5 31.4 37.6 7.3
PUL[14] TOMM’18 Transfer 44.7 59.1 65.6 20.1 30.4 4604 50.7 16.4
EUG[47] TIP’19 OneEx 55.8 72.3 83.5 26.2 48.8 63.4 68.4 28.5
SPGAN[16] CVPR’18 Transfer 58.1 76.0 82.7 26.7 46.9 62.6 68.5 26.4
TJ-AIDL[34] CVPR’18 Transfer 58.2 - - 26.5 44.3 - - 23.0
BUC[17] AAAI’19 None 66.2 79.6 84.5 38.3 47.4 62.6 68.4 27.5
Ours - None 69.2 83.0 87.8 41.3 51.5 64.6 70.1 30.0
TABLE III: Comparison with the state-of-the-art methods on two image-based large-scale re-id datasets. The column ”Labels” lists the supervision used by the corresponding method. ”Transfer” means it uses an external dataset with annotations. ”OneEx” denotes that only one image of each identity is labeled. ”None” denotes no extra information is used.
Methods Venue Labels MARS DukeMTMC-VideoReID
rank-1 rank-5 rank-10 mAP rank-1 rank-5 rank-10 mAP
OIM[42] CVPR’17 None 33.7 48.1 54.8 13.5 51.1 70.5 76.2 43.8
DGM+IDE[23] ICCV’17 OneEx 36.8 54 - 16.8 42.3 57.9 69.3 33.6
Stepwise[20] ICCV’17 OneEx 41.2 55.5 - 19.6 56.2 70.3 79.2 46.7
RACE[49] ECCV’18 OneEx 43.2 57.1 62.1 24.5 - - - -
DAL[50] BMVC’18 OneEx 49.3 65.9 72.2 23.0 - - - -
BUC[17] AAAI’19 None 61.1 75.1 80.0 38.0 69.2 81.1 85.8 61.9
EUG[47] TIP’19 OneEx 62.8 75.2 80.4 42.6 72.9 84.3 88.3 63.3
Ours - None 64.3 79.2 85.1 43.8 75.2 87.0 90.2 66.1
TABLE IV: Results on two video-based person re-identification datasets, MARS and DukeMTMC-VideoReID. The column ”Labels” lists the supervision used by the corresponding method. ”OneEx” denotes the each person in the dataset is annotated with one labeled example. ”None” denotes no extra information is used.

Iv-C Implementation details

In our experiments, we conventionally adopt ResNet-50 [51]

as the backbone architecture with pre-trained weights on ImageNet 

[52]. A two layer fully connected layer is added on top of the penultimate layer (2048-d) of ResNet-50 for smaller feature embedding (1024-d). The last classification layer is replaced by the implementation of Eq. (2

) with varying cluster numbers. All datasets share the exact same set of hyper-parameters if not specified. For CNN model training, we set the total training epoch is to be 20, batch size to be 16, dropout rate to be 0.5. The CNN model is optimized by Stochastic Gradient Descent (SGD) with momentum set to 0.9. Learning rate for parameters is initialized to 0.1 and decreased to 0.01 after 15 epochs. For clustering stages, the batch merging parameter

is set to be 0.05 and the trade-off parameter in Eq. (5) is set to be 0.5. On image-based re-ID datasets, i.e., Market-1501 and DukeMTMC-reID, the whole training process takes less than 3 hours to finish with a Tesla V100 GPU. For video-based re-ID datasets, i.e., MARS and DukeMTMC-VideoReID, it takes about 4 hours for complete training.

Iv-D Ablation Study

We evaluate the effectiveness of our proposed clustering criterion on all datasets and Table II summarizes the numerical results.

The effectiveness of the inter-cluster dispersion term. We evaluate the effectiveness of our inter-cluster dispersion term by comparing to a very close work BUC [17]. For fair comparison, we report results of BUC without its size regularization term in the first row of Table II and our proposed criterion without intra regularization is shown in the third row. It can be observed that, across all four datasets, our method outperforms BUC by a large margin of around 6% in rank-1 accuracy and 7% in mAP. This performance gain comes solely from the difference in cluster linkage criterion. Single linkage based minimum distance criterion essentially forms stretched and elongated clusters. While ours average pairwise distance has the capability to take into consideration wider context. This type of full linkage algorithm better discovers patterns underlying the dataset. Notably, our model without the second term manages to achieve comparable results to that of full BUC model (second row).

The effectiveness of the intra-cluster dispersion term. We further study the effects of the intra-cluster dispersion regularization term. Results of our full model can be found in the fourth row. The numerical increase indicates that the regularization term is helpful to further boost the performance. On Market-1501, the rank-1 accuracy is increased from 66.2% to 69.2% and mAP from 38.7% to 41.3%. Similar trend is observed across all datasets which advocates its effectiveness. The alliance of intra and inter dispersion in the full model can lead to a better feature representation learning.

(a) Rank-1 accuracy
(b) mAP scores
Fig. 4: Parameter study on Market-1501. We set varying values for trade-off parameter and report Rank-1 accuracy and mAP changes in (a) and (b), respectively.

Iv-E Algorithm Analysis

Analysis on the trade-off parameter. The regularization parameter in Eq. (5) balances the importance of the intra-cluster and inter-cluster dispersion. We report results on Market-1501 dataset with varying values in Fig. 4. It can be seen that the rank-1 accuracy first increases to its peak when and then experiences a decline as shown in Fig. 4(a). A similar trend can be also found for mAP scores given in Fig. 4(b). It is plausible since this parameter can be interpreted as the preference in candidate cluster selection which emphasizes more on selecting clusters that are spatially close in feature space when is relatively small, but more on selecting compact candidates as increases. The best performance with resonates with the empirical evidence that the inter dispersion should be the main clustering indicant coupled with reasonable regularization.

Fig. 5: The rank-1 accuracy performance with respect to clustering stages on DukeMTMC-VideoReID. Our proposed criterion demonstrates a faster learning process and better robustness to cluster numbers, compared to BUC[17].

Robustness. We further provide an analysis on the performance change over clustering iterations throughout the learning process to study its learning speed and robustness. Performance change with regards to clustering stages is shown in Fig. 5. One can see that we achieved best rank-1 accuracy at the clustering stage, one stage ahead of BUC. On DukeMTMC-VideoReID, one stage comprised of 100 pairs of merging; while on DukeMTMC-reID, one stage later convergence equates to 800 more merging operations. This indicates that our algorithm has faster learning speed.

It can also be observed that the performances from both approaches intertwined with each other before the clustering stage but diverged afterwards, with ours outperforming BUC by a relatively large margin. The tangled stages are the beginning stages where smaller clusters are being formed, progressively building up trustworthy identities and stimulating stable performance increase. After that, the algorithms started merging larger-sized clusters, during which inaccurate cluster merging could have a big influence on the performance. Those performance differences in the figure basically demonstrate the superiority of our approach. Another noticeable fact is that the performance of BUC dropped immediately after its peak while ours remained at the same level for few following stages before its decline. This phenomenon exhibited that our algorithm have better robustness to the resulting clustering numbers. Finally, the performance decline in the last few stages for both approaches is caused by the over-clustering of images from distinct identities since the resulting cluster number is far less than ground-truth identity number.

Fig. 6: Qualitative illustration of clustering on a reduced training set sampled from Market-1501 (100 identities, 1,657 images). T-SNE is adopted to visualize feature embeddings. Same color denotes same identity. The circled regions show two clustering cases i.e., negative (on left) and positive (on right) merging. We also show identity samples pointed by an arrow. For the negative sample, it can be seen that two visually alike person are merged together.

Qualitative Analysis. We further provide a qualitative analysis to gain a better understanding of the clustering results by our proposed approach. In Fig 6, T-SNE [53] is used to visualize the clustering results on a reduced dataset, which contains 1,657 images of 100 identities randomly sampled from Market-1501. It can be seen that in most cases, samples from the same identity are group together (see collections of same colored points). At the same time, there are some incorrect merging of identities. For example, see bottom left box in Fig 6 where two ladies with similar appearance (white t-shirts and dark sports wear) are clustered together. This highlights the fact that visually alike identities are very difficult to disambiguate without any supervision.

Iv-F Comparison with state-of-the-art

We evaluate our approach on both image-based and video-based person re-ID datasets and numerical results are reported in Table III and Table IV, respectively.

Image-based Person re-ID. Table III summarizes the state-of-the-art unsupervised person re-ID results on Market-1501 and DukeMTMC-reID datasets. On Market-1501, we achieve the best performance among all listed approaches with rank-1 = 69.2%, mAP = 41.3%. Among the compared methods, OIM [42] and BUC [17] are evaluated under the fully unsupervised setting. It can be seen that our approach outperforms the state-of-the-art BUC by a margin of 3%. Similar performance improvements can be observed on the DukeMTMC-reID dataset.

Performance of some domain adaption and one-shot learning approaches are also reported, e.g. TJ-AIDL [34] and EUG [47]. TJ-AIDL [34] trains with attribute labels to learn a robust embedding encoding extra attribute information which is transferable, while EUG [47] initializes model with one example labels and then progressively selects samples for training. In our experiment, we still surpass them by a relatively large margin (11% and 13.4% in rank-1 accuracy) even though external supervisions are provided in their settings.

Video-based Person re-ID. The comparisons with state-of-the-art algorithms on video-based person re-ID datasets, MARS and DukeMTMC-VideoReID are reported in Table IV. On DukeMTMC-VideoReID, we achieved rank-1=75.2%, mAP=66.1%, exceeding our competitor BUC [17] by 6% and 4.2% in rank-1 accuracy and mAP, respectively. This demonstrates a more stable generalization ability of our proposed clustering algorithm to different data distributions. We also managed to outperform all other competitive methods on MARS dataset with rank-1=64.3%, mAP=43.8%. These results illustrate the effectiveness of our proposed approach.

V Conclusion

In this paper, we highlight the importance of quantifying cluster validity using robust statistical measures. We consider the problem of unsupervised person re-ID and propose a dispersion based criterion to assess the quality of automatically generated clusters. On one hand, the proposed criterion considers both intra- and inter-cluster dispersion and can achieve better clustering. The former dispersion term enforces compact clusters, while the latter ensures the separation between them. On the other hand, the proposed criterion can handle singleton clusters and prevent poor clustering. The overall extensive performance evaluations and ablation study illustrates the effectiveness of our proposed method.

References

  • [1] X. Yang, M. Wang, and D. Tao, “Person re-identification with metric learning using privileged information,” IEEE Trans. Image Process., vol. 27, no. 2, pp. 791–805, 2018.
  • [2] Y.-J. Cho and K.-J. Yoon, “Pamm: Pose-aware multi-shot matching for improving person re-identification,” IEEE Trans. Image Process., vol. 27, no. 8, pp. 3739–3752, 2018.
  • [3] Z. Feng, J. Lai, and X. Xie, “Learning view-specific deep networks for person re-identification,” IEEE Trans. Image Process., vol. 27, no. 7, pp. 3472–3483, 2018.
  • [4] Y. Suh, J. Wang, S. Tang, T. Mei, and K. Mu Lee, “Part-aligned bilinear representations for person re-identification,” in ECCV, 2018.
  • [5] Y. Sun, L. Zheng, W. Deng, and S. Wang, “Svdnet for pedestrian retrieval,” in ICCV, 2017.
  • [6] H. Zhao, M. Tian, S. Sun, J. Shao, J. Yan, S. Yi, X. Wang, and X. Tang, “Spindle net: Person re-identification with human body region guided feature decomposition and fusion,” in CVPR, 2017.
  • [7]

    S. Zhou, J. Wang, J. Wang, Y. Gong, and N. Zheng, “Point to set similarity based deep feature learning for person re-identification,” in

    CVPR, 2017.
  • [8] M. Farenzena, L. Bazzani, A. Perina, V. Murino, and M. Cristani, “Person re-identification by symmetry-driven accumulation of local features,” in CVPR, 2010.
  • [9] S. Liao, Y. Hu, X. Zhu, and S. Z. Li, “Person re-identification by local maximal occurrence representation and metric learning,” in CVPR, 2015.
  • [10] G. Lisanti, I. Masi, A. D. Bagdanov, and A. Del Bimbo, “Person re-identification by iterative re-weighted sparse ranking,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 37, no. 8, pp. 1629–1642, 2015.
  • [11] R. Zhao, W. Ouyang, and X. Wang, “Unsupervised salience learning for person re-identification,” in CVPR, 2013.
  • [12] H. Wang, S. Gong, and T. Xiang, “Unsupervised learning of generative topic saliency for person re-identification,” in BMVC, 2014.
  • [13] E. Kodirov, T. Xiang, and S. Gong, “Dictionary learning with iterative laplacian regularisation for unsupervised person re-identification.” in BMVC, 2015.
  • [14] H. Fan, L. Zheng, C. Yan, and Y. Yang, “Unsupervised person re-identification: Clustering and fine-tuning,” ACM Trans. Multimedia Comput., Commun. Appl., vol. 14, no. 4, p. 83, 2018.
  • [15] P. Peng, T. Xiang, Y. Wang, M. Pontil, S. Gong, T. Huang, and Y. Tian, “Unsupervised cross-dataset transfer learning for person re-identification,” in CVPR, 2016.
  • [16] W. Deng, L. Zheng, Q. Ye, G. Kang, Y. Yang, and J. Jiao, “Image-image domain adaptation with preserved self-similarity and domain-dissimilarity for person re-identification,” in CVPR, 2018.
  • [17] Y. Lin, X. Dong, L. Zheng, Y. Yan, and Y. Yang, “A bottom-up clustering approach to unsupervised person re-identification,” in AAAI, 2019.
  • [18] F. M. Khan and F. Bremond, “Unsupervised data association for metric learning in the context of multi-shot person re-identification,” in AVSS, 2016.
  • [19] E. Kodirov, T. Xiang, Z. Fu, and S. Gong, “Person re-identification by unsupervised graph learning,” in ECCV, 2016.
  • [20] Z. Liu, D. Wang, and H. Lu, “Stepwise metric promotion for unsupervised video person re-identification,” in ICCV, 2017.
  • [21] X. Ma, X. Zhu, S. Gong, X. Xie, J. Hu, K.-M. Lam, and Y. Zhong, “Person re-identification by unsupervised video matching,” Pattern Recognit., vol. 65, pp. 197–210, 2017.
  • [22] H. Wang, X. Zhu, T. Xiang, and S. Gong, “Towards unsupervised open-set person re-identification,” in ICIP, 2016.
  • [23] M. Ye, J. Li, A. J. Ma, L. Zheng, and P. C. Yuen, “Dynamic graph co-matching for unsupervised video-based person re-identification,” IEEE Trans. Image Process., 2019.
  • [24] R. Zhao, W. Oyang, and X. Wang, “Person re-identification by saliency learning,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 2, pp. 356–370, 2017.
  • [25] H. Yao, S. Zhang, R. Hong, Y. Zhang, C. Xu, and Q. Tian, “Deep representation learning with part loss for person re-identification,” IEEE Trans. Image Process., 2019.
  • [26] B. Ma, Y. Su, and F. Jurie, “Bicov: a novel image representation for person re-identification and face verification,” in BMVC, 2012.
  • [27] ——, “Local descriptors encoded by fisher vectors for person re-identification,” in ECCV, 2012.
  • [28] E. Tzeng, J. Hoffman, T. Darrell, and K. Saenko, “Simultaneous deep transfer across domains and tasks,” in ICCV, 2015.
  • [29] M. Long, Y. Cao, J. Wang, and M. I. Jordan, “Learning transferable features with deep adaptation networks,” in ICML, 2015.
  • [30] B. Sun and K. Saenko, “Deep coral: Correlation alignment for deep domain adaptation,” in ECCV, 2016.
  • [31]

    Y. Ganin and V. Lempitsky, “Unsupervised domain adaptation by backpropagation,” in

    ICML, 2015.
  • [32] B. Sun, J. Feng, and K. Saenko, “Return of frustratingly easy domain adaptation,” in AAAI, 2016.
  • [33] Z. Zhong, L. Zheng, Z. Zheng, S. Li, and Y. Yang, “Camstyle: a novel data augmentation method for person re-identification,” IEEE Trans. Image Process., vol. 28, no. 3, pp. 1176–1190, 2019.
  • [34] J. Wang, X. Zhu, S. Gong, and W. Li, “Transferable joint attribute-identity deep learning for unsupervised person re-identification,” in CVPR, 2018.
  • [35] Z. Zhong, L. Zheng, S. Li, and Y. Yang, “Generalizing a person retrieval model hetero-and homogeneously,” in ECCV, 2018.
  • [36] M. Caron, P. Bojanowski, A. Joulin, and M. Douze, “Deep clustering for unsupervised learning of visual features,” in ECCV, 2018.
  • [37]

    K. Ghasedi Dizaji, A. Herandi, C. Deng, W. Cai, and H. Huang, “Deep clustering via joint convolutional autoencoder embedding and relative entropy minimization,” in

    ICCV, 2017.
  • [38] J. Xie, R. Girshick, and A. Farhadi, “Unsupervised deep embedding for clustering analysis,” in ICML, 2016.
  • [39]

    B. Yang, X. Fu, N. D. Sidiropoulos, and M. Hong, “Towards k-means-friendly spaces: Simultaneous deep learning and clustering,” in

    ICML, 2017.
  • [40]

    Y. Wen, K. Zhang, Z. Li, and Y. Qiao, “A discriminative feature learning approach for deep face recognition,” in

    ECCV, 2016.
  • [41] G. Ding, S. Zhang, S. Khan, Z. Tang, J. Zhang, and F. Porikli, “Feature affinity based pseudo labeling for semi-supervised person re-identification,” IEEE Trans. Multimedia, 2019.
  • [42] T. Xiao, S. Li, B. Wang, L. Lin, and X. Wang, “Joint detection and identification feature learning for person search,” in CVPR, 2017.
  • [43] J. M. Lucas and M. S. Saccucci, “Exponentially weighted moving average control schemes: properties and enhancements,” Technometrics, vol. 32, no. 1, pp. 1–12, 1990.
  • [44] L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, and Q. Tian, “Scalable person re-identification: A benchmark,” in ICCV, 2015.
  • [45] Z. Zheng, L. Zheng, and Y. Yang, “Unlabeled samples generated by gan improve the person re-identification baseline in vitro,” in ICCV, 2017.
  • [46] L. Zheng, Z. Bie, Y. Sun, J. Wang, C. Su, S. Wang, and Q. Tian, “Mars: A video benchmark for large-scale person re-identification,” in ECCV, 2016.
  • [47] Y. Wu, Y. Lin, X. Dong, Y. Yan, W. Bian, and Y. Yang, “Progressive learning for person re-identification with one example,” IEEE Trans. Image Process., 2019.
  • [48] J. C. Gower, “A comparison of some methods of cluster analysis,” Biometrics, pp. 623–637, 1967.
  • [49] M. Ye, X. Lan, and P. C. Yuen, “Robust anchor embedding for unsupervised video person re-identification in the wild,” in ECCV, 2018.
  • [50]

    Y. Chen, X. Zhu, and S. Gong, “Deep association learning for unsupervised video person re-identification,” in

    BMVC, 2018.
  • [51] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016.
  • [52] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and F.-F. Li, “Imagenet: A large-scale hierarchical image database,” in CVPR, 2009.
  • [53] L. v. d. Maaten and G. Hinton, “Visualizing data using t-sne,” J. Mach. Learn. Res., vol. 9, no. Nov, pp. 2579–2605, 2008.