Deep Active Learning for Video-based Person Re-identification

12/14/2018 ∙ by Menglin Wang, et al. ∙ Zhejiang University 0

It is prohibitively expensive to annotate a large-scale video-based person re-identification (re-ID) dataset, which makes fully supervised methods inapplicable to real-world deployment. How to maximally reduce the annotation cost while retaining the re-ID performance becomes an interesting problem. In this paper, we address this problem by integrating an active learning scheme into a deep learning framework. Noticing that the truly matched tracklet-pairs, also denoted as true positives (TP), are the most informative samples for our re-ID model, we propose a sampling criterion to choose the most TP-likely tracklet-pairs for annotation. A view-aware sampling strategy considering view-specific biases is designed to facilitate candidate selection, followed by an adaptive resampling step to leave out the selected candidates that are unnecessary to annotate. Our method learns the re-ID model and updates the annotation set iteratively. The re-ID model is supervised by the tracklets' pesudo labels that are initialized by treating each tracklet as a distinct class. With the gained annotations of the actively selected candidates, the tracklets' pesudo labels are updated by label merging and further used to re-train our re-ID model. While being simple, the proposed method demonstrates its effectiveness on three video-based person re-ID datasets. Experimental results show that less than 3% pairwise annotations are needed for our method to reach comparable performance with the fully-supervised setting.



There are no comments yet.


page 2

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Person re-identification (re-ID) in video surveillance has important significance for public security. Therefore, extensive studies have been conducted to address the re-ID problem and most of them focused on single image frames [17, 40, 38, 12, 33]. In recent years, video-based re-ID [1, 9, 11, 19, 37, 31] has been attracting more and more research attention. It utilizes both spatial and temporal information so that can better overcome the challenges resulted from occlusion, lighting variation, and pose and camera-view change.

Most existing works perform video-based person re-ID under full supervision. While state-of-the-art re-ID performances are reported in large-scale labeled datasets [39, 23], the fully supervised methods [1, 9, 11, 19, 37] are weak when scaled to real-world deployment. The reasons are in two aspects. On one hand, the amount of video data collected by a wide-area camera network is large, it is prohibitively expensive to make full annotations. On the other hand, the abundant unlabeled data are informative but fully supervised methods rarely discover their inherent information. Researchers therefore resort to unsupervised [38, 36, 16] or semi-supervised [32, 15] techniques. Unfortunately, there is still a significant performance gap between these methods and fully supervised counterparts so far.

In order to reduce the annotation cost while keeping the re-ID performance, this paper proposes an approach that integrates an active learning (AL) [26] scheme into a deep learning framework. Active learning aims to use as few labeled data as possible to achieve high performance. The sampling strategy, that is how to pick the most informative instances for annotation, plays a key role. Different query strategies have been recently developed in various AL-based vision tasks such as classification [10], recognition [6, 13], and object detection [25, 27]. However, they cannot be straightforwardly applied to the person re-ID problem because they do not exploit the inter-relations between samples but consider individual instances only. In recent years, there have also been several attempts at utilizing active learning for person re-ID [2, 29, 14, 24]. Some [2, 29, 14] focused on post-ranking and exploited the annotations to refine the initial ranking results.  [24] explicitly considered AL in person re-ID as an optimal subset selection task and implemented it by solving a triangle free subgraph maximization problem on the k-partite graph. Few of these methods exploit the inter-relations between samples to facilitate sample selection and model learning.

In video-based person re-ID, the annotation task is either directly assigning an ID label to each tracklet or telling whether two tracklets are matched or not. In this work, we take the second annotation manner. By checking a common video-based person re-ID dataset, we notice that only a rather small portion of tracklet-pairs are true matches, also referred to as true positives (TP), and most pairs are negative. It indicates that the truly matched tracklet-pairs are the most informative candidates for learning. Motivated by this observation, active learning is exploited to find and annotate the most TP-likely tracklet-pairs that the re-ID model is certain of. This sampling criterion is distinct from typical active learning methods [10, 14, 6, 29, 2] in which the most uncertain samples are queried.


Figure 1: Example of identities in the same and different camera views. Each row shows three tracklets of a person. In the first row, tracklets (a) and (b) are from the same camera, (c) is from a different camera. In the second row, tracklets (d) and (e) are from the same camera, (f) is from a different camera.

In addition, we also observe the following view-specific biases: 1) Truly matched tracklets from the same camera view are more similar to each other than those from different views, as shown in Fig. 1; 2) False positives are more likely to be selected from the same views. This observation inspires us to design a view-aware sampling strategy that takes the view information into account. An adaptive resampling step is further adopted to filter out the selected negative pairs that are unnecessary to annotate.

The main contributions of our work are listed as follows:

  • We propose a framework that integrates an active learning scheme with a deep learning model for video-based person re-ID. It performs re-ID model update and active annotation in an iterative and progressive way. In contrast to other semi-supervised or AL-based methods, our model requires no labeled re-ID data for initialization.

  • We design a sampling criterion to choose the most TP-likely candidates for annotation. A view-aware strategy and an adaptive resampling step are also designed to facilitate candidate selection. Our sampling strategies can significantly reduce annotation effort.

  • Extensive experiments on three benchmark multi-camera person re-ID datasets validate the effectiveness of the proposed method. The results show that less than 3% pairwise annotations are needed for our method to reach comparable performance with the fully-supervised setting.

2 Related Work

2.1 Fully Supervised Video-based Person Re-ID

The majority of existing video-based person re-ID methods are fully supervised. Similar to image-based counterparts, metric learning and representation learning are two major research directions. For instance, You et al[37] and Zhu et al[46] introduced set-based constraints into distance metric learning to better tackle intra-person variations in videos. McLaughlin et al[19], Zhou et al[43], and Li et al[11]

designed recurrent neural networks (RNN) or pooling schemes to aggregate temporal features. Besides, attention schemes 

[34, 1] were also introduced to the person re-ID problem in very recent years. Fully supervised methods have gained promising performances in large-scale video datasets [39, 23]. However, their performances may degenerate dramatically when applied to real-world scenarios beyond the labeled training data domains.

2.2 Semi-supervised Video-based Person Re-ID

Semi-supervised learning trains a model initially on a small amount of labeled data and then update the model by exploiting unlabeled data. By this means, it can alleviate annotation burden without compromising too much performance. In semi-supervised video-based person re-ID, the one-shot setting [16], in which one tracklet of each identity is labeled, was considered in very recent years. For instance, Wu et al[32] initialized a CNN model under the one-shot setting and gradually chose the most confident unlabeled tracklets for model update. DGM [36] and the stepwise method [16] also require at least one labeled tracklet for each identity to initialize their models. Different from active learning, these methods do not actively choose unlabeled data for human to annotate.

2.3 Active Learning for Person Re-ID

Active learning aims to reduce annotation cost by intelligently choosing some of the unlabeled data to annotate, and thus related to human-in-the-loop approaches [29]. Most existing works [2, 28, 24] applying active learning for person re-ID are based on still images. Different sampling strategies, such as the entropy-based criterion [2] and the exploration-exploitation jointed criterion [28] that measures both diversity and uncertainty, were proposed. In [24]

, image-pair selection is formulated as a combinatorial optimization problem based on transitivity. All these image-based methods either work under the one-shot setting or require a small pre-labeled training set for model initialization. Contrastively, our work is video-based and no labeled person re-ID data is needed. Moreover, our proposed sampling strategy is quite different from theirs.

3 The Proposed Method

3.1 Overall Framework

Unlike many other active learning approaches [24, 13], the proposed method does not require any labeled data for initialization. Thus, we consider a fully unlabeled video dataset. By pedestrian detection and tracking, we get tracklets containing pedestrian images. Let us represent the dataset by , where denotes the -th image. is a matrix mapping the image index to the tracklet index. If the -th image belongs to the -th tracklet, then the entry and 0 otherwise.

Following [42, 32], we formulate the re-ID task as a classification problem that minimizes the following objective function:


where is a CNN model, parameterized by , to extract the feature for the image .

is a classifier, parameterized by

, to predict -dimensional classification confidence. is the number of classes which is dynamically set. is the classification loss that computes the cross entropy between the prediction and the pseudo target label that is automatically assigned in a certain way.

Figure 2: An overview of our proposed framework. The circles with different colors represent tracklets of different persons. Our model learns through iterating the two stages: model learning stage and active annotation stage. The pseudo image label is first initialized by considering each individual tracklet as belonging to a unique class. Using the updated pseudo image label as the target, the model learning stage learns under the supervision of the classification loss. Afterwards, the learned features are utilized by the active annotation stage for computing tracklet similarity. In the active annotation stage, the view-aware sampling strategy progressively selects the ”TP-likely” tracklet pairs as candidates, then a re-sampling step performs label propagation so as to filter false positives. The chosen pairs are then annotated and merged into the updated target label for iterative model learning. The figure is best viewed in color.

Note that, for training efficiency, the above classification model takes each image as the input. In test stage, we use to extract the feature for each image of a query tracklet and gallery tracklets. A set-to-set distance defined in Sec. 3.2 is then applied to compute the distance between the query and gallery tracklets for the result ranking.

The above-defined problem is optimized in an alternative way. In the beginning, each tracklet is treated as a distinct class. That is, the tracklets’ pseudo target labels are initialized as . According to the image-tracklet relation denoted in , the image’s pseudo target label can be transitively obtained. Once initialized, we optimize and by fixing , and then update by fixing the other parameters.

Fig. 2 presents an overview of our framework. Corresponding to the above-introduced optimization way, we split the entire procedure into model learning stage and active annotation stage that are performed alternatively. Our re-ID model adopts ResNet-50 [7]

as the feature extractor, followed by several fully connected layers as the classifier. The feature extractor is pretrained not on any labeled re-ID data but only on ImageNet 

[3]. The model learning stage jointly trains the feature extractor and the classifier under the supervision of the pseudo target labels. The active annotation stage first extracts image features using the learned feature extractor , and then updates the pseudo labels by annotating tracklet-pairs actively selected according to a view-aware sampling and adaptive resampling strategy. After gaining incremental annotations, the tracklets’pesudo labels are furthered updated by a merging algorithm. At each new iteration, the number of classes in the re-ID model is reset as the number of merged clusters.

3.2 View-aware Sampling Strategy

At each iteration, the active annotation stage selects the most informative tracklet-pairs for annotation. Manual annotation tells whether a selected pair is a true match or not. We observe that the true matches only take a small portion of whole pairs in a dataset, and the entire relationship between tracklets can be known if all true matches are annotated. Therefore, we prefer to choose the tracklet pairs that are the most likely to be true matches. To this end, we define a set-to-set distance as the criterion and design a view-aware strategy for sampling. Our view-aware sampling strategy is designed based on the view-specific biases introduced in Sec. 1.

The Set-to-Set distance. The dissimilarity between tracklets is defined based on a set-to-set distance introduced here. Let us consider two tracklets and , each of which contains a set of pedestrian images. The distance between and is defined by


Here, () is an image belonging to tracklet , and image () is from ; denotes the cardinality of a set. is an indicator determining whether the distance between two images is counted in or not; , and .

This distance takes the average of smallest image-pair distances as the distance between tracklets. In contrast to performing temporal pooling to obtain tracklet features and then compute the Euclidean distance between the tracklet features [32, 36, 39], or taking the average distance of all image-pairs [16]

, our approach is more robust to outliers. In our experiments,

is set to be 3.

View-aware sampling strategy. The view-specific biases inspire us to design a sampling strategy that is aware of camera views. Specifically, at the -th iteration, the active annotation stage selects number of candidates that have the smallest dissimilarity values from the same-view tracklet pairs, together with number of candidates from cross-views. For simplicity and efficiency, we set these two variables as follows:


The above setting indicates that we piecewise linearly increase the number of annotations along with the iterations going on. Considering that tracklet pairs from the same views are more similar to each other, we set a larger value for than so that more same-view pairs are selected for annotation at initial iterations. Later on, more cross-view pairs are selected by setting greater than . This sampling strategy follows the self-paced learning principle by not only sampling in a progressive manner [32, 30] but also shifting from same-view (easier) to cross-view (harder). Such self-paced manner can bring more reliability for a learner that has a weak initialization.

3.3 Adaptive Resampling

The proposed sampling strategy is prone to choose tracklet-pairs that are more likely to be true matches. As iteration goes on, a growing number of the true-matches are being annotated, leaving only a small amount unfound. As a result, the percentage of selected true matches will decrease at later iterations and a lot of annotation efforts will be wasted on the selected false positive pairs. In order to reduce unnecessary annotations, we further propose an adaptive resampling scheme to leave out negative pairs selected at each iteration.

Our adaptive resampling scheme is designed by first using an efficient label propagation technique [45] to propagate clumped clusters to isolated ones, and then using a reciprocal ranking rule to filter out negatives. We briefly introduce these two step as follows.

We consider all the tracklets in a dataset and the pesudo target labels that is defined previously. The labels are soft labels that can be interpreted as distributions over clusters. We let the labels of a tracklet propagate to all other tracklets through fully connected edges. A probabilistic transition matrix is defined by [45]:




is a weight computed according to the tracklet distance between the -th tracklet and the -th tracklet.

The label propagation technique [45] propagates the distributions between all tracklets by iteratively perform the following steps:

  1. ;

  2. Row-normalize ;

  3. Clamp the results.

After label propagation, we can derive the probability distribution of each tracklet belonging to each cluster, which is further used to screen candidates. We adopt the reciprocal ranking as a rule for screening. Assume

is the set of candidate tracklet-pairs obtained from the sampling stage. Assume is the -nearest cluster neighbors of tracklet , i.e. the top-K of the ranked probability distribution after performing label propagation. Then


denotes the candidate pairs remained after screening. The rule in 7 indicates that a tracklet-pair is kept when both tracklets in the pair are among the nearest cluster neighbors of each other. Otherwise the pair is removed from the candidate set.

3.4 Label Merging

Our re-ID model is iteratively trained with the supervision of all tracklets’ pseudo target labels. These pseudo labels are initialized by taking each tracklet as a distinct class. After receiving annotations for the progressively sampled tracklet-pairs, we take a label merging process at each iteration to reduce the class number for the re-ID model. The merging result is required to satisfy 1) each tracklet in a cluster should be matched with one or more other tracklets in the same cluster and 2) a tracklet outside a cluster is not matched with any tracklets in the cluster.

We adopt a density-based clustering algorithm DBSCAN [4] for merging. DBSCAN basically groups together the points in high density and marks the points that lie alone in low-density areas as outliers. Therefore, it can discover clusters of arbitrary shape in spatial databases with noise. There are two key parameters in DBSCAN: and that, respectively, denotes the radius for the neighborhood of a point and the minimum number of points in the given neighborhood . In our implementation, we set to 0.01 and to 2, so that our requirements can be satisfied to a large extent.

The merged labels provide a pair-consistent [22, 44, 13] picture for all tracklet-pairs. For instance, a tracklet in a cluster matches to all the other tracklets in the same cluster. Moreover, if two tracklets from separate clusters are identified, then these two clusters can be merged into one cluster. These consistencies bring a lot of auto-annotated pairs and boosts the annotation efficiency.

4 Experiments

4.1 Datasets

The PRID dataset [8] consists of images captured by two cameras, with 385 identities recorded by one camera, and 749 identities by the other. 200 identities appear in both camera views. In order to guarantee the effective length of videos, 178 identities each of which has more than 27 frames are selected out of the mutual 200 identities. During experiments, the dataset is randomly divided by half into training and test sets. The train/test partition are repeated 10 times and the average results are reported.

The MARS dataset [39] is the largest video dataset for person re-ID. It contains 20,478 tracklets for 1,261 identities, captured by six cameras on a university campus. The tracklets are automatically generated by the DPM [5] detector and the GMMCP [23] tracker. The dataset is evenly split into training and test sets, respectively, containing 631 and 630 identities. We fix this partition in our experiments. During test, each identity has one randomly-selected tracklet probe under each camera.

The DukeMTMC-VideoReID (Duke-video) dataset is a recent video re-id dataset, created by Yu et al[32] in their experiments for one-shot person re-id. It is a subset of DukeMTMC dataset [23], a large-scale dataset for multi-camera tracking. The tracklets are generated by cropping pedestrain images from the videos for 12 frames every second. Since the DukeMTMC dataset is manually annotated, each identity has at most one tracklet under each camera. Following the protocol in [41], the generated DukeMTMC-VideoReID dataset is split into 702 identities for training, 702 identities for test and 408 identities as distractors.

4.2 Experimental Settings

Evaluation metrics. For both MARS and DukeMTMC-VideoReID, the Rank-1 score of the cumulative matching characteristic (CMC) curve and the mean average precision (mAP) are adopted to measure the re-id performance. For the PRID dataset, since each query has only one ground truth, we report the Rank-1, Rank-5, Rank-10, Rank-20 scores of the CMC curve.

Annotation ratio. The annotation ratio (AR) is defined as the number of labeled data divided by the number of whole data. Due to different annotation ways existing in re-ID works, we here provide the exact definition. Let us denote the number of identities in a dataset by and the number of tracklets by . For the methods that annotate tracklet-pairs like ours, the annotation ratio can be computed as , where is the number of manually annotated tracklet-pairs, and is the total number of pairs needed to label the whole dataset. can be computed as follows: Under common annotation settings, pairs are randomly selected for labeling. For a newly annotated pair, the historical annotation information of the involved tracklets are synchronized between them. Since there is no direct formula to compute the above described , we perform intensive simulations and use the average total annotation result as . For the methods that directly assign an ID to each tracklet, if tracklets are labeled, then .

Implementation details.

The proposed method is implemented using the PyTorch 


framework. During training, the batch size is set to 32 for MARS and DukeMTMC-VideoReID dataset, and 8 for PRID since the last dataset is relatively small. We use stochastic gradient descent (SGD) as the optimizer with weight decay 0.9 and momentum 5e-4. The learning rate is fixed to 0.001 in our experiments.

4.3 Algorithm analysis


Figure 3: Ablation comparison on the view-aware sampling strategy and the adaptive resampling strategy. (a),(b) and (c) are for comparing the view-aware sampling strategy: (a) and (b) show the rank-1 accuracy and the mAP curve over the annotation ratio, respectively. (c) plots the gained TP ratio as the annotation ratio increases. (d),(e) and (f) are for comparing the adaptive resampling strategy: (d) and (e) show the rank-1 accuracy and the mAP curve over the annotation ratio, respectively. (f) plots the gained TP ratio as the annotation ratio increases.

The proposed framework consists of several key components that altogether contribute to the final performance. In order to investigate how these individual components influence the model performance, we conduct the following ablation experiments on MARS dataset.

4.3.1 Analysis on view-aware sampling strategy.

The view-aware sampling strategy splits all the pairs into two subsets, according to whether the tracklets come from the same camera view or not. Then candidate pairs are progressively selected from each of the two subsets. The main advantage of such design is that the difference of pair hardness can be considered. To investigate how this strategy contribute to the performance, we compare 1) the mode with the view-aware strategy2) the model that treats all views equally, as well as other active learning methods for re-ID.

The tradeoff between re-ID accuracy and manual annotation ratio is illustrated in Fig. 3. In specific, Fig. 3(a) and Fig. 3(b) show the rank-1 accuracy and mAP over the manual annotation ratio, respectively. Fig. 3(c) plots the relationship between the manual annotation ratio and the actual gained TP ratio. From Fig. 3

(a), we can see that both the two candidate selection strategies are able to reach the fully-annotated rank-1 accuracy with a tiny amount (less than 1.2%) of annotations, which greatly outperforms the random sampling and the k-means clustering methods. In addition, the resampling strategy consistently outperforms the mixed strategy. For example, when the annotation ratio is 0.23% (i.e. around 12000 tracklet pairs), the resampling strategy achieves 50.2% rank-1 accuracy, surpassing the mixed-strategy by 7.9% (absolute). The mAP curves in Fig. 

3(b) are in accordance with the rank-1 curves in Fig. 3(a). When the annotation ratio is greater than 0.75%, the superiority of resampling strategy over mixed-view declines, since at this time most of the TP pairs are annotated, causing the difference to be less significant. Nevertheless, the advantage of view-aware over view-ignored can be well proved in general.

The curves shown in Fig. 3(c) can further explain the accuracy curve behaviors. The gained TP ratio means the percentage of the gained TP pair number to the total TP number. It can be observed in Fig. 3(c) that as the annotation ratio increases, the gained TP ratio first rises rapidly, and then slowly reaches near-100%. It indicates that the gained TP number is the key factor to the improvement of recognition accuracy.

4.3.2 Analysis on adaptive resampling.

To make better use of the annotation resource, we propose an adaptive resampling scheme to further filter out the selected negative candidates. In order to analyze the effect of this scheme, we conduct the following experiments with and w/o resampling for explicit comparison. The experimental results are presented in Fig. 3.

The rank1 and mAP curves vs annotation ratio are shown in Fig. 3(d) and (e), and performances of the baseline active learning methods are compared as well. Several conclusions can be inferred from the two sub-figures: First, we can observe the consistency between rank-1 and mAP curve trends, as well as the large performance gap between our proposed model and the two compared active learning methods. In addition, our model with adaptive resampling almost always performs better than the model w/o it, reaching higher rank-1 and mAP accuracy when under the same annotation ratio. Last but not least, the model with adaptive resampling reaches fully-supervised performance using much less annotations, which proves the effectiveness of the adaptive resampling step at removing false positive pairs. For better analysis, we also present the relationship between the gained TP ratio and the manual annotation ratio in Fig. 3(f). As is shown in Fig. 3(f), the two settings (with and w/o resampling) have quite similar TP gains in the beginning. As iteration goes on, the percentage of TP pairs in the candidates starts to fade while the the percentage of FP pairs is on the increase. At this time, the effect of the resampling gets more significant. Finally, the setting with resampling is able to discover almost all the TPs at a lower manual annotation ratio.

Type Method PRID MARS Duke-video
A.R. R1 R5 R10 R20 A.R. R1 mAP A.R. R1 mAP
Supervised caffeNet[39] 100 77.3 93.5 99.3 100 65.3 47.6
Fusion[9] 100 83.03 66.43
GOG+XQDA[18] 100 69.4 89.6 92.4 95.7 100 41.97 24.89 100 58.83 52.42
Ours(supervised) 100 73.93 88.31 92.36 96.63 100 75.35 64.98 100 87.04 83.46
One-shot SMP[16] 50 38.7 68.1 79.6 90.0 7.53 41.2 19.7 31.97 56.26 46.76
DGM[36] 50 48.2 78.3 83.9 92.4 7.53 36.8 21.3 31.97 42.36 33.62
EUG[32] 7.53 62.67 42.45 31.97 72.79 63.23
RACE[35] 50 50.6 79.4 84.8 91.8 7.53 43.2 24.5
Active learn Random sample 50.56 44.49 71.01 80.11 90.56 1.98 28.89 13.78 34.15 54.56 49.81
K-means[20] 50.56 52.7 75.73 82.81 89.89 1.98 25.81 12.15 34.15 61.54 55.85
Ours 2.05 71.68 89.44 93.03 95.84 1.62 75.15 63.62 0.26 85.19 80.11
Table 1: Performance comparison with other methods on PRID, MARS and Duke-video dataset. A.R. means the manual annotation ratio in percentage.

4.4 Comparison with the State-of-the-Art Methods

To validate the effectiveness of the proposed approach, we compare it to other deep learning based methods on all three datasets. These compared methods are grouped into three categories: 1) Fully supervised methods, including caffeNet [39], Fusion [9], Snippet [1], and the supervised version of our approach. 2) Semi-supervised methods, including EUG [32], DGM [36], SMP [16], and RACE [35]. These methods are learned under the one-shot setting. 3) Baseline active learning methods, including a random sampling method and a K-means clustering approach [20]. The former is the version using our framework but replacing the sampling strategy by random sampling. K-means clustering ranks the samples by their distances to the K cluster centers in ascending order, and selects the top-ranked samples for ID annotation.

Table 1 reports the comparison results on three datasets. From Table 1, we can make the following observations:

  • With less than 3% annotations, our approach reaches comparable performance to our fully-supervised counterpart on all three datasets. Specifically on MARS, our approach achieves 98.67% of the fully-annotated counterpart with only 1.13% annotations. In comparison, the active learning methods using random sampling or k-means clustering give significantly worse results while using more annotations. The comparisons demonstrate the effectiveness of our method at wisely querying samples to reduce annotation amount.

  • Since the one-shot methods require at least one annotated tracklet of each identity, their annotation ratio is computed as the percentage of labeled tracklets among all the tracklets. Hence the annotation ratio is 7.53%, 31.97% and 50% for MARS, Duke-video and PRID respectively. When comparing with the one-shot methods, our method outperforms a lot with much less annotations required. The one-shot methods exploit the annotations better than random annotation, however their performance boost may be limited by the one-tracklet per-ID annotation requirement. On the other hand, the results show that a well-designed annotation strategy can better make use of the annotation amount to help promote re-id performance.

  • When comparing with the fully supervised methods, our fully-supervised counterpart gives better results than GOG+XQDA [18], and performs on par with caffeNet[39]. The comparisons prove that our fully-annotated counterpart is an effective upper bound to verify our proposed active learning re-id method.

5 Conclusion

Reducing annotation cost is an important goal pursued by various computer vision applications. In this paper, we have presented a video-based person re-ID framework that integrates an active learning strategy to progressively select the most TP-likely tracklet-pairs for annotation. In our incremental selection process, a view-aware sampling strategy is adopted that takes view-specific biases into account to facilitate candidate selection. To further tackle the increasing number of selected negative pairs that are not necessary to annotate, we proposed an adaptive resampling step which effectively filters them out. The proposed approach has been validated on three public datasets. It reaches comparable re-ID performance to the fully-supervised setting while using an extremely low annotation amount. The experimental results demonstrate the effectiveness of our method. Being simple and flexible, our active learning strategy can be combined with other state-of-the-art deep re-ID networks to bring further improvement in re-ID performance and annotation efficiency.


  • [1] D. Chen, H. Li, T. Xiao, S. Yi, and X. Wang. Video person re-identification with competitive snippet-similarity aggregation and co-attentive snippet embedding. In CVPR, 2018.
  • [2] A. Das, R. Panda, and A. Roy-Chowdhury. Active image pair selection for continuous person re-identification. In ICIP, 2015.
  • [3] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR, 2009.
  • [4] M. Ester, H.-P. Kriegel, J. Sander, X. Xu, et al. A density-based algorithm for discovering clusters in large spatial databases with noise. In SIGKDD, 1996.
  • [5] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained part-based models. IEEE TPAMI, 32(9):1627–1645, 2010.
  • [6] M. Hasan and A. K. Roy-Chowdhury. Context aware active learning of activity recognition models. In ICCV, 2015.
  • [7] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
  • [8] M. Hirzer, C. Beleznai, P. M. Roth, and H. Bischof. Person re-identification by descriptive and discriminative classification. In Scandinavian conference on Image analysis, pages 91–102. Springer, 2011.
  • [9] D. Li, X. Chen, Z. Zhang, and K. Huang. Learning deep context-aware features over body and latent parts for person re-identification. In CVPR, 2017.
  • [10] X. Li and Y. Guo.

    Multi-level adaptive active learning for scene classification.

    In ECCV, 2014.
  • [11] Y. Li, L. Zhuo, J. Li, J. Zhang, X. Liang, and Q. Tian.

    Video-based person re-identification by deep feature guided pooling.

    In CVPR, 2017.
  • [12] S. Liao, Y. Hu, X. Zhu, and S. Z. Li. Person re-identification by local maximal occurrence representation and metric learning. In CVPR, 2015.
  • [13] L. Lin, K. Wang, D. Meng, W. Zuo, and L. Zhang. Active self-paced learning for cost-effective and progressive face identification. IEEE TPAMI, 40(1):7–19, 2018.
  • [14] C. Liu, C. Change Loy, S. Gong, and G. Wang. Pop: Person re-identification post-rank optimisation. In ICCV, 2013.
  • [15] W. Liu, X. Chang, L. Chen, and Y. Yang. Semi-supervised bayesian attribute learning for person re-identification. In AAAI, 2018.
  • [16] Z. Liu, D. Wang, and H. Lu. Stepwise metric promotion for unsupervised video person re-identification. In ICCV, 2017.
  • [17] A. J. Ma, J. Li, P. C. Yuen, and P. Li. Cross-domain person reidentification using domain adaptation ranking svms. IEEE TIP, 24(5):1599–1613, 2015.
  • [18] T. Matsukawa, T. Okabe, E. Suzuki, and Y. Sato. Hierarchical gaussian descriptor for person re-identification. In CVPR, 2016.
  • [19] N. McLaughlin, J. M. del Rincon, and P. Miller. Recurrent convolutional network for video-based person re-identification. In CVPR, 2016.
  • [20] F. Nie, H. Wang, H. Huang, and C. H. Ding. Early active learning via robust representation and structured sparsity. In IJCAI, 2013.
  • [21] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in pytorch. 2017.
  • [22] S. Paul, J. H. Bappy, and A. K. Roy-Chowdhury. Non-uniform subset selection for active learning in structured data. In CVPR, 2017.
  • [23] E. Ristani, F. Solera, R. Zou, R. Cucchiara, and C. Tomasi. Performance measures and a data set for multi-target, multi-camera tracking. In ECCV, 2016.
  • [24] S. Roy, S. Paul, N. E. Young, and A. K. Roy-Chowdhury. Exploiting transitivity for learning person re-identification models on a budget. In CVPR, 2018.
  • [25] S. Roy, A. Unmesh, and V. P. Namboodiri. Deep active learning for object detection. In BMVC, 2018.
  • [26] B. Settles. Active learning literature survey. Technical report, University of Wisconsin-Madison, 2009.
  • [27] K. Wan, X. Yan, D. Zhang, and L. Zhang. Towards human-machine cooperation: Self-supervised sample mining for object detection. In CVPR, 2018.
  • [28] H. Wang, S. Gong, and T. Xiang. Highly efficient regression for scalable person re-identification. In BMVC, 2016.
  • [29] H. Wang, S. Gong, X. Zhu, and T. Xiang. Human-in-the-loop person re-identification. In ECCV, 2016.
  • [30] H. Wang, S. Gong, X. Zhu, and T. Xiang. Human-in-the-loop person re-identification. In ECCV, 2016.
  • [31] T. Wang, S. Gong, X. Zhu, and S. Wang. Person re-identification by video ranking. In ECCV, 2014.
  • [32] Y. Wu, Y. Lin, X. Dong, Y. Yan, W. Ouyang, and Y. Yang. Exploit the unknown gradually: One-shot video-based person re-identification by stepwise learning. In CVPR, 2018.
  • [33] T. Xiao, H. Li, W. Ouyang, and X. Wang. Learning deep feature representations with domain guided dropout for person re-identification. In CVPR, 2016.
  • [34] S. Xu, Y. Cheng, G. Kang, and Y. Yang. Jointly attentive spatial-temporal pooling networks for video-based person re-identification. In ICCV, 2017.
  • [35] M. Ye, X. Lan, and P. C. Yuen. Robust anchor embedding for unsupervised video person re-identification in the wild. In ECCV, 2018.
  • [36] M. Ye, A. J. Ma, L. Zheng, J. Li, and P. C. Yuen. Dynamic label graph matching for unsupervised video re-identification. In ICCV, 2017.
  • [37] J. You, A. Wu, X. Li, and W.-S. Zheng. Top-push video-based person re-identification. In CVPR, 2016.
  • [38] H.-X. Yu, A. Wu, and W.-S. Zheng. Cross-view asymmetric metric learning for unsupervised person re-identification. In ICCV, 2017.
  • [39] L. Zheng, Z. Bie, Y. Sun, J. Wang, C. Su, S. Wang, and Q. Tian. Mars: A video benchmark for large-scale person re-identification. In ECCV, 2016.
  • [40] L. Zheng, Y. Yang, and A. Hauptmann. Person re-identification: Past, present and future. arXiv preprint arXiv:1610.02984, 2016.
  • [41] Z. Zheng, L. Zheng, and Y. Yang. Unlabeled samples generated by gan improve the person re-identification baseline in vitro. arXiv preprint arXiv:1701.07717, 3, 2017.
  • [42] Z. Zhong, L. Zheng, and S. Li. Re-ranking person re-identification with k-reciprocal encoding. In CVPR, 2017.
  • [43] Z. Zhou, Y. Huang, W. Wang, L. Wang, and T. Tan. See the forest for the trees: Joint spatial and temporal recurrent neural networks for video-based person re-identification. In ICCV, 2017.
  • [44] Z. Zhou, J. Y. Shin, L. Zhang, S. R. Gurudu, M. B. Gotway, and J. Liang.

    Fine-tuning convolutional neural networks for biomedical image analysis: Actively and incrementally.

    In CVPR, 2017.
  • [45] X. Zhu and Z. Ghahramani. Learning from labeled and unlabeled data with label propagation. Technical report, CMU CALD tech report CMU-CALD-02-107, 2002.
  • [46] X. Zhu, X.-Y. Jing, F. Wu, and H. Feng. Video-based person re-identification by simultaneously learning intra-video and inter-video distance metrics. In IJCAI, 2016.