Single Camera Training for Person Re-identification

09/24/2019 ∙ by Tianyu Zhang, et al. ∙ Beihang University Peking University 0

Person re-identification (ReID) aims at finding the same person in different cameras. Training such systems usually requires a large amount of cross-camera pedestrians to be annotated from surveillance videos, which is labor-consuming especially when the number of cameras is large. Differently, this paper investigates ReID in an unexplored single-camera-training (SCT) setting, where each person in the training set appears in only one camera. To the best of our knowledge, this setting was never studied before. SCT enjoys the advantage of low-cost data collection and annotation, and thus eases ReID systems to be trained in a brand new environment. However, it raises major challenges due to the lack of cross-camera person occurrences, which conventional approaches heavily rely on to extract discriminative features. The key to dealing with the challenges in the SCT setting lies in designing an effective mechanism to complement cross-camera annotation. We start with a regular deep network for feature extraction, upon which we propose a novel loss function named multi-camera negative loss (MCNL). This is a metric learning loss motivated by probability, suggesting that in a multi-camera system, one image is more likely to be closer to the most similar negative sample in other cameras than to the most similar negative sample in the same camera. In experiments, MCNL significantly boosts ReID accuracy in the SCT setting, which paves the way of fast deployment of ReID systems with good performance on new target scenes.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Person re-identification (ReID) aims to retrieve a certain person appearing in a camera network. With increasing concerns on public security, ReID has attracted more and more research attention from both academia and industry. In the past years, many algorithms [23, 25, 21, 39] and datasets [34, 38, 28, 36] have been proposed, which significantly boosted the progress of this research field. Despite the higher and higher accuracy obtained by specifically designed approaches on standard ReID benchmarks, many issues of this task remain unsolved. As discussed in [28, 5]

, a ReID model trained on one dataset performs poorly on other datasets due to dataset bias. Thus, to deploy a ReID system to a new environment, labelers have to annotate a training dataset from the target scene, which is often time-consuming and even impractical in large-scale application scenarios. To tackle this issue, researchers make a common assumption is that there exist a number of available unlabeled images in the target scene, based on which they design some unsupervised learning 

[14] or domain adaptation approaches [27]

to improve ReID performance. However, these methods depend on extra modules to predict the pseudo label of each image or generate fake images. Therefore, it is not always reliable and reports less satisfying performance compared to supervised learning methods.

Figure 1: The comparison between SCT and previous settings in person ReID. Fully-supervised-training (FST) data are composed of annotated pedestrians appearing in multiple cameras. Unsupervised-training (UT) data have no identity annotations. Under our single-camera-training (SCT) setting, each pedestrian appears in only one camera and identity labels are easy to obtain.

Different from previous work, this paper investigates ReID under a novel single-camera-training (SCT) setting, where each pedestrian appears in only one camera. We compare our setting to previous ones in Fig. 1. Without the heavy burden of annotating cross-camera pedestrians, training data with labels are easy to obtain under SCT. For example, using off-the-shelf tracking techniques [9, 17], researchers can quickly collect a large number of tracklets under each camera at different time periods, and thus it is very likely that each of them corresponds to a unique ID. Therefore, compared to the fully-supervised-training (FST) setting, i.e., learning knowledge from cross-camera annotations, SCT requires much fewer efforts in preparing for training data. Compared to the unsupervised-training (UT) setting, which requires frequent cross-camera person occurrences, SCT makes a mild assumption on camera-independence, so as to provide weak but reliable supervision signals for learning. Therefore, SCT has the potential of being deployed to a wider range of application scenarios.

It remains an issue of how to make use of the camera-independence assumption to learn discriminative features for ReID. The most important one lies in camera isolation, which implies that there are no cross-camera pedestrians in the entire training set. Conventional methods heavily rely on cross-camera annotations because this is the key supervision that a model receives for metric learning. That is to say, by pulling the images of the same person appearing in different cameras close, conventional methods can learn camera-unrelated features so that they perform well on the testing set. With camera isolation in SCT, we must turn to other types of supervision to achieve the goal of metric learning.

To this end, we propose a novel loss term named Multi-Camera Negative Loss (MCNL). The design of MCNL is inspired by a simple hypothesis, that given an arbitrary person in a multi-camera network, it is more likely that the most similar person is found in another camera, rather than in the same camera, because there are simply more candidates in other cameras. To verify this, we perform statistical analysis on several public datasets, and the results indeed support our assumption (please see Fig. 2). Based on the above observation, our MCNL adjusts feature distributions and alleviates camera isolation problem by ranking the distances of cross-camera negative pairs and within-camera negative pairs. Extensive experiments show that MCNL can force the backbone network to learn more person-related features but ignore camera-unrelated representations, and then achieves good performance under the SCT setting.

Our major contributions can be summarized as follows:

  • To the best of our knowledge, this paper is the first to present the SCT setting. Moreover, this paper analyzes the advantages and challenges under the SCT setting compared to existing settings in person ReID.

  • To solve the issue of camera isolation under the SCT setting, this paper proposes a simple yet effective loss term named MCNL. Extensive experiments show that MCNL significantly boosts the ReID performance under SCT, and it is not sensitive to wrong annotations.

  • Last but not least, by solving SCT, this paper sheds light on fast deployment of ReID systems in new environments, implying a wide range of real-world applications.

2 Related Work

Our work is proposed under the new single-camera-training setting, which is relative to previous FST setting and UT setting. In this section, we mainly summarize the existing methods of these settings and then elaborate the differences between these settings and our SCT.

2.1 Fully-Supervised-Training Setting

The FST setting implies that there are a large number of annotated cross-camera pedestrian images for training. In previous works under the FST setting, most of them formulated person ReID as a classification task and trained a classification model with the labeled training data [35, 37, 22]

. With the advantages of large-scale training data and deep neural networks, these methods achieve good results. In addition, some researchers designed complex network architectures to extract more robust and discriminative features 

[29, 33, 15]. Differently, other researchers argue that the surrogate loss for classification may not be suitable when the number of identities increases [8]. Therefore, end-to-end deep metric learning methods were proposed and widely used under the FST setting [30, 8, 2]. For example, Hermans et al. [8] demonstrated that the triplet loss is more effective for person ReID task. Chen et al. [2] proposed a deep quadruplet network to improve the ReID performance further. Although the performance has been boosted significantly, the demand for annotating large-scale training data hinders their real-world applications, e.g., the fast deployment of ReID systems in new target scenes is almost impossible. This is because it is rather expensive to collect this kind of training data for the FST setting. Different from the FST setting, the SCT setting requires much less time in the training data collection process, since there is no need to collect and annotate cross-camera pedestrian images. Therefore, our SCT setting is more suitable for fast deployment of ReID systems in new target scenes.

2.2 Unsupervised-Training Setting

Different from FST, the UT setting means there are no labeled training data. Although hand-crafted features like LOMO [13], BOW [34] and ELF [6] can be used directly, the ReID performance is relatively low. Therefore, some researchers designed novel unsupervised learning methods to improve ReID performance under the UT setting. Liang et al. [12] proposed a salience weighted model. Lin et al. [14] adopted a bottom-up clustering approach for purely unsupervised ReID. Without the supervision of identity labels, the performance of their methods is still not satisfactory. To further boost ReID accuracy, many unsupervised domain adaptation methods have been proposed. They conducted supervised learning on the source domain and transferred to the target domain, thus can benefit from FST and produce better results. The ways of transferring domain knowledge include image-image translation [28, 5], attribute consistency scheme [27], and so on [40, 18, 41]. These methods perform well when the target domain and the source domain are very similar, but may not be suitable when the domain gap is large [11]. Differently, the problem does not exist in our proposed SCT setting because under SCT, ReID models are only trained with training data from the target scene. More recently, Li et al. [11] built the cross-camera tracklet association to learn a robust ReID model from automatically generated person tracklets. This method [11] assumes that cross-camera pedestrians are common, and thus camera relations can be learned by matching person tracklets. However, in a large-scale camera network, the average number of cameras pedestrians pass through is quite small, e.g., one person appears in only five cameras from thousands of cameras. Moreover, the tracklet association method [11] is not so reliable to make sure each matched tracklets belonging to the same person. The wrong matched tracklets may cause the learned ReID model to perform poorly. Inspired by the above discussion, we propose a more reliable setting, i.e.,single-camera-training, and further design the multi-camera negative loss to improve the ReID performance under this setting.

3 Problem: Single-Camera-Training

Researchers report major difficulty in collecting and annotating data for person ReID, and such difficulty is positively related to the number of cameras in the network. We take MSMT17 [28], a large-scale ReID dataset, as an example. To construct it, researchers collected high-resolution videos covering 180 hours from 15 cameras, after which three labelers worked on the data for 2 months for cross-camera annotation. In another dataset named RPIfield [36], there are two types of pedestrians known as actors and distractors, respectively. A small number of actors followed pre-defined paths to walk in the camera network, and thus it is easy to associate the images captured by different cameras. However, a large number of distractors, without being controlled, are walking randomly, so that it is rather expensive to annotate these pedestrians among cameras. This annotation process, from a side view, verifies the difficulty of collecting and annotating cross-camera pedestrians.

On the other hand, cross-camera information plays the central role in person ReID, because for the conventional approaches, this is the main source of supervision that the same person appears differently in the camera network – this is exactly what we hope to learn. We quantify how existing datasets provide cross-camera information by computing the average number of occurrences of each person in the camera network, i.e., if a person appears in three cameras, his/her number of occurrences is 3. We name it the camera-per-person (CP) value and list a few examples in Tab. 1. We desire a perfect dataset in which all persons are annotated in all cameras, i.e., CP equals to the number of cameras, but for a large camera network, this is often impossible, e.g., in MSMT17, the CP value is 3.81, far smaller than 15, the number of cameras.

Dataset
MSMT17 15 3.81 0.254
DukeMTMC-reID 8 3.13 0.391
Market-1501 6 4.34 0.724
RPIfield (distractors) 12 1.25 0.104
RPIfield (actors) 12 6.99 0.583
RPIfield (total) 12 1.40 0.117
Table 1: The camera-per-person (CP) value of a few ReID datasets. denotes the number of cameras.

To alleviate the burden of data annotation, we propose to consider the scenario that no cross-camera annotations are available, i.e., CP equals to 1 regardless of the number of cameras in the network. We name this setting to be single-camera-training (SCT)111Here is a disclaimer: SCT does not mean that there is only one single camera in the training data, but we simply assume that each person appears in only one camera, or, in other words, two persons appearing in different cameras must have different identities.. This requirement can be achieved by collecting data from different cameras in different time periods (e.g.

, recording camera A from 8 am to 9 am while camera B from 10 am to 11 am) – although this cannot guarantee our assumption, as we shall see in experiments, our approach is robust to a small fraction of ‘outliers’, 

i.e., two or more occurrences of the same person in different cameras are assumed to be different identities.

This setting greatly eases the fast deployment of a ReID system. With off-the-shelf person tracking algorithms [9, 17], we can easily extract a large number of tracklets in videos, each of which forms an identity in the training set. However, such a training dataset is less powerful than those specifically designed for the ReID task, as it lacks supervision of how a person can appear differently in different cameras. We call this challenge camera isolation, and will elaborate on this point carefully in the next section.

4 Our Approach

4.1 Baseline and Motivation

Existing ReID approaches often start with a backbone which extracts a feature vector

from an input image . On top of these features, there are mainly two types of loss functions, and sometimes they are used together towards higher accuracy. The first type is named the cross entropy (CE) loss, which requires the model to perform a classification task in which the same person in different cameras are categorized into one class. The second type is named the triplet margin (TM) loss, which assumes that the largest distance between two appearances of the same person should be smaller than the smallest distance between this person and another. When built upon a ResNet-50 [7] backbone, CE and TM achieve 78.9% and 79.0% rank-1 accuracy on the DukeMTMC-reID dataset [38], respectively, without any bells and whistles.

However, in the SCT setting, both of them fail dramatically due to camera isolation. From the DukeMTMC-reID dataset, we sample 5,993 images from the training set which satisfy the SCT setting, and the corresponding models, with CE and TM losses, report 40.2% and 21.2%, respectively. In comparison, we sample another training subset with the same number of images but equipped with cross-camera annotation, and these numbers become 69.3% and 75.8%, which verifies our hypothesis.

Figure 2:

Curves of the probability produced by the triplet margin loss, with respect to the number of elapsed training epochs, of finding the most similar person (of a different ID) in another camera. The figures on the left and right show results on Market-1501 and DukeMTMC-reID, respectively.

To explain this dramatic accuracy drop, we first point out that a ReID system needs to learn feature embedding which is independent to cameras (i.e., camera-unrelated features), which is to say, the learned feature distribution is approximately the same under different cameras. However, we point out that both CE and TM losses cannot achieve this goal by themselves – they heavily rely on cross-camera annotations. Without such annotations, existing ReID systems often learn camera-related features.

Here, we provide another metric to quantify the impact of camera-related/unrelated features, which is the core observation that motivates our algorithm design. Intuitively, for a set of camera-unrelated features, the feature distribution over the entire camera network should be approximately the same as the distribution over any single camera. In other words, the expectation of similarity between two different persons in the same camera should not be higher than that between two different persons from two cameras. Therefore, given an anchor image, the probability that the most similar person appears in the same camera is only i.e., in a multi-camera system, the most similar person mostly appears in another camera. Thus, we perform statistics during the training process with the TM loss, under both SCT and FST, and show results in Fig. 2. We can see that, under the FST setting, this probability is mostly increasing during the training process, and eventually reaches a plateau at around 0.8; while under the SCT setting, the curve is less stable and the stabilized probability is much lower.

Thus, our motivation is to facilitate the learned features to satisfy that the most similar person appears in another camera. This is considered as the extra, weakly-supervised cue to be explored in the SCT setting. This leads to a novel loss function, the Multi-Camera Negative Loss (MCNL), which is detailed in the next subsection.

4.2 Multi-Camera Negative Loss

Inspired by the analyses above, we design the Multi-Camera Negative Loss (MCNL) to ensure that, given any anchor image in one camera, the most similar negative image is more likely to be found from other cameras, and the negative image should be less similar to the anchor image, compared to the most dissimilar positive image.

In a mini-batch with cameras, identities from each camera and images of each identity (i.e., the batch size is ), given an anchor image , let denote the feature mapping function learned by our network, and represent the Euclidean distance between two feature vectors. The hardest positive distance of is defined as:

(1)

Then, we have the hardest negative distance in the same camera:

(2)

and the hardest negative distance in other cameras:

(3)

With these terms, MCNL is formulated as follows:

(4)

where , and both of and denote the values of margins.

As shown in Eq. (4.2), the second loss term is to ensure the most similar negative image is found from other cameras, and the first loss term is to force this negative image to be less similar than the most dissimilar positive image. These two parts together provide boundaries to restrict between and , which meets the motivation described in the previous section.

Moreover, the proposed MCNL also ensures that the learned feature is discriminative and camera-unrelated. Given the most similar cross-camera negative image differs in camera factors, it is more likely that the similarity lies in person appearance. By pulling the most similar cross-camera negative pairs a little closer, MCNL encourages the model to focus more on person appearance. For the most similar within-camera negative pair, as camera factors are shared with the anchors, pushing them away further reduces the impact of cameras. In addition, MCNL also ensures positive pairs closer than cross-camera negative pairs, which meets the basic correctness of metric learning.

Differences from prior work. Previously, researchers proposed many triplet-based or quadruplet-based loss functions to improve ReID performance [8, 19, 20]. The largest difference between our approach and theirs lies in that they pushed away the hardest negative images from other cameras without constraints, while we do not. In a dataset constructed under the SCT setting, these methods tend to learn camera-related cues to separate negative images from another camera, which further aggravates the camera isolation problem. Moreover, we evaluate several state-of-the-art methods related to metric learning and ReID under SCT. The experiment results demonstrate that existing methods are not suitable for this new setting (please see Sec. 5.4).

Advantages. Based on the above discussions, the advantages of our proposed MCNL can be summarized as two folds. (i) MCNL can alleviate the camera isolation problem. Through pulling the cross-camera negative pairs closer and pushing the within-camera negative pairs away, MCNL forces the feature extraction model to ignore the camera clues. (ii) Same with previous metric learning approaches, MCNL can force the feature extraction model to learn a more discriminative representation by adding the constraint that, the hardest positive image should be closer to the anchor image, compared with the negative images (both cross-camera and within-camera negative images).

5 Experiments

5.1 Datasets

Dataset
#Train
IDs
#Train
Images
#Test
IDs
#Test
Images
With cross-
camera persons?
Market 751 12,936 750 15,913 True
Market-SCT 751 3,561 750 15,913 False
Duke 702 16,522 1,110 17,661 True
Duke-SCT 702 5,993 1,110 17,661 False
Table 2: Detailed statistics of the datasets used in our experiments.

To evaluate the effectiveness of our proposed method, we mainly conduct experiments on two large-scale person ReID datasets, i.e., Market-1501 [34] and DukeMTMC-reID [38]. For short, we refer to Market-1501 and DukeMTMC-reID as Market and Duke, respectively.

Both Market and Duke are widely used person ReID datasets. For each person in the training sets, there are multiple images from different cameras. To better evaluate our method, we reconstruct these training sets for the SCT setting. More specifically, we randomly choose one camera for each person and take those images of the person under the selected camera as training images. Finally, we sample 5,993 images from the training set of Duke and 3,561 images from the training set of Market. In this paper, we denote these sampled datasets as Duke-SCT and Market-SCT, respectively. Note that, we still keep the original testing data and strictly follow the standard testing protocols. The detailed statistics of the datasets are shown in Tab. 2.

5.2 Implementation Details

We adopt ResNet-50 [7]

which is pre-trained on ImageNet 

[3] as our network backbone. The final fully connected layers are removed, and we conduct global averaging pooling (GAP) to the output of the fourth block of ResNet-50. The 2048-dim GAP feature is used for metric learning. In each batch, we randomly select cameras, and sample identities for each selected camera. Then, we randomly sample images for each identity, leading to the batch size of for Duke-SCT. For Market-SCT, there are only cameras in the training set. Hence, we sample cameras, identities for each camera, and images for each identity, thus the batch size is for Market-SCT. We empirically set and as , respectively. For baseline, we implement the batch hard triplet loss [8], which is one of the most effective implementations of the TM loss. For short, we use Triplet to denote the batch hard triplet loss in the following sections. The margin of Triplet is set to be , as it achieves excellent performance under the FST setting. The input images are resized as , and Adam [10] optimizer is adopted. Weight decay is set as . The learning rate is initialized as and exponentially decays following the Eq. (5) proposed in [8]:

(5)

For all datasets, we update the learning rate every epoch after 100 epochs and stop training when reaching 200 epochs, i.e., and , respectively. All experiments are conducted on two NVIDIA GTX 1080Ti GPUs.

5.3 Diagnostic Studies

Methods Duke-SCT Market-SCT
Rank-1 mAP Rank-1 mAP
Triplet 21.2 11.3 39.7 18.2
Triplet-other 9.9 3.6 25.2 8.8
Triplet-same 54.6 35.9 51.3 28.0
MCNL 66.4 45.3 66.2 40.6
Table 3: ReID accuracy (%) produced by different loss terms, among which MCNL reports the best results. The training sets are Duke-SCT and Market-SCT, respectively. Triplet, Triplet-other and Triplet-same denote the batch hard triplet loss [8] and its two variations, respectively.

The effectiveness of MCNL. MCNL is designed based on Triplet [8]. To better demonstrate the effectiveness of MCNL, we evaluate the performance of Triplet and its two variations, Triplet-same and Triplet-other. Triplet-same represents the hardest negative image is selected from the same camera as the anchor image, and Triplet-other means the hardest negative image is found from other cameras. The performance of these methods is summarized in Tab. 3.

(a) Triplet (8.433)
(b) Triplet-other (16.683)
(c) Triplet-same (0.404)
(d) MCNL (0.255)
Figure 3: Visualization of feature distributions. Pseudo F statistics are shown in brackets. Each color indicates features from a camera. This figure is best viewed in color.

As shown in Tab. 3, MCNL achieves huge improvements compared to Triplet and its two variations. For example, MCNL outperforms Triplet with performance gains in Rank-1 accuracy on Duke and boosts the ReID performance with compared to Triplet-same. It is worth noticing that, compared with Triplet, Triplet-same also improves the ReID performance under the SCT setting. That is because Triplet-same aims to maximize the distance of the negative pair, of which the two images come from the same camera. To achieve the above goal, Triplet-same forces the feature extraction model to focus on foreground area and extract more camera-unrelated features because camera-related clues are very similar. It is obvious that Triplet-other aims to push cross-camera negative pairs away. Therefore, the model will focus more on background and get worse performance. Similar to Triplet-same, MCNL also aims to push within-camera negative pairs as far as possible. Moreover, by restricting the distance of the hardest cross-camera negative pair smaller than the distance of the hardest within-camera negative pair, the model further solves the camera isolation problem and ignores camera-related features.

To better evaluate the above discussions, we utilize t-SNE [24] to visualize the feature distributions extracted by different methods. To achieve this, we randomly sample 500 images from the testing set of Duke, and then extract the features on these images through four models trained with Triplet, Triplet-same, Triplet-other, and MCNL, respectively. Moreover, we use the pseudo F statistics [1] to evaluate the relations of feature distributions of different cameras quantitatively. A larger value of pseudo F indicates more distinct clusters, which means the extracted features are more related to cameras. In other words, a smaller value of pseudo F implies that features are better learned.

As shown in Fig. 3, features extracted by Triplet and Triplet-other are separable according to cameras, which is bad for ReID systems. Differently, Triplet-same and MCNL both map images to a camera-unrelated feature space.

Stability analysis. Although the data collection process is restricted, there are inevitably some persons appearing in not only one camera. To evaluate the robustness of our MCNL under this setting, we conduct experiments on Duke to show how the accuracy changes with respect to the percentage of people showing up in multiple cameras. As shown in Fig. 4, when these people are annotated truly according to their identities, Triplet loss benefits largely from ground-truth cross-camera annotations. Nevertheless, with a considerable portion (, out of ) of outliers, MCNL still holds an advantage. On the other hand, when they are annotated under the SCT setting, i.e., the images of the same person but different cameras are assigned with different labels, MCNL is quite robust with a small accuracy drop.

This result further demonstrates that our proposed MCNL improves the ReID accuracy under the SCT setting with great robustness against outliers. In real-world applications, it is easy to control the portion of outliers under a low ratio.

Figure 4: Rank-1 accuracy (%) on Duke-SCT with randomly selected cross-camera persons. MCNL shows great robustness against outliers. Solid or dashed line: whether the model receives accurate annotations.
Methods Ref. Duke-SCT Market-SCT
Rank-1 mAP Rank-1 mAP
Center Loss [30] ECCV’16 38.7 23.2 40.3 18.5
A-Softmax [16] CVPR’17 34.8 22.9 41.9 23.2
ArcFace [4] CVPR’19 35.8 22.8 39.4 19.8
PCB [23] ECCV’18 32.7 22.2 43.5 23.5
Suh’s method [21] ECCV’18 38.5 25.4 48.0 27.3
MGN [26] ACMMM’18 27.1 18.7 38.1 24.7
MCNL This paper 66.4 45.3 66.2 40.6
Table 4: Comparisons of ReID accuracy (%) when training with SCT datasets. MCNL reports the best performance on SCT datasets while other methods undergo dramatic accuracy drop.
Methods Ref. Labels Duke Market
Rank-1 mAP Rank-1 mAP
BOW [34] ICCV’15 None 17.1 8.3 35.8 14.8
DECAMEL [31] TPAMI’18 None - - 60.2 32.4
BUC [14] AAAI’19 None 47.4 27.5 66.2 38.3
TAUDL [11] ECCV’18 Tracklet 61.7 43.5 63.7 41.2
TJ-AIDL [27] CVPR’18 Transfer 44.3 23.0 58.2 26.5
SPGAN [5] CVPR’18 Transfer 46.9 26.4 58.1 26.9
HHL [40] ECCV’18 Transfer 46.9 27.2 62.2 31.4
MAR [32] CVPR’19 Transfer 67.1 48.0 67.7 40.0
ECN [41] CVPR’19 Transfer 63.3 40.4 75.1 43.0
MCNL This paper SCT 66.4 45.3 66.2 40.6
MCNL+MAR [32] This paper Transfer+SCT 71.4 53.3 72.3 48.0
MCNL+ECN [41] This paper Transfer+SCT 67.3 45.5 76.3 51.2
Table 5: ReID accuracy (%) comparisons to UT methods. None denotes purely unsupervised training without any labels. Transfer denotes utilizing other labeled source datasets and unlabeled target datasets. Tracklet denotes using tracklet labels.

5.4 Comparison to Previous Work

Comparisons to FST methods. We evaluate a few popular FST methods under the SCT setting and compare our method with other advanced metric learning algorithms. As shown in Tab. 4, previous state-of-the-art methods for the FST setting fail dramatically under the SCT setting while MCNL shows great advantages. This is because, without cross-camera annotations, these methods are unable to extract camera-unrelated features.

Comparisons to UT methods. Our motivation of the SCT setting is for fast deployment of ReID systems on new target scenes, which is the same as the motivation of the UT setting. Thus, as shown in Tab. 5

, we also compare MCNL with previous unsupervised training methods, including purely unsupervised methods, tracklet association learning method, and domain adaptation methods.

Compared to the state-of-the-art purely unsupervised methods (labels denoted as None), our proposed MCNL significantly outperforms BUC [14] with performance gains in Rank-1 accuracy on Duke. As for TAUDL [11] that uses Tracklet labels, the entire training sets are used to train the models. Our method constructs SCT datasets for training, and thus only a small portion of training data are used, but still surpasses TAUDL on Duke and Market in Rank-1 accuracy.

Recently, many domain adaptation methods that use other labeled datasets for extra supervision obtain good ReID accuracy. Our MCNL alone achieves competitive results compared to them. Moreover, our method is also complementary to current domain adaptation methods and can be easily combined by replacing the target datasets with SCT datasets. Such combination instantly brings significant improvement. We take MAR [32] and ECN [41], for example. After using SCT data and MCNL in MAR, we boost in Rank-1 accuracy and in mAP on Duke dataset; the combination of MCNL and ECN improves mAP on Market by compared to ECN only. Taking the advantages of reliable target domain annotations and extra transferred information, we achieve the best ReID performance on Duke and Market, respectively. Implementation details about the combination of MCNL and MAR/ECN will be included in the supplemental materials.

Note that, our method achieves good performance with much fewer training data. Because of giving up collecting cross-camera pedestrian images under the SCT setting, the above training data can be easily collected and annotated. Therefore, compared with prior work, our method and proposed SCT setting are more suitable for fast deployment of ReID systems with good performance on new target scenes.

6 Conclusions

In this paper, we explore a new setting named single-camera-training (SCT) for person ReID. With the advantage of low costs in data collection and annotation, SCT lays the foundation of fast deployment of ReID systems in new environments. To work under SCT, we propose a novel loss term named multi-camera negative loss (MCNL). Experiments demonstrate that under SCT, the proposed approach boosts ReID performance of existing approaches by a large margin.

Our approach reveals the possibility of learning cross-camera correspondence without cross-camera annotations. In the future, we will explore more cues to leverage under the SCT setting and consider the mixture of single-camera and cross-camera annotations to further improve ReID accuracy.

References

  • [1] T. Caliński and J. Harabasz (1974)

    A dendrite method for cluster analysis

    .
    Communications in Statistics-theory and Methods 3 (1), pp. 1–27. Cited by: §5.3.
  • [2] W. Chen, X. Chen, J. Zhang, and K. Huang (2017) Beyond triplet loss: a deep quadruplet network for person re-identification. In CVPR, Cited by: §2.1.
  • [3] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) ImageNet: a large-scale hierarchical image database. In CVPR, Cited by: §5.2.
  • [4] J. Deng, J. Guo, X. Niannan, and S. Zafeiriou (2019)

    ArcFace: additive angular margin loss for deep face recognition

    .
    In CVPR, Cited by: Table 4.
  • [5] W. Deng, L. Zheng, G. Kang, Y. Yang, Q. Ye, and J. Jiao (2018) Image-image domain adaptation with preserved self-similarity and domain-dissimilarity for person reidentification. In CVPR, Cited by: §1, §2.2, Table 5.
  • [6] D. Gray and H. Tao (2008) Viewpoint invariant pedestrian recognition with an ensemble of localized features. In ECCV, Cited by: §2.2.
  • [7] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, Cited by: §4.1, §5.2.
  • [8] A. Hermans, L. Beyer, and B. Leibe (2017) In defense of the triplet loss for person re-identification. arXiv preprint arXiv:1703.07737. Cited by: §2.1, §4.2, §5.2, §5.3, Table 3.
  • [9] M. Keuper, S. Tang, B. Andres, T. Brox, and B. Schiele (2018) Motion segmentation & multiple object tracking by correlation co-clustering. IEEE transactions on pattern analysis and machine intelligence. Cited by: §1, §3.
  • [10] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §5.2.
  • [11] M. Li, X. Zhu, and S. Gong (2018)

    Unsupervised person re-identification by deep learning tracklet association

    .
    In ECCV, Cited by: §2.2, §5.4, Table 5.
  • [12] C. Liang, B. Huang, R. Hu, C. Zhang, X. Jing, and J. Xiao (2015) A unsupervised person re-identification method using model based representation and ranking. In ACM MM, Cited by: §2.2.
  • [13] S. Liao, Y. Hu, X. Zhu, and S. Z. Li (2015) Person re-identification by local maximal occurrence representation and metric learning. In CVPR, Cited by: §2.2.
  • [14] Y. Lin, X. Dong, L. Zheng, Y. Yan, and Y. Yang (2019) A bottom-up clustering approach to unsupervised person re-identification. In AAAI, Cited by: §1, §2.2, §5.4, Table 5.
  • [15] J. Liu, B. Ni, Y. Yan, P. Zhou, S. Cheng, and J. Hu (2018) Pose transferrable person re-identification. In CVPR, Cited by: §2.1.
  • [16] W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj, and L. Song (2017) SphereFace: deep hypersphere embedding for face recognition. In CVPR, Cited by: Table 4.
  • [17] W. Luo, B. Stenger, X. Zhao, and T. Kim (2019) Trajectories as topics: multi-object tracking by topic discovery. IEEE Transactions on Image Processing 28 (1), pp. 240–252. Cited by: §1, §3.
  • [18] P. Peng, T. Xiang, Y. Wang, M. Pontil, S. Gong, T. Huang, and Y. Tian (2016)

    Unsupervised cross-dataset transfer learning for person re-identification

    .
    In CVPR, Cited by: §2.2.
  • [19] F. Schroff, D. Kalenichenko, and J. Philbin (2015) Facenet: a unified embedding for face recognition and clustering. In CVPR, Cited by: §4.2.
  • [20] H. Shi, Y. Yang, X. Zhu, S. Liao, Z. Lei, W. Zheng, and S. Z. Li (2016) Embedding deep metric for person re-identification: a study against large variations. In ECCV, Cited by: §4.2.
  • [21] Y. Suh, J. Wang, S. Tang, T. Mei, and K. M. Lee (2018) Part-aligned bilinear representations for person re-identification. In ECCV, Cited by: §1, Table 4.
  • [22] Y. Sun, L. Zheng, W. Deng, and S. Wang (2017) SVDNet for pedestrian retrieval. In ICCV, Cited by: §2.1.
  • [23] Y. Sun, L. Zheng, Y. Yang, Q. Tian, and S. Wang (2018) Beyond part models: person retrieval with refined part pooling (and a strong convolutional baseline). In ECCV, Cited by: §1, Table 4.
  • [24] L. Van Der Maaten (2014) Accelerating t-sne using tree-based algorithms.

    The Journal of Machine Learning Research

    15 (1), pp. 3221–3245.
    Cited by: §5.3.
  • [25] G. Wang, J. Lai, P. Huang, and X. Xie (2019) Spatial-temporal person re-identification. In AAAI, Cited by: §1.
  • [26] G. Wang, Y. Yuan, X. Chen, J. Li, and X. Zhou (2018) Learning discriminative features with multiple granularities for person re-identification. In ACM MM, Cited by: Table 4.
  • [27] J. Wang, X. Zhu, S. Gong, and W. Li (2018) Transferable joint attribute-identity deep learning for unsupervised person re-identification. In CVPR, Cited by: §1, §2.2, Table 5.
  • [28] L. Wei, S. Zhang, W. Gao, and Q. Tian (2018) Person transfer gan to bridge domain gap for person re-identification. In CVPR, Cited by: §1, §2.2, §3, A. Implementation Details of MCNL+MAR/ECN.
  • [29] L. Wei, S. Zhang, H. Yao, W. Gao, and Q. Tian (2017) GLAD: global-local-alignment descriptor for pedestrian retrieval. In ACM MM, Cited by: §2.1.
  • [30] Y. Wen, K. Zhang, Z. Li, and Y. Qiao (2016) A discriminative feature learning approach for deep face recognition. In ECCV, Cited by: §2.1, Table 4.
  • [31] H. Yu, A. Wu, and W. Zheng (2018) Unsupervised person re-identification by deep asymmetric metric embedding. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: Table 5.
  • [32] H. Yu, W. Zheng, A. Wu, X. Guo, S. Gong, and J. Lai (2019) Unsupervised person re-identification by soft multilabel learning. In CVPR, Cited by: §5.4, Table 5, A. Implementation Details of MCNL+MAR/ECN.
  • [33] X. Zhang, H. Luo, X. Fan, W. Xiang, Y. Sun, Q. Xiao, W. Jiang, C. Zhang, and J. Sun (2017) Alignedreid: surpassing human-level performance in person re-identification. arXiv preprint arXiv:1711.08184. Cited by: §2.1.
  • [34] L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, and Q. Tian (2015) Scalable person re-identification: a benchmark. In ICCV, Cited by: §1, §2.2, §5.1, Table 5.
  • [35] L. Zheng, Y. Yang, and A. G. Hauptmann (2016) Person re-identification: past, present and future. arXiv preprint arXiv:1610.02984. Cited by: §2.1.
  • [36] M. Zheng, S. Karanam, and R. J. Radke (2018) RPIfield: a new dataset for temporally evaluating person re-identification. In CVPR Workshops, Cited by: §1, §3.
  • [37] Z. Zheng, L. Zheng, and Y. Yang (2017) A discriminatively learned cnn embedding for person reidentification. ACM Transactions on Multimedia Computing, Communications, and Applications 14 (1), pp. 13. Cited by: §2.1.
  • [38] Z. Zheng, L. Zheng, and Y. Yang (2017) Unlabeled samples generated by gan improve the person re-identification baseline in vitro. In ICCV, Cited by: §1, §4.1, §5.1.
  • [39] Z. Zheng, L. Zheng, and Y. Yang (2018) Pedestrian alignment network for large-scale person re-identification. IEEE Transactions on Circuits and Systems for Video Technology. Cited by: §1.
  • [40] Z. Zhong, L. Zheng, S. Li, and Y. Yang (2018) Generalizing a person retrieval model hetero-and homogeneously. In ECCV, Cited by: §2.2, Table 5.
  • [41] Z. Zhong, L. Zheng, Z. Luo, S. Li, and Y. Yang (2019) Invariance matters: exemplar memory for domain adaptive person re-identification. In CVPR, Cited by: §2.2, §5.4, Table 5, A. Implementation Details of MCNL+MAR/ECN.

Supplemental Material

A. Implementation Details of MCNL+MAR/ECN

MCNL+MAR. MAR [32] utilizes a large-scale reference dataset (MSMT17 [28]) to learn soft multilabels for unlabeled target datasets. We first replace the target datasets with their SCT versions, i.e.

, Duke-SCT and Market-SCT. Then, we add MCNL to the loss function with a hyperparameter

to balance the values of losses. The original loss function of MAR is:

(6)

and the loss function of MCNL+MAR is:

(7)

We keep all hyperparameters as same as MAR, and set for Market-SCT and for Duke-SCT. Apart from the target training data and loss function, nothing has been changed. MCNL+ECN. ECN [41] discovers three invariance factors for unsupervised person ReID. We first replace the target training data with SCT versions. During training, ECN uses a GAN model to generate fake images with different camera styles. We only use those fake images whose real images are in our sampled SCT datasets. Then, we add MCNL to the loss function. The original loss function of ECN is:

(8)

and after we add MCNL, the loss function of MCNL+ECN is:

(9)

We set

and keep other hyperparameters the same as ECN. Moreover, after introducing the SCT data with reliable labels, the exemplar memory of ECN now has key-value memory according to the identity labels. The original purpose of exemplar-invariance is to classify each image into its own class. After the combination, exemplar-invariance loss classifies each identity. Since exemplar-invariance loss now pulls positive pairs close, neighborhood-invariance loss is no longer needed. We change exemplar-invariance loss accordingly and remove neighborhood-invariance loss.

B. Visualization of Feature Distributions

We also make a comparison between MCNL and Triplet to see the difference during the training process. From Fig. 5, we can conclude that at the very beginning of training, Triplet learns more camera-related features while MCNL quickly discards camera information and learns more person-related features. The observation further demonstrates MCNL can alleviate the camera isolation problem, and make a uniform feature space for all cameras.