Log In Sign Up

Weakly Supervised Person Search with Region Siamese Networks

Supervised learning is dominant in person search, but it requires elaborate labeling of bounding boxes and identities. Large-scale labeled training data is often difficult to collect, especially for person identities. A natural question is whether a good person search model can be trained without the need of identity supervision. In this paper, we present a weakly supervised setting where only bounding box annotations are available. Based on this new setting, we provide an effective baseline model termed Region Siamese Networks (R-SiamNets). Towards learning useful representations for recognition in the absence of identity labels, we supervise the R-SiamNet with instance-level consistency loss and cluster-level contrastive loss. For instance-level consistency learning, the R-SiamNet is constrained to extract consistent features from each person region with or without out-of-region context. For cluster-level contrastive learning, we enforce the aggregation of closest instances and the separation of dissimilar ones in feature space. Extensive experiments validate the utility of our weakly supervised method. Our model achieves the rank-1 of 87.1 surpasses several fully supervised methods, such as OIM and MGTS, by a clear margin. More promising performance can be reached by incorporating extra training data. We hope this work could encourage the future research in this field.


page 1

page 4

page 7


Exploring Visual Context for Weakly Supervised Person Search

Person search has recently emerged as a challenging task that jointly ad...

Weakly Supervised Person Re-Identification

In the conventional person re-id setting, it is assumed that the labeled...

CGUA: Context-Guided and Unpaired-Assisted Weakly Supervised Person Search

Recently, weakly supervised person search is proposed to discard human-a...

Weakly Supervised Dataset Collection for Robust Person Detection

To construct an algorithm that can provide robust person detection, we p...

Domain Adaptive Person Search

Person search is a challenging task which aims to achieve joint pedestri...

Box-driven Class-wise Region Masking and Filling Rate Guided Loss for Weakly Supervised Semantic Segmentation

Semantic segmentation has achieved huge progress via adopting deep Fully...

Bootstrapping Weakly Supervised Segmentation-free Word Spotting through HMM-based Alignment

Recent work in word spotting in handwritten documents has yielded impres...

1 Introduction

(a) Fully supervised setting
(b) Proposed weakly supervised setting
Figure 1: Comparisons between two settings. (a) Fully supervised setting. The images are annotated with both bounding boxes and person identities. Note that some identity annotations have lacked in original person search datasets. (b) The proposed weakly supervised setting. The images only have bounding box annotations.

Person search [36] aims to localize and recognize a query person from a gallery of unconstrained scene images. Despite tremendous progress achieved by recent works [36, 35, 26, 37, 11, 10, 5, 44], the training process requires strong supervision in terms of bounding boxes and identity labels, shown in Fig. 1(a). However, obtaining such annotations at a large scale can be time-consuming and economically expensive. Even in the most widely used dataset, CUHK-SYSU [36], almost 72.7% of pedestrian bounding boxes have no identity annotations. It indicates that labeling identities is more difficult than bounding boxes. The infeasible identity annotations largely restrict the scalability of supervised methods.

Instead of relying on the expensive labeling, a lot of researchers have been dedicated to training models with no labels [17, 7] or incomplete labels [27, 45] in the areas of image recognition, object detection, . Nevertheless, relevant explorations are missing in the field of person search. To fill the gap, we investigate the person search modeling in a weakly supervised setting, where only bounding box annotations are required. As shown in Fig. 1(b), the proposed setting alleviates the burden of obtaining manually labeled identities. However, it is more challenging to pursue accurate person search using sole bounding box annotations.

In this paper, we setup a strong weakly supervised baseline called Region Siamese Networks (R-SiamNets). Towards learning meaningful feature representations of each instance, our model minimizes the discrepancy between two encoded features transformed from the same pedestrian region. Specifically, one branch is fed with the whole scene image and extracts the RoI features of the person instance. The other branch extracts features from the cropped person image. The obtained features from the two weight-sharing branches are constrained to be consistent. The motivation of this design is that the context-free instance features extracted from the cropped image regions could help the network to distinguish persons from the irrelevant background content. We formulate a self-instance consistency loss and an inter-instance similarity consistency loss to supervise the learning of context-invariant feature representations.

Further, a cluster-level contrastive learning method is introduced for striking a balance between separation and clustering. The clustering method aggregates the closest instances together and pushes instances from different clusters apart. It is assumed that the closest features have a high probability of being from the same individual. The generated pseudo labels by clustering are used for contrastive learning. We iteratively apply this non-parametric clustering for refinement along with the training process. The cluster-level contrastive learning yields a significant performance gain, with the

absolute improvement of mAP on the CUHK-SYSU dataset.

Our contributions can be summarized in three-folds:

  • We introduce a weakly-supervised setting for person search. The new setting only requires bounding box annotations, relieving the burden of obtaining manually labeled identities. With this setting, the developed algorithms could be readily utilized for large-scale person search in real-world scenarios.

  • We propose the R-SiamNet as a baseline under the weakly supervised setting. With the Siamese network, instance-level consistency learning is applied to encourage context-invariant representations. Further, cluster-level contrastive learning is introduced for striking a balance between separation and clustering.

  • Our R-SiamNet achieves the rank-1 of 87.1% and 75.2% on CUHK-SYSU and PRW datasets, respectively. The results outperform several supervised methods by a clear margin, e.g., OIM [36] and MGTS [4]. More promisingly, the performance is further promoted when incorporating more extra datasets.

2 Related Work

Person Search.

Recently, person search task has raised a lot of interest in the computer vision community 

[36, 43, 4, 3, 21, 5, 42]. In the literature, there are two manners to deal with this problem, i.e., two-step and one-step.

For two-step methods, the pedestrian detection and person re-identification are trained with two separated models [43, 4, 3, 21, 15]. Zheng et al. [43] evaluate various combinations of different detectors and re-ID networks, and develop a Confidence Weighted Similarity (CWS) to assist the pedestrian matching. Chen et al. [4] enhance the feature representations by introducing a mask-guided two-steam model. Han et al. [15] develop an RoI transform layer to optimize the two networks end-to-end. With more parameters, these methods guarantee high performance while low efficiency in evaluation.

One-step methods [36, 35, 16, 5] jointly train the detection and re-ID in a unified model, exhibiting high efficiency. Among these methods, [36, 35] take the Faster R-CNN [28] as a backbone network, and most layers are shared by two tasks. Munjal et al. [26] first introduce a query-guided end-to-end person search network. With the global context from both query and gallery images, the well-designed framework generates query-relevant proposals and learns query-guided re-ID scores. Yan et al. [37] explore the contextual information and build a graph learning framework to employ context pairs to update target similarity. Dong et al. [10] develop a bi-directional interaction network and employ the cropped person patches as the guidance to reduce redundant context influences.

These studies are fully supervised and require precise annotations for each person, including the bounding boxes and the identities. It is impractical to extend these methods in large-scale scenarios due to the expensive labeling process. Thus, we introduce a weakly supervised setting and develop a framework trained solely with bounding boxes.

Siamese Networks. Siamese networks [2] consist of twin networks which accept distinct inputs, and the comparability is determined by supervision. This architecture is widely used in many fields, including object tracking [1], one-shot learning [20], signature [2] and face verification [31], . In this paper, we explore the context-invariant embeddings based on a region Siamese network with two forms of inputs, i.e., whole scene images, and cropped images.

Contrastive Learning. Contrastive learning [14]

aims at attracting the positive sample pairs and repulsing the negative ones, which has been popularized in recent unsupervised learning 

[19, 34, 39, 17, 7, 8]. Wu et al. [34] consider each instance as a class, and employ a memory bank to store the instance embeddings. Similar to [34], Ye et al. [39] learn the data augmentation invariant and instance spread-out features. MoCo [17] maintains the dictionary with a queue and a moving-averaged encoder, so the contrastive learning is viewed as the dictionary look-up. SimCLR [7] deprecates the usage of the memory bank, and directly uses negative samples in the current batch. In this paper, rather than only independently penalizing the incompatibility of each single positive pair at a time, we construct more informative positive pairs by non-parametric clustering.

Unsupervised Person Re-identification. The traditional unsupervised methods can be summarized into three categories, designing hand-craft features, utilizing localized salience statistics, or dictionary learning-based works. However, the performance of these methods is inferior to supervised ones. On the basis of clustering algorithms, recent works exhibit higher performance. Lin et al. [24] develop a bottom-up clustering framework to iteratively train the network with pseudo labels. Zeng et al. [40]

combine the hierarchical clustering method with hard-batch triplet loss. Ge 

et al. [13] design a self-paced contrastive learning strategy with a novel clustering reliability criterion to filer the unstable clusters with DBSCAN [12]

. Although the cluster-based methods achieve high performance, they require carefully adjusted hyperparameters,

e.g., number of clusters [24, 40], or the distance threshold [12].

In this paper, we employ a non-parametric clustering method [29], with a filtering mechanism by image information. This manner only relies on the first neighbor of each data point, and requires no hyper-parameters.

3 Proposed Method

In this section, we first introduce the overall region Siamese network in Sec. 3.1. Then we will describe the instance-level consistency learning in Sec. 3.2. At last, the cluster-level contrastive learning is developed in Sec. 3.3.

3.1 Region Siamese Network

Our main target is to locate the person positions and learn representative features for identification. Under the weakly supervised setting, only the bounding box annotations are employed in the training process. Without manually labeled identities, it is essential to design supervision signals for training the network. To achieve this goal, we develop the framework from two aspects: 1) Based on Siamese networks, the comparability is determined by supervision with different augmented inputs. In this paper, both whole scene images and cropped images are taken as inputs. We focus on the instance-level consistency learning to encourage context-invariant embeddings. 2) Pseudo labels generated by clustering permit to model cluster-level supervision. Thus, we apply the cluster-level contrastive learning, reaching a balance between separation and aggregation.

Based on these considerations, we propose a region Siamese network as shown in Fig. 2. There are two branches termed as search path and instance path. With an extra detection head, the search path takes the whole scene images as inputs, training the detection and identification jointly. The feature embedding of each pedestrian is generated by the Region-of-Interest (RoI) align layer. For the instance path, the cropped images are taken as inputs. With less context, the corresponding outputs can focus on the regions of pedestrians. To ensure the context-invariant embeddings, we apply the instance-level consistency learning on the output of two paths. It consists of a self-instance consistency loss and an inter-instance similarity consistency loss. Besides, to model the cluster-level supervision, we employ a non-parametric clustering method based on the nearest neighbor of each sample. A cluster-level contrastive loss is developed to calculate the similarities between the samples in the current batch and the memory bank. During inference, we only utilize the search path.

Figure 2: Illustration of our R-SiamNet for weakly supervised person search. Given whole scene images, the detection and identification are trained jointly with the backbone in the search path. The features of pedestrians are produced by the RoI align layer, denoted as . Meanwhile, we introduce an instance path with the cropped persons as inputs. In this path, we extract the features through the same layers. To encourage the context-invariant features, a self-instance consistency loss and an inter-instance similarity consistency loss are applied between and . Besides, pseudo labels are produced with the non-parametric clustering. We calculate the cluster-level contrastive loss between the averaged features and the embeddings in memory bank. Note that only the search path is utilized in testing.

3.2 Instance-Level Consistency Learning

With no identity annotations, it is observed that the learned instance features are involved with excessive context. As a fine-grained task, the retrieval process is easily affected by interference from surrounding persons and noises. To alleviate this issue, we input the cropped pedestrians which contain less context to build the supervision. The instance-level consistency learning is developed to encourage the context-invariant embeddings from two aspects.

Self-instance consistency loss. Given a mini-batch of scene images, we obtain cropped pedestrian images with bounding box labels. The scene images and cropped ones are processed by the region Siamese network. The output embeddings are denoted as and

for search path and instance path, respectively. To encourage the context-invariant embeddings for a specific instance, we consider to maximize the cosine similarity between

and . Then the self-instance consistency loss is defined as:


where is the L2 norm, and the loss is averaged over all instances in a mini-batch.

Inter-instance similarity consistency loss. The feature embeddings in each path can be seen as an aggregated feature space. The above self-instance consistency loss only constrains the embedding pairs from the same person to be closer individually. We further apply a constraint to enlarge the similarity distribution between the entire feature spaces. For the search path, the similarity matrix is obtained by , where is produced by a row-wise L2 normalization on . In the same way, is calculated for the instance path. Our goal is to keep the consistency between the two similarity distributions. Based on the divergence, we develop an inter-instance similarity consistency loss as follows:


where denotes the KL divergence. It can be formulated as . Since each distribution can be considered as the target, we adopt the mutual manner.

3.3 Cluster-Level Contrastive Learning

The instance recognition [34] treats each sample as a category to produce well-separated samples. In person search task, this approach may be inferior. We aim at exploring the similarities among persons and striking a balance between separation and aggregation. A non-parametric clustering method is employed to produce pseudo labels, and a cluster-level contrastive loss is applied between samples in the current batch and the memory bank.

Non-parametric clustering. We build the cluster-level supervision based on an assumption that the most similar feature embeddings have high probabilities of belonging to the same class. This motivates our clustering manner, in which only the nearest neighbor of each sample is aggregated. Meanwhile, there is a prior that the pedestrians in same scene images belong to different identities. Thus, we can filter some false aggregations when clustering. After clustering, all persons are assigned pseudo labels.

Specifically, at the beginning of each epoch, the clustering process is conducted after extracting the embeddings of all instances. Supposing there are

samples, an adjacency matrix can be constructed among all samples, which is initialized with all zeros. To set , two conditions should be satisfied simultaneously:

1) or or , where denotes the first neighbor of sample . Besides the nearest neighbors, the adjacency matrix also links points that have the same neighbor with .

2) The whole scene image of the -th and -th pedestrian should be different. If two persons come from the same scene image, they cannot be clustered.

Different from other clustering algorithms that require carefully adjusted hyperparameters [24, 13, 40], e.g., cluster numbers or distance threshold, we employ the non-parametric clustering approach. It is easily scalable to large-scale data with minimal computational expenses.

Cluster-level contrastive loss. Similar to previous works [36, 5], we maintain a memory bank to store the embeddings of all instances, where denotes the feature dimensions. After clustering, the memory bank is updated by the newly embedded features with pseudo labels. Then, a cluster-level contrastive loss can be computed between the features in the memory bank and current batch.

In a mini-batch, a specific instance feature is denoted as . Assuming that there are positive samples in the memory bank sharing the same pseudo label with . Then, the remaining samples in M are considered as negative samples. The cosine similarities are denoted as and , respectively. Inspired by [30], we apply the cluster-level contrastive loss to make each greater than :


where is the scale factor. During backward, the memory bank is updated with the samples in current mini-batch: . is the momentum factor and denotes the instance position in the memory bank.

Input : Unlabeled data ;
Scale factor ; Momentum

Initialize the backbone with ImageNet-pretrained ResNet-50.

for each epoch do
        1: Extract all the instance features.
        2: Conduct the non-parametric clustering.
        3: Update the features in the memory bank.
        for each mini-batch do
               1: Encode instance features of two paths through R-SiamNet:
               2: Compute the self-instance consistency loss with Eq.1
               3: Compute the inter-instance similarity consistency loss with Eq. 2
               4: Compute the cluster-level contrastive loss with Eq.3
               5: During backward, update the features in memory bank:
        end for
end for
Algorithm 1 Training procedure of R-SiamNet

3.4 Training Procedure

Given the input images, we aim to learn a deep convolutional neural network (CNN) model

to realize precise localization and identification. The details of our training procedure are provided in Algorithm 1. In summary, our total training objective is revised as:


where denotes the detection losses, including regression loss and foreground-background classification loss.

4 Experiments

In this section, we first introduce two benchmark datasets, followed by the settings under weakly supervised manner and evaluation protocols in Sec. 4.1. Then Sec. 4.2 describes the implementation details and the reproducibility. We conduct a series of ablation studies to analyze the effectiveness of the proposed method in Sec. 4.3. Finally, we discuss the comparison with the state of the arts in Sec. 4.4.

4.1 Datasets and Settings

CUHK-SYSU dataset. CUHK-SYSU [36] is a large-scale person search dataset, which is composed of urban scene pictures and movie snapshots. There are images with annotated bounding boxes, including labeled identities. The training set consists of images with identities and several unlabeled ones. There are gallery images and probe images in testing set.

PRW dataset. PRW [43] is captured by six spatially disjoint cameras in the university. It consists of frames with annotated bounding boxes, among which are assigned with identity labels, and the rest are unlabeled ones. The training set contains frames with identities, and the testing set includes gallery images and queries with identities.

Settings. Under the fully supervised setting, the statistics of the training data are shown in Tab. 1. In this paper, we propose a weakly supervised setting for person search, reducing the need of strong supervision during training. Under the weakly supervised setting, our model is trained only using and bounding box annotations for the CUHK-SYSU and PRW datasets, respectively.

Dataset Images IDs Bboxes
Labeled Unlabeled
CUHK-SYSU 11206 5532 15080 (27.3%) 40180 (72.7%)
PRW 5704 482 14906 (82.6%) 3142 (17.4%)
Table 1: Training data statistics on the CUHK-SYSU and PRW datasets within the fully supervised settings. Bbox: Bounding box.

Evaluation protocols.

Our experiments employ the standard evaluation metrics in person search 

[36]. One is the cumulative matching cure (CMC), which is inherited from the person re-ID. A candidate is counted if the intersection-over-union (IoU) with ground truth is greater than 0.5. The other is the mean Average Precision (mAP), and it is inspired by the object detection task. For each query, we compute an averaged precision (AP) based on the precision-recall curve. Then, the mAP is calculated by averaging the APs across all the queries.

4.2 Implementation Details

Model. We employ RepPoints [38] released by OpenMMLab [6]

as our backbone network, containing the ImageNet pretrained ResNet-50 

[18], the feature pyramid network (FPN) [22], and a detection head. The search path takes the scene images as inputs, jointly training the detection and recognition. To obtain the pedestrian features , the RoI align is applied on FPN with the ground truth RoIs, followed by a fully connected (fc) layer after flattening. For the instance path, the cropped and resized images are taken as inputs. Similar to , is produced with the same network except the RoI align, and both features are -d.

Training. The scene images are resized to , and cropped images are rescaled to

. The batched Stochastic Gradient Descent (SGD) optimizer is used with a momentum of

. The weight decay factor for L2 regularization is set to . We use a mini-batch size of , and an initial learning rate of . The model is trained for epochs with the learning rate multiplied by at and epochs. The scale factor is set to and the momentum is set to

for both datasets. All experiments are implemented on the PyTorch framework, and the network is trained on the NVIDIA Tesla V100.

4.3 Ablation Study

To evaluate the effectiveness of the proposed framework, we conduct detailed ablation studies on the CUHK-SYSU and PRW datasets. Note that all the settings in each experiment are the same as the implementation in Sec. 4.2.

Effectiveness of different components. To verify the effectiveness of each component, we compare the performance under different settings in the training process. The results are shown in Tab. 2.

Instance Recognition (IR) [34] denotes that each instance is treated as a category in training. A memory bank is maintained to store all the instance features, providing abundant negative samples to compute the contrastive loss. Search path w/ IR means the supervision of IR is only applied on the instance features produced by the search path. This approach can be viewed as the baseline of our method. With the scene images as input, instance features may contain excessive context to disturb the matching, thus the mAP only reaches . Instance path w/ IR indicates the IR is applied on the features generated by the instance path, which takes the cropped persons as inputs. In the search path, only the detection head is trained within the same backbone. Under this setting, the result can reach on mAP. Compared to the search path, the instance path contains less context, exhibiting higher performance.

R-SiamNet w/ takes both scene images and cropped pedestrian images as inputs, and the fused outputs are supervised by IR. To encourage the context-invariant feature embeddings, we apply the self-instance consistency loss between two paths. It maximizes the similarity of pair-wise features from two paths. The mAP is promoted to , surpassing by a large margin. This verifies the importance of keeping consistency. For further restraint, we develop the inter-instance similarity consistency loss , which is applied on the similarity distributions within the mini-batch of two paths. This further guarantees the context-invariant embeddings, and achieves a gain of on rank-1. Moreover, to explore the cluster-level supervision, we implement the non-parametric clustering to produce pseudo labels. Thus, a cluster-level contrastive loss is employed for supervision instead of IR. From Tab. 2, we can see that the performance achieves on mAP and on rank-1. The results show the effectiveness of our framework.

mAP R1 R5 R10
Search path w/ IR 51.85 59.69 67.31 69.03
Instance path w/ IR 63.79 65.55 82.21 86.83
R-SiamNet w/ & IR 76.06 78.21 90.28 92.90
R-SiamNet w/ & & IR 79.62 82.17 91.69 94.03
R-SiamNet w/ & & 85.72 86.86 95.24 96.86
Table 2: Component analysis of our method. IR: Instance recognition. The rank-1/5/10 accuracy (%) and mAP (%) are shown.
Figure 3: Visualization of different methods on the CUHK-SYSU dataset. Given the query images, we show the rank-1 search results of three training approaches. First column shows the query persons with the green boxes. The instance path/search path denote the model is trained with a single path. The last column show the result of our region Siamese network. Red/blue boxes represent the wrong/correct results, respectively.

Scalability with different scales of training data. Our framework is designed to learn discriminative identity embeddings under the weakly supervised setting. The scalability is essential for the methods when giving more training data without identity labeling. To discuss the scalability, we design the experiment from two aspects.

First, we assess the performance with different percentages of training data individually, as shown above the dotted line in Tab. 3. Specifically, we divide the CUHK-SYSU/PRW dataset by 20%, 40%, 60%, 80%, 100% for training. It can be seen that the performance is gradually improved as the training scale increases. Moreover, the growing tendency has not reached saturation on both datasets, indicating that the proposed framework can achieve further improvement with more training data.

Second, to further evaluate the scalability of our method, we expand the training set by combining different datasets. The results are shown below the dotted line in Tab. 3. When training with CUHK-SYSU and PRW datasets together, the performance on both datasets has been considerably improved. Especially for the PRW dataset, the mAP is promoted by a large margin since the added CUHK-SYSU owns a larger data scale. Besides, we also employ a dataset called INRIA [9] in the pedestrian detection task. There are images that contain pedestrians with bounding boxes labeling. When training with the three datasets together, our performance is further promoted on both CUHK-SYSU and PRW datasets.

All the experiments prove that our framework has the potential to reach promising performance by incorporating more training data.

Training set CUHK-SYSU PRW
R1 mAP R1 mAP
20% data 78.34 76.71 66.94 15.61
40% data 82.41 80.50 69.32 17.59
60% data 83.45 82.52 71.80 19.10
80% data 85.41 84.15 72.73 19.64
100% data 86.86 85.72 73.36 21.16
CUHK-SYSU & PRW 87.00 85.92 75.06 23.50
CUHK-SYSU & PRW & INRIA 87.59 86.19 76.03 25.53
Table 3: Performance with different scales of training set. Above the dotted line, the results with different percentages of training data are shown. Below the dotted line, it exhibits the performance of combining with more training datasets.

Visualization and Analysis. To evaluate the effectiveness of the proposed method, we illustrate some qualitative search results on the CHUK-SYSU dataset. As Fig. 3 shows, we present the comparisons on rank-1 of our R-SiamNet and the other two manners. Specifically, the first column shows the query persons with green bounding boxes. Instance path/search path represents that the model employs the cropped images/whole scene images as input within a single path. The last column is the result of our R-SiamNet. The search results are exhibited with different colors, i.e., red boxes denote the wrong matches while the blue boxes show the correct ones.

There are several observations from the visualization. First, it is observed that the wrong matches in the search path are definitely different from the query, but with similar contexts. This verifies the existence of excessive contexts, which disturb the matching process by involving more surrounding persons/backgrounds. Second, we find that most false examples in the instance path have a similar appearance with the query. This indicates the features are hardly disturbed by the contexts. Third, our R-SiamNet promotes the complementary of both paths, maintaining useful context to aid the person search. The results also show the effectiveness of our method, which can successfully localize and match the query person in most cases.

4.4 Comparisons with the State-of-the-arts

In this section, we compare our proposed framework with current state-of-the-art methods on person search in Tab. 4. The results of two-step methods [3, 4, 21, 15, 33] are shown in the upper block while the one-step methods [36, 35, 25, 37, 41, 26, 5] in the lower block.

R1 mAP R1 mAP
Fully supervised setting:
two-step RCAA [3] 81.3 79.3 - -
MGTS [4] 83.7 83.0 72.1 32.6
CLSA [21] 88.5 87.2 65.0 38.7
RDLR [15] 94.2 93.0 70.2 42.9
TCTS [33] 95.1 93.9 87.5 46.8
one-step OIM [36] 78.7 75.5 49.9 21.3
IAN [35] 80.1 76.3 61.9 23.0
NPSM [25] 81.2 77.9 53.1 24.2
CTXGraph [37] 86.5 84.1 73.6 33.4
DC-I-Net [41] 86.5 86.2 55.1 31.8
QEEPS [26] 89.1 88.9 76.7 37.1
NAE [5] 92.4 91.5 80.9 43.3
NAE+ [5] 92.9 92.1 81.1 44.0
DMRNet [16] 94.2 93.2 83.3 46.9
Weakly supervised setting:
Ours (1333*800) 86.9 85.7 73.4 21.2
Ours (1500*900) 87.1 86.0 75.2 21.4
Table 4: Experimental comparisons with state-of-the-art methods on the CUHK-SYSU and PRW datasets.

Evaluation on CUHK-SYSU. The comparisons between our network and existing supervised methods on the CUHK-SYSU dataset are shown in Tab. 4. When the gallery size is set to 100, our proposed method reaches on mAP and on rank-1. The performance can outperform several supervised methods. Moreover, when using a larger resolution of , our results are further improved and reach on mAP and on rank-1.

To evaluate the performance consistency, we also compare with other competitive methods under varying gallery sizes of . Fig. 4 shows the comparisons with both one-step and two-step methods. It can be seen that the performance of all methods decreases as the gallery size increases. This indicates it is challenging when more distracting people are involved in the identity matching process, which is close to real-world applications. Our method outperforms some supervised methods under different gallery sizes.

Evaluation on PRW. We also evaluate our method on the PRW dataset, as shown in Tab. 4. Following the setting in benchmark [43], the gallery contains all the testing images. This is challenging since a tremendous number of detected bounding boxes are involved. Compared with the competitive techniques, our performance achieves on rank-1, and it is further promoted with larger resolutions. Our method surpasses most works in both one-step and two-step manners. However, the results exhibit a low mAP due to the minor inter-class variations in this dataset.

Figure 4: Comparisons with different gallery sizes on the CUHK-SYSU dataset. Our method is represented by dotted lines.

5 Conclusion

In this paper, we introduce a weakly supervised setting for the person search task to alleviate the burden of costly labeling. Under this new setting, no specific identity annotations of pedestrians are required, and we only utilize the accessible bounding boxes for training. Meanwhile, we propose a baseline called R-SiamNet for localizing persons and learning discriminative feature representations. To encourage the context-invariant features, a self-instance consistency loss and an inter-instance similarity consistency loss are developed. We also explore the balance between separation and aggregation by a cluster-level contrastive loss. Extensive experimental results on the widely used benchmarks demonstrate the effectiveness of our framework. The results also show that the gap with supervised state-of-the-arts will be further narrowed with more training data.


This work was partially supported by the Project of the National Natural Science Foundation of China No. 61876210, the Fundamental Research Funds for the Central Universities No.2019kfyXKJC024, and the 111 Project on Computational Intelligence and Intelligent Control under Grant B18024.


  • [1] L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, and P. H. Torr (2016) Fully-convolutional siamese networks for object tracking. In European Conference on Computer Vision, Cited by: §2.
  • [2] J. Bromley, I. Guyon, Y. LeCun, E. Säckinger, and R. Shah (1994)

    Signature verification using a” siamese” time delay neural network

    Advances in Neural Information Processing Systems. Cited by: §2.
  • [3] X. Chang, P. Huang, Y. Shen, X. Liang, Y. Yang, and A. G. Hauptmann (2018) RCAA: relational context-aware agents for person search. In European Conference on Computer Vision, Cited by: §2, §2, §4.4, Table 4.
  • [4] D. Chen, S. Zhang, W. Ouyang, J. Yang, and Y. Tai (2018) Person search via a mask-guided two-stream cnn model. In European Conference on Computer Vision, Cited by: §A.6, Table 9, Weakly Supervised Person Search with Region Siamese Networks, 3rd item, §2, §2, §4.4, Table 4.
  • [5] D. Chen, S. Zhang, J. Yang, and B. Schiele (2020) Norm-aware embedding for efficient person search. In

    IEEE Conference on Computer Vision and Pattern Recognition

    Cited by: §A.6, Table 9, §1, §2, §2, §3.3, §4.4, Table 4.
  • [6] K. Chen, J. Wang, J. Pang, Y. Cao, Y. Xiong, X. Li, S. Sun, W. Feng, Z. Liu, J. Xu, et al. (2019) MMDetection: open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155. Cited by: §4.2.
  • [7] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton (2020) A simple framework for contrastive learning of visual representations. In

    International conference on machine learning

    Cited by: §1, §2.
  • [8] X. Chen and K. He (2020) Exploring simple siamese representation learning. arXiv preprint arXiv:2011.10566. Cited by: §2.
  • [9] N. Dalal and B. Triggs (2005) Histograms of oriented gradients for human detection. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §4.3.
  • [10] W. Dong, Z. Zhang, C. Song, and T. Tan (2020) Bi-directional interaction network for person search. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1, §2.
  • [11] W. Dong, Z. Zhang, C. Song, and T. Tan (2020) Instance guided proposal network for person search. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1.
  • [12] M. Ester, H. Kriegel, J. Sander, X. Xu, et al. (1996) A density-based algorithm for discovering clusters in large spatial databases with noise.. In Kdd, Cited by: §2.
  • [13] Y. Ge, D. Chen, F. Zhu, R. Zhao, and H. Li (2020) Self-paced contrastive learning with hybrid memory for domain adaptive object re-id. arXiv preprint arXiv:2006.02713. Cited by: §A.3, §2, §3.3.
  • [14] R. Hadsell, S. Chopra, and Y. LeCun (2006) Dimensionality reduction by learning an invariant mapping. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.
  • [15] C. Han, J. Ye, Y. Zhong, X. Tan, C. Zhang, C. Gao, and N. Sang (2019) Re-id driven localization refinement for person search. In IEEE International Conference on Computer Vision, Cited by: §2, §4.4, Table 4.
  • [16] C. Han, Z. Zheng, C. Gao, N. Sang, and Y. Yang (2021) Decoupled and memory-reinforced networks: towards effective feature learning for one-step person search. In Proceedings of the AAAI Conference on Artificial Intelligence, Cited by: §A.2, §2, Table 4.
  • [17] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick (2020) Momentum contrast for unsupervised visual representation learning. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1, §2.
  • [18] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §4.2.
  • [19] R. D. Hjelm, A. Fedorov, S. Lavoie-Marchildon, K. Grewal, P. Bachman, A. Trischler, and Y. Bengio (2018)

    Learning deep representations by mutual information estimation and maximization

    In International Conference on Learning Representations, Cited by: §2.
  • [20] G. Koch, R. Zemel, and R. Salakhutdinov (2015) Siamese neural networks for one-shot image recognition. In

    ICML deep learning workshop

    Cited by: §2.
  • [21] X. Lan, X. Zhu, and S. Gong (2018) Person search by multi-scale matching. In European Conference on Computer Vision, Cited by: §2, §2, §4.4, Table 4.
  • [22] T. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie (2017) Feature pyramid networks for object detection. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §4.2.
  • [23] T. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár (2017) Focal loss for dense object detection. In IEEE International Conference on Computer Vision, Cited by: §A.2.
  • [24] Y. Lin, X. Dong, L. Zheng, Y. Yan, and Y. Yang (2019) A bottom-up clustering approach to unsupervised person re-identification. In Proceedings of the AAAI Conference on Artificial Intelligence, Cited by: §2, §3.3.
  • [25] H. Liu, J. Feng, Z. Jie, K. Jayashree, B. Zhao, M. Qi, J. Jiang, and S. Yan (2017) Neural person search machines. In IEEE International Conference on Computer Vision, Cited by: §4.4, Table 4.
  • [26] B. Munjal, S. Amin, F. Tombari, and F. Galasso (2019) Query-guided end-to-end person search. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §A.6, Table 9, §1, §2, §4.4, Table 4.
  • [27] M. Oquab, L. Bottou, I. Laptev, and J. Sivic (2015) Is object localization for free?-weakly-supervised learning with convolutional neural networks. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1.
  • [28] S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems, Cited by: §A.2, §2.
  • [29] S. Sarfraz, V. Sharma, and R. Stiefelhagen (2019) Efficient parameter-free clustering using first neighbor relations. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.
  • [30] Y. Sun, C. Cheng, Y. Zhang, C. Zhang, L. Zheng, Z. Wang, and Y. Wei (2020) Circle loss: a unified perspective of pair similarity optimization. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §3.3.
  • [31] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf (2014) Deepface: closing the gap to human-level performance in face verification. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.
  • [32] L. Van der Maaten and G. Hinton (2008) Visualizing data using t-sne.. Journal of Machine Learning Research. Cited by: §B.1.
  • [33] C. Wang, B. Ma, H. Chang, S. Shan, and X. Chen (2020) TCTS: a task-consistent two-stage framework for person search. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §4.4, Table 4.
  • [34] Z. Wu, Y. Xiong, S. X. Yu, and D. Lin (2018) Unsupervised feature learning via non-parametric instance discrimination. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2, §3.3, §4.3.
  • [35] J. Xiao, Y. Xie, T. Tillo, K. Huang, Y. Wei, and J. Feng (2019) IAN: the individual aggregation network for person search. Pattern Recognition. Cited by: §1, §2, §4.4, Table 4.
  • [36] T. Xiao, S. Li, B. Wang, L. Lin, and X. Wang (2017) Joint detection and identification feature learning for person search. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: Weakly Supervised Person Search with Region Siamese Networks, 3rd item, §1, §2, §2, §3.3, §4.1, §4.1, §4.4, Table 4.
  • [37] Y. Yan, Q. Zhang, B. Ni, W. Zhang, M. Xu, and X. Yang (2019) Learning context graph for person search. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1, §2, §4.4, Table 4.
  • [38] Z. Yang, S. Liu, H. Hu, L. Wang, and S. Lin (2019) Reppoints: point set representation for object detection. In IEEE International Conference on Computer Vision, Cited by: §A.2, §A.3, §4.2.
  • [39] M. Ye, X. Zhang, P. C. Yuen, and S. Chang (2019) Unsupervised embedding learning via invariant and spreading instance feature. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.
  • [40] K. Zeng, M. Ning, Y. Wang, and Y. Guo (2020) Hierarchical clustering with hard-batch triplet loss for person re-identification. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2, §3.3.
  • [41] L. Zhang, Z. He, Y. Yang, L. Wang, and X. Gao (2020) Tasks integrated networks: joint detection and retrieval for image search. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §4.4, Table 4.
  • [42] X. Zhang, X. Wang, J. Bian, C. Shen, and M. You (2021) Diverse knowledge distillation for end-to-end person search. In Proceedings of the AAAI Conference on Artificial Intelligence, Cited by: §2.
  • [43] L. Zheng, H. Zhang, S. Sun, M. Chandraker, Y. Yang, and Q. Tian (2017) Person re-identification in the wild. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2, §2, §4.1, §4.4.
  • [44] Y. Zhong, X. Wang, and S. Zhang (2020) Robust partial matching for person search in the wild. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1.
  • [45] Z. Zhou (2018) A brief introduction to weakly supervised learning. National science review. Cited by: §1.

Appendix A Additional Experimental Results

a.1 Paths for Evaluation

In our paper, the instance path is only applied in training, facilitating the identity feature learning of the main search path. During inference, we drop the instance path and the images are only passed through the search path. We compare the results of using different paths for testing, as shown in Tab. 5. It can be seen that using two paths for evaluation cannot bring extra performance gains. This indicates the context-invariant embeddings produced by our framework.

Inference path mAP Rank-1
Two paths 85.65 86.73
Search path 85.72 86.86
Table 5: Comparisons of using different paths for evaluation on the CUHK-SYSU dataset.

a.2 Different Detection Networks

Following [16], we choose the RepPoints as the detection network. To show the expandability of our R-SiamNet, different detectors are incorporated into our framework, including Faster R-CNN [28], RetinaNet [23] and RepPoints [38]. As reported in Tab. 6, the final performance gaps among different detectors are small, exhibiting the effectiveness and robustness of our framework.

Detector mAP Rank-1
Faster R-CNN 84.84 85.72
RetinaNet 85.39 86.59
RepPoints 85.72 86.86
Table 6: Comparisons when incorporated with different detectors on the CUHK-SYSU dataset.

a.3 Comparisons with Two-Step Manner

We combine a well-trained RepPoints detector [38] and an unsupervised re-ID model called SPCL [13] as our two-step competitor. As shown in Tab. 7, our method outperforms it by a large margin with higher efficiency. It shows that training detection and identification end-to-end is beneficial for obtaining better representations. It may also imply the importance of instance-level consistency learning.

Methods mAP Rank-1
RepPoints+SPCL 73.43 74.79
Ours 85.72 86.86
Table 7: Comparisons with two-step manner on the CUHK-SYSU dataset.

a.4 Evaluation on filter strategy in the clustering.

To analyze the effectiveness of the filter strategy in clustering, we conduct experiments with/without filtering by image information. As shown in Tab. 8, we observe rank-1 drops on CUHK-SYSU/PRW datasets by removing the filter strategy. This shows that it is beneficial to filter the aggregation of the persons from the same scene images.

mAP Rank-1 mAP Rank-1
R-SiamNet w/o filter 84.74 85.64 20.31 72.23
R-SiamNet 85.72 86.86 21.16 73.36
Table 8: Performance of our method with/without the filter strategy in clustering. Results on the CUHK-SYSU and PRW datasets are shown. R-SiamNet w/o filter means the clustering is applied without filtering by image information.

a.5 Different numbers of training epochs

We illustrate the mAP scores with different numbers of training epochs. As Fig. 5 shows, the results of three data scales are exhibited on the CUHK-SYSU dataset. It can be observed that the performance improves steadily to saturation as the epoch increases. With smaller data scales, the mAP reaches saturation earlier.

Figure 5: Performance with different numbers of training epochs. The results of three data scales are exhibited on the CUHK-SYSU dataset.

a.6 Runtime Comparisons

To compare the evaluation efficiency of our framework with other methods, we report the average runtime of the inference stage for a whole scene image, shown in Tab. 9. Since these methods are evaluated with different GPUs, we also exhibit the Tera-Floating Point Operation per-second (TFLOPs) for fair comparisons. Similar to other methods [5, 26, 4], we evaluate the models with an input image size of . As shown in Tab. 9, our R-SiamNet takes milliseconds to process one image, which is faster than the two-step method MGTS [4] by a large margin. The query-guided method QEEPS [26] requires to re-compute all the gallery embeddings for each query image. This time-consuming operation reduces the evaluation efficiency. Moreover, our method is faster than NAE [5]. These results clearly demonstrate the efficiency of our R-SiamNet in evaluation.

Methods GPU (TFLOPs) Runtime (ms)
MGTS [4] K80 (8.7) 1269
QEEPS [26] P6000 (12.0) 300
NAE+ [5] V100 (14.1) 98
NAE [5] V100 (14.1) 83
Ours V100 (14.1) 72
Table 9: Runtime comparisons of different methods when evaluation. The average runtime for one image with the size of is exhibited on the CUHK-SYSU dataset.
(a) R-SiamNet w/o & &
(b) R-SiamNet w/o
(c) R-SiamNet
Figure 6: T-SNE feature visualization on a part of the PRW training set ( classes, pedestrians). (a) R-SiamNet without & & . (b) R-SiamNet without . (c) Our proposed R-SiamNet with both instance-level consistency learning and cluster-level contrastive learning. Colors denote person identities.

Appendix B More Qualitative Analysis

b.1 Feature Visualization

To analyze the discriminative ability of our learned features, we employ the t-SNE [32] to visualize the feature representations in training. As illustrated in Fig. 6, there are pedestrians with classes, which is a subset of the PRW training set. Different colors represent different classes.

Fig. 6(a) shows the result when training with a single search path. Without applying the instance-level consistency learning and cluster-level contrastive learning, the learned features show large intra-class distances and small inter-class distances. When adding the instance-level consistency learning, including and , the result is shown in Fig. 6(b). It is observed that the feature embeddings of the same category can be aggregated compared with Fig. 6(a). This shows the effectiveness of our instance-level consistency learning. Nevertheless, the features within the class are gathered loosely, and the margins among different classes are not clear. Furthermore, we apply the cluster-level contrastive learning, and the result is shown in Fig. 6(c). It can be seen that both intra-class compactness and inter-class separability are further encouraged. There are obvious margins among most categories. This verifies that our R-SiamNet can generate discriminative embeddings under the weakly supervised settings.