Global-Local Context Network for Person Search

12/05/2021
by   Peng Zheng, et al.
0

Person search aims to jointly localize and identify a query person from natural, uncropped images, which has been actively studied in the computer vision community over the past few years. In this paper, we delve into the rich context information globally and locally surrounding the target person, which we refer to scene and group context, respectively. Unlike previous works that treat the two types of context individually, we exploit them in a unified global-local context network (GLCNet) with the intuitive aim of feature enhancement. Specifically, re-ID embeddings and context features are enhanced simultaneously in a multi-stage fashion, ultimately leading to enhanced, discriminative features for person search. We conduct the experiments on two person search benchmarks (i.e., CUHK-SYSU and PRW) as well as extend our approach to a more challenging setting (i.e., character search on MovieNet). Extensive experimental results demonstrate the consistent improvement of the proposed GLCNet over the state-of-the-art methods on the three datasets. Our source codes, pre-trained models, and the new setting for character search are available at: https://github.com/ZhengPeng7/GLCNet.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

02/16/2021

Multi-Attribute Enhancement Network for Person Search

Person Search is designed to jointly solve the problems of Person Detect...
03/22/2021

Anchor-Free Person Search

Person search aims to simultaneously localize and identify a query perso...
09/01/2021

Efficient Person Search: An Anchor-Free Approach

Person search aims to simultaneously localize and identify a query perso...
04/07/2022

PSTR: End-to-End One-Step Person Search With Transformers

We propose a novel one-step transformer-based person search framework, P...
11/27/2021

Head and Body: Unified Detector and Graph Network for Person Search in Media

Person search in media has seen increasing potential in Internet applica...
05/01/2021

Person Search Challenges and Solutions: A Survey

Person search has drawn increasing attention due to its real-world appli...
01/08/2021

Contextual Non-Local Alignment over Full-Scale Representation for Text-Based Person Search

Text-based person search aims at retrieving target person in an image ga...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Person search aims at identifying a query person from natural, uncropped images across different views [34, 42]. To tackle this task, we need not only address the challenges in person re-identification (re-ID) (e.g

., pose variations, occlusions, and varied camera views), but also unify pedestrian detection and person re-ID into a joint framework. By simultaneously solving the two sub-tasks, person search exhibits its unique advantages in terms of efficiency, accuracy, and practicability. In the past few years, rapid progresses in person search have been made with the renaissance of deep learning techniques. Existing person search approaches can be generally divided into two categories,

i.e., two-step approaches [31, 19, 10, 28] and one-step ones [34, 7, 36]. In two-step approaches, person search is decomposed into two sub-tasks, i.e., pedestrian detection and person re-ID. Detection and re-ID models typically work in a sequential manner, i.e., the detection model is used to crop the person regions, which are subsequently fed to the re-ID model to extract discriminative embeddings for identification. However, sequentially dealing with the two tasks with two separate networks is both time- and resource-consuming. In contrast, one-step approaches [34, 5, 20, 36] integrate detection and re-ID into a unified end-to-end network. Instead of explicitly employing the cropped person regions from the detection results, one-step models apply an RoI-Align layer to aggregate the features within the detected bounding boxes.

Nevertheless, both two-step and one-step methods concentrate on the person-centric regions, without perceiving the surrounding environment. In other words, the holistic and rich context is not fully exploited. However, mining such context is proved to be very effective in other related tasks, such as object detection [21, 8, 7] and group re-ID [22, 45]. In small object detection [21, 8], scene context help find the tiny objects which are usually neglected by common object detection methods. Group re-ID methods [37, 32, 43] usually enhance the re-ID embeddings of the target person by incorporating the features of neighboring persons. Motivated by the above facts and owing to the setting of person search (i.e., a holistic image including multiple persons is provided), several attempts [23, 1, 24, 12] have recently been made to learn the underlining context information beyond the detected person regions. They typically exploit two types of context information, i.e., scene and group. For instance, scene context is used in [23, 1] to help identify the target via measuring the correspondence between the detected box of the query and the scene in the gallery. Group context is often mined in a pair-wise fashion, e.g., siamese networks are developed in [24, 12] to find the potential co-travelers; [38, 20, 24] attempt to achieve the globally optimized matching based on the input pairs.

However, most existing methods resort to either scene or group context, while both types of context make their own contributions to the final performance, as a person is not only related to the global scene, but also associated with the local neighbors. Therefore, an open question is naturally posed - Is it possible to leverage both context information to further improve the person search performance? In this paper, we attempt to answer the question by exploiting both scene and group context in a unified framework. To this end, we develop a simple yet effective framework for hybrid context learning, i.e., global-local context network (GLCNet), which effectively and efficiently exploits the global scene and local group context. On the one hand, scene context can facilitate learning global discriminative features, which can enhance the re-ID features in the current scene. On the other hand, we employ group context to establish the relationships between the target person and his/her neighboring co-travelers.

To be specific, we inject the global and local context into an existing one-step person search model, i.e., SeqNet [20], where two levels of context information work as the additional features to enhance the re-ID embeddings of the target person. The feature enhancement is performed in terms of three hierarchical stages as follows. 1) In the first stage, we design a context enhancer to learn more enhanced context features by playing with the context features itself. 2) The second stage employs the norm aware embedding (NAE) layer [5] to improve the diversity for re-ID and context features. 3) Finally, in the third stage, we simply fuse all the features and feed them to the online instance matching (OIM) loss [34] for the ultimate feature interactions and supervising the training process. More concretely, the features are regarded as the memory for the target, and the scenes where he/she often appears and the surrounding concurrent people will be recorded and updated. In this way, re-ID embeddings can perceive the rich context features and vice versa, leading to the final enhanced features, significant improving the person search performance.

The proposed GLCNet exhibits its unique advantages over the existing context-based person search approaches from three perspectives. First, we investigate into both scene and group context, implicitly leveraging their complementary characteristics. Second, our network architecture is more efficient than those siamese frameworks that require the time-consuming pair generation. In the meantime, our network is more flexible and concise, bringing less computational overhead to the base model. Third, our context features are simultaneously updated and interacted with the re-ID features in a multi-stage fashion.

In addition to person search, we evaluate the proposed method in a more challenging setting, i.e., character search (joint character detection and identification) in movies [16], where context also plays a critical role. We study character search based on a newly-proposed large-scale movie dataset, i.e., MovieNet [16]. This dataset is more diverse and realistic than existing person search datasets which mostly focus on surveillance scenarios. In character search, the same character may wear different clothes in distinct scenes, bringing significant difficulty to existing person search approaches. In particular, since [16] only introduces the separate settings of character detection and identification, we re-organize the data to evaluate this new setting, which we refer to as MovieNet-CS. We follow CUHK-SYSU [34] to provide multiple difficulty levels for character search by varying the number of images in the training/test set on MovieNet-CS. From the subsequent experimental results, we find that our approach notably outperforms the state-of-the-art methods by large margins.

In summary, our main contributions include:

  • We propose a novel global-local context network (GLCNet) for person search, by enhancing the re-ID features with global scene context and local group context, resulting in more discriminative and semantically meaningful representations.

  • We equip an existing framework with the additional ability of perceiving the global-local context information by updating and interacting both re-ID and context features in terms of multiple stages, while bringing little computational overhead to its concise architecture.

  • Extensive experiments on two widely-adopted person search benchmarks clearly show the superiority of our method over the state-of-the-art approaches. The additional evaluation in character search, where our method outperforms other competitors by large margins, further confirms the effectiveness of the proposed context learning scheme.

2 Related Work

Person Search.

Although tremendous achievements have been made in person re-ID [26, 41], there still remains a gap to apply re-ID to practical applications, due to that gallery images are cropped person images instead of natural images consisting of persons and their surroundings. To address this, person search is introduced to jointly localize and identify query persons. Existing person search frameworks can be generally divided into two categories: two-step and one-step models. In two-step frameworks, the person detection results are fed into a re-ID model, i.e., detection and re-ID models are independent of each other and organized in a sequential way. Inspired by the achievements in object detection [30], the first one-step person search approach based on Faster-RCNN [30] was proposed in [34]. Since then, many more successful attempts [38, 36, 20, 5, 40, 18] have been made to make the joint framework more effective and efficient. For instance, to improve the efficiency, Yan et al[36] proposed the first anchor-free person search framework, which equipped an existing detector with three levels of alignment. In [20], detection and re-ID were considered as a progressive process and tackled with two sub-networks. However, existing person search models suffer a lot from the conflicting objects of person detection and re-ID during training [3]. To tackle this problem, Chen et al. proposed a novel norm-aware embedding (NAE) method to decompose the feature embedding into norm and angle for detection and re-ID, respectively.

Context Learning in Person Search.

Over the past few years, context learning has played an important role in the sub-tasks of person search, i.e., object detection [27, 25, 7] and person re-ID [37, 17]. In object detection, context information often plays a role in reasoning and help establish the relationship between objects and rationality. Among most of the existing approaches, the context used can be divided into two categories: global scene context and local group context. With global scene context, the relationship between the scene and target object can be established. For instance, a beach often involves a boat, and an umbrella often appears in a rainy day. To be short, more information in the background could be absorbed to assist the localization on target objects. For instance, in [25], scene context is applied to enhance some parts of object state, which may help find the neglected small objects. In [7], Chen et al

. used a correlation estimation procedure to select useful contextual information from the surrounding regions, which improves the quality of their detection results.

Recently, many works have been proposed for mining the relationship between objects in a certain area. In the area of small object detection, tiny objects are difficult to be detected and recognized only by its own features. Context information among instances nearby is used to form a group for the measurement on the relationship between each other [21, 8]. In [21], Lim et al. extracted the feature of higher layers to obtain the contextual information from surrounding pixels. Among re-ID methods, context information is often represented as the concept of group, which helps model the relationships between the target person and his/her neighboring persons. In [6], unlabelled target instances were leveraged as the contextual guidance for image generation. Besides, group re-ID methods [22, 37, 45] explore more on how to incorporate the additional visual context from the neighboring group members to enhance the re-ID performance. In terms of person search, there exist several methods [38, 9, 24, 20] that explore context information. In [38, 24, 20], group context was used by graph matching. In [35] and [12], context was employed in an implicit way by using memory units in the weakly-supervised setting.

Character Search in Movies.

Xiao et al[34] introduced the first person search dataset, i.e., CUHK-SYSU, collected from street snaps and movie/TV frames. PRW [42] is a collection of images from surveillance videos. These two existing datasets have been used to evaluate methods for searching persons in different places. In the era of social media and big data, there is an increasing need for searching persons on the Internet, especially movie stars and political celebrities. However, a big gap between searching the person on the street and retrieving a person from the Internet is that the clothing of the target person usually changes across different scenes. To be more specific, a movie cast tends to wear different clothes in different movies or even a single movie. In such cases, the clothing is not a reliable basis of distinction any more. Furthermore, if criminals may change their clothes in different zones, the existing hypothetical basis of normal person re-ID is not established. Given the above facts and observations, it is important to make an exploration on pushing the normal person re-ID and person search into a more general task, which is the character re-identification and character search, respectively. The well-behaved character search could bring more convenient entertaining and safer living for our daily life. In [16], Huang et al. proposed a holistic dataset for movie understanding, where rich annotations for character detection and character identification have been provided. Owing to this great work, an apposite dataset and a baseline method for character identification have been given, which provides a great chance to extend the character identification to character search, which is much more in line with practical applications.

3 Global-Local Context Network

In this section, we first revisit the existing person search framework, and briefly discuss their limitations, especially w.r.t. the insufficient use of person search data. We then introduce the proposed unified global-local context network (GLCNet), which makes full use of the context information in two levels: scene-level and group-level. Finally, these two levels of contexts are fed into the loss function

[34] to help improve the identity feature.

Figure 1: Overall framework of the proposed global-local context network (GLCNet). The structure of context enhancer is shown in Fig. 2

3.1 Insufficient Use of Person Search Data

In the past few years, many great person search frameworks have been proposed. Most of the existing network are designed to learn more accurate bounding boxes [20], more discriminative re-ID features [9], or reconcile the relationship between detection and re-ID [31, 5]. Prior researches tried to transform the person search problem into solving the person detection and re-ID, and the difference is to tackle them separately or jointly. From two-step methods to one-step methods, the re-ID branch generally relies on the detection results, and only employs the information within the bounding box to learn re-ID features. However, the person search images naturally contains much richer information, including the labeled persons, unlabeled persons and complete background, which could provide complementary information for re-ID learning. Inspired by this, some prior works have been proposed to explore the role of context information in person search [12, 35] in an implicit and weakly supervised way. Group context has also been explored in several works by graph matching [37, 20, 24], where the context information is applied in a pair-wise way. These pair-based works firstly extract feature of all persons existing in an image pair, then optimize the matching the two feature sets in a graph level, instead of making the matching one by one in a greedy style. Therefore, we propose a unified global-local context network (GLCNet) to explicitly address these issues.

3.2 Framework Overview

Our framework is developed upon SeqNet [20], one of the state-of-the-art models in person search. We follow SeqNet to feed the output features of the RoI-Align layer to the norm-aware embedding (NAE) [5] layer. As shown in Fig. 1, the proposed GLCNet network has two individual branches, which are used for extracting global scene context (GSC) and local group context (LGC), respectively. Based on the assumption that different people have their personal frequently visited places, there may be a probabilistic connection between persons and places/scenes. The more complex and discriminative the scene is, the more unique and robust the connection between it and certain persons is. We find that this phenomenon exists on the existing datasets, especially on the CUHK-SYSU and MovieNet datasets. With these two branches, the context information could be captured by the global and local features. Specifically, to learn the global features, we design the GSC branch to extract the features from the backbone. Meanwhile, to learn the local features, we employ the LGC branch, which extracts the positive RoI features to learn features of all the persons from a single image. These two features are then fed into the NAE layer to obtain more discriminative features with better diversity.

Figure 2: Structure of the context enhancer.

3.3 Global Scene Context

In object detection and segmentation tasks, scene context has been proven to play an important role [27, 7, 25, 21]. In [25], Liu et al. adopt a graphic model to establish the relationship between scene/instance context and the target object to infer the object state. This motivates us to think about the rationality of exploiting the global scene context from two perspectives: i.e., dataset and network architecture.

To find whether there is a connection between a certain person with the corresponding scene, we take a deep look into the existing datasets. On CUHK-SYSU, many images are key frames from movie shots, where there exist significant differences of scenes between different shots and high similarities between key frames in a single shot. In such cases, we expect there is a strong connection between the identity and the corresponding scenes. While on PRW, there are few differences between the images since all of them are collected from eight surveillance cameras. As a result, the connections between the identity and the scenes should not be so tight, so that capturing the global scene context should not be so effective on this dataset, which is verified in the following experiments.

From the perspective of network architecture, our fusion block brings interactions between the backbone features and the RoI features, making the re-ID branch learn features from multiple levels: i.e., global and local; shallow and deep.

In addition, we propose the context enhancer (CE) to enhance the global scene context features. The CE block enhances the output of our ResNet-50 backbone, which is a global response to the scene. To avoid that the improvement comes from the high complexity of the network, we design the CE as simple as possible, which is made of several 11 convolutional layers and an adaptive pooling layer, as shown in Fig. 2.

3.4 Local Group Context

Group context information has the potential to bring extra useful information and reduce ambiguity in single person re-ID [37, 32, 43]

. For group context matching, a trivial solution is to use a graph neural network (GNN), which views all the identities as a vertex set and computes the edge among these identities. However, directly feeding the persons into the GNN lacks enough interpretability and increases the computational complexity a lot. Based on the above considerations, we propose our Local Group Context (LGC) module to extract the features from the neighboring persons, which are then normalized in the NAE block and fed into the OIM Loss as part of the features of the target person.

The LGC module consists of a sampling layer and a context enhancer. The sampling layer is used for selecting the positive RoI features which are more possible to represent persons. After selection, the features of persons will be fed into the CE for feature enhancement. Finally, the group context features will be sent to the NAE layer for normalization.

Figure 3: Structure of the embedding decomposition module in the implicit style.
Figure 4: Structure of the embedding decomposition module in the explicit style.

 

CUHK-SYSU PRW
Embedding decompo- sition module mAP top-1 mAP top-1
Implicit style 95.2 95.8 46.2 84.4
Explicit style 95.7 96.3 46.9 85.1
Table 1: Comparative results when employing different embedding decomposition module.

3.5 Embedding Decomposition

Given the contextual information from two perspectives, i.e., global scene context and local group context, we need to find the right way to exploit these two contexts as well as the original re-ID features. To make full use of our context features, we design an embedding decomposition module in an implicit way and an explicit way, respectively. We first introduce our implicit design, as shown in Fig. 3. In the implicit way, we concatenate our two re-ID features and two context features, and then employ a channel attention block [15]

to give different weights to each feature. Afterward, we feed the single feature vector to the NAE layer. The second design of this module is more concise and explicit. As shown in Fig.

4, our explicit design involves an NAE layer and a concatenation operation. As a result, the identity features of a target person consist of four parts: its original shallow and deep re-ID features, global context features which represent the scene, and local context features indicating the representations of its surrounding persons. Through the OIM [34] loss, these two types of context features will also be a part of the representation of the target person.

The performance of these two designs can be referred to in Table 1. Considering the conciseness and performance, we adopt the explicit decomposition module in our experiments, unless otherwise specified.

 

CUHK-SYSU PRW
Baseline Scene Group mAP top-1 mAP top-1
93.9 94.3 46.5 82.6
50.3 40.6 4.3 27.9
45.0 36.4 6.0 25.0
95.3 96.0 46.3 84.9
93.3 94.1 46.3 82.9
95.7 96.3 46.9 85.1
Table 2: Comparative results when employing different context features to assist the re-identification.

 

CUHK-SYSU PRW
Methods mAP top-1 mAP top-1
one-step OIM [34] 75.5 78.7 21.3 49.4
IAN [33] 76.3 80.1 23.0 61.9
NPSM [23] 77.9 81.2 24.2 53.1
RCAA [1] 79.3 81.3 - -
CTXG [38] 84.1 86.5 33.4 73.6
QEEPS [28] 88.9 89.1 37.1 76.7
HOIM [2] 89.7 90.8 39.8 80.4
BINet [9] 90.0 90.7 45.3 81.7
NAE [5] 91.5 92.4 43.3 80.9
NAE+ [5] 92.1 92.9 44.0 81.1
PGA [18] 92.3 94.7 44.2 85.2
DKD [40] 93.1 94.2 50.5 87.1
AlignPS [36] 93.1 93.4 45.9 81.9
AlignPS+ 94.0 94.5 46.1 82.1
SeqNet [20] 93.8 94.6 46.7 83.4
SeqNet+CBGM[20] 94.8 95.7 47.6 87.6
AGWF[11] 93.3 94.2 53.3 87.7
GLCNet 95.7 96.3 46.9 85.1
GLCNet+CBGM 96.0 96.3 47.6 88.0
two-step DPM+IDE [42] - - 20.5 48.3
CNN+MGTS [4] 83.3 83.9 32.8 72.1
CNN+CLSA [19] 87.2 88.5 38.7 65.0
FPN+RDLR [13] 93.0 94.2 42.9 70.2
IGPN [9] 90.3 91.4 47.2 87.0
OR [39] 92.3 93.8 52.3 71.5
TCTS [31] 93.9 95.1 46.8 87.5
Table 3: Comparison with state-of-the-art methods. The upper block lists the results of one-step models, while the lower block shows the results of two-step methods.

4 Experiments

In this section, we conduct the experiments on two widely-used benchmark datasets, i.e., CUHK-SYSU [34] and PRW [42]. In the following, we first introduce the datasets, experimental settings, and implementation details. Subsequently, comparison results to state-of-the-art methods are demonstrated on these two datasets. Finally, we perform ablation studies to validate the effectiveness of context learning. Moreover, since our method is expected to capture meaningful context information, successful identification relies less on the target person itself. To validate the advantages of our context-aware approach, we also conduct the experiments on the MovieNet dataset [16] for the task of character search, where conventional person search approaches may fail in some cases because of the changing clothing of the same identity.

4.1 Datasets and Settings

Cuhk-Sysu [34].

CUHK-SYSU consists of 18,184 images, with 96,143 annotated bounding boxes and 8,432 identities. Two types of data sources are collected, exhibiting high degrees of diversities and variations, making it quite challenging. We follow [34] to adopt the standard training / test split, with a training set of 5,532 identities and 11,206 images, and a test set of 2,900 query persons and 6,978 images. Unless otherwise stated, the results are reported by using the default gallery size of 100.

Prw [42].

PRW contains 11,816 images from the surveillance videos captured by six static cameras in a university campus. After manual annotation, there exist 932 labeled persons with 43,110 bounding boxes. Following the standard data split, the training set contains 5,704 images with 482 different identities, while the test set includes 2,057 query persons and 6,112 images. For each query, we adopt the whole test set as the gallery.

MovieNet [16].

MovieNet111http://movienet.site/ is a holistic dataset for movie understanding, which contains 1,100 movies with a large amount of data, e.g., movie frames, trailers, plot descriptions, etc.. It also contains rich annotations, including the more than 1.1M characters with millions of bounding boxes and 3,087 cast identities. On MovieNet[16], Huang et al.established benchmarks for many tasks based on movie data, including the character identification. They crop the characters with ground-truth boxes and combine the face and body features for the identification. However, simultaneously detecting the characters and identifying them is much closer to realistic application scenarios. Therefore, we extend the character identification task to character search. In order to make the experiments easy to execute and reproduce, we split out the data and annotations which are useful for the character search task, and reorganized the dataset. The character search dataset is split into the training and test sets based on the 3,087 identities. 1,000 identities of them are included in the test set, while the rest 2,087 form the training set. However, considering the extremely large scale of the whole 100K training images, we define 3 different settings for the training sets. Specifically, we adopt at most 10, 30, and 70 instances per identity, contributing to the training sets of 20K, 54K, and 100K training images, respectively. We follow CUHK-SYSU [34] to introduce different gallery sizes to bring various levels of difficulty for comprehensive evaluation. We denote the dataset with our proposed data splits and settings as MovieNet-CS.

Evaluation Metrics.

Following the conventional settings, the mean average precision (mAP) and top-1 accuracy are adopted to evaluate the person search performance of different methods. We also employ the recall and average precision (AP) to evaluate the performance in terms of person detection.

 

CUHK-SYSU PRW
Methods Recall AP Recall AP
OIM [34] 89.3 79.7 - -
NAE [34] 92.6 86.8 - -
SeqNet [34] 92.1 89.2 - -
AGWF [11] - - 97.5 94.5
GLCNet 92.4 89.4 96.7 94.2
Table 4: Comparison of the detection performance of different methods.

4.2 Implementation Details

We implement our model based on the PyTorch

[29] library and all the experiments are conducted on a single NVIDIA Tesla V100 GPU. We adopt SeqNet [20] as our baseline network. During training, the batch size is set to 5 and each image is resized to 9001500 on CUHK-SYSU [34] and PRW [42]. We follow [34] to set the circular queue size of OIM to 5,000/500/2,000. The only data augmentation used here is random horizontal flip. We follow our baseline model [20]

in the settings of some hyper-parameters: Our model is optimized by Stochastic Gradient Descent (SGD) for 18 epochs on CUHK-SYSU, PRW, and MovieNet-CS with the initial learning rate of 0.003 which is warmed up during the first epoch and decreased by 10 at the 16th epoch. The momentum and weight decay of SGD are set to 0.9 and 5

individually. At test time, the thresholds of NMS are set to 0.4 and 0.5 w.r.t. the first and second heads, respectively. The parameters of the CBGM module used in SeqNet also follow the same settings as in [20], i.e., and on CUHK-SYSU and PRW, respectively.

4.3 Ablation Study

In this section, we conduct the analytical experiments on CUHK-SYSU and PRW for an in-depth analysis of our proposed method.

Baseline. We directly adopt the SeqNet [20] as our baseline model due to its promising performance on both CUHK-SYSU and PRW among previous methods. As shown in Table 2, our context modules can bring notable improvement to the baseline.

Global Scene Context. To validate the effect of scene context, we neglect the original re-ID feature and use only the scene context feature for the re-ID task. Referring to Fig. 1, only the scene context feature in yellow is fed into the OIM loss for learning re-id features. Although the scene context feature is a global one before the RoI selection, mAP can still achieve 50.3% on CUHK-SYSU and 27.9% top-1 accuracy on PRW. All the given results indicates that our context features can facilitate the re-ID feature learning to a large extent.

Local Group Context. To evaluate the effectiveness of the local group context, we follow the procedures in validating the scene context. As shown in Fig. 1, only the group context feature in green is fed into the OIM loss to compute the re-ID feature. As shown in Table 2, mAP can still achieve 45.0% on CUHK-SYSU and 25.0% top-1 accuracy on PRW, with only the fused features of all the detected persons in the image. It indicates that our group context is beneficial for the re-ID task.

To measure the contribution of each part, we train our complete models (i.e., with two original re-ID features, two context features, and the explicit embedding decomposition module), while only part of features are used during test, as illustrated in Fig. 4. As shown in Table 5, though training with all features, each single identification feature could work well independently at the same level of performance. This indicates the interpretability and flexibility of our framework. In other words, if more components are used to equip the current model in the future, the effectiveness of existing features will be influenced.

 

CUHK-SYSU Base Scene Group Base+Scene
mAP 92.7 46.2 35.6 95.2
top-1 93.2 40.8 31.2 95.8
Table 5: Performance using the sub-network from a well-trained complete model. Compared with Table 2, the performance here is still at the same level.

4.4 Comparison with State-of-the-Art Methods

We compare our model with the state-of-the-art methods, including both one-step [34, 5, 38, 20, 36] and two-step methods [31, 19, 3, 13].

Results on CUHK-SYSU. As shown in Table 3, GLCNet and GLCNet+CBGM both outperforms all one-step and two-step person search models without any tricks, such as using additional auxiliary model or data (e.g., knowledge distillation or domain adaptation). Compared with the stat e-of-the-art two-step model TCTS [31], our GLCNet+CBG outperforms it by 2.2% and 1.3% w.r.t.mAP and top-1 accuracy, respectively. Compared to the sophisticated tricks adopted by TCTS, e.g., random erasing [44], label smooth, and triplet loss [14], our approach only adopts random horizontal flip for data augmentation. Our GLCNet model also outperforms the state-of-the-art one-step model SeqNet+CBGM [20] by 1.3% and 0.6% w.r.t. mAP and top-1 accuracy, respectively. Besides, we simply keep the hyper-parameters of CBGM instead of making adjustment. Moreover, we illustrate the results of GLCNet with different gallery sizes and make the comparison between our model and both one-step and two-step models. The detailed comparison results are illustrated in Fig. 5. From the figure, we can observe that our GLCNet and GLCNet+CBGM outperform all the existing models by notable margins. In addition, our approach achieves the best performance in terms of detection on CUHK-SYSU, as shown in Table 4, indicating that our model can handle the conflicting objectives of detection and re-ID during training in a better way.

Figure 5: Performance comparison in terms of different gallery sizes on CUHK-SYSU. The dashed and solid lines represent two-step and one-step methods, respectively.

 

N=10 N=30 N=70
Gallery mAP top-1 mAP top-1 mAP top-1
2K 34.2 80.9 41.6 85.5 44.0 85.6
4K 30.0 79.9 37.6 84.4 39.0 85.7
10K 24.1 77.3 30.9 81.4 32.6 82.4
Table 6: Performance of SeqNet on MovieNet-CS under different data splits. ‘N’ indicates the maximum number of frames of the same identity.

 

N=10 N=30 N=70
Gallery mAP top-1 mAP top-1 mAP top-1
2K 45.9 85.6 53.6 89.4 55.5 88.6
4K 41.5 82.4 50.0 87.6 51.1 87.8
10K 34.4 78.5 42.7 84.7 44.4 85.7
Table 7: Performance of our GLCNet on MovieNet-CS under different data splits. ‘N’ indicates the maximum number of frames of the same identity.

Results on PRW. As stated in Sec. 2, PRW contains less context information so that the context information might not be so effective. Nevertheless, as shown in Table 3, our GLCNet+CBGM still achieves the best top-1 accuracy among all the one-step and two-step models. Although the mAP of our model on PRW only ranks the third among all the one-step and two-step methods, knowledge distillation or external models are adopted in their approaches, which are quite time- and resource-consuming.

Results on MovieNet-CS. Since we have 3 levels of training sets and 3 gallery sets on MovieNet-CS, we conduct detailed comparisons between the baseline network (SeqNet) and our GLCNet Network under 9 settings. As shown in Tables 6 and 7, our GLCNet Network consistently outperforms the SeqNet to a large extent in terms of all the settings on MovieNet-CS. Particularly, 34% improvement in mAP is achieved under the ‘N=10’-‘2K’ setting. Compared with more previous methods, we can see that in such a highly diverse dataset, our context approach can outperform other methods by significant margins, even if the CBGM is not used in the evaluation on MovieNet-CS.

 

Methods GPU (TFLOPs) Time (ms)
QEEPS[28] P6000 (12.6) 300
NAE[5] V100 (14.1) 83
NAE+[5] V100 (14.1) 98
SeqNet[20] V100 (14.1) 86
AlignPS[36] V100 (14.1) 61
AlignPS+[36] V100 (14.1) 67
GLCNet V100 (14.1) 97
Table 8: Runtime comparisons of different one-step models.

Runtime Comparison. Here, we compare the speed of different person search models. To keep consistent with previous approaches, images are resized to pixels for processing. We run the inference on a single Tesla V100 GPU with batch size of 3. Table 8 depicts the results, where we can see our GLCNet takes 97 milliseconds (ms) to process an image, which is still acceptable and even a bit faster than NAE+. Moreover, despite adding more modules onto SeqNet, we only decrease the speed by 11 ms.

5 Conclusion

In this paper, we propose the context-aware person search framework, i.e., global-local context network (GLCNet), to exploit both global scene context and local group context in a unified manner. The proposed GLCNet injects the context information by feature enhancement at multiple stages, leading to the final discriminative representations of the target person. Our approach outperforms the existing methods on two widely-used person search datasets, i.e., CUHK-SYSU and PRW. Additional experiments on the large-scale MovieNet dataset demonstrate that our method has a great potential to be applied to the more challenging task of character search.

6 Limitations and Future Work

Although our model achieves promising performance on existing person search datasets and MovieNet, it still has some limitations. Firstly, the effectiveness of our context-aware method relies on the diversity of context. On some datasets with a low degree of diversity (e.g., PRW), the improvement by context is very limited. Besides, the speed of our model is not high enough. In the future, more efforts will be put into making our context in a more adaptive way to simplify the architecture. Furthermore, character search is a more challenging task that deserves more attention. Different from person search, facial information may play an important role in improving the character search performance.

References

  • [1] Xiaojun Chang, Po-Yao Huang, Yi-Dong Shen, Xiaodan Liang, Yi Yang, and Alexander Hauptmann. Rcaa: Relational context-aware agents for person search. In ECCV, 2018.
  • [2] Di Chen, Shanshan Zhang, Wanli Ouyang, Jian Yang, and Bernt Schiele. Hierarchical online instance matching for person search. In AAAI, pages 10518–10525, 2020.
  • [3] Di Chen, Shanshan Zhang, Wanli Ouyang, Jian Yang, and Ying Tai. Person search via a mask-guided two-stream cnn model. In ECCV, 2018.
  • [4] Di Chen, Shanshan Zhang, Wanli Ouyang, Jian Yang, and Ying Tai. Person search by separated modeling and A mask-guided two-stream CNN model. IEEE Trans. Image Process., 29:4669–4682, 2020.
  • [5] Di Chen, Shanshan Zhang, Jian Yang, and Bernt Schiele. Norm-aware embedding for efficient person search.

    2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition

    , pages 12612–12621, 2020.
  • [6] Yanbei Chen, Xiatian Zhu, and Shaogang Gong. Instance-guided context rendering for cross-domain person re-identification. 2019 IEEE/CVF International Conference on Computer Vision, pages 232–242, 2019.
  • [7] Zhe Chen, Shaoli Huang, and Dacheng Tao. Context refinement for object detection. In ECCV, 2018.
  • [8] Santosh Kumar Divvala, Derek Hoiem, James Hays, Alexei A. Efros, and Martial Hebert. An empirical study of context in object detection. 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 1271–1278, 2009.
  • [9] Wenkai Dong, Zhaoxiang Zhang, Chunfeng Song, and Tieniu Tan. Bi-directional interaction network for person search. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2836–2845, 2020.
  • [10] Wenkai Dong, Zhaoxiang Zhang, Chunfeng Song, and Tieniu Tan. Instance guided proposal network for person search. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2582–2591, 2020.
  • [11] Byeong-Ju Han, Kuhyeun Ko, and Jae-Young Sim. End-to-end trainable trident person search network using adaptive gradient propagation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 925–933, 2021.
  • [12] Chuchu Han, Kai Su, Dongdong Yu, Zehuan Yuan, Changxin Gao, Nong Sang, Yi Yang, and Changhu Wang. Weakly supervised person search with region siamese networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 12006–12015, October 2021.
  • [13] Chuchu Han, Jiacheng Ye, Yunshan Zhong, Xin Tan, Chi Zhang, Changxin Gao, and Nong Sang. Re-id driven localization refinement for person search. 2019 IEEE/CVF International Conference on Computer Vision, pages 9813–9822, 2019.
  • [14] Alexander Hermans, Lucas Beyer, and B. Leibe. In defense of the triplet loss for person re-identification. ArXiv, abs/1703.07737, 2017.
  • [15] Jie Hu, Li Shen, Samuel Albanie, Gang Sun, and Enhua Wu. Squeeze-and-excitation networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42:2011–2023, 2020.
  • [16] Qingqiu Huang, Yu Xiong, Anyi Rao, Jiaze Wang, and Dahua Lin. Movienet: A holistic dataset for movie understanding. ArXiv, abs/2007.10937, 2020.
  • [17] Mohamed Ibn Khedher, Mounîm A. El-Yacoubi, and Bernadette Dorizzi. Probabilistic matching pair selection for surf-based person re-identification. 2012 BIOSIG - Proceedings of the International Conference of Biometrics Special Interest Group (BIOSIG), pages 1–6, 2012.
  • [18] Hanjae Kim, Sunghun Joung, Ig-Jae Kim, and Kwanghoon Sohn. Prototype-guided saliency feature learning for person search. In CVPR, pages 4865–4874, 2021.
  • [19] Xu Lan, Xiatian Zhu, and Shaogang Gong. Person search by multi-scale matching. In ECCV, 2018.
  • [20] Zhengjia Li and Duoqian Miao. Sequential end-to-end network for efficient person search. In AAAI, 2021.
  • [21] Jeong-Seon Lim, M. Astrid, Hyungjin Yoon, and Seung-Ik Lee. Small object detection using context and attention.

    2021 International Conference on Artificial Intelligence in Information and Communication (ICAIIC)

    , pages 181–186, 2021.
  • [22] Giuseppe Lisanti, Niki Martinel, A. Bimbo, and Gian Luca Foresti. Group re-identification via unsupervised transfer of sparse features encoding. 2017 IEEE International Conference on Computer Vision, pages 2468–2477, 2017.
  • [23] Hao Liu, Jiashi Feng, Zequn Jie, Jayashree Karlekar, Bo Zhao, Meibin Qi, Jianguo Jiang, and Shuicheng Yan. Neural person search machines. 2017 IEEE International Conference on Computer Vision, pages 493–501, 2017.
  • [24] Jiawei Liu, Zhengjun Zha, Richang Hong, Meng Wang, and Yongdong Zhang. Dual context-aware refinement network for person search. Proceedings of the 28th ACM International Conference on Multimedia, 2020.
  • [25] Yong Liu, Ruiping Wang, S. Shan, and Xilin Chen. Structure inference net: Object detection using scene-level context and instance-level relationships. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6985–6994, 2018.
  • [26] Hao Luo, Youzhi Gu, Xingyu Liao, Shenqi Lai, and Wei Jiang. Bag of tricks and a strong baseline for deep person re-identification. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 1487–1495, 2019.
  • [27] Roozbeh Mottaghi, Xianjie Chen, Xiaobai Liu, Nam-Gyu Cho, Seong-Whan Lee, Sanja Fidler, Raquel Urtasun, and Alan Loddon Yuille. The role of context for object detection and semantic segmentation in the wild. 2014 IEEE Conference on Computer Vision and Pattern Recognition, pages 891–898, 2014.
  • [28] Bharti Munjal, Sikandar Amin, Federico Tombari, and Fabio Galasso. Query-guided end-to-end person search. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 811–820, 2019.
  • [29] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. ArXiv, abs/1912.01703, 2019.
  • [30] Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39:1137–1149, 2015.
  • [31] Cheng Wang, Bingpeng Ma, Hong Chang, S. Shan, and Xilin Chen. Tcts: A task-consistent two-stage framework for person search. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11949–11958, 2020.
  • [32] Hao Xiao, Weiyao Lin, Bin Sheng, Ke Lu, Junchi Yan, Jingdong Wang, Errui Ding, Yihao Zhang, and Hongkai Xiong. Group re-identification: Leveraging and integrating multi-grain information. Proceedings of the 26th ACM international conference on Multimedia, 2018.
  • [33] Jimin Xiao, Yanchun Xie, Tammam Tillo, Kaizhu Huang, Yunchao Wei, and Jiashi Feng. IAN: the individual aggregation network for person search. Pattern Recognit., 87:332–340, 2019.
  • [34] Tong Xiao, Shuang Li, Bochao Wang, Liang Lin, and Xiaogang Wang. Joint detection and identification feature learning for person search. 2017 IEEE Conference on Computer Vision and Pattern Recognition, pages 3376–3385, 2017.
  • [35] Yichao Yan, Jinpeng Li, Shengcai Liao, Jie Qin, Bingbing Ni, Xiaokang Yang, and Ling Shao. Exploring visual context for weakly supervised person search. ArXiv, abs/2106.10506, 2021.
  • [36] Yichao Yan, Jingpeng Li, Jie Qin, Song Bai, Shengcai Liao, Li Liu, Fan Zhu, and Ling Shao. Anchor-free person search. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7686–7695, 2021.
  • [37] Y. Yan, J. Qin, B. Ni, J. Chen, L. Liu, F. Zhu, W. S. Zheng, X. Yang, and L. Shao. Learning multi-attention context graph for group-based re-identification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020.
  • [38] Yichao Yan, Qiang Zhang, Bingbing Ni, Wendong Zhang, Minghao Xu, and Xiaokang Yang. Learning context graph for person search. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2153–2162, 2019.
  • [39] Hantao Yao and Changsheng Xu. Joint person objectness and repulsion for person search. IEEE Trans. Image Process., 30:685–696, 2021.
  • [40] Xinyu Zhang, Xinlong Wang, Jia-Wang Bian, Chunhua Shen, and Mingyu You. Diverse knowledge distillation for end-to-end person search. In AAAI, pages 3412–3420, 2021.
  • [41] Zhizheng Zhang, Cuiling Lan, Wenjun Zeng, Xin Jin, and Zhibo Chen. Relation-aware global attention for person re-identification. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3183–3192, 2020.
  • [42] Liang Zheng, Hengheng Zhang, Shaoyan Sun, Manmohan Chandraker, Yang Yang, and Qi Tian. Person re-identification in the wild. 2017 IEEE Conference on Computer Vision and Pattern Recognition, pages 3346–3355, 2017.
  • [43] Wei-Shi Zheng, Shaogang Gong, and Tao Xiang. Associating groups of people. In BMVC, 2009.
  • [44] Zhun Zhong, Liang Zheng, Guoliang Kang, Shaozi Li, and Yi Yang. Random erasing data augmentation. ArXiv, abs/1708.04896, 2020.
  • [45] Ji Zhu, Hua Yang, Weiyao Lin, Nian Liu, Jia Wang, and Wenjun Zhang. Group re-identification with group context graph neural networks. IEEE Transactions on Multimedia, 23:2614–2626, 2021.