DeepAI
Log In Sign Up

Grouped Adaptive Loss Weighting for Person Search

Person search is an integrated task of multiple sub-tasks such as foreground/background classification, bounding box regression and person re-identification. Therefore, person search is a typical multi-task learning problem, especially when solved in an end-to-end manner. Recently, some works enhance person search features by exploiting various auxiliary information, e.g. person joint keypoints, body part position, attributes, etc., which brings in more tasks and further complexifies a person search model. The inconsistent convergence rate of each task could potentially harm the model optimization. A straightforward solution is to manually assign different weights to different tasks, compensating for the diverse convergence rates. However, given the special case of person search, i.e. with a large number of tasks, it is impractical to weight the tasks manually. To this end, we propose a Grouped Adaptive Loss Weighting (GALW) method which adjusts the weight of each task automatically and dynamically. Specifically, we group tasks according to their convergence rates. Tasks within the same group share the same learnable weight, which is dynamically assigned by considering the loss uncertainty. Experimental results on two typical benchmarks, CUHK-SYSU and PRW, demonstrate the effectiveness of our method.

READ FULL TEXT VIEW PDF
03/01/2020

FMT:Fusing Multi-task Convolutional Neural Network for Person Search

Person search is to detect all persons and identify the query persons fr...
10/25/2019

An End-to-End Foreground-Aware Network for Person Re-Identification

Person re-identification is a crucial task of identifying pedestrians of...
11/06/2022

Sequential Transformer for End-to-End Person Search

Person Search aims to simultaneously localize and recognize a target per...
07/21/2022

OIMNet++: Prototypical Normalization and Localization-aware Learning for Person Search

We address the task of person search, that is, localizing and re-identif...
02/16/2021

Multi-Attribute Enhancement Network for Person Search

Person Search is designed to jointly solve the problems of Person Detect...
08/10/2021

ASMR: Learning Attribute-Based Person Search with Adaptive Semantic Margin Regularizer

Attribute-based person search is the task of finding person images that ...
07/19/2020

AWR: Adaptive Weighting Regression for 3D Hand Pose Estimation

In this paper, we propose an adaptive weighting regression (AWR) method ...

1. Introduction

Figure 1. Illustration of our proposed GALW. Different sub-tasks are presented as bubbles in different colors whose area is proportional to the convergence rate of a sub-task in an end-to-end model. We partition a collection of tasks with a consistent convergence rate into a group shared with the same learnable loss weight which is dynamically assigned by considering the loss uncertainty.

Person search has attracted the attention of many researchers in recent years, as it can be widely applied in video surveillance, social entertainment, etc. Person search is an integrated task of pedestrian detection and person re-identification (re-id). The goal of this task is to find the given person from a set of uncropped scene images.

Recent works on person search can be roughly divided into one-stage methods (Xiao et al., 2017; Liu et al., 2017; Chang et al., 2018; Yan et al., 2019; Munjal et al., 2019; Chen et al., 2020a; Dong et al., 2020a; Chen et al., 2020c; Kim et al., 2021; Li and Miao, 2021; Han et al., 2021; Yan et al., 2021a, b; Yang et al., 2020; Zhang et al., 2021; Liu et al., 2020) and two-stage methods (Zheng et al., 2017; Chen et al., 2020b; Lan et al., 2018; Han et al., 2019; Dong et al., 2020b; Yao and Xu, 2020; Wang et al., 2020). A typical two-stage method first trains a pedestrian detection model and then applies a re-id model on the detected pedestrian images. In contrast, one-stage methods optimize pedestrian detection and re-id tasks simultaneously in an end-to-end manner. To increase the discrimination ability of models, some works introduce auxiliary tasks (Chen et al., 2022, 2020b; Han et al., 2021; Zhong et al., 2020) (e.g.

pose estimation, attribute recognition, and human parsing) to provide guidance information for person search. Han

et al. (Han et al., 2021) adopt part classification to obtain spacial fine-grained features. Zhong et al. (Zhong et al., 2020) extract part features from visible body parts and take re-id as a partial feature matching procedure. Chen et al. (Chen et al., 2020b) explore the impact of background information on person search by semantic segmentation. Although auxiliary information brings an improvement on performance, jointly optimizing such model containing multiple tasks becomes complicated, which is mainly caused by the inconsistent convergence rates of different tasks. Therefore, it is a great challenge to design an effective multi-task learning (MTL) strategy for model optimization.

To synchronize the convergence of different tasks, a straightforward solution is to manually assign different loss weights for these tasks. However, it is difficult to assign a suitable weight for each task manually. The automatic loss weighting strategy (Kendall et al., 2018; Liu et al., 2019; Chen et al., 2018; Guo et al., 2018) provides an alternative way to solve this problem. Kendall et al. (Kendall et al., 2018) propose a novel multi-task loss that uses homoscedastic uncertainty to weight tasks dynamically. Liu et al. (Liu et al., 2019) use dynamic weight averaging to balance learning speed of each task. All these methods have achieved good results in MTL. However, through analysis of the impact of loss weighting on person search, we find that the performance degrades when optimizing too many tasks in an end-to-end manner.

To solve this problem, in this work, we adopt the task grouping strategy, which assembles many tasks into a small number of optimization groups. Existing methods (Vandenhende et al., 2021; Crawshaw, 2020; Standley et al., 2020; Fifty et al., 2021) have shown the efficacy of grouping different tasks, e.g. Fifty et al. (Fifty et al., 2021) determine task groups by employing a measure of inter-task affinity. However, these methods are inconvenient, time and resource consuming since task groups are usually associated with different networks which require separate training. Different from these methods, as shown in Fig. 1, we propose a grouped adaptive loss weighting (GALW) method that groups tasks according to their convergence rates in the same network. Specifically, we put those tasks with a similar gradient magnitude slope into a group and explore homoscedastic uncertainty learning to optimize the weights of different grouped tasks for person search. By doing this, we can dynamically learn the optimal loss weights, which makes optimization more effective and stable without extra computational costs. To verify the extendability of our method on more tasks, we additionally employ an attribute recognition network that obtains rich features benefit for the problem of mismatch on similar appearances.

Figure 2.

The overview of our network and loss functions. This framework uses ResNet-50 

(He et al., 2016) as backbone and contains three branches, NAE branch (Chen et al., 2020c) (green part), AlignPS branch (Yan et al., 2021a) (yellow part), and auxiliary task branch (orange part), of which the first two branches come from ROI-AlignPS (Yan et al., 2021b). in the right is the corresponding loss function in the box of task . We can extend model by adding auxiliary tasks. AFA means aligned feature aggregation module used in (Yan et al., 2021a) and RPN (Ren et al., 2015) refers to region proposal network.

In summary, our contributions are as follows:

  1. We provide an analysis to explore the effect of uncertainty loss weighting strategy on person search. We find that the performance degrades when dealing with large numbers of different tasks in person search and the issue of inconsistent convergence rates gets severe as the number of tasks grows.

  2. We propose a grouped adaptive loss weighting (GALW) method, which adjusts the weight of each task automatically and dynamically. We put those tasks with similar convergence rates into a group shared with the same learnable loss weight, which makes the model optimization more effective for person search.

  3. We achieve state-of-the-art results on the CUHK-SYSU dataset and competitive performance with less running time on the PRW dataset. Furthermore, we verify the generalization ability and extendability of GALW by applying GALW on different baselines and adding auxiliary tasks respectively.

2. Related Work

In this section, we first review existing works on person search. Since person search is a typical MTL problem, we also review some related works about MTL.

2.1. Person Search

The task of person search traditionally consists of two sub-tasks: pedestrian detection and re-id. Existing methods can be categorized into one-stage or two-stage models according to their training strategy (separately or end-to-end). DPM+IDE (Zheng et al., 2017) is the first two-stage framework that combines different detectors and re-id models to detect pedestrians first and then perform re-id using the cropped images. Based on that, several methods make further improvements and achieve better performance (Wang et al., 2020; Yao and Xu, 2020; Dong et al., 2020b; Han et al., 2019; Chen et al., 2020b; Lan et al., 2018). These two-stage models lack efficiency although they have better performance.

For one-stage methods, Xiao et al. (Xiao et al., 2017) first propose an end-to-end framework for person search by adding re-id layers after Faster-RCNN (Ren et al., 2015). Chen et al. (Chen et al., 2020c) introduce a norm-aware embedding method (NAE) which relieves the contradictory goals of pedestrian detection and person re-id tasks by decomposing the feature embedding into norm and angle respectively. Based on that, SeqNet (Li and Miao, 2021) gets better performance by stacking the NAE models. Kim et al. (Kim et al., 2021) present a prototype-guided attention module to obtain discriminative re-id features. Yan et al. (Yan et al., 2021a) first propose the anchor-free based person search model (AlignPS) which addresses the problems of scale, region and task misalignment. They further introduce an advanced AlignPS (ROI-AlignPS (Yan et al., 2021b)), taking advantage of the anchor-based and anchor-free model, to enhance the final performance.

Recently, some one-stage works enhance person search features by exploiting various auxiliary information. Chen et al. (Chen et al., 2022) explore skeleton key points to update spatial-temporal features. Han et al. (Han et al., 2021) incorporate the part classification branch to generalize features shared with pedestrian detection and re-id and further enhance the quality of spacial features according to the detection confidence, as well as preventing the detection over-fitting in the latter part of the training.

Although these one-stage methods are simple and efficient, they assign different weights for different tasks manually and do not take model optimization into account. Especially with more tasks, there is a non-negligible problem that how to balance the contributions of losses for optimizing an end-to-end model. For two-stage models, it is unnecessary to focus on this problem and there is no need to optimize all losses together because they train the detector and re-id separately. Since person search is a typical MTL problem and loss weighting is a straightforward way to optimize models by re-weighing losses during training, in this work, we make an analysis on loss weighing with different numbers of tasks in the field of person search, which further helps us optimize this task.

2.2. Multi-Task Learning

MTL is a machine learning method that is widely used in many fields such as computer vision 

(Liu et al., 2019; Zhao et al., 2018; Zhang et al., 2014)

, reinforcement learning 

(Akkaya et al., 2019)

, natural language processing 

(Collobert and Weston, 2008; Collobert et al., 2011). In MTL, multiple tasks can be trained simultaneously and the loss functions can be optimized at once within a single model. According to the existing frameworks, MTL methods can be divided into three aspects: architecture design (Zhao et al., 2018; Zhang et al., 2014), optimization strategy  (Kendall et al., 2018; Liu et al., 2019; Chen et al., 2018; Guo et al., 2018) and task relationship learning (Vandenhende et al., 2021; Crawshaw, 2020).

Architecture designing based methods focus on which components can be shared and which are task-specific to get generalized features for each task (Liu et al., 2019; Zhao et al., 2018; Zhang et al., 2014). Optimization strategy based methods aim to solve the task balancing problem during training to optimize the model parameters in a faster learning speed. One of the most common methods is loss weighting (Kendall et al., 2018; Liu et al., 2019; Chen et al., 2018; Guo et al., 2018). Yan et al. (Kendall et al., 2018) use homoscedastic uncertainty in Bayesian modeling to weight losses which makes it more suitable for noisy data. Chen et al. (Chen et al., 2018) propose gradient normalization to balance learning speed and magnitude of different losses but need to calculate the gradient which needs more GPU resources and increases training time. Guo et al. (Guo et al., 2018) present a dynamic task prioritization method that prefers hard-learning tasks. Task relationship learning based methods aim to learn the relationship of tasks and further improve the learning on these tasks with learned relationships. Task grouping is the most typical method through different ways. Standley et al. (Standley et al., 2020) make an analysis on the influence factors of MTL and get better performance under a limited inference-time budget by using a network selection strategy. Fifty et al. (Fifty et al., 2021) propose a new measure of inter-task affinity to group tasks by quantifying the effect between tasks and train the grouped tasks separately.

These loss weighting methods optimize training problems in terms of learning speed, performance, uncertainty and order of loss magnitude. Meanwhile, task grouping provides us with a way to learn an explicit representation of tasks or relationships between tasks. Different from the above techniques, we propose a grouped adaptive loss weighting method that exploits the advantages of both. We regard the similar gradient magnitude trend as consistent convergence rate and partition a collection of tasks with similar gradient magnitude trend into a group. We further exploit homoscedastic uncertainty learning to assign loss weights on the different groups and improve model performance without extra computational costs.

3. Method

Figure 3. Gradient magnitudes of different sub-tasks in ROI-AlignPS. We zoom the lower part of the left figure (orange solid rectangle) for better view.

In this section, we first make an analysis on loss weighting to demonstrate our motivation, followed by the proposed method for person search.

3.1. Analysis on Loss Weighting

Person search is a typical MTL problem, however, some of the existing MTL strategies are not suitable for optimizing a person search model which contains too many sub-tasks. In this subsection, we conduct a comprehensive analysis of loss weighting strategy on person search. Specifically, We use two recent works, i.e. AlignPS (Yan et al., 2021a) and ROI-AlignPS (Yan et al., 2021b), as our baseline methods. As shown in Fig. 2, ROI-AlignPS contains 10 sub-tasks (i.e. ), while AlignPS is an anchor-free method which only contains 5 sub-tasks (i.e. ). Due to the fact that the mutual learning of ROI-AlignPS is not performed in every iteration of the training phase, we redesign ROI-AlignPS (denoted as ROI-AlignPS) by removing mutual learning loss. We use uncertainty loss (Kendall et al., 2018) as our loss weighting function, (denotes uncertainty loss weighting function (ULWF) in the following), which uses homoscedastic uncertainty to automatically adjust weights. Compared to other loss weighting methods, uncertainty loss is easy to implement without any gradient calculations and does not introduce too many extra parameters.

We conduct experiments on one of typical person search datasets, PRW (Zheng et al., 2017)

dataset. The mean average precision (mAP) and Top-1 are used as the evaluation metrics, a higher value indicates a better performance of person search. From Fig. 

3, Tab. 1 and Tab. 2, we make the following observations and discussions:

Method mAP Top-1

AlignPS

- - - - - 1 1 1 1 1 45.90 81.90
- - - - - 10 1 1 1 10 22.51 62.22
- - - - - 1 1 10 1 10 39.03 79.21
- - - - - 1 1 10 10 1 47.83 81.88

ROI-AlignPS

1 1 1 1 1 1 1 1 1 1 48.68 84.15
10 10 1 10 1 1 10 1 1 10 47.45 82.47
10 1 10 1 10 1 10 10 10 1 46.41 81.88
1 1 1 1 10 10 10 10 10 1 49.79 83.65
1 1 10 1 10 10 10 10 10 1 50.30 84.30

Table 1. Effect of loss weighting on performance of ROI-AlignPS (Yan et al., 2021b) and AlignPS (Yan et al., 2021a) when their loss of sub-task is manually assigned different weight ().
Method Task number mAP Top-1
AlignPS (Yan et al., 2021a) 5 45.58 81.90
AlignPS w/ ULWF 5 48.88 82.91
ROI-AlignPS (Yan et al., 2021b) 10 50.30 84.30
ROI-AlignPS w/ ULWF 10 45.76 82.77

Table 2. Effectiveness of ULWF on different number of tasks.
  1. Fig. 3 depicts gradient magnitude of different tasks in ROI-AlignPS during training. We can see that there are diverse convergence rates in different tasks. Convergence rates of and others show opposite trends over training.

  2. Tab. 1 shows the performance of AlignPS and ROI-AlignPS with manually routine assigned with loss weights. We find that the performance gaps of ROI-AlignPS and AlignPS are large when their sub-tasks are manually assigned different weights. For example, the best performance of AlignPS outperforms its worst performance by 25.32 pp w.r.t mAP and 19.66 pp w.r.t Top-1. For the MTL in person search, assigning a suitable weight for each task is important but is hard to be achieved manually.

  3. Tab. 2 explores the effectiveness of ULWF. Compared to the baseline methods AlignPS, using ULWF brings a significant improvement of 3.30% w.r.t mAP. When using ULWF, the mAP of ROI-AlignPS decreases by 4.54%. We find that ULWF is useful for learning a small number of tasks, however, it causes performance degradation when optimizing too many tasks in an end-to-end manner. We consider that the inconsistent convergence rates of too many tasks lead to the under-performance of the loss weighting strategy.

3.2. Grouped Adaptive Loss Weighting

Based on the analysis in Sec. 3.1, we propose a grouped adaptive loss weighting method. In the following, we first introduce our regularized uncertainty loss weighting function (RULWF) in Sec. 3.2.1. Then the details of task grouping are described in Sec. 3.2.2.

3.2.1. Regularized Uncertainty Loss Weighting

We use homoscedastic uncertainty learning (Gal and Ghahramani, 2016; Kendall and Gal, 2017; Kendall et al., 2018) to conduct loss weighting. Take classification task as an example. The Bayesian probabilistic likelihood of model output is defined as:

(1)

where is the output of the network and is the observation noise of task which is a learnable parameter. The log-likelihood of this output is:

(2)

where is the number of classes. And the loss function can be written as:

(3)

where denotes original unweighted loss function of task . In order to simplify the optimisation objective, an assumption is used in the last transition and it becomes equality when  (Kendall et al., 2018).

Furthermore, some changes need to be made to meet that assumption and stabilize the training. We add a regularizer loss for each task:

(4)

where is 1-norm. Then, the total loss is written as:

(5)

in which when the task is classification task and regression task respectively. is the weight for the regularizer term. In practice, we set . We will validate the effectiveness of this design in Sec. 4.3.

3.2.2. Task Grouping

As discussed in Sec. 3.1, when the number of tasks increases, the relationship between tasks becomes particularly intricate. Task grouping is one way of learning relationships between tasks and leverage the learned task relationships to improve model training. We measure the gradient slope of each tasks and group them according to the similarity of gradient slopes, which we deem is an effective strategy for multi-task learning.

Specifically, we calculate the gradient magnitude of shared parameters at each epoch:

(6)

where is the number of shared parameters. We use the average slope of each polyline in Fig. 3 as a measure of task trends.

(7)

where is the number of epochs. refers to the average slope of task . In general, the convergence rate decreases and gradually plateaus over time. The average slope for each task is too small to distinguish, so we process it further as follows:

(8)

where and are used to indicate the sign and the order of magnitude of respectively. is sigmoidactivation function which makes it insensitive to numbers that orders of magnitude are large. Then, we partition

into different groups by hierarchical clustering algorithm 

(Nielsen, 2016). And final loss function Eq. (5) turns into:

(9)

where refers to the number of groups and loss function of group equals to the sum of the losses of tasks within this group. We describe the detailed procedure in Algorithm 1.

0:  image data from dataset. Initialise: Model shared parameters in backbone, epoch number , task number , group number ;
0:  Model parameters ;
1:  // For first training;
2:  for  do
3:     for  do
4:        Calculate gradient magnitude of using Eq. (6);
5:     end for
6:  end for
7:  for  do
8:     Gradient magnitude trend of task:
9:     Calculate average slope using Eq. (7) and get by processing with Eq. (8)
10:  end for
11:  Task grouping with hierarchical clustering using .
12:  // For second training;
13:  Grouping tasks: ;
14:  Training Model with grouped tasks using Eq. (9);
15:  return  Final model parameters ;
Algorithm 1 Grouped Adaptive Loss Weighting Method

4. Experiments

In this section, we first introduce two typical datasets, CUHK-SYSU (Xiao et al., 2017) and PRW (Zheng et al., 2017), evaluation metrics of person search, followed by implementation details. Then we perform exhaustive ablation studies on PRW to examine the contributions of the proposed components. Finally, we show the experimental results in comparison to state-of-the-art methods.

4.1. Datasets and Settings

There are two typical datasets in person search, which we use in our experiments.

(1) CUHK-SYSU. CUHK-SYSU (Xiao et al., 2017) is a large-scale person search database providing 18,184 images, 96,143 pedestrian bounding boxes and 8,432 different identities. There are two kinds of images: video frames selected from movie snapshots and street/city scenes captured by a moving camera. The dataset is divided into standard train/test split, where the training set includes 11,206 images and 5,532 identities, while the testing set contains 6,978 gallery images with 2,900 query persons. Instead of using all the test images as a gallery, it defines a set of protocols with gallery sizes ranging from 50 to 4,000. We use the default gallery size of 100 in our experiments unless otherwise specified.

(2) PRW. PRW dataset (Zheng et al., 2017) contains 11,816 frames which are sampled from videos captured with six cameras on a university campus. There are 43,110 bounding boxes and where 34,304 of them are annotated with 932 identities and the rest marked as unknown identities. In the training set, it provides 5,134 images with 482 different persons while the testing set includes 2,057 query persons and 6,112 gallery images. Different from CUHK-SYSU, we use the whole gallery set as the search space for each query person.

Evaluation Metrics. Similar to re-id, we employ mAP and cumulative matching characteristics Top- as performance metrics for person search, where mAP metric reflects the accuracy and matching rate of searching a probe person from gallery images. Top- score represents the percentage that at least one of the proposals most similar to a given query succeeds in the re-id matching.

4.2. Implementation Details

Our model consists of three branches as shown in Fig. 2: NAE branch, AlignPS branch and auxiliary task branch. The first two of them make up our baseline ROI-AlignPS (Yan et al., 2021b). In addition, we add an attribute recognition network branch as an auxiliary task as an example to verify the extendability of our proposed method on more tasks, which is denoted as ROI-AlignPS-Attr. We extract different level features of proposals which are cropped and reshaped to in pyramid structure and put them into attribute localization module (ALM) modules (Tang et al., 2019) to perform attribute prediction. Binary cross-entropy loss function is adopted as attribute loss . Due to lack of attribute labels on CUHK-SYSU or PRW dataset, we use an off-the-shelf framework (Tang et al., 2019)

to generate attribute pseudo-labels and directly conduct this supervised learning. All frameworks employ ResNet-50 

(He et al., 2016)

pretrained on ImageNet 

(Deng et al., 2009) as the backbone, where blocks from “conv2” to “conv4” are the shared layers.

During training, all experiments are conducted on a single NVIDIA TITAN RTX GPU for 24 epochs with an initial learning rate 0.001, which is reduced by 10 at epoch 16 and 22 respectively. Following (Yan et al., 2021b)

, the momentum and weight decay of stochastic gradient descent (SGD) optimizer are set to 0.9 and 0.0005. Warmup strategy is used for 300 steps. We adopt a multi-scale training strategy where the long sides of images are randomly resized from 667 to 2000 during training. At test time, the scale of the test image is fixed into

. Our code will be made publicly available and more details are provided in our code.

4.3. Ablation Study

In this section, we conduct several analytical experiments on PRW dataset (Zheng et al., 2017) to better understand our proposed method.

4.3.1. Necessity of loss weighting and task grouping.

In order to verify the effect of loss weighting and task grouping in person search, we perform several experiments. As introduced in Sec. 3.2, our GALW consists of a loss weighting function RULWF without task grouping and a task grouping strategy. We manually delete one of them and show their performance in Tab. 3. This table shows that (1) our RULWF outperforms ULWF both with and without task grouping (row 2 & 3 and row 4 & 5) (2) When we add our task grouping method either on ULWF or RULWF, performance has been significantly improved, which means our task grouping method is beneficial for training. (3) The gap between the first and last rows demonstrates that our proposed method GALW is an effective way to dynamically weight tasks.

ULWF RULWF Task Grouping mAP Top-1
50.30 84.30
45.76 82.77
50.92 85.63
52.42 85.43
52.89 86.07

Table 3. The performance of our method GALW with different components. ULWF means multi-task loss in (Kendall et al., 2018) while RULWF refers to regularized ULWF in our method.
GN Group Details mAP Top-1
2 {}, {} 51.99 85.53
3 {}, {}, {} 52.52 85.78
4 {},{},{}, {} 52.89 86.07

5
{},{},{}, {}, {} 52.55 85.43
6 {},{}, {}, {},{},{} 52.17 84.48
- - 50.92 85.63

Table 4. Comparative results on different numbers of groups. GN refers to the number of groups.

GS
Group Details mAP Top-1
branch {},{} 51.37 85.53
semantic {}, {} 51.80 85.58
{}, {}, {} 52.16 85.73

random {},{},{}, {} 52.14 85.78
{},{},{}, {} 49.58 84.54

ours
{},{},{}, {} 52.89 86.07
- - 50.92 85.63


Table 5. Comparative results on different grouping strategies. GS refers to group strategies.

4.3.2. Group number in hierarchical clustering algorithm.

We discuss the impact of different numbers of groups in hierarchical clustering algorithm (Nielsen, 2016). As shown in Tab. 4, we can see that different numbers of groups have different effects on performance. The best performance is achieved when the tasks are divided into four groups. In addition, no matter what the number of groups is, experiments with our task grouping method perform better than those without.

4.3.3. Grouping strategies in task grouping.

In Tab. 5, we compare the performance with different grouping strategies: branches, semantic (one is reid and detection, another is reid, bbox regression and others), random and ours. We can find that (1) our grouping method is better than other three grouping strategies. (2) The method of random grouping divides tasks into four groups same as ours but the performance drops 1.34 pp. in mAP compared with the last row without any grouping strategy, which demonstrates the effectiveness of our grouping method.

Method mAP Top-1
ROI-AlignPS (Yan et al., 2021b) 50.30 84.30
ROI-AlignPS-Attr 51.42 85.23
ROI-AlignPS-Attr w/ GALW 53.25 86.22
Table 6. Extendability of our method GALW on more tasks. ROI-AlignPS-Attr denotes our baseline with an attribute recognition task .
Methods Group Number mAP Top-1
NAE (Chen et al., 2020c) - 43.3 80.9
NAE w/ GALW 3 43.6 81.0

AlignPS (Yan et al., 2021a) - 45.9 81.9
AlignPS w/ GALW 3 49.4 83.5
ROI-AlignPS (Yan et al., 2021b) - 50.3 84.3
ROI-AlignPS w/ GALW 4 52.9 86.1
Table 7. Generalization ability of our method GALW by applying GALW to different baselines.

4.3.4. Extendability of our method

In order to verify the extendability of our method on more tasks, we directly add an attribute recognition task as an auxiliary task in Fig. 2. We make an analysis from two aspects: (1) whether the performance of the network is improved after adding auxiliary tasks, (2) and whether the performance improvement comes from our method. We group these tasks according to their convergence rates. As Tab. 6 shows, the improvement comes not only from the addition of auxiliary tasks but also from our approach. This further verifies that our method can be applied to more tasks.

4.3.5. Generalization ability of our method

Our method can be generalized to other end-to-end methods. Tab. 7 shows that our GALW achieves excellent performance across various baselines both on anchor-based method NAE (Chen et al., 2020c) and anchor-free methods AlignPS (Yan et al., 2021a) and ROI-AlignPS (Yan et al., 2021b), which demonstrates the generalization ability of our method.

Methods CUHK-SYSU PRW
mAP Top-1 mAP Top-1

two-stage

DPM+IDE (Zheng et al., 2017) - - 20.5 48.3
CNN+MGTS (Chen et al., 2020b) 83.3 83.9 32.8 72.1
CNN+CLSA (Lan et al., 2018) 87.2 88.5 38.7 65.0
FPN+RDLR (Han et al., 2019) 93.0 94.2 42.9 70.2
IGPN (Dong et al., 2020b) 90.3 91.4 47.2 87.0
OR (Yao and Xu, 2020) 92.3 93.8 52.3 71.5
TCTS (Wang et al., 2020) 93.9 95.1 46.8 87.5

one-stage

OIM (Xiao et al., 2017) 75.5 78.7 21.3 49.4
NPSM (Liu et al., 2017) 77.9 81.2 24.2 53.1
RCAA (Chang et al., 2018) 79.3 81.3 - -
CTXG (Yan et al., 2019) 84.1 86.5 33.4 73.6
QEEPS (Munjal et al., 2019) 88.9 89.1 37.1 76.7
HOIM (Chen et al., 2020a) 89.7 90.8 39.8 80.4
BINet (Dong et al., 2020a) 90.0 90.7 45.3 81.7
NAE (Chen et al., 2020c) 91.5 92.4 43.3 80.9
PGA (Kim et al., 2021) 92.3 94.7 44.2 85.2
SeqNet (Li and Miao, 2021) 93.8 94.6 46.7 83.4
AGWF (Han et al., 2021) 93.3 94.2 53.3 87.7
AlignPS (Yan et al., 2021a) 93.1 93.4 45.9 81.9
ROI-AlignPS (Yan et al., 2021b) 95.4 96.0 51.6 84.4
ROI-AlignPS (Yan et al., 2021b) 95.0 95.3 50.3 84.3

ROI-AlignPS w/ GALW 95.6 96.3 52.9 86.1
Table 8. Comparison with state-of-the-art methods on CUHK-SYSU and PRW datasets. Best results are bold in red and the second results are bold in blue.
Figure 4. The mAP under different gallery sizes on CUHK-SYSU dataset. The dashed lines represent two-stage methods and the solid lines represent one-stage ones.

Methods GPU (TFLOPs)
P40(11.8) RTX(16.3)
NAE (Chen et al., 2020c) 158 85
AlignPS (Yan et al., 2021a) 122 65
SeqNet (Li and Miao, 2021) 178 97
AGWF (Han et al., 2021) 145 80
ROI-AlignPS (Yan et al., 2021b) 127 69
ROI-AlignPS w/ GALW 127 69


Table 9. Speed comparison on different GPUs. Runtimes are measured in milliseconds.
Figure 5. Top-1 search results for several samples. The orange, green and red bounding boxes denote the queries, correct and incorrect matches, respectively. A failure case is in the last row and we zoom the detected object for better view.

4.4. Comparison to the State-of-the-Art Methods

In Tab. 8, we compare our method with the state-of-the-art methods, including two-stage methods (Zheng et al., 2017; Chen et al., 2020b; Lan et al., 2018; Han et al., 2019; Dong et al., 2020b; Yao and Xu, 2020; Wang et al., 2020) and one-stage methods (Xiao et al., 2017; Liu et al., 2017; Chang et al., 2018; Yan et al., 2019; Munjal et al., 2019; Chen et al., 2020a; Dong et al., 2020a; Chen et al., 2020c; Kim et al., 2021; Li and Miao, 2021; Han et al., 2021; Yan et al., 2021a, b).

4.4.1. Results on CUHK-SYSU dataset.

The performance with our proposed method on CUHK-SYSU dataset (Xiao et al., 2017) is 95.6% and 96.3% in terms of mAP and Top-1 scores, respectively. Notably, whether compared with the one-stage methods or the two-stage methods, our method achieves the best performance, which is 0.6 and 1.0 pp. higher w.r.t. mAP and Top-1 than the baseline ROI-AlignPS.

In addition, we further evaluate the performance under larger search scopes. i.e. each query person is matched in galleries of different sizes. From Fig. 4, we can see that the mAPs for all methods decrease monotonically with increasing gallery sizes, which means it is difficult to match a person in larger scopes. We can observe that the framework with our method GALW outperforms both one-step methods and two-step methods at all gallery sizes.

4.4.2. Results on PRW dataset.

On PRW dataset (Zheng et al., 2017), all methods suffer from degraded performance due to the characteristic of fewer training images and a larger gallery. However, the proposed method outperforms AGWF (Han et al., 2021), which is the current state-of-the-art one-stage method by 53.3% in mAP. Our method improves over baseline by 2.6 and 1.8 pp. w.r.t. mAP and top-1 scores. This margin means our method is also robust on small datasets.

4.4.3. Runtime Comparison.

We compare the speed of different models in a P40 and RTX GPU respectively. All methods are implemented in PyTorch 

(Paszke et al., 2019) without bells and whistles. We test inference time with input images in size 1500900. As shown in Tab. 9, ROI-AlignPS with our method cost 127 and 69 milliseconds on a P40 and a RTX GPU respectively. Our method is faster than AGWF (Han et al., 2021), which is the current state-of-the-art one-step method, while achieving competitive performance on PRW dataset.

4.4.4. Qualitative Results.

We show some qualitative search results of NAE (Chen et al., 2020c), AlignPS (Yan et al., 2021a), ROI-Alignps+GALW and ROI-AlignPS (Yan et al., 2021b) in Fig. 5. We can see that when we apply our proposed GALW to ROI-AlignPS, it can further successfully handle cases including occlusions (row 1), viewpoint variation (row 1, 2) and scale variation (row 2). A failure case is illustrated in the last row, where it still fails in distinguishing objects that share similar appearances.

5. Conclusions

In this paper, we propose a GALW for person search. It is a great challenge how to weight the tasks automatically and dynamically, especially with a large number of different tasks in one-stage methods of person search. Since person search is a typical multi-task problem and loss weighting is a straightforward way to resolve it, we try to make an analysis on different baselines with an existing loss weighing method. From the analysis, we find it under-performs with a large number of tasks and the issue of inconsistent convergence rates gets severe over increasing tasks. Motivated by these findings, we present a GALW method, grouping tasks into a group according to convergence rate andassigning each task group with a learnable loss weight, which makes training of person search more effective. In addition, in order to further verify the generalization ability and extendability of GALW with more tasks, we apply GALW on different baselines and introduce an attribute recognition to our baseline network in Fig. 2 as an auxiliary task respectively. Experimental results demonstrate that our method can weight tasks more effectively and it is still valid for more tasks.

Acknowledgements.
This work was supported in part by the National Natural Science Foundation of China (Grant No. 62172225) and the Fundamental Research Funds for the Central Universities (No. 30920032201).

References

  • I. Akkaya, M. Andrychowicz, M. Chociej, M. Litwin, B. McGrew, A. Petron, A. Paino, M. Plappert, G. Powell, R. Ribas, et al. (2019) Solving rubik’s cube with a robot hand. In arXiv, External Links: 1910.07113 Cited by: §2.2.
  • X. Chang, P. Huang, Y. Shen, X. Liang, Y. Yang, and A. G. Hauptmann (2018) RCAA: relational context-aware agents for person search. In European conference on computer vision, pp. 84–100. Cited by: §1, §4.4, Table 8.
  • D. Chen, A. Doering, S. Zhang, J. Yang, J. Gall, and B. Schiele (2022) Keypoint message passing for video-based person re-identification. In AAAI Conference on Artificial Intelligence, Cited by: §1, §2.1.
  • D. Chen, S. Zhang, W. Ouyang, J. Yang, and B. Schiele (2020a) Hierarchical online instance matching for person search. In AAAI Conference on Artificial Intelligence, pp. 10518–10525. Cited by: §1, §4.4, Table 8.
  • D. Chen, S. Zhang, W. Ouyang, J. Yang, and Y. Tai (2020b) Person search by separated modeling and a mask-guided two-stream CNN model. IEEE Transactions on Image Processing 29, pp. 4669–4682. Cited by: §1, §2.1, §4.4, Table 8.
  • D. Chen, S. Zhang, J. Yang, and B. Schiele (2020c) Norm-aware embedding for efficient person search. In

    Computer Vision and Pattern Recognition

    ,
    pp. 12615–12624. Cited by: Figure 2, §1, §2.1, §4.3.5, §4.4.4, §4.4, Table 7, Table 8, Table 9.
  • Z. Chen, V. Badrinarayanan, C. Lee, and A. Rabinovich (2018) GradNorm: gradient normalization for adaptive loss balancing in deep multitask networks. In International Conference on Machine Learning, pp. 794–803. Cited by: §1, §2.2, §2.2.
  • R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa (2011) Natural language processing (almost) from scratch. Journal of machine learning research 12 (ARTICLE), pp. 2493–2537. Cited by: §2.2.
  • R. Collobert and J. Weston (2008)

    A unified architecture for natural language processing: deep neural networks with multitask learning

    .
    In International Conference on Machine Learning, pp. 160–167. Cited by: §2.2.
  • M. Crawshaw (2020) Multi-task learning with deep neural networks: a survey. In arXiv, External Links: 2009.09796 Cited by: §1, §2.2.
  • J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) ImageNet: a large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: §4.2.
  • W. Dong, Z. Zhang, C. Song, and T. Tan (2020a) Bi-directional interaction network for person search. In Computer Vision and Pattern Recognition, pp. 2839–2848. Cited by: §1, §4.4, Table 8.
  • W. Dong, Z. Zhang, C. Song, and T. Tan (2020b) Instance guided proposal network for person search. In Computer Vision and Pattern Recognition, pp. 2585–2594. Cited by: §1, §2.1, §4.4, Table 8.
  • C. Fifty, E. Amid, Z. Zhao, T. Yu, R. Anil, and C. Finn (2021) Efficiently identifying task groupings for multi-task learning. In Advances in Neural Information Processing Systems, Vol. 34. Cited by: §1, §2.2.
  • Y. Gal and Z. Ghahramani (2016)

    Dropout as a bayesian approximation: representing model uncertainty in deep learning

    .
    In International Conference on Machine Learning, pp. 1050–1059. Cited by: §3.2.1.
  • M. Guo, A. Haque, D. Huang, S. Yeung, and L. Fei-Fei (2018) Dynamic task prioritization for multitask learning. In European conference on computer vision, pp. 270–287. Cited by: §1, §2.2, §2.2.
  • B. Han, K. Ko, and J. Sim (2021) End-to-end trainable trident person search network using adaptive gradient propagation. In International Conference on Computer Vision, pp. 925–933. Cited by: §1, §2.1, §4.4.2, §4.4.3, §4.4, Table 8, Table 9.
  • C. Han, J. Ye, Y. Zhong, X. Tan, C. Zhang, C. Gao, and N. Sang (2019) Re-id driven localization refinement for person search. In International Conference on Computer Vision, pp. 9814–9823. Cited by: §1, §2.1, §4.4, Table 8.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Computer Vision and Pattern Recognition, pp. 770–778. Cited by: Figure 2, §4.2.
  • A. Kendall, Y. Gal, and R. Cipolla (2018) Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In Computer Vision and Pattern Recognition, pp. 7482–7491. Cited by: §1, §2.2, §2.2, §3.1, §3.2.1, Table 3.
  • A. Kendall and Y. Gal (2017) What uncertainties do we need in bayesian deep learning for computer vision?. In Advances in Neural Information Processing Systems, Vol. 30. Cited by: §3.2.1.
  • H. Kim, S. Joung, I. Kim, and K. Sohn (2021) Prototype-guided saliency feature learning for person search. In Computer Vision and Pattern Recognition, pp. 4865–4874. Cited by: §1, §2.1, §4.4, Table 8.
  • X. Lan, X. Zhu, and S. Gong (2018) Person search by multi-scale matching. In European conference on computer vision, pp. 536–552. Cited by: §1, §2.1, §4.4, Table 8.
  • Z. Li and D. Miao (2021) Sequential end-to-end network for efficient person search. In AAAI Conference on Artificial Intelligence, pp. 2011–2019. Cited by: §1, §2.1, §4.4, Table 8, Table 9.
  • H. Liu, J. Feng, Z. Jie, K. Jayashree, B. Zhao, M. Qi, J. Jiang, and S. Yan (2017) Neural person search machines. In International Conference on Computer Vision, pp. 493–501. Cited by: §1, §4.4, Table 8.
  • J. Liu, Z. Zha, R. Hong, M. Wang, and Y. Zhang (2020) Dual context-aware refinement network for person search. In ACM International Conference on Multimedia, pp. 3450–3459. Cited by: §1.
  • S. Liu, E. Johns, and A. J. Davison (2019) End-to-end multi-task learning with attention. In Computer Vision and Pattern Recognition, pp. 1871–1880. Cited by: §1, §2.2, §2.2.
  • B. Munjal, S. Amin, F. Tombari, and F. Galasso (2019) Query-guided end-to-end person search. In Computer Vision and Pattern Recognition, pp. 811–820. Cited by: §1, §4.4, Table 8.
  • F. Nielsen (2016)

    Introduction to hpc with mpi for data science

    .
    Vol. 1, pp. 195–211. Cited by: §3.2.2, §4.3.2.
  • A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al. (2019) Pytorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems, Vol. 32, pp. 8024–8035. Cited by: §4.4.3.
  • S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster R-CNN: towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems, Vol. 28. Cited by: Figure 2, §2.1.
  • T. Standley, A. Zamir, D. Chen, L. Guibas, J. Malik, and S. Savarese (2020) Which tasks should be learned together in multi-task learning?. In International Conference on Machine Learning, pp. 9120–9132. Cited by: §1, §2.2.
  • C. Tang, L. Sheng, Z. Zhang, and X. Hu (2019) Improving pedestrian attribute recognition with weakly-supervised multi-scale attribute-specific localization. In International Conference on Computer Vision, pp. 4997–5006. Cited by: §4.2.
  • S. Vandenhende, S. Georgoulis, W. Van Gansbeke, M. Proesmans, D. Dai, and L. Van Gool (2021) Multi-task learning for dense prediction tasks: a survey. IEEE Transactions on Pattern Analysis and Machine intelligence. Cited by: §1, §2.2.
  • C. Wang, B. Ma, H. Chang, S. Shan, and X. Chen (2020) TCTS: a task-consistent two-stage framework for person search. In Computer Vision and Pattern Recognition, pp. 11952–11961. Cited by: §1, §2.1, §4.4, Table 8.
  • T. Xiao, S. Li, B. Wang, L. Lin, and X. Wang (2017) Joint detection and identification feature learning for person search. In Computer Vision and Pattern Recognition, pp. 3415–3424. Cited by: §1, §2.1, §4.1, §4.4.1, §4.4, Table 8, §4.
  • Y. Yan, J. Li, J. Qin, S. Bai, S. Liao, L. Liu, F. Zhu, and L. Shao (2021a) Anchor-free person search. In Computer Vision and Pattern Recognition, pp. 7690–7699. Cited by: Figure 2, §1, §2.1, §3.1, Table 1, Table 2, §4.3.5, §4.4.4, §4.4, Table 7, Table 8, Table 9.
  • Y. Yan, J. Li, J. Qin, S. Liao, and X. Yang (2021b) Efficient person search: an anchor-free approach. In arXiv, External Links: 2109.00211 Cited by: Figure 2, §1, §2.1, §3.1, Table 1, Table 2, §4.2, §4.2, §4.3.5, §4.4.4, §4.4, Table 6, Table 7, Table 8, Table 9.
  • Y. Yan, Q. Zhang, B. Ni, W. Zhang, M. Xu, and X. Yang (2019) Learning context graph for person search. In Computer Vision and Pattern Recognition, pp. 2158–2167. Cited by: §1, §4.4, Table 8.
  • W. Yang, D. Li, X. Chen, and K. Huang (2020) Bottom-up foreground-aware feature fusion for person search. In Proceedings of the 28th ACM International Conference on Multimedia, pp. ACM International Conference on Multimedia. Cited by: §1.
  • H. Yao and C. Xu (2020) Joint person objectness and repulsion for person search. IEEE Transactions on Image Processing 30, pp. 685–696. Cited by: §1, §2.1, §4.4, Table 8.
  • W. Zhang, L. He, P. Chen, X. Liao, W. Liu, Q. Li, and Z. Sun (2021) Boosting end-to-end multi-object tracking and person search via knowledge distillation. In ACM International Conference on Multimedia, pp. 1192–1201. Cited by: §1.
  • Z. Zhang, P. Luo, C. C. Loy, and X. Tang (2014) Facial landmark detection by deep multi-task learning. In European conference on computer vision, pp. 94–108. Cited by: §2.2, §2.2.
  • X. Zhao, H. Li, X. Shen, X. Liang, and Y. Wu (2018)

    A modulation module for multi-task learning with applications in image retrieval

    .
    In European conference on computer vision, pp. 401–416. Cited by: §2.2, §2.2.
  • L. Zheng, H. Zhang, S. Sun, M. Chandraker, Y. Yang, and Q. Tian (2017) Person re-identification in the wild. In Computer Vision and Pattern Recognition, pp. 1367–1376. Cited by: §1, §2.1, §3.1, §4.1, §4.3, §4.4.2, §4.4, Table 8, §4.
  • Y. Zhong, X. Wang, and S. Zhang (2020) Robust partial matching for person search in the wild. In Computer Vision and Pattern Recognition, pp. 6827–6835. Cited by: §1.