Learning Multi-Attention Context Graph for Group-Based Re-Identification

by   Yichao Yan, et al.

Learning to re-identify or retrieve a group of people across non-overlapped camera systems has important applications in video surveillance. However, most existing methods focus on (single) person re-identification (re-id), ignoring the fact that people often walk in groups in real scenarios. In this work, we take a step further and consider employing context information for identifying groups of people, i.e., group re-id. We propose a novel unified framework based on graph neural networks to simultaneously address the group-based re-id tasks, i.e., group re-id and group-aware person re-id. Specifically, we construct a context graph with group members as its nodes to exploit dependencies among different people. A multi-level attention mechanism is developed to formulate both intra-group and inter-group context, with an additional self-attention module for robust graph-level representations by attentively aggregating node-level features. The proposed model can be directly generalized to tackle group-aware person re-id using node-level representations. Meanwhile, to facilitate the deployment of deep learning models on these tasks, we build a new group re-id dataset that contains more than 3.8K images with 1.5K annotated groups, an order of magnitude larger than existing group re-id datasets. Extensive experiments on the novel dataset as well as three existing datasets clearly demonstrate the effectiveness of the proposed framework for both group-based re-id tasks. The code is available at https://github.com/daodaofr/group_reid.


page 2

page 4

page 9

page 10

page 11

page 13

page 14

page 18


Learning Multi-Granular Hypergraphs for Video-Based Person Re-Identification

Video-based person re-identification (re-ID) is an important research to...

Group Re-Identification with Multi-grained Matching and Integration

The task of re-identifying groups of people underdifferent camera views ...

Learning by Erasing: Conditional Entropy based Transferable Out-Of-Distribution Detection

Out-of-distribution (OOD) detection is essential to handle the distribut...

Group Re-Identification via Unsupervised Transfer of Sparse Features Encoding

Person re-identification is best known as the problem of associating a s...

Letter-level Online Writer Identification

Writer identification (writer-id), an important field in biometrics, aim...

Intra-clip Aggregation for Video Person Re-identification

Video-based person re-id has drawn much attention in recent years due to...

A Free Lunch to Person Re-identification: Learning from Automatically Generated Noisy Tracklets

A series of unsupervised video-based re-identification (re-ID) methods h...

1 Introduction

Person re-identification (re-id) aims to re-identify individuals across multi-camera surveillance systems. Over the past few years, person re-id has received increasing attention due to its great potential in many real-world applications, such as searching for suspects or lost people. In addition, it is also a fundamental research topic in computer vision and pattern recognition. In a typical person re-id pipeline, the system is provided with a target person as probe and aims to search through a gallery of known ID records to find a match. Usually, the probe and the gallery consist of human detection results or manually annotated bounding boxes, as shown in Fig. 

(a)a. The major challenges in person re-id include distinguishing different people sharing a similar appearance (e.g., wearing similar clothes), or, conversely, retrieving the same person that undergoes significant appearance changes (due to pose/viewpoint variations, different illumination conditions, and occlusions). Although sophisticated deep learning architectures and metric learning schemes have greatly improved the individual representation capability, the conventional setting for person re-id neglects to take into account the fact that people are likely to walk in groups in real scenarios.

Based on the above fact, in this paper, we aim at solving two group-based re-identification problems, i.e., group re-id and group-aware (single) person re-id. As shown in Fig. (b)b and (c)c, respectively, group re-id aims to find target groups given a group of interest, whilst group-aware person re-id is essentially a person re-id task, differing only in the fact that it considers context information from neighboring group members. We believe that, on the one hand, group re-id can greatly facilitate the understanding of group behaviors; while, on the other hand, rich context within a group can also be beneficial for single person re-id. However, there are very limited existing works [99, 11, 64, 49, 47] that address group-based re-id tasks, which is mainly due to the following challenges. First of all, in addition to the difficulties already faced in person re-id, multiple subjects in a group bring more challenges, which conventional person re-id models cannot directly deal with. A straightforward solution may be to pool or average individual features, but this is not a robust solution since it does not allow groups with similar centroids to be distinguished. Second, group members are not always spatially close to each other (see Fig. (b)b), and the group layout can vary greatly depending on the viewpoint. Therefore, directly utilizing the contents in group bounding boxes for re-id is not feasible. Third, the number of group members varies for different groups, and even for the same group may change across frames. Learning a robust group-level representation with such data discrepancies is a non-trivial problem. Last but not least

, most previous methods employ hand-crafted features for group representation, which goes against current trends of employing deep learning models. This is probably due to the lack of large-scale group re-id datasets. Therefore, it is highly desirable to develop a deep learning based model that not only addresses the above challenges for group re-id, but also makes good use of the context information in groups to boost the performance of person re-id.

(a) Person re-id
(b) Group re-id
(c) Group-aware person re-id
Fig. 1: Illustration of traditional person re-id, group re-id and group-aware (single) person re-id. (a) Person re-id only measures the similarity between individual pairs, and its performance is easily influenced by occlusions or people wearing similar clothes. (b) Group re-id aims to associate different groups of people, which is more challenging compared with single person re-id. (c) In group-aware (single) person re-id, we explore the visual context information in the group as additional guidance to learn more robust representations. For instance, the man in the black suit can be better re-identified with the context information from his neighboring group members.

To address the aforementioned challenges, we propose a novel framework named Multi-Attention Context Graph (MACG), which is a unified model for both group re-id and group-aware person re-id. Specifically, as shown in Fig. 2, we model each group as a context graph, where the nodes refer to group members. The advantages of our context graph representation are two-fold. First, a graph-level representation can be obtained through feature aggregation from nodes, which inherently addresses the challenge of group layout and/or group member changes. Second, node features in the group can benefit from the context information through the use of Graph Neural Networks (GNNs), which greatly facilitate information propagation through graph edges. In this way, group re-id can be formulated as a graph-level feature learning task. Furthermore, we propose a multi-level attention mechanism to address the challenges of learning group context information. For the node-level representation, intra- and inter-graph attention modules are proposed for encoding the context information within the same graph and across different graphs, respectively. These modules are partially inspired by the recent attention mechanisms for GNNs (e.g., GAT [76] and GAM [42]). We further build a higher-level attention mechanism for aggregating node-level features to obtain the final graph-level representation. This module adaptively finds the most representative individuals within the group and thus increases the discrimination capability of group-level features. Note that all three attention modules work collaboratively and are optimized jointly. In practice, graph-level representations are directly employed for group re-id, while node-level representations can be leveraged to address group-aware person re-id.

In addition, as previously mentioned, the lack of large-scale datasets for group re-id significantly hinders algorithmic development (and especially the development of deep learning approaches) in this important research field. As a matter of fact, the current largest group re-id dataset [47] only contains 177 groups from 354 images, causing the learned models to easily over-fit to the training data. To facilitate the deployment of deep learning models on this task and to improve the generalizability of the learned models, we build a new group re-id dataset containing 3,839 group images of 1,558 annotated groups, which is an order of magnitude larger than existing datasets. Since this dataset is built upon CUHK-SYSU [86], which was originally built for the person search task, we denote the novel dataset as CUHK-SYSU-Group (CSG).

The main contributions of this work include

  1. We build a novel dataset, i.e., CUHK-SYSU-Group (CSG), which is currently the largest group re-id dataset. We believe the dataset will significantly boost the research in this important field.

  2. We propose a novel unified framework, namely MACG, for both group re-id and group-aware person re-id tasks. Context information exploited by GNNs largely enhances person(node)-level and group(graph)-level representation capabilities.

  3. A multi-level attention mechanism is developed to capture both intra- and inter-group context. Moreover, the final graph-level representation is learned from node-level features in a self-attentive way.

  4. The proposed framework is systematically evaluated on the novel CSG dataset as well as three existing group re-id datasets, where the experimental results clearly show that our MACG outperforms the state-of-the-art methods by a large margin in terms of both group-based re-id tasks.

The remainder of this paper is organized as follows. Section 2 introduces some related works. The proposed MACG framework is present in Section 3. Section 4 elaborates the details of the newly collected CSG dataset, including the annotation rule and the statistics. In Section 5, the performance of MACG on CSG as well as other three exiting group re-id datasets, is evaluated by comparing to the state-of-the-art methods. Finally, we draw some conclusions and summarize the future research direction in Section 6.

2 Related Work

Person Re-Identification. Person re-id aims to associate pedestrians across non-overlapping cameras. Most previous methods try to address this task from two perspectives, i.e., feature representation and distance metric learning. Before the emergence of deep learning methods, previous approaches resorted to designing various types of hand-crafted features, such as color [53], texture [27], and gradient [32, 10]. These methods achieved certain success on small datasets. However, the representation capability of hand-crafted features is limited for large-scale searches. Similar limitations also apply for traditional distance metric learning methods [98, 93, 40, 22], which aim to optimize a distance function based on certain features. However, the learned metric usually over-fits on the training data, thus leads to limited generalization ability.

With its recent renaissance [55, 81], deep learning was first introduced to address person re-id in [95, 43], and has largely dominated this field ever since. A large number of works have proposed different model structures for this task [95, 46, 89, 14, 71, 106, 100, 6, 7]. Some methods employ end-to-end feature and metric learning [2, 66, 19, 94, 102, 20, 80, 18, 17], while others exploit part information to build more robust representations [67, 48, 72, 84, 96, 57, 16]

. Recently, generative adversarial networks (GANs) 

[31] have begun to be employed in re-id models [101, 51, 83, 61, 30] to address the lack of training data and domain adaptation for this task. Based on the learned features, various re-ranking strategies [104, 8] have further been proposed to refine the retrieval results. Recently, person search methods [86, 50, 15, 13, 58] have been proposed to jointly address the task of person detection and re-identification, facilitating real-world applications. These approaches have achieved promising results on recent person re-id benchmarks. However, they all focus only on learning individual appearance features based on given human bounding boxes. When the person appears in a group, visual context information provided by this group is ignored. In this work, we address two group-based re-id tasks by exploiting the group context information.

Group Re-Identification. Existing group re-id datasets [99, 47, 23] are essentially small-scale, which greatly hinders the research on this task. As a result, there are currently only a few works [99, 11, 64, 49, 47] addressing the group re-id task, most of which are based on hand-crafted features. In this work, we build a large-scale group re-id dataset to facilitate group feature learning. Employing group information could be a promising direction to further improve single person re-id. Although some methods [3, 12] have made efforts towards this for group-aware person re-id, we build a novel graph-based context learning framework upon GNNs, which effectively makes use of group context cues to improve the model.

Graph Neural Networks. Graph Neural Networks (GNNs) [65] allow graph representations to be learnt with neural networks. GNNs were originally developed for structured data [26, 59, 24, 41, 60], but have recently been generalized to various computer vision tasks. For instance, Wang et al[82] built a similarity graph and a spatial-temporal graph to model intra-object and inter-object relations, respectively, in a video, which achieves great performance in action recognition. Chen et al[21]

applied GNNs to multi-label image recognition. They built a graph over object labels and utilized GNNs to map the label graph to a set of object classifiers, obtaining image-level labels. Zheng

et al[103] formulated the visual dialog task as the inference of the unknown nodes in a graphic model. They modeled the given question as known nodes in a graph, while the answer was formulated as a node with a missing value, which can be approximated with differentiable GNNs. Some recent works [85, 69, 68] explored GNNs for the task of person re-id, without considering group information. Although graph-based models typically employ the information from neighbor points, the critical idea is to select suitable neighbors and to pass specific messages in a proper way according to the characteristics of the specific task. In this sense, our framework is clearly different from existing graph methods. For example, shape index [62] addresses the task of facial landmark detection, and the locations of individual landmarks depend on the local pixel context. In this case, the neighbors are defined as the points within a local region around the predicted landmark, and the messages are passed by calculating pixel difference features. In contrast, our task is to measure the similarity between groups, which is dependent on the similarity between individual persons between groups. Therefore, in this work, we define the neighbors according to inter-group similarity, and we build multi-attention GNNs to model the rich context and to better propagate the node features.

Attention Mechanisms. The attention mechanism [5] was originally introduced in machine translation to automatically search for the most relevant parts in the source sentence and predict a target word. In computer vision tasks, attention mechanisms were designed to discover the important spatial regions in an image or the critical frames in a video. They have been broadly applied in various tasks, such as image recognition  [4, 90, 78], captioning [88, 54, 92], and person re-id [44, 87, 29]. They have also been widely employed in graph models to adaptively learn the node weights for feature propagation [76, 1, 75, 34, 9]. In this work, we build both intra-graph and inter-graph attention modules to learn the contribution of context nodes, as well as a readout attention module for node feature aggregation. These three attention modules are trained in a collaborative way for robust feature learning.

Fig. 2: Illustration of the proposed Multi-Attention Context Graph for group-based re-identification. First of all, the individual features are extracted by a CNN model for group member representation. Second, we construct a context graph for each group by connecting all the group members with graph nodes. We then design an intra-graph and an inter-graph attention module to learn the within-group and between-group dependencies, respectively. Afterwards, node-level features are aggregated for global group-level and local person-level correspondence learning, respectively.

Note that the preliminary version of this work [91] was published at CVPR 2019. Compared with [91], in this paper we provide the following additional contributions. 1) The preliminary work only considers group-aware person re-id, while this work further addresses group re-id in a unified framework. 2) The preliminary work only employs a vanilla GNN model for group context mining, while this work incorporates a multi-level attention mechanism for more discriminative feature learning. 3) Experiments in [91] are based on CUHK-SYSU, which provides no explicit group information. In this paper, we have further manually annotated and purified the group information in this dataset. Consequently, a new large dataset, containing much more labeled group information than existing benchmarks, is constructed, which can benefit deep models in learning group context information more accurately and robustly.

3 Multi-Attention Context Graph

3.1 Overview of the Method

Group re-identification is a very challenging and complicated task, since it not only suffers from difficulties occurring in single person re-id (i.e., occlusions, human pose variations, background clutter, etc.), but is also confronted with new problems such as group layout and group member variations. Therefore, the keys to re-identifying a group are two-fold. First, the model should be able to understand individual group members, i.e., to address the single person re-id task. Second, the model should also be able to build a layout-invariant and noise-insensitive group-level representation for robust group re-id.

To address these challenges, we propose a novel framework, namely Multi-Attention Context Graph (MACG), for addressing group-based re-id tasks. Overall, MACG can be regarded as a unified framework that is capable of 1) extracting discriminative features of individual group members, 2) updating individual features using context information within and between groups, and 3) achieving group-level representations for group re-identification. The pipeline of our framework is illustrated in Fig. 2. Specifically, the model receives pairs of groups as input, and then extracts individual features from the members of each group. Subsequently, we construct a fully connected context graph to model the characteristics of each group. Node features of the context graph are updated with intra-graph and inter-graph information through two attention mechanisms, and group-level information is obtained by aggregating node features with another self-attention mechanism. The detailed architecture is elaborated in the following subsections.

3.2 Individual Representation

Individual person representation has been extensively studied in the context of single person re-id. Among the state-of-the-art methods, part-based models have been proven to be effective for robust person representation, especially in the case of partial occlusion. In our problem, people within a group are more likely to be occluded by other group members, which motivates us to also consider modeling human parts for our group re-id task. To this end, we design a part-based learning framework to effectively model part features. Specifically, we evenly divide the human body into parts (=4 in our experiments). We construct several part-based pooling layers on top of the feature map from the last convolutional layer (conv5_3) of ResNet-50 [36]. Each part-based pooling layer concentrates on a specific human part and pools the features (i.e

., a specific region in the CNN feature maps) into 2048-dimensional vectors. These features are further utilized to construct our context graph for the subsequent graph-based feature learning procedure. The pipeline for individual feature extraction is illustrated in Fig. 


3.3 Context Graph Construction

Person-level and part-level features of each group member only provide local individual representations within a group. For group re-id involving multiple people, it is more important to capture global context by aggregating local information into a global group-level representation. A trivial solution to this problem could be to use an ad-hoc feature aggregation method (such as average/max pooling). However, this may lead to significant loss of discriminative information. As an alternative, one could also employ learning-based aggregation methods, such as Recurrent Neural Networks (RNNs), but these models are unable to capture the structural dependencies in the scene. Therefore, the discrimination capability of the group representations generated by such aggregation methods is limited.

Fig. 3: Illustration of individual feature extraction and context graph construction. We employ a part-based representation for each person, and part-level features are stacked as input node features in the graph. All the graph nodes are connected through edges, which makes the graph insensible to the group layout and spatial dependency within the group.

In this work, inspired by the recent success of GNNs, we propose a novel context graph model for group re-id. In GNNs, a graph is constructed to connect individual features and model the dependencies within the graph. These models can be viewed as flexible tools for both global representation learning (e.g., graph classification) and local representation learning (e.g., node classification). Specifically, in our context graph model, we first associate each group member with a graph node; thus the node-level features are the individual feature representations of the corresponding members. Second, we build a fully connected graph based on all the members in a group, in which all the nodes are connected to each other with equally weighted edges. Note that the graph could also be built to reflect the group layout, i.e., by only connecting the members that are spatially close to each other (or connecting them with edges of larger weights). However, our aim is to learn the identity information of the group, which should be invariant to the group layout. Therefore, edges with identical weights are more suitable for our task. Based on the context graph, we can perform reasoning in a straightforward manner, with a designated graph architecture, for which this work takes advantage of graph neural networks with attention mechanisms. The learned node-level representation can be naturally aggregated into a group-level representation for group re-id. Meanwhile, the learned node-level features are also beneficial for single person re-id.

-th input group image
Cropped image of the -th person in
Number of people in
Number of person parts
Graph of group context in
Adjacency matrix of
Vertex set of
Edge set of
Feature vector of person in the -th layer of
-th part feature vector of person in the -th layer of
-th intra-part message of person in the -th layer of
-th inter-part message of person in the -th layer of
-th inter-graph message of person in the -th layer of
Affinity matrix between graphs
Permutation matrix denoting node correspondence
TABLE I: Descriptions on key notations used in this paper.

Formally, given a group image containing members , we extract individual features of each group member with the CNN model proposed in Sec. 3.2:


where , and () denotes the feature of the -th part of the -th individual in the image . The individual features serve as the input features for the graph nodes. We then build a graph consisting of vertices and a set of edges . We use to denote the adjacency matrix associated with graph , where we set


Please refer to TABLE I for the notations used throughout the paper.

3.4 Context Graph Learning

Once the graph is built and the features of input nodes are obtained, the next step is to promote individual features into a graph-level representation for robust group matching. To generate a robust group representation, it is necessary to take the following considerations into account: (1) Different group members have varying importance. For instance, occluded and outlier members are less important than members who are highly discriminative. (2) The group similarity is computed based on group pairs rather than individual groups. It is thus necessary to measure the mutual influence of individual group members of one group on the other. Therefore, it is highly important that the model be capable of understanding both intra- and inter- group context. To this end, we propose a novel GNN-based framework called Multi-Attenion Context Graph (MACG), where multiple attention mechanisms are proposed to discover and match discriminative features within and between groups. Specifically, MACG contains three attention modules for graph learning, which are detailed as follows.

3.4.1 Intra-Group Attention

To explore the context dependency within a group, graph neural networks allow nodes to encode and send messages to their neighbors through edge connections. The node representation is then updated accordingly by aggregating the received messages. Existing models have designed different strategies to aggregate messages, which can be generally summarized into two categories, i.e., pooling-based and attention based aggregators. For a graph node, pooling-based aggregators simply perform average/max/sum pooling on all the messages sent to the node. This operation is straightforward and has been widely adopted in various GNN models [39, 26, 33]. However, pooling-based aggregators consider each message as being of equal importance, ignoring the correlations between nodes within the group. In our task, the correlations could be important cues for identifying groups. For example, occluded group members usually contain less discriminative information for identifying a group, thus the messages coming from those nodes are less important compared with other nodes. In contrast to pooling-based strategies, attention-based aggregators [76, 1, 75] calculate the importance weights of messages, and aggregate the messages by taking the weighted sums. We therefore develop an attention-based mechanism for intra-group message passing.

Fig. 4: Illustration of intra- and inter-part attentions for a single node. Intra-part attention only receives messages from node features corresponding to the same body part, while inter-part attention receives messages from different part features.

Different from previous GNN models, where each node usually contains a single feature vector, every node in our context graph contains part-level features. Here, we introduce a part-level attention for each node, which not only considers the correlations between part-level features from the same part across different nodes, but also explores the dependencies of different parts from different nodes in case of misalignment or pose variations. In this way, the intra-graph attention is composed of two kinds of attention mechanisms, i.e., intra-part and inter-part attention.

For intra-part correlation computation, we consider a feature pair from the -th group image , where the features come from different nodes but correspond to the same person part . The importance of the intra-part message sent from to can be calculated as follows:


where is a function that measures the correlation between the inputs, and is a weight matrix that transforms the input features into higher-level representations. Following GAT [76], we formulate as a fully connected layer based on the concatenation of the inputs. After obtaining all the importance weights , we calculate the attention weights by normalizing the importance weights with the softmax function:


Then, the intra-part messages passed to node can be aggregated by combining the neighbors’ features with the corresponding attention weights:


Inter-part message passing can also be formulated with the attention mechanism similar to the intra-part attention, except that a node only receives messages from inter-part features. In this case, we consider a feature pair , where the features come from different nodes and correspond to different person parts . The importance of messages passed from to is


Then the attention weights and inter-part message can be calculated respectively as follows:

Fig. 5: Illustration of inter-graph attention for a single node. We compute the node-level similarity between inter-graph nodes, and part-level features share the same set of attention weights.

After obtaining the intra-part message and the inter-part message , they are further concatenated as the intra-graph message. Also note that we can employ multiple attention heads to calculate both the intra- and inter-part attentions, as in GAT [76]. The results obtained could then be concatenated or averaged for better representation. The intra-graph attention including both intra- and inter-part attention mechanisms is illustrated in Fig. 4.

3.4.2 Inter-Group Attention

Within-group context can be effectively modeled with the above-mentioned intra-group message passing. However, for group re-id, the objective is to calculate the similarity between group pairs. It is therefore necessary to explore the inter-group correlation. Intuitively, if a group pair shares the same ID, it is highly possible that there exists some correspondence between the individuals in the two groups. Meanwhile, a higher similarity for the individual pairs indicates a higher group-level similarity. Consider a node-level feature pair from two graphs , where and . We calculate the importance weights of inter-graph features as


where is simply an inner product layer since we only need to measure the similarity between graph nodes, and is a projection matrix. Similarly, the inter-graph attention weights and the message passed from graph to the -th part of person in graph can be calculated respectively as follows:


Note that here we consider person-level similarity for inter-graph attention calculations. Therefore, the part-level features in a node share the same set of inter-graph attention weights (i.e., ). Fig. 5 illustrates the intra-graph attention mechanism.

After obtaining the intra-graph (including intra- and inter-part) and inter-graph messages, the node features are updated with fully connected layers by concatenating previous features and all three types of messages:

0:  Cropped images from a pair of group images , group label , person label , the ground-truth permutation matrix
0:  Multi-Attention Context Graph (MACG) Network
1:  while not converge do
2:     Extract features by Eq. (1)
3:     Build the context graphs by Eq. (2)
4:     for  do
5:        Calculate the intra-part messages by Eqs. (3)-(5): ,
6:        Calculate the inter-part messages by Eqs. (6)-(8) ,
7:        Calculate the inter-group messages by Eqs. (9)-(11) ,
8:        Update the node representations by Eq. (12)
9:     end for
10:     Extract the graph-level feature by Eqs. (13)-(15)
11:     Calculate the losses by Eqs. (16)-(21)
12:     Back-propagation and update the MACG Network
13:  end while
Algorithm 1 Multi-Attention Context Graph Learning

3.4.3 Correspondence Learning

The above feature updating steps using intra- and inter-graph attention mechanisms are repeated for rounds and then the model is designed to learn group and person correspondence, respectively. We first construct a graph-level representation via the readout operation. Here, we simply apply self-attention on graph nodes and the final graph representation is a weighted sum of the node-level features:


We can also calculate for graph in a similar manner.

Fig. 6: Illustration of the inference stages for group re-id and group-aware person re-id.

To learn group correspondence, the pairwise loss function is adopted to pull features of the same group close together and push different groups far apart:


where is the label of the pair, and is the margin. when the pair shares the same group ID and when the pair consists of different groups. Note that, similar to Graph Matching Net [45], a triplet loss can also be applied for training the context graph.

In addition to considering global group-level representations, local person-level information is also important for constructing global correspondence. For person-level correspondence learning, we also employ a pair-wise loss function:


where indicates whether the -th node in and the -th node in belong to the same identity. In addition, we employ a permutation cross-entropy loss [79] to learn better person-level correspondence. Specifically, an affinity matrix is first computed to measure the node-level affinity between two graphs:


where is a weight matrix to learn the affinity function, and is a hyper-parameter. A Sinkhorn operator [70] is then applied to to generate the permutation matrix:


where the Sinkhorn operator iteratively performs row-wise and column-wise normalization on the input matrix until convergence. Finally, the cross entropy loss is applied between the predicted permutation matrix and the ground-truth matrix :


where is a binary matrix, and if the -the person in and the -th person in belong to the same identity.

During training, the overall loss is a linear combination of all the above loss functions:


We summarize the training process of our framework in Algorithm 1.

3.5 Model Inference

After the model is trained, we can perform two types of group-based re-id tasks, as illustrated in Fig. 6. Specifically, the graph-level representations from the readout attention module are directly employed for group re-id. For group-aware person re-id, the node-level features already contain discriminative context information as they receive messages from both intra-group and inter-group members. In addition, we can also utilize the person correspondence learning module to further reduce the ambiguity between people with similar appearances. The inference of group re-id and group-aware person re-id can be jointly computed in our framework. The results of group re-id and group-aware person re-id are discussed in Sec. 5.2 and Sec. 5.3, respectively.

4 CUHK-SYSU-Group Dataset

In addition to the proposed MACG model, another important contribution of this paper is the newly collected CUHK-SYSU-Group (CSG) dataset. To the best of our knowledge, CSG is the largest dataset for group re-identification, consisting of more than 3.8 images from 1.5 labeled groups. In the following, we will introduce the detailed collection process and the statistical distribution of this dataset. It is worth noting that our CSG dataset is suitable for evaluating both group re-id and group-aware single person re-id methods, owing to the instance-level annotation.

4.1 Dataset Collection

Creating a large-scale group re-id dataset is very challenging. First of all, there is no strict guideline of how to define a group. People that are geometrically close to each other can often be considered as a group, and existing datasets typically follow this criterion. However, we argue that sometimes group members are not necessarily spatially close. Please refer to the example in Fig. (b)b, where we can observe interactions between group members, even if they are far away from each other. In this situation, we still consider them to be a group. Therefore, during annotation we manually determine a group by jointly considering the spatial and interactive cues. Secondly, data collection and group annotation are both labor-intensive. This is probably why the size of existing datasets are rather limited. In this work, we construct the group re-id dataset based on the CUHK-SYSU dataset, which was originally designed for the person search task. Thanks to the person-level IDs and bounding box annotations in CUHK-SYSU, we can take the advantage of them for group annotation. Specifically, we first discover the people that simultaneously appear in different images as group candidates. Then we manually and empirically select the groups according to the spatial layout and interactions between people. Following previous protocol [47], we annotate the groups with the same ID when they have more than 60% of members in common. We further consolidate the dataset by refining group annotations with the help of ten volunteers. Specifically, after the groups are annotated, we only keep the groups when at least seven volunteers agree on the correctness of the annotations.

Datasets MCTS [99] RG [47] DG [47] CSG
# Image 274 324 354 3839
# Group 64 162 177 1558
# Viewpoint 8 2 8 diversified
TABLE II: Statistical comparisons between CSG and existing group re-id datasets. MCTS denotes i-LIDS MCTS dataset, RG denotes Road Group dataset and DG denotes DukeMTMC Group dataset.
Fig. 7: Visualization of some examples from the CUHK-SYSU-Group dataset. This dataset contains various challenging situations, such as occlusions, group layout and member changes, lighting condition changes, etc.
CSG i-LIDS MCTS DukeMTMC Group Road Group
R-1 R-5 R-10 R-20 R-1 R-5 R-10 R-20 R-1 R-5 R-10 R-20 R-1 R-5 R-10 R-20
CRRRO-BRO [99] 10.4 25.8 37.5 51.2 23.3 54.0 69.8 82.7 9.9 26.1 40.2 64.9 17.8 34.6 48.1 62.2
Covariance [11] 16.5 34.1 47.9 67.0 26.5 52.5 66.0 90.9 21.3 43.6 60.4 78.2 38.0 61.0 73.1 82.5
PREF [49] 19.2 36.4 51.8 70.7 30.6 55.3 67.0 92.6 22.3 44.3 58.5 74.4 43.0 68.7 77.9 85.2
BSC+CM [107] 24.6 38.5 55.1 73.8 32.0 59.1 72.3 93.1 23.1 44.3 56.4 70.4 58.6 80.6 87.4 92.1
MGR [47] 57.8 71.6 76.5 82.3 38.8 65.7 82.5 98.8 48.4 75.2 89.9 94.4 80.2 93.8 96.3 97.5
MACG 63.2 75.4 79.7 84.4 45.1 70.4 84.9 99.1 57.4 79.0 90.3 94.3 84.5 95.0 96.9 98.1
TABLE III: Comparison with the state-of-the-art group re-id methods. R- (=1, 5, 10, 20) denotes the Rank- accuracy (%).

4.2 Dataset Statistics

Our CSG dataset contains 3,839 images of 1,558 groups, with about 3.5 annotated people and 9 bounding boxes. Each group contains at least two people and the largest group contains eight people. The average number of group members on CSG is 2.5. We statistically compare CSG with three existing group re-id datasets, i.e., i-LIDS MCTS [99], Road Group [47] and DukeMTMC Group [64, 47], as listed in TABLE II. We can see that CSG is an order of magnitude larger than existing datasets with respect to the number of images and groups. Additionally, the viewpoints and data sources of CSG are more diverse than in existing datasets. For example, existing datasets only contain images from surveillance cameras with fixed camera viewpoints. In contrast, in CSG (as with CUHK-SYSU), images are captured from hand-held cameras as well as movie snapshots, which have much more diversified viewpoints, backgrounds and lighting conditions. Some examples from CSG are illustrated in Fig. 7. We can observe from the examples that this dataset contains various challenging scenarios, including occlusions, group layout and member changes, and lighting condition changes, and these challenges may even co-exist in a single group. On CSG, about 26% of the groups contain persons that are occluded, while only 4% of the images are captured from exactly the same viewpoint, making most group images contain layout/member or lighting condition changes. All the above factors make the CSG dataset more challenging and suitable for the task of group re-id.

5 Experimental Results

In this section, we evaluate our model on its ability to associate groups and group members. We conduct extensive experiments on our CSG dataset and three existing group re-id datasets. We first compare our model with the state-of-the-art methods, and then evaluate our method on group-aware person re-id, by directly employing node features. We also conduct a comprehensive ablation study to verify the contribution of each model component, as well as the sensitivity of the proposed framework under different settings and parameters. We further visualize the attention weights to better understand the attention mechanisms. Finally, we analyze the impacts of pedestrian and group detectors.

5.1 Implementation Details and Experimental Setup

We employ ResNet50 [36]

pretrained on ImageNet 

[25] as the backbone. The person images are resized to 256

128 as inputs. The initial learning rate is set to 0.0003 and is reduced by a factor of 10 at the 80-th and 160-th epochs, with the training stage terminating at the 200-th epoch. As groups are of different sizes, we construct graphs with equal numbers of nodes to facilitate implementation, and add dummy nodes to the groups with limited members. We only perform person correspondence learning on positive group pairs, as there exists no correspondence for negative pairs. We use two-layer (

i.e., ) GNNs in our framework. We train our model on one Tesla V100 GPU and it takes about 28 hours for the model to converge on the CSG dataset.

We manually split CSG into fixed training and test sets, where 859 groups are utilized for training and the remaining 699 groups for testing. We also ensure that the groups in the test set are captured from different viewpoints. During testing, the images in the test set are sequentially selected as the probe, while all the remaining images are regarded as the gallery set. In this way, there is no overlap of viewpoints between the probe and gallery images. Additionally, we add 5 group images containing 20

person bounding boxes as distractors in the gallery set, making it comparable in scale to traditional person re-id datasets. As for iLID-MCTS, Duke Group, and Road Group, we randomly partition the datasets into training sets and test sets with equal sizes. The final result is obtained by averaging the results of 10 random splits. We use the Cumulative Matching Characteristics (CMC) as the evaluation metric.

5.2 Group Re-Identification

To evaluate the group re-id performance, we compare our full model with a few state-of-the-art methods specifically designed for group re-id: CRRRO-BRO [99], Covariance [11], PREF [49], BSC+CM [107] and MGR [47]. The quantitative results are illustrated in TABLE III.

We find that the performance of most previous methods is limited, mainly due to the following two reasons. Firstly, most previous methods employ hand-crafted features for group representation. Secondly, these methods merely consider the global representation of the entire group, ignoring rich context information from group members. For example, CRRRO-BRO [99] designs a center rectangular ring ration-occurrence descriptor (CRRRO), which tries to find a stable representation against a relative position change between two people, as well as a block-based ratio-occurrence descriptor (BRO) for non-center-rotation changes. However, associating groups of people is much more difficult than modeling the relative positions of person pairs. This is why CRRRO-BRO [99] achieves a relatively better performance on datasets with smaller group sizes (e.g., i-LIDS MCTS, in which most groups only contain two people). The covariance [11] descriptor measures the distance between two groups as the dissimilarity between the covariance matrices. However, the covariance matrices are calculated based on local pixel values, which can be easily influenced by the background noise that commonly appears in group scenes. Therefore, its performance is limited. PREF [49] learns a feature dictionary for single person re-id and then transfers it for group appearance encoding. In real scenarios, group appearance is more variable than single person appearance and thus the performance of such an unsupervised method is also limited. BSC+CM [107] explores patch correspondence between group images, but its performance is still limited due to the complex dynamics within a group.

Fig. 8: Visualization of group re-id results. The first image is the query, whilst the rest are the Rank-1 to Rank-5 (from left to right) retrieved results. The green and red bounding boxes denote correct and incorrect matches, respectively.

Among existing methods, MGR [47] is the only one that employs deep models for feature representation. More importantly, MGR explores the multi-granular correspondence between groups using a graph model and it achieves evidently better performance than previous models. The proposed multi-attention context graph differs from MGR in the following aspects. First of all, MGR [47] focuses on learning redundant multi-granular sub-group correspondence to improve robustness of re-identification. The improvement from the multi-granular representation is more significant for groups with large numbers of members, and it is not as obvious for datasets containing less people (e.g., i-LIDS MCTS). However, there are often higher chances to observe small groups than large groups in real scenarios. In contrast, for the proposed multi-attention framework, there is no explicit constraint on multi-granular information and we can clearly observe consistent improvement on all the datasets. Second, MGR [47] contains three stages that are optimized separately, whilst the proposed framework is built upon graph neural networks and is end-to-end learnable. In addition, the proposed framework outperforms all existing state-of-the-art models, demonstrating the effectiveness of the proposed MACG model.

We visualize some group re-id results in Fig. 8, where the first two examples illustrate the cases that group member changes between query and gallery images. It can be observed that the correct matches can be successfully retrieved in these cases, even if there exist occlusions in the first example. The last two examples show the failure cases, where the gallery groups usually contain persons sharing similar appearance with both pedestrians in the query set. In this case, it would be more difficult for the model to retrieve the correct match.

Fig. 9: Visualization of re-id results using single person re-id models and the proposed MACG model. The first image is the query, whilst the rest are the Rank-1 to Rank-10 (from left to right) retrieved results. The green and red bounding boxes denote correct and incorrect matches, respectively.

5.3 Group-Aware Person Re-Identification

Group context information has proven to be beneficial for reducing ambiguity in single person re-id [99, 47, 91]. In this work, we explore both group correspondence learning and person correspondence learning, which could be naturally utilized for single person re-id under the guidance of group context. Specifically, we utilize the node-level features to measure the similarity of person pairs between groups.

Model Group mAP R-1 R-5 R-10 R-20
Our CNN - 61.3 61.3 75.2 81.0 86.5
Strong-Baseline[56] - 63.4 63.1 80.3 84.3 87.1
PCB[73] - 60.5 63.7 80.2 83.9 85.8
OSNet[105] - 61.3 62.4 77.0 81.0 84.6
CG [91] 62.1 62.7 78.4 82.6 87.2
MGR [47] 63.3 63.8 79.9 83.8 87.4
MACG w/o CL 64.2 63.9 79.7 83.4 87.5
MACG 66.5 65.6 80.5 84.6 88.1
TABLE IV: Comparative results for single person re-id with group context information.

We compare our method with the following two types of models for person re-id: 1) traditional person re-id models without group information, i.e., our baseline CNN model, Strong-Baseline [56], PCB[73], and OSNet[105]. 2) MGR [47] that explores group correspondence learning to help single person re-id, and Context Graph (CG) [91] which also explores group context for person re-id. The results are reported in TABLE IV. We find that the models using group information generally achieve better performance than the baseline person re-id models, which further validates the effectiveness and benefits of exploiting group context for person re-id. On the one hand, compared with our baseline CNN, the proposed MACG improves the Rank-1 matching accuracy and mAP by 4.3% and 5.2%, respectively. On the other hand, MACG also achieves notable improvement compared with recent state-of-the-art person re-id models. We also observe that, compared with MGR [47] and CG [91], the proposed model achieves further improvement. Compared with MGR [47], our framework applies a multi-level attention mechanism to better facilitate context message passing among groups. Although CG [91] employs group context, it only considers instance-level similarity learning in the loss function. In contrast, the proposed framework jointly optimizes group and instance correspondence learning, thus yielding better performance. If our model is trained without the person correspondence loss (i.e., MACG w/o CL), the improvement is less significant. The above results demonstrate that group context information is not only useful for group re-id, but also provides incremental improvement for single person re-id. As a result, the proposed MACG framework is capable of handling both group and person re-id tasks.

We visualize some person re-id results in Fig. 9. It can be observed that for the single person re-id models (i.e., Strong-Baseline, PCB, and OSNet), it is difficult to distinguish people sharing similar appearances, especially in the cases of occlusions and illumination changes. Therefore, the Rank-1 matching accuracy is relatively low. In these cases, the proposed MACG model reduces the visual ambiguity by making use of group context information, and thus achieves better performance compared to single person re-id models.

Model CSG Market-1501 CSG MSMT17
mAP R-1 mAP R-1
Strong-Baseline[56] 15.9 35.0 2.9 8.1
PCB[73] 16.8 36.7 2.6 7.5
OSNet[105] 18.8 38.4 2.7 7.8
MACG 14.9 39.2 2.0 8.3
TABLE V: Dataset transfer results for single person re-id.

To further evaluate the generalization capability of our framework for single person re-id, we directly apply the models trained on CSG to two single person re-id datasets, i.e., Market-1501 [97] and MSMT17 [83]. Since there does not exist group context on these two datasets, to build the context graph in our model, a single person is simply replicated to construct the group. The comparison results are illustrated in TABLE V, where we can observe that the results of MACG are comparable to those models specifically designed for single person re-id [56, 73, 105]. Notably, our model even achieves slightly higher Rank-1 accuracy on both datasets. This indicates that although our model is designed for group context learning, in the case of single person re-id where group context is missing, our model can still adaptively make use of the “self-context” from the replicated persons. As a result, the proposed MACG generalizes well to single person re-id without group context.

(a) Backbone networks
(b) Body height-parts
(c) Node connections
(d) GNN layers
Fig. 10: Model sensitivity analysis under different settings on the CSG dataset. Zoom in for better visualization.

5.4 Model Component Analysis

The proposed MACG framework is built upon a GNN model with several attention modules, i.e., an intra-graph attention module, an inter-graph attention module, and a readout self-attention module. To validate the effectiveness of the GNN framework, we first compare our model with several baseline CNN models, which are illustrated in Fig. 11. The results are reported at the top of TABLE VI. 1) We first feed the entire group image into the CNN model to directly learn the global group representation, which is denoted as ‘Global CNN’ in the 1-st row of TABLE VI. In this case, we observe very low performance on the CSG dataset. This is due to the fact that group images contain large portions of noisy background, which make it difficult for a single CNN to capture the group ID information. 2) We then crop the individual group members and train a single person re-id model to extract individual features. Mean-pooling is performed to yield the final group representation. This is denoted as ‘Local CNN’ in the 2-nd row of TABLE VI. The results are better than Global CNN since the influence of background is reduced, but they are still low due to the information loss during feature pooling. 3) As part models are effective at modeling individual people, we further train a part-based CNN model and concatenate the part features for the person-level representation. This is denoted as ‘Local Part CNN’ in the 3-rd row of TABLE VI, and the results are relatively better than the ‘Local CNN’. Note that both global and local CNN models cannot achieve satisfactory results on the group re-id task, because rich context information in groups has not been explored.

Fig. 11: Illustration of three baseline CNN models.
Model Attention R-1 R-5 R-10 R-20
Global CNN - 46.5 64.3 71.5 77.1
Local CNN - 52.7 68.9 74.5 80.3
Local Part CNN - 54.6 70.1 75.6 80.9
GNN - 58.7 72.4 76.3 81.7
Part GNN - 60.1 73.2 76.9 82.5
Part GNN Intra-part 61.3 74.1 77.5 83.5
Part GNN Inter-part 61.0 74.3 77.8 83.3
Part GNN Intra-graph 62.2 74.5 78.4 83.8
Part GNN Inter-graph 62.8 74.8 78.7 84.0
Part GNN Readout 61.2 73.9 78.1 82.9
MACG (Proposed) Multi-level 63.2 75.4 79.7 84.4
TABLE VI: Model component analysis on the CSG dataset, where R- denotes the Rank- accuracy (%).
Fig. 12: Visualization of the message passing path with multi-level attention mechanisms. Different colors denote different part features, and different numbers denote different people.

To model context information, we first consider utilizing a vanilla GNN model for group representation learning, in which node features are extracted from the Local CNN. From the 4-th row of TABLE VI, we observe that the GNN model significantly improves the performance by 4%, which demonstrates that modeling a group with a graph neural network can effectively capture the dependency within a group. We then replace the node features of the GNN with part-level features extracted by the local part CNN, reported as ‘Part GNN’ in the 5-th row of TABLE VI. The results are better than the GNN with person features, which demonstrates the importance of the node-level features for graph representation. For part representation, we further validate the effectiveness of the proposed intra-part and inter-part attention, as well as the intra-graph attention, by combining these two attention modules. It can be observed from the 6-th to 8-th rows of TABLE VI that part-level attention can further improve the performance by exploring intra-group feature dependency. Moreover, inter-graph attention is proven to be effective from the 9-th row of TABLE VI, as it explicitly models the correlation between group pairs. In addition, the readout attention is also effective for aggregating node features to yield the group representation, and we observe a 1.1% improvement compared to the use of mean-pooling in Part GNN. Finally, by combining all the modules, our full model achieves 63.2% rank-1 accuracy on the CSG dataset. The above results demonstrate the effectiveness of the proposed part-based graph model, as well as the attention modules.

5.5 Model Sensitivity Analysis

In this section, we analyze the sensitivity of the proposed framework under different settings and hyperparameters.

1) We first analyze how different backbone networks influence model performance. Specifically, we replace the ResNet50 backbone with a shallower network (ResNet18) and a deeper network (ResNet101). The results are illustrated in Fig. (a)a. We observe considerable performance decrease when using shallow networks, especially for the rank-1 matching rate. Furthermore, ResNet101 only achieves negligible improvement compared with ResNet50. In addition to the ResNet family, we also evaluate the performance with DenseNet161 [37] and EfficientNet-b7  [74]. It can be observed that DenseNet161 achieves comparable performance with ResNet101, while EfficientNet-b7 performs worst among all these backbones, indicating that EfficientNet-b7 cannot be well generalized to our group-based re-id task. These results also indicate that different backbone networks have direct influences on the proposed framework. We find that ResNet50 is suitable for the backbone as it has a good trade-off between performance and model complexity. We also compare different part feature pooling methods, i.e., average pooling and max pooling, based on the same backbone (i.e., ResNet50). From the performance of ResNet50-avg and ResNet50-max in Fig. (a)a, we can see that pooling methods have limited influences on the final performance.

2) We then evaluate the influences of different image resolutions (i.e., human body heights) and human body partitions. To be specific, we resize the input person into 256/384128, and divide the human body into (=2, 4, 6, 8) parts. The performance of different height-partition combinations is visualized in Fig. (b)b. We can see that the performance of =2 is notably inferior to other partitions, because this is a rather coarse partition. The best performance is achieved under the 384128 resolution and 6 partitions. It is also noteworthy that the performance of finer partitions (e.g., =4, 6, 8) is close to each other, indicating that the model is not sensitive to human body partitions.

3) In our framework, we construct the context graph by connecting all the members in a group. We compare it with partial connections according to the geometric distance between people. The results are reported in Fig. (c)c, where ‘Neighbor=’ denotes that a person is only connected with his/her top- nearest neighbors and ‘Neighbor=All’ denotes full connections. We observe that the performance is improved consistently as becomes larger, i.e., with denser connections. This indicates that a fully connected graph can facilitate message passing within the group, which helps the model to better learn the context information. We have also carried out more studies with respect to graph building and neighbor selection. In terms of graph building, we apply an offline attribute detector [38] on our proposed dataset, and we build the graph adjacency matrix according to the attribute similarities between persons. For neighbor selection, our current policy is to directly measure the correlations between nodes (Eq. (3) and Eq. (9)), and we update the scores with the re-ranked similarity scores to evaluate the re-ranking policy [104]. We observe that neither attributes nor re-ranking policy improves the overall performance. On the one hand, as the attribute detector is trained on another source dataset, the domain gap will cause inaccurate predictions on the target dataset. On the other hand, since the re-ranking is only performed on limited samples between two groups, its impact is somewhat limited.

4) Last but not least, we evaluate the influence of employing different numbers of GNN layers (). As shown in Fig. (d)d, utilizing a two-layer GNN achieves better performance than employing a GNN with other numbers of layers. This is maybe because a single-layer GNN could not maintain sufficient representation capabilities, while a GNN with more layers contains too many parameters, making it more likely overfit to the training data. This may be related to the size of our training data, and further expanding the training data may require a deeper GNN to achieve better performance.

5.6 Visualization of Attention Mechanisms

To better understand how attention mechanisms work in our framework, we visualize the message passing path of each attention module, as illustrated in Fig. 12. In particular, each point in Fig. 12 denotes a part-level feature, where different colors denote different parts and different numbers denote different people in the group. The arrow links the target node with its most attended neighbor. For intra-part attention, the model only considers features from the same body parts and thus there only exist connections between nodes of the same color. We observe that the occluded person is less involved in message passing. In the first example, the 3-rd person is partially occluded in Image A and only two parts of this person are attended for intra-part message passing. Similarly, the bottom part of the first person in Image B is occluded and thus is not attended.

Group Person Recall mAP R-1
GT GT - 62.3 63.2
DPM[28] 76.8 42.3 41.5
SSD[52] 85.7 54.6 54.2
FasterRCNN[63] 94.8 59.1 59.9
MaskRCNN[35] 95.2 59.5 60.1
FasterRCNN[63] 94.5 55.8 54.6
Joint FasterRCNN[86] 91.7 51.3 50.8
TABLE VII: Impact of detectors on the CSG dataset, where R- denotes the Rank- accuracy (%).

For inter-part attention, the model tries to find dependencies between different body parts and thus only nodes in different colors are connected. We notice that the attended nodes come from adjacent parts of the target node. For example, the attended features of the 1-st part comes from the 2-nd part, and the attended features of the 2-nd part are from the 1-st and 3-rd parts, etc. This may be due to the fact that adjacent parts have stronger semantic connections with each other. For inter-graph attention, person-level attention weights are calculated, and we notice that the corresponding people successfully attend to each other. For readout attention, the attention weights are marked beside the features. We can see that the occluded persons are assigned with lower weights. For the second example, there exists an outlier member in Image B, and this sample is assigned with a relatively low attention weight. This demonstrates that readout attention is able to filter noisy sample in groups and thus yields more robust matching results.

5.7 Discussion of Detectors

In the above experiments, we utilize the ground-truth group and person bounding boxes for feature learning. To evaluate the impacts of pedestrian detectors, we employ different methods (i.e., DPM [28], SSD [52], FasterRCNN [63], and MaskRCNN [35]) to generate pedestrian bounding boxes. It can be observed from TABLE VII that the choice of the detector has a significant influence on recognition performance. We also show some detection examples in Fig. (a)a, where we set the positive detection threshold to -0.5 for DPM, and 0.5 for the other detectors. We can see that DPM is more sensitive to occlusions and scale/illumination variation as it is based on hand-crafted features, while deep learning based models (i.e., SSD, FasterRCNN, and MaskRCNN) are more robust in these situations and thus achieve better performance.

(a) Pedestrian detection
(b) Group and pedestrian detection
Fig. 13: Visualization of detection results in the case of occlusion, low illumination and resolution. Red bounding boxes denote the detected pedestrians, and green bounding boxes denote the detected groups. Zoom in for better visualization.

To explore the possibility of building an automatic group detector, we train a FasterRCNN [63] model which simultaneously detects individual pedestrians as well as groups based on our annotations. The detection results are illustrated in Fig (b)b, where candidate groups and pedestrians can be successfully localized. To further enable end-to-end optimization, we integrate the Joint FasterRCNN [86] into our framework, i.e., the detection and recognition modules share the same backbone network, and the person-level features come from the RoI Pooling layer in FasterRCNN. We observe that this joint learning model achieves inferior performance than the two-stage FasterRCNN model. This is in consistence with recent two-stage person search frameworks [15, 77], which indicate that different targets of detection and re-identification will influence the optimization of the joint learning framework. However, these results validate the possibility of building an end-to-end learning framework for group re-id.

6 Conclusion and Future Work

In this work, we proposed a multi-attention context graph (MACG) model to jointly address the tasks of group re-id and group-aware person re-id. All the proposed attention mechanisms worked collaboratively for robust group representation. In addition, we built the CUHK-SYSU-Group (CSG) dataset for group re-identification, which is an order of magnitude larger than existing group re-id datasets. The superior performance in terms of the above two tasks on CSG and other group re-id benchmarks validated the effectiveness of the proposed MACG framework.

Currently, existing group re-id datasets do not contain any spatio-temporal information (e.g., time stamps or camera information). It would be interesting to combine such information for group re-id. In addition, the situation where there exists more than one group in an image is also worth future consideration, as the co-existing groups may provide additional context that is beneficial for group re-id.


This work was supported partially by National Key Research and Development Program of China (2016YFB1001003), NSFC (U19B2035, 61527804, U1811461, 61976137, U1611461), STCSM (18DZ1112300), MoE-China Mobile Research Fund Project (MCM20180702), the 111 Project (B07022 and Sheitc No.150633), the Shanghai Key Laboratory of Digital Media Processing and Transmissions, and the SJTU-BIGO Joint Research Fund. Yichao Yan and Jie Qin contributed equally to this work.


  • [1] S. Abu-El-Haija, B. Perozzi, R. Al-Rfou, and A. Alemi (2017) Watch your step: learning graph embeddings through attention. CoRR abs/1710.09599. Cited by: §2, §3.4.1.
  • [2] E. Ahmed, M. J. Jones, and T. K. Marks (2015) An improved deep learning architecture for person re-identification. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3908–3916. Cited by: §2.
  • [3] S. M. Assari, H. Idrees, and M. Shah (2016) Human re-identification in crowd videos using personal, social and environmental constraints. In European Conference on Computer Vision, pp. 119–136. Cited by: §2.
  • [4] J. Ba, V. Mnih, and K. Kavukcuoglu (2015) Multiple object recognition with visual attention. In International Conference on Learning Representations, ICLR, Cited by: §2.
  • [5] D. Bahdanau, K. Cho, and Y. Bengio (2015) Neural machine translation by jointly learning to align and translate. In International Conference on Learning Representations, ICLR, Cited by: §2.
  • [6] S. Bai, X. Bai, and Q. Tian (2017) Scalable person re-identification on supervised smoothed manifold. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 3356–3365. Cited by: §2.
  • [7] S. Bai, Y. Li, Y. Zhou, Q. Li, and P. H. S. Torr (2019) Adversarial metric attack for person re-identification. CoRR abs/1901.10650. Cited by: §2.
  • [8] S. Bai, P. Tang, P. H.S. Torr, and L. J. Latecki (2019-06) Re-ranking via metric fusion for object retrieval and person re-identification. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • [9] S. Bai, F. Zhang, and P. H. S. Torr (2021) Hypergraph convolution and hypergraph attention. Pattern Recognition 110, pp. 107637. Cited by: §2.
  • [10] A. Bedagkar-Gala and S. K. Shah (2012) Part-based spatio-temporal model for multi-person re-identification. Pattern Recognition Letters 33 (14), pp. 1908 – 1915. Cited by: §2.
  • [11] Y. Cai, V. Takala, and M. Pietikäinen (2010) Matching groups of people by covariance descriptor. In International Conference on Pattern Recognition, ICPR, pp. 2744–2747. Cited by: §1, §2, TABLE III, §5.2, §5.2.
  • [12] M. Cao, C. Chen, X. Hu, and S. Peng (2017) From groups to co-traveler sets: pair matching based person re-identification framework. In IEEE International Conference on Computer Vision, ICCV Workshops, pp. 2573–2582. Cited by: §2.
  • [13] X. Chang, P. Huang, Y. Shen, X. Liang, Y. Yang, and A. G. Hauptmann (2018) RCAA: relational context-aware agents for person search. In European Conference on Computer Vision, pp. 86–102. Cited by: §2.
  • [14] D. Chen, Z. Yuan, B. Chen, and N. Zheng (2016) Similarity learning with spatial constraints for person re-identification. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1268–1277. Cited by: §2.
  • [15] D. Chen, S. Zhang, W. Ouyang, J. Yang, and Y. Tai (2018) Person search via a mask-guided two-stream CNN model. In European Conference on Computer Vision, pp. 764–781. Cited by: §2, §5.7.
  • [16] J. Chen, J. Qin, Y. Yan, L. Huang, L. Liu, F. Zhu, and L. Shao (2020) Deep local binary coding for person re-identification by delving into the details. In ACM Multimedia, Cited by: §2.
  • [17] J. Chen, Y. Wang, J. Qin, L. Liu, and L. Shao (2017) Fast person re-identification via cross-camera semantic binary transformation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3873–3882. Cited by: §2.
  • [18] J. Chen, Z. Zhang, and Y. Wang (2015) Relevance metric learning for person re-identification by exploiting listwise similarities. IEEE Transactions on Image Processing 24 (12), pp. 4741–4755. Cited by: §2.
  • [19] W. Chen, X. Chen, J. Zhang, and K. Huang (2017) Beyond triplet loss: A deep quadruplet network for person re-identification. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1320–1329. Cited by: §2.
  • [20] Y. Chen, X. Zhu, W. Zheng, and J. Lai (2018) Person re-identification by camera correlation aware feature augmentation. IEEE Trans. Pattern Anal. Mach. Intell. 40 (2), pp. 392–408. Cited by: §2.
  • [21] Z. Chen, X. Wei, P. Wang, and Y. Guo (2019-06) Multi-label image recognition with graph convolutional networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • [22] D. Cheng, Y. Gong, Z. Li, D. Zhang, W. Shi, and X. Zhang (2018) Cross-scenario transfer metric learning for person re-identification. Pattern Recognition Letters. Cited by: §2.
  • [23] W. Choi, Y. Chao, C. Pantofaru, and S. Savarese (2014) Discovering groups of people in images. In European Conference on Computer Vision, pp. 417–433. Cited by: §2.
  • [24] M. Defferrard, X. Bresson, and P. Vandergheynst (2016) Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in Neural Information Processing Systems, pp. 3837–3845. Cited by: §2.
  • [25] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and F. Li (2009) ImageNet: A large-scale hierarchical image database. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 248–255. Cited by: §5.1.
  • [26] D. K. Duvenaud, D. Maclaurin, J. Aguilera-Iparraguirre, R. Gómez-Bombarelli, T. Hirzel, A. Aspuru-Guzik, and R. P. Adams (2015) Convolutional networks on graphs for learning molecular fingerprints. In Advances in Neural Information Processing Systems, pp. 2224–2232. Cited by: §2, §3.4.1.
  • [27] M. Farenzena, L. Bazzani, A. Perina, V. Murino, and M. Cristani (2010) Person re-identification by symmetry-driven accumulation of local features. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2360–2367. Cited by: §2.
  • [28] P. F. Felzenszwalb, R. B. Girshick, D. A. McAllester, and D. Ramanan (2010) Object detection with discriminatively trained part-based models. IEEE Trans. Pattern Anal. Mach. Intell. 32 (9), pp. 1627–1645. Cited by: §5.7, TABLE VII.
  • [29] Y. Fu, X. Wang, Y. Wei, and T. Huang (2019) STA: spatial-temporal attention for large-scale video-based person re-identification. In Proceedings of the Association for the Advancement of Artificial Intelligence, Cited by: §2.
  • [30] Y. Ge, Z. Li, H. Zhao, G. Yin, S. Yi, X. Wang, and H. Li (2018) FD-GAN: pose-guided feature distilling GAN for robust person re-identification. In Advances in Neural Information Processing Systems, pp. 1230–1241. Cited by: §2.
  • [31] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. C. Courville, and Y. Bengio (2014) Generative adversarial nets. In NIPS, pp. 2672–2680. Cited by: §2.
  • [32] D. Gray and H. Tao (2008) Viewpoint invariant pedestrian recognition with an ensemble of localized features. In European Conference on Computer Vision, pp. 262–275. Cited by: §2.
  • [33] W. L. Hamilton, Z. Ying, and J. Leskovec (2017) Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems, pp. 1024–1034. Cited by: §3.4.1.
  • [34] X. Han, Z. Liu, and M. Sun (2018)

    Neural knowledge acquisition via mutual attention between knowledge graph and text

    In AAAI Conference on Artificial Intelligence,, pp. 4832–4839. Cited by: §2.
  • [35] K. He, G. Gkioxari, P. Dollár, and R. B. Girshick (2020) Mask R-CNN. IEEE Trans. Pattern Anal. Mach. Intell. 42 (2), pp. 386–397. Cited by: §5.7, TABLE VII.
  • [36] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778. Cited by: §3.2, §5.1.
  • [37] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger (2017) Densely connected convolutional networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2261–2269. Cited by: §5.5.
  • [38] J. Jia, H. Huang, W. Yang, X. Chen, and K. Huang (2020) Rethinking of pedestrian attribute recognition: realistic datasets with efficient method. CoRR abs/2005.11909. Cited by: §5.5.
  • [39] T. N. Kipf and M. Welling (2017) Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations, ICLR, Cited by: §3.4.1.
  • [40] M. Köstinger, M. Hirzer, P. Wohlhart, P. M. Roth, and H. Bischof (2012) Large scale metric learning from equivalence constraints. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 2288–2295. Cited by: §2.
  • [41] S. I. Ktena, S. Parisot, E. Ferrante, M. Rajchl, M. C. H. Lee, B. Glocker, and D. Rueckert (2017) Distance metric learning using graph convolutional networks: application to functional brain networks. In Medical Image Computing and Computer Assisted Intervention - MICCAI, pp. 469–477. Cited by: §2.
  • [42] J. B. Lee, R. A. Rossi, and X. Kong (2018) Graph classification using structural attention. In ACM Knowledge Discovery & Data Mining, KDD, pp. 1666–1674. Cited by: §1.
  • [43] W. Li, R. Zhao, T. Xiao, and X. Wang (2014) DeepReID: deep filter pairing neural network for person re-identification. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 152–159. Cited by: §2.
  • [44] W. Li, X. Zhu, and S. Gong (2018) Harmonious attention network for person re-identification. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 2285–2294. Cited by: §2.
  • [45] Y. Li, C. Gu, T. Dullien, O. Vinyals, and P. Kohli (2019) Graph matching networks for learning the similarity of graph structured objects. In

    International Conference on Machine Learning, ICML

    pp. 3835–3845. Cited by: §3.4.3.
  • [46] S. Liao, Y. Hu, X. Zhu, and S. Z. Li (2015) Person re-identification by local maximal occurrence representation and metric learning. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2197–2206. Cited by: §2.
  • [47] W. Lin, Y. Li, H. Xiao, S. John, J. Zou, H. Xiong, J. Wang, and M. Tao (2019) Group re-identification with multi-grained matching and integration. IEEE Transaction on Cybernetics. Cited by: §1, §1, §2, §4.1, §4.2, TABLE II, TABLE III, §5.2, §5.2, §5.3, §5.3, TABLE IV.
  • [48] W. Lin, Y. Shen, J. Yan, M. Xu, J. Wu, J. Wang, and K. Lu (2017) Learning correspondence structures for person re-identification. IEEE Trans. Image Processing 26 (5), pp. 2438–2453. Cited by: §2.
  • [49] G. Lisanti, N. Martinel, A. D. Bimbo, and G. L. Foresti (2017) Group re-identification via unsupervised transfer of sparse features encoding. In IEEE International Conference on Computer Vision, ICCV, pp. 2468–2477. Cited by: §1, §2, TABLE III, §5.2, §5.2.
  • [50] H. Liu, J. Feng, Z. Jie, J. Karlekar, B. Zhao, M. Qi, J. Jiang, and S. Yan (2017) Neural person search machines. In IEEE International Conference on Computer Vision, ICCV, pp. 493–501. Cited by: §2.
  • [51] J. Liu, B. Ni, Y. Yan, P. Zhou, S. Cheng, and J. Hu (2018) Pose transferrable person re-identification. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4099–4108. Cited by: §2.
  • [52] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. E. Reed, C. Fu, and A. C. Berg (2016) SSD: single shot multibox detector. In European Conference on Computer Vision, ECCV, pp. 21–37. Cited by: §5.7, TABLE VII.
  • [53] D. G. Lowe (2004) Distinctive image features from scale-invariant keypoints. International Journal on Computer Vision 60 (2), pp. 91–110. Cited by: §2.
  • [54] J. Lu, C. Xiong, D. Parikh, and R. Socher (2017)

    Knowing when to look: adaptive attention via a visual sentinel for image captioning

    In IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 3242–3250. Cited by: §2.
  • [55] X. Lu, C. Ma, B. Ni, X. Yang, I. D. Reid, and M. Yang (2018) Deep regression tracking with shrinkage loss. In European Conference on Computer Vision, pp. 369–386. Cited by: §2.
  • [56] H. Luo, W. Jiang, Y. Gu, F. Liu, X. Liao, S. Lai, and J. Gu (2019)

    A strong baseline and batch normalization neck for deep person re-identification

    IEEE Transactions on Multimedia. Cited by: §5.3, §5.3, TABLE IV, TABLE V.
  • [57] J. Miao, Y. Wu, P. Liu, Y. Ding, and Y. Yang (2019) Pose-guided feature alignment for occluded person re-identification. In IEEE/CVF International Conference on Computer Vision, ICCV, pp. 542–551. Cited by: §2.
  • [58] B. Munjal, S. Amin, F. Tombari, and F. Galasso (2019) Query-guided end-to-end person search. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 811–820. Cited by: §2.
  • [59] M. Niepert, M. Ahmed, and K. Kutzkov (2016) Learning convolutional neural networks for graphs. In International Conference on Machine Learning, ICML, pp. 2014–2023. Cited by: §2.
  • [60] S. Qi, W. Wang, B. Jia, J. Shen, and S. Zhu (2018) Learning human-object interactions by graph parsing neural networks. In European Conference on Computer Vision, pp. 407–423. Cited by: §2.
  • [61] X. Qian, Y. Fu, T. Xiang, W. Wang, J. Qiu, Y. Wu, Y. Jiang, and X. Xue (2018) Pose-normalized image generation for person re-identification. In European Conference on Computer Vision, pp. 661–678. Cited by: §2.
  • [62] S. Ren, X. Cao, Y. Wei, and J. Sun (2014) Face alignment at 3000 FPS via regressing local binary features. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 1685–1692. Cited by: §2.
  • [63] S. Ren, K. He, R. B. Girshick, and J. Sun (2015) Faster R-CNN: towards real-time object detection with region proposal networks. In NIPS, pp. 91–99. Cited by: §5.7, §5.7, TABLE VII.
  • [64] E. Ristani, F. Solera, R. S. Zou, R. Cucchiara, and C. Tomasi (2016) Performance measures and a data set for multi-target, multi-camera tracking. In European Conference on Computer Vision workshop, pp. 17–35. Cited by: §1, §2, §4.2.
  • [65] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini (2009) The graph neural network model. IEEE Trans. Neural Networks 20 (1), pp. 61–80. Cited by: §2.
  • [66] F. Schroff, D. Kalenichenko, and J. Philbin (2015)

    FaceNet: A unified embedding for face recognition and clustering

    In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 815–823. Cited by: §2.
  • [67] Y. Shen, W. Lin, J. Yan, M. Xu, J. Wu, and J. Wang (2015) Person re-identification with correspondence structure learning. In IEEE International Conference on Computer Vision, ICCV, pp. 3200–3208. Cited by: §2.
  • [68] Y. Shen, H. Li, T. Xiao, S. Yi, D. Chen, and X. Wang (2018) Deep group-shuffling random walk for person re-identification. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 2265–2274. Cited by: §2.
  • [69] Y. Shen, H. Li, S. Yi, D. Chen, and X. Wang (2018) Person re-identification with deep similarity-guided graph neural network. In European Conference on Computer Vision, ECCV, Cited by: §2.
  • [70] R. Sinkhorn (1964-06) A relationship between arbitrary positive matrices and doubly stochastic matrices. Ann. Math. Statist. 35 (2), pp. 876–879. Cited by: §3.4.3.
  • [71] Z. Song, B. Ni, Y. Yan, Z. Ren, Y. Xu, and X. Yang (2017) Deep cross-modality alignment for multi-shot person re-identification. In ACM MM, pp. 645–653. Cited by: §2.
  • [72] C. Su, J. Li, S. Zhang, J. Xing, W. Gao, and Q. Tian (2017) Pose-driven deep convolutional model for person re-identification. In IEEE International Conference on Computer Vision, ICCV, pp. 3980–3989. Cited by: §2.
  • [73] Y. Sun, L. Zheng, Y. Yang, Q. Tian, and S. Wang (2018) Beyond part models: person retrieval with refined part pooling (and A strong convolutional baseline). In European Conference on Computer Vision, ECCV, Cited by: §5.3, §5.3, TABLE IV, TABLE V.
  • [74] M. Tan and Q. V. Le (2019) EfficientNet: rethinking model scaling for convolutional neural networks. In International Conference on Machine Learning, ICML, K. Chaudhuri and R. Salakhutdinov (Eds.), pp. 6105–6114. Cited by: §5.5.
  • [75] K. K. Thekumparampil, C. Wang, S. Oh, and L. Li (2018)

    Attention-based graph neural network for semi-supervised learning

    CoRR abs/1803.03735. Cited by: §2, §3.4.1.
  • [76] P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio (2017) Graph attention networks. CoRR abs/1710.10903. Cited by: §1, §2, §3.4.1, §3.4.1, §3.4.1.
  • [77] C. Wang, B. Ma, H. Chang, S. Shan, and X. Chen (2020) TCTS: a task-consistent two-stage framework for person search. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR, Cited by: §5.7.
  • [78] F. Wang, M. Jiang, C. Qian, S. Yang, C. Li, H. Zhang, X. Wang, and X. Tang (2017) Residual attention network for image classification. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 6450–6458. Cited by: §2.
  • [79] R. Wang, J. Yan, and X. Yang (2019) Learning combinatorial embedding networks for deep graph matching. In IEEEInternational Conference on Computer Vision, ICCV, Cited by: §3.4.3.
  • [80] T. Wang, S. Gong, X. Zhu, and S. Wang (2016) Person re-identification by discriminative selection in video ranking. IEEE Trans. Pattern Anal. Mach. Intell. 38 (12), pp. 2501–2514. Cited by: §2.
  • [81] W. Wang, J. Shen, and L. Shao (2018) Video salient object detection via fully convolutional networks. IEEE Trans. Image Processing 27 (1), pp. 38–49. Cited by: §2.
  • [82] X. Wang and A. Gupta (2018) Videos as space-time region graphs. In European Conference on Computer Vision, pp. 413–431. Cited by: §2.
  • [83] L. Wei, S. Zhang, W. Gao, and Q. Tian (2018) Person transfer GAN to bridge domain gap for person re-identification. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 79–88. Cited by: §2, §5.3.
  • [84] L. Wei, S. Zhang, H. Yao, W. Gao, and Q. Tian (2017) GLAD: global-local-alignment descriptor for pedestrian retrieval. In ACM Multimedia, pp. 420–428. Cited by: §2.
  • [85] J. Wu, H. Liu, Y. Yang, Z. Lei, S. Liao, and S. Z. Li (2019) Unsupervised graph association for person re-identification. In IEEE/CVF International Conference on Computer Vision, ICCV, pp. 8320–8329. Cited by: §2.
  • [86] T. Xiao, S. Li, B. Wang, L. Lin, and X. Wang (2017) Joint detection and identification feature learning for person search. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3376–3385. Cited by: §1, §2, §5.7, TABLE VII.
  • [87] J. Xu, R. Zhao, F. Zhu, H. Wang, and W. Ouyang (2018) Attention-aware compositional network for person re-identification. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 2119–2128. Cited by: §2.
  • [88] K. Xu, J. Ba, R. Kiros, K. Cho, A. C. Courville, R. Salakhutdinov, R. S. Zemel, and Y. Bengio (2015) Show, attend and tell: neural image caption generation with visual attention. In International Conference on Machine Learning, ICML, pp. 2048–2057. Cited by: §2.
  • [89] Y. Yan, B. Ni, Z. Song, C. Ma, Y. Yan, and X. Yang (2016) Person re-identification via recurrent feature aggregation. In European Conference on Computer Vision, pp. 701–716. Cited by: §2.
  • [90] Y. Yan, B. Ni, and X. Yang (2017) Fine-grained recognition via attribute-guided attentive feature aggregation. In ACM on Multimedia Conference, MM, pp. 1032–1040. Cited by: §2.
  • [91] Y. Yan, Q. Zhang, B. Ni, W. Zhang, M. Xu, and X. Yang (2019-06) Learning context graph for person search. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2, §5.3, §5.3, TABLE IV.
  • [92] L. Yao, A. Torabi, K. Cho, N. Ballas, C. J. Pal, H. Larochelle, and A. C. Courville (2015) Describing videos by exploiting temporal structure. In IEEE International Conference on Computer Vision, ICCV, pp. 4507–4515. Cited by: §2.
  • [93] L. Zhang, T. Xiang, and S. Gong (2016) Learning a discriminative null space for person re-identification. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1239–1248. Cited by: §2.
  • [94] C. Zhao, K. Chen, Z. Wei, Y. Chen, D. Miao, and W. Wang (2018) Multilevel triplet deep learning model for person re-identification. Pattern Recognition Letters. Cited by: §2.
  • [95] R. Zhao, W. Ouyang, and X. Wang (2014) Learning mid-level filters for person re-identification. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 144–151. Cited by: §2.
  • [96] L. Zheng, Y. Huang, H. Lu, and Y. Yang (2017) Pose invariant embedding for deep person re-identification. CoRR abs/1701.07732. Cited by: §2.
  • [97] L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, and Q. Tian (2015) Scalable person re-identification: A benchmark. In IEEE International Conference on Computer Vision, ICCV, pp. 1116–1124. Cited by: §5.3.
  • [98] W. Zheng, S. Gong, and T. Xiang (2013) Reidentification by relative distance comparison. IEEE Trans. Pattern Anal. Mach. Intell. 35 (3), pp. 653–668. Cited by: §2.
  • [99] W. Zheng, S. Gong, and T. Xiang (2009) Associating groups of people. In British Machine Vision Conference, BMVC, pp. 1–11. Cited by: §1, §2, §4.2, TABLE II, TABLE III, §5.2, §5.2, §5.3.
  • [100] W. Zheng, S. Gong, and T. Xiang (2016) Towards open-world person re-identification by one-shot group-based verification. IEEE Trans. Pattern Anal. Mach. Intell. 38 (3), pp. 591–606. Cited by: §2.
  • [101] Z. Zheng, L. Zheng, and Y. Yang (2017) Unlabeled samples generated by gan improve the person re-identification baseline in vitro. In IEEE International Conference on Computer Vision, ICCV, pp. 3774–3782. Cited by: §2.
  • [102] Z. Zheng, L. Zheng, and Y. Yang (2018) A discriminatively learned CNN embedding for person reidentification. TOMCCAP 14 (1), pp. 13:1–13:20. Cited by: §2.
  • [103] Z. Zheng, W. Wang, S. Qi, and S. Zhu (2019-06) Reasoning visual dialogs with structural and partial observations. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • [104] Z. Zhong, L. Zheng, D. Cao, and S. Li (2017) Re-ranking person re-identification with k-reciprocal encoding. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3652–3661. Cited by: §2, §5.5.
  • [105] K. Zhou, Y. Yang, A. Cavallaro, and T. Xiang (2019) Omni-scale feature learning for person re-identification. In IEEE/CVF International Conference on Computer Vision, ICCV, pp. 3701–3711. Cited by: §5.3, §5.3, TABLE IV, TABLE V.
  • [106] Z. Zhou, Y. Huang, W. Wang, L. Wang, and T. Tan (2017) See the forest for the trees: joint spatial and temporal recurrent neural networks for video-based person re-identification. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6776–6785. Cited by: §2.
  • [107] F. Zhu, Q. Chu, and N. Yu (2016) Consistent matching based on boosted salience channels for group re-identification. In IEEE International Conference on Image Processing, ICIP, pp. 4279–4283. Cited by: TABLE III, §5.2, §5.2.