ViTAA: Visual-Textual Attributes Alignment in Person Search by Natural Language

05/15/2020 ∙ by Zhe Wang, et al. ∙ Arizona State University Beihang University 1

Person search by natural language aims at retrieving a specific person in a large-scale image pool that matches the given textual descriptions. While most of the current methods treat the task as a holistic visual and textual feature matching one, we approach it from an attribute-aligning perspective that allows grounding specific attribute phrases to the corresponding visual regions. We achieve success as well as the performance boosting by a robust feature learning that the referred identity can be accurately bundled by multiple attribute visual cues. To be concrete, our Visual-Textual Attribute Alignment model (dubbed as ViTAA) learns to disentangle the feature space of a person into subspaces corresponding to attributes using a light auxiliary attribute segmentation computing branch. It then aligns these visual features with the textual attributes parsed from the sentences by using a novel contrastive learning loss. Upon that, we validate our ViTAA framework through extensive experiments on tasks of person search by natural language and by attribute-phrase queries, on which our system achieves state-of-the-art performances. Code will be publicly available upon publication.



There are no comments yet.


page 2

page 6

page 12

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In recent years, we have witnessed numerous practical breakthroughs in person modeling related challenges, e.g., pedestrian detection [2, 5, 53], person re-identification [11, 25, 59] and pedestrian attribute recognition [29, 40]. Person search [24, 13] as an aggregation of the aforementioned tasks thus gains increasing research attention. Comparing with searching by image queries or by pre-defined attributes, person search by natural language [24, 23, 6, 51] makes the retrieving procedure much more user-friendly with increased flexibility due to the supporting of the open-form natural language queries. Meanwhile, learning the visual-textual associations well is becoming increasingly critical which calls an urgent demand for a representation learning schema that is able to fully exploit both modalities.

Relevant studies in person modeling related research points out the critical role of the discriminative representations, especially of the local details in both visual and textual modalities. For visual processing, [38, 58] propose to learn the pose-related features from the key points map of human, while [19, 26] leverage the body-part features by auxiliary segmentation-based supervision. [24, 23, 55] decompose the complex sentences into noun phrases, and [51, 22] directly use the attribute-specific annotations to learn fine-grained attribute related features. Proceeding from this, attribute specific features from text and image are even requisite for person search by natural language task, and how to effectively couple them becomes an open question. We find insight from a fatally flawed case that lingers in most of the current visual-language systems in Figure 1, termed as “malpositioned matching”. For example, tasks like textual grounding [35, 33], VQA [1]

, and image retrieval using natural language 

[36, 16]

are measuring the similarities or mutual information across modalities in a holistic fashion by answering: are the feature vectors of text and image align with each other? That way, when users input “

a girl in white shirt and black skirt” as retrieval query, the model trained in a holistic representation way is not able to distinguish the nuances of the two images as shown in Figure 1, where the false positive one actually shows “black shirt and white skirt”. As both the distinct color visual cues (“white” and “black”) exist in the images, overall matching without the ability of referring them to specific body-part prevents the model from discriminating them as needed. Such cases exist extensively in almost all cross-modal tasks, especially when both inputs have similar attributes. All of the above pose an indispensable challenge for the model to tackle with the ability of visual and textual co-referencing.

We reckon that a critical yet under-investigated challenge that causes the above-mentioned drawback is the misalignment of attributes in the person search by natural language task in most of the current models, due to the inadequate ability of cross-modal association at the fine-grained level. In order to address such challenge, we put forward a novel Visual-Textual Attributes A

lignment model (dubbed as ViTAA). For discriminative feature extraction, we fully exploit both visual and textual attribute representations. Specifically, we leverage semantic segmentation labels to drive the attribute-aware feature learning from the input image. As shown in Figure 

3, we design multiple local branches, each of which is responsible to predict one particular attribute guided by the supervision on pixel-wise classification. In this way, these visual features are intrinsically aligned through the label information and also avoid interference with background clutters. We then use a generic natural language parser to extract attribute-related phrases, also at the same time remove the complex syntax in natural language and redundant non-informative descriptions. Based upon this, we adopt a contrastive learning schema to learn a joint embedding space for visual and textual attributes. Meanwhile, we also notice that there may also exist common attributes across different person identities (e.g., two different persons may wear similar “black shirt”). To thoroughly exploits these cases during training, we propagate a novel sampling method to mine surrogate positive examples which largely enriches our sampling space, and also provides us with valid informative samples for the sake of overcoming convergence problem in metric learning.

To this end, we argue and show that the benefits of the attribute-aligned person search model ge well beyond the obvious. As the images used for person search tasks often contain a large variance on appearance (

e.g., varying poses or viewpoints, with/without occlusion, and with cluttered background), the abstracted attribute-specific features naturally could help to resolve the ambiguities thus resulting in a higher robustness. Also, searching the person images by the attributes innately brings interpretability for the retrieving task and enables the attribute specific retrieval. It is also worth mentioning that, there exist few very recent efforts that attempt to utilize the local details in both visual and textual modalities [7, 48] and hierarchically align them [6, 3]. The pairing schemas of visual feature and textual phrases in these methods are all based on the same identity, where they neglect the cues that exist across different identities. Comparing with them, we further propose a more comprehensive modeling way that fully exploits the identical attributes from different persons thus greatly helps the alignment learning.

To validate these speculations, we conduct the attribute-specific person retrieval in our experiment, showing that our attribute aligned model is capable of linking specific visual cues with specific words/phrases. More specifically, we validate the effectiveness of the ViTAA model on the task of 1) person search by natural language and 2) by attribute. From the experiment results, our ViTAA shows a promising performances across all these tasks. Further qualitative analysis verifies that our alignment learning successfully learns the fine-grained level correspondence across the visual and textual attributes. To summarize our contributions:

  • We design an attribute-aware representation learning framework for extracting and aligning both visual and textual features for the task of person search. To the best of our knowledge, we are the first to adopt both semantic segmentation as well as natural language parsing to facilitate a semantically aligned feature learning.

  • We design a novel cross-modal alignment learning schema based on contrastive learning which can adaptively highlight the informative samples during the alignment learning.

  • We propose an unsupervised data sampling method, which facilitates the construction of contrastive learning pairs by exploiting more surrogate positive samples across person identities.

  • Experiments conducted to validate the superiority of ViTAA over other state-of-the-art methods for the person search by natural language task. We also conduct qualitative analysis to demonstrate the interpretability of ViTAA.

2 Related Work

Person Search. Given the form of the querying data, current person search tasks can be categorized into two major thrusts: searching by images (termed as Person Re-Identification), and person search by textual descriptions. Typical person re-identification (Re-Id) methods [11, 25, 59] are formulated as retrieving the candidate that shows highest correlation with the query in the image galleries. However, a clear and valid image query is not always available in the real scenario, thus largely impedes the applications of re-ID tasks. Recently, researchers alter their attention to re-ID by textual descriptions: identifying the target person by using free-form natural languages [24, 23, 3]. Meanwhile, it also comes with great challenges as it requires the model to deal with the complex syntax from the long and free-form descriptive sentence, and the inconsistent interpretations of low-quality surveillance images. To tackle these, methods like [24, 23, 4] employ attention mechanism to build the relation module between visual and textual representations, while [55, 60] propose the cross-modal objective function for the joint embedding learning. Dense visual feature is extracted in [32] by cropping the input image for learning a regional-level matching schema. Beyond this, [18]

introduces pose estimation information for the delicate human body-part parsing.

Attribute Representations. Adopting appropriate feature representations is of crucial importance for learning and retrieving from both image and text. Previous efforts in person search by natural language unanimously use holistic features of the person, which omit the partial visual cues from attributes at the fine-grained level. Multiple re-ID systems have focused on the processing of body-part regions for visual feature learning, which can be summarized as: hand-craft horizontal stripes or grid [25, 43, 46], attention mechanism [37, 45], and auxiliary information including keypoints [49, 41], human parsing mask [19, 26] and dense semantic estimation [56]. Among these methods, the auxiliary information usually provides more accurate partition results on localizing human parts and facilitating body-part attribute representations thanks to the multi-task training or the auxiliary networks. However, only few work [12] pay attention to the accessories (such as the backpack) which could be the potential contextual cues for accurate person retrieval. As the corresponding components to specific visual cues, textual attribute phrases are usually provided as ground-truth labels or can be extracted from sentences through identifying the noun phrases with sentence parsing. Many of them use textual attributes as auxiliary label information to complement the content of image features [22, 39, 28]. Recently, a few attempts leverage textual attribute as query for person retrieval [6, 51][51] imposes an attribute-guided attention mechanism to capture the holistic appearance of person. [6] proposes a hierarchical matching model that can jointly learn global category-level and local attribute-level embedding.

Visual-Semantic Embedding. Recent works in vision and language propagate the notion of visual semantic embedding, with a goal to learn a joint feature space for both visual inputs and their correspondent textual annotations [8, 52]. Such a mechanism plays as a core role in a series of cross-modal tasks, e.g.

, image captioning 

[20, 50], image retrieval through natural language [55, 48], and vision question answering [1] Conventional joint embedding learning framework adopts two-branch architecture [55, 8, 52], where one branch extracts image features and the other one encodes textual descriptions, according to which the cross-modal embedding features are learned, by carefully designed objective functions.

3 Our Approach

Our network is composed of an image stream and a language stream (see Figure 3), with the intention to encode inputs from both modalities for a visual-textual embedding learning. To be specific, given a person image and its textual description , we first use the image stream to extract a global visual representation , and a stack of local visual representations of attributes , . Similarly, we follow the language stream to extract overall textual embedding , then decompose the whole sentence using standard natural language parser [21] into a list of the attribute phrases, and encode them as , . Our core contribution is the cross-modal alignment learning that matches each visual component with its corresponding textual phrase , along with the global representation matching for the person search by natural language task.

3.1 The Image Stream

We adopt the sub-network of ResNet-50 (conv1, conv2_x, conv3_x, and conv4_x) [15] as the backbone to extract feature maps from the input image. Then, we introduce a global branch , and multiple local branches to generate global visual features , and attribute visual features respectively, where . The network architectures are shown in Table 1. On the top of all the local branches is an auxiliary segmentation layer to supervise each local branch to generate the segmentation map of one specific attribute category (shown in Figure 3). Intuitively, we argue that the additional auxiliary task acts as a knowledge regulator, that diversifies each local branch to present attribute-specific features.

Our person segmentation network utilizes the architecture of a lightweight MaskHead [14] and can be removed during inference phase to reduce the computational cost. The remaining unsolved problem is that parsed annotations are not available in all person search datasets. To address that, we first train a human parsing network with HRNet [42] as an off-the-shelf tool. We then use the pixel-wise attribute category predictions as our segmentation annotations (illustrated in Figure 2). The HRNet is jointly trained on multiple human parsing datasets: MHPv2 [57], ATR [27], and VIPeR [44]. With these annotations, local branches receives the supervision needed from the segmentation task to learn attribute-specific features. Essentially, we are distilling the human body information from a well-trained human parsing networks to our lightweight local branches through the joint training222More details of our HRNet training and segmentation results can be found in the experimental part and the supplementary materials..

Figure 2: Attribute annotation of CUHK-PEDES dataset generated by our pre-trained HRNet.

Discussion. Using attribute feature has the following advantages over the global features. 1) The textual annotations in person search by natural language task describe the person mostly by their dressing/body appearances, where the attribute features perfectly fit the situation. 2) Attribute aligning avoids the “malpositioned matching” cases as shown in Fig. 1: using segmentation to regularize feature learning also equips the model to be resilient over the diverse human poses or viewpoints.

3.2 The Language Stream

Given the raw textual description, our language stream first parses and extracts noun phrases w.r.t. each attribute, and then feeds them into a language network to obtain the sentence-level as well as the phrase-level embeddings. We adopt a bi-directional LSTM that encodes the full sentence and the parsed noun phrases using the Stanford POS tagger [30] to generate the global textual embedding and the local textual embedding. Meanwhile, we also need to categorize the novel noun phrases in the sentence to specific attribute category for visual-textual alignment learning. To address that, we adopt a dictionary clustering approach. Concretely, we first manually collect a list of words per attribute category, e.g., “shirt”, “jersey”, “polo” to represent the upper-body category, and use the average-pooled word vectors [10] of them as the anchor embedding , and form the dictionary , where

is the total number of attributes. Building upon that, we assign the noun phrase to the category that has the highest cosine similarity, and form the local textual embedding {

}. Different from previous works like [56, 31, 17], we also include accessory as one type of attribute, which serves as the crucial matching cues in many cases.

Figure 3: Illustrative diagram of our ViTAA network, which includes an image stream (left) and a language stream (right). Our image stream first encodes the person image and extract both global and attribute representations. The local branch is additional supervised by an auxiliary segmentation network where the category labels are acquired by an off-the-shell human parsing network. In the meanwhile, the textual description is parsed and decomposed into attribute atoms, and encoded by a weight-shared Bi-LSTM. We train our ViTAA jointly under global/attribute align loss in an end-to-end manner.

3.3 Visual-Textual Alignment Learning

Once we have extracted the global and attribute features, the key objective for the next stage is to learn a joint embedding space across the visual and the textual modalities, where the visual features are tightly matched with the given textual description. Mathematically, we formulate our learning objective as a contrastive learning task that takes input as triplets, i.e., and , where denotes the index of person to identify, and refer to the corresponding feature representations of the person , and a randomly sampled irrelevant person respectively. We note that features in the triplet can be both at the global-level and the attribute-level. Here, we discuss the learning schema on which can be extended to .

We adopt the cosine similarity as the scoring function between visual and textual features . For a positive pair , the cosine similarity is encouraged to be as large as possible, which we defined as absolute similarity criterion. While for a negative pair , enforcing the cosine similarity to be minimal may yield an arbitrary constraint over the negative samples . Instead, we propose to optimize the deviation between and to be larger than a preset margin, called relative similarity criterion. These two criterion can be formalized as:


where is the least margin that positive and negative similarity should differ and is set to in practice.

In contrastive learning, the general form of the basic objective function are hinge loss and logistic loss . One crucial drawback of hinge loss is that its derivative w.r.t. is a constant value: . Since the pair-based construction of training data leads to a polynomial growth of training pairs, inevitably we will have a certain part of the randomly sampled negative texts being less informative during training. Treating all the redundant samples equally might raise the risk of a slow convergence and/or even model degeneration for the metric learning tasks. While the derivative of logistic loss w.r.t. is: which is related with the input value. With this insight, we settle with the logistic loss as our basic objective function.

With the logistic loss, the aforementioned two criterions can be further derived and rewritten as:


where denotes the lower bound for positive similarity and

denotes the upper bound for negative similarity. Together with logistic loss function, our final

Alignment loss can be unrolled as:


where and denote the temperature parameters that control the slope of gradient over positive and negative samples. The partial derivatives are calculated as:


In such way, we show that Eq. 3 outputs continuous gradients and can assign higher weight to the more informative samples accordingly.

K-reciprocal Sampling. One of the premise of visual-textual alignment is to fully exploit the informative positive and negative samples to provide valid supervisions. However, most of the current contrastive learning methods [47, 54] construct the positive pairs by selecting samples belonging to the same class and simply treat the random samples from other classes as negative. This is viable when using only global information at coarse level during training, which may not be able to handle the case as illustrated in Figure 1 where a fine-grained level comparison is needed. This practice is largely depending on the average number of samples for each attribute category to provide comprehensive positive samples. With this insight, we propose to further enlarge the searching space of positive samples from the cross-id incidents.

For instance, as in Figure 1, though the two ladies are with different identities, they share the extremely alike shoes which can be treated as the positive samples for learning. We term these kind of samples with identical attributes but belongs to different person identities as the “surrogate positive samples”. Kindly including the common attribute features of the surrogate positive samples in positive pairs makes much more sense than the reverse. It is worth noting that, this is unique only to our attribute alignment learning phase because attributes can only be compared at the fine-grained level. Now the key question is, how can we dig out the surrogate positive samples since we do not have direct cross-ID attribute annotations? Inspired by the re-ranking techniques in re-ID community [9, 61], we propose k-reciprocal sampling as an unsupervised method to generate the surrogate labels at the attribute-level. How does the proposed method sample from a batch of visual and textual features? Straightforwardly, for each attribute , we can extract a batch of visual and textual features from the feature learning network and mine their corresponding surrogate positive samples using our algorithm for alignment learning. Since we are only discussing the input form of , our sampling algorithm is actually mining the surrogate positive textual features for each . Note that, if the attribute information in either modality is missing after parsing, we can simply ignore them during sampling.

is a set of visual feature for attribute
is a set of textual feature for attribute
is a set of surrogate positive sample for attribute
1 for each  do
2       find the top-K nearest neighbours of w.r.t. : ;
3       ;
4       for each  do
5             find the top-K nearest neighbours of w.r.t. : ;
6             if  then
8             end if
10       end for
12 end for
Algorithm 1 K-reciprocal Sampling Algorithm

3.4 Joint Training

The entire network is trained in an end-to-end manner. We adopt the widely-used cross-entropy loss (ID Loss) to assist the learning of the discriminative features of each instance, as well as pixel-level cross-entropy loss (Seg Loss) to classify the attribute categories in the auxiliary segmentation task. For the cross-modal alignment learning, we design the Alignment Loss on both the global-level and the attribute-level representations. The overall loss function thus emerges:

Layer name Parameters Output size #Branch
Average Pooling
Max Pooling
Table 1: Detailed architecture of our global and local branches in image stream. #Branch. denotes the number of sub-branches.

4 Experiment

4.1 Experimental Setting.

Datasets. We conduct experiments on the CUHK-PEDES [24] dataset, which is currently the only benchmark for person search by natural language. It contains 40,206 images of 13,003 different persons, where each image comes with two human-annotated sentences. The average length of each sentence is around 23. The dataset is split into 11,003 identities with 34,054 images in the training set, 1,000 identities with 3,078 images in validation, and 1,000 identities with 3,074 images in testing set.

Evaluation protocols. Following the standard evaluation setting, we adopt Recall@K (K=1, 5, 10) as the retrieval criteria. More specifically, given a natural language description as query, Recall@K (R@K) reports the percentage of the images where at least one corresponding person is retrieved correctly among the top-K results.

Implementation details. For the global and local branches in image stream, we use the Basicblock as described in [15], where each branch is randomly initialized (detailed architecture is shown in Table 1). We use horizontally flipping as data augmenting and resize all the images to input. We use the Adam solver as the training optimizer with weight decay set as , and involves image-language pairs per mini-batch. The learning rate is initialized at for the first epochs during training, then decayed by a factor of for the remaining epochs. The whole experiment is implemented on a single Tesla V100 GPU machine. We empirically set .

Pedestrian attributes parsing. Based on the analysis of image and natural language annotations in the dataset, we warp both visual and textual attributes into categories: head (including descriptions related to hat, glasses and face), clothes on the upper body, clothes on the lower body, shoes and bags (including backpack, handbag). We reckon that these attributes are visually distinguishable from images and also the most descriptive part from the sentences. In Figure 2, we visualize the segmentation maps generated by our pre-trained HRNet, where attribute regions can be properly segmented and associated with correct labels.

4.2 Comparisons with the State-of-The-Arts

Result on CUHK-PEDES dataset. We summarize the performance of ViTAA and compare it with state-of-the-art methods in Table 2 on the CUHK-PEDES test set. Methods like GNA-RNN [24], CMCE [23], PWM-ATH [4] employ attention mechanism to learn the relation between visual and textual representation, while Dual Path [60], CMPM+CMPC [55] design objectiveness function for better visual-textual embedding learning. These methods only learn and utilize the “global” feature representation of both image and text. Moreover, MIA [32] exploits “region” information by dividing the input image into several horizontal stripes and extracting noun phrases from the natural language description. Similarly, GALM [18] leverage “keypoint

” information from human pose estimation as an attention mechanism to assist feature learning and together with a noun phrases extractor implemented on input text. Though the above two utilize the local-level representations, none of them learn the associations between visual features with textual phrases. From Table 

2, we observe that ViTAA shows a consistent lead on all metrics (R@1-10), outperforming the GALM [18] by a margin of 1.85%, 0.39%, 0.55% and claims the new state-of-the-art results. We note that though the performance seems incremental, the shown improvement on the R@1 performance is challenging, and this also suggests that the alignment learning of ViTAA contributes to the retrieval task directly. In the following, we further report the ablation studies on the effect of different components, and exhibit the attribute retrieval results quantitatively and qualitatively.

Method Feature R@1 R@5 R@10
GNA-RNN [24] global 19.05 - 53.64
CMCE [23] global 25.94 - 60.48
PWM-ATH [4] global 27.14 49.45 61.02
Dual Path [60] global 44.40 66.26 75.07
CMPM+CMPC [55] global 49.37 - 79.27
MIA [32] global+region 53.10 75.00 82.90
GALM [18] global+keypoint 54.12 75.45 82.97
ViTAA global+attribute 55.97 75.84 83.52
Table 2: Person search results on the CUHK-PEDES test set. Best results are in bold.

4.3 Ablation Study

We carry out comprehensive ablations to evaluate the contribution of different components and the training configurations.

Comparisons over different component combinations. To compare the individual contribution of each component, we set the baseline model as the one trained with only ID loss. In Table 3, we report the improvement of the proposed components (segmentation, global-alignment, and attribute-alignment) on the basis of the baseline model. From this table, we have the following observations and analyses: First, using segmentation loss only brings marginal improvement because the visual features are not aligned with their corresponding textual features. Similarly, we observe the same trend when the training is combined with only attribute-alignment loss where the visual features are not properly segmented, thus can not be associated for retrieval. An incremental gain is obtained by combining these two components. Second, compared with attribute-level, global-level alignment greatly improves the performance under all criteria, which demonstrates the efficiency of the visual-textual alignment schema. The reason for the performance gap is that, the former is learning the attribute similarity across different person identities while the latter is concentrating the uniqueness of each person. In the end, by combining all the loss terms yields the best performance, validating that our attribute-alignment and global-alignment learning are complimentary with each other.

Model Component R@1 R@5 R@10
Segmentation Attr-Align Glb-Align
29.68 51.84 61.57
30.93 52.71 63.11
31.40 54.09 63.66
39.26 61.22 68.14
52.27 73.33 81.61
55.97 75.84 83.52
Table 3: The improvement of components added on baseline model. Glb-Align and Attr-Align represent global-level and attribute-level alignment respectively.
Figure 4: From left to right, we exhibit the raw input person images, attribute labels generated by the pre-trained HRNet, attribute segmentation result from our segmentation branches, and their corresponded feature maps from the local branches.

Visual attribute segmentation and representations. In Figure 6, we visualize the segmentation maps from the MaskHead network and the feature representations of the local branches. It evidently shows that, even transferred using only a lightweight MaskHead branch, the auxiliary person segmentation network can produce accurate pixel-wise labels under different human pose. This also suggests that person parsing knowledge has been successfully distilled our local branches, which is crucial for the precise cross-modal alignment learning. On the right side of Figure 6, we showcase the feature maps of local branch per attribute.

Figure 5: (a) R@1 and R@10 results across different value in the proposed surrogate positive data sampling method. (b) Some examples of the surrogate positive data with different person identities.
Figure 6: Examples of person search results on CUHK-PEDES. We indicate the true/false matching results in green/red boxes.

K-reciprocal sampling. We investigate how the value of impacts the pair-based sampling and learning process. We evaluate the R@1 and R@10 performance under different settings in Figure 5(a). Ideally, the larger the is, the more potential surrogate positive samples will be mined, while this also comes with the possibility that more non-relevant examples (false positive examples) might be incorrectly sampled. Result in Figure 5(a) agrees with our analysis: best R@1 and R@10 is achieved when is set to 8, and the performances are continuous dropping as goes larger. In Figure 5(b), we also provide visual examinations of the surrogate positive pairs that mined by our sampling method. The visual attributes from different persons serve as valuable positive samples in our alignment learning schema.

Qualitative analysis. We present the qualitative examples of person retrieval results to provide a more in-depth examination. As shown in Figure 6, we illustrate the top-10 matching results using the natural language query. In the successful case (top), ViTAA precisely capture all attributes in the target person. It is worth noting that the wrong answers still capture the relevant attributes: “sweater with black, gray and white stripes”, “tan pants”, and “carrying a bag”. For the failure case (bottom), though the retrieved images are incorrect, we observe that all the attributes described in the query are there in almost all retrieved results.

4.4 Extension: Attribute Retrieval

In order to validate the ability of associating the visual attribute with the text phrase, we further conduct attribute retrieval experiment on the datasets of Market-1501 [59] and DukeMTMC [34], where and human related attributes are annotated per image by [28]. In our experiment, we use our pre-trained ViTAA on CUHK-PEDES without any further finetuning, and conduct the retrieval task using the attribute phrase as the query under R@1 and mAP metrics. In our experiment, we simply test only on the upper-body clothing attribute category, and post the retrieval results in Table 4. We introduce the details of our experiment in the supplementary materials. From Table 4, it clearly shows that ViTAA achieves great performances on almost all sub-attributes. This further strongly supports our argument that ViTAA is able to associate the visual attribute features with textual attribute descriptions successfully.

Market1501 DukeMTMC
Attr. #Target R@1 mAP Attr. #Target R@1 mAP
upblack 1759 99.2 44.0 upblack 11047 100.0 82.2
upwhite 3491 44.7 64.8 upwhite 1095 91.3 35.4
upred 1354 92.3 54.8 upred 821 100.0 44.6
uppurple 363 100.0 61.2 uppurple 65 0.0 9.0
upyellow 1195 72.7 75.6 upgray 2012 81.7 29.8
upgray 1755 49.7 55.2 upblue 1577 77.9 31.3
upblue 1100 70.1 32.4 upgreen 417 89.2 24.5
upgreen 949 87.9 50.4 upbrown 345 18.3 15.2
Table 4: Upper-body clothing attribute retrieve results. Score is reported in the form of baseline/ours. Attr is the short form of attribute and “upblack” denotes the upper-body in black.

5 Conclusion

In this work, we present a novel ViTAA model to address the person search by natural language task from the perspective of an attribute-specific alignment learning. In contrast to the existing methods, ViTAA fully exploits the common attribute information in both visual and textual modalities across different person identities, and further builds strong association between the visual attribute features and their corresponding textual phrases by using our alignment learning schema. We show that ViTAA achieves state-of-the-art results on the challenging benchmark CUHK-PEDEDS and demonstrate its promising potential that further advances the person search by natural language domain.


  • [1] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. Lawrence Zitnick, and D. Parikh (2015) Vqa: visual question answering. In

    Proceedings of the IEEE international conference on computer vision

    pp. 2425–2433. Cited by: §1, §2.
  • [2] R. Benenson, M. Omran, J. Hosang, and B. Schiele (2014) Ten years of pedestrian detection, what have we learned?. In European Conference on Computer Vision, pp. 613–627. Cited by: §1.
  • [3] D. Chen, H. Li, X. Liu, Y. Shen, J. Shao, Z. Yuan, and X. Wang (2018) Improving deep visual representation for person re-identification by global and local image-language association. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 54–70. Cited by: §1, §2.
  • [4] T. Chen, C. Xu, and J. Luo (2018-03) Improving text-based person search by spatial matching and adaptive threshold. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1879–1887. External Links: Document Cited by: §2, §4.2, Table 2.
  • [5] P. Dollár, C. Wojek, B. Schiele, and P. Perona (2009) Pedestrian detection: a benchmark. In

    Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on

    pp. 304–311. Cited by: §1.
  • [6] Q. Dong, S. Gong, and X. Zhu (2019-10) Person search by text attribute query as zero-shot learning. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §1, §1, §2.
  • [7] Z. Fang, S. Kong, C. Fowlkes, and Y. Yang (2019-06) Modularized textual grounding for counterfactual resilience. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
  • [8] A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, M. Ranzato, and T. Mikolov (2013) Devise: a deep visual-semantic embedding model. In Advances in neural information processing systems, pp. 2121–2129. Cited by: §2.
  • [9] J. Garcia, N. Martinel, C. Micheloni, and A. Gardel (2015) Person re-identification ranking optimisation by discriminant context information analysis. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1305–1313. Cited by: §3.3.
  • [10] Y. Goldberg and O. Levy (2014) Word2vec explained: deriving mikolov et al.’s negative-sampling word-embedding method. arXiv preprint arXiv:1402.3722. Cited by: §3.2.
  • [11] S. Gong, M. Cristani, S. Yan, and C. C. Loy (2014) Person re-identification. Springer Publishing Company, Incorporated. External Links: ISBN 1447162951, 9781447162957 Cited by: §1, §2.
  • [12] J. Guo, Y. Yuan, L. Huang, C. Zhang, J. Yao, and K. Han (2019-10) Beyond human parts: dual part-aligned representations for person re-identification. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §2.
  • [13] C. Han, J. Ye, Y. Zhong, X. Tan, C. Zhang, C. Gao, and N. Sang (2019) Re-id driven localization refinement for person search. In Proceedings of the IEEE International Conference on Computer Vision, pp. 9814–9823. Cited by: §1.
  • [14] K. He, G. Gkioxari, P. Dollár, and R. Girshick (2017) Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 2961–2969. Cited by: §3.1.
  • [15] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §3.1, §4.1.
  • [16] J. Jeon, V. Lavrenko, and R. Manmatha (2003) Automatic image annotation and retrieval using cross-media relevance models. In Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, pp. 119–126. Cited by: §1.
  • [17] Y. Jing, C. Si, J. Wang, W. Wang, L. Wang, and T. Tan (2018) Cascade attention network for person search: both image and text-image similarity selection. CoRR. External Links: Link, 1809.08440 Cited by: §3.2.
  • [18] Y. Jing, C. Si, J. Wang, W. Wang, L. Wang, and T. Tan (2018) Pose-guided joint global and attentive local matching network for text-based person search. arXiv preprint arXiv:1809.08440. Cited by: §2, §4.2, Table 2.
  • [19] M. M. Kalayeh, E. Basaran, M. Gökmen, M. E. Kamasak, and M. Shah (2018) Human semantic parsing for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1062–1071. Cited by: §1, §2.
  • [20] A. Karpathy and L. Fei-Fei (2015) Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3128–3137. Cited by: §2.
  • [21] D. Klein and C. D. Manning (2003) Fast exact inference with a factored model for natural language parsing. In Advances in neural information processing systems, pp. 3–10. Cited by: §3.
  • [22] R. Layne, T. M. Hospedales, and S. Gong (2014) Attributes-based re-identification. In Person Re-Identification, pp. 93–117. Cited by: §1, §2.
  • [23] S. Li, T. Xiao, H. Li, W. Yang, and X. Wang (2017) Identity-aware textual-visual matching with latent co-attention. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1890–1899. Cited by: §1, §1, §2, §4.2, Table 2.
  • [24] S. Li, T. Xiao, H. Li, B. Zhou, D. Yue, and X. Wang (2017) Person search with natural language description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1970–1979. Cited by: §1, §1, §2, §4.1, §4.2, Table 2.
  • [25] W. Li, R. Zhao, T. Xiao, and X. Wang (2014)

    Deepreid: deep filter pairing neural network for person re-identification

    In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 152–159. Cited by: §1, §2, §2.
  • [26] X. Liang, K. Gong, X. Shen, and L. Lin (2018) Look into person: joint body parsing & pose estimation network and a new benchmark. IEEE transactions on pattern analysis and machine intelligence 41 (4), pp. 871–885. Cited by: §1, §2.
  • [27] X. Liang, S. Liu, X. Shen, J. Yang, L. Liu, J. Dong, L. Lin, and S. Yan (2015-12) Deep human parsing with active template regression. Pattern Analysis and Machine Intelligence, IEEE Transactions on (12), pp. 2402–2414. External Links: Document, ISSN 0162-8828 Cited by: §3.1.
  • [28] Y. Lin, L. Zheng, Z. Zheng, Y. Wu, Z. Hu, C. Yan, and Y. Yang (2019) Improving person re-identification by attribute and identity learning. Pattern Recognition. External Links: Document Cited by: §2, §4.4.
  • [29] X. Liu, H. Zhao, M. Tian, L. Sheng, J. Shao, S. Yi, J. Yan, and X. Wang (2017)

    Hydraplus-net: attentive deep features for pedestrian analysis

    In Proceedings of the IEEE international conference on computer vision, pp. 350–359. Cited by: §1.
  • [30] C. D. Manning, M. Surdeanu, J. Bauer, J. Finkel, S. J. Bethard, and D. McClosky (2014)

    The Stanford CoreNLP natural language processing toolkit

    In Association for Computational Linguistics (ACL) System Demonstrations, pp. 55–60. External Links: Link Cited by: §3.2.
  • [31] K. Niu, Y. Huang, W. Ouyang, and L. Wang (2019) Improving description-based person re-identification by multi-granularity image-text alignments. CoRR. External Links: Link, 1906.09610 Cited by: §3.2.
  • [32] K. Niu, Y. Huang, W. Ouyang, and L. Wang (2019) Improving description-based person re-identification by multi-granularity image-text alignments. arXiv preprint arXiv:1906.09610. Cited by: §2, §4.2, Table 2.
  • [33] B. A. Plummer, L. Wang, C. M. Cervantes, J. C. Caicedo, J. Hockenmaier, and S. Lazebnik (2015) Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE international conference on computer vision, pp. 2641–2649. Cited by: §1.
  • [34] E. Ristani, F. Solera, R. Zou, R. Cucchiara, and C. Tomasi (2016) Performance measures and a data set for multi-target, multi-camera tracking. In European Conference on Computer Vision, pp. 17–35. Cited by: §4.4.
  • [35] A. Rohrbach, M. Rohrbach, R. Hu, T. Darrell, and B. Schiele (2016) Grounding of textual phrases in images by reconstruction. In European Conference on Computer Vision, pp. 817–834. Cited by: §1.
  • [36] R. Shekhar and C. Jawahar (2012) Word image retrieval using bag of visual words. In 2012 10th IAPR International Workshop on Document Analysis Systems, pp. 297–301. Cited by: §1.
  • [37] J. Si, H. Zhang, C. Li, J. Kuen, X. Kong, A. C. Kot, and G. Wang (2018) Dual attention matching network for context-aware feature sequence based person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5363–5372. Cited by: §2.
  • [38] C. Su, J. Li, S. Zhang, J. Xing, W. Gao, and Q. Tian (2017) Pose-driven deep convolutional model for person re-identification. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3960–3969. Cited by: §1.
  • [39] C. Su, S. Zhang, J. Xing, W. Gao, and Q. Tian (2018) Multi-type attributes driven multi-camera person re-identification. Pattern Recognition 75, pp. 77–89. Cited by: §2.
  • [40] P. Sudowe, H. Spitzer, and B. Leibe (2015) Person attribute recognition with a jointly-trained holistic cnn model. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 87–95. Cited by: §1.
  • [41] Y. Suh, J. Wang, S. Tang, T. Mei, and K. Mu Lee (2018) Part-aligned bilinear representations for person re-identification. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 402–419. Cited by: §2.
  • [42] K. Sun, B. Xiao, D. Liu, and J. Wang (2019-06) Deep high-resolution representation learning for human pose estimation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §3.1.
  • [43] Y. Sun, L. Zheng, Y. Yang, Q. Tian, and S. Wang (2018) Beyond part models: person retrieval with refined part pooling (and a strong convolutional baseline). In Proceedings of the European Conference on Computer Vision (ECCV), pp. 480–496. Cited by: §2.
  • [44] Z. Tan, Y. Yang, J. Wan, H. Hang, G. Guo, and S. Z. Li (2019) Attention-based pedestrian attribute analysis. IEEE transactions on image processing (12), pp. 6126–6140. Cited by: §3.1.
  • [45] C. Wang, Q. Zhang, C. Huang, W. Liu, and X. Wang (2018) Mancs: a multi-task attentional network with curriculum sampling for person re-identification. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 365–381. Cited by: §2.
  • [46] G. Wang, Y. Yuan, X. Chen, J. Li, and X. Zhou (2018) Learning discriminative features with multiple granularities for person re-identification. In 2018 ACM Multimedia Conference on Multimedia Conference, pp. 274–282. Cited by: §2.
  • [47] Y. Wen, K. Zhang, Z. Li, and Y. Qiao (2016)

    A discriminative feature learning approach for deep face recognition

    In European conference on computer vision, pp. 499–515. Cited by: §3.3.
  • [48] H. Wu, J. Mao, Y. Zhang, Y. Jiang, L. Li, W. Sun, and W. Ma (2019-06) Unified visual-semantic embeddings: bridging vision and language with structured meaning representations. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2.
  • [49] J. Xu, R. Zhao, F. Zhu, H. Wang, and W. Ouyang (2018) Attention-aware compositional network for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2119–2128. Cited by: §2.
  • [50] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio (2015) Show, attend and tell: neural image caption generation with visual attention. In

    International conference on machine learning

    pp. 2048–2057. Cited by: §2.
  • [51] Z. Yin, W. Zheng, A. Wu, H. Yu, H. Wan, X. Guo, F. Huang, and J. Lai (2018-07) Adversarial attribute-image person re-identification. In

    Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI-18

    pp. 1100–1106. External Links: Document, Link Cited by: §1, §1, §2.
  • [52] Q. You, Z. Zhang, and J. Luo (2018) End-to-end convolutional semantic embeddings. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5735–5744. Cited by: §2.
  • [53] S. Zhang, R. Benenson, M. Omran, J. Hosang, and B. Schiele (2016) How far are we from solving pedestrian detection?. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1259–1267. Cited by: §1.
  • [54] X. Zhang, Z. Fang, Y. Wen, Z. Li, and Y. Qiao (2017) Range loss for deep face recognition with long-tailed training data. In Proceedings of the IEEE International Conference on Computer Vision, pp. 5409–5418. Cited by: §3.3.
  • [55] Y. Zhang and H. Lu (2018) Deep cross-modal projection learning for image-text matching. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 686–701. Cited by: §1, §2, §2, §4.2, Table 2.
  • [56] Z. Zhang, C. Lan, W. Zeng, and Z. Chen (2019) Densely semantically aligned person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 667–676. Cited by: §2, §3.2.
  • [57] J. Zhao, J. Li, Y. Cheng, T. Sim, S. Yan, and J. Feng (2018) Understanding humans in crowded scenes: deep nested adversarial learning and a new benchmark for multi-human parsing. In 2018 ACM Multimedia Conference on Multimedia Conference, pp. 792–800. Cited by: §3.1.
  • [58] L. Zheng, Y. Huang, H. Lu, and Y. Yang (2019) Pose invariant embedding for deep person re-identification. IEEE Transactions on Image Processing. Cited by: §1.
  • [59] L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, and Q. Tian (2015) Scalable person re-identification: a benchmark. In Proceedings of the IEEE international conference on computer vision, pp. 1116–1124. Cited by: §1, §2, §4.4.
  • [60] Z. Zheng, L. Zheng, M. Garrett, Y. Yang, and Y. Shen (2017) Dual-path convolutional image-text embedding with instance loss. arXiv preprint arXiv:1711.05535. Cited by: §2, §4.2, Table 2.
  • [61] Z. Zhong, L. Zheng, D. Cao, and S. Li (2017) Re-ranking person re-identification with k-reciprocal encoding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1318–1327. Cited by: §3.3.