Person search has gained great attention in recent years due to its wide applications in video surveillance, e.g., missing persons searching and suspects tracking. With the dramatic increase in the number of videos, manual person search in large-scale videos is unrealistic, so we need to design methods to perform the task more efficiently. Existing methods of person search are mainly classified into three categories according to the query type, e.g., image-based query[Zhou et al.2018, Zheng, Gong, and Xiang2011, Sarfraz et al.2018], attribute-based query [Su et al.2016, Vaquero et al.2009] and text-based query [Li et al.2017b, Li et al.2017a, Zheng et al.2017b, Chen, Xu, and Luo2018]. Image-based person search needs at least one image of the queried person which in many cases is very difficult to obtain. Text-based person search can solve person image missing problem because textual descriptions are more accessible. Moreover, compared to attribute-based person search, text-based methods can describe persons with more details in a more natural way. Considering the advantages above, we study the task of person search with natural language in this paper.
Person search with natural language aims to retrieve corresponding person images to a textual description from a large-scale image database, which is illustrated in Fig. 1
. The main challenge of this task is to extract corresponding visual contents to the human description and build their mappings in the feature space. Previous methods generally utilize Convolutional Neural Networks (CNNs)[Krizhevsky, Sutskever, and Hinton2012]
to obtain a global representation of the input image which often cannot effectively extract visual contents corresponding to the person in the image. Considering that human pose is closely related to human body parts, we exploit pose information to guide attention for person visual feature selection. To our knowledge, we are the first to employ human pose to handle the task of text-based person search.
In terms of matching strategies between image and text, some approaches have been proposed to utilize local features for accurate matching. For example, Chen et al. [Chen, Xu, and Luo2018] fuse all the word-image patch similarities, while Chen et al. [Chen et al.2018] compute the phrase-image similarities to weight image feature maps. Different from them, we propose to compute the local similarities between person image regions and the text. To further select the most related human parts to the textual description, we perform a hard attention over the local similarities.
In this paper, we propose a cascade attention network (CAN) for person search with natural language, which progressively selects key matching cues from person image and text-image similarity. Fig. 2
shows the architecture of CAN. First, we estimate human pose from the input image and extract 14 confidence maps corresponding to 14 body keypoints. On the one hand, the confidence maps are concatenated with the input image to augment visual representation. On the other hand, the confidence maps are used to guide attention for person feature selection. Specifically, we employ a visual CNN, e.g., VGG-16[Simonyan and Zisserman2014] or ResNet-50 [He et al.2016]
, and a pose CNN to extract visual representation and pose representation, respectively. Combining these two representations, we can obtain a pose attention map which imposes weights on the visual representation. The weighted visual representation emphasizes the importance of person in the image. Then, we use a bi-directional long short-term memory (LSTM) network[Hochreiter and Schmidhuber1997] to learn the representation of textual description. Considering that the textual description is usually very long and contains several short sentences, we take the integrated representations of all the short sentences to represent the whole description. Finally, we compute the local similarities between person image regions and the textual description rather than a global similarity between image and textual description. To further select the most related human parts to the textual description, we define a threshold to perform hard attention over the local similarities. Both ranking loss and identification loss are employed for better training. Our proposed method is evaluated on the challenging dataset CUHK-PEDES [Li et al.2017b], which is currently the only dataset for person search with natural language. Experimental results show that our CAN model outperforms the state-of-the-art methods on this dataset.
The main contributions of our work are summarized as follows:
1. We propose to utilize pose information to guide attention for person visual feature selection, which is first used in text-based person search. The experimental results prove its effects.
2. We propose a similarity-based hard attention to further select the most related human parts to the textual description.
3. The proposed cascade attention network (CAN) achieves the best results on the challenging dataset CUHK-PEDES. The extensive ablation studies verify the effectiveness of each component in the CAN.
In this section, we briefly introduce the related work, including some prior studies on person search with natural language, human pose for person search as well as attention for person search.
Person Search with Natural Language. Li et al. [Li et al.2017b]
propose the task of person search with natural language and further propose a recurrent neural network with gated neural attention (GNA-RNN) for this task. To utilize identity-level annotations, Li et al.[Li et al.2017a] propose an identity-aware two-stage framework. The first stage implements a CNN-LSTM network to embed cross-modal features, while the second stage refines the CNN-LSTM network with a latent co-attention mechanism. Chen et al. [Chen, Xu, and Luo2018] propose a word-image patch matching model in order to capture the local similarity. Different from the above three methods all of which are CNN-RNN architecture, Zheng et al. [Zheng et al.2017b] propose to employ CNN for textual feature learning. In this paper, we propose a cascade attention network to progressively select matching cues from person image and text-image similarity.
Human Pose for Person Search. With the development of pose estimation [Bak and Carr2017, Insafutdinov et al.2016, Cao et al.2017a, Carreira et al.2015, Chu et al.2016], many approaches in image-based person search extract human poses to improve the visual representation. To solve the problem of human pose variations, Liu et al. [Liu et al.2018] propose a pose transferrable person search framework through utilizing pose-transferred sample augmentations. Zheng et al. [Zheng et al.2017a] utilize pose to normalize person image. The normalized image and original image are all used to match the person. Su et al. [Su et al.2016] also utilize pose to normalize person image, but they leverage human body parts and global image to learn a robust feature representation. Sarfraz et al. [Sarfraz et al.2018] directly concatenate the pose information to the input image to learn visual representation. In this paper for the task of text-based person search, we do not aim to handle misalignment or pose variations but select person image features guided by human poses.
Attention for Person Search. Attention mechanism aims to select the key parts of an input, which is generally divided into soft attention and hard attention. Soft attention computes a weight map and selects the input according to the weight map, while hard attention just preserves one or a few parts of the input and ignores the others. Recently, attention is widely used in person search, which selects either visual contents or textual information. Li et al. [Li et al.2017b] compute an attention map based on text representation to focus on visual units. Li et al. [Li et al.2017a] propose a co-attention method, which extracts word-image features via spatial attention and aligns sentence structures via a latent semantic attention. Zhao et al. [Zhao et al.2017] employ human body parts to weight the image feature map. As we know, hard attention is rarely explored in person search. In this paper, we propose a cascade attention network including a pose-guided soft attention and a similarity-based hard attention.
Cascade Attention Network
In this section, we explain the cascade attention network in detail. First, we introduce the procedure of visual representation extraction which contains a pose-guided attention network. Then, we describe the textual representation learning which is based on a bi-directional long short-term memory network (bi-LSTM). Next, we illustrate the hard attention for text-image similarity selection. Finally, we give the details of learning the proposed model.
Visual Representation Extraction
Considering that human pose is closely related to human body parts, we exploit pose information to guide attention for person visual feature selection. In this work, we estimate human pose from the input image using the PAF approach proposed in [Cao et al.2017b] due to its high accuracy and realtime performance. To obtain more accurate human poses, we retrain PAF on a larger AI challenge dataset [Wu et al.2017] which annotates 14 keypoints for each person.
However, in the experiments, the retrained PAF still cannot obtain accurate human poses in the challenging person search dataset due to the occlusion and lighting change, e.g., CUHK-PEDES. The upper right three images in Fig. 3 show the cases of partial and complete missing keypoints. Looking in detail at the procedure of pose estimation, we find that the 14 confidence maps prior to keypoints generation can convey more information about the person in the image when the estimated joint keypoints are incorrect or missing. The lower right images in Fig. 3 show the superimposed results of the 14 confidence maps which can still provide cues for human body and its parts.
The pose confidence maps play the two-fold role in our model. On the one hand, the 14 confidence maps are concatenated with the 3-channel input image to augment visual representation. We extract visual representation using both VGG-16 and ResNet-50 on the augmented 17-channel input and compare their performance in the experiments. Take VGG-16 for example. We can obtain the feature maps before the last pooling layer of VGG-16, where
means there are 49 regions and each region is represented by a 512-dimensional vector.
On the other hand, the 14 confidence maps are used to guide attention for person feature selection. First, the confidence maps are embedded into a 512-dimensional vector
by a pose CNN, which has 4 convolutional layers and a fully connected layer. Then, we propose a pose-guided soft attention model to concentrate on the person in the image. We denote the original feature maps asand the weighted feature maps as . and are their region representations, respectively. The pose-guided soft attention model is defined as follows:
where , and are three transformation matrices, and is the dimension of the hidden states. is the attention weight for the -th region and is the number of image regions which is 49 here. It should be noted that our attention model generates feature maps by multiplying each region representation by a weight instead of a feature vector by summing the weighted region representations, which is different from previous methods.
Textual Representation Learning
Given a textual description , we represent its -th word as a one-hot vector according to the index of this word in the vocabulary, where is the vocabulary size. Then we embed the one-hot vector into a 300-dimensional embedding vector:
where . To model the dependencies between adjacent words, we adopt a bi-directional long short-term memory network (bi-LSTM) to handle the embedding vectors which correspond to the words of the description:
where and represent the forward and backward LSTMs, respectively. The takes the current word embedding and previous hidden state as inputs, and outputs the current hidden state , so does .
Generally, the textual representation is defined as the concatenation of the last hidden states and . Considering that the description sentence in text-based person search is very long, the only concatenation of and cannot represent the entire sentence well. Consequently, we separate the long sentence into several short ones according to their semantics. As shown in Fig. 2, the description sentence can be divided into three short ones ending with “pants, soles and shoulder”. The representations of these short sentences are the concatenation of their start and end word representations, which are denoted as . The representation of the textual description is defined as:
The experimental results in Table 3 verify the effectiveness of the proposed textual representation.
Similarity-based Hard Attention
After obtaining visual representation and textual representation , we propose a hard attention for the local similarity selection between image regions and text, which aims to select the most related human parts to the textual description.
First, we transform visual representation and textual representation to the same feature space as follows:
where and are two transformation matrices, and is the dimension of the feature space. Here is the hidden dimension of the bi-LSTM in textual representation learning.
Then, instead of directly computing the global similarity between visual representation and textual representation
, we define the local cosine similarity between each image region representationand textual presentation :
a set of local similarities can be obtained.
Rather than summing all the local similarities as the final score, we propose a hard attention to select the description-related image regions and ignore the irrelevant ones. Specifically, a threshold is set and local similarities can be selected if their weights higher than this threshold. The final similarity score is defined as follows:
where is the weight of local similarity. Similar to [Malinowski et al.2018], we set to . Practically, we can select a fixed number of similarity scores according to their values, but the complex person images can easily fail this idea. In the ablation experiments, we compare the proposed hard attention with several other selective strategies. The experimental results demonstrate the effectiveness of our model.
The ranking loss is a common objective function for retrieval tasks. In this paper, we employ the triplet ranking loss as proposed in [Faghri et al.2017] to train our CAN, which is defined as follows:
where is a positive pair, is the hardest negative text in a mini-batch given an image , is the hardest negative image in a mini-batch given a text and is a margin. and are defined as follows:
This ranking loss function ensures that the positive pair is closer than the negative pair with a margin.
In addition, we adopt the identification loss and classify persons into different groups by their identifications, which ensures the identification-level matching. The image and text identification losses and are defined as follows:
where is a shared transformation matrix, and there are 11003 different persons in the training set. denotes the average pooling over the image region representations. We share between image and text to constrain their representations in the same feature space.
Finally, the total loss is defined as:
where , and are all set to 1 in our experiments.
During testing, we rank the similarity score to retrieve person images based on text query.
In this section, we first introduce the experimental setup including dataset, evaluation metrics, implementation details, and baseline setup. Then, we analyze the quantitative results of our method and a set of baseline variants. Finally, we visualize some retrieval results given sentence queries and attention maps.
Dataset and Metrics. The CUHK-PEDES is currently the only dataset for text-based person search. We follow the same data split as [Li et al.2017b]. Training set has 34054 images, 11003 persons and 68126 textual descriptions. Validation set has 3078 images, 1000 persons and 6158 textual descriptions. Test set has 3074 images, 1000 persons and 6156 textual descriptions. On average, each image contains 2 different textual descriptions and the textual descriptions contain more than 23 words. The dataset contains 9408 different words.
We adopt top-1, top-5 and top-10 accuracies to evaluate the performance. Given a textual description, we rank all test images by their similarities with the queried text. If top-k images contain any corresponding person, the search is successful.
Implementation Details. In our experiments, we set the dimension of the hidden states in pose-guided attention and hidden dimension of bi-LSTM as 1024 and 512, respectively. In similarity-based hard attention, we set the dimension of the feature space as 1024. After dropping the words that occur less than twice, we get a vocabulary with size 4984.
We initialize the weights of visual CNN with VGG-16 or ResNet-50 pre-trained on ImageNet classification task. In order to match the dimension of the augmented first layer, we directly copy the averaged weight along the channel dimension to initialize the first layer. To better train our model, we divide the model training into two steps. First, we fix the parameters of pre-trained visual CNN and only train the other model parameters with learning rate. Second, we release the parts of the visual CNN and train the entire model with learning rate . We stop training when the loss converges. The model is optimized with Adam [Kingma and Ba2014] optimizer. The batch size and margin are 128 and 0.2, respectively.
Baseline. We employ a visual CNN (VGG-16 or ResNet-50) to extract the activations from the penultimate layer as visual representation. For textual representation, We utilize a bi-LSTM to encode the textual description which concatenates the last hidden state of forward and backward LSTMs as the textual feature. After extracting the visual and textual features, we transform them into the same space. Only ranking loss is utilized to train the model.
Comparison with the State-of-the-art Methods. Table 1 shows the comparison results with the state-of-the-art methods. We report the results of our model under two different visual features (VGG-16 and ResNet-50). Overall, it can be seen that our CAN achieves the best performances under top-1, top-5 and top-10 metrics. Compared with the best competitor Dual Path [Zheng et al.2017b], our CAN significantly outperforms it under top-1 metric by about 8% with the VGG-16 feature and 1.1% with the ResNet-50 feature, respectively. The improved performances over the best competitor indicate that our CAN is very effective for this task. Compared with these methods (GNA-RNN [Li et al.2017b], IATV [Li et al.2017a] and GLA [Chen et al.2018]) which employ soft attention mechanism to extract visual representations or textual representations, our CAN also achieves better performances under three evaluation metrics, which proves the superiority of our pose-guide attention mechanism in selecting more discriminative person features from the input image. Although PWM-ATH [Chen, Xu, and Luo2018] also computes the local similarity between the input image and textual description, the improved performance (13%) over this method suggests that our similarity-based hard attention can select more proper description-related regions. To be noted, our CAN has made a great improvement over the baseline models under three evaluation metrics. These results further demonstrate the effectiveness of our CAN.
Ablation Experiments. To investigate the series components in CAN, we perform a set of ablation studies. We employ VGG-16 as visual CNN in experiments.
Base-9408. Compared with baseline model, this model utilizes all the words in the dataset which contains 9408 words to encode the textual description.
Base-LSTM. Compared with baseline model, a unidirectional LSTM replaces the bi-LSTM to encode the textual description in this model.
Base+SA. This model adds a similarity-based hard attention network to the baseline model.
Base+PA. It adds a pose-guided attention network to the baseline model.
Base+PA+SSC. This model utilizes the short sentences combination based on Base+PA.
Base+PA+SSC+SA. Compared to Base+PA+SSC, this model adds a similarity-based hard attention network.
Base+PA+SSC+SA+. It denotes our proposed model. The denotes the sum of and .
As Table 2 shows, utilizing all the words in the dataset drops the accuracy. It demonstrates that low-frequency words make noise to the model. Compared with unidirectional LSTM, increased performances illustrate that bi-LSTM is more effective to encode textual description.
Table 3 shows the experimental results of some baseline variants. The improved performances demonstrate that PA, SA, SSC and are effective for text-based person search. For example, the Base+SA outperforms the Base by 3.8%, 3.9%, 3.0% in terms of top-1, top-5 and top-10 metrics, which proves the effectiveness of SA in selecting description-related regions. The performance improvement of Base+PA over Base by 2.8%, 2.9%, 1.7% in terms of top-1, top-5 and top-10 metrics indicates that PA can help our model to select more robust person features to match the corresponding description. Similarly, we can see that SSC and are also very effective for this task from these comparison results.
Analysis of Similarity-based Hard Attention. In our experiments, we select a variational number of regions by our hard attention. As illustrated in Fig. 4, our model mainly selects 13 regions with positive image-text pairs and 41 regions with negative image-text pairs, which demonstrates that our model can better match the positive image-text pairs. For positive pairs, our CAN can select significant description-related regions. But for negative pairs, the similarity-based hard attention network distracts the attention to every region due to mismatching.
We also explore how the number of selected regions affects model performance. In experiments, we select the top regions according to their local similarities. Table 4 shows the experimental results of our similarity-based hard attention under selecting different number of regions. We can see that the performance increases when increasing the number of regions, and saturates soon. It denotes that only some description-related regions are useful for matching. In addition, we also compare our hard attention with soft attention model, which re-weights the local similarities according to their value. The results shown in Table 4 demonstrate that our similarity-based hard attention is more effective than soft attention due to filtering out unrelated visual features. In addition, our CAN also outperforms the model which removes the similarity-based hard attention.
Fig. 5 shows the qualitative results of the person search based on text query by our proposed CAN. For each text, we show the top-8 retrieved images ranked by the similarity scores. The first two rows show four successful cases and the last row shows two unsuccessful cases. For successful cases, the corresponding images are within the top-8 images. Although some images are non-corresponding to the textual descriptions, they also fit parts of the descriptions. For example, in the first case, almost all persons wear “green clothes”, which demonstrates the effectiveness of our CAN in matching text and image. However, the left one in the third row only captures the keywords “red” and “pack” but fails to understand the whole sentence.
To verify whether the proposed model can selectively attend to corresponding regions, we visualize the focus areas of pose-guided attention model and similarity-based hard attention model, respectively. Fig. 6 shows the results, where lighter regions indicate the attended areas. We can see that both two attention models can attend to the person. And pose-guided attention first gets a coarse attention map of the person image, then similarity-based hard attention selects the description-related regions and obtains a refined attention map, which proves our cascade attention network is effective to select related regions.
In this paper, we propose a cascade attention network for person search with natural language description. We first utilize the pose information to guide the attention for person visual feature selection. With the extracted person image representation, we calculate the local similarities between person parts and text. Then, we propose a similarity-based hard attention to further select corresponding regions. Our ablation studies show the effectiveness of each component in the CAN. The experimental results on a challenging dataset show that our approach outperforms the state-of-the-art methods by a large margin and demonstrates better matching between image and text.
- [Bak and Carr2017] Bak, S., and Carr, P. 2017. One-shot metric learning for person re-identification. In CVPR.
- [Cao et al.2017a] Cao, Z.; Simon, T.; Wei, S.-E.; and Sheikh, Y. 2017a. Realtime multi-person 2d pose estimation using part affinity fields. In CVPR.
- [Cao et al.2017b] Cao, Z.; Simon, T.; Wei, S. E.; and Sheikh, Y. 2017b. Realtime multi-person 2d pose estimation using part affinity fields. In CVPR.
- [Carreira et al.2015] Carreira, J.; Agrawal, P.; Fragkiadaki, K.; and Malik, J. 2015. Human pose estimation with iterative error feedback. In CVPR.
- [Chen et al.2018] Chen, D.; Li, H.; Liu, X.; Shen, Y.; Yuan, Z.; and Wang, X. 2018. Improving deep visual representation for person re-identification by global and local image-language association. In ECCV.
- [Chen, Xu, and Luo2018] Chen, T.; Xu, C.; and Luo, J. 2018. Improving text-based person search by spatial matching and adaptive threshold. In WACV.
- [Chu et al.2016] Chu, X.; Ouyang, W.; Li, H.; and Wang, X. 2016. Structured feature learning for pose estimation. In CVPR.
- [Faghri et al.2017] Faghri, F.; Fleet, D. J.; Kiros, J. R.; and Fidler, S. 2017. Vse++: Improved visual-semantic embeddings. In BMVC.
- [He et al.2016] He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In CVPR.
- [Hochreiter and Schmidhuber1997] Hochreiter, S., and Schmidhuber, J. 1997. Long short-term memory. Neural Computation.
- [Insafutdinov et al.2016] Insafutdinov, E.; Pishchulin, L.; Andres, B.; Andriluka, M.; and Schiele, B. 2016. Deepercut: A deeper, stronger, and faster multi-person pose estimation model. In ECCV.
- [Kingma and Ba2014] Kingma, D. P., and Ba, J. 2014. Adam: A method for stochastic optimization. Computer Science.
- [Krizhevsky, Sutskever, and Hinton2012] Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2012. Imagenet classification with deep convolutional neural networks. In NIPS.
- [Li et al.2017a] Li, S.; Xiao, T.; Li, H.; Yang, W.; and Wang, X. 2017a. Identity-aware textual-visual matching with latent co-attention. In ICCV.
- [Li et al.2017b] Li, S.; Xiao, T.; Li, H.; Zhou, B.; Yue, D.; and Wang, X. 2017b. Person search with natural language description. In CVPR.
- [Liu et al.2018] Liu, J.; Ni, B.; Yan, Y.; Zhou, P.; Cheng, S.; and Hu, J. 2018. Pose transferrable person re-identification. In CVPR.
- [Malinowski et al.2018] Malinowski, M.; Doersch, C.; Santoro, A.; and Battaglia, P. 2018. Learning visual question answering by bootstrapping hard attention. In ECCV.
- [Reed et al.2016] Reed, S.; Akata, Z.; Lee, H.; and Schiele, B. 2016. Learning deep representations of fine-grained visual descriptions. In CVPR.
- [Sarfraz et al.2018] Sarfraz, M. S.; Schumann, A.; Eberle, A.; and Stiefelhagen, R. 2018. A pose-sensitive embedding for person re-identification with expanded cross neighborhood re-ranking. In CVPR.
- [Simonyan and Zisserman2014] Simonyan, K., and Zisserman, A. 2014. Very deep convolutional networks for large-scale image recognition. Computer Science.
- [Su et al.2016] Su, C.; Zhang, S.; Xing, J.; Gao, W.; and Tian, Q. 2016. Deep attributes driven multi-camera person re-identification. In ECCV.
- [Vaquero et al.2009] Vaquero, D. A.; Feris, R. S.; Duan, T.; and Brown, L. 2009. Attribute-based people search in surveillance environments. In WACV.
- [Vinyals et al.2015] Vinyals, O.; Toshev, A.; Bengio, S.; and Erhan, D. 2015. Show and tell: A neural image caption generator. In CVPR.
- [Wu et al.2017] Wu, J.; Zheng, H.; Zhao, B.; Li, Y.; Yan, B.; Liang, R.; Wang, W.; Zhou, S.; Lin, G.; and Fu, Y. 2017. Ai challenger : A large-scale dataset for going deeper in image understanding. In arXiv preprint arXiv:1711.06475.
- [Zhao et al.2017] Zhao, L.; Li, X.; Zhuang, Y.; and Wang, J. 2017. Deeply-learned part-aligned representations for person re-identification. In ICCV.
- [Zheng et al.2017a] Zheng, L.; Huang, Y.; Lu, H.; and Yang, Y. 2017a. Pose invariant embedding for deep person re-identification. In arXiv preprint arXiv:1701.07732.
- [Zheng et al.2017b] Zheng, Z.; Zheng, L.; Garrett, M.; Yang, Y.; and Shen, Y.-D. 2017b. Dual-path convolutional image-text embedding. In arXiv preprint arXiv:1711.05535.
- [Zheng, Gong, and Xiang2011] Zheng, W. S.; Gong, S.; and Xiang, T. 2011. Person re-identification by probabilistic relative distance comparison. In CVPR.
- [Zhou et al.2018] Zhou, Q.; Fan, H.; Zheng, S.; Su, H.; Li, X.; Wu, S.; and Ling, H. 2018. Graph correspondence transfer for person re-identification. In AAAI.