Person re-identification (re-ID) has become increasingly popular in the modern computer vision community due to its great significance in the research and applications of visual surveillance. It aims at recognizing a person-of-interest (query) across different cameras. The most challenging problem in re-ID is how to accurately match persons under intensive variance of appearances, such as human poses, camera viewpoints, and illumination conditions. Encouraged by the remarkable success in deep learning algorithms and the emergence of large-scale datasets, many advanced methods have been developed to relieve these vision-based difficulties and made significant improvements in the community[13, 23, 7].
Recent years witness that the application of the various axillary information, such as human poses , person attributes  and language descriptions , can significantly boost the performance of person re-ID. These serve as the augmented feature representations for improving person re-ID. Notably, the image captions could provide a comprehensive and detailed footprint of a specific person. It is semantically richer than visual attributes. More importantly, language descriptions of a particular person are often more consistent across different cameras (or views), which could alleviate the difficulty of the appearance variance in person re-ID task.
Two significant barriers exist in applying the image captions for person re-ID. The first one is the increasing complexity to handle image captions. It is certain that language descriptions contain many redundant and fuzzy information, which could be a great challenge if not handled properly. Thus an effective learning approach for constructing a compact representation of language descriptions is of vital importance. Another one is the lack of description annotations for person re-ID task. Recently,  proposed the CUHK-PEDES, which provides person images with annotated captions. The images from this dataset are collected from various person re-ID benchmark datasets such as CUHK01 , CUHK03 , Market-1501 , and et al. However, the annotations are usually restricted to these datasets. In real-world applications, the person images normally do not have paired language descriptions. Thus, a method for automatically generating the high-quality semantic image captions to various real-world datasets is also urgently needed.
In this paper, we propose a novel hierarchical offshoot recurrent network (HorNet) for improving person re-ID via image captioning. Figure 1 illustrates the schematic illustration of our framework for person re-ID task. We first use the similarity preserving generative adversarial network (SPGAN)  to transfer the real-world images into a unified domain, which can significantly enhance the quality of the generated descriptions via the following image captioner 
. Then both of the images and generated captions are used as the input to the HorNet. The HorNet has two sub-networks to handle the input images and captions, respectively. For images, we utilize mainstream CNNs (i.e., Resnet50) to extract the visual features. For captions, we develop a two-layer LSTMs module with a discrete binary gate in each time step. The gradient of the separate gates is estimated using. This module dynamically controls the information flow from the lower layer to the upper layer via these gates. It selects the most relevant vocabularies (i.e., the correct or meaningful words), which are consistent with the input visual features. Consequently, HorNet can learn the visual representations from the given images and the language descriptions from the generated image captions jointly, and thus significantly enhance the performance of person re-ID. Finally, we verify the performance of our proposed method in two scenarios, i.e., person re-ID datasets with and without image captions. Experimental results on several widely used benchmark datasets, i.e., CUHK03, Market-1501, and Duke-MTMC, demonstrate the superiority of the proposed method. Our method can simultaneously learn the visual and language representation from both the images and captions while achieving a state-of-the-art recognition performance.
In a nutshell, our main contributions in the present work can be summarized as threefold:
(1) We develop a new captioning module via image domain transfer and captioner in person re-ID system. It can generate high-quality language captions for given visual images.
(2) We propose a novel hierarchical offshoot recurrent network (HorNet) based on the generated images captions, which learns the visual and language representation jointly.
(3) We verify the superiority of our proposed method on person re-ID task. State-of-the-art empirical results are achieved on the three commonly used benchmark datasets.
2 Related Work
The early research works on person re-ID mainly focus on the visual feature extraction. For instance, split a pedestrian image into three horizontal parts and train three-part CNNs to extract features. Then the similarity between two images is calculated based on the cosine distance metric of their features.  use triplet samples for training the network, considering not only the samples of the same person but also the samples of different people.  proposes a multi-scale triplet CNN for person re-ID. Due to recently released large-scale benchmark dataset, e.g., CUHK03 , Market-1501 , many researchers try to learn a deep model based on the identity loss for person re-ID.  directly uses a conventional fine-tuning approach and outperforms many previous results. Also, recent research  proves that a discriminative loss, combined with the verification loss objective, is superior.
Several recent research has endeavored to use auxiliary information to aid the feature representation for the person re-ID. Some research [23, 28] relies on the extra information of the person’s poses for person re-ID. They leverage the human parts cues to alleviate the pose variations and learn robust feature representations from both the global and local image regions. Another type of auxiliary information, attributes of a person, has been used in person re-ID . However, these methods all rely on the attribute annotations, which are normally hard to collect in real-world applications.  uses automatically detected attributes and visual features for person re-ID. The attribute detector is trained on another dataset which contains the attribute annotations.
The relationship between visual representations and language descriptions has long been investigated. It has attracted high attention in tasks such as image captioning , visual question answering. Associating person images and their corresponding language descriptions for the person searching has been proposed in . Several research works employ the language descriptions as complementary information, together with visual representations, for person re-ID.  exploit natural language descriptions as additional training supervision for effective visual features.  propose to combine the language descriptions and image features and fuse them for the person re-ID task.
3 Our Method
3.1 Improving Person Re-ID via Image Captioning
The image caption of a specific person is semantically rich and can provide complementary information for the visual representations. However, the handcrafted descriptions of a person image are hard to collect due to the annotation difficulties in real-world person re-ID applications. We propose a method to generalize the language descriptions accurately from a dataset with image captions to others without such captions. The whole scheme of our approach is illustrated in Figure 2. Given images with captions, i.e., the CUHK-PEDES dataset, we use SPGAN to transfer arbitrary image to the CUHK-PEDES style. The SPGAN is proposed to improve image-to-image domain adaptation by preserving both the self-similarity and domain-dissimilarity for person re-ID. We utilize it in our case as in  to transfer the image domain (or style) of the un-annotated datasets. Then we train an image captioner  to generate image descriptions automatically on the transferred datasets. The visualization of the domain transfer process and corresponding generated captions are illustrated in Figure 4. It is clear that the transferred images have more accurate language descriptions. However, the generated sentences, which are based on the domain-translated images, still contain some incorrect keywords and redundant information. The proposed HorNet, which contains the discrete binary gates, can select the most relevant language tokens with the visual features, and thus provide a good solution for the issue.
3.2 The Proposed HorNet Model
To facilitate the visual representations of a person in the person re-ID task, we propose the HorNet to learn the visual features and the corresponding language description jointly. The HorNet adds a branch to the CNNs (i.e., Resnet-50) with two-layer LSTMs and a discrete or continuous gate between each layer at every time step. The lower-layer LSTM handles the input languages while the upper-layer LSTM selects the relevant language features via the gates. Finally, the last hidden state of the upper-layer LSTM is concatenated with the visual features extracted via Resnet-50 to generate a compact representation. The objective function of the HorNet consists of two parts: the identification loss and the triplet loss, which are trained jointly to optimize the person re-ID model.
We present the pipeline of our proposed method in Figure 3. The input to the HorNet consists of two parts, i.e., the image and the corresponding language descriptions. Let the language descriptions be processed by a two-layer LSTM model. The bottom layer is a normal LSTM network, which reasons on the sequential input of the language descriptions. More formally, let the be the input of description, be the hidden state, be the LSTM cell state at time . The term , where , denotes the word embedding of the input description. In our research work, we use a linear embedding for the input language tokens. Hence, the bottom layer can be expressed in Equation (1).
where function denotes the compact form of the forward pass of an LSTM unit with a forget gate as:
denotes input vector,is forget gate’s activation, is input gate’s activation, is output gate’s activation, is cell state vector, and is hidden state of the LSTM unit . and
are weight matrices and bias vector parameters which need to be learned during training. The activation
is sigmoid function and the operatordenotes the Hadamard product (i.e., element-wise product).
The boundary gate controls the information from the lower layer to the upper layer. The boundary gate is estimated with , which is derived directly from the proposed in .
The replaces the in the -Max Trick with the following Softmax function:
where are sampled from the distribution , and is the temperature parameter. indicates the dimension of the generated Softmax vector (i.e., the number of categories).
To derive the , we firstly re-write the Sigmoid function as a Softmax of two values: and 0, as in the following Equation (3).
Hence, the can be written as in the following Equation (4).
where and are independently sampled from the distribution .
Thus, the upper-layer LSTM inputs are the gated hidden units of the lower-layer, which can be expressed as the following Equation (5) and Equation (6). In our experiments, all the soft gates are estimated using the with a constant of .
where denotes the deep visual features of the images extracted via CNNs (i.e., Resnet-50) and the indicates the features concatenation operation.
To obtain a discrete value (i.e., language tokens selection), we also set the hard gates in Equation (5).
Finally, we forward the last hidden unit of the language branch to form a compact representation by using a concatenation operation with the corresponding visual features as,
Our loss function is the combination of the identification loss and triplet loss objectives, which can be expressed in the following Equation (9) as,
where is a -class cross entropy loss parametered by . It treats each ID number of person as an individual class as,
and denotes the Triplet loss as,
where is the anchor, indicates the positive example, and is the negative sample. The means margin.
|Methods||Market-1501||CUHK03 Detected||CUHK03 Labeled|
|Identification + Triplet Loss||71.4||86.3||95.1||96.9||88.3||98.0||98.9||99.3||92.2||99.2||99.6||99.8|
|Identification Loss + HorNet||65.3||82.6||93.0||95.7||80.0||91.0||92.4||93.1||81.9||92.2||93.1||93.6|
|Identification + Triplet Loss + HorNet||73.3||88.6||95.2||96.7||91.5||98.5||99.2||99.4||92.4||98.9||99.5||99.7|
|HorNet + Rerank||85.6||91.0||94.7||95.9||95.0||98.8||99.0||99.4||97.1||99.4||99.7||99.8|
|Methods||CUHK03 (Detected)||CUHK03 (Labeled)|
|Deep Person ||89.4||98.2||91.5||99.0|
|HorNet + Rerank||95.0||98.8||97.1||99.4|
|HorNet + Rerank||85.8||91.0||94.2||97.4|
|BoW + Kissme ||12.2||25.1||-||-|
|LOMO + XQDA ||17.0||30.8||-||-|
|Verification + Identification ||49.3||68.9||-||-|
|PAN + Rerank ||66.7||75.9||-||-|
|FMN + Rerank ||72.8||79.5||-||-|
|Resnet50 + BERT + Rerank ||78.8||84.1||90.0||92.2|
|Identification Loss + HorNet (Without Domain Transfer)||52.5||71.1||82.6||87.7|
|Identification Loss + HorNet (With Domain Transfer)||58.4||74.3||87.3||90.8|
|HorNet (With Domain Transfer)||60.4||76.4||88.1||90.5|
|HorNet + Rerank||79.2||84.4||90.2||92.5|
We evaluated the proposed methods on person re-ID datasets such as CUHK03 , Market-1501  and Duke-MTMC . There are two types of experiments: with and without description annotations. For CUHK03 and Market-1501, the description annotations can be directly retrieved from the CUHK-PEDES dataset . Hence, we evaluated the proposed method for person re-ID by using these annotations. However, Duke-MTMC lacks the language annotations. We used an image captioner to generate language descriptions, which are used to jointly optimize the proposed HorNet.
4.1 Implementation Details
For the HorNet, the embedding dimension of word vectors is . The dimension of the hidden unit of the LSTM is also . Since we used a fully connected layer to process the last hidden unit of the upper-layer LSTM and consequently reduced its dimension to 256, the dimension of the final representation for vision and language is (i.e., ), in which the dimension of the visual features is .
4.2 Experiments with CUHK-PEDES Annotations
To first verify the effect of the language descriptions in person re-ID, we augmented two standard person re-ID datasets, which are CUHK03 and Market-1501, by using annotations from CUHK-PEDES dataset. The annotations of CUHK-PEDES are initially developed for cross-modal text-based person search. Since the persons in Market-1501 and CUHK03 have many similar samples and only four images of each person in these two datasets have language descriptions, we annotated the unannotated images in those datasets with the language information from the same ID, which is the same as the protocol in .
We first evaluated the proposed method on CUHK03 dataset, using the classical evaluation protocols . We tested the baseline method which uses the identification loss based on the Resnet-50 model, with CMC top-1 accuracy in the CUHK03 detected images. The CMC top-1 result is raised to by augmenting the language descriptions. A similar phenomenon can be seen in the CUHK03 labeled images. We performed an ablation study to verify the effect of the different part in HorNet. The experimental results are presented in Table 1. The proposed HorNet with reranking technique can achieve the best performance on Market-1501, CUHK03 detected images, and CUHK03 labeled images datasets.
The comparison with other state-of-the-art methods are listed in Table 2. Specifically, we compared the proposed HorNet with other methods which employ auxiliary information, which include Deep Semantic Embedding (DSE) , ACRN , ACRN , Vision and Language (VL) , Image-language Association (ILA) . ACRN applies axillary attribute information to aid the person re-ID. DSE, ACRN, VL and ILA all employ the external language descriptions for person re-ID. Among them, VL uses a vanilla LSTM or CNN language encoding model, which is discriminatively poorer than our HorNet, since the proposed HorNet uses discrete gates to select useful information for person re-ID. ILA uses the same training and testing protocol but with a more complex model. Our model can also be combined with various metric learning techniques, including the Rerank proposed in . We also employed the Rerank to post-process our features, with improved results. Overall, our HorNet performs much better than the ILA on the CUHK03 classic evaluation protocol, achieved CMC top-1 accuracy, with a raise over the ILA. We also conducted experiments on the Market-1501 dataset, the results are presented in Table 1 and Table 3. A similar phenomenon to those of CUHK03 can be seen in Table 3, with mAP result.
4.3 Experiments on the Duke-MTMC Dataset (Without Captions)
In a realistic person re-ID system, language annotations are rare and hard to get. Hence, we want to see if the automatically generated language descriptions can boost the performance of a person re-ID system. We chose a more challenging and realistic dataset, i.e., Duke-MTMC  to verify this assumption. Firstly, we trained an image captioning model based on the CUHK-PEDES dataset by using the convolutional image captioning model, which has released code and good performance . We split the CUHK-PEDES images into two splits: 95% for training and 5% for validation. We used the early stopping technique to train the image captioning model and achieved 35.4 BLEU-1, 22.4 BLEU-2, 15.0 BLEU-3, 9.9 BLEU-4, 22.3 METEOR, 34.2 ROUGE_L and 22.1 CIDEr results on the validation set. Subsequently, we used the trained image captioning model to generate language descriptions for the Duke-MTMC dataset. However, we found that the generated descriptions are not discriminative enough, as shown in Figure 4. There are many incorrent or imprecise keywords in the language descriptions. Also, we tested the performance by augmenting the Duke-MTMC with the generated descriptions and the results turned out to be poor, even worse than the baselines, only with mAP result, as shown in Table 4. The cause of this phenomenon is the poor generalization capability of the image captioning model, especially when there is a domain difference between two diverse datasets. To alleviate this problem, we used the SPGAN  to transfer the image style of the Duke-MTMC to the CUHK-PEDES. The generated language descriptions are with much better quality, as presented in Figure 4. The results from the augmentation with the generated language descriptions on the transferred Duke-MTMC images are much better than that provided by the simple visual features, with mAP result on Duke-MTMC. To prove the superiority of the HorNet, we also use BERT  to replace HorNet, but with a poorer performance. Furthermore, we also implement a Rerank  to boost the final recognition performance and achieved mAP result.
In this paper, we developed a language captioning module via image domain transfer and captioner techniques in person re-ID system. It can generate high-quality language descriptions for visual images, which can significantly compensate for the visual variance in person re-ID. Then we proposed a novel hierarchical offshoot recurrent network (HorNet) for improving person re-ID via such an automatical image captioning module. It can learn the visual and language representation from both images and the generated captions, and thus enhance the performance. The experiments demonstrate promising results of our model on CUHK03, Market-1501 and Duke-MTMC datasets. Future research includes a more robust language captioning module and advanced metric learning methods.
-  (2018) Re-id done right: towards good practices for person re-identification. arXiv:1801.05339. Cited by: Table 3.
-  (2018) Convolutional image captioning. In CVPR, pp. 5561–5570. Cited by: §1, §3.1, §4.3.
-  (2014) Neural machine translation by jointly learning to align and translate. arXiv:1409.0473. Cited by: §2.
-  (2017) Scalable person re-identification on supervised smoothed manifold. In CVPR, pp. 2530–2539. Cited by: Table 2, Table 3.
Deep-person: learning discriminative deep features for person re-identification. arXiv:1711.10658. Cited by: Table 2, Table 3, Table 4.
-  (2018) Joint deep semantic embedding and metric learning for person re-identification. PRL. Cited by: §4.2, Table 2, Table 3.
-  (2018) Improving deep visual representation for person re-identification by global and local image-language association. In ECCV, pp. 54–70. Cited by: §1, §1, §2, §2, §4.2, §4.2, Table 2, Table 3.
-  (2018) Image-image domain adaptation with preserved self-similarity and domain-dissimilarity for person reidentification. In CVPR, pp. 994–1003. Cited by: §1, §3.1, §4.3.
-  (2018) BERT: pre-training of deep bidirectional transformers for language understanding. CoRR abs/1810.04805. External Links: Cited by: §4.3, Table 4.
-  (2017) Let features decide for themselves: feature mask network for person re-identification. arXiv:1711.07155. Cited by: Table 4.
-  (2016) Categorical reparameterization with gumbel-softmax. arXiv:1611.01144. Cited by: §1, §3.2.
-  (2018) Focused hierarchical rnns for conditional sequence processing. arXiv:1806.04342. Cited by: §2.
-  (2017) Learning deep context-aware features over body and latent parts for person re-identification. In CVPR, pp. 384–393. Cited by: §1, Table 2, Table 3.
-  (2017) Person search with natural language description. In CVPR, pp. 1970–1979. Cited by: §1, §2, Table 2, Table 3, §4.
-  (2012) Human reidentification with transferred metric learning. In ACCV, pp. 31–44. Cited by: §1.
-  (2014) Deepreid: deep filter pairing neural network for person re-identification. In CVPR, pp. 152–159. Cited by: §1, §2, §4.2, §4.
-  (2015) Person re-id by local maximal occurrence representation and metric learning. In CVPR, pp. 2197–2206. Cited by: Table 4.
-  (2017) Improving person re-identification by attribute and identity learning. arXiv:1703.07220. Cited by: §2, Table 4.
-  (2016) Multi-scale triplet cnn for person re-identification. In ACMMM, pp. 192–196. Cited by: §2.
-  (2017) Multi-scale deep learning architectures for person re-identification. pp. 5399–5408. Cited by: Table 2.
-  (2016) Performance measures and a data set for multi-target, multi-camera tracking. In ECCV, pp. 17–35. Cited by: §4.3, §4.
-  (2017) Person re-id by deep learning attribute-complementary information. In CVPRW, pp. 20–28. Cited by: §1, §2, §4.2, Table 2, Table 3, Table 4.
-  (2017) Pose-driven deep convolutional model for person re-identification. In ICCV, pp. 3960–3969. Cited by: §1, §1, §2, Table 3.
-  (2017) Svdnet for pedestrian retrieval. In ICCV, pp. 3800–3808. Cited by: Table 2, Table 3, Table 4.
-  (2018) Person re-identification with vision and language. In ICPR, pp. 2136–2141. Cited by: §2, §4.2, Table 2.
Image captioning using adversarial networks and reinforcement learning. In ICPR, pp. 248–253. Cited by: §2.
-  (2014) Deep metric learning for person re-identification. In ICPR, pp. 34–39. Cited by: §2.
-  (2017) Deeply-learned part-aligned representations for person re-identification.. In ICCV, pp. 3219–3228. Cited by: §2, Table 3.
-  (2015) Person re-identification meets image search. arXiv:1502.02171. Cited by: §1, §2, §4.
-  (2015) Scalable person re-identification: a benchmark. In ICCV, pp. 1116–1124. Cited by: Table 4.
-  (2016) Person re-identification: past, present and future. arXiv:1610.02984. Cited by: §2.
-  (2017) A discriminatively learned cnn embedding for person reidentification. TOMM) 14 (1), pp. 13. Cited by: §2, Table 4.
-  (2017) Unlabeled samples generated by gan improve the person re-identification baseline in vitro. In CVPR, pp. 3754–3762. Cited by: Table 2, Table 3.
-  (2018) Pedestrian alignment network for large-scale person re-identification. TCSVT. Cited by: Table 4.
-  (2017) Re-ranking person re-identification with k-reciprocal encoding. In CVPR, pp. 1318–1327. Cited by: §4.2, §4.3.