Dual-path CNN with Max Gated block for Text-Based Person Re-identification

09/20/2020
by   Tinghuai Ma, et al.
10

Text-based person re-identification(Re-id) is an important task in video surveillance, which consists of retrieving the corresponding person's image given a textual description from a large gallery of images. It is difficult to directly match visual contents with the textual descriptions due to the modality heterogeneity. On the one hand, the textual embeddings are not discriminative enough, which originates from the high abstraction of the textual descriptions. One the other hand,Global average pooling (GAP) is commonly utilized to extract more general or smoothed features implicitly but ignores salient local features, which are more important for the cross-modal matching problem. With that in mind, a novel Dual-path CNN with Max Gated block (DCMG) is proposed to extract discriminative word embeddings and make visual-textual association concern more on remarkable features of both modalities. The proposed framework is based on two deep residual CNNs jointly optimized with cross-modal projection matching (CMPM) loss and cross-modal projection classification (CMPC) loss to embed the two modalities into a joint feature space. First, the pre-trained language model, BERT, is combined with the convolutional neural network (CNN) to learn better word embeddings in the text-to-image matching domain. Second, the global Max pooling (GMP) layer is applied to make the visual-textual features focus more on the salient part. To further alleviate the noise of the maxed-pooled features, the gated block (GB) is proposed to produce an attention map that focuses on meaningful features of both modalities. Finally, extensive experiments are conducted on the benchmark dataset, CUHK-PEDES, in which our approach achieves the rank-1 score of 55.81 and outperforms the state-of-the-art method by 1.3

READ FULL TEXT

page 1

page 3

page 6

page 8

page 9

page 11

page 12

page 13

research
01/19/2021

AXM-Net: Cross-Modal Context Sharing Attention Network for Person Re-ID

Cross-modal person re-identification (Re-ID) is critical for modern vide...
research
08/28/2019

Adversarial Representation Learning for Text-to-Image Matching

For many computer vision applications such as image captioning, visual q...
research
06/23/2019

Improving Description-based Person Re-identification by Multi-granularity Image-text Alignments

Description-based person re-identification (Re-id) is an important task ...
research
08/07/2017

Identity-Aware Textual-Visual Matching with Latent Co-attention

Textual-visual matching aims at measuring similarities between sentence ...
research
10/04/2018

Image-to-Video Person Re-Identification by Reusing Cross-modal Embeddings

Image-to-video person re-identification identifies a target person by a ...
research
08/13/2023

Improving Face Recognition from Caption Supervision with Multi-Granular Contextual Feature Aggregation

We introduce caption-guided face recognition (CGFR) as a new framework t...
research
06/09/2022

Cross-modal Local Shortest Path and Global Enhancement for Visible-Thermal Person Re-Identification

In addition to considering the recognition difficulty caused by human po...

Please sign up or login with your details

Forgot password? Click here to reset