DeepAI AI Chat
Log In Sign Up

Dual-path CNN with Max Gated block for Text-Based Person Re-identification

by   Tinghuai Ma, et al.

Text-based person re-identification(Re-id) is an important task in video surveillance, which consists of retrieving the corresponding person's image given a textual description from a large gallery of images. It is difficult to directly match visual contents with the textual descriptions due to the modality heterogeneity. On the one hand, the textual embeddings are not discriminative enough, which originates from the high abstraction of the textual descriptions. One the other hand,Global average pooling (GAP) is commonly utilized to extract more general or smoothed features implicitly but ignores salient local features, which are more important for the cross-modal matching problem. With that in mind, a novel Dual-path CNN with Max Gated block (DCMG) is proposed to extract discriminative word embeddings and make visual-textual association concern more on remarkable features of both modalities. The proposed framework is based on two deep residual CNNs jointly optimized with cross-modal projection matching (CMPM) loss and cross-modal projection classification (CMPC) loss to embed the two modalities into a joint feature space. First, the pre-trained language model, BERT, is combined with the convolutional neural network (CNN) to learn better word embeddings in the text-to-image matching domain. Second, the global Max pooling (GMP) layer is applied to make the visual-textual features focus more on the salient part. To further alleviate the noise of the maxed-pooled features, the gated block (GB) is proposed to produce an attention map that focuses on meaningful features of both modalities. Finally, extensive experiments are conducted on the benchmark dataset, CUHK-PEDES, in which our approach achieves the rank-1 score of 55.81 and outperforms the state-of-the-art method by 1.3


page 1

page 3

page 6

page 8

page 9

page 11

page 12

page 13


AXM-Net: Cross-Modal Context Sharing Attention Network for Person Re-ID

Cross-modal person re-identification (Re-ID) is critical for modern vide...

Adversarial Representation Learning for Text-to-Image Matching

For many computer vision applications such as image captioning, visual q...

Improving Description-based Person Re-identification by Multi-granularity Image-text Alignments

Description-based person re-identification (Re-id) is an important task ...

Identity-Aware Textual-Visual Matching with Latent Co-attention

Textual-visual matching aims at measuring similarities between sentence ...

Image-to-Video Person Re-Identification by Reusing Cross-modal Embeddings

Image-to-video person re-identification identifies a target person by a ...

Cross-modal Local Shortest Path and Global Enhancement for Visible-Thermal Person Re-Identification

In addition to considering the recognition difficulty caused by human po...

Dual-Path Convolutional Image-Text Embedding with Instance Loss

Matching images and sentences demands a fine understanding of both modal...