OMG: Observe Multiple Granularities for Natural Language-Based Vehicle Retrieval

by   Yunhao Du, et al.

Retrieving tracked-vehicles by natural language descriptions plays a critical role in smart city construction. It aims to find the best match for the given texts from a set of tracked vehicles in surveillance videos. Existing works generally solve it by a dual-stream framework, which consists of a text encoder, a visual encoder and a cross-modal loss function. Although some progress has been made, they failed to fully exploit the information at various levels of granularity. To tackle this issue, we propose a novel framework for the natural language-based vehicle retrieval task, OMG, which Observes Multiple Granularities with respect to visual representation, textual representation and objective functions. For the visual representation, target features, context features and motion features are encoded separately. For the textual representation, one global embedding, three local embeddings and a color-type prompt embedding are extracted to represent various granularities of semantic features. Finally, the overall framework is optimized by a cross-modal multi-granularity contrastive loss function. Experiments demonstrate the effectiveness of our method. Our OMG significantly outperforms all previous methods and ranks the 9th on the 6th AI City Challenge Track2. The codes are available at


page 3

page 7


All You Can Embed: Natural Language based Vehicle Retrieval with Spatio-Temporal Transformers

Combining Natural Language with Vision represents a unique and interesti...

Symmetric Network with Spatial Relationship Modeling for Natural Language-based Vehicle Retrieval

Natural language (NL) based vehicle retrieval aims to search specific ve...

Hierarchical Similarity Learning for Language-based Product Image Retrieval

This paper aims for the language-based product image retrieval task. The...

Intra-Modal Constraint Loss For Image-Text Retrieval

Cross-modal retrieval has drawn much attention in both computer vision a...

Unleashing the Imagination of Text: A Novel Framework for Text-to-image Person Retrieval via Exploring the Power of Words

The goal of Text-to-image person retrieval is to retrieve person images ...

SBNet: Segmentation-based Network for Natural Language-based Vehicle Search

Natural language-based vehicle retrieval is a task to find a target vehi...

Please sign up or login with your details

Forgot password? Click here to reset