The style transformer with common knowledge optimization for image-text retrieval

03/01/2023
by   Wenrui Li, et al.
0

Image-text retrieval which associates different modalities has drawn broad attention due to its excellent research value and broad real-world application. However, most of the existing methods haven't taken the high-level semantic relationships ("style embedding") and common knowledge from multi-modalities into full consideration. To this end, we introduce a novel style transformer network with common knowledge optimization (CKSTN) for image-text retrieval. The main module is the common knowledge adaptor (CKA) with both the style embedding extractor (SEE) and the common knowledge optimization (CKO) modules. Specifically, the SEE uses the sequential update strategy to effectively connect the features of different stages in SEE. The CKO module is introduced to dynamically capture the latent concepts of common knowledge from different modalities. Besides, to get generalized temporal common knowledge, we propose a sequential update strategy to effectively integrate the features of different layers in SEE with previous common feature units. CKSTN demonstrates the superiorities of the state-of-the-art methods in image-text retrieval on MSCOCO and Flickr30K datasets. Moreover, CKSTN is constructed based on the lightweight transformer which is more convenient and practical for the application of real scenes, due to the better performance and lower parameters.

READ FULL TEXT

page 1

page 2

page 4

research
11/16/2021

TRIG: Transformer-Based Text Recognizer with Initial Embedding Guidance

Scene text recognition (STR) is an important bridge between images and t...
research
05/21/2022

HLATR: Enhance Multi-stage Text Retrieval with Hybrid List Aware Transformer Reranking

Deep pre-trained language models (e,g. BERT) are effective at large-scal...
research
05/30/2019

Multitask Text-to-Visual Embedding with Titles and Clickthrough Data

Text-visual (or called semantic-visual) embedding is a central problem i...
research
11/09/2020

Learning the Best Pooling Strategy for Visual Semantic Embedding

Visual Semantic Embedding (VSE) is a dominant approach for vision-langua...
research
09/18/2023

CLIP-based Synergistic Knowledge Transfer for Text-based Person Retrieval

Text-based Person Retrieval aims to retrieve the target person images gi...
research
07/17/2020

Consensus-Aware Visual-Semantic Embedding for Image-Text Matching

Image-text matching plays a central role in bridging vision and language...
research
03/14/2023

Efficient Image-Text Retrieval via Keyword-Guided Pre-Screening

Under the flourishing development in performance, current image-text ret...

Please sign up or login with your details

Forgot password? Click here to reset