TS2-Net: Token Shift and Selection Transformer for Text-Video Retrieval

07/16/2022
by   Yuqi Liu, et al.
0

Text-Video retrieval is a task of great practical value and has received increasing attention, among which learning spatial-temporal video representation is one of the research hotspots. The video encoders in the state-of-the-art video retrieval models usually directly adopt the pre-trained vision backbones with the network structure fixed, they therefore can not be further improved to produce the fine-grained spatial-temporal video representation. In this paper, we propose Token Shift and Selection Network (TS2-Net), a novel token shift and selection transformer architecture, which dynamically adjusts the token sequence and selects informative tokens in both temporal and spatial dimensions from input video samples. The token shift module temporally shifts the whole token features back-and-forth across adjacent frames, to preserve the complete token representation and capture subtle movements. Then the token selection module selects tokens that contribute most to local spatial semantics. Based on thorough experiments, the proposed TS2-Net achieves state-of-the-art performance on major text-video retrieval benchmarks, including new records on MSRVTT, VATEX, LSMDC, ActivityNet, and DiDeMo.

READ FULL TEXT

page 2

page 11

page 14

page 22

page 23

research
11/23/2021

Efficient Video Transformers with Spatial-Temporal Token Selection

Video transformers have achieved impressive results on major video recog...
research
05/02/2022

CenterCLIP: Token Clustering for Efficient Text-Video Retrieval

Recently, large-scale pre-training methods like CLIP have made great pro...
research
06/21/2021

CLIP2Video: Mastering Video-Text Retrieval via Image CLIP

We present CLIP2Video network to transfer the image-language pre-trainin...
research
06/04/2022

Video-based Human-Object Interaction Detection from Tubelet Tokens

We present a novel vision Transformer, named TUTOR, which is able to lea...
research
04/14/2021

Decoupled Spatial-Temporal Transformer for Video Inpainting

Video inpainting aims to fill the given spatiotemporal holes with realis...
research
07/06/2021

Feature Fusion Vision Transformer for Fine-Grained Visual Categorization

The core for tackling the fine-grained visual categorization (FGVC) is t...
research
01/17/2022

Action Keypoint Network for Efficient Video Recognition

Reducing redundancy is crucial for improving the efficiency of video rec...

Please sign up or login with your details

Forgot password? Click here to reset