Audio-Enhanced Text-to-Video Retrieval using Text-Conditioned Feature Alignment

07/24/2023
by   Sarah Ibrahimi, et al.
0

Text-to-video retrieval systems have recently made significant progress by utilizing pre-trained models trained on large-scale image-text pairs. However, most of the latest methods primarily focus on the video modality while disregarding the audio signal for this task. Nevertheless, a recent advancement by ECLIPSE has improved long-range text-to-video retrieval by developing an audiovisual video representation. Nonetheless, the objective of the text-to-video retrieval task is to capture the complementary audio and video information that is pertinent to the text query rather than simply achieving better audio and video alignment. To address this issue, we introduce TEFAL, a TExt-conditioned Feature ALignment method that produces both audio and video representations conditioned on the text query. Instead of using only an audiovisual attention block, which could suppress the audio information relevant to the text query, our approach employs two independent cross-modal attention blocks that enable the text to attend to the audio and video representations separately. Our proposed method's efficacy is demonstrated on four benchmark datasets that include audio: MSR-VTT, LSMDC, VATEX, and Charades, and achieves better than state-of-the-art performance consistently across the four datasets. This is attributed to the additional text-query-conditioned audio representation and the complementary information it adds to the text-query-conditioned video representation.

READ FULL TEXT
research
08/22/2021

Using Large Pre-Trained Models with Cross-Modal Attention for Multi-Modal Emotion Recognition

Recently, self-supervised pre-training has shown significant improvement...
research
04/06/2022

ECLIPSE: Efficient Long-range Video Retrieval using Sight and Sound

We introduce an audiovisual method for long-range text-to-video retrieva...
research
02/24/2023

Deep Learning for Video-Text Retrieval: a Review

Video-Text Retrieval (VTR) aims to search for the most relevant video re...
research
04/12/2022

Text-Driven Separation of Arbitrary Sounds

We propose a method of separating a desired sound source from a single-c...
research
06/28/2023

ICSVR: Investigating Compositional and Semantic Understanding in Video Retrieval Models

Video retrieval (VR) involves retrieving the ground truth video from the...
research
09/19/2022

Tree-based Text-Vision BERT for Video Search in Baidu Video Advertising

The advancement of the communication technology and the popularity of th...
research
09/09/2021

Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss

Employing large-scale pre-trained model CLIP to conduct video-text retri...

Please sign up or login with your details

Forgot password? Click here to reset