ECLIPSE: Efficient Long-range Video Retrieval using Sight and Sound

04/06/2022
by   Yan-Bo Lin, et al.
2

We introduce an audiovisual method for long-range text-to-video retrieval. Unlike previous approaches designed for short video retrieval (e.g., 5-15 seconds in duration), our approach aims to retrieve minute-long videos that capture complex human actions. One challenge of standard video-only approaches is the large computational cost associated with processing hundreds of densely extracted frames from such long videos. To address this issue, we propose to replace parts of the video with compact audio cues that succinctly summarize dynamic audio events and are cheap to process. Our method, named ECLIPSE (Efficient CLIP with Sound Encoding), adapts the popular CLIP model to an audiovisual video setting, by adding a unified audiovisual transformer block that captures complementary cues from the video and audio streams. In addition to being 2.92x faster and 2.34x memory-efficient than long-range video-only approaches, our method also achieves better text-to-video retrieval accuracy on several diverse long-range video datasets such as ActivityNet, QVHighlights, YouCook2, DiDeMo and Charades.

READ FULL TEXT

page 2

page 4

page 11

page 12

research
04/10/2020

Stacked Convolutional Deep Encoding Network for Video-Text Retrieval

Existing dominant approaches for cross-modal video-text retrieval task a...
research
07/24/2023

Audio-Enhanced Text-to-Video Retrieval using Text-Conditioned Feature Alignment

Text-to-video retrieval systems have recently made significant progress ...
research
04/03/2020

TimeGate: Conditional Gating of Segments in Long-range Activities

When recognizing a long-range activity, exploring the entire video is ex...
research
08/04/2020

Temporal Context Aggregation for Video Retrieval with Contrastive Learning

The current research focus on Content-Based Video Retrieval requires hig...
research
12/04/2020

A high performance approach to detecting small targets in long range low quality infrared videos

Since targets are small in long range infrared (IR) videos, it is challe...
research
04/29/2015

Visual Information Retrieval in Endoscopic Video Archives

In endoscopic procedures, surgeons work with live video streams from the...
research
05/20/2009

Learning Nonlinear Dynamic Models

We present a novel approach for learning nonlinear dynamic models, which...

Please sign up or login with your details

Forgot password? Click here to reset