Watch and Learn: Mapping Language and Noisy Real-world Videos with Self-supervision

11/19/2020
by   Yujie Zhong, et al.
0

In this paper, we teach machines to understand visuals and natural language by learning the mapping between sentences and noisy video snippets without explicit annotations. Firstly, we define a self-supervised learning framework that captures the cross-modal information. A novel adversarial learning module is then introduced to explicitly handle the noises in the natural videos, where the subtitle sentences are not guaranteed to be strongly corresponded to the video snippets. For training and evaluation, we contribute a new dataset `ApartmenTour' that contains a large number of online videos and subtitles. We carry out experiments on the bidirectional retrieval tasks between sentences and videos, and the results demonstrate that our proposed model achieves the state-of-the-art performance on both retrieval tasks and exceeds several strong baselines. The dataset will be released soon.

READ FULL TEXT

page 2

page 9

research
10/20/2021

Look at What I'm Doing: Self-Supervised Spatial Grounding of Narrations in Instructional Videos

We introduce the task of spatially localizing narrated interactions in v...
research
12/07/2021

STC-mix: Space, Time, Channel mixing for Self-supervised Video Representation

Contrastive representation learning of videos highly relies on the avail...
research
01/18/2022

Leveraging Real Talking Faces via Self-Supervision for Robust Forgery Detection

One of the most pressing challenges for the detection of face-manipulate...
research
10/11/2022

Cross-modal Search Method of Technology Video based on Adversarial Learning and Feature Fusion

Technology videos contain rich multi-modal information. In cross-modal i...
research
02/18/2023

SSVMR: Saliency-based Self-training for Video-Music Retrieval

With the rise of short videos, the demand for selecting appropriate back...
research
07/16/2022

SVGraph: Learning Semantic Graphs from Instructional Videos

In this work, we focus on generating graphical representations of noisy,...
research
03/22/2022

Look for the Change: Learning Object States and State-Modifying Actions from Untrimmed Web Videos

Human actions often induce changes of object states such as "cutting an ...

Please sign up or login with your details

Forgot password? Click here to reset