Video-Guided Curriculum Learning for Spoken Video Grounding

09/01/2022
by   Yan Xia, et al.
0

In this paper, we introduce a new task, spoken video grounding (SVG), which aims to localize the desired video fragments from spoken language descriptions. Compared with using text, employing audio requires the model to directly exploit the useful phonemes and syllables related to the video from raw speech. Moreover, we randomly add environmental noises to this speech audio, further increasing the difficulty of this task and better simulating real applications. To rectify the discriminative phonemes and extract video-related information from noisy audio, we develop a novel video-guided curriculum learning (VGCL) during the audio pre-training process, which can make use of the vital visual perceptions to help understand the spoken language and suppress the external noise. Considering during inference the model can not obtain ground truth video segments, we design a curriculum strategy that gradually shifts the input video from the ground truth to the entire video content during pre-training. Finally, the model can learn how to extract critical visual information from the entire video clip to help understand the spoken language. In addition, we collect the first large-scale spoken video grounding dataset based on ActivityNet, which is named as ActivityNet Speech dataset. Extensive experiments demonstrate our proposed video-guided curriculum learning can facilitate the pre-training process to obtain a mutual audio encoder, significantly promoting the performance of spoken video grounding tasks. Moreover, we prove that in the case of noisy sound, our model outperforms the method that grounding video with ASR transcripts, further demonstrating the effectiveness of our curriculum strategy.

READ FULL TEXT

page 2

page 5

page 8

research
11/01/2021

Masking Modalities for Cross-modal Video Retrieval

Pre-training on large scale unlabelled datasets has shown impressive per...
research
04/21/2020

Curriculum Pre-training for End-to-End Speech Translation

End-to-end speech translation poses a heavy burden on the encoder, becau...
research
02/25/2022

Learning English with Peppa Pig

Attempts to computationally simulate the acquisition of spoken language ...
research
08/31/2021

SimulLR: Simultaneous Lip Reading Transducer with Attention-Guided Adaptive Memory

Lip reading, aiming to recognize spoken sentences according to the given...
research
07/14/2023

SGGNet^2: Speech-Scene Graph Grounding Network for Speech-guided Navigation

The spoken language serves as an accessible and efficient interface, ena...
research
11/29/2018

Progressive Recurrent Learning for Visual Recognition

Computer vision is difficult, partly because the mathematical function c...
research
10/23/2020

Show and Speak: Directly Synthesize Spoken Description of Images

This paper proposes a new model, referred to as the show and speak (SAS)...

Please sign up or login with your details

Forgot password? Click here to reset