Generating Visually Aligned Sound from Videos

07/14/2020
by   Peihao Chen, et al.
0

We focus on the task of generating sound from natural videos, and the sound should be both temporally and content-wise aligned with visual signals. This task is extremely challenging because some sounds generated outside a camera can not be inferred from video content. The model may be forced to learn an incorrect mapping between visual content and these irrelevant sounds. To address this challenge, we propose a framework named REGNET. In this framework, we first extract appearance and motion features from video frames to better distinguish the object that emits sound from complex background information. We then introduce an innovative audio forwarding regularizer that directly considers the real sound as input and outputs bottlenecked sound features. Using both visual and bottlenecked sound features for sound prediction during training provides stronger supervision for the sound prediction. The audio forwarding regularizer can control the irrelevant sound component and thus prevent the model from learning an incorrect mapping between video frames and sound emitted by the object that is out of the screen. During testing, the audio forwarding regularizer is removed to ensure that REGNET can produce purely aligned sound only from visual features. Extensive evaluations based on Amazon Mechanical Turk demonstrate that our method significantly improves both temporal and content-wise alignment. Remarkably, our generated sound can fool the human with a 68.12 available at https://github.com/PeihaoChen/regnet

READ FULL TEXT
research
04/17/2023

Conditional Generation of Audio from Video via Foley Analogies

The sound effects that designers add to videos are designed to convey a ...
research
07/20/2021

FoleyGAN: Visually Guided Generative Adversarial Network-Based Synchronous Sound Generation in Silent Videos

Deep learning based visual to sound generation systems essentially need ...
research
04/10/2018

Audio-Visual Scene Analysis with Self-Supervised Multisensory Features

The thud of a bouncing ball, the onset of speech as lips open -- when vi...
research
11/19/2022

VarietySound: Timbre-Controllable Video to Sound Generation via Unsupervised Information Disentanglement

Video to sound generation aims to generate realistic and natural sound g...
research
06/19/2023

Visually-Guided Sound Source Separation with Audio-Visual Predictive Coding

The framework of visually-guided sound source separation generally consi...
research
08/19/2018

Dynamic Temporal Alignment of Speech to Lips

Many speech segments in movies are re-recorded in a studio during postpr...
research
09/18/2021

V-SlowFast Network for Efficient Visual Sound Separation

The objective of this paper is to perform visual sound separation: i) we...

Please sign up or login with your details

Forgot password? Click here to reset