VarietySound: Timbre-Controllable Video to Sound Generation via Unsupervised Information Disentanglement

11/19/2022
by   Chenye Cui, et al.
0

Video to sound generation aims to generate realistic and natural sound given a video input. However, previous video-to-sound generation methods can only generate a random or average timbre without any controls or specializations of the generated sound timbre, leading to the problem that people cannot obtain the desired timbre under these methods sometimes. In this paper, we pose the task of generating sound with a specific timbre given a video input and a reference audio sample. To solve this task, we disentangle each target sound audio into three components: temporal information, acoustic information, and background information. We first use three encoders to encode these components respectively: 1) a temporal encoder to encode temporal information, which is fed with video frames since the input video shares the same temporal information as the original audio; 2) an acoustic encoder to encode timbre information, which takes the original audio as input and discards its temporal information by a temporal-corrupting operation; and 3) a background encoder to encode the residual or background sound, which uses the background part of the original audio as input. To make the generated result achieve better quality and temporal alignment, we also adopt a mel discriminator and a temporal discriminator for the adversarial training. Our experimental results on the VAS dataset demonstrate that our method can generate high-quality audio samples with good synchronization with events in video and high timbre similarity with the reference audio.

READ FULL TEXT

page 5

page 12

page 14

research
05/22/2023

DiffAVA: Personalized Text-to-Audio Generation with Visual Alignment

Text-to-audio (TTA) generation is a recent popular problem that aims to ...
research
09/08/2023

The Power of Sound (TPoS): Audio Reactive Video Generation with Stable Diffusion

In recent years, video generation has become a prominent generative tool...
research
06/02/2023

Enhance Temporal Relations in Audio Captioning with Sound Event Detection

Automated audio captioning aims at generating natural language descripti...
research
05/29/2023

Make-An-Audio 2: Temporal-Enhanced Text-to-Audio Generation

Large diffusion models have been successful in text-to-audio (T2A) synth...
research
07/14/2020

Generating Visually Aligned Sound from Videos

We focus on the task of generating sound from natural videos, and the so...
research
05/21/2020

Pitchtron: Towards audiobook generation from ordinary people's voices

In this paper, we explore prosody transfer for audiobook generation unde...
research
04/05/2022

RaDur: A Reference-aware and Duration-robust Network for Target Sound Detection

Target sound detection (TSD) aims to detect the target sound from a mixt...

Please sign up or login with your details

Forgot password? Click here to reset