Vision-Infused Deep Audio Inpainting

10/24/2019
by   Hang Zhou, et al.
17

Multi-modality perception is essential to develop interactive intelligence. In this work, we consider a new task of visual information-infused audio inpainting, synthesizing missing audio segments that correspond to their accompanying videos. We identify two key aspects for a successful inpainter: (1) It is desirable to operate on spectrograms instead of raw audios. Recent advances in deep semantic image inpainting could be leveraged to go beyond the limitations of traditional audio inpainting. (2) To synthesize visually indicated audio, a visual-audio joint feature space needs to be learned with synchronization of audio and video. To facilitate a large-scale study, we collect a new multi-modality instrument-playing dataset called MUSIC-Extra-Solo (MUSICES) by enriching MUSIC dataset. Extensive experiments demonstrate that our framework is capable of inpainting realistic and varying audio segments with or without visual contexts. More importantly, our synthesized audio segments are coherent with their video counterparts, showing the effectiveness of our proposed Vision-Infused Audio Inpainter (VIAI). Code, models, dataset and video results are available at https://hangz-nju-cuhk.github.io/projects/AudioInpainting

READ FULL TEXT

page 3

page 5

page 7

research
10/09/2020

Audio-Visual Speech Inpainting with Deep Learning

In this paper, we present a deep-learning-based framework for audio-visu...
research
11/15/2019

Deep Long Audio Inpainting

Long (> 200 ms) audio inpainting, to recover a long missing part in an a...
research
06/01/2023

Speech inpainting: Context-based speech synthesis guided by video

Audio and visual modalities are inherently connected in speech signals: ...
research
05/29/2023

VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset

Vision and text have been fully explored in contemporary video-text foun...
research
07/23/2019

NONOTO: A Model-agnostic Web Interface for Interactive Music Composition by Inpainting

Inpainting-based generative modeling allows for stimulating human-machin...
research
08/03/2022

Estimating Visual Information From Audio Through Manifold Learning

We propose a new framework for extracting visual information about a sce...
research
11/09/2018

Identify, locate and separate: Audio-visual object extraction in large video collections using weak supervision

We tackle the problem of audiovisual scene analysis for weakly-labeled d...

Please sign up or login with your details

Forgot password? Click here to reset