Audiovisual Singing Voice Separation

07/01/2021
by   Bochen Li, et al.
0

Separating a song into vocal and accompaniment components is an active research topic, and recent years witnessed an increased performance from supervised training using deep learning techniques. We propose to apply the visual information corresponding to the singers' vocal activities to further improve the quality of the separated vocal signals. The video frontend model takes the input of mouth movement and fuses it into the feature embeddings of an audio-based separation framework. To facilitate the network to learn audiovisual correlation of singing activities, we add extra vocal signals irrelevant to the mouth movement to the audio mixture during training. We create two audiovisual singing performance datasets for training and evaluation, respectively, one curated from audition recordings on the Internet, and the other recorded in house. The proposed method outperforms audio-based methods in terms of separation quality on most test recordings. This advantage is especially pronounced when there are backing vocals in the accompaniment, which poses a great challenge for audio-only methods.

READ FULL TEXT

page 5

page 7

page 8

research
11/29/2022

Neural Vocoder Feature Estimation for Dry Singing Voice Separation

Singing voice separation (SVS) is a task that separates singing voice au...
research
05/08/2021

Domestic activities clustering from audio recordings using convolutional capsule autoencoder network

Recent efforts have been made on domestic activities classification from...
research
03/02/2021

Audio-Visual Speech Separation Using Cross-Modal Correspondence Loss

We present an audio-visual speech separation learning method that consid...
research
06/06/2019

Singing voice separation: a study on training data

In the recent years, singing voice separation systems showed increased p...
research
05/09/2023

Learn to Sing by Listening: Building Controllable Virtual Singer by Unsupervised Learning from Voice Recordings

The virtual world is being established in which digital humans are creat...
research
10/25/2022

Enhanced Fuzzy Decomposition of Sound Into Sines, Transients, and Noise

The decomposition of sounds into sines, transients, and noise is a long-...
research
11/25/2021

Neuronal Learning Analysis using Cycle-Consistent Adversarial Networks

Understanding how activity in neural circuits reshapes following task le...

Please sign up or login with your details

Forgot password? Click here to reset