VoViT: Low Latency Graph-based Audio-Visual Voice Separation Transformer

03/08/2022
by   Juan F. Montesinos, et al.
0

This paper presents an audio-visual approach for voice separation which outperforms state-of-the-art methods at a low latency in two scenarios: speech and singing voice. The model is based on a two-stage network. Motion cues are obtained with a lightweight graph convolutional network that processes face landmarks. Then, both audio and motion features are fed to an audio-visual transformer which produces a fairly good estimation of the isolated target source. In a second stage, the predominant voice is enhanced with an audio-only network. We present different ablation studies and comparison to state-of-the-art methods. Finally, we explore the transferability of models trained for speech separation in the task of singing voice separation. The demos, code, and weights will be made publicly available at https://ipcv.github.io/VoViT/

READ FULL TEXT
research
04/20/2021

A cappella: Audio-visual Singing Voice Separation

Music source separation can be interpreted as the estimation of the cons...
research
04/05/2022

VocaLiST: An Audio-Visual Synchronisation Model for Lips and Voices

In this paper, we address the problem of lip-voice synchronisation in vi...
research
11/29/2022

Neural Vocoder Feature Estimation for Dry Singing Voice Separation

Singing voice separation (SVS) is a task that separates singing voice au...
research
12/05/2019

Audio-Visual Target Speaker Extraction on Multi-Talker Environment using Event-Driven Cameras

In this work, we propose a new method to address audio-visual target spe...
research
10/29/2020

Progressive Voice Trigger Detection: Accuracy vs Latency

We present an architecture for voice trigger detection for virtual assis...
research
07/21/2020

SLNSpeech: solving extended speech separation problem by the help of sign language

A speech separation task can be roughly divided into audio-only separati...
research
03/21/2023

End-to-End Integration of Speech Separation and Voice Activity Detection for Low-Latency Diarization of Telephone Conversations

Recent works show that speech separation guided diarization (SSGD) is an...

Please sign up or login with your details

Forgot password? Click here to reset