Incorporating Ultrasound Tongue Images for Audio-Visual Speech Enhancement through Knowledge Distillation

05/24/2023
by   Rui-Chen Zheng, et al.
0

Audio-visual speech enhancement (AV-SE) aims to enhance degraded speech along with extra visual information such as lip videos, and has been shown to be more effective than audio-only speech enhancement. This paper proposes further incorporating ultrasound tongue images to improve lip-based AV-SE systems' performance. Knowledge distillation is employed at the training stage to address the challenge of acquiring ultrasound tongue images during inference, enabling an audio-lip speech enhancement student model to learn from a pre-trained audio-lip-tongue speech enhancement teacher model. Experimental results demonstrate significant improvements in the quality and intelligibility of the speech enhanced by the proposed method compared to the traditional audio-lip speech enhancement baselines. Further analysis using phone error rates (PER) of automatic speech recognition (ASR) shows that palatal and velar consonants benefit most from the introduction of ultrasound tongue images.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/19/2023

Incorporating Ultrasound Tongue Images for Audio-Visual Speech Enhancement

Audio-visual speech enhancement (AV-SE) aims to enhance degraded speech ...
research
04/02/2022

Fast Real-time Personalized Speech Enhancement: End-to-End Enhancement Network (E3Net) and Knowledge Distillation

This paper investigates how to improve the runtime speed of personalized...
research
12/02/2022

Injecting Spatial Information for Monaural Speech Enhancement via Knowledge Distillation

Monaural speech enhancement (SE) provides a versatile and cost-effective...
research
09/21/2020

Correlating Subword Articulation with Lip Shapes for Embedding Aware Audio-Visual Speech Enhancement

In this paper, we propose a visual embedding approach to improving embed...
research
09/15/2023

Two-Step Knowledge Distillation for Tiny Speech Enhancement

Tiny, causal models are crucial for embedded audio machine learning appl...
research
04/16/2019

Joined Audio-Visual Speech Enhancement and Recognition in the Cocktail Party: The Tug Of War Between Enhancement and Recognition Losses

In this paper we propose an end-to-end LSTM-based model that performs si...
research
11/15/2018

On Training Targets and Objective Functions for Deep-Learning-Based Audio-Visual Speech Enhancement

Audio-visual speech enhancement (AV-SE) is the task of improving speech ...

Please sign up or login with your details

Forgot password? Click here to reset