De-STT: De-entaglement of unwanted Nuisances and Biases in Speech to Text System using Adversarial Forgetting

11/25/2020
by   Hemant Yadav, et al.
1

Training a robust Speech to Text (STT) system requires tens of thousands of hours of data. Variabilities present in the dataset such as unwanted nuisances (environmental noise, etc) and biases (accent, gender, age, etc) are reasons for the need of large datasets to learn general representations, which is often not feasible for low resource languages. In many computer vision tasks, a recently proposed adversarial forgetting approach to remove unwanted features has produced good results. This motivates us to study the effect of de-entangling the accent information from the input speech signal while training STT systems. To this end, we use an information bottleneck architecture based on adversarial forgetting. This training scheme aims to enforce the model to learn general accent invariant speech representations. Two STT models trained on just 20 hrs of audio, with and without adversarial forgetting, are tested on two unseen accents not present in the training set. The results favour the adversarial forgetting scheme with an absolute average improvement of 6% over the standard training scheme. Furthermore, we also observe an absolute improvement of 5.5% when tested on the seen accents present in the training set.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/11/2019

Invariant Representations through Adversarial Forgetting

We propose a novel approach to achieving invariance for deep neural netw...
research
06/29/2023

The Importance of Robust Features in Mitigating Catastrophic Forgetting

Continual learning (CL) is an approach to address catastrophic forgettin...
research
08/26/2022

Effectiveness of Mining Audio and Text Pairs from Public Data for Improving ASR Systems for Low-Resource Languages

End-to-end (E2E) models have become the default choice for state-of-the-...
research
06/05/2019

Investigating the Lombard Effect Influence on End-to-End Audio-Visual Speech Recognition

Several audio-visual speech recognition models have been recently propos...
research
06/16/2023

Low-Resource Text-to-Speech Using Specific Data and Noise Augmentation

Many neural text-to-speech architectures can synthesize nearly natural s...
research
10/28/2019

Adversarial Multitask Learning for Joint Multi-Feature and Multi-Dialect Morphological Modeling

Morphological tagging is challenging for morphologically rich languages ...
research
04/25/2023

SAFE: Machine Unlearning With Shard Graphs

We present Synergy Aware Forgetting Ensemble (SAFE), a method to adapt l...

Please sign up or login with your details

Forgot password? Click here to reset