Speech Activity Detection Based on Multilingual Speech Recognition System

10/23/2020

∙

To better model the contextual information and increase the generalization ability of a voice detection system, this paper leverages a multi-lingual Automatic Speech Recognition (ASR) system to perform Speech Activity Detection (SAD). Sequence-discriminative training of multi-lingual Acoustic Model (AM) using Lattice-Free Maximum Mutual Information (LF-MMI) loss function, effectively extracts the contextual information of the input acoustic frame. The index of maximum output posterior is considered as a frame-level speech/non-speech decision function. Majority voting and logistic regression are applied to fuse the language-dependent decisions. The leveraged multi-lingual ASR is trained on 18 languages of BABEL datasets and the built SAD is evaluated on 3 different languages. In out-of-domain datasets, the proposed SAD model shows significantly better performance w.r.t. baseline models. In the Ester2 dataset, without using any in-domain data, this model outperforms the WebRTC, phoneme recognizer based VAD (Phn_Rec), and Pyannote baselines (respectively 7.1, 1.7, and 2.7 (DetER) metrics. Similarly, in the LiveATC dataset, this model outperforms the WebRTC, Phn_Rec, and Pyannote baselines (respectively 6.4, 10.0, and 3.7 absolutely) in DetER metrics.

READ FULL TEXT

Speech Activity Detection Based on Multilingual Speech Recognition System

Sign in with Google

Consider DeepAI Pro