Semantic VAD: Low-Latency Voice Activity Detection for Speech Interaction

05/21/2023
by   Mohan Shi, et al.
0

For speech interaction, voice activity detection (VAD) is often used as a front-end. However, traditional VAD algorithms usually need to wait for a continuous tail silence to reach a preset maximum duration before segmentation, resulting in a large latency that affects user experience. In this paper, we propose a novel semantic VAD for low-latency segmentation. Different from existing methods, a frame-level punctuation prediction task is added to the semantic VAD, and the artificial endpoint is included in the classification category in addition to the often-used speech presence and absence. To enhance the semantic information of the model, we also incorporate an automatic speech recognition (ASR) related semantic loss. Evaluations on an internal dataset show that the proposed method can reduce the average latency by 53.3 significant deterioration of character error rate in the back-end ASR compared to the traditional VAD approach.

READ FULL TEXT
research
06/02/2023

Streaming Speech-to-Confusion Network Speech Recognition

In interactive automatic speech recognition (ASR) systems, low-latency r...
research
10/25/2022

Dynamic Speech Endpoint Detection with Regression Targets

Interactive voice assistants have been widely used as input interfaces i...
research
05/29/2023

Building Accurate Low Latency ASR for Streaming Voice Search

Automatic Speech Recognition (ASR) plays a crucial role in voice-based a...
research
01/10/2023

Streaming Punctuation: A Novel Punctuation Technique Leveraging Bidirectional Context for Continuous Speech Recognition

While speech recognition Word Error Rate (WER) has reached human parity ...
research
03/02/2021

Long-Running Speech Recognizer:An End-to-End Multi-Task Learning Framework for Online ASR and VAD

When we use End-to-end automatic speech recognition (E2E-ASR) system for...
research
03/21/2023

End-to-End Integration of Speech Separation and Voice Activity Detection for Low-Latency Diarization of Telephone Conversations

Recent works show that speech separation guided diarization (SSGD) is an...
research
11/10/2020

A low latency ASR-free end to end spoken language understanding system

In recent years, developing a speech understanding system that classifie...

Please sign up or login with your details

Forgot password? Click here to reset