VADOI:Voice-Activity-Detection Overlapping Inference For End-to-end Long-form Speech Recognition

02/22/2022
by   Jinhan Wang, et al.
0

While end-to-end models have shown great success on the Automatic Speech Recognition task, performance degrades severely when target sentences are long-form. The previous proposed methods, (partial) overlapping inference are shown to be effective on long-form decoding. For both methods, word error rate (WER) decreases monotonically when overlapping percentage decreases. Setting aside computational cost, the setup with 50 achieve the best performance. However, a lower overlapping percentage has an advantage of fast inference speed. In this paper, we first conduct comprehensive experiments comparing overlapping inference and partial overlapping inference with various configurations. We then propose Voice-Activity-Detection Overlapping Inference to provide a trade-off between WER and computation cost. Results show that the proposed method can achieve a 20 Language Translation long-form corpus while maintaining the WER performance when comparing to the best performing overlapping inference algorithm. We also propose Soft-Match to compensate for similar words mis-aligned problem.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/30/2018

Towards End-to-end Automatic Code-Switching Speech Recognition

Speech recognition in mixed language has difficulties to adapt end-to-en...
research
02/03/2020

End-to-End Automatic Speech Recognition Integrated With CTC-Based Voice Activity Detection

This paper integrates a voice activity detection (VAD) function with end...
research
11/06/2019

A comparison of end-to-end models for long-form speech recognition

End-to-end automatic speech recognition (ASR) models, including both att...
research
03/02/2021

Long-Running Speech Recognizer:An End-to-End Multi-Task Learning Framework for Online ASR and VAD

When we use End-to-end automatic speech recognition (E2E-ASR) system for...
research
10/11/2021

A Comparative Study on Non-Autoregressive Modelings for Speech-to-Text Generation

Non-autoregressive (NAR) models simultaneously generate multiple outputs...
research
03/21/2023

End-to-End Integration of Speech Separation and Voice Activity Detection for Low-Latency Diarization of Telephone Conversations

Recent works show that speech separation guided diarization (SSGD) is an...
research
05/11/2021

Research on Mosaic Image Data Enhancement for Overlapping Ship Targets

The problem of overlapping occlusion in target recognition has been a di...

Please sign up or login with your details

Forgot password? Click here to reset