Serialized Output Training for End-to-End Overlapped Speech Recognition

03/28/2020
by   Naoyuki Kanda, et al.
0

This paper proposes serialized output training (SOT), a novel framework for multi-speaker overlapped speech recognition based on an attention-based encoder-decoder approach. Instead of having multiple output layers as with the permutation invariant training (PIT), SOT uses a model with only one output layer that generates the transcriptions of multiple speakers one after another. The attention and decoder modules take care of producing multiple transcriptions from overlapped speech. SOT has two advantages over PIT: (1) no limitation in the maximum number of speakers, and (2) an ability to model the dependencies among outputs for different speakers. We also propose a simple trick to reduce the complexity of processing each training sample from O(S!) to O(1), where S is the number of the speakers in the training sample, by using the start times of the constituent source utterances. Experimental results on LibriSpeech corpus show that the SOT models can transcribe overlapped speech with variable numbers of speakers significantly better than PIT-based models. We also show that the SOT models can accurately count the number of speakers in the input audio.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/19/2020

Joint Speaker Counting, Speech Recognition, and Speaker Identification for Overlapped Speech of Any Number of Speakers

We propose an end-to-end speaker-attributed automatic speech recognition...
research
04/22/2018

Multi-Head Decoder for End-to-End Speech Recognition

This paper presents a new network architecture called multi-head decoder...
research
10/08/2018

Recognizing Overlapped Speech in Meetings: A Multichannel Separation Approach Using Neural Networks

The goal of this work is to develop a meeting transcription system that ...
research
11/07/2018

CNN-based MultiChannel End-to-End Speech Recognition for everyday home environments

Casual conversations involving multiple speakers and noises from surroun...
research
08/30/2018

End-to-end Speech Recognition with Adaptive Computation Steps

In this paper, we present Adaptive Computation Steps (ACS) algorithm, wh...
research
05/19/2020

Investigations on Phoneme-Based End-To-End Speech Recognition

Common end-to-end models like CTC or encoder-decoder-attention models us...
research
06/20/2021

Encoder-Decoder Based Attractor Calculation for End-to-End Neural Diarization

This paper investigates an end-to-end neural diarization (EEND) method f...

Please sign up or login with your details

Forgot password? Click here to reset