
-
SPGISpeech: 5,000 hours of transcribed financial audio for fully formatted end-to-end speech recognition
In the English speech-to-text (STT) machine learning task, acoustic mode...
read it
-
INTERSPEECH 2021 ConferencingSpeech Challenge: Towards Far-field Multi-Channel Speech Enhancement for Video Conferencing
The ConferencingSpeech 2021 challenge is proposed to stimulate research ...
read it
-
Dual-Path Modeling for Long Recording Speech Separation in Meetings
The continuous speech separation (CSS) is a task to separate the speech ...
read it
-
End-to-End Dereverberation, Beamforming, and Speech Recognition with Improved Numerical Stability and Advanced Frontend
Recently, the end-to-end approach has been successfully applied to multi...
read it
-
Gaussian Kernelized Self-Attention for Long Sequence Data and Its Application to CTC-based Speech Recognition
Self-attention (SA) based models have recently achieved significant perf...
read it
-
Deep Learning based Multi-Source Localization with Source Splitting and its Effectiveness in Multi-Talker Speech Recognition
Multi-source localization is an important and challenging technique for ...
read it
-
Intermediate Loss Regularization for CTC-based Speech Recognition
We present a simple and efficient auxiliary loss function for automatic ...
read it
-
The Hitachi-JHU DIHARD III System: Competitive End-to-End Neural Diarization and X-Vector Clustering Systems Combined by DOVER-Lap
This paper provides a detailed description of the Hitachi-JHU system tha...
read it
-
Leveraging End-to-End ASR for Endangered Language Documentation: An Empirical Study on Yoloxóchitl Mixtec
"Transcription bottlenecks", created by a shortage of effective human tr...
read it
-
A Review of Speaker Diarization: Recent Advances with Deep Learning
Speaker diarization is a task to label audio or video recordings with cl...
read it
-
Online End-to-End Neural Diarization Handling Overlapping Speech and Flexible Numbers of Speakers
This paper proposes an online end-to-end diarization that can handle ove...
read it
-
Arabic Speech Recognition by End-to-End, Modular Systems and Human
Recent advances in automatic speech recognition (ASR) have achieved accu...
read it
-
The 2020 ESPnet update: new features, broadened applications, performance improvements, and future plans
This paper describes the recent development of ESPnet (https://github.co...
read it
-
End-to-End Speaker Diarization as Post-Processing
This paper investigates the utilization of an end-to-end diarization mod...
read it
-
Continuous Speech Separation Using Speaker Inventory for Long Multi-talker Recording
Leveraging additional speaker information to facilitate speech separatio...
read it
-
Improving RNN Transducer With Target Speaker Extraction and Neural Uncertainty Estimation
Target-speaker speech recognition aims to recognize target-speaker speec...
read it
-
ESPnet-se: end-to-end speech enhancement and separation toolkit designed for asr integration
We present ESPnet-SE, which is designed for the quick development of spe...
read it
-
Integration of speech separation, diarization, and recognition for multi-speaker meetings: System description, comparison, and analysis
Multi-speaker speech recognition of unsegmented recordings has diverse a...
read it
-
DOVER-Lap: A Method for Combining Overlap-aware Diarization Outputs
Several advances have been made recently towards handling overlapping sp...
read it
-
Directional ASR: A New Paradigm for E2E Multi-Speaker Speech Recognition with Source Localization
This paper proposes a new paradigm for handling far-field multi-speaker ...
read it
-
Recent Developments on ESPnet Toolkit Boosted by Conformer
In this study, we present recent developments on ESPnet: End-to-End Spee...
read it
-
Improved Mask-CTC for Non-Autoregressive End-to-End ASR
For real-world deployment of automatic speech recognition (ASR), the sys...
read it
-
Orthros: Non-autoregressive End-to-end Speech Translation with Dual-decoder
Fast inference speed is an important goal towards real-world deployment ...
read it
-
Training Noisy Single-Channel Speech Separation With Noisy Oracle Sources: A Large Gap and A Small Step
As the performance of single-channel speech separation systems has impro...
read it
-
Learning Speaker Embedding from Text-to-Speech
Zero-shot multi-speaker Text-to-Speech (TTS) generates target speaker vo...
read it
-
The Sequence-to-Sequence Baseline for the Voice Conversion Challenge 2020: Cascading ASR and TTS
This paper presents the sequence-to-sequence (seq2seq) baseline system f...
read it
-
Augmentation adversarial training for unsupervised speaker recognition
The goal of this work is to train robust speaker recognition models with...
read it
-
Streaming Transformer ASR with Blockwise Synchronous Inference
The Transformer self-attention network has recently shown promising perf...
read it
-
Sequence to Multi-Sequence Learning via Conditional Chain Mapping for Mixture Signals
Neural sequence-to-sequence models are well established for applications...
read it
-
Speaker-Conditional Chain Model for Speech Separation and Extraction
Speech separation has been extensively explored to tackle the cocktail p...
read it
-
The JHU Multi-Microphone Multi-Speaker ASR System for the CHiME-6 Challenge
This paper summarizes the JHU team's efforts in tracks 1 and 2 of the CH...
read it
-
Online End-to-End Neural Diarization with Speaker-Tracing Buffer
End-to-end speaker diarization using a fully supervised self-attention m...
read it
-
Neural Speaker Diarization with Speaker-Wise Chain Rule
Speaker diarization is an essential step for processing multi-speaker au...
read it
-
Insertion-Based Modeling for End-to-End Automatic Speech Recognition
End-to-end (E2E) models have gained attention in the research field of a...
read it
-
End-to-End Far-Field Speech Recognition with Unified Dereverberation and Beamforming
Despite successful applications of end-to-end approaches in multi-channe...
read it
-
End-to-End Speaker Diarization for an Unknown Number of Speakers with Encoder-Decoder Based Attractors
End-to-end speaker diarization for an unknown number of speakers is addr...
read it
-
Mask CTC: Non-Autoregressive End-to-End ASR with CTC and Mask Predict
We present Mask CTC, a novel non-autoregressive end-to-end automatic spe...
read it
-
DiscreTalk: Text-to-Speech as a Machine Translation Problem
This paper proposes a new end-to-end text-to-speech (E2E-TTS) model base...
read it
-
ESPnet-ST: All-in-One Speech Translation Toolkit
We present ESPnet-ST, which is designed for the quick development of spe...
read it
-
CHiME-6 Challenge:Tackling Multispeaker Speech Recognition for Unsegmented Recordings
Following the success of the 1st, 2nd, 3rd, 4th and 5th CHiME challenges...
read it
-
End-to-End Neural Diarization: Reformulating Speaker Diarization as Simple Multi-label Classification
The most common approach to speaker diarization is clustering of speaker...
read it
-
Speaker Diarization with Region Proposal Network
Speaker diarization is an important pre-processing step for many speech ...
read it
-
End-to-End Multi-speaker Speech Recognition with Transformer
Recently, fully recurrent neural network (RNN) based end-to-end models h...
read it
-
End-to-End Automatic Speech Recognition Integrated With CTC-Based Voice Activity Detection
This paper integrates a voice activity detection (VAD) function with end...
read it
-
Non-Autoregressive Transformer Automatic Speech Recognition
Recently very deep transformers start showing outperformed performance t...
read it
-
Towards Online End-to-end Transformer Automatic Speech Recognition
The Transformer self-attention network has recently shown promising perf...
read it
-
ESPnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Text-to-Speech Toolkit
This paper introduces a new end-to-end text-to-speech (E2E-TTS) toolkit ...
read it
-
A practical two-stage training strategy for multi-stream end-to-end speech recognition
The multi-stream paradigm of audio processing, in which several sources ...
read it
-
Transformer ASR with Contextual Block Processing
The Transformer self-attention network has recently shown promising perf...
read it
-
MIMO-SPEECH: End-to-End Multi-Channel Multi-Speaker Speech Recognition
Recently, the end-to-end approach has proven its efficacy in monaural mu...
read it