Collecting audio-text pairs is expensive; however, it is much easier to
...
Although frame-based models, such as CTC and transducers, have an affini...
There has been an increased interest in the integration of pretrained sp...
Recently there have been efforts to introduce new benchmark tasks for sp...
This paper describes our system for the low-resource domain adaptation t...
Most human interactions occur in the form of spoken conversations where ...
Disfluency detection has mainly been solved in a pipeline approach, as
p...
End-to-end automatic speech recognition suffers from adaptation to unkno...
Speech samples recorded in both indoor and outdoor environments are ofte...
A streaming style inference of encoder-decoder automatic speech recognit...
Although end-to-end text-to-speech (TTS) models can generate natural spe...
Sound event localization and detection (SELD) involves identifying the
d...
Recording and annotating real sound events for a sound event localizatio...
This report describes our systems submitted to the DCASE2021 challenge t...
Although end-to-end automatic speech recognition (E2E ASR) has achieved ...
Self-attention (SA) based models have recently achieved significant
perf...
The Transformer self-attention network has recently shown promising
perf...
The Transformer self-attention network has recently shown promising
perf...
The Transformer self-attention network has recently shown promising
perf...
An on-device DNN-HMM speech recognition system efficiently works with a
...