Zero-shot text-to-speech (TTS) synthesis aims to clone any unseen speake...
Mapping two modalities, speech and text, into a shared representation sp...
Recently, excellent progress has been made in speech recognition. Howeve...
The single-speaker singing voice synthesis (SVS) usually underperforms a...
This paper presents an end-to-end high-quality singing voice synthesis (...
The spontaneous behavior that often occurs in conversations makes speech...
For text-to-speech (TTS) synthesis, prosodic structure prediction (PSP) ...
Making moral judgments is an essential step toward developing ethical AI...
Expressive speech synthesis is crucial for many human-computer interacti...
Accurate recognition of cocktail party speech containing overlapping
spe...
Automatic recognition of disordered and elderly speech remains highly
ch...
Visual information can serve as an effective cue for target speaker
extr...
Multi-talker overlapped speech poses a significant challenge for speech
...
Automatic speaker verification (ASV) plays a critical role in
security-s...
Nowadays, recognition-synthesis-based methods have been quite popular wi...
Building end-to-end task bots and maintaining their integration with new...
Subband-based approaches process subbands in parallel through the model ...
Automatic dubbing, which generates a corresponding version of the input
...
Music-driven 3D dance generation has become an intensive research topic ...
Due to the mismatch between the source and target domains, how to better...
Recent advances in text-to-speech have significantly improved the
expres...
Despite recent concerns about undesirable behaviors generated by large
l...
As a common way of emotion signaling via non-linguistic vocalizations, v...
With the global population aging rapidly, Alzheimer's disease (AD) is
pa...
Many patients with chronic diseases resort to multiple medications to re...
Automatic recognition of disordered and elderly speech remains a highly
...
Although automatic speech recognition (ASR) can perform well in common
n...
Homophone characters are common in tonal syllable-based languages, such ...
Expressive text-to-speech (TTS) aims to synthesize different speaking st...
The flipped classroom is a new pedagogical strategy that has been gainin...
FullSubNet has shown its promising performance on speech enhancement by
...
Early diagnosis of Alzheimer's disease (AD) is crucial in facilitating
p...
We propose an unsupervised learning method to disentangle speech into co...
This paper investigates an unsupervised approach towards deriving a
univ...
Audio-visual active speaker detection (AVASD) is well-developed, and now...
State-of-the-art neural network language models (NNLMs) represented by l...
One-shot voice conversion (VC) with only a single target speaker's speec...
Early diagnosis of Alzheimer's disease (AD) is crucial in facilitating
p...
A key challenge for automatic speech recognition (ASR) systems is to mod...
Early diagnosis of Alzheimer's disease (AD) is crucial in facilitating
p...
State of the art time automatic speech recognition (ASR) systems are bec...
Fundamental modelling differences between hybrid and end-to-end (E2E)
au...
Articulatory features are inherently invariant to acoustic signal distor...
Cross-lingual word embeddings can be applied to several natural language...
Previous works on expressive speech synthesis focus on modelling the
mon...
Despite the rapid advance of automatic speech recognition (ASR) technolo...
Deep neural networks have brought significant advancements to speech emo...
The accuracy of prosodic structure prediction is crucial to the naturaln...
Although deep learning and end-to-end models have been widely used and s...
Recently, many novel techniques have been introduced to deal with spoofi...