Co-speech gesture generation is crucial for automatic digital avatar
ani...
Argumentative explainable AI has been advocated by several in recent yea...
Zero-shot text-to-speech aims at synthesizing voices with unseen speech
...
Cross-lingual timbre and style generalizable text-to-speech (TTS) aims t...
Conversational recommender systems (CRSs) have become crucial emerging
r...
Scaling text-to-speech to a large and wild dataset has been proven to be...
We are interested in a novel task, namely low-resource text-to-talking
a...
Diffusion models have demonstrated impressive performance in text-to-ima...
Large diffusion models have been successful in text-to-audio (T2A) synth...
Direct speech-to-speech translation (S2ST) has gradually become popular ...
Direct speech-to-speech translation (S2ST) aims to convert speech from o...
Improving text representation has attracted much attention to achieve
ex...
Generating talking person portraits with arbitrary speech audio is a cru...
As a key component of automated speech recognition (ASR) and the front-e...
Large-scale multimodal generative modeling has created milestones in
tex...
Speech-to-speech translation directly translates a speech utterance to
a...
Random forests are decision tree ensembles that can be used to solve a
v...
In this paper, we consider the problem of verifying pre-opacity for
disc...
Unsupervised video domain adaptation is a practical yet challenging task...
Chinese dialect text-to-speech(TTS) system usually can only be utilized ...
There is broad agreement in the literature that explanation methods shou...
In this paper, we investigate the optimal robot path planning problem fo...
Correct-by-construction synthesis is a cornerstone of the confluence of
...
Recent progress in material data mining has been driven by high-capacity...
Clothes style transfer for person video generation is a challenging task...
Recently, phonetic posteriorgrams (PPGs) based methods have been quite
p...
In expressive speech synthesis, there are high requirements for emotion
...
Singing voice conversion (SVC) aims to convert the voice of one singer t...
The general aim of multi-focus image fusion is to gather focused regions...
This paper proposes the building of Xiaomingbot, an intelligent, multili...
Accent conversion (AC) transforms a non-native speaker's accent into a n...
This paper presents ByteSing, a Chinese singing voice synthesis (SVS) sy...
In this paper, we propose a hybrid text normalization system using multi...
In Mandarin text-to-speech (TTS) system, the front-end text processing m...
In this paper, we propose several opacity-preserving (bi)simulation rela...
Detectability of discrete event systems (DESs) is a property to determin...