Pre-trained language models have achieved impressive results in various ...
Current talking face generation methods mainly focus on speech-lip
synch...
Talking-head video editing aims to efficiently insert, delete, and subst...
Learning multiscale Transformer models has been evidenced as a viable
ap...
In this paper, we propose DiffusionNER, which formulates the named entit...
Developing digital sound synthesizers is crucial to the music industry a...
Denoising diffusion probabilistic models (DDPMs) have shown promising
pe...
We introduce CLaMP: Contrastive Language-Music Pre-training, which learn...
Autoencoders were widely used in many machine learning tasks thanks to t...
Solving complicated AI tasks with different domains and modalities is a ...
Audio-driven talking face has attracted broad interest from academia and...
Neural text-to-speech (TTS) generally consists of cascaded architecture ...
Though denoising diffusion probabilistic models (DDPMs) have achieved
re...
Although deep learning has revolutionized music generation, existing met...
Using a text description as prompt to guide the generation of text or im...
Dialogue summarization aims to condense the lengthy dialogue into a conc...
Human usually composes music by organizing elements according to the mus...
While previous speech-driven talking face generation methods have made
s...
Lyric-to-melody generation, which generates melody according to given ly...
Current text to speech (TTS) systems usually leverage a cascaded acousti...
Binaural audio plays a significant role in constructing immersive augmen...
Sentence scoring aims at measuring the likelihood score of a sentence an...
Text to speech (TTS) has made rapid progress in both academia and indust...
Adaptive text to speech (TTS) can synthesize new voices in zero-shot
sce...
Non-autoregressive text to speech (NAR-TTS) models have attracted much
a...
Denoising diffusion probabilistic models (diffusion models for short) re...
We propose a novel robust and efficient Speech-to-Animation (S2A) approa...
In the development of neural text-to-speech systems, model pre-training ...
Automatic lyrics transcription (ALT), which can be regarded as automatic...
Weight sharing has become the de facto approach to reduce the
training c...
While recent text to speech (TTS) models perform very well in synthesizi...
Rap generation, which aims to produce lyrics and corresponding singing b...
Text to speech (TTS), or speech synthesis, which aims to synthesize
inte...
Denoising diffusion probabilistic models have been recently proposed to
...
While pre-trained language models (e.g., BERT) have achieved impressive
...
Error correction techniques have been used to refine the output sentence...
Text to speech (TTS) is widely used to synthesize personal voice for a t...
End-to-end automatic speech recognition (ASR) can achieve promising
perf...
Data in the real world tends to exhibit a long-tailed label distribution...
Custom voice, a specific text to speech (TTS) service in commercial spee...
Mean opinion score (MOS) is a popular subjective metric to assess the qu...
In this paper, we propose MixSpeech, a simple yet effective data augment...
We study the challenging task of neural network quantization without
end...
Text to speech (TTS) has been broadly used to synthesize natural and
int...
While neural-based text to speech (TTS) models can synthesize natural an...
Automatic song writing aims to compose a song (lyric and/or melody) by
m...
Robust voice activity detection (VAD) is a challenging task in low
signa...
Lip reading aims to recognize text from talking lip, while lip generatio...
High-fidelity singing voices usually require higher sampling rate (e.g.,...
In pop music, accompaniments are usually played by multiple instruments
...