AlignSTS: Speech-to-Singing Conversion via Cross-Modal Alignment

05/08/2023
by   Ruiqi Li, et al.
0

The speech-to-singing (STS) voice conversion task aims to generate singing samples corresponding to speech recordings while facing a major challenge: the alignment between the target (singing) pitch contour and the source (speech) content is difficult to learn in a text-free situation. This paper proposes AlignSTS, an STS model based on explicit cross-modal alignment, which views speech variance such as pitch and content as different modalities. Inspired by the mechanism of how humans will sing the lyrics to the melody, AlignSTS: 1) adopts a novel rhythm adaptor to predict the target rhythm representation to bridge the modality gap between content and pitch, where the rhythm representation is computed in a simple yet effective way and is quantized into a discrete space; and 2) uses the predicted rhythm representation to re-align the content based on cross-attention and conducts a cross-modal fusion for re-synthesize. Extensive experiments show that AlignSTS achieves superior performance in terms of both objective and subjective metrics. Audio samples are available at https://alignsts.github.io.

READ FULL TEXT

page 3

page 4

page 8

page 12

research
05/24/2023

CMOT: Cross-modal Mixup via Optimal Transport for Speech Translation

End-to-end speech translation (ST) is the task of translating speech sig...
research
03/05/2022

Audio-visual speech separation based on joint feature representation with cross-modal attention

Multi-modal based speech separation has exhibited a specific advantage o...
research
05/05/2022

Cross-modal Contrastive Learning for Speech Translation

How can we learn unified representations for spoken utterances and their...
research
05/05/2022

Unsupervised Mismatch Localization in Cross-Modal Sequential Data

Content mismatch usually occurs when data from one modality is translate...
research
12/04/2020

Cross-Modal Generalization: Learning in Low Resource Modalities via Meta-Alignment

The natural world is abundant with concepts expressed via visual, acoust...
research
02/27/2023

Cross-modal Face- and Voice-style Transfer

Image-to-image translation and voice conversion enable the generation of...
research
11/11/2021

Learning Signal-Agnostic Manifolds of Neural Fields

Deep neural networks have been used widely to learn the latent structure...

Please sign up or login with your details

Forgot password? Click here to reset