Unsupervised Mismatch Localization in Cross-Modal Sequential Data

05/05/2022
by   Wei Wei, et al.
1

Content mismatch usually occurs when data from one modality is translated to another, e.g. language learners producing mispronunciations (errors in speech) when reading a sentence (target text) aloud. However, most existing alignment algorithms assume the content involved in the two modalities is perfectly matched and thus leading to difficulty in locating such mismatch between speech and text. In this work, we develop an unsupervised learning algorithm that can infer the relationship between content-mismatched cross-modal sequential data, especially for speech-text sequences. More specifically, we propose a hierarchical Bayesian deep learning model, named mismatch localization variational autoencoder (ML-VAE), that decomposes the generative process of the speech into hierarchically structured latent variables, indicating the relationship between the two modalities. Training such a model is very challenging due to the discrete latent variables with complex dependencies involved. We propose a novel and effective training procedure which estimates the hard assignments of the discrete latent variables over a specifically designed lattice and updates the parameters of neural networks alternatively. Our experimental results show that ML-VAE successfully locates the mismatch between text and speech, without the need for human annotations for model training.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/24/2021

Learning Aligned Cross-Modal Representation for Generalized Zero-Shot Classification

Learning a common latent embedding by aligning the latent spaces of cros...
research
12/16/2022

Text-to-speech synthesis based on latent variable conversion using diffusion probabilistic model and variational autoencoder

Text-to-speech synthesis (TTS) is a task to convert texts into speech. T...
research
05/08/2023

AlignSTS: Speech-to-Singing Conversion via Cross-Modal Alignment

The speech-to-singing (STS) voice conversion task aims to generate singi...
research
01/24/2022

A Bayesian Permutation training deep representation learning method for speech enhancement with variational autoencoder

Recently, variational autoencoder (VAE), a deep representation learning ...
research
03/07/2018

Extracting Domain Invariant Features by Unsupervised Learning for Robust Automatic Speech Recognition

The performance of automatic speech recognition (ASR) systems can be sig...
research
08/11/2023

Improving Joint Speech-Text Representations Without Alignment

The last year has seen astonishing progress in text-prompted image gener...
research
06/13/2016

Robust Probabilistic Modeling with Bayesian Data Reweighting

Probabilistic models analyze data by relying on a set of assumptions. Da...

Please sign up or login with your details

Forgot password? Click here to reset