STEPs-RL: Speech-Text Entanglement for Phonetically Sound Representation Learning

11/23/2020
by   Prakamya Mishra, et al.
0

In this paper, we present a novel multi-modal deep neural network architecture that uses speech and text entanglement for learning phonetically sound spoken-word representations. STEPs-RL is trained in a supervised manner to predict the phonetic sequence of a target spoken-word using its contextual spoken word's speech and text, such that the model encodes its meaningful latent representations. Unlike existing work, we have used text along with speech for auditory representation learning to capture semantical and syntactical information along with the acoustic and temporal information. The latent representations produced by our model were not only able to predict the target phonetic sequences with an accuracy of 89.47 achieve competitive results to textual word representation models, Word2Vec FastText (trained on textual transcripts), when evaluated on four widely used word similarity benchmark datasets. In addition, investigation of the generated vector space also demonstrated the capability of the proposed model to capture the phonetic structure of the spoken-words. To the best of our knowledge, none of the existing works use speech and text entanglement for learning spoken-word representation, which makes this work first of its kind.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/06/2020

Contextualized Spoken Word Representations from Convolutional Autoencoders

A lot of work has been done recently to build sound language models for ...
research
02/03/2021

Confusion2vec 2.0: Enriching Ambiguous Spoken Language Representations with Subwords

Word vector representations enable machines to encode human language for...
research
06/01/2020

Transcription-Enriched Joint Embeddings for Spoken Descriptions of Images and Videos

In this work, we propose an effective approach for training unique embed...
research
08/05/2021

Applying the Information Bottleneck Principle to Prosodic Representation Learning

This paper describes a novel design of a neural network-based speech gen...
research
05/19/2023

Speech-Text Dialog Pre-training for Spoken Dialog Understanding with Explicit Cross-Modal Alignment

Recently, speech-text pre-training methods have shown remarkable success...
research
11/05/2017

Learning Word Embeddings from Speech

In this paper, we propose a novel deep neural network architecture, Sequ...
research
09/04/2023

Minimal Effective Theory for Phonotactic Memory: Capturing Local Correlations due to Errors in Speech

Spoken language evolves constrained by the economy of speech, which depe...

Please sign up or login with your details

Forgot password? Click here to reset