Audio-Linguistic Embeddings for Spoken Sentences

02/20/2019
by   Albert Haque, et al.
0

We propose spoken sentence embeddings which capture both acoustic and linguistic content. While existing works operate at the character, phoneme, or word level, our method learns long-term dependencies by modeling speech at the sentence level. Formulated as an audio-linguistic multitask learning problem, our encoder-decoder model simultaneously reconstructs acoustic and natural language features from audio. Our results show that spoken sentence embeddings outperform phoneme and word-level baselines on speech recognition and emotion recognition tasks. Ablation studies show that our embeddings can better model high-level acoustic concepts while retaining linguistic content. Overall, our work illustrates the viability of generic, multi-modal sentence embeddings for spoken language understanding.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/18/2019

Learned In Speech Recognition: Contextual Acoustic Word Embeddings

End-to-end acoustic-to-word speech recognition models have recently gain...
research
02/15/2021

Leveraging Acoustic and Linguistic Embeddings from Pretrained speech and language Models for Intent Classification

Intent classification is a task in spoken language understanding. An int...
research
04/08/2022

Transducer-based language embedding for spoken language identification

The acoustic and linguistic features are important cues for the spoken l...
research
09/10/2020

Multi-modal embeddings using multi-task learning for emotion recognition

General embeddings like word2vec, GloVe and ELMo have shown a lot of suc...
research
12/13/2021

Detecting Emotion Carriers by Combining Acoustic and Lexical Representations

Personal narratives (PN) - spoken or written - are recollections of fact...
research
05/09/2017

A Systematic Review of Hindi Prosody

Prosody describes both form and function of a sentence using the suprase...
research
10/23/2022

Bootstrapping meaning through listening: Unsupervised learning of spoken sentence embeddings

Inducing semantic representations directly from speech signals is a high...

Please sign up or login with your details

Forgot password? Click here to reset