DeepAI AI Chat
Log In Sign Up

Disambiguating Speech Intention via Audio-Text Co-attention Framework: A Case of Prosody-semantics Interface

by   Won Ik Cho, et al.

Understanding the intention of an utterance is challenging for some prosody-sensitive cases, especially when it is in the written form. The main concern is to detect the directivity or rhetoricalness of an utterance and to distinguish the type of question. Since it is inevitable to face both the issues regarding prosody and semantics, the identification is expected to benefit from the observations of human language processing mechanism. In this paper, we combat the task with attentive recurrent neural networks that exploit acoustic and textual features, using a manually created speech corpus that incorporates only the syntactically ambiguous utterances which require prosody for disambiguation. We found out that co-attention frameworks on audio-text data, namely multi-hop attention and cross-attention, can perform better than previously suggested speech-based/text-aided networks. By this, we infer that understanding the genuine intention of the ambiguous utterances incorporates recognizing the interaction between auditory and linguistic processes.


Speech Intention Understanding in a Head-final Language: A Disambiguation Utilizing Intonation-dependency

For a large portion of real-life utterances, the intention cannot be sol...

Speech Emotion Recognition Using Multi-Hop Attention Mechanism

In this paper, we are interested in exploiting textual and acoustic data...

A computational model of early language acquisition from audiovisual experiences of young infants

Earlier research has suggested that human infants might use statistical ...

Using previous acoustic context to improve Text-to-Speech synthesis

Many speech synthesis datasets, especially those derived from audiobooks...

InferEM: Inferring the Speaker's Intention for Empathetic Dialogue Generation

Current approaches to empathetic response generation typically encode th...

An Initial Investigation for Detecting Partially Spoofed Audio

All existing databases of spoofed speech contain attack data that is spo...

Structured Argument Extraction of Korean Question and Command

Intention identification and slot filling is a core issue in dialog mana...

Code Repositories


A short tutorial for the co-utilization of audio and text data (multi-modal analysis)

view repo