Hidden bawls, whispers, and yelps: can text be made to sound more than just its words?

02/22/2022
by   Caluã de Lacerda Pataca, et al.
0

Whether a word was bawled, whispered, or yelped, captions will typically represent it in the same way. If they are your only way to access what is being said, subjective nuances expressed in the voice will be lost. Since so much of communication is carried by these nuances, we posit that if captions are to be used as an accurate representation of speech, embedding visual representations of paralinguistic qualities into captions could help readers use them to better understand speech beyond its mere textual content. This paper presents a model for processing vocal prosody (its loudness, pitch, and duration) and mapping it into visual dimensions of typography (respectively, font-weight, baseline shift, and letter-spacing), creating a visual representation of these lost vocal subtleties that can be embedded directly into the typographical form of text. An evaluation was carried out where participants were exposed to this speech-modulated typography and asked to match it to its originating audio, presented between similar alternatives. Participants (n=117) were able to correctly identify the original audios with an average accuracy of 65 significant difference when showing them modulations as animated or static text. Additionally, participants' comments showed their mental models of speech-modulated typography varied widely.

READ FULL TEXT

page 6

page 8

research
10/21/2019

Clotho: An Audio Captioning Dataset

Audio captioning is the novel task of general audio content description ...
research
12/31/2020

Text-Free Image-to-Speech Synthesis Using Learned Segmental Units

In this paper we present the first model for directly synthesizing fluen...
research
09/22/2022

Why More Text is (Often) Better: Themes from Reader Preferences for Integration of Charts and Text

Given a choice between charts with minimal text and those with copious t...
research
07/26/2017

SPEECH-COCO: 600k Visually Grounded Spoken Captions Aligned to MSCOCO Data Set

This paper presents an augmentation of MSCOCO dataset where speech is ad...
research
03/23/2017

Visually grounded learning of keyword prediction from untranscribed speech

During language acquisition, infants have the benefit of visual cues to ...
research
02/25/2022

Human Detection of Political Deepfakes across Transcripts, Audio, and Video

Recent advances in technology for hyper-realistic visual effects provoke...
research
01/20/2021

Towards Understanding How Readers Integrate Charts and Captions: A Case Study with Line Charts

Charts often contain visually prominent features that draw attention to ...

Please sign up or login with your details

Forgot password? Click here to reset