Synthesizing audio from tongue motion during speech using tagged MRI via transformer

02/14/2023
by   Xiaofeng Liu, et al.
0

Investigating the relationship between internal tissue point motion of the tongue and oropharyngeal muscle deformation measured from tagged MRI and intelligible speech can aid in advancing speech motor control theories and developing novel treatment methods for speech related-disorders. However, elucidating the relationship between these two sources of information is challenging, due in part to the disparity in data structure between spatiotemporal motion fields (i.e., 4D motion fields) and one-dimensional audio waveforms. In this work, we present an efficient encoder-decoder translation network for exploring the predictive information inherent in 4D motion fields via 2D spectrograms as a surrogate of the audio data. Specifically, our encoder is based on 3D convolutional spatial modeling and transformer-based temporal modeling. The extracted features are processed by an asymmetric 2D convolution decoder to generate spectrograms that correspond to 4D motion fields. Furthermore, we incorporate a generative adversarial training approach into our framework to further improve synthesis quality on our generated spectrograms. We experiment on 63 paired motion field sequences and speech waveforms, demonstrating that our framework enables the generation of clear audio waveforms from a sequence of motion fields. Thus, our framework has the potential to improve our understanding of the relationship between these two modalities and inform the development of treatments for speech disorders.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/05/2022

Tagged-MRI Sequence to Audio Synthesis via Self Residual Attention Guided Heterogeneous Translator

Understanding the underlying relationship between tongue and oropharynge...
research
07/07/2021

Efficient Transformer for Direct Speech Translation

The advent of Transformer-based models has surpassed the barriers of tex...
research
01/24/2017

Speech Map: A Statistical Multimodal Atlas of 4D Tongue Motion During Speech from Tagged and Cine MR Images

Quantitative measurement of functional and anatomical traits of 4D tongu...
research
06/29/2021

FastPitchFormant: Source-filter based Decomposed Modeling for Speech Synthesis

Methods for modeling and controlling prosody with acoustic features have...
research
10/22/2019

Complex Transformer: A Framework for Modeling Complex-Valued Sequence

While deep learning has received a surge of interest in a variety of fie...
research
02/27/2023

SpeechFormer++: A Hierarchical Efficient Framework for Paralinguistic Speech Processing

Paralinguistic speech processing is important in addressing many issues,...
research
01/17/2023

Audio2Gestures: Generating Diverse Gestures from Audio

People may perform diverse gestures affected by various mental and physi...

Please sign up or login with your details

Forgot password? Click here to reset