SpeechFormer++: A Hierarchical Efficient Framework for Paralinguistic Speech Processing

02/27/2023
by   Weidong Chen, et al.
0

Paralinguistic speech processing is important in addressing many issues, such as sentiment and neurocognitive disorder analyses. Recently, Transformer has achieved remarkable success in the natural language processing field and has demonstrated its adaptation to speech. However, previous works on Transformer in the speech field have not incorporated the properties of speech, leaving the full potential of Transformer unexplored. In this paper, we consider the characteristics of speech and propose a general structure-based framework, called SpeechFormer++, for paralinguistic speech processing. More concretely, following the component relationship in the speech signal, we design a unit encoder to model the intra- and inter-unit information (i.e., frames, phones, and words) efficiently. According to the hierarchical relationship, we utilize merging blocks to generate features at different granularities, which is consistent with the structural pattern in the speech signal. Moreover, a word encoder is introduced to integrate word-grained features into each unit encoder, which effectively balances fine-grained and coarse-grained information. SpeechFormer++ is evaluated on the speech emotion recognition (IEMOCAP MELD), depression classification (DAIC-WOZ) and Alzheimer's disease detection (Pitt) tasks. The results show that SpeechFormer++ outperforms the standard Transformer while greatly reducing the computational cost. Furthermore, it delivers superior results compared to the state-of-the-art approaches.

READ FULL TEXT

page 1

page 11

page 14

research
03/08/2022

SpeechFormer: A Hierarchical Efficient Framework Incorporating the Characteristics of Speech

Transformer has obtained promising results on cognitive speech signal pr...
research
02/27/2023

DST: Deformable Speech Transformer for Emotion Recognition

Enabled by multi-head self-attention, Transformer has exhibited remarkab...
research
03/03/2023

DWFormer: Dynamic Window transFormer for Speech Emotion Recognition

Speech emotion recognition is crucial to human-computer interaction. The...
research
03/02/2023

Fine-grained Emotional Control of Text-To-Speech: Learning To Rank Inter- And Intra-Class Emotion Intensities

State-of-the-art Text-To-Speech (TTS) models are capable of producing hi...
research
11/29/2021

On the rate of convergence of a classifier based on a Transformer encoder

Pattern recognition based on a high-dimensional predictor is considered....
research
06/19/2022

Learning Multiscale Transformer Models for Sequence Generation

Multiscale feature hierarchies have been witnessed the success in the co...
research
02/14/2023

Synthesizing audio from tongue motion during speech using tagged MRI via transformer

Investigating the relationship between internal tissue point motion of t...

Please sign up or login with your details

Forgot password? Click here to reset