Aalto's End-to-End DNN systems for the INTERSPEECH 2020 Computational Paralinguistics Challenge

08/06/2020
by   Tamás Grósz, et al.
0

End-to-end neural network models (E2E) have shown significant performance benefits on different INTERSPEECH ComParE tasks. Prior work has applied either a single instance of an E2E model for a task or the same E2E architecture for different tasks. However, applying a single model is unstable or using the same architecture under-utilizes task-specific information. On ComParE 2020 tasks, we investigate applying an ensemble of E2E models for robust performance and developing task-specific modifications for each task. ComParE 2020 introduces three sub-challenges: the breathing sub-challenge to predict the output of a respiratory belt worn by a patient while speaking, the elderly sub-challenge to estimate the elderly speaker's arousal and valence levels and the mask sub-challenge to classify if the speaker is wearing a mask or not. On each of these tasks, an ensemble outperforms the single E2E model. On the breathing sub-challenge, we study the impact of multi-loss strategies on task performance. On the elderly sub-challenge, predicting the valence and arousal levels prompts us to investigate multi-task training and implement data sampling strategies to handle class imbalance. On the mask sub-challenge, using an E2E system without feature engineering is competitive to feature-engineered baselines and provides substantial gains when combined with feature-engineered baselines.

READ FULL TEXT
research
04/03/2015

A Unified Deep Neural Network for Speaker and Language Recognition

Learned feature representations and sub-phoneme posteriors from Deep Neu...
research
08/07/2020

Applying Speech Tempo-Derived Features, BoAW and Fisher Vectors to Detect Elderly Emotion and Speech in Surgical Masks

The 2020 INTERSPEECH Computational Paralinguistics Challenge (ComParE) c...
research
06/23/2022

Comparing supervised and self-supervised embedding for ExVo Multi-Task learning track

The ICML Expressive Vocalizations (ExVo) Multi-task challenge 2022, focu...
research
06/02/2023

Comparing a composite model versus chained models to locate a nearest visual object

Extracting information from geographic images and text is crucial for au...
research
08/12/2020

Mask Detection and Breath Monitoring from Speech: on Data Augmentation, Feature Representation and Modeling

This paper introduces our approaches for the Mask and Breathing Sub-Chal...
research
07/05/2019

Deep Neural Baselines for Computational Paralinguistics

Detecting sleepiness from spoken language is an ambitious task, which is...
research
02/02/2021

The Hitachi-JHU DIHARD III System: Competitive End-to-End Neural Diarization and X-Vector Clustering Systems Combined by DOVER-Lap

This paper provides a detailed description of the Hitachi-JHU system tha...

Please sign up or login with your details

Forgot password? Click here to reset