An Exploration into the Performance of Unsupervised Cross-Task Speech Representations for "In the Wild” Edge Applications

05/09/2023
by   Heitor Guimarães, et al.
0

Unsupervised speech models are becoming ubiquitous in the speech and machine learning communities. Upstream models are responsible for learning meaningful representations from raw audio. Later, these representations serve as input to downstream models to solve a number of tasks, such as keyword spotting or emotion recognition. As edge speech applications start to emerge, it is important to gauge how robust these cross-task representations are on edge devices with limited resources and different noise levels. To this end, in this study we evaluate the robustness of four different versions of HuBERT, namely: base, large, and extra-large versions, as well as a recent version termed Robust-HuBERT. Tests are conducted under different additive and convolutive noise conditions for three downstream tasks: keyword spotting, intent classification, and emotion recognition. Our results show that while larger models can provide some important robustness to environmental factors, they may not be applicable to edge applications. Smaller models, on the other hand, showed substantial accuracy drops in noisy conditions, especially in the presence of room reverberation. These findings suggest that cross-task speech representations are not yet ready for edge applications and innovations are still needed.

READ FULL TEXT

page 1

page 2

page 3

research
02/18/2023

RobustDistiller: Compressing Universal Speech Representations for Enhanced Environment Robustness

Self-supervised speech pre-training enables deep neural network models t...
research
04/08/2023

Unsupervised Speech Representation Pooling Using Vector Quantization

With the advent of general-purpose speech representations from large-sca...
research
08/16/2018

Emotion Recognition in Speech using Cross-Modal Transfer in the Wild

Obtaining large, human labelled speech datasets to train models for emot...
research
07/12/2022

Multitask Learning from Augmented Auxiliary Data for Improving Speech Emotion Recognition

Despite the recent progress in speech emotion recognition (SER), state-o...
research
10/26/2022

Knowledge Transfer For On-Device Speech Emotion Recognition with Neural Structured Learning

Speech emotion recognition (SER) has been a popular research topic in hu...
research
09/15/2021

Behavior of Keyword Spotting Networks Under Noisy Conditions

Keyword spotting (KWS) is becoming a ubiquitous need with the advancemen...

Please sign up or login with your details

Forgot password? Click here to reset