Analysis of XLS-R for Speech Quality Assessment

08/23/2023
by   Bastiaan Tamm, et al.
0

In online conferencing applications, estimating the perceived quality of an audio signal is crucial to ensure high quality of experience for the end user. The most reliable way to assess the quality of a speech signal is through human judgments in the form of the mean opinion score (MOS) metric. However, such an approach is labor intensive and not feasible for large-scale applications. The focus has therefore shifted towards automated speech quality assessment through end-to-end training of deep neural networks. Recently, it was shown that leveraging pre-trained wav2vec-based XLS-R embeddings leads to state-of-the-art performance for the task of speech quality prediction. In this paper, we perform an in-depth analysis of the pre-trained model. First, we analyze the performance of embeddings extracted from each layer of XLS-R and also for each size of the model (300M, 1B, 2B parameters). Surprisingly, we find two optimal regions for feature extraction: one in the lower-level features and one in the high-level features. Next, we investigate the reason for the two distinct optima. We hypothesize that the lower-level features capture characteristics of noise and room acoustics, whereas the high-level features focus on speech content and intelligibility. To investigate this, we analyze the sensitivity of the MOS predictions with respect to different levels of corruption in each category. Afterwards, we try fusing the two optimal feature depths to determine if they contain complementary information for MOS prediction. Finally, we compare the performance of the proposed models and assess the generalizability of the models on unseen datasets.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/01/2022

Pre-trained Speech Representations as Feature Extractors for Speech Quality Assessment in Online Conferencing Applications

Speech quality in online conferencing applications is typically assessed...
research
03/16/2019

Non-intrusive speech quality assessment using neural networks

Estimating the perceived quality of an audio signal is critical for many...
research
04/02/2023

Re-IQA: Unsupervised Learning for Image Quality Assessment in the Wild

Automatic Perceptual Image Quality Assessment is a challenging problem t...
research
10/18/2018

Exploiting High-Level Semantics for No-Reference Image Quality Assessment of Realistic Blur Images

To guarantee a satisfying Quality of Experience (QoE) for consumers, it ...
research
04/04/2022

MOSRA: Joint Mean Opinion Score and Room Acoustics Speech Quality Assessment

The acoustic environment can degrade speech quality during communication...
research
11/12/2022

Efficient Speech Quality Assessment using Self-supervised Framewise Embeddings

Automatic speech quality assessment is essential for audio researchers, ...
research
02/04/2020

Aesthetic Quality Assessment for Group photograph

Image aesthetic quality assessment has got much attention in recent year...

Please sign up or login with your details

Forgot password? Click here to reset