Perceptual audio loss function for deep learning

08/20/2017
by   Dan Elbaz, et al.
Technion
0

PESQ and POLQA , are standards are standards for automated assessment of voice quality of speech as experienced by human beings. The predictions of those objective measures should come as close as possible to subjective quality scores as obtained in subjective listening tests. Wavenet is a deep neural network originally developed as a deep generative model of raw audio wave-forms. Wavenet architecture is based on dilated causal convolutions, which exhibit very large receptive fields. In this short paper we suggest using the Wavenet architecture, in particular its large receptive filed in order to learn PESQ algorithm. By doing so we can use it as a differentiable loss function for speech enhancement.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

05/06/2019

Learning with Learned Loss Function: Speech Enhancement with Quality-Net to Improve Perceptual Evaluation of Speech Quality

Utilizing a human-perception-related objective function to train a speec...
05/23/2020

Exploring the Best Loss Function for DNN-Based Low-latency Speech Enhancement with Temporal Convolutional Networks

Recently, deep neural networks (DNNs) have been successfully used for sp...
01/13/2020

A Differentiable Perceptual Audio Metric Learned from Just Noticeable Differences

Assessment of many audio processing tasks relies on subjective evaluatio...
02/09/2021

CDPAM: Contrastive learning for perceptual audio similarity

Many speech processing methods based on deep learning require an automat...
11/20/2019

Perceptual Loss Function for Neural Modelling of Audio Systems

This work investigates alternate pre-emphasis filters used as part of th...
01/31/2021

High Fidelity Speech Regeneration with Application to Speech Enhancement

Speech enhancement has seen great improvement in recent years mainly thr...
08/11/2020

PoCoNet: Better Speech Enhancement with Frequency-Positional Embeddings, Semi-Supervised Conversational Data, and Biased Loss

Neural network applications generally benefit from larger-sized models, ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Problem formulation and related work

In statistics, the Mean Squared Error (MSE) or Peak Signal to Noise Ratio (PSNR) of an estimator are widely used objective measures and are good distortion indicators (loss functions) between the estimators output and the size that we want to estimate. those loss functions are used for many reconstruction tasks. However, PSNR and MSE do not have good correlation with reliable subjective methods such as Mean Opinion Score (MOS) obtained from expert listeners. A more suitable speech quality assessment can by achieved by using tests that aim to achieve high correlation with MOS tests such as PEAQ or POLQA. However those algorithms are hard to represent as a differentiable function such as MSE moreover, as opposed to MSE that measures the average of the squares of the errors or deviations and accounts for each sample separately those algorithms have memory and take into account long time dependencies between samples from the speech signal.

Audio waveforms, are signals with very high temporal resolution, at least 16,000 samples per second. In order to catch long time dependency of the PESQ score will need to use an architecture with very large receptive filed. this architecture can be achieved by using dilated convolutions, which exhibit very large receptive fields as presented in Wavenet. In this work we present an approach that utilizes this large receptive filed and train a Wavenet model that takes as an input the clean and the degraded audio and is trained in a supervised way to predict the PESQ score of those two signals by training it with the results obtained from full reference PESQ algorithm. i.e we train the model to predict the full reference PESQ score.

a different approach which aims to denoise audio with fidelity to both objective and subjective measure quality of the enhanced speech was made in [4], however in this work the loss function is learned via minimax game and doesn’t try to learn the PESQ score directly.

2 Method description

In order to alleviate computational demands we use 0.25 seconds of clean audio and 0.25 seconds of degraded audio, both sampled at 16 KHz at the input of the Wavenet network. We also use conditioning on the speaker identity as suggested in the original wavenet paper, this way we are able to learn more accurate PESQ score per speaker, and can use a single model for PESQ evaluation of different speakers.

After learning the PESQ we can use it as a differentiable loss function in order to denoise speech, this way we can minimize:

were is the clean audio, is the degraded audio, is a number in and is the differentiable loss function, which was trained to learn the PESQ mapping.

3 Experimental results

In order to alleviate computational demands we use 0.25 seconds of clean audio and 0.25 seconds of degraded audio, both sampled at 16 KHz at the input of the Wavenet network. We also use conditioning on the speaker identity as suggested in the original wavenet paper, this way we are able to learn more accurate PESQ score per speaker, and can use a single model for PESQ evaluation of different speakers. The data set that was used for training the network is TIMIT Acoustic-Phonetic Continuous Speech Corpus [2]. Both the degraded and the clean audio are fed into the network, 4095 samples each.The network receptive field is corresponding to the twice the length of the audio signal, 8190 samples. The degraded audio was generated with speech shaped noise, generated by matlab [3]

in this process The program derives the Fourier transform of all the speech files, the Fourier transform is then manipulated such that the phases of the spectral components are randomized. The resultant modified Fourier output is then converted back into the time domain using an inverse Fourier transform. The resultant is a speech shaped noise with spectrum almost identical to that of the original speech corpus

By running the PESQ algorithm on a section of 0.25 second segment we found that the results yield correlation of 81 percent to running the PESQ task on a full audio section

References

  • [1] John G. Beerends, Christian Schmidmer, Jens Berger, Matthias Obermann, Raphael Ullmann, Joachim Pomy, and Michael Keyhl. Perceptual objective listening quality assessment (POLQA), the third generation ITU-t standard for End-to-End speech quality measurement part I—Temporal alignment. JAES, 61(6):366–384, June 2013.
  • [2] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, D. S. Pallett, and N. L. Dahlgren. Darpa timit acoustic phonetic continuous speech corpus cdrom, 1993.
  • [3] Nike. Speech spectrum shaped noise. https://www.mathworks.com/matlabcentral/fileexchange/55701-speech-spectrum-shaped-noise, 2012.
  • [4] Santiago Pascual, Antonio Bonafonte, and Joan Serrà. SEGAN: speech enhancement generative adversarial network. CoRR, abs/1703.09452, 2017.
  • [5] A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra. Perceptual evaluation of speech quality (pesq)-a new method for speech quality assessment of telephone networks and codecs. In Proceedings of the Acoustics, Speech, and Signal Processing, 200. On IEEE International Conference - Volume 02, ICASSP ’01, pages 749–752, Washington, DC, USA, 2001. IEEE Computer Society.
  • [6] Aäron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alexander Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. In Arxiv, 2016.