1 Problem formulation and related work
In statistics, the Mean Squared Error (MSE) or Peak Signal to Noise Ratio (PSNR) of an estimator are widely used objective measures and are good distortion indicators (loss functions) between the estimators output and the size that we want to estimate. those loss functions are used for many reconstruction tasks. However, PSNR and MSE do not have good correlation with reliable subjective methods such as Mean Opinion Score (MOS) obtained from expert listeners. A more suitable speech quality assessment can by achieved by using tests that aim to achieve high correlation with MOS tests such as PEAQ or POLQA. However those algorithms are hard to represent as a differentiable function such as MSE moreover, as opposed to MSE that measures the average of the squares of the errors or deviations and accounts for each sample separately those algorithms have memory and take into account long time dependencies between samples from the speech signal.
Audio waveforms, are signals with very high temporal resolution, at least 16,000 samples per second. In order to catch long time dependency of the PESQ score will need to use an architecture with very large receptive filed. this architecture can be achieved by using dilated convolutions, which exhibit very large receptive fields as presented in Wavenet. In this work we present an approach that utilizes this large receptive filed and train a Wavenet model that takes as an input the clean and the degraded audio and is trained in a supervised way to predict the PESQ score of those two signals by training it with the results obtained from full reference PESQ algorithm. i.e we train the model to predict the full reference PESQ score.
a different approach which aims to denoise audio with fidelity to both objective and subjective measure quality of the enhanced speech was made in , however in this work the loss function is learned via minimax game and doesn’t try to learn the PESQ score directly.
2 Method description
In order to alleviate computational demands we use 0.25 seconds of clean audio and 0.25 seconds of degraded audio, both sampled at 16 KHz at the input of the Wavenet network. We also use conditioning on the speaker identity as suggested in the original wavenet paper, this way we are able to learn more accurate PESQ score per speaker, and can use a single model for PESQ evaluation of different speakers.
After learning the PESQ we can use it as a differentiable loss function in order to denoise speech, this way we can minimize:
were is the clean audio, is the degraded audio, is a number in and is the differentiable loss function, which was trained to learn the PESQ mapping.
3 Experimental results
In order to alleviate computational demands we use 0.25 seconds of clean audio and 0.25 seconds of degraded audio, both sampled at 16 KHz at the input of the Wavenet network. We also use conditioning on the speaker identity as suggested in the original wavenet paper, this way we are able to learn more accurate PESQ score per speaker, and can use a single model for PESQ evaluation of different speakers. The data set that was used for training the network is TIMIT Acoustic-Phonetic Continuous Speech Corpus . Both the degraded and the clean audio are fed into the network, 4095 samples each.The network receptive field is corresponding to the twice the length of the audio signal, 8190 samples. The degraded audio was generated with speech shaped noise, generated by matlab 
in this process The program derives the Fourier transform of all the speech files, the Fourier transform is then manipulated such that the phases of the spectral components are randomized. The resultant modified Fourier output is then converted back into the time domain using an inverse Fourier transform. The resultant is a speech shaped noise with spectrum almost identical to that of the original speech corpus
By running the PESQ algorithm on a section of 0.25 second segment we found that the results yield correlation of 81 percent to running the PESQ task on a full audio section
-  John G. Beerends, Christian Schmidmer, Jens Berger, Matthias Obermann, Raphael Ullmann, Joachim Pomy, and Michael Keyhl. Perceptual objective listening quality assessment (POLQA), the third generation ITU-t standard for End-to-End speech quality measurement part I—Temporal alignment. JAES, 61(6):366–384, June 2013.
-  J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, D. S. Pallett, and N. L. Dahlgren. Darpa timit acoustic phonetic continuous speech corpus cdrom, 1993.
-  Nike. Speech spectrum shaped noise. https://www.mathworks.com/matlabcentral/fileexchange/55701-speech-spectrum-shaped-noise, 2012.
-  Santiago Pascual, Antonio Bonafonte, and Joan Serrà. SEGAN: speech enhancement generative adversarial network. CoRR, abs/1703.09452, 2017.
-  A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra. Perceptual evaluation of speech quality (pesq)-a new method for speech quality assessment of telephone networks and codecs. In Proceedings of the Acoustics, Speech, and Signal Processing, 200. On IEEE International Conference - Volume 02, ICASSP ’01, pages 749–752, Washington, DC, USA, 2001. IEEE Computer Society.
-  Aäron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alexander Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. In Arxiv, 2016.