Monaural speech separation is the problem of separating a target speech signal from an acoustic mixture consisting of other highly non-stationary signals, e.g., competing speech signals. Traditionally, model based approaches like hidden Markov models (e.g., in ) and non-negative matrix factorization (e.g., in [2, 3]) have been used to address it. In recent years, however, purely data driven discriminative approaches like deep neural networks (DNNs) (e.g., in [4, 5]) have achieved great success.
In this paper, we focus on objective intelligibility performance of DNN-based speech separation. Moreover, our approach concerns with maintaining a low algorithmic processing latency (e.g., in [6, 7, 8]) which is particularly critical for applications like hearing aids  and cochlear implants . Notably for hearing aids, according to Agnew et al. , delays as low as 3 to 5 ms were found to be noticeable and anything longer than 10 ms was deemed objectionable to hearing impaired listeners due to potential comb filter coloration or echo from the combination of direct and delayed sound in open hearing aid fittings.
DNN-based speech separation approaches have generally been using the traditional mean square error (MSE) loss function (e.g., in [4, 12]) between the predicted and target spectrum or time-frequency masks. MSE loss is intuitively sub optimal in the sense that it treats all frequency components of the signal equally which deviates from what has been suggested from the studies of human auditory system . Hence it makes sense to use perceptually motivated cost functions instead. There have been some attempts towards this goal in the context of speech separation/enhancement. For example, an altered version of traditional MSE was used in , a weighted MSE approach based on absolute threshold of hearing and masking properties of human auditory system was employed in , and a cost function inspired from short-time objective intelligibility (STOI)  measure was used in 
. Another notable work comparing different cost functions, e.g., Kullback-Leibler divergence, Itakura-Saito divergence, and, MSE, was reported in.
We propose a cost function based on the extended short time objective intelligibility (ESTOI) metric . ESTOI extends the widely used short time objective intelligibility (STOI) metric  and is postulated to be a better predictor of subjective intelligibility when the interfering signal is modulated , e.g., competing speech, and hence is better suited as an optimization objective for our purpose. We optimize for ESTOI using a sequence based loss using a long short-term memory network unlike  where feedforward DNNs were used for STOI optimization. Moreover, we use a single network which jointly optimizes for all one-third octave bands unlike  where multiple networks were used.
In direct optimization with ESTOI loss, we report an improvement of 0.03 (p , Wilcoxon signed-rank test ) averaged over four speaker pairs in terms of ESTOI metric as compared to a baseline DNN trained on MSE cost. We also observe that this optimization degrades the separation performance in terms of source to distortion ratio (SDR)  on an average by 0.6 dB. We then propose a pretraining strategy where the DNN is first trained with MSE as objective and continue the training after model convergence (in terms of validation MSE) with the ESTOI loss. The proposed approach mitigates the degradation in SDR to an average of 0.2 dB while offering a better or at par objective intelligibility performance compared to the baseline.
2 Proposed Cost function
In this paper we propose a sequence based loss that approximates the calculation of ESTOI estimator. The magnitude spectrogram of the mixture, , is computed, and being frequency and time indices, respectively, and fed to an LSTM. Each sequence consists of successive STFT frames. The proposed loss is computed between estimated and target spectra of the two sources. Both STOI and ESTOI measures utilize one-third octave band processing to mimic frequency selectivity of cochlear processing in human ear. Moreover, ESTOI measure is computed on short time analysis segments of 384 ms in order to include temporal modulation frequencies which are critical for speech intelligibility . We keep these design choices consistent in our cost calculation with the only difference from the original ESTOI computation being the inclusion of frequency range up to 8 kHz. For the sake of simplicity we explain loss calculation between the estimated spectrum, and target spectrum corresponding to one source only and the process is identical for the second source. It entails the following steps:
Band decomposition: is processed to give one-third octave band decomposed version as,
where and denote the frequency boundaries of one-third octave band. is the number of one-third octave bands. Similarly, is one-third octave band decomposed version obtained from .
Time segmentation: Now (and ) is segmented into time-segments, where is the number of STFT frames in , is ESTOI context window for the calculation of intermediate intelligibility measures. Hence time-segment for is a matrix given by,
Similarly, is the time segment corresponding to band decomposed spectrogram for intermediate ESTOI calculation.
: Each of the above segments is then mean and variance normalized first along the rows (temporal normalization) such that each row of resulting matrix is zero mean and unit norm. It is followed by normalization along columns (spectral normalization) yielding a matrix
, each column of which is a unit norm and zero mean vector.
Dot product and averaging: The intermediate intelligibility index corresponding to time segment is simply the dot product of columns of and given by,
Where denotes the transpose operation. The ESTOI metric corresponding to the sequence is then calculated by averaging the intermediate measures, i.e. . The cost aims to maximize this metric which can be achieved by minimizing negative of the ESTOI metric for the sequence, i.e, minimizing .
For simplicity, Figure 1 depicts the process of computation of the loss function for source 1 only. Similar loss calculation is done for source 2
and final loss is mean of the two losses. All the operations described above are differentiable. Libraries Keras
are used for training which performs automatic differentiation and gradient backpropagation.
We consider a long short-term memory network (LSTM)  as the baseline DNN topology. There are three cases under investigation here: a) MSE objective, we will denote it as MSE-DNN, b) The proposed ESTOI objective, we will denote it as ESTOI-DNN, and, c) Training with the proposed objective but instead of training from the scratch we use weights from the first case as initial weights, we will denote it as MSE-ESTOI-DNN. Please see 4.4 for the discussion on the motivation for Case c.
3 Implicit Time-frequency masking
In this work, we use masking based source separation paradigm (e.g., used in [4, 25, 26]) where a DNN is used to predict a time-frequency mask corresponding to a target speaker. The ESTOI computation however is done for estimated and reference source spectra. We adopt an implicit mask prediction scheme in the sense that DNN is being optimized to output mask such that when mask is applied element-wise to the mixture, the resulting source spectrum minimizes the loss calculated in the spectrum domain. A similar scheme was used in , and in  in the form of skip filtering connections. The predicted spectrum for the target source 1, gets computed from predicted mask as,
where denotes the Hadamard product. The mask corresponding to the other speaker is defined to be and hence the corresponding predicted spectrum is,
We incorporate above two masking operations as a deterministic layer at the network output and jointly estimate the output spectra corresponding to the two sources similar to . This also enforces the condition of the sum of two masks being equal to .
This section describes the acoustic material used in the experiments, metrics used to evaluate the separation and intelligibility performance of the proposed system and finally the results obtained.
4.1 Acoustic Material and data generation
The Danish hearing in noise test (HINT) dataset is used for experiments reported in this paper. It is an extended version of Danish HINT dataset  and consists of three male and three female speakers. Each speaker has 13 lists, each consisting of 20 five word sentences of natural speech. The native sampling rate is 44.1 kHz which is downsampled to 16 kHz before processing. Four speaker pairs: F1 and F2, F2 and F3, M1 and F1; and M1 and M2, are used for the evaluation. A separate network is trained for each of these. Eight lists are used (L6 to L13) for training, two lists (L4, L5) for validation, and two lists (L1, L2) for testing. Total duration of audio for training and validation is approximately 7 minutes.
STFT spectra are used as DNN input features with analysis window of 128 samples (8 ms) and 50 % frame overlap, resulting in 8 ms algorithmic latency. For generating the training data, all available audio signals are concatenated in the time domain corresponding to each speaker. STFT features are then extracted. As the available training material is quite low, a data augmentation scheme is used to increase the amount of training data. It involves circularly offsetting one speaker spectrogram with respect to the other and adding them to generate mixture spectrogram. Note that the summation here is in complex domain. In this work, we use 30 shifts of temporal length , where and is the number of STFT frames in the longer of the two training spectrograms. It effectively increases the amount of training data by a factor of 30 to around 2.6 hours.
BSS-EVAL toolbox  is used for objective evaluation of separation performance and ESTOI metric was used for evaluation of speech intelligibility. For the former, we report SDRs. Source to interference ratio (SIR) and source to artifact ratio (SAR) are also reported for completeness. In addition, we also report STOI values as well as it is more widely reported being an older measure.
4.3 Experimental design
The design choices for ESTOI computation in the proposed loss function are kept inline with the standard ESTOI computation, i.e., ESTOI context for intermediate correlation measures
is 384 ms and the centre frequency of lowest one-third octave band is set at 150 Hz. The frequency range used however is up to 8 kHz. The LSTM network uses three hidden layers, each having 512 hidden neurons and a time-distributed feedforward dense layer as the output. The sequence length used here is 256 STFT frames () to have enough time context for several intermediate ESTOI calculation segments. LSTM cells used in the recurrent layers here are standard as described in . The Adam optimizer is used with default parameters as recommended in 
. A patience value of 30 epochs is used which means the training is stopped when the error on validation data does not got down for 30 consecutive epochs. For audio processing and feature extraction Librosa library is used. The experiments are conducted for five initialization seeds and averaged to get the final results reported here.
The mean objective evaluation metrics for the three DNN configurations: MSE-DNN, ESTOI-DNN, and, MSE-ESTOI-DNN, corresponding to the four speaker pairs.
For the evaluation, list L1 for first speaker of the pair and list L2 for the second speaker are used. Each list consists of 20 sentences and hence we have 400 test mixtures for each speaker pair. The Table 1 shows the mean objective evaluation metrics for the four speaker pairs. Moreover, Figure 2 depicts the violin plots of ESTOI values for the four speaker pairs which shows the distribution of metric values in addition to the embedded boxplot with median and interquartile range. With ESTOI-DNN, an average improvement of 0.03 in terms of ESTOI metric is observed. An important consequence of ESTOI optimization is poorer separation performance in terms of SDR in all speaker pairs except F2F3 as compared to MSE-DNN baseline. On an average, a degradation of 0.6 dB is observed. Loss functions aiming to improve objective intelligibility may result in decrease in other signal energy based separation metrics, such as SDR. However, the aim is to also maintain on par objective separation criteria to ensure good subjective quality of the separated signal. This observation partly motivates Case c. Hence instead of training models from scratch we use weights of MSE-DNN as initial weights and train for ESTOI objective. DNNs trained in this manner denoted as MSE-ESTOI-DNN offer similar improvements in ESTOI measure as observed with MSE-DNN while mitigating the losses in the SDR performance to 0.2 on an average. Moreover, authors in  noted that MSE based systems performed at par with their proposed STOI optimization approach. We thus acknowledge the utility of MSE optimization towards the final goal of optimizing for improvements in intelligibility. It therefore makes sense to use MSE objective along with the proposed ESTOI objective, an observation which also serves as the motivation for Case c.
5 Conclusion and future work
In this work, we proposed a novel objective function for optimizing objective intelligibility performance of DNN-based speech separation systems, here in terms of ESTOI, and compared it with commonly used MSE objective. We showed that the proposed approach offers improvements or performs at par with the baseline. We also showed that a pretraining strategy utilizing weights of MSE optimized DNN as the initial point of optimization for our approach can mitigate the losses in terms SDR resulting from using ESTOI optimization alongwith preserving superior or at par intelligibility performance in terms of ESTOI. This observation alongwith results previously reported in , indicate the usefulness of MSE optimization for the goal of improving intelligibility performance. The future work includes combining the MSE and ESTOI to joint objective of the form , where and are MSE and ESTOI losses respectively, and is a weighing parameter.
-  S. T. Roweis, “One microphone source separation,” in Advances in Neural Information Processing Systems, 2001, pp. 793–799.
-  T. Virtanen, “Monaural sound source separation by nonnegative matrix factorization with temporal continuity and sparseness criteria,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, no. 3, pp. 1066–1074, 2007.
-  M. N. Schmidt and R. K. Olsson, “Single-channel speech separation using sparse non-negative matrix factorization,” in Ninth International Conference on Spoken Language Processing, 2006.
P. Huang, M. Kim, M. Hasegawa-Johnson, and P. Smaragdis, “Deep learning for monaural speech separation,” inIEEE International Conference on Acoustics, Speech and Signal Processing, 2014, pp. 1562–1566.
H. Erdogan, J. R. Hershey, S. Watanabe, and J. Le Roux, “Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks,” inIEEE International Conference onAcoustics, Speech and Signal Processing, 2015, pp. 708–712.
-  Y. Luo and N. Mesgarani, “TasNet: time-domain audio separation network for real-time single-channel speech separation,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2018.
-  G. Naithani, T. Barker, G. Parascandolo, L. Bramsløw, N. H. Pontoppidan, and T. Virtanen, “Low latency sound source separation using convolutional recurrent neural networks,” in IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2017, pp. 71–75.
-  G. Naithani, G. Parascandolo, T. Barker, N. H. Pontoppidan, and T. Virtanen, “Low-latency sound source separation using deep neural networks,” in IEEE Global Conference on Signal and Information Processing, 2016, pp. 272–276.
-  L. Bramsløw, “Preferred signal path delay and high-pass cut-off in open fittings,” International Journal of Audiology, vol. 49, no. 9, pp. 634–644, 2010.
-  J. Hidalgo, “Low latency audio source separation for speech enhancement in cochlear implants,” Master’s thesis, Universitat Pompeu Fabra, 2012.
-  J. Agnew and J. M. Thornton, “Just noticeable and objectionable group delays in digital hearing aids,” Journal of the American Academy of Audiology, vol. 11, no. 6, pp. 330–336, 2000.
-  Y. Xu, J. Du, L.-R. Dai, and C.-H. Lee, “A regression approach to speech enhancement based on deep neural networks,” IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), vol. 23, no. 1, pp. 7–19, 2015.
-  P. C. Loizou, Speech Enhancement: Theory and Practice. CRC press, 2007.
P. G. Shivakumar and P. G. Georgiou, “Perception optimized deep denoising autoencoders for speech enhancement.” inINTERSPEECH, 2016, pp. 3743–3747.
-  A. Kumar and D. Florencio, “Speech enhancement in multiple-noise conditions using deep neural networks,” in INTERSPEECH, 2016, pp. 3738–3742.
-  C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “A short-time objective intelligibility measure for time-frequency weighted noisy speech,” in IEEE International Conference on Acoustics Speech and Signal Processing, 2010, pp. 4214–4217.
-  M. Kolbæk, Z.-H. Tan, and J. Jensen, “Monaural speech enhancement using deep neural networks by maximizing a short-time objective intelligibility measure,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2018.
-  A. A. Nugraha, A. Liutkus, and E. Vincent, “Multichannel audio source separation with deep neural networks,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 9, pp. 1652–1664, 2016.
-  J. Jensen and C. H. Taal, “An algorithm for predicting the intelligibility of speech masked by modulated noise maskers,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 11, pp. 2009–2022, 2016.
-  E. Vincent, R. Gribonval, and C. Févotte, “Performance measurement in blind audio source separation,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 14, no. 4, pp. 1462–1469, 2006.
-  C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “An algorithm for intelligibility prediction of time–frequency weighted noisy speech,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 7, pp. 2125–2136, 2011.
-  F. Chollet et al., “Keras,” 2015. [Online]. Available: https://github.com/fchollet/keras
-  Theano Development Team, “Theano: A Python framework for fast computation of mathematical expressions,” arXiv e-prints, vol. abs/1605.02688, May 2016. [Online]. Available: http://arxiv.org/abs/1605.02688
-  S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997.
-  F. Weninger, J. R. Hershey, J. Le Roux, and B. Schuller, “Discriminatively trained recurrent neural networks for single-channel speech separation,” in IEEE Global Conference on Signal and Information Processing, 2014, pp. 577–581.
-  Y. Zhao, D. Wang, I. Merks, and T. Zhang, “DNN-based enhancement of noisy and reverberant speech,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2016, pp. 6525–6529.
S. I. Mimilakis, K. Drossos, T. Virtanen, and G. Schuller, “A recurrent
encoder-decoder approach with skip-filtering connections for monaural singing
voice separation,” in
IEEE International Workshop on Machine Learning for Signal Processing, 2017.
-  J. B. Nielsen and T. Dau, “The Danish hearing in noise test.” International Journal of Audiology, vol. 50, no. 3, pp. 202–8, 2011.
-  F. A. Gers, J. Schmidhuber, and F. Cummins, “Learning to forget: Continual prediction with LSTM,” Neural Computation, pp. 2451–2471, 2000.
-  D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in International Conference on Learning Representations, 2014.
-  B. McFee et al., “librosa 0.5.0,” Feb. 2017. [Online]. Available: https://doi.org/10.5281/zenodo.293021