1 Introduction
Listening enhancement applications, such as hearing aid processing [1] and audio augmented reality [2], differ from other audio enhancement applications, like teleconferencing and speech recognition, in part because of their strict delay constraints. Since users hear both live and processed signals simultaneously, these systems must process sound with no more than a few milliseconds of delay. Discerning listeners can notice delays as low as 3 ms and are disturbed by delays greater than 10 ms [3]. Listeners with hearing loss can tolerate greater delay, around 20 ms for closedfitting hearing aids [4] and 6 ms for openfitting hearing aids [5]. Delays longer than about 30 ms can impair the user’s ability to speak [6].
This delay requirement limits the performance of audio enhancement systems. In singlechannel systems, the frequency resolution of a frequencyselective filter generally improves with longer delay. Modern singlemicrophone audio enhancement algorithms [7], such as those employing timefrequency masks [8] and nonnegative matrix factorization [9]
, often process speech using shorttime Fourier transform (STFT) frames of 60 ms or longer to maximize timefrequency sparsity
[8]. These algorithms are effective in many applications, but their delay is too large for listening enhancement.Multichannel audio enhancement systems use microphone arrays to spatially separate signals [10, 11, 12]. Many multichannel methods are also applied in the STFT domain to more easily model reverberation [12, 13]. In principle, however, spatial processing should require minimal delay: for example, a linear array can enhance a source at broadside with zero delay by simply summing its inputs. Whereas the frequency resolution of a temporal filter depends on its duration, the spatial resolution of an array is determined by its spatial extent. Multichannel listening systems can use both spatial and spectral diversity to separate signals. It is natural to ask, therefore, whether devices with large arrays can enhance audio with lower delay than those with small arrays. That is, can we use array processing to trade space for time?
There is a large body of literature on array processing for listening devices, e.g. [14, 15], and causal multichannel filters have been studied in the contexts of dereverberation [16, 17, 18, 19] and noise and echo control [20]. In [21], the authors considered the minimum filter delay required to cover the full aperture of an array. There have also been several proposed lowdelay singlemicrophone filtering and source separation techniques [22, 23, 24]. However, to the best of our knowledge, there has been no prior study of delayperformance tradeoffs in array processing.
Here we approach audio enhancement as a stationary linear estimation problem: given an observed signal from the infinite past to time
, what is the linear minimum mean square error (MSE) estimate of a desired signal at time ? Positive values of correspond to delay and negative values to prediction. Such problems are well understood in the scalar case: for certain signals, we can use spectral factorization to compute exact expressions for the MSE as a function of [25, 26, 27]. For example, Figure 1 shows delayerror curves for separating several spectrally distinct speechlike sounds, which will be described in Section 3. Asincreases, the MSE decreases from the variance of the target signal to the MSE of a noncausal Wiener filter. We can apply similar theoretical tools in the multivariate case
[28, 29] to analyze delayperformance tradeoffs for causal multichannel Wiener filters (CMWF) in terms of the spatial and temporal correlation structures of the source signals. In this work, we will derive a general expression for the MSE performance of a CMWF as a function of , find exact expressions for idealized mixing models, and present experimental results from wearable and distributed microphone arrays in a real room.2 DelayConstrained Multichannel Filtering
Consider a mixture of sources captured by microphones. Let the sources and additive noise be widesense stationary continuoustime random processes that are uncorrelated with each other. Let , , be known causal impulse responses and let be filter impulse responses. Denote the observed signals by and the system output by , where
(1)  
(2) 
and denotes linear convolution. We define the desired output signal to be the first source as captured by the first microphone—for example, a target talker reproduced at the microphone nearest the listener’s ear—and delayed by time :
(3) 
To understand fundamental tradeoffs in performance, we restrict our attention to the bestcase scenario in which all signals are stationary in both space and time and have known statistics. Let be the frequency response matrix corresponding to the ’s. Let , , , and
be the autocorrelation sequences of the corresponding random variables and let
, , , and be their respective Fourier transforms. To ensure that the CMWF is well defined, we assume that is positive definite for all of interest. Let be the crosscorrelation of with and let be its Fourier transform, where is the column of corresponding to the target source. Let be the Fourier transform of .2.1 Causal filter performance
The CMWF must satisfy the WienerHopf equation [25],
(4) 
The MSE between and is
(5) 
The noncausal () solution to (4) and its error power are readily expressed in the frequency domain:
(6)  
(7) 
For finite , we can solve (4) by first decomposing into its spectral factors [28],
(8) 
where and its inverse are both causal. We proceed by decorrelating using and then solving (4) for the decorrelated signals [29] to find the causal filter
(9) 
where denotes the causal part of the argument, that is, timedomain truncation from . Let
. For the listening enhancement application, this vector can be written
(10) 
Let be the inverse Fourier transform of . Substituting from (9) into (5), using the spectral factorization (8) and Parseval’s identity, and rearranging terms [27], we can show that
(11) 
Thus, the error penalty due to causality is the energy in for . Our goal is to understand how depends on the spatial and spectral characteristics of the source signals. While multivariate spectral factorizations are often difficult to compute in practice [30], we can find exact expressions for certain special cases that provide insight about the delayconstrained array processing problem.
2.2 Uniform linear array
First, consider a plane wave incident upon a uniform linear array of sensors with the reference at one end. Let be the time difference of arrival (TDOA) between adjacent microphones, let and let , so that
(12)  
(13) 
A convenient spectral factor is the lower triangular matrix
(14) 
where . Applying (10) and taking the inverse Fourier transform, we have
(15) 
Finally, from (11), the MSE is
(16)  
(17) 
where if and if . Thus, the error is reduced for each microphone that the source reaches within time of reaching the reference. The delayerror curve is a piecewise constant function with steps of width and decreasing heights that depend on .
2.3 Twosource, twomicrophone separation
We can follow a similar procedure with multiple sources. Consider a scenario with two plane wave sources and two microphones. Let and be the TDOAs of the sources, let and let with , so that
(18)  
(19) 
The determinant of can be written
(20) 
where is a scalar that depends only on . The spectral factorization of takes different forms depending on the signs of and , but always includes a term of the form , which results in an infiniteduration . Applying (11), we find that
(21) 
where , ,
(22) 
(23) 
This delayerror curve is also piecewise constant, but has a geometric “tail” that decays with a rate of roughly . The height of the steps is determined by and the width is determined by , which depends on the distance between the sources. For large positive , approaches .
Figure 2(a) shows for four combinations of source placement. The causality penalty takes a different form depending on the relative placement of sources and microphones. For example, if both the target and interference source are closer to microphone 1 than microphone 2 (near/near), then the second microphone does not contribute any information at . If the sources are on opposite sides, then the difference in TDOAs, is larger, and therefore decays more slowly.
2.4 Temporally correlated signals
The expressions above were derived for uncorrelated source and noise processes. In many applications, however, the signals of interest are correlated and can therefore be separated spectrally as well as spatially. It is difficult in general to predict the effects of signal correlation on the delayerror curve. However, if the entries of share a common spectral factor—for example, if the sources are identically distributed and are recorded by identical microphones—then we can write and , where is the scalar spectral factorization of the common factor. Then we have
(24)  
(25) 
Since is causal, it spreads the energy of forward in time. Figure 2(b) shows the same scenario as in the previous section, but with identically distributed speechshaped sources. The error is lower overall, the steps are smoother, and the filter can begin to separate the signals even before they reach either microphone.
3 Experiments
To evaluate delayperformance tradeoffs in realistic conditions, we recorded audio mixtures using a wearable microphone array in a cocktail party scenario at the Augmented Listening Laboratory at the University of Illinois at UrbanaChampaign, which has a reverberation time of around ms. The recording setup, shown in Figure 3, consisted of twenty omnidirectional lavalier microphones: two at the left and right ears of a mannequin “listener,” six along the perimiter of a hat with radius 30 cm, and twelve mounted on stands at 60 cm and 120 cm distances from the listener. The reference microphone is that in the left ear. Source signals were produced by loudspeakers two meters away from the listener. The acoustic impulse responses between the loudspeakers and microphones were measured using linear sweeps. All data was sampled at 16 kHz.
The signals were separated using the discretetime, finitelength version of the CMWF. Let and be stacked vectors of the sampled multichannel signals and the finite impulse response filter coefficients, respectively. Let be the filter output sequence and let be the desired output sequence. Let and , where is expectation. The linear minimum MSE filter coefficients are [10]
(26) 
In our experiments, was computed using truncated impulse response measurements. We applied diagonal loading comparable to the source power to account for modeling errors and ambient noise. We used discretetime filters with length samples (128 ms). For each experiment we report the sample MSE relative to the source power, computed as .
Figure 4(a) shows delayerror curves for four simultaneous talkers using arrays of up to eight microphones at varying distances. The speech signals were twentysecond clips taken from the VCTK dataset [31] and the filters were designed using a single approximate longterm average speech autocorrelation. Because we model the signals as identically distributed, the filters must rely on spatial rather than spectral diversity to separate the sources. As the radius of the array increases, the curves move downward and to the left, indicating that the largeraperture arrays can achieve similar performance with lower delay compared to the smalleraperture arrays. In fact, since the source signals reach the microphone stands several milliseconds before they reach the listener, the system could operate with negative delay.
The twochannel filter performs poorly in this experiment because it does not take advantage of the timefrequency sparsity of speech signals, which many speech enhancement algorithms exploit. To account for the benefits of sparsity within the stationary estimation framework of this paper, we repeated the experiment with four stationary speechlike sounds generated using the Vocaloid music synthesis software. Each tensecond source signal represents a different vowel sung in a different key. Although the signals are deterministic and periodic, the filters were designed based on 50 ms von Hannwindowed autocorrelation sequences. Figure 1 shows the delayerror curves for singlechannel mixtures of these sources and Figure 4(b) compares the separation performance of multichannel filters with different array sizes. Because the sources are approximately disjoint in the frequency domain, a one or twochannel filter can separate them effectively, but requires a delay to do so. The larger microphone arrays also benefit from longer delay, but perform better for small . For example, the performance of the hatmounted array with zero delay matches that of the binaural microphones with about 10 ms delay, which would be perceptible to many listeners.
4 Conclusions
The theoretical and experimental results presented here suggest that larger arrays can separate sound sources with lower delay and that the delayperformance tradeoff depends on both the spatial and temporal correlation structure of the observed signals. When microphones are located between the listener and sound sources, those sensors receive the signals before the listening device, shifting the delayperformance curve to the left. Arrays also provide spatial gain, which improves overall performance regardless of delay. When signals are spectrally distinct, a singlechannel filter could separate them effectively given a long enough delay, but an array can achieve the same performance with little or no delay.
Much remains to be understood about delayconstrained array processing. For example, equations (10) and (11) tell us little in general about the effects of reverberation and signal spectra on delay. Furthermore, because many signals of interest are nonstationary, we must also consider timevarying causal array processing. Finally, to realize the benefits of spatial diversity in delayconstrained listening enhancement, listening devices must use larger microphone arrays than they do today. Large wearable and distributed arrays could allow us to apply stronger noise reduction while meeting the strict delay constraints of realtime listening applications.
References
 [1] J. M. Kates, Digital Hearing Aids. Plural Publishing, 2008.
 [2] V. Valimaki, A. Franck, J. Ramo, H. Gamper, and L. Savioja, “Assisted listening using a headset: Enhancing audio perception in real, augmented, and virtual environments,” IEEE Signal Processing Magazine, vol. 32, no. 2, pp. 92–99, 2015.
 [3] J. Agnew and J. M. Thornton, “Just noticeable and objectionable group delays in digital hearing aids,” Journal of the American Academy of Audiology, vol. 11, no. 6, pp. 330–336, 2000.
 [4] M. A. Stone and B. C. Moore, “Tolerable hearing aid delays. I. Estimation of limits imposed by the auditory path alone using simulated hearing losses,” Ear and Hearing, vol. 20, no. 3, pp. 182–192, 1999.
 [5] M. A. Stone, B. C. Moore, K. Meisenbacher, and R. P. Derleth, “Tolerable hearing aid delays. V. Estimation of limits for open canal fittings,” Ear and Hearing, vol. 29, no. 4, pp. 601–617, 2008.
 [6] M. A. Stone and B. C. Moore, “Tolerable hearing aid delays. II. Estimation of limits imposed during speech production,” Ear and Hearing, vol. 23, no. 4, pp. 325–338, 2002.
 [7] S. Makino, ed., Audio Source Separation. Springer, 2018.
 [8] O. Yilmaz and S. Rickard, “Blind separation of speech mixtures via timefrequency masking,” IEEE Transactions on Signal Processing, vol. 52, no. 7, pp. 1830–1847, 2004.
 [9] A. Ozerov and C. Févotte, “Multichannel nonnegative matrix factorization in convolutive mixtures for audio source separation,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, no. 3, pp. 550–563, 2010.
 [10] J. Benesty, J. Chen, and Y. Huang, Microphone Array Signal Processing, vol. 1. Springer, 2008.
 [11] M. Brandstein and D. Ward, Microphone Arrays: Signal Processing Techniques and Applications. Springer, 2013.
 [12] S. Gannot, E. Vincent, S. MarkovichGolan, and A. Ozerov, “A consolidated perspective on multimicrophone speech enhancement and source separation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 4, pp. 692–730, 2017.
 [13] M. S. Pedersen, J. Larsen, U. Kjems, and L. C. Parra, “Convolutive blind source separation methods,” in Springer Handbook of Speech Processing, pp. 1065–1094, Springer, 2008.
 [14] S. Doclo, W. Kellermann, S. Makino, and S. E. Nordholm, “Multichannel signal enhancement algorithms for assisted listening devices,” IEEE Signal Processing Magazine, vol. 32, no. 2, pp. 18–30, 2015.
 [15] S. Doclo, S. Gannot, M. Moonen, and A. Spriet, “Acoustic beamforming for hearing aid applications,” in Handbook on Array Processing and Sensor Networks (S. Haykin and K. R. Liu, eds.), pp. 269–302, Wiley, 2008.
 [16] P. Naylor and N. D. Gaubitch, Speech Dereverberation. Springer, 2010.
 [17] T. Nakatani, T. Yoshioka, K. Kinoshita, M. Miyoshi, and B.H. Juang, “Blind speech dereverberation with multichannel linear prediction based on short time Fourier transform representation,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 85–88, 2008.

[18]
B. Schwartz, S. Gannot, and E. Habets, “Online speech dereverberation using Kalman filter and EM algorithm,”
IEEE/ACM Transactions on Audio, Speech and Language Processing, vol. 23, no. 2, pp. 394–406, 2015.  [19] J. Benesty, J. Chen, Y. Huang, and J. Dmochowski, “On microphonearray beamforming from a MIMO acoustic signal processing perspective,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, no. 3, pp. 1053–1065, 2007.
 [20] E. Hänsler and G. Schmidt, Acoustic Echo and Noise Control: A Practical Approach. Wiley, 2005.
 [21] J. Chen, J. Benesty, and Y. Huang, “A minimum distortion noise reduction algorithm with multiple microphones,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 16, no. 3, pp. 481–493, 2008.
 [22] J. M. Kates and K. H. Arehart, “Multichannel dynamicrange compression using digital frequency warping,” EURASIP Journal on Applied Signal Processing, vol. 2005, pp. 3003–3014, 2005.
 [23] H. W. Löllmann and P. Vary, “Low delay noise reduction and dereverberation for hearing aids,” EURASIP Journal on Advances in Signal Processing, vol. 2009, no. 1, p. 437807, 2009.
 [24] T. Barker, T. Virtanen, and N. H. Pontoppidan, “Lowlatency soundsourceseparation using nonnegative matrix factorisation with coupled analysis and synthesis dictionaries,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 241–245, 2015.

[25]
N. Wiener,
Extrapolation, Interpolation, and Smoothing of Stationary Time Series
. Wiley, 1949.  [26] H. W. Bode and C. E. Shannon, “A simplified derivation of linear least square smoothing and prediction theory,” Proceedings of the IRE, vol. 38, no. 4, pp. 417–425, 1950.
 [27] H. L. Van Trees, Detection, Estimation, and Modulation Theory, Part I. Wiley, 2004.
 [28] N. Wiener and P. Masani, “The prediction theory of multivariate stochastic processes, II,” Acta Mathematica, vol. 99, no. 1, pp. 93–137, 1958.
 [29] E. Wong and J. Thomas, “On the multidimensional prediction and filtering problem and the factorization of spectral matrices,” Journal of the Franklin Institute, vol. 272, no. 2, pp. 87–99, 1961.
 [30] V. Kucera, “Factorization of rational spectral matrices: a survey of methods,” in International Conference on Control, pp. 1074–1078, 1991.
 [31] C. Veaux, J. Yamagishi, and K. MacDonald, “CSTR VCTK corpus: English multispeaker corpus for CSTR voice cloning toolkit,” 2017.
Comments
There are no comments yet.