Listening enhancement applications, such as hearing aid processing  and audio augmented reality , differ from other audio enhancement applications, like teleconferencing and speech recognition, in part because of their strict delay constraints. Since users hear both live and processed signals simultaneously, these systems must process sound with no more than a few milliseconds of delay. Discerning listeners can notice delays as low as 3 ms and are disturbed by delays greater than 10 ms . Listeners with hearing loss can tolerate greater delay, around 20 ms for closed-fitting hearing aids  and 6 ms for open-fitting hearing aids . Delays longer than about 30 ms can impair the user’s ability to speak .
This delay requirement limits the performance of audio enhancement systems. In single-channel systems, the frequency resolution of a frequency-selective filter generally improves with longer delay. Modern single-microphone audio enhancement algorithms , such as those employing time-frequency masks  and non-negative matrix factorization 
, often process speech using short-time Fourier transform (STFT) frames of 60 ms or longer to maximize time-frequency sparsity. These algorithms are effective in many applications, but their delay is too large for listening enhancement.
Multichannel audio enhancement systems use microphone arrays to spatially separate signals [10, 11, 12]. Many multichannel methods are also applied in the STFT domain to more easily model reverberation [12, 13]. In principle, however, spatial processing should require minimal delay: for example, a linear array can enhance a source at broadside with zero delay by simply summing its inputs. Whereas the frequency resolution of a temporal filter depends on its duration, the spatial resolution of an array is determined by its spatial extent. Multichannel listening systems can use both spatial and spectral diversity to separate signals. It is natural to ask, therefore, whether devices with large arrays can enhance audio with lower delay than those with small arrays. That is, can we use array processing to trade space for time?
There is a large body of literature on array processing for listening devices, e.g. [14, 15], and causal multichannel filters have been studied in the contexts of dereverberation [16, 17, 18, 19] and noise and echo control . In , the authors considered the minimum filter delay required to cover the full aperture of an array. There have also been several proposed low-delay single-microphone filtering and source separation techniques [22, 23, 24]. However, to the best of our knowledge, there has been no prior study of delay-performance tradeoffs in array processing.
Here we approach audio enhancement as a stationary linear estimation problem: given an observed signal from the infinite past to time, what is the linear minimum mean square error (MSE) estimate of a desired signal at time ? Positive values of correspond to delay and negative values to prediction. Such problems are well understood in the scalar case: for certain signals, we can use spectral factorization to compute exact expressions for the MSE as a function of [25, 26, 27]. For example, Figure 1 shows delay-error curves for separating several spectrally distinct speechlike sounds, which will be described in Section 3. As
increases, the MSE decreases from the variance of the target signal to the MSE of a noncausal Wiener filter. We can apply similar theoretical tools in the multivariate case[28, 29] to analyze delay-performance tradeoffs for causal multichannel Wiener filters (CMWF) in terms of the spatial and temporal correlation structures of the source signals. In this work, we will derive a general expression for the MSE performance of a CMWF as a function of , find exact expressions for idealized mixing models, and present experimental results from wearable and distributed microphone arrays in a real room.
2 Delay-Constrained Multichannel Filtering
Consider a mixture of sources captured by microphones. Let the sources and additive noise be wide-sense stationary continuous-time random processes that are uncorrelated with each other. Let , , be known causal impulse responses and let be filter impulse responses. Denote the observed signals by and the system output by , where
and denotes linear convolution. We define the desired output signal to be the first source as captured by the first microphone—for example, a target talker reproduced at the microphone nearest the listener’s ear—and delayed by time :
To understand fundamental tradeoffs in performance, we restrict our attention to the best-case scenario in which all signals are stationary in both space and time and have known statistics. Let be the frequency response matrix corresponding to the ’s. Let , , , and
be the autocorrelation sequences of the corresponding random variables and let, , , and be their respective Fourier transforms. To ensure that the CMWF is well defined, we assume that is positive definite for all of interest. Let be the cross-correlation of with and let be its Fourier transform, where is the column of corresponding to the target source. Let be the Fourier transform of .
2.1 Causal filter performance
The CMWF must satisfy the Wiener-Hopf equation ,
The MSE between and is
The noncausal () solution to (4) and its error power are readily expressed in the frequency domain:
where denotes the causal part of the argument, that is, time-domain truncation from . Let
. For the listening enhancement application, this vector can be written
Thus, the error penalty due to causality is the energy in for . Our goal is to understand how depends on the spatial and spectral characteristics of the source signals. While multivariate spectral factorizations are often difficult to compute in practice , we can find exact expressions for certain special cases that provide insight about the delay-constrained array processing problem.
2.2 Uniform linear array
First, consider a plane wave incident upon a uniform linear array of sensors with the reference at one end. Let be the time difference of arrival (TDOA) between adjacent microphones, let and let , so that
A convenient spectral factor is the lower triangular matrix
where . Applying (10) and taking the inverse Fourier transform, we have
Finally, from (11), the MSE is
where if and if . Thus, the error is reduced for each microphone that the source reaches within time of reaching the reference. The delay-error curve is a piecewise constant function with steps of width and decreasing heights that depend on .
2.3 Two-source, two-microphone separation
We can follow a similar procedure with multiple sources. Consider a scenario with two plane wave sources and two microphones. Let and be the TDOAs of the sources, let and let with , so that
The determinant of can be written
where is a scalar that depends only on . The spectral factorization of takes different forms depending on the signs of and , but always includes a term of the form , which results in an infinite-duration . Applying (11), we find that
where , ,
This delay-error curve is also piecewise constant, but has a geometric “tail” that decays with a rate of roughly . The height of the steps is determined by and the width is determined by , which depends on the distance between the sources. For large positive , approaches .
Figure 2(a) shows for four combinations of source placement. The causality penalty takes a different form depending on the relative placement of sources and microphones. For example, if both the target and interference source are closer to microphone 1 than microphone 2 (near/near), then the second microphone does not contribute any information at . If the sources are on opposite sides, then the difference in TDOAs, is larger, and therefore decays more slowly.
2.4 Temporally correlated signals
The expressions above were derived for uncorrelated source and noise processes. In many applications, however, the signals of interest are correlated and can therefore be separated spectrally as well as spatially. It is difficult in general to predict the effects of signal correlation on the delay-error curve. However, if the entries of share a common spectral factor—for example, if the sources are identically distributed and are recorded by identical microphones—then we can write and , where is the scalar spectral factorization of the common factor. Then we have
Since is causal, it spreads the energy of forward in time. Figure 2(b) shows the same scenario as in the previous section, but with identically distributed speech-shaped sources. The error is lower overall, the steps are smoother, and the filter can begin to separate the signals even before they reach either microphone.
To evaluate delay-performance tradeoffs in realistic conditions, we recorded audio mixtures using a wearable microphone array in a cocktail party scenario at the Augmented Listening Laboratory at the University of Illinois at Urbana-Champaign, which has a reverberation time of around ms. The recording setup, shown in Figure 3, consisted of twenty omnidirectional lavalier microphones: two at the left and right ears of a mannequin “listener,” six along the perimiter of a hat with radius 30 cm, and twelve mounted on stands at 60 cm and 120 cm distances from the listener. The reference microphone is that in the left ear. Source signals were produced by loudspeakers two meters away from the listener. The acoustic impulse responses between the loudspeakers and microphones were measured using linear sweeps. All data was sampled at 16 kHz.
The signals were separated using the discrete-time, finite-length version of the CMWF. Let and be stacked vectors of the sampled multichannel signals and the finite impulse response filter coefficients, respectively. Let be the filter output sequence and let be the desired output sequence. Let and , where is expectation. The linear minimum MSE filter coefficients are 
In our experiments, was computed using truncated impulse response measurements. We applied diagonal loading comparable to the source power to account for modeling errors and ambient noise. We used discrete-time filters with length samples (128 ms). For each experiment we report the sample MSE relative to the source power, computed as .
Figure 4(a) shows delay-error curves for four simultaneous talkers using arrays of up to eight microphones at varying distances. The speech signals were twenty-second clips taken from the VCTK dataset  and the filters were designed using a single approximate long-term average speech autocorrelation. Because we model the signals as identically distributed, the filters must rely on spatial rather than spectral diversity to separate the sources. As the radius of the array increases, the curves move downward and to the left, indicating that the larger-aperture arrays can achieve similar performance with lower delay compared to the smaller-aperture arrays. In fact, since the source signals reach the microphone stands several milliseconds before they reach the listener, the system could operate with negative delay.
The two-channel filter performs poorly in this experiment because it does not take advantage of the time-frequency sparsity of speech signals, which many speech enhancement algorithms exploit. To account for the benefits of sparsity within the stationary estimation framework of this paper, we repeated the experiment with four stationary speechlike sounds generated using the Vocaloid music synthesis software. Each ten-second source signal represents a different vowel sung in a different key. Although the signals are deterministic and periodic, the filters were designed based on 50 ms von Hann-windowed autocorrelation sequences. Figure 1 shows the delay-error curves for single-channel mixtures of these sources and Figure 4(b) compares the separation performance of multichannel filters with different array sizes. Because the sources are approximately disjoint in the frequency domain, a one- or two-channel filter can separate them effectively, but requires a delay to do so. The larger microphone arrays also benefit from longer delay, but perform better for small . For example, the performance of the hat-mounted array with zero delay matches that of the binaural microphones with about 10 ms delay, which would be perceptible to many listeners.
The theoretical and experimental results presented here suggest that larger arrays can separate sound sources with lower delay and that the delay-performance tradeoff depends on both the spatial and temporal correlation structure of the observed signals. When microphones are located between the listener and sound sources, those sensors receive the signals before the listening device, shifting the delay-performance curve to the left. Arrays also provide spatial gain, which improves overall performance regardless of delay. When signals are spectrally distinct, a single-channel filter could separate them effectively given a long enough delay, but an array can achieve the same performance with little or no delay.
Much remains to be understood about delay-constrained array processing. For example, equations (10) and (11) tell us little in general about the effects of reverberation and signal spectra on delay. Furthermore, because many signals of interest are nonstationary, we must also consider time-varying causal array processing. Finally, to realize the benefits of spatial diversity in delay-constrained listening enhancement, listening devices must use larger microphone arrays than they do today. Large wearable and distributed arrays could allow us to apply stronger noise reduction while meeting the strict delay constraints of real-time listening applications.
-  J. M. Kates, Digital Hearing Aids. Plural Publishing, 2008.
-  V. Valimaki, A. Franck, J. Ramo, H. Gamper, and L. Savioja, “Assisted listening using a headset: Enhancing audio perception in real, augmented, and virtual environments,” IEEE Signal Processing Magazine, vol. 32, no. 2, pp. 92–99, 2015.
-  J. Agnew and J. M. Thornton, “Just noticeable and objectionable group delays in digital hearing aids,” Journal of the American Academy of Audiology, vol. 11, no. 6, pp. 330–336, 2000.
-  M. A. Stone and B. C. Moore, “Tolerable hearing aid delays. I. Estimation of limits imposed by the auditory path alone using simulated hearing losses,” Ear and Hearing, vol. 20, no. 3, pp. 182–192, 1999.
-  M. A. Stone, B. C. Moore, K. Meisenbacher, and R. P. Derleth, “Tolerable hearing aid delays. V. Estimation of limits for open canal fittings,” Ear and Hearing, vol. 29, no. 4, pp. 601–617, 2008.
-  M. A. Stone and B. C. Moore, “Tolerable hearing aid delays. II. Estimation of limits imposed during speech production,” Ear and Hearing, vol. 23, no. 4, pp. 325–338, 2002.
-  S. Makino, ed., Audio Source Separation. Springer, 2018.
-  O. Yilmaz and S. Rickard, “Blind separation of speech mixtures via time-frequency masking,” IEEE Transactions on Signal Processing, vol. 52, no. 7, pp. 1830–1847, 2004.
-  A. Ozerov and C. Févotte, “Multichannel nonnegative matrix factorization in convolutive mixtures for audio source separation,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, no. 3, pp. 550–563, 2010.
-  J. Benesty, J. Chen, and Y. Huang, Microphone Array Signal Processing, vol. 1. Springer, 2008.
-  M. Brandstein and D. Ward, Microphone Arrays: Signal Processing Techniques and Applications. Springer, 2013.
-  S. Gannot, E. Vincent, S. Markovich-Golan, and A. Ozerov, “A consolidated perspective on multimicrophone speech enhancement and source separation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 4, pp. 692–730, 2017.
-  M. S. Pedersen, J. Larsen, U. Kjems, and L. C. Parra, “Convolutive blind source separation methods,” in Springer Handbook of Speech Processing, pp. 1065–1094, Springer, 2008.
-  S. Doclo, W. Kellermann, S. Makino, and S. E. Nordholm, “Multichannel signal enhancement algorithms for assisted listening devices,” IEEE Signal Processing Magazine, vol. 32, no. 2, pp. 18–30, 2015.
-  S. Doclo, S. Gannot, M. Moonen, and A. Spriet, “Acoustic beamforming for hearing aid applications,” in Handbook on Array Processing and Sensor Networks (S. Haykin and K. R. Liu, eds.), pp. 269–302, Wiley, 2008.
-  P. Naylor and N. D. Gaubitch, Speech Dereverberation. Springer, 2010.
-  T. Nakatani, T. Yoshioka, K. Kinoshita, M. Miyoshi, and B.-H. Juang, “Blind speech dereverberation with multi-channel linear prediction based on short time Fourier transform representation,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 85–88, 2008.
B. Schwartz, S. Gannot, and E. Habets, “Online speech dereverberation using Kalman filter and EM algorithm,”IEEE/ACM Transactions on Audio, Speech and Language Processing, vol. 23, no. 2, pp. 394–406, 2015.
-  J. Benesty, J. Chen, Y. Huang, and J. Dmochowski, “On microphone-array beamforming from a MIMO acoustic signal processing perspective,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, no. 3, pp. 1053–1065, 2007.
-  E. Hänsler and G. Schmidt, Acoustic Echo and Noise Control: A Practical Approach. Wiley, 2005.
-  J. Chen, J. Benesty, and Y. Huang, “A minimum distortion noise reduction algorithm with multiple microphones,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 16, no. 3, pp. 481–493, 2008.
-  J. M. Kates and K. H. Arehart, “Multichannel dynamic-range compression using digital frequency warping,” EURASIP Journal on Applied Signal Processing, vol. 2005, pp. 3003–3014, 2005.
-  H. W. Löllmann and P. Vary, “Low delay noise reduction and dereverberation for hearing aids,” EURASIP Journal on Advances in Signal Processing, vol. 2009, no. 1, p. 437807, 2009.
-  T. Barker, T. Virtanen, and N. H. Pontoppidan, “Low-latency sound-source-separation using non-negative matrix factorisation with coupled analysis and synthesis dictionaries,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 241–245, 2015.
Extrapolation, Interpolation, and Smoothing of Stationary Time Series. Wiley, 1949.
-  H. W. Bode and C. E. Shannon, “A simplified derivation of linear least square smoothing and prediction theory,” Proceedings of the IRE, vol. 38, no. 4, pp. 417–425, 1950.
-  H. L. Van Trees, Detection, Estimation, and Modulation Theory, Part I. Wiley, 2004.
-  N. Wiener and P. Masani, “The prediction theory of multivariate stochastic processes, II,” Acta Mathematica, vol. 99, no. 1, pp. 93–137, 1958.
-  E. Wong and J. Thomas, “On the multidimensional prediction and filtering problem and the factorization of spectral matrices,” Journal of the Franklin Institute, vol. 272, no. 2, pp. 87–99, 1961.
-  V. Kucera, “Factorization of rational spectral matrices: a survey of methods,” in International Conference on Control, pp. 1074–1078, 1991.
-  C. Veaux, J. Yamagishi, and K. MacDonald, “CSTR VCTK corpus: English multi-speaker corpus for CSTR voice cloning toolkit,” 2017.