Blind source separation (BSS) is a core problem in signal processing with numerous applications in various fields, such as: biomedical data processing, audio processing, digital communication, and image processing . In BSS problems, only the output observations are given, whereas neither the original sources nor the mixing systems are known. Separation methods usually rely on some a priori
hypothesis regarding the characteristics of the original sources or the obtained mixtures. Assuming that the sources are independent and have non-Gaussian distributions leads toindependent component analysis (ICA) methods based on probabilistic or information theoretic criteria [2, 3, 4]. Non-negative matrix factorization (NMF) methods can be employed for signals which admit factorization to non-negative components . Sparsity of the signals is also often assumed, allowing a representation as a linear combination of few elementary signals .
In audio applications, the measured signals in an array of microphones represent convolutive mixtures of the source signals [7, 8, 9]. The measured signals are obtained by filtering the clean source signals with the corresponding acoustic channels relating the sources and the microphones. The acoustic channels, in a typical reverberant environment, consist of various reflections from the objects and surfaces defining the acoustic enclosure. The measured signals are commonly analysed in the short time Fourier transform
short time Fourier transform(STFT) domain, where the convolutive mixtures are transformed into multiplicative mixtures at each frequency bin.
ICA-based methods can be applied, subject to scale-ambiguity and source permutation problems [10, 11]. Alternatively, numerous separation methods rely on the sparsity of speech sources in the STFT domain, assuming that each time-frequency (TF) bin is occupied by a single source . In algorithms based on NMF
, the speech spectrum is decomposed to a multiplication of non-negative basis and activation functions[13, 14]. Due to joint estimation of source parameters and mixing coefficients, these methods are free from permutation alignment problems. Other full-band approaches cluster the measurements according to time difference of arrival (TDOA) estimates or phase difference levels with respect to several microphones [15, 16, 17]. However, these models cannot be successfully applied in the presence of high reverberation, when the TDOA estimates are of poor quality. Robustness to room reverberations can be attained by performing bin-wise clustering, in the cost of adding a second stage of permutation alignment procedure [18, 19]. The TIFROM algorithm  avoids the TF sparsity assumption. It inspects the variations of computed instantaneous ratios, and detects small regions in the TF plane with a single active speaker.
In this paper, we present a novel source separation algorithm, which is specifically applicable to speech mixtures. The key point lies in the spectral decomposition of the correlation matrix between different observations. The justification of the method is based on a probabilistic model, in which each observation consists of different portions of the hidden sources. The relative portion of each source is randomly generated according to the sources’ probabilities, which vary from one observation to another. Based on this model, we show that the column space of the correlation matrix is spanned by the probabilities of the different sources. Accordingly, the rank of the correlation matrix equals the number of sources, and its eigenvectors form a simplex of the sources’ activity probabilities. The vertices of the simplex correspond to observations dominated by a single source with high probability, facilitating the estimation of the hidden sources.
The applicability of the presented model for blind separation of speech mixtures relies on two main attributes of multichannel audio mixtures. The first is the sparsity of the speech in the STFT domain, implying that different time-frames contain different portions of speech components of the different speakers. The second is the fact that in a multichannel framework each speaker is associated with a unique spatial signature, manifested in the associated acoustic channel. Applying the above procedure and exploiting convex geometry tools, we can identify frames dominated by a single speaker, enabling estimation of the corresponding acoustic channels. Given the estimated acoustic channels, the individual speakers are extracted using the pseudo-inverse of the acoustic mixing system.
Our method recovers a simplex of the probability of activity of the different sources. Convex geometry tools are more commonly utilized for hyperspectral unmixing (HU) in the emerging field of hyperspectral remote sensing [21, 22]
. In those studies, the goal is to identify materials in a scene, using hyperspectral images with high spectral resolution. The work relies on a linear mixing model, where each pixel is modelled as a linear sum of the radiated energy curves of the materials contained in this pixel. The nature of the problem entails a positivity constraint on the weights of the different materials. In addition, the weights must sum to one due to energy conservation. The latter constraint violates the statistical independence assumption, making the application of many standardBSS algorithms inappropriate. Alternatively, the above constraints lay the ground for the application of convex geometry tools for HU. There was also an attempt to borrow these principles for quasi-stationary sources such as speech sources . In general, it is clear that speech mixtures are not formed as convex mixtures. In , a certain normalization followed by a pre-processing procedure for cross-correlation mitigation, were proposed in order to enforce bin-wise convexity.
It is important to emphasize that the mixture model presented in this paper is fundamentally different from the one used for HU. In our model, we recover a simplex of the probability of activity of the different sources, while in HU the simplex is formed in the original (often high-dimensional) domain of the mixing systems. In addition, our method also inherently identifies the number of sources in the mixture, whereas HU methods generally assume that the number of sources is known. Moreover, in contrast to , we present a full-band approach based on averaging over a large number of frequency bins, which enhances robustness and avoids permutation problems.
The paper is organized as follows. The probabilistic model and its analysis by convex geometry principles are presented in Section II. The model is applied to speech mixtures and an algorithm for speaker counting and separation is derived in Section III. Section IV contains an extensive experimental study demonstrating the performance of the proposed method in comparison to several competing methods. Section V concludes this paper.
Ii Statistical Mixture Model and Analysis
We present a general statistical model describing the generation of a collection of observations as mixtures of a set of hidden sources. The observations consist of different portions of each of the sources, where each source occurs with a certain probability. The separation is based on the computation of the correlation matrix defined over the given observations. Based on the spectral decomposition of the correlation matrix, we can identify the number of hidden sources and derive a simplex representation, which relates each observation with its corresponding probabilities. In Section III, we discuss the relation between this general model and the problem of blind separation of speech mixtures. We use the analogy between the two to derive an algorithm for estimating the number of active speakers and separating them.
Ii-a Mixture Generation
Consider unknown hidden sources
. The hidden sources are i.i.d. random vectors consisting ofcoordinates, i.e. , where the th coordinate of the th source is denoted by . The hidden sources follow a multivariate distribution with zero-mean and identity covariance matrix, i.e.:
is the identity matrix of size
. The diagonal covariance matrix implies that the coordinates of the hidden sources are assumed to be uncorrelated. It should be noted that the unit variance assumption is used here for the sake of simplicity, and that the following derivation also holds for non-unit and non-constant variance by applying a proper normalization.
Suppose we are given a set of observations , also in , which are formed as a combination of the hidden sources. Each observation is assigned with a set of probabilities summing to one. The vector is constructed by statistically independent lotteries, which are defined by the associated set of probabilities. In each lottery, the value of the th coordinate of is chosen as the value of the th coordinate of the th source with probability . Accordingly, the th coordinate of the th observation can be written as:
where is an indicator function, which equals if the th source is chosen and otherwise, and satisfies:
where for and otherwise. We further assume that the indicator functions of different coordinates and of different frames are mutually independent.
According to this statistical model, for each , the probability corresponds to the relative portion of the th source in the construction of the observation . An illustration of the presented mixture model is depicted in Fig. 1 for sources, coordinates and observations. Consider for example the first observation , with associated probabilities: , and . In the vector , coordinates are taken from , coordinates are taken from , and coordinates are taken from . In practice, the relative portion of each source only approximately matches the corresponding probability for large enough.
The motivation for this model comes from separation of speech mixtures. According to the sparsity assumption of speech sources in the STFT domain , each TF bin is dominated by a single speaker. Given the spectrogram of the mixed signal, we can define a column vector for each frame index, consisting of the STFT values in a certain frequency band. Relying on the sparsity assumption, each frequency bin in this vector contains a signal from a single speaker. The challenge in speech mixtures, is that they are time-varying. In Section II-A we mitigate this problem by proposing features based on the acoustic channels, which are approximately fixed as long as the environment and the source positions do not change dramatically.
Ii-B Analysis of the Correlation Matrix
Our goal is to recover the number of hidden sources and to estimate them based on the given set of observations . The key to our separation scheme lies in the spectral decomposition of the correlation matrix defined over the different observations, which is analysed in this section.
Let be the correlation matrix, with . According to (4) the correlation matrix can be recast as:
where is a matrix with , and is a diagonal matrix with . We show in Appendix B, that has a negligible effect on the spectral decomposition of . Therefore, henceforth we omit from our derivations and consider the correlation matrix as .
Following the mutual independence assumption of the sources, the columns of are linearly independent, i.e. the rank of equals the number of sources . Hence, the rank of also equals , i.e. it has nonzero eigenvalues. We apply an eigenvalue decomposition (EVD) , with an orthonormal matrix consisting of the eigenvectors , and a diagonal matrix with the eigenvalues on its diagonal. The eigenvalues are sorted by their values in a descending order. According to (5), the first eigenvectors , associated with the nonzero eigenvalues , form a basis for the column space of the matrix . Accordingly, the following identity holds:
where , and is a invertible matrix.
Each observation can be represented as a point in , defined by the corresponding set of probabilities: . Note that each point is a convex combination of the standard unit vectors:
where with one in the th coordinate and zeros elsewhere. Accordingly, the collection of the probability sets lies in a -simplex in . This is a standard simplex, whose vertices are the standard unit vectors . Note that in this representation, points for which the probability of the th source is dominant over the probabilities of the other sources, i.e. , satisfy: , namely these points are concentrated nearby the th vertex.
We can use the eigenvectors of to form an equivalent representation in , defined by: . According to (6), this representation is related to the former representation by the following transformation:
Hence, the set occupies a simplex, which is a rotated and scaled version of the standard simplex defined by the standard unit vectors. The new simplex is the convex hull of the following vertices:
where is the th column of the matrix .
Regarding the computation of the matrix , we do not have access to the expected values , hence we use instead the typical values . In Appendix A, we show that the variance of is proportional to , hence approaches zero for large enough, implying that the typical value is close to the expected value.
We demonstrate the above derivation using three examples with , and sources. We generate independent sources of dimension with Next, we generate observations, according to (2). To generate the probabilities for each , we draw uniform variables between and sort them in an ascending order: . Accordingly, for each , we define the probability of each source by: , and . Next, we construct the matrix with , and apply an EVD.
Figure 2 (a)-(c) depicts , for (a), (b), and (c). To enable visualization also for we omit one coordinate of , and represent the simplexes in . The colouring of the points is as follows: blue, green, red and cyan for observations dominated by the first, the second, the third, and the fourth source, respectively (for only blue, green and red, and for only blue and green).Yellow points depict frames with mixture of sources. We observe that in each plot the points form a -simplex, i.e. a line segment (a), a triangle (b) and a tetrahedron (c).
(a)-(c). We observe that the scattering of the points in (d)-(f) represents a linear transformation of the scattering in (a)-(c), as implied by (8).
Figure 3 depicts the computed eigenvalues of , sorted in a descending order. We observe that the number eigenvalues with significant value above zero, exactly matches the number of sources .
We conclude with the practical aspects of the new representation derived by the EVD of the matrix . By examining the rank of the obtained decomposition, we can estimate the number of sources involved in the construction of the set . Furthermore, the eigenvectors form a simplex that corresponds to the probability of activity of each source along the observation index . We can use this representation to identify observations, which are highly dominated by a certain source, i.e. with , implying . The identified observations can be used for estimating the original hidden sources .
Iii Source Counting and Separation
In this section, we devise a statistical model for speech mixtures, which resembles the model presented in Section II-A. Next, we use the analysis of Section II-B to derive an algorithm for source counting and separation.
Iii-a Speech Mixtures
Consider concurrent speakers, located in a reverberant enclosure. The signals are measured by an array of microphones. The measured signals are analysed in the STFT domain with a window of length samples and overlap of samples:
where is the acoustic transfer function (ATF) relating the th source and the th microphone, and is the signal of the th speaker. Here, is the frequency bin, and is the frame index.
The first microphone () is considered as the reference microphone. We define the relative transfer function (RTF) [24, 25] as the ratio between the ATF of the th microphone and the ATF of the reference microphone, both of which are associated with the th speaker:
In order to transform the measurements (10) into features that correspond to the model presented in Section II-A, we rely on two main assumptions. The first assumption regards the fact that each speaker has a unique spatial signature, which is manifested in the associated RTF (11). The second assumption regards the sparsity of speech signals in the STFT domain.
For speech mixtures, the hidden sources are defined by the RTFs of each of the speakers. Each hidden source consists of coordinates for the real and the imaginary parts of the RTF values, in frequency bins and in microphones:
Note that is an all-ones vector for all , hence is excluded from in (12). We assume that the RTF vectors have a diagonal covariance matrix (1). The attributes of the Fourier transform prescribe that the real and the imaginary parts of the RTF values, as well as the different frequency bins, are uncorrelated. For large enough, the model can tolerate slight correlations between adjacent frequency bins, or between neighbouring microphones. In addition, we assume that the RTFs of the different speakers are mutually independent. This was empirically verified in the experimental study of Section IV, assuming a minimal angle of between adjacent speakers.
After defining the the hidden vectors associated with each of the speakers, we have to extract related observations from the measured signals (10). We assume that low-energy frames do not contain speech components, and hence these frames are excluded from our analysis. We use the assumption of the speech sparsity in the TF domain , which is widely employed in the STFT analysis of speech mixtures, and is often applied for localization [26, 17, 27] and separation tasks [14, 19, 28]. According to , each TF bin is exclusively dominated by a single speaker. Let denote an indicator function with expected value , which equals if the th speaker is active in the th bin, and equals , otherwise. The assumption that the probability is dependent on but independent of , reflects that the frequency components of a speech signal tend to be activated synchronously [19, 29]. According to the TF sparsity assumption, the following holds for each TF bin (recall (II-A)):
Hence, (10) can be recast as:
We compute the following instantaneous ratio between the th microphone and the reference microphone:
Let denote the observed RTF of frame , which consists of the real and the imaginary parts of the RTF values, in frequency bins and in microphones (recall (12)):
Note that for a certain frequency bin, the same speaker (both the real and the imaginary parts) is captured by all the microphones. However, this does not affect the relative portions of the different speakers in , and has a negligible effect on the variance of the correlation (34) provided . There is a trade-off choosing the frequency band . On the one hand, we should focus on the frequency band in which most of the speech components are concentrated, in order to avoid TF bins with low-energy speech components. On the other hand, a sufficient broad frequency band should be used in order to reduce the effect of TF bins occupied by several speakers, and to obtain a better averaging with smaller variance (34).
We compute (17) and (18) for each , and form the set . We conclude that the obtained set is constructed from the RTF vectors of the different sources (12), and has similar properties to the set of observations defined in Section II-A. A nomenclature listing the different symbols and their meanings is given in Table I.
|No. of sources/speakers,|
|No. of microphones,|
|No. of observations/frames in the STFT,|
|No. of frequency bins in the chosen band,|
|No. of coordinates ,|
|Hidden sources defined by RTF values of each speaker|
|Observations defined by instantaneous RTFs of each frame|
|Probability of activity of the speakers in each frame|
|Correlation matrix with|
|Eigenvalues of the correlation matrix|
|Eigenvectors of the correlation matrix|
|A transformation of , obtained by the eigenvectors of|
|Vertices of the standard simplex occupied by|
|Vertices of the transformed simplex occupied by|
Iii-B Speaker Counting and Separation
After we have shown that the speech separation problem can be formulated using the model in Section II-A, we would like to use the analysis of Section II-B to derive an algorithm for speaker counting and separation.
Following the derivation of Section II-B, we construct an matrix with , and apply EVD. Based on the computed eigenvectors, we form a representation in , defined by: .
We provide a similar demonstration for speech mixtures as we have presented in the syntactic case in Section II-B. We present three examples with , and speakers. The generation of the mixtures and the associated parameters are described in details in the experimental part, in Section IV. Figure 4 depicts the points , for (a), (b) and (c). The plots in Fig. 4 are generated in a similar way to the plots in Fig. 2. Here, too, we omit one coordinate of to enable visualization also for . We observe a good correspondence between Fig. 4 and Fig. 2, which gives evidence to the applicability of the general model of Section II to the case of speech mixtures.
Figure 5 depicts the computed eigenvalues sorted in a descending order, and normalized by the value of the maximum eigenvalue. As in Fig. 3, the number of eigenvalues with significant value above zero matches the number of sources . Hence, we can estimate the number of sources in the mixture by:
where is a threshold parameter.
We use the obtained representation to recover the probabilities of the speakers. Next, we detect frames, which are dominated by one of the speakers, and utilize them for estimating the corresponding RTFs. As discussed in Section II-B, the vertices of the simplex defined by correspond to single-speaker points. We recover the simplex vertices, and then utilize them to transform the obtained representation to the original probabilities .
We assume that for each speaker there is at least one frame, with index , which contains only this speaker, i.e. . The single-speaker frames are the simplex vertices, i.e. . Note that single-speaker frames are tantamount to pure pixels in HU. Several algorithms for identifying the vertices of a simplex were developed in the context of HU [30, 31, 32]. We use a simple approach based on the family of successive projection algorithms . We first identify two vertices of the simplex, and then successively identify the remaining vertices by maximizing the projection onto the orthogonal complement of the space spanned by the previously identified vertices. We start with the first vertex, which is chosen as the point with the maximum norm:
Then, the second vertex is chosen as the point with maximum distance with respect to the first identified vertex:
Next, we identify the remaining vertices of the simplex. Let and . Suppose we have already identified vertices with . We define the matrix , from which we construct its orthogonal complement projector , where denotes the matrix pseudoantique. The th vertex is chosen as the point with maximum projection to the column space of :
We successively repeat (22) for , and recover all the simplex vertices . For simplicity of notation, we ignore possible permutation of the indices of the vertices with respect to the actual identity of the speakers.
Let denote the set of frames dominated by the th speaker. Based on the recovered probabilities, we define the set by:
where is a probability threshold.
Given the set , an RTF estimator of the th speaker, is given by:
Based on the estimated RTFs of each of the speakers , the mixture can be unmixed applying the pseudo-inverse of the matrix containing the estimated RTFs:
and . The time-domain separated signals are obtained by applying the inverse-STFT. The proposed method is summarized in Algorithm 1.
Iv Experimental Study
In this section, we evaluate the performance of the proposed method in various test scenarios. The measured signals are generated using concatenated TIMIT sentences. The clean signals are convoluted with acoustic impulse responses, which are drawn from an open database . The AIRs in the database were measured in a reverberant room of size mmm with reverberation times of ms, ms and ms. We use a uniform linear array of microphones with cm inter-microphone spacing. The different speaker positions are located on a spatial grid of angles ranging from to in steps with m and m distance from the microphone array.
The signal duration is s, with sampling rate of kHz. The window length of the STFT is set to with overlap between adjacent frames, which corresponds to a total amount of frames. For each frame, the instantaneous RTF of each frequency bin in (17), is estimated by averaging the signals in adjacent frames (). The instantaneous RTF vectors in (18) consist of frequency bins, corresponding to kHz, in which most of the speech components are concentrated. The obtained concatenated vectors of length are normalized to have a unit-norm. The results are demonstrated for mixtures of , and speakers in different locations (with a minimum angle of between adjacent speakers).
We first examine the ability of the proposed method to estimate the number of speakers in the mixture. Here, we use a smaller frequency range between kHz, which yields better results for the task of counting the number of speakers. We conduct Monte-Carlo trials for each , in which the angles and the distances of the speakers, as well as their input sentences, are randomly selected. Figure 6 depicts the average counting accuracy as a function of the threshold parameter (19) in the range between and . We observe that the counting accuracy is robust to the choice of the threshold value, with above accuracy in the defined range. Perfect recovery is obtained for threshold values between and .
Next, we examine the ability of the proposed method to identify the set of frames dominated by each speaker. Figure 7 illustrates the time-domain signals of each of the speakers for a mixture of speakers (a), and for a mixture of speakers (b). The shaded areas stand for time instances which were found to be dominated by each of the speakers, using (24). It can be seen that the proposed algorithm successfully identifies time-periods for which one speaker is dominant over the other speakers. Comparing Fig. 7(a) and (b), we observe that as more speakers are involved in the mixture, then less time-periods are dominated by a single speaker.
The separation performance is evaluated using the signal to interference ratio (SIR) and signal to distortion ratio (SDR) measures, evaluated using the BSS-Eval toolbox . The measures are averaged over Monte-Carlo trials, in which the angles and the distances of the sources, as well as their input sentences, are randomly selected.
We compare the proposed method to two oracle methods, which are also based on the unmixing scheme of (26). In addition, we compare to a multichannel NMF algorithm  representing state of-the-art algorithms of the BSS family. The methods based on (26) use either of the following procedures for estimating the RTFs, used to compute the unmixng matrix:
Ideal: The RTFs are estimated using the individually measured signals, i.e.:
The parameters of the NMF algorithm are initialized using the separated speakers, which are artificially mixed with SIR that is improved with respect to the input SIR of the given mixture by dB.
We evaluate the performance of all the algorithms depending on the number of speakers and on the reverberation time. The results depending on the number of speakers are depicted in Table II for , with a fixed reverberation time of ms. The results depending on the reverberation time are depicted in Table III for ms, for mixtures of speakers.
We observe that the ideal unmixing yields the best results. In fact, it represents an upper bound for the separation capabilities, since it is derived using the separated speakers. The semi-ideal unmixing is inferior with respect to the upper bound, since the ideal unmixing uses the original signals for estimating the RTFs, whereas the semi-ideal unmixing uses non-pure frames from the mixed signals, which may contain also low energy components of other speakers. The proposed estimator determines the frames dominated by a certain speaker based on the mixed signals. Its performance is comparable to the semi-ideal unmixing with a small gap of dB. The NMF method is inferior with respect to the proposed method in almost all cases. It should be emphasized that the NMF algorithm uses an initialization with improved SIR, whereas the proposed method is completely blind. For all algorithms, a performance degradation is observed as the number of speakers increases or as the reverberation time increases. It should be noted that for both the semi-ideal unmixing and the proposed method, an increase in the number of speakers means a decrease in the number of frames dominated by a single speaker, hence, the performance gap between both algorithms and the ideal unmixing increases.
Figure 8 presents an example of the spectrograms and the waveforms of a mixture of speakers, where the first speaker (a), the mixture (b), and the output signal of the proposed method (c), are depicted. It is evident that the spectral components of the second speaker are significantly attenuated, while preserving most of the spectral components of the first speaker. There is also a good match between the original and the output waveforms.
We present a novel framework for speech source counting and separation in a completely blind manner. The separation is based on the sparsity of speech in the STFT domain, as well as the fact that each speaker is associated with a unique spatial signature, manifested by the RTF between the speaker and the microphones. A spectral decomposition of the correlation matrix of different time frames reveals the number of speakers, and forms a simplex of the speakers’ probabilities across time. Utilizing convex geometry tools, the frames dominated by each speaker are identified. The RTFs of the different speakers are estimated using these identified frames, and an unmixing scheme is implemented to separate the individual speakers. The performance is demonstrated in an experimental study for various reverberation levels.
In this section, we compute the expected correlation between observations and evaluate its variance. The computation is based on the statistical model of Section II. Recall the following assumption regarding the hidden sources:
We compute the correlation for :
The second equality follows from the independence of the indicator functions and the sources. The third equality follows from the independence of the indicator functions for . The fourth equality is due to (30).
For the autocorrelation is given by:
where the third equality follows from (31).
We compute the variance of (32):
We show that the variance (34) approaches zero for large enough, implying that the typical value approaches the expected value .