I Introduction
As speech recognition and conversational AI matures, voice interactions with robots will become even more popular. Robots at homes, hospitals, restaurants, airports will all be able to converse with humans. In this context, recognizing the human’s speech/voice is critical, especially since many of these interactions will be happening in noisy environments. In signal processing, this problem has been called “source separation”, referring to the ability to separate a voice signal from a mixture of multiple signals. Source separation has been studied extensively and today’s results are impressive, to the extent that source signals can be separated using microphones, even when is slightly larger than . Observe that the problem is particularly challenging not only because the sources are unknown, but also because the channels (over which the signals arrive to the microphones) are also unknown. Hence, this problem is specifically known as underdetermined blind source separation (UBSS).
A rich body of work has concentrated on the UBSS problem and today’s techniques range from unsupervised methods (e.g., ICA, IVA, Adaptive Beamforming (ABF)), to speech specific techniques (e.g., DUET, BayesianDUET), to compressed sensing and supervised deep learning approaches
[Wiley_ICA, TASLP07_IVA, Frost_LCMV, DUET04, CompressedSens, ILRMA_sawada_ono_kameoka_kitamura_saruwatari_2019, GatedNN]. However, for UBSS problems, majority of past works rely on interpolations and regressions since source signal information is lost during the mixing process due to underdetermined nature. Therefore, their performance degrades, understandably, as
increases for a given . Said differently, any reduction in the gap can immediately improve the quality of signal separation.This paper proposes to leverage robotic mobility to reduce the gap between and . At a high level, we intend to rotate a microphone array to an orientation such that two interfering signals appear as one interference in this orientation. The intuition is rooted in angular aliasing, where two signals arriving from completely different directions will produce the same relative delays at the microphone array, if the array is oriented in the correct angle. This correct orientation occurs when the line joining the microphones bisects the two interferers as shown in Figure 1. Thus, deliberate angular aliasing transforms a =, = system into a =, = system, making it solvable. With many more sources in the real world, reducing to also offers a clear improvement.
Separating sources via angular aliasing presents 2 challenges:

Since the angle of arrivals (AoA) of the signals are not known, the correct microphone orientation
is unknown as well. Estimating all
AoAs is difficult with microphones, and worse, AoA estimates are plagued by frontback ambiguities (i.e., it is difficult to tell whether a signal is arriving from a direction in front, or from the back). 
Even if the AoAs are estimated, it is not clear which interferers should be aligned to maximize source separation performance. In fact, should the optimal orientation always bisect two interferers? Or could partially align multiple interferers in a way that maximizes separation?
This paper addresses these two problems in Section III through a mobilityguided timefrequency method, and then combines with established source separation (SS) techniques to achieve improved performance. Our RoSS algorithm is complementary, hence compatible, with all SS techniques. We implement RoSS on a rotating microphone prototype (pretending it to be a robothead), and perform experiments in simulated and uncontrolled environments (Section IV). Results show that RoSS can achieve over dB of scale invariant signal distortion ratio (SISDR) and SISDRi [Def_SISDR, TrainigNoisy], consistently outperforming existing UBSS/BSS methods. We believe RoSS could also be effective with smartphones, earbuds, moving videoconference systems, and surveillance cameras, all of which have limited number of microphones but contain actuators or inertial measurement units (IMUs) for angular rotation and sensing.
Ii Formulation and Opportunity
Iia Signal Model
Let be 3 source signals, of which is the target and others are interference (Fig. 2(a)). A linear microphone array receives the mixture of these signals as and and we designate as the reference for relative delay calculations. The signals travel from the farfield over AoAs (k=T,A,B). We explain our proposed method with signals and consider later.
We make the following Assumptions:
(A1) The sound sources are human speech, widely assumed to be mutually independent, nonGaussian signals.
(A2) Once a speech has been separated, it is possible to tell if it is from the target user (i.e., a voice fingerprint is available).
(A3) Sources are not moving in the time scale of seconds.
Thus, the received (convolutive) signal mixture is:
(1) 
where , (k=T,A,B), are timedifferenceofarrivals (TDOA) between the microphones, while and denote sound propagation speed and spacing between microphones, respectively.
Thus, in the timefrequency domain:
(2) 
In matrix form, this equation becomes:
(3) 
where = (k=A,B,T) is the steering vector. Note that even if all ’s are known, the system is still underdetermined.
IiB Interference Alignment
What if we rotate the array such that the line joining the microphones bisect the two interferers? While the correct rotation angle needs to be inferred blindly, for now let us assume we know it. Fig.2(b) shows the outcome. Since the new AoAs of the two interferers are now and , their corresponding TDOAs become equal, or aliased, as follows:
Thus, in frequency domain, interferers A and B have identical array vectors where . Hence, the new measurement vector is:
(4) 
This expression means that the array would sense two groups of signals, not three (i.e., the target and the sum of two interferers). Fig. 2(c) shows the two signals arriving from distinct angles. This produces a determined system of equations except that one of the mixed signals arriving from AoA is actually a sum of independent sources. If the sum is independent of the target signal (as shown next), we can apply classical source separation.
IiC Sum of Mutually Independence Sources
We briefly show that a mixture of two independent sources remains independent from the third source when all three are mutually independent. Define , and
as mutually independent continuous random variables, and
is a fourth random variable. Let
andbe CDF and PDF of variable i, respectively. Then, the joint distribution of
and can be written as:(5) 
Therefore, and are also mutually independent [book_stat].
Iii Alignment by rotational motion
Our goal now is to correctly rotate the microphone so that interferers get aligned. For this, we first need to estimate the source AoAs, and then determine the correct microphone rotation as a function of these estimated AoAs.
Iiia Estimating AoAs in Underdetermined Scenarios
Estimating AoAs with microphones is known to be a hard problem for general signals. However, literature has shown promise with speech signals due to what is known as the WDisjoint Orthogonality (WDO) property [DUET04]
. Briefly, extensive experiments have shown that speech from two humans have a low probability of collision in a given timefrequency (TF) bin. Thus, if one calculates the TDOA for each TF bin — called
intermicrophone time difference (ITD) — one can extract information about AoAs. Fig. 3 illustrates this with a toy example of red and blue signals; the calculated ITDs from the red and blue TF bins form clusters. The means of these clusters partly reveals the red/blue signal’s AoA.Unfortunately, the mapping between ITD and AoA is not : because AoAs of both and produce identical ITDs at the microphone array. In Fig. 4, see how clusters get mapped to AoAs (of which 2 AoAs are spurious). This is classically known as the front back ambiguity. Rotating the microphone to the correct orientation would obviously need to resolve this ambiguity first.
IiiB Rotation Resolves AoA Ambiguity
To resolve AoA ambiguity, we propose to rotate the microphone and observe the change in the ITDs. Depending on the ITD change (positive or negative) it is possible to resolve whether the source’s AoA is in front () or back (). Fig. 5(a) illustrates an example counterclockwise rotation of . From the microphone’s reference frame, the source AoAs rotate in the clockwise direction. If the original AoA was in front (), then Fig. 5(c) shows how the new ITD increases (i.e., a right shift in the ITD axis), and vice versa when the AoA is .
Equation 6 and 7 show this analytically, where is the ITD after rotation, and is change in ITD after the rotation.
(6) 
(7) 
As discussed earlier, the sign of is different based on whether the source is in front or back. This sign offers a reliable feature to resolve the frontback ambiguity. Of course, challenges emerge in real scenarios, discussed next.
What happens when real situations have more than one source? Consider sources in Fig. 6(a). For easier notations, let us map the each signal as =. Thus, if the ITD was before rotation, then after rotation, the ITD becomes:
(8) 
where was the true global AoA value before rotation.
As shown in Fig. 6(b), the sign based feature becomes challenging as the peaks from and pairs begin to merge in a crowded histogram. The situation is worse when K becomes larger or when the measurement is noisy. This motivates our following approach to identify K AoAs using and 2 ITD measurements.
Hypothesis Testing:
From a current estimate of ITD, and after a rotation of , we can derive two expected ITDs as:
(9) 
where = , still has frontback ambiguity. From now on, let us call , as , .
We now apply binary hypothesis testing by comparing the expected with the measured ITD histogram obtained after rotation. This also reveals the more likely value of
, essentially a maximum likelihood estimator (MLE). Said differently, the probability density function (PDF) of
near one of the values is expected to be higher than the counterpart as shown in Fig. 7. This gives us the correct AoA as:(10) 
where
(11) 
where p() is a PDF of that can be acquired by fitting the measured new ITD histogram , and p() is a prior distribution on . We use equal priors, 1/2, for each in Equation 11
. Kernel density estimation (KDE) can be used to fit the distribution based on the histogram acquired by ITDs. By repeating the estimation K times for
, we estimate for sources.In practice, however, several more issues affect the abovementioned solution.
There is no guarantee that all ITD values would be prominent in first measurement because some source pairs can be image pairs, i.e., AoAs of deg and , respectively.
p() may not be higher in the expected region because other signals can appear in the opposite region after rotation, resulting in an erroneous decision in Equation 11.
Since the to
mapping is nonlinear, the variance of a peak in ITD axis may not be preserved when mapped to its AoA. Thus, AoA estimation error varies depending on true AoA values at each measurement. For instance, AoA values near 0 or 180 result in more estimation errors than those in 90 or 90 degrees for the same variances in the ITD domain.
To overcome these problems, let us consider taking more rotational measurements by rotating R times with to get R ITD distributions where and is for no rotation case.
Markov Model Testing:
After the rth rotation, if significant peaks are found in at , where , then local AoA values at the rth rotation for sources is estimated as:
(12) 
where
(13) 
Since the total rotation after times is known and equals , the global AoA values from the rth measurements, , can be estimated as:
(14) 
Noting that rotational movement only changes a state to the next state, the given process can be viewed as a Markov process where the th state is only affected by the th state. Therefore, estimation of Equation 13 can be repeated iteratively from to to generate a total number of of values. By capturing major clusters within the data of where and , mean values of clusters, K global AoA estimates , can be estimated.
Compared to 1rotation estimator, Rrotation estimator statistically improves the issues mentioned above by relying on multiple measurements taken at different orientations. Of course, robustness improves for greater if time duration at each measurement remains the same.
IiiC Optimal Angle for Source Separation
Once K AoA values are estimated, the optimal rotation angle for source separation can be found.
Case: For , no rotation is needed for source separation since the system is already (over)determined and all K sources can be recovered at the same time via techniques such as ICA. For , a pair of sources can be aligned as explained in Section IIB to recover remaining source. To enhance source i, rotation angle for alignment is a bisecting angle between two interfering sources.
(15) 
By setting , determined source separation technique such as ICA can be applied to recover .
Case: Even if a pair of interferers are aligned, the system is still underdetermined because an alignment of a pair only reduces effective K by 1, i.e., can still be greater than where . Therefore, UBSS techniques that utilize TF masks (e.g., DUET [DUET04]) can be used to separate groups of sources. TF masking methods estimate masks for each TF bin based on ITD measurements. Therefore, source separation performance depends on both clear identification of and small number of overlapping TF bins. Importantly, our proposed interference alignment approach provides benefits in both cases.
First, alignment of two nearest angular neighbors of target source yields maximum isolation in domain from rest of the sources as shown in Fig. 8(a) with few exceptions (b). One way to check whether it’s (a) or (b) is to draw all triangles containing the target and nearest angular neighbor sources, and count the number of acute triangles. If there is at least 1 acute triangle as in (a), then aligning nearest neighbors is best isolation angle. Otherwise, it is AoA dependent as in (b) ^{1}^{1}1Minimum angle difference with adjacent source deg & all other interferers & for . This case is not considered in this paper since the condition is strict.. Thus, there is optimal rotation angle where can be clearly identified in histogram of p() without being interfered by adjacent interferers.
Second, proportion of underdetermined TF bins reduces after interference alignment which allows sparsitybased UBSS techniques [TCASI19_UBSS_Sparse_TF] to recover such TF bins. Therefore, rotational alignment angle for can be generalized as:
(16) 
Target Signal Identification: At each alignment angle, target signal feature can be used to perform voice recognition to assign scores to know whether the target signal is on recovered group or mixture group where K1 iterations are needed in worst case scenario.
IiiD RoSS Algorithm
Rotational Source Separation (RoSS) algorithm contains two modules: i) Explore and ii) Exploit as shown below:
At every rotational angle, STFT is performed on microphone signals after applying LPF to prevent spatial aliasing.
RoSS runs explore module first to identify AoAs. Then, it runs exploit module to rotate to right alignment angle for source enhancement. (backtoback)
If any interference signal moves, Score value of voice recognition would drop due to misalignment of interferers. Then, RoSS gets reinitiated from the beginning. Note that movement of target signal does not affect interference alignment much.
RoSS can be run faster through a pipeline of two modules: Once brief AoA estimate is acquired before exploration ends, move to one of the estimated alignment angle prematurely then use this movement to update the AoA estimation. Repeat until target source is found. This way, both improving AoA accuracy and target source search can be done at once. See Fig. 9 for detailed operation.
Iv Evaluation
Iva Settings
Data Model: We test RoSS on 1minutelong speech recordings from the LibriTTS dataset[LibriTTS]. We made each signal power to be identical, i.e., for Ksources. Also, K sources are selected in a predefined AoA order of [170, 70, 50, 120, 10]. For instance, [170, 70, 50] is used for case.
Setting  1, 2, 3  4  5 

Type  Simulation  Measurement  Measurement 
Location  Computer  Outdoor  Indoor 
SNR [dB]  15  15.4  23 
: Reverb Time [ms]  0, 450, 700  0  N/A 
Room Size []  
Distance to speakers [m]  2.5  2  2 to 2.5 
[deg]  15 ()  15 ()  15 () 
Experiment Settings: We have implemented RoSS on a custom built rotary platform actuated by a NEMA17 stepper motor (Fig. 10(a)). The motor was controlled in openloop using a TB6600 motor driver where its peak rotation speed and acceleration is and each. ReSpeaker microphone array for Raspberry Pi is mounted on the rotary platform and 2 adjacent microphones, with 5cm spacing, are used to record audio signal. Experiments were tested and conducted in more than two environments including one outdoor and two indoor locations as depicted and explained in Fig. 10(b,c,d) and TableI. In each environment, the platform was placed at the center and K sources were played from individual speakers placed radially at a distance of to m from the platform.
Algorithms are run in Python with sampling frequency of kHz, STFT frame lengths of 512 or 1024 with 25% overlap with adjacent frames. For source separation module, we use natural gradientbased IVA [TASLP07_IVA], DUET [DUET04] and MVDR [Frost_LCMV] for demonstration purpose of our concept. Other stateoftheart UBSS/BSS techniques [GatedNN, ConvTasNet] could also be adopted for better separation performance. For rotation step , we used constant rotation by 15 degrees. Rotation can be planned judiciously such as rotating back and forth for instance.
However, in performance evaluation, since it is difficult to compare enhanced target source with the true target source alone in measurement for comparison, we measured twice, with and without interference at different time points. Note that, this would degrade source separation performance naturally due to external noise sources because two acoustic environments can be different at each time point.
Simulation Settings: In settings 1to3 in TableI, two convolutive mixtures, , are generated based on room impulse response (RIR) generator [rir_generator] with following conditions.
Room size: 10m x 10m (2dimensional space assumed).
Two omnidirectional identical microphones with spacing of 5 cm are located at the center, rotating around center.
Gaussian noise is added so that microphone SNR is 15 dB while maintaining SIR of 10log(K1) dB
Separated sources are evaluated by comparing with each source alone measured at the reference microphone .
Algorithm settings are similar as in measurement settings except for kHz sampling frequency.
IvB Performance Evaluation
AoA estimation: Fig. 11 shows AoA estimation results based on number of rotational steps in 4 different combinations of K and where K is number of sources and comprises of listening time and moving time per rotation while majority is listening time since it takes less than 0.1 seconds to rotate 15 deg. As claimed in Section III and (14), realtime estimated AoA values approach to truth AoA as the number of rotation goes up. In ideal setting where there is no reverberation (Setting 1), only two rotations were needed, two seconds, to get all correct AoA values in both cases. As channel becomes reverberant or noisy, more number of steps provide benefit in estimating true AoA values. Also, having longer seem to help with identifying major peaks at initial step, but it does not help with getting rid of intrinsic convergence offset caused by reverberations.
Performance Metric: For source separation metric, we utilize scaleinvariant signaltodistortion ratio (SISDR) and SISDRi [Def_SISDR, TrainigNoisy] where SISDRi captures improvement on SISDR after source separation: SISDR(estimate, target)  SISDR(mixture, target). In this work, SISDRi is used to deal with noisy outdoor data where both mixture data and target data are noisy due to wind.
Benefit of RoSS: Let us validate the effectiveness of interference alignment by considering the impact of initial orientation in terms of source separation. While AoAs of K sources are typically randomly distributed, conventional schemes such as IVA and DUET have huge variations in its performance depending on microphone array orientation in underdetermined scenario (Fig. 12). And as expected, source separation is optimum at proposed alignment angles (15)(16), each resulting in local AoA of [110, 10, 110] and [110, 10, 110, 180] when target was at 70 deg with zero offset. That is, in RoSS, mobility provides an opportunity to rotate to that optimal angle point regardless of the initial orientation or source AoA distribution. Therefore, filled area in yellow between RoSS (dotted line) and other methods represents the statistical gain of RoSS in source separation.
Source Separation: We show evaluation results on average SISDRi on each source for different settings for when seconds in Fig. 13
. In (a), about 1 degree of freedom gain is clear from AoAinformed RoSS and RoSS over other algorithms. In (b), for setting 1to4, as environment becomes more reverberant, SISDRi values drop because both AoA estimation and interference alignment starts to fail. However, setting 5 result shows poor performance even if it has almost no reverberation since it’s outdoor measurement at the field. Major cause of this is thought to be external noises such as constant wind effect.
V Conclusion and Future Works
In this paper, we have presented RoSS where it blindly searches AoAs via rotational motion to align a pair of interference signals to separate target sound source in underdetermined scenario. While we demonstrated the concept with unsupervised blind schemes assuming sources are mutually independent, proposed idea of reducing the gap between and is applicable to many other UBSS/BSS method regardless of whether it’s supervised or unsupervised even when sources are partially correlated. Also, we believe that applications where AoAs or source locations are readilyavailable, i.e. audiovision applications, can benefit directly from proposed idea to align the interference as demonstrated in Fig. 13 with informed RoSS bypassing the AoA estimation step to gain robustness against high K and reverberation.
As a next step, further investigation can be made on exploiting continuous motion since we currently consider only discrete rotations. Also, realtime score of recovered target source can be utilized to find the alignment angle.