Log In Sign Up

RoSS: Utilizing Robotic Rotation for Audio Source Separation

by   Hyungjoo Seo, et al.

This paper considers the problem of audio source separation where the goal is to isolate a target audio signal (say Alice's speech) from a mixture of multiple interfering signals (e.g., when many people are talking). This problem has gained renewed interest mainly due to the significant growth in voice controlled devices, including robots in homes, offices, and other public facilities. Although a rich body of work exists on the core topic of source separation, we find that robotic motion of the microphone – say the robot's head – is a complementary opportunity to past approaches. Briefly, we show that rotating the microphone array to the correct orientation can produce desired aliasing between two interferers, causing the two interferers to pose as one. In other words, a mixture of K signals becomes a mixture of (K-1), a mathematically concrete gain. We show that the gain translates well to practice provided two mobility-related challenges can be mitigated. This paper is focused on mitigating these challenges and demonstrating the end-to-end performance on a fully functional prototype. We believe that our Rotational Source Separation module RoSS could be plugged into actual robot heads, or into other devices (like Amazon Show) that are also capable of rotation.


page 1

page 7


Multi-Task Audio Source Separation

The audio source separation tasks, such as speech enhancement, speech se...

Sampling-Frequency-Independent Audio Source Separation Using Convolution Layer Based on Impulse Invariant Method

Audio source separation is often used as preprocessing of various applic...

WildMix Dataset and Spectro-Temporal Transformer Model for Monoaural Audio Source Separation

Monoaural audio source separation is a challenging research area in mach...

Sudo rm -rf: Efficient Networks for Universal Audio Source Separation

In this paper, we present an efficient neural network for end-to-end gen...

Online Spectrogram Inversion for Low-Latency Audio Source Separation

Audio source separation is usually achieved by estimating the short-time...

Single-Channel Blind Source Separation for Singing Voice Detection: A Comparative Study

We propose a novel unsupervised singing voice detection method which use...

Audio Source Separation Using a Deep Autoencoder

This paper proposes a novel framework for unsupervised audio source sepa...

I Introduction

As speech recognition and conversational AI matures, voice interactions with robots will become even more popular. Robots at homes, hospitals, restaurants, airports will all be able to converse with humans. In this context, recognizing the human’s speech/voice is critical, especially since many of these interactions will be happening in noisy environments. In signal processing, this problem has been called “source separation”, referring to the ability to separate a voice signal from a mixture of multiple signals. Source separation has been studied extensively and today’s results are impressive, to the extent that source signals can be separated using microphones, even when is slightly larger than . Observe that the problem is particularly challenging not only because the sources are unknown, but also because the channels (over which the signals arrive to the microphones) are also unknown. Hence, this problem is specifically known as under-determined blind source separation (UBSS).

A rich body of work has concentrated on the UBSS problem and today’s techniques range from unsupervised methods (e.g., ICA, IVA, Adaptive Beamforming (ABF)), to speech specific techniques (e.g., DUET, Bayesian-DUET), to compressed sensing and supervised deep learning approaches

[Wiley_ICA, TASLP07_IVA, Frost_LCMV, DUET04, CompressedSens, ILRMA_sawada_ono_kameoka_kitamura_saruwatari_2019, GatedNN]

. However, for UBSS problems, majority of past works rely on interpolations and regressions since source signal information is lost during the mixing process due to under-determined nature. Therefore, their performance degrades, understandably, as

increases for a given . Said differently, any reduction in the gap can immediately improve the quality of signal separation.

This paper proposes to leverage robotic mobility to reduce the gap between and . At a high level, we intend to rotate a microphone array to an orientation such that two interfering signals appear as one interference in this orientation. The intuition is rooted in angular aliasing, where two signals arriving from completely different directions will produce the same relative delays at the microphone array, if the array is oriented in the correct angle. This correct orientation occurs when the line joining the microphones bisects the two interferers as shown in Figure 1. Thus, deliberate angular aliasing transforms a =, = system into a =, = system, making it solvable. With many more sources in the real world, reducing to also offers a clear improvement.

Fig. 1: Rotation of the microphone array to the correct orientation (i.e., bisecting the source signals) produces the desired spatial aliasing.
Fig. 2: (a) 2-microphone array faced with 3 sources resulting in a UBSS problem. (b) Rotation causes interferers to arrive over the same absolute AoA angle ( and

). (c) The steering vector for interferers get aliased, resulting in a determined system.

Separating sources via angular aliasing presents 2 challenges:

  1. Since the angle of arrivals (AoA) of the signals are not known, the correct microphone orientation

    is unknown as well. Estimating all

    AoAs is difficult with microphones, and worse, AoA estimates are plagued by front-back ambiguities (i.e., it is difficult to tell whether a signal is arriving from a direction in front, or from the back).

  2. Even if the AoAs are estimated, it is not clear which interferers should be aligned to maximize source separation performance. In fact, should the optimal orientation always bisect two interferers? Or could partially align multiple interferers in a way that maximizes separation?

This paper addresses these two problems in Section III through a mobility-guided time-frequency method, and then combines with established source separation (SS) techniques to achieve improved performance. Our RoSS algorithm is complementary, hence compatible, with all SS techniques. We implement RoSS on a rotating microphone prototype (pretending it to be a robot-head), and perform experiments in simulated and uncontrolled environments (Section IV). Results show that RoSS can achieve over dB of scale invariant signal distortion ratio (SI-SDR) and SI-SDRi [Def_SISDR, TrainigNoisy], consistently outperforming existing UBSS/BSS methods. We believe RoSS could also be effective with smartphones, earbuds, moving video-conference systems, and surveillance cameras, all of which have limited number of microphones but contain actuators or inertial measurement units (IMUs) for angular rotation and sensing.

Ii Formulation and Opportunity

Ii-a Signal Model

Let be 3 source signals, of which is the target and others are interference (Fig. 2(a)). A linear -microphone array receives the mixture of these signals as and and we designate as the reference for relative delay calculations. The signals travel from the far-field over AoAs (k=T,A,B). We explain our proposed method with signals and consider later.

We make the following Assumptions:
(A1) The sound sources are human speech, widely assumed to be mutually independent, non-Gaussian signals.
(A2) Once a speech has been separated, it is possible to tell if it is from the target user (i.e., a voice fingerprint is available).
(A3) Sources are not moving in the time scale of seconds.

Thus, the received (convolutive) signal mixture is:


where , (k=T,A,B), are time-difference-of-arrivals (TDOA) between the microphones, while and denote sound propagation speed and spacing between microphones, respectively.

Thus, in the time-frequency domain:


In matrix form, this equation becomes:


where = (k=A,B,T) is the steering vector. Note that even if all ’s are known, the system is still under-determined.

Ii-B Interference Alignment

What if we rotate the array such that the line joining the microphones bisect the two interferers? While the correct rotation angle needs to be inferred blindly, for now let us assume we know it. Fig.2(b) shows the outcome. Since the new AoAs of the two interferers are now and , their corresponding TDOAs become equal, or aliased, as follows:

Thus, in frequency domain, interferers A and B have identical array vectors where . Hence, the new measurement vector is:


This expression means that the array would sense two groups of signals, not three (i.e., the target and the sum of two interferers). Fig. 2(c) shows the two signals arriving from distinct angles. This produces a determined system of equations except that one of the mixed signals arriving from AoA is actually a sum of independent sources. If the sum is independent of the target signal (as shown next), we can apply classical source separation.

Ii-C Sum of Mutually Independence Sources

We briefly show that a mixture of two independent sources remains independent from the third source when all three are mutually independent. Define , and

as mutually independent continuous random variables, and

is a fourth random variable. Let


be CDF and PDF of variable i, respectively. Then, the joint distribution of

and can be written as:


Therefore, and are also mutually independent [book_stat].

Iii Alignment by rotational motion

Our goal now is to correctly rotate the microphone so that interferers get aligned. For this, we first need to estimate the source AoAs, and then determine the correct microphone rotation as a function of these estimated AoAs.

Iii-a Estimating AoAs in Under-determined Scenarios

Estimating AoAs with microphones is known to be a hard problem for general signals. However, literature has shown promise with speech signals due to what is known as the W-Disjoint Orthogonality (WDO) property [DUET04]

. Briefly, extensive experiments have shown that speech from two humans have a low probability of collision in a given time-frequency (TF) bin. Thus, if one calculates the TDOA for each TF bin — called

inter-microphone time difference (ITD) — one can extract information about AoAs. Fig. 3 illustrates this with a toy example of red and blue signals; the calculated ITDs from the red and blue TF bins form clusters. The means of these clusters partly reveals the red/blue signal’s AoA.

Fig. 3: ITD computed from TF bins produce 2 clusters around two mean ITDs. These mean ITDs are estimates of AoA.

Unfortunately, the mapping between ITD and AoA is not : because AoAs of both and produce identical ITDs at the microphone array. In Fig. 4, see how clusters get mapped to AoAs (of which 2 AoAs are spurious). This is classically known as the front back ambiguity. Rotating the microphone to the correct orientation would obviously need to resolve this ambiguity first.

Fig. 4: 2 ITD clusters gets mapped to 4 clusters in AoA space.

Iii-B Rotation Resolves AoA Ambiguity

To resolve AoA ambiguity, we propose to rotate the microphone and observe the change in the ITDs. Depending on the ITD change (positive or negative) it is possible to resolve whether the source’s AoA is in front () or back (). Fig. 5(a) illustrates an example counter-clockwise rotation of . From the microphone’s reference frame, the source AoAs rotate in the clockwise direction. If the original AoA was in front (), then Fig. 5(c) shows how the new ITD increases (i.e., a right shift in the ITD axis), and vice versa when the AoA is .

Fig. 5: (a) 2-microphone array rotating by when faced with a source. (b) Front-back ambiguity in AoA. (c) Rotational motion is equivalent to all sources rotate by opposite direction, and AoA ambiguity disappears.

Equation 6 and 7 show this analytically, where is the ITD after rotation, and is change in ITD after the rotation.


As discussed earlier, the sign of is different based on whether the source is in front or back. This sign offers a reliable feature to resolve the front-back ambiguity. Of course, challenges emerge in real scenarios, discussed next.

What happens when real situations have more than one source? Consider sources in Fig. 6(a). For easier notations, let us map the each signal as =. Thus, if the ITD was before rotation, then after rotation, the ITD becomes:


where was the true global AoA value before rotation.

As shown in Fig. 6(b), the sign based feature becomes challenging as the peaks from and pairs begin to merge in a crowded histogram. The situation is worse when K becomes larger or when the measurement is noisy. This motivates our following approach to identify K AoAs using and 2 ITD measurements.

Fig. 6: (a) 2-microphone array rotating by when faced with K=4 sources. (b) Rotation shifts ITD values of K source accordingly.

Hypothesis Testing:

From a current estimate of ITD, and after a rotation of , we can derive two expected ITDs as:


where = , still has front-back ambiguity. From now on, let us call , as , .

We now apply binary hypothesis testing by comparing the expected with the measured ITD histogram obtained after rotation. This also reveals the more likely value of

, essentially a maximum likelihood estimator (MLE). Said differently, the probability density function (PDF) of

near one of the values is expected to be higher than the counterpart as shown in Fig. 7. This gives us the correct AoA as:




where p() is a PDF of that can be acquired by fitting the measured new ITD histogram , and p() is a prior distribution on . We use equal priors, 1/2, for each in Equation 11

. Kernel density estimation (KDE) can be used to fit the distribution based on the histogram acquired by ITDs. By repeating the estimation K times for

, we estimate for sources.

Fig. 7: Binary hypothesis testing on estimation of K AoA values using rotation angle.

In practice, however, several more issues affect the above-mentioned solution.

There is no guarantee that all ITD values would be prominent in first measurement because some source pairs can be image pairs, i.e., AoAs of deg and -, respectively.

p() may not be higher in the expected region because other signals can appear in the opposite region after rotation, resulting in an erroneous decision in Equation 11.

Since the to

mapping is non-linear, the variance of a peak in ITD axis may not be preserved when mapped to its AoA. Thus, AoA estimation error varies depending on true AoA values at each measurement. For instance, AoA values near 0 or 180 result in more estimation errors than those in 90 or -90 degrees for the same variances in the ITD domain.

To overcome these problems, let us consider taking more rotational measurements by rotating R times with to get R ITD distributions where and is for no rotation case.

Markov Model Testing:
After the r-th rotation, if significant peaks are found in at , where , then local AoA values at the r-th rotation for sources is estimated as:




Since the total rotation after times is known and equals , the global AoA values from the r-th measurements, , can be estimated as:


Noting that rotational movement only changes a state to the next state, the given process can be viewed as a Markov process where the -th state is only affected by the -th state. Therefore, estimation of Equation 13 can be repeated iteratively from to to generate a total number of of values. By capturing major clusters within the data of where and , mean values of clusters, K global AoA estimates , can be estimated.

Compared to 1-rotation estimator, R-rotation estimator statistically improves the issues mentioned above by relying on multiple measurements taken at different orientations. Of course, robustness improves for greater if time duration at each measurement remains the same.

Iii-C Optimal Angle for Source Separation

Once K AoA values are estimated, the optimal rotation angle for source separation can be found.

Case: For , no rotation is needed for source separation since the system is already (over-)determined and all K sources can be recovered at the same time via techniques such as ICA. For , a pair of sources can be aligned as explained in Section II-B to recover remaining source. To enhance source i, rotation angle for alignment is a bisecting angle between two interfering sources.


By setting , determined source separation technique such as ICA can be applied to recover .

Case: Even if a pair of interferers are aligned, the system is still under-determined because an alignment of a pair only reduces effective K by 1, i.e., can still be greater than where . Therefore, UBSS techniques that utilize TF masks (e.g., DUET [DUET04]) can be used to separate groups of sources. T-F masking methods estimate masks for each T-F bin based on ITD measurements. Therefore, source separation performance depends on both clear identification of and small number of overlapping T-F bins. Importantly, our proposed interference alignment approach provides benefits in both cases.

First, alignment of two nearest angular neighbors of target source yields maximum isolation in domain from rest of the sources as shown in Fig. 8(a) with few exceptions (b). One way to check whether it’s (a) or (b) is to draw all triangles containing the target and nearest angular neighbor sources, and count the number of acute triangles. If there is at least 1 acute triangle as in (a), then aligning nearest neighbors is best isolation angle. Otherwise, it is AoA dependent as in (b) 111Minimum angle difference with adjacent source deg & all other interferers & for . This case is not considered in this paper since the condition is strict.. Thus, there is optimal rotation angle where can be clearly identified in histogram of p() without being interfered by adjacent interferers.

Second, proportion of under-determined T-F bins reduces after interference alignment which allows sparsity-based UBSS techniques [TCASI19_UBSS_Sparse_TF] to recover such T-F bins. Therefore, rotational alignment angle for can be generalized as:

Fig. 8: In ITD-AoA plot, there are two types of maximum target isolation in domain.(a) Nearest neighbors are aligned. (b) No sources are aligned

Target Signal Identification: At each alignment angle, target signal feature can be used to perform voice recognition to assign scores to know whether the target signal is on recovered group or mixture group where K-1 iterations are needed in worst case scenario.

Iii-D RoSS Algorithm

Rotational Source Separation (RoSS) algorithm contains two modules: i) Explore and ii) Exploit as shown below:

At every rotational angle, STFT is performed on microphone signals after applying LPF to prevent spatial aliasing.

RoSS runs explore module first to identify AoAs. Then, it runs exploit module to rotate to right alignment angle for source enhancement. (back-to-back)

If any interference signal moves, Score value of voice recognition would drop due to misalignment of interferers. Then, RoSS gets re-initiated from the beginning. Note that movement of target signal does not affect interference alignment much.

RoSS can be run faster through a pipeline of two modules: Once brief AoA estimate is acquired before exploration ends, move to one of the estimated alignment angle prematurely then use this movement to update the AoA estimation. Repeat until target source is found. This way, both improving AoA accuracy and target source search can be done at once. See Fig. 9 for detailed operation.

Input: Starting position,
Initialize , ;
while  180 [deg] do
        Fit ITD histogram with KDE to get ;
        Find prominent peaks from ;
        if  then
               , , Rotate by ;
        end if
       From , get (9) ;
        From , get (12) (13);
        From (14) ;
       K-Means() w/ inertia cluster means ;
        (break; if does not change for 2 times) , , Rotate by ;
end while
Save cluster means to , save to K, save ;
Algorithm 1 Explore Module
Output: (i of interest)
From K, determine using ;
J ;
for j in J do
        Rotate to ;
        For small time duration [s],
        if  then
               Perform DUET to get K-1 signal groups;
        else if  then
               Perform IVA to get K-1 signal groups ;
        end if
       VoiceRecognition(K-1 groups) Scores;
        if maxScores Decision Criteria then
              break ;
        end if
end for
Perform IVA or DUET for [s] to get ;
Algorithm 2 Exploit Module
Fig. 9: System-level demonstration example of pipelined RoSS

Iv Evaluation

Iv-a Settings

Data Model: We test RoSS on 1-minute-long speech recordings from the LibriTTS dataset[LibriTTS]. We made each signal power to be identical, i.e., for K-sources. Also, K sources are selected in a pre-defined AoA order of [170, 70, -50, -120, 10]. For instance, [170, 70, -50] is used for case.

Setting 1, 2, 3 4 5
Type Simulation Measurement Measurement
Location Computer Outdoor Indoor
SNR [dB] 15 15.4 23
: Reverb Time [ms] 0, 450, 700 0 N/A
Room Size []
Distance to speakers [m] 2.5 2 2 to 2.5
[deg] 15 () 15 () 15 ()
TABLE I: Details of 5 evaluation settings
Fig. 10: (a) Custom-built rotary platform with MICs. (b) Outdoor field (Setting 4). (c) Conference room (Setting 5). (d) Testing scenes.

Experiment Settings: We have implemented RoSS on a custom built rotary platform actuated by a NEMA-17 stepper motor (Fig. 10(a)). The motor was controlled in open-loop using a TB-6600 motor driver where its peak rotation speed and acceleration is and each. ReSpeaker microphone array for Raspberry Pi is mounted on the rotary platform and 2 adjacent microphones, with 5cm spacing, are used to record audio signal. Experiments were tested and conducted in more than two environments including one outdoor and two indoor locations as depicted and explained in Fig. 10(b,c,d) and Table-I. In each environment, the platform was placed at the center and K sources were played from individual speakers placed radially at a distance of to m from the platform.

Algorithms are run in Python with sampling frequency of kHz, STFT frame lengths of 512 or 1024 with 25% overlap with adjacent frames. For source separation module, we use natural gradient-based IVA [TASLP07_IVA], DUET [DUET04] and MVDR [Frost_LCMV] for demonstration purpose of our concept. Other state-of-the-art UBSS/BSS techniques [GatedNN, ConvTasNet] could also be adopted for better separation performance. For rotation step , we used constant rotation by 15 degrees. Rotation can be planned judiciously such as rotating back and forth for instance.

However, in performance evaluation, since it is difficult to compare enhanced target source with the true target source alone in measurement for comparison, we measured twice, with and without interference at different time points. Note that, this would degrade source separation performance naturally due to external noise sources because two acoustic environments can be different at each time point.

Simulation Settings: In settings 1-to-3 in Table-I, two convolutive mixtures, , are generated based on room impulse response (RIR) generator [rir_generator] with following conditions.

Room size: 10m x 10m (2-dimensional space assumed).

Two omni-directional identical microphones with spacing of 5 cm are located at the center, rotating around center.

Gaussian noise is added so that microphone SNR is 15 dB while maintaining SIR of -10log(K-1) dB

Separated sources are evaluated by comparing with each source alone measured at the reference microphone .

Algorithm settings are similar as in measurement settings except for kHz sampling frequency.

Iv-B Performance Evaluation

Fig. 11: AoA estimation result over time in different settings and different number of sources and unit times. Dotted line represents truth AoAs.

AoA estimation: Fig. 11 shows AoA estimation results based on number of rotational steps in 4 different combinations of K and where K is number of sources and comprises of listening time and moving time per rotation while majority is listening time since it takes less than 0.1 seconds to rotate 15 deg. As claimed in Section III and (14), real-time estimated AoA values approach to truth AoA as the number of rotation goes up. In ideal setting where there is no reverberation (Setting 1), only two rotations were needed, two seconds, to get all correct AoA values in both cases. As channel becomes reverberant or noisy, more number of steps provide benefit in estimating true AoA values. Also, having longer seem to help with identifying major peaks at initial step, but it does not help with getting rid of intrinsic convergence offset caused by reverberations.

Fig. 12: Source separation performance with initial orientation offset. Zero offset at (a) [170, 70, -50], (b) [170, 70, -50, -120].

Performance Metric: For source separation metric, we utilize scale-invariant signal-to-distortion ratio (SI-SDR) and SI-SDRi [Def_SISDR, TrainigNoisy] where SI-SDRi captures improvement on SI-SDR after source separation: SI-SDR(estimate, target) - SI-SDR(mixture, target). In this work, SI-SDRi is used to deal with noisy outdoor data where both mixture data and target data are noisy due to wind.

Benefit of RoSS: Let us validate the effectiveness of interference alignment by considering the impact of initial orientation in terms of source separation. While AoAs of K sources are typically randomly distributed, conventional schemes such as IVA and DUET have huge variations in its performance depending on microphone array orientation in under-determined scenario (Fig. 12). And as expected, source separation is optimum at proposed alignment angles (15)(16), each resulting in local AoA of [110, 10, -110] and [110, 10, -110, 180] when target was at 70 deg with zero offset. That is, in RoSS, mobility provides an opportunity to rotate to that optimal angle point regardless of the initial orientation or source AoA distribution. Therefore, filled area in yellow between RoSS (dotted line) and other methods represents the statistical gain of RoSS in source separation.

Fig. 13: Average SI-SDRi compared to (a) various algorithms (b) different setups where X markers are AoA-informed RoSS.

Source Separation: We show evaluation results on average SI-SDRi on each source for different settings for when seconds in Fig. 13

. In (a), about 1 degree of freedom gain is clear from AoA-informed RoSS and RoSS over other algorithms. In (b), for setting 1-to-4, as environment becomes more reverberant, SI-SDRi values drop because both AoA estimation and interference alignment starts to fail. However, setting 5 result shows poor performance even if it has almost no reverberation since it’s outdoor measurement at the field. Major cause of this is thought to be external noises such as constant wind effect.

V Conclusion and Future Works

In this paper, we have presented RoSS where it blindly searches AoAs via rotational motion to align a pair of interference signals to separate target sound source in under-determined scenario. While we demonstrated the concept with unsupervised blind schemes assuming sources are mutually independent, proposed idea of reducing the gap between and is applicable to many other UBSS/BSS method regardless of whether it’s supervised or unsupervised even when sources are partially correlated. Also, we believe that applications where AoAs or source locations are readily-available, i.e. audio-vision applications, can benefit directly from proposed idea to align the interference as demonstrated in Fig. 13 with informed RoSS bypassing the AoA estimation step to gain robustness against high K and reverberation.

As a next step, further investigation can be made on exploiting continuous motion since we currently consider only discrete rotations. Also, real-time score of recovered target source can be utilized to find the alignment angle.