DeepAI
Log In Sign Up

ClearBuds: Wireless Binaural Earbuds for Learning-Based Speech Enhancement

We present ClearBuds, the first hardware and software system that utilizes a neural network to enhance speech streamed from two wireless earbuds. Real-time speech enhancement for wireless earbuds requires high-quality sound separation and background cancellation, operating in real-time and on a mobile phone. Clear-Buds bridges state-of-the-art deep learning for blind audio source separation and in-ear mobile systems by making two key technical contributions: 1) a new wireless earbud design capable of operating as a synchronized, binaural microphone array, and 2) a lightweight dual-channel speech enhancement neural network that runs on a mobile device. Our neural network has a novel cascaded architecture that combines a time-domain conventional neural network with a spectrogram-based frequency masking neural network to reduce the artifacts in the audio output. Results show that our wireless earbuds achieve a synchronization error less than 64 microseconds and our network has a runtime of 21.4 milliseconds on an accompanying mobile phone. In-the-wild evaluation with eight users in previously unseen indoor and outdoor multipath scenarios demonstrates that our neural network generalizes to learn both spatial and acoustic cues to perform noise suppression and background speech removal. In a user-study with 37 participants who spent over 15.4 hours rating 1041 audio samples collected in-the-wild, our system achieves improved mean opinion score and background noise suppression. Project page with demos: https://clearbuds.cs.washington.edu

READ FULL TEXT VIEW PDF

page 5

page 9

12/21/2022

ReVISE: Self-Supervised Speech Resynthesis with Visual Input for Universal and Generalized Speech Enhancement

Prior works on improving speech quality with visual input typically stud...
11/02/2021

Reduction of Subjective Listening Effort for TV Broadcast Signals with Recurrent Neural Networks

Listening to the audio of TV broadcast signals can be challenging for he...
08/18/2022

Deploying Enhanced Speech Feature Decreased Audio Complaints at SVT Play VOD Service

At Public Service Broadcaster SVT in Sweden, background music and sounds...
08/30/2020

Improved Lite Audio-Visual Speech Enhancement

Numerous studies have investigated the effectiveness of audio-visual mul...
09/13/2018

Real-Time Lightweight Chaotic Encryption for 5G IoT Enabled Lip-Reading Driven Secure Hearing-Aid

Existing audio-only hearing-aids are known to perform poorly in noisy si...
01/22/2023

Cellular Network Speech Enhancement: Removing Background and Transmission Noise

The primary objective of speech enhancement is to reduce background nois...

1. Introduction

Figure 1. ClearBuds Application. Our goal is to isolate a user’s voice from background noise (e.g., street sounds or other people talking) by performing source separation using a pair of custom designed, synchronized, wireless earbuds.

With the rapid proliferation of wireless earbuds (100 million AirPods sold in 2020 (26)), more people than ever are taking calls on-the-go. While these systems offer unprecedented convenience, their mobility raises an important technical challenge: environmental noise (e.g., street sounds, people talking) can interfere and make it harder to understand the speaker. We therefore seek to enhance the speaker’s voice and suppress background sounds using speech captured across the two earbuds.

Source separation of acoustic signals is a long-standing problem where the conventional approach for decades has been to perform beamforming using multiple microphones. Signal processing-based beamformers that are computationally lightweight can encode the spatial information but do not effectively capture acoustic cues (Van Veen and Buckley, 1988; Krim and Viberg, 1996; Chhetri et al., 2018). Recent work has shown that deep neural networks can encode both spatial and acoustic information and hence can achieve superior source separation with gains of up to  dB over signal processing baselines (Subakan et al., 2021; Luo and Mesgarani, 2019). However, these neural networks are computationally expensive. None of the existing binaural (i.e., using two microphones) neural networks can meet the end-to-end latency required for telephony applications or have been evaluated with real earbud data. Commercial end-to-end systems, like Krisp (78), use neural networks on a cloud server for single-channel speech enhancement, with implications to cost and privacy.

We present the first mobile system that uses neural networks to achieve real-time speech enhancement from binaural wireless earbuds. Our key insight is to treat wireless earbuds as a binaural microphone array, and exploit the specific geometry – two well-separated microphones behind a proximal source – to devise a specialized neural network for high quality speaker separation. In contrast to using multiple microphones on the same earbud to perform beamforming, as is common in Apple AirPods (2) and other hearing aids, we use microphones across the left and right earbuds, increasing the distance between the two microphones and thus the spatial resolution.

To realize this vision, we need to address three key technical challenges to deliver a functioning, practical system:

  1. Today’s wireless earbuds only support one channel of microphone up-link to the phone. AirPods and similar devices upload microphone output from only a single earbud at a time. To achieve binaural speaker separation, we need to design and build novel earbud hardware that can synchronously transmit audio data from both the earbuds, and maintain tight synchronization over long periods of time.

  2. Binaural speech enhancement networks have high computational requirements, and have not been demonstrated on mobile devices with data from wireless earbuds. Reducing the network size naively often leads to unpleasant artifacts. Thus, we also need to optimize the neural networks to run in real-time on smart devices that have a limited computational capability compared to cloud GPUs. Further, we need to meet the end-to-end latency requirements for telephony applications and ensure that the resulting audio output has a high quality from a user experience perspective.

  3. Prior binaural speech enhancement networks are trained and tested on synthetic data and have not been shown to generalize to real data. Building an end-to-end system however requires a network that generalizes to in-the-wild use.

To achieve this system, we make three technical contributions spanning earable hardware and neural networks.

Figure 2. ClearBuds hardware inside 3D-printed enclosure and when placed beside a quarter.
  • [itemsep=1pt,parsep=2pt,topsep=3pt,partopsep=0pt,leftmargin=0em, itemindent=1em,labelwidth=1em,labelsep=0.5em]

  • Synchronized binaural earables. We designed a binaural wireless earbud system (Fig. 2

    ) capable of streaming two time-synchronized microphone audio streams to a mobile device. This is one of the first systems of its kind, and we expect our open-source earbud hardware and firmware to be of wider interest as a research and development platform. Existing earable platforms such as eSense 

    (Kawsar et al., 2018) do not support time-synchronized audio transmission from two earbuds to a mobile device. We designed our DIY hardware using open source eCAD software, outsourced fabrication and assembly (K for 50 units), and 3D printed the enclosures.

    Figure 3. Background voice performance. We use spatial cues to separate background voices from the target speaker, even when the background voice is louder than the target voice. This is evident when the target speaker is silent but background voice continues to talk (highlighted in orange). Apple AirPods Pro uses an endfire beamformer to partially suppress background voice. The mono-channel Facebook Denoiser (Demucs) is unable to suppress the background voice. Clearbud’s network removes the background voice, approaching ground truth.
  • Lightweight cascaded neural network. We introduce a lightweight neural network that utilizes binaural input from wearable earbuds to isolate the target speaker. To achieve real-time operation, we start with the Conv-TasNet source separation network (Luo and Mesgarani, 2019) and redesign the network to achieve a 90% re-use of the computed network activations from the previous time step for each new audio segment (see §3.2). While these optimizations make this network real-time, they also introduce artifacts in the audio output (i.e., crackling, static). Interestingly, these artifacts have little effect on traditional metrics, like Signal-to-Distortion Ratio (SDR), but have a noticeable effect on subjective listening scores (see §5.2). These artifacts however are often visible in a frequency representation of the audio. To address this, we combine our mobile temporal model with a real-time spectrogram-based frequency masking neural network. We show that by combining the two networks and creating a lightweight cascaded network, we can reduce artifacts and improve the audio quality further.

  • Network training for in-the-wild generalization. Training the network in a supervised way requires clean ground truth speech samples as training targets. This is difficult to obtain in fully natural settings since the ground truth speech is corrupted with background noise and voices. Training a network that generalizes to in-the-wild scenarios also requires the training data to mimic the dynamics of real speech as closely as possible. This includes reverb, voice resonance, and microphone response. Synthetically rendered spatial data is the easiest type of data to obtain, but most different from real recordings, while real speakers wearing the headset in an anechoic chamber provide the best ground-truth training targets, but are the most costly to obtain. Synthetic data can simulate various reverb and multi-path that are not captured in an anechoic chamber. Our training methodology uses large amounts of synthetic data simulated in software, small amounts of hardware data with speakers embedded into a foam mannequin head and small amounts of data from human speakers wearing the earbuds in an anechoic chamber (see §4) to create a neural network that generalizes to users and multi-path environments not in the training data.

We combine our wireless earbuds and neural network to create ClearBuds, an end-to-end system capable of (1) source separation for the intended speaker in noisy environments, (2) attenuation and/or elimination of both background noises and external human voices, and (3) real-time, on-device processing on a commodity mobile phone paired to the two earbuds. Our results show that:

  • [itemsep=1pt,parsep=2pt,topsep=3pt,partopsep=0pt,leftmargin=0em, itemindent=1em,labelwidth=1em,labelsep=0.5em]

  • Our binaural wireless earbuds can stream audio to a phone with a synchronization error less than 64s and operate continuously on a coin cell battery for 40 hours.

  • Our system outperforms Apple AirPods Pro by 5.23, 8.61, and 6.94 dB for the tasks of separating the target voice from background noise, background voices, and a combination of background noise and voices respectively.

  • Our network has a runtime of 21.4ms on iPhone 12, and the entire ClearBuds system operates in real-time with an end-to-end latency of 109ms. For telephony applications, an ear-to-mouth latency of less than 200ms is required for a good user experience (58).

  • In-the-wild evaluation with eight users in various indoor and outdoor scenarios shows that our system generalizes to previously unseen participants and multipath environments, that are not in the training data.

  • In a user study with 37 participants who spent over 15.4 hours and rated a total of 1041 in-the-wild audio samples, our cascaded network achieved a higher mean opinion score and noise suppression than both the input speech as well as a lightweight Conv-TasNet.

We believe that this paper bridges state-of-the-art deep learning for blind audio source separation and in-ear mobile systems. The ability to perform background noise suppression and speech separation could positively impact millions of people who use earbuds to take calls on-the-go. By open-sourcing the hardware and collected datasets, our work may help kickstart future research among mobile system and machine learning researchers to design algorithms around wireless earbud data.

2. Related Work

Endfire beamforming configurations remain popular on consumer mobile phones and earbuds (20; 2; 15; 43). While recent advances in neural networks have shown promising results, none of them are demonstrated with wireless earbuds. By creating a wireless network between two earbuds, we demonstrate that our real-time, two-channel neural network can outperform current real-time speech enhancement approaches for wireless earbuds.

Beamforming techniques. Since signal-processing based beamforming is computationally lightweight, these techniques are deployed on commercial devices such as smart speakers (16), mobile phones (20), and earbud devices like Apple AirPods (2). However, the performance of beamforming is limited by the geometry of the microphones and the distance between them (Van Veen and Buckley, 1988; InvenSense, 2013). The form factor of devices like AirPods restricts both the number of microphones on a single earbud and the available distance between them, limiting the gain of the beamformer. While beamforming across two earbuds could provide better performance in principle, current wireless architectures are limited to streaming from a single earbud at a time (Telephony and Group, 2020). Furthermore, adaptive beamformers such as MVDR (Frost, 1972), while showing promise with relatively few interfering sources, are sensitive to sensor placement tolerance and steering (Zhang and Wang, 2017; Brandstein, 2001). Finally, beamforming leverages spatial or spectral cues only and does not use acoustic cues (e.g., structure in human speech) and perceptual differences to discriminate sources, information that machine learning methods leverage successfully.

Single-channel deep speech enhancement. Many deep learning techniques operate on spectrograms to separate the human voice from background noise (Xu et al., 2015; Mohammadiha et al., 2013; Duan et al., ; Nikzad et al., 2020; Choi et al., 2019; Weninger et al., 2015; Fu et al., 2019; Soni et al., ). However, recent works instead operate directly on the time domain signals (Luo and Mesgarani, 2019; Germain et al., 2018; Pascual et al., 2017; Defossez et al., 2020; Macartney and Weyde, 2018), yielding performance improvements over spectrogram approaches. Commercial noise suppression software like Krisp (78) and Google Meet (42) have successfully deployed single-channel models in real-time and are available for use on mobile phones and desktop computers, but processing is performed on the cloud. (Fedorov et al., 2020)

achieves low-power speech enhancement using long short-term memory (LSTM), but it is for a single-channel network but not for multichannel source separation. Further, single-channel models cannot effectively capture spatial information and fail to isolate the intended speaker when there are multiple speakers (see Fig. 

3).

Figure 4.

Network Diagram of CB-Net. Our network contains a time-domain component, shown in the top as CB-Conv-TasNet, and a frequency domain component, shown on the bottom by CB-UNet.

Multi-channel source separation and speech enhancement. Multi-channel methods have been shown to perform better than their single-channel source separation counterparts (Yoshioka et al., 2018; Chen et al., 2018; Zhang and Wang, 2017; Gu et al., 2020; Tzirakis et al., 2021; Jenrungrot et al., 2020). Binaural methods have also been used for source separation (Sun et al., ; Han et al., 2020; Li et al., 2011; Reindl et al., 2010) and localization (van Hoesel et al., 2008; Lyon, 1983; Kock, 1950); (Han et al., 2020) reduces the look-ahead time in the network to make it causal in behavior but has not been demonstrated to run on a mobile device. Our method improves on existing binaural methods by combining time-domain neural network with spectrogram-based frequency masking networks as well as optimizing them to enable real-time processing on a phone. Recent works such as  (Tan et al., 2019, 2021; Shankar et al., 2020) use multiple microphones on a smartphone for speech-enhancement. However, neither of them demonstrates evaluation with real data, where artifacts because of network optimizations can affect user performance. In contrast, we demonstrate the first system that achieves real-time speech enhancement using microphones on the two wireless earbuds. Further, as the distance between the earbuds is larger than the distance between microphones on a typical mobile phone, we can attain a better baseline than a mobile phone implementation, while also retaining the ability to speak hands-free. More recent works tackle the problem of real-time directional hearing using eye trackers and wearable headsets. For example, (Wang et al., AAAI 2022) uses a hybrid network that combines signal processing with neural networks, but shows that their technique performs poorly in binaural scenarios (i.e., two microphones) and requires four or more microphones. In contrast, we focus on the problem of speech enhancement and create the first real-time end-to-end hardware-software neural-network based system using wireless synchronized earbuds.

Earbud computing and platforms. There has been recent interest in earbud computing (Ma et al., 2021b; Kawsar et al., 2018; Min et al., 2018; Powar and Beresford, 2019; Yang and Choudhury, 2021) to address applications in health monitoring (Chan et al., 2019; Bui et al., 2021; Chan et al., 2022), activity tracking (Ma et al., 2021a) and sensor fusion with EEG signals (Ceolini et al., 2020). The eSense platform (Kawsar et al., 2018; Min et al., 2018) has enabled research in sensing applications with earables. OpenMHA (Pavlovic,Caslav et al., 2018; Herzke et al., 2017) is an open signal processing software platform for hearing aid research. Neither of these platforms support time-synchronized audio transmission from two earbuds, which is a critical requirement for achieving speech enhancement in binaural settings. In contrast, we created open-source wireless earbud hardware that can support synchronize wireless transmission from the two earbuds.

3. ClearBuds Design

We first introduce our lightweight neural network architecture. We then describe system design of our hardware platform and our synchronization algorithm. We open-source our mechanical, firmware, application, and network designs at our project website: https://clearbuds.cs.washington.edu.

3.1. Problem Formulation

Suppose we have a 2 channel microphone array with one microphone on each ear of the wearer. The target voice is speaking with a signal in the presence of some background noise bg or other non-target speakers . There may also be multi-path reflections and reverberations r which we would also like to reduce, i.e., . Our goal is then to recover the target speaker’s signal, , while ignoring the background, reverbations, or other speakers. We also must do so in a real-time way, meaning that the a mixture sample received at time must be processed and outputted by the network before for some defined latency L. We refer to the non-target speakers as "background voices". These background voices may be at any location in the scene, including very close to the target speaker and their angle can change with time and motion.

3.2. Neural Network Architecture Motivation

Our network needs to perform in real-time on a mobile device with minimal latency. This is challenging for several reasons. First, the processing device has a much lower compute capacity, especially compared to cloud GPUs. Additionally, the network should separate non-speech noises as well as unwanted speech. To do this, it must learn spatial cues and human voice characteristics. Finally, the resulting output should maximize the quality from a human experience perspective while minimizing any artifacts the network might introduce.

Our network, which we call ClearBuds-Net or CB-Net, is a cascaded model that operates in both time and frequency domains. The full network architecture is illustrated in Fig. 4 and contains two main sub-components: A dual-channel time domain network called CB-Conv-TasNet, and a frequency based network called CB-UNet.

3.2.1. CB-Conv-TasNet

The first component of separation method is a time domain network that is based on a multi-channel extension of Conv-TasNet (Luo and Mesgarani, 2019). This is a network in the waveform domain that has a Temporal Convolution Network (TCN) structure, lending itself to a causal implementation with intermediate layer caching (Paine et al., 2016). We use depthwise separable convolutions (Howard et al., 2017) to further reduce the number of parameters and make the design real-time. We call this network CB-Conv-TasNet since it is an optimized version of the original Conv-TasNet.

A key feature of the time domain approach is that it can easily capture spatial cues in the network. In our application, the desired source is always physically between two microphones, thus the voice signal will reach the microphones roughly at the same time. In contrast, background or other speakers are typically not temporally aligned and will reach one microphone earlier or later. By feeding two time synchronized channels into the neural network, this spatial alignment of the sources can be learned from time differences in the signal. This is similar to a delay-and-sum beamforming effect, except the sum is replaced with a deep network.

Figure 5. The spectrograms above show the motivation behind a combined time and frequency domain method. The output of the time-domain component, CB-Conv-TasNet, contains artifacts, particularly at high frequencies. Although subtle, these artifacts are perceptible by human listeners. CB-Net is able to reduce these artifacts by using a frequency-domain network (CB-UNet) that masks unwanted frequencies.

3.2.2. CB-UNet

The output of our lightweight CB-Conv-TasNet often contains audible artifacts (i.e., crackling, static) that reduce the listening experience. Interestingly, these artifacts have little effect on traditional metrics, like Signal-to-Distortion Ratio (SDR), but have a noticeable effect on subjective listening scores (see §5.2). These artifacts are often visible in a frequency representation of the audio. Fig. 5 shows how CB-Conv-TasNet alone contains noticeable artifacts when compared to the ground truth. To address this, we cascade a lightweight causal UNet (Ronneberger et al., 2015) which operates on the mel-scale spectrogram of the input audio. This network, which we call CB-UNet, produces a binary mask which is applied to the output of CB-Conv-TasNet. The combined output, shown in Fig. 5 as CB-Net, reduces these artifacts. The mean opinion scores in our evaluation shows the strength of the cascaded CB-Net when compared to the time-domain component only.

3.3. Neural Network Detailed Description

3.3.1. CB-Conv-TasNet

The input to the network is a binaural mixture given by . The first step is an encoder that transforms the mixture x into with a 1D convolution of size

and stride

. This is followed by a ReLU layer. The encoder’s outputs are next fed into a temporal convolution network that consists of stacks of 1-D convolutions with increasing dilation factors. We use 14 convolution layers with dilation factors of 1,2,4,..,64 repeated twice, with a ReLU nonlinearity and skip connection after each convolution. The encoder output is multiplied with the output of the temporal conv-net, before being fed through a fully connected Decoder layer which transforms the output back into

.

In a real-world implementation, we do not have access to the full waveform, but only packets of data at a time. Furthermore, we must process these packets with limited access to future input samples. Given 15.625 kHz sampling rate, we choose to process packets of 350 samples at a time (22.4ms), which is our window size . We also use

, or 700 samples of lookahead time (44.8ms) and 1.5s of past samples. Since we have no padding in the temporal convolution net, the network starts with this large temporal context and outputs exactly

samples, corresponding to the desired output for our input packet of samples. When we receive the next packet of size , all intermediate activation from the encoder and temporal conv-net can be shifted over by samples and re-used. We chose , but any divisor of would work. Re-using intermediate outputs from previous packets saves over of the compute time for a new packet in our network.

Figure 6. CB-Conv-TasNet, the time-domain component of CB-Net. Given a packet of 350 samples (22.4ms) highlighted in blue, we use 1.5s of past input and 44.8ms of future input to output the separation results. Our caching scheme works as follows: When we receive a new 350ms samples, all intermediate activations (circles in the diagram) slide to the left, and we compute only the rightmost column of outputs.

3.3.2. CB-UNet

The frequency domain network is a mono-channel network that outputs a binary mask for each time-frequency bin. The input

is a summation of the binaural left and right channel, which is the equivalent of a broadside beamformer. We first run a STFT, which is a mel-scale fourier transform with hop size of

, a window size of including zero padding on the edges, and a bin mel-scale output. The network input is a spectrogram of time bins and frequency bins, corresponding to a receptive field of samples, or . In order to maintain the causality requirement, we use the same lookahead strategy as the time-domain network where we allow samples of lookahead for a target packet of samples. The UNet architecture contains 4 downsampling and upsampling layers, starting with 64 channels and doubling the number of channels at each subsequent layer. The downsampling layers contain a depthwise separable convolution followed by a max pooling, and the upsampling layers contain a depthwise separable convolution followed by a transposed convolution for upsampling. The output is a sigmoid function, which is then thresholded to return a binary mask in . When outputting a spectrogram mask on an input, we predict a mask over the entire input even though we only need the output for a specific slice of samples, or a mask. Further optimizations could be made by caching intermediate outputs or only computing the mask for the target samples. However CB-UNet’s run-time was so small compared to the rest of the network that these optimizations were not necessary.

3.3.3. Combining the Outputs

At each time step, the output of CB-Conv-Tasnet is an audio waveform in , and the output of CB-UNet is a spectrogram mask in . We run the same fourier transform on the buffered conv-tasnet outputs to produce a spectrogram . Our output can then be computed by . Our empirical results show that this gives the best results compared to other methods such as ratio masking.

3.3.4. Training

CB-Conv-TasNet is trained with an -based loss over the waveform along with the multi-resolution spectrogram loss. Formally, provided is our target speaker and is the output from the network, our loss is:

STFT denotes the magnitude of the short time Fourier transform, and denotes the Frobenius norm. and represent spectral convergence and magnitude losses, which gave better results than L1 loss alone.

For training CB-UNet, for each time frequency bin, the training target M is 1 if the target voice is the dominant component, and 0 otherwise. Formally, . The network is then trained with the binary cross entropy of the output compared to the target mask.

3.3.5. Hyperparameters and Training Details

We use a learning rate of along with the ADAM optimizer (Kingma and Ba, 2014) for training the network. The network was trained on a single Nvidia TITAN Xp GPU. Because of the small size of the network, training could be completed within a single day and generally required

50 epochs to reach convergence.As an additional data augmentation step we make the following perturbations to the data: High-shelf and low-shelf gain of up to

are randomly added using the sox library.

3.4. Synchronized wireless earbuds

Figure 7. Hardware Block Diagram. Each ClearBud integrates a PDM microphone, accelerometer, flash, and coin cell battery. Buttons and LEDs are used for interfacing with the device, and a USB port is used for programming and debug.

We seek to capture speech from the target speaker’s mouth which sits on the sagittal plane roughly equidistant to the ears. Given an ear-to-ear spacing of 17.5cm, to effectively isolate this central plane we require a distance precision on the order of a few centimeters. An interaural time difference of 100s would correspond to source maximally 3.43 cm off this central plane, therefore we target a synchronization accuracy under 100s.

Figure 8. Time Sync Design. The primary ClearBud broadcasts a sync clock over the air. The secondary ClearBud then uses the sync clock to rate encode, by increasing or decreasing the size of its local PCM buffer.

3.4.1. Hardware

Our custom hardware design contains a pulse-density modulated (PDM) microphone (Invensense ICS-41350) and a Bluetooth Low Energy (BLE) microcontroller (Nordic nRF52840).111For future research applications, an ultra low-power accelerometer (Bosch BMA400), a 1Gbit NAND flash for local data collection (Winbond W25N01GVZEIG), and support for speaker and an additional microphone are included. The system is powered off of a CR2032 coin cell battery and programmed via SWD over a Micro-USB connector. Each ClearBud has an integrated PDM microphone set to a clock frequency of 2MHz. With an internal PDM decimation ratio of 64, this provides us a sampling frequency of 31.25kHz. As most HD voice applications and wideband codecs are limited to 16kHz (C. and M.H., 2009), we decimate further in firmware by a factor of 2, giving us a final sampling frequency of 15.625kHz.

Two 16-bit 180 sample size Pulse-Code Modulation (PCM) buffers are round-robined: one is filled with incoming PCM data while the other is processed. The DMA is responsible for both clocking in the PDM data and converting it into PCM. One buffer is always connected to the DMA, while the other is freed for processing for the rest of the data pipeline. When the buffer connected to the DMA fills, the buffers switch roles and we begin processing data on the newly freed buffer, and connect the other buffer back to the DMA. With this design we always have a continuous PCM stream to operate on. Both ClearBuds transmit the PCM microphone data to a mobile phone for input into our neural network. To maximize throughput, we use the highest Bluetooth rate and packet sizes supported by iOS, which is 2Mbps and 182 bytes, respectively. We design a lightweight wireless protocol where the first 2 bytes represent a monotonically-increasing sequence number, while the other 180 bytes are reserved for the 16-bit PCM audio samples. The sequence number is used on the phone so that we can zero-pad PCM data in the occasional event that a packet is dropped either over-the-air or by the radio hardware. This zero-padding keeps the left and right microphone data aligned on the host side in areas of poor radio performance or interference in the environment.

The hardware schematic and layout for ClearBuds was designed using the open source eCAD tool KiCad. A 2-layer flexible printed circuit was fabricated and assembled by PCBWay. The 3D printed enclosures were designed using AutoDesk Fusion 360 and printed with a Phrozen Sonic Mini using a liquid resin fabrication process. The MEMS microphone sits behind the lid on the earbud’s outer surface. A single button on the enclosure provides access to turn on and off the earbuds.

3.4.2. Microphone synchronization

Three components are necessary for maintaining microphone synchronization: (1) As each of our earbuds has its own local clock source, we need to establish a common clock between them so that they have the same reference of time, (2) a synchronized startup so each earbud starts recording from their respective microphone at the exact same time, and (3) a rate encoding scheme to control the earbud’s sampling rate to match each other.

In our system, each earbud has its own respective 32MHz clock source with a total +/- 20ppm frequency tolerance budget. So, in the worst case scenario, the earbuds will have 2.4 milliseconds of drift each minute. We use the Nordic’s TimeSlot API (59), which grants us access to the underlying radio hardware in between Bluetooth transmissions. This provides us a transport to transmit and receive accurate time sync beacons (77). Each ClearBud keeps a free-running 16MHz hardware timer with a max value of 800,000, overflowing and wrapping around at a rate of about 20 Hz. One ClearBud is assigned as the timing master while the other ClearBud will synchronize its free-running timer to the master’s. The primary ClearBud (timing master) transmits time sync packets at a rate of 200 Hz. These packets contain the value of the free-running timer at the time of the radio packet transmission. When the secondary ClearBud receives this packet, it can then add or subtract an offset to its own free-running timer for a common clock.

Once each ClearBud is connected to the mobile phone, the phone sends a START command to both ClearBuds over BLE. Each ClearBud contains firmware which arms a programmable peripheral interconnect (PPI) to launch the PDM bus once the 16MHz free-running timer wraps around at 800,000. By using this method, we bypass the CPU and trigger a synchronized startup entirely at the hardware layer. One caveat is that the mobile phone could write to one ClearBud right before its clock wraps around at 800,000, and the other ClearBud right after it wraps around at 800,000. With a clock that wraps around at 20Hz, this would trigger a mismatched startup and cause an alignment error of 50ms. To correct for this, each ClearBud reports its common clock timer value to the phone once it has received the START command. The phone can then remove the first 781 audio samples (781 samples / 15.625kHz = 50ms) if one ClearBud started streaming 50ms before the other.

The final component to keeping the audio streams aligned is to create a rate encoding scheme between the ClearBuds. With the time sync beacons from the primary ClearBud, the other ClearBud now has both its local clock and the common clock (primary ClearBud’s local clock). With these two clocks, the secondary ClearBud can identify how much faster or slower its PDM clock is running in relation to the primary ClearBud. We note that with a 2MHz PDM clock and a PDM decimation ratio of 64, each audio sample occupies 32 us. The non-primary ClearBud can then add or remove a sample to its PDM buffer every time the difference between the clocks exceeds a multiple of 32 us. By doing this, the secondary ClearBud ensures that its PDM buffer starts filling up at the exact same time as the primary ClearBud’s PDM buffer, with a tolerance of 32 us.

4. Training methodology

Training the network in a supervised way requires clean ground truth speech samples as training targets. This is difficult to obtain in fully natural settings since the ground truth speech is corrupted with background noise and voices. Training a network that generalizes to in-the-wild scenarios also requires the training data to mimic the dynamics of real speech as closely as possible. This includes reverb, voice resonance, and microphone response. Synthetically rendered spatial data is the easiest type of data to obtain, but most different from real recordings, while real speakers wearing the headset in an anechoic chamber provide the best ground-truth training targets, but are the most costly to obtain. Synthetic data can simulate various reverb and multipath that are not captured in an anechoic chamber. We adopt a hybrid training methodology where we first train on a large amount of synthetic data and fine-tune on real data recorded with our hardware. Our training method is based on the commonly used mix-and-separate framework (Zhao et al., 2018), where clean speech and noise samples are recorded separately and combined randomly to form noisy mixtures. Our results show that our network trained this way generalizes to naturally recorded noisy data in real-world environments.

Synthetic data. This type of data is the easiest to obtain, since a wide variety of voice types and physical setups can be generated instantly. Many machine learning baselines, e.g., (Luo et al., 2020; Jenrungrot et al., 2020; Tzirakis et al., 2021), only train and evaluate on synthetic data generated in this manner. To generate the synthetic dataset, we create multi-speaker recordings in simulated environments with reverb and background noises. All voices come from the VCTK dataset (Veaux et al., 2016) (110 unique speakers with over 44 hours), and background sounds come from the WHAM! dataset (Wichern et al., 2019), with 58 hours of recordings from a variety of noise environments such as a restaurant, crowd, and music.

To synthesize a single example, we create a 3 second mixture as follows: two virtual microphones are placed 17.5 cm apart, which is the average distance between human ears (Risoud et al., 2018). The target speaker’s voice is placed at the center between the two virtual microphones, and a second voice is placed randomly between and meters away and at a random angle. A randomly chosen background noise is also placed in the scene. We then simulate room impulse responses (RIRs) for a randomly sized room using the image source method implemented in the pyroomacoustics library (Allen and Berkley, 1979; Scheibler et al., ). The room is rectangular with sides randomly chosen between 5 and 20 meters, and the RT60 values are randomly chosen between 0 and 1 second. All signals are convolved with the RIR and rendered to the two channel microphone array. The volumes of the background are randomly chosen so that the input signal-to-distortion ratio is roughly between -5 and 5 dB. For training, we use 10,000 mixtures generated in this manner.

Hardware data. While a large amount of synthetic data can be easily rendered to train the network, it does not contain characteristics such as the microphone response of physical hardware and imperfections in the time-of-arrival. To address this, we also train on a set of recorded voice samples from our earbuds. We set up a foam mannequin head with an artificial mouth speaker (Sony SBS-XB12) that plays VCTK samples as the spoken ground truth. For background voice recordings, the speaker is placed in varying locations within a one meter radius of the foam head. Physically recorded background noise is provided by binaural version of the WHAM! dataset (Wichern et al., 2019), which was recorded in real environments using a binaural mannequin like ours. We record 2 hours each of clean speech, and background voices. 2000 random mixtures are then created for training.

Human data. The spoken hardware data above still does not contain natural voice resonance since it is played out of an electronic speaker. Furthermore, the background sounds recorded by a mannequin wearing earbuds still misses some of the physical filtering of the human body. To better capture desired output of real scenarios, we collect a ground-truth speech dataset in an anechoic chamber with human speakers (5 male, 4 female) and a noise dataset in real environments with human listeners. For the voice data, each human speaker wore our ClearBuds prototypes, and uttered 15 minutes of text from Project Gutenberg in the anechoic chamber. The purpose of this anechoic data is to provide clean training targets for the network, modelling the resonance of human speakers wearing our hardware. For the real world noise dataset, individuals wore ClearBuds and recorded various noisy scenarios such as washing dishes, loud indoor/outdoor restaurants, and busy traffic intersections. 2000 random mixtures of clean voice and recorded noise were generated for this dataset.

Our network is jointly trained using all these datasets. Note that testing and evaluation is done outside the anechoic chamber.

Figure 9. Comparison with AirPods Pro. Reporting the output SI-SDR (note: not SI-SDR increase). ClearBuds exceeds in three conditions: target voice plus background noise (BG), target voice plus background voice (BV), target voice plus background voice and noise.

5. Experiments and Results

We first compare our end-to-end system performance against a commercial wireless earbud system. We then present in-the-wild evaluation of our system. Next, we compare numerical results against various speech enhancement baselines. Finally, we present system-level evaluations. Our work is approved by the IRB.

5.1. Comparison with Beamforming Earbuds

We evaluate our end-to-end system against the Apple AirPods Pro headset connected to a iPhone 12 Pro in a repeatable physical set up. In our evaluation, as is typical, there is no overlap between training and test datasets.

Figure 10. In-the-wild experiments in various scenarios (crowded cafe, busy intersection, outdoor plaza, classroom) were conducted across 8 users and indoor and oudoor environments, all unseen in our training dataset.

Procedure. We use the popular metric scale-invariant signal-to-distortion ratio (SI-SDR) (Roux et al., 2018). While SI-SDR provides a repeatable metric used in the acoustic community, it requires a clean, sample-aligned ground truth (target voice) as the basis for evaluation. Therefore, we create a repeatable soundscape for our test setup where a sample-aligned ground truth can be obtained. A foam mannequin head with a speaker (Sony SBS-XB12) inserted into its artificial mouth uttered one hundred VCTK samples with identities and samples unseen in the training set. The mannequin wore ClearBuds and AirPods Pro in subsequent experiments, and the outputs of the two systems could be directly compared. Ambient environmental sound (from WHAM! dataset) was played via four monitors (PreSonus Eris E3.5) positioned to fill 3 meter by 4 meter room, and background voice (also VCTK) was played from a monitor positioned 0.4 meters from head on the right. All speakers were driven through a common USB interface (PreSonus 1810c) ensuring the same time-alignment and loudness between the two test conditions. Since Apple AirPods Pro beamforming cannot be toggled on and off, we cannot calculate an SI-SDR increase (SI-SDRi), and therefore report output SI-SDR. To establish the ground truth voice against which to calculate SI-SDR, we record clean target voice through each headset. Ambient noise SNR ranged between 0dB and 16dB with respect to target voice. Qualitatively, this sounded like a second person speaking loudly in a noisy bar or cafe. Finally, background voice SNR ranged between 6dB and 12dB, qualitatively sounding like a person speaking from a meter or two away.

Results. We report output SI-SDR from the two systems in Fig. 9. To calculate output SI-SDR, we align individual one second chunks and take the logarithmic mean across 250 chunks. We find that ClearBuds achieves higher output SI-SDR across all test conditions when compared to the beamforming utilized by the Apple AirPods Pro. For a qualitative comparison of AirPods Pro versus ClearBuds performance with human speakers, see video: https://clearbuds.cs.washington.edu/videos/airpods_comparison.mp4.

5.2. In-the-Wild Evaluation

We perform in-the-wild evaluation in indoor and outdoor scenarios as well as users not in the training data. The procedure and results are described in the following sections.

Figure 11. In-the-wild study results. Noise suppression indicates perceived quality of background noise reduction (higher is less intrusive). Overall MOS indicates overall perceived quality. Error bars are 95% CI.

In-the-wild experiments. Eight individuals (four male, four female, mean age 25) with a variety of accents wore a pair of ClearBuds and read excerpts from Project Gutenberg (51) while in four noisy environments: a coffee shop, a noisy intersection, an outdoor plaza, and a classroom (see Fig. 10). The environments featured ringing phones, cross-talk from other people, ambient music, a crying baby, opening/closing doors, driving vehicles, and street noise, amongst other common sounds. These experiments were uncontrolled in that the background voices and noise were naturally occurring sounds that are typical to these real-world scenarios and were mobile.

Evaluation procedure. In-the-wild evaluation precludes access to clean, sample-aligned truth to compute SI-SDR. Instead, the common (and expensive) procedure is to perform a user study and compute the mean opinion score. Since this is a time-consuming process, prior works on binaural networks, e.g., (Luo et al., 2020; Tan et al., 2019; Jenrungrot et al., 2020), avoid in-the-wild evaluation. Since our goal is to design and evaluate an in-ear system in real scenarios, we recruit thirty-seven participants (11 female, 26 male, mean age 29) for a user study. Each participant listened to between 6 and 11 in-the-wild audio samples (avg. 9.38 samples, each between 10–60 seconds). Each speech sample was processed and presented three ways: (1) the original input, (2) CB-Conv-TasNet, and (3) CB-Net, yielding a total of 379.383 1,041 rating samples.

Participants were encouraged to use audio equipment they would typically use for a call. Fourteen used earbuds, thirteen used computer speakers, seven used headphones, and three used phone speakers. The study took about 25 minutes per participant. As is typical with noise suppression systems, participants were asked to give ratings in two categories: the intrusiveness of the noise and overall quality (mean opinion score - MOS):

  1. Noise suppression: How INTRUSIVE/NOTICEABLE were the BACKGROUND sounds? 1 - Very intrusive, 2 - Somewhat intrusive, 3 - Noticeable, but not intrusive, 4 - Slightly noticeable, 5 - Not noticeable

  2. Overall MOS: If this were a phone call with another person, How was your OVERALL experience? 1 - Bad, 2 - Poor, 3 - Fair, 4 - Good, 5 - Excellent

Figure 12. Mobility of speaker and noise sources in-the-wild. Red box highlights a moving truck on the road while ClearBuds user is walking. Video: https://youtu.be/HYu0ybjcQPA?t=127

Results. Fig. 11 shows the noise intrusiveness and MOS values for the original microphone, CB-Conv-TasNet, and CB-Net. As expected, applying CB-Conv-TasNet to the original audio helped suppress noise dramatically, increasing opinion score from 2.02 (slight better than 2 - Somewhat intrusive) to 3.28 (between 3 - Noticeable, but not intrusive and 4 - Slightly noticeable) (p<0.01). The light-touch, spectrogram-masking clean up method featured in CB-Net increased noise suppression opinion score significantly (p<0.001) to 3.77, indicating the method did indeed further suppress perceptually annoying noise artifacts. Importantly, this step also increased overall MOS. While users only slightly preferred (p<0.05) CB-Conv-TasNet (2.67) to the original input (2.49) due to artifacts introduced, they more significantly (p<0.001) preferred our CB-Net (3.10), an increase of 0.61 opinion score points from the input. For context, in the flagship ICASSP 2021 Deep Suppression Noise Challenge (Reddy et al., 2021), with state-of-the-art, real-time algorithms run on a quad-core desktop CPU, the winning submission increased MOS by 0.57 (28) from input.

SI-SDR increase (SI-SDRi) Output PESQ
Method Target with BG Target with BV Target with BV + BG Target with BG Target with BV Target with BV + BG
CB-Net 10.41 10.56 9.35 2.08 2.68 1.81
CB-Conv-TasNet 11.19 11.01 9.68 2.24 2.58 1.91
CB-Conv-TasNet Single Mic 6.15 0.13 2.34 1.82 1.84 1.53
CB-UNet 3.21 0.78 1.82 1.60 2.10 1.50
DTLN  (Westhausen and Meyer, 2020) 7.02 0.06 2.13 2.08 1.95 1.67
Causal Demucs  (Defossez et al., 2020) 6.62 -0.03 2.11 1.80 1.88 1.43
Ideal Ratio Mask (IRM, oracle) 11.41 11.53 12.04 2.53 3.00 2.44
Ideal Binary Mask (IBM, oracle) 9.97 11.05 10.85 2.30 2.90 2.21
Table 1. Benchmarking our neural network. We show results for a target voice speaking in three noise scenarios: (1) Background noise (BG), (2) Background voice (BV), and (3) Background noise and background voice (BG and BV). CB-Conv-TasNet performs slightly better on synthetic data, but as shown in Fig. 11, does not generalize as well to in-the-wild scenarios. This demonstrates the importance of evaluating networks on real in-the-wild hardware data.
Figure 13. (a) Performance against angle of background voice in presence of significant multipath. (b) Performance against amount of reverberation in an indoor room. RT60 (in seconds) measures how long sound takes to decay by 60 dB in a space with a diffuse soundfield. (c) Performance as distance between ears increases.

Note that in our in-the-wild experiments, the background noise and voices were not static. The speakers themselves can also be mobile (see Fig. 12). Our network was able to adaptively remove the background noise and achieve speech enhancement with mobility.

5.3. Benchmarking our Neural Network

The conventional evaluation in the machine learning and acoustic community is to evaluate models and techniques on synthetic data against baselines. For completeness, we compare our method against a variety of speech enhancement baselines using the synthetic dataset. For evaluation, an additional 1000 mixtures of 3 seconds each were generated such that there was no overlapping identities or samples between the train and test splits.

Evaluation Procedure. For comparisons to other baseline methods, we use the popular SI-SDR and PESQ metrics. Unlike the AirPods experiment, where the original noisy mixture could not be recorded since AirPods beamforming cannot be toggled off, here we compute SI-SDR of the ground truth relative to both the input noisy mixture and then to the network output. When reporting the increase from the input SI-SDR to output SI-SDR, we use the SI-SDR improvement (SI-SDRi).

For a deep learning baseline in the waveform domain, we choose the causal Demucs model (Defossez et al., 2020). This is a single channel method which was recently shown to outperform many other deep learning baselines and runs real-time on a laptop CPU. We also compare with Dual-signal Transformation LSTM Network (DTLN)  (Westhausen and Meyer, 2020). This method also runs on a laptop or mobile phone in real-time. To compare with spectrogram based methods, we use the oracle baselines, ideal ratio mask (IRM) and ideal binary mask (IBM) (Stöter et al., 2018; Wang, 2005), that use the ground truth voice to calculate the best possible result that can be obtained by masking a noisy spectrogram.

As an ablation study, we report results with each individual component of the network, CB-Conv-TasNet and CB-UNet. We also show results when the multi-channel part of our network, CB-Conv-TasNet, only has access to one microphone, labeled as CB-Conv-TasNet Single Mic. This explicitly shows the advantage of using two microphones. There are only a few deep learning methods that tackle binaural speech separation for mobile processing, and the most relevant ones, such as (Tan et al., 2019) and (Han et al., 2020), do not have publicly available code to test against.

Results. As shown in Table 1, our binaural method is comparable to the best possible results that can be obtained by a spectrogram masking method (IBM, IRM). We also show an improvement over waveform based deep learning methods that only use a single microphone input. In particular, the improvement is greatest when there are two speakers present (Target Voice + Background Voice). This is because single channel methods can only rely on voice characteristics, whereas our network also uses spatial cues to separate the speaker of interest. Although CB-Net shows similar or worse performance to CB-Conv-TasNet, subjective evaluation on in-the-wild hardware data shows that CB-Net is far superior to human listeners (see §5.2).

Examples of the synthetic dataset, outputs from all the methods and qualitative comparisons against Krisp (78), a commercial noise suppression system, can be found linked from our project website: https://clearbuds.cs.washington.edu.

5.3.1. Additional neural network evaluations

We numerically evaluate various aspects of the design by changing the angle of background voice, reverberance in the environment, and microphone separation.

Angle of background voice. The ability of our network to separate the target voice from a background voice is based on utilizing the time difference of arrival to the binaural microphones. Because we only have two microphones, this ability is limited when the background voice is in the front-back plane of the speaker. In this case, the background voice will arrive at each microphone simultaneously, and there will be no spatial cues to separate the two voices. To illustrate this effect, we graph the separation performance as a function of the angle of the background voice in Fig. 13.

Multipath and reverberant environments. While our in-the-wild experiments show the performance in various indoor and outdoor environments, we benchmark our system in different reverberant conditions, including those more reverberant than seen during training. Synthetically generated mixtures are generated using the pyroomacoustics library with the RT60 value randomly chosen between 0 and 4s. We generate 200 examples and plot the SI-SDRi compared to the RT60 in Fig. 13. Our method shows only a slight decrease in performance as the reverberation of the environment increases. Because the target speaker is physically close to the microphone array, our setup is generally less affected by reverberations than other kinds of source separation problems where the target speaker may be further away.

Figure 14. Time Synchronization Validation. Without time synchronization (red), microphone samples drift apart and lose alignment at about 128s/min.

Separation between microphones. Our in-the-wild evaluation across 8 participants showed generalization across facial features. Here, we benchmark our method to different head sizes where the distance between the microphones may be different. We generate 200 synthetic samples, where the distance between the microphones is randomly chosen between 10 and 25 cm. Because the target speaker is in the middle of the microphone array, the target signal will arrive at both mics simultaneously regardless of the microphone distance. Fig. 13 show little change in performance even with microphone distances greatly different than used during training.

5.4. System Evaluation

Synchronization. In order to evaluate this, we place both ClearBuds roughly equidistant from a speaker. A click tone is played every 15 seconds for 5 minutes, and recorded on both ClearBuds with time sync disabled and enabled. We calculate the sample error on each recorded click offline and convert it into time error with a sampling rate of 15.625kHz. Fig. 14(a) shows the synchronization results across a five minute interval. With time sync enabled, the sample error never exceeds 1 sample at 15,625 kHz, or 64 s. Fig. 14(b) also shows the CDF of the timing error across experiments of 5 minutes each conducted with other Bluetooth devices in the environment, with and without time synchronization.

Run-time and end-to-end latency. Mouth-to-ear delay is defined as the time it takes from speech to exit the speaker’s mouth and reach the listener’s ear on the other end of the call. The International Telecommunication Union Telecommunication Standardization Sector (ITU-T) G.114 recommendation regarding mouth-to-ear delay indicates that most users are “very satisfied” as long as the latency does not exceed 200 ms (58). In our end-to-end system, we targeted a one-way latency of 100 ms prior to uplink, leaving up to 100 ms of network delay to move an IP packet from the source to the destination.

With a 180-sample PCM buffer being filled at 31.25 kHz, there is a 5.76 ms delay prior to the samples reaching BLE stack. Once these samples reach the radio hardware, there is a worst-case additional latency of 7.5 ms as defined by the minimum BLE connection interval supported by Bluetooth 5.0 (3). At the time of writing, the latest iOS supports a minimum BLE connection interval of 15 ms. After the samples reach the mobile phone, we wait for 67.2 ms to receive enough samples to run a forward pass of our network. Our network has a run-time of 21.4 ms on an iPhone 12 Pro (see Table 2). The number of FLOPs is computed over each packet of 350 samples. Together, we have a latency of 109 ms, leaving 91 ms for one-way network delay (RTT=182ms).

Power analysis. CB-Net uses an order of magnitude lower FLOPs per second compared to Conv-TasNet on the smartphone, significantly reducing the computational and corresponding power consumption. We also measure the power consumption of the ClearBuds hardware. We measure current consumption by powering our system through its Micro-USB port with a DC power supply set to 3V, which goes through the same power path as our coin cell battery. While continuously wirelessly streaming microphone data, we measure average current consumed to be 5 mA. With the CR2032’s nominal capacity of 210 mAh, this translates to approximately 42 hours of operation. Table 3 shows a breakdown by component of the system’s power consumption. The accelerometer (BMA400) and flash (W25N01GVZEIG) are omitted as they are power gated during streaming.

Device Conv-TasNet CB-Conv-TasNet CB-Net
iPhone 12 Pro 155.5ms 17.5ms 21.4ms
iPhone 11 165.4ms 18.6ms 22.7ms
iPhone XS 241.5ms 27.2ms 33.0ms
FLOPs/packet 1078M 97M 131M
Table 2. Neural network run time on smartphones

6. Limitations & Future work

The first limitation is that the user must be wearing both wireless earbuds to benefit from our binaural noise suppression network. Second, with only two microphones, there is an opportunity for background voices to remain in the uplink channel if the voice is within a few degrees of the target speaker’s sagittal plane (see Fig. 13). The underlying assumption of our network is that the mouth is in the middle of the user’s ears, though as seen in Fig. 13

and our in-the-wild evaluation, some variance is permissible.

While we minimize the power consumption of the ClearBud hardware, we shift the processing and therefore power consumption to the more powerful mobile phone. Performing network computation on the mobile phone over a cloud GPU is an enhancement in terms of user privacy and security so that sensitive voice data is not transmitted to the cloud. While mobile chips are becoming more power efficient, an alternative design to explore is to run our neural network on a plugged-in edge device (e.g., router), minimizing computation while achieving similar latency.

Future work could integrate two microphones in each earbud, so that each earbud could beamform toward the user’s mouth prior to processing in the neural network. We also had to develop a custom wireless audio protocol to stream two microphones to a single phone. While this prevents this architecture from being deployed on today’s commodity wireless earbuds, adoption may be imminent as Bluetooth 5.2 shows promise with the introduction of Multi-Stream Audio and Audio Broadcast (27).

Our network could also be deployed on other multi-microphone mobile or resource-constrained edge systems such as smart watches, augmented reality glasses, or smart speakers to allow for enhanced voice control or telephony in noisy environments. The hardware and firmware for Clearbuds could be leverage to produce wireless, synchronized microphone arrays for telephony, acoustic activity recognition or for swarm robot localization and control.

7. Conclusion

Real-time speech enhancement has been an open research challenge for multiple decades. The recent proliferation of wireless earbuds and neural network architectures provides an opportunity to build systems that bridge neural networks and wireless earbuds to create new capabilities. Here, we present ClearBuds, the first deep learning based system to achieve real-time speech enhancement with binaural wireless earbuds. At its core is a new open-source wireless earbud design capable of operating as a synchronized binaural microphone array and a lightweight cascaded neural network. In-the-wild experiments show that ClearBuds can achieve background noise suppression, background speech removal, and speaker separation using wireless earbuds.

Acknowledgments. This research is funded by the UW Reality Lab, Moore Inventor Fellow award #10617 and the researchers are also funded by the National Science Foundation. We thank our shepherd, Youngki Lee, and the anonymous reviewers for their feedback on our submission.

Component Power Consumption
BLE SoC (nRF52840) 12.02 mW
Microphone (ICS-41350) 0.77 mW
Ideal Diode (LM66100DCKT) 0.27 W
Buck Efficiency Loss (MAX38640) 1.75 mW
Total 14.54 mW
Table 3. ClearBuds hardware power consumption

References

  • [1] J. B. Allen and D. A. Berkley (1979) Image method for efficiently simulating small-room acoustics. The Journal of the Acoustical Society of America 65 (4), pp. 943–950. Cited by: §4.
  • [2] Apple airpods. https://www.apple.com/airpods/. Cited by: §1, §2, §2.
  • [3] (2016) Bluetooth core specification v5.0. Cited by: §5.4.
  • [4] M. Brandstein (2001) Microphone arrays: signal processing techniques and applications. Springer Science & Business Media. Cited by: §2.
  • [5] N. Bui, N. Pham, J. J. Barnitz, Z. Zou, P. Nguyen, H. Truong, T. Kim, N. Farrow, A. Nguyen, J. Xiao, R. Deterding, T. Dinh, and T. Vu (2021-07) EBP: an ear-worn device for frequent and comfortable blood pressure monitoring. Commun. ACM 64 (8), pp. 118–125. External Links: ISSN 0001-0782, Link, Document Cited by: §2.
  • [6] C. R. N. S. C. L. C. and S. M.H. (2009) ITU-t coders for wideband, superwideband, and fullband speech communication. Cited by: §3.4.1.
  • [7] E. Ceolini, J. Hjortkjær, D. Wong, J. O’Sullivan, V. Raghavan, J. Herrero, A. Mehta, S. Liu, and N. Mesgarani (2020-08) Brain-informed speech separation (biss) for enhancement of target speaker in multitalker speech perception. NeuroImage 223, pp. 117282. External Links: Document Cited by: §2.
  • [8] J. Chan, A. Najafi, M. Baker, J. Kinsman, L. Mancl, S. Norton, R. Bly, and S. Gollakota (2022-06) Performing tympanometry using smartphones. Communications Medicine. Cited by: §2.
  • [9] J. Chan, S. Raju, R. Nandakumar, R. Bly, and S. Gollakota (2019-05) Detecting middle ear fluid using smartphones. Science Translational Medicine 11, pp. eaav1102. External Links: Document Cited by: §2.
  • [10] Z. Chen, X. Xiao, T. Yoshioka, H. Erdogan, J. Li, and Y. Gong (2018) Multi-channel overlapped speech recognition with location guided speech extraction network. pp. 558–565. Cited by: §2.
  • [11] A. Chhetri, P. Hilmes, T. Kristjansson, W. Chu, M. Mansour, X. Li, and X. Zhang (2018)

    Multichannel audio front-end for far-field automatic speech recognition

    .
    In 2018 EUSIPCO, pp. 1527–1531. Cited by: §1.
  • [12] H. Choi, J. Kim, J. Huh, A. Kim, J. Ha, and K. Lee (2019) Phase-aware speech enhancement with deep complex u-net. External Links: 1903.03107 Cited by: §2.
  • [13] A. Defossez, G. Synnaeve, and Y. Adi (2020) Real time speech enhancement in the waveform domain. External Links: 2006.12847 Cited by: §2, §5.3, Table 1.
  • [14] Z. Duan, G. J. Mysore, and P. Smaragdis Speech enhancement by online non-negative spectrogram decomposition in non-stationary noise environments. INTERSPEECH 2012, pp. 594–597. External Links: ISBN 9781622767595 Cited by: §2.
  • [15] (2020-03) Earbuds that put sound first. https://en-de.sennheiser.com/newsroom/earbuds-that-put-sound-first. Cited by: §2.
  • [16] Echo (3rd gen). https://www.amazon.com/all-new-echo/dp/b07nftvp7p. Amazon. Cited by: §2.
  • [17] I. Fedorov, M. Stamenovic, C. Jensen, L. Yang, A. Mandell, Y. Gan, M. Mattina, and P. N. Whatmough (2020-10) TinyLSTMs: efficient neural speech enhancement for hearing aids. Interspeech 2020. External Links: Link, Document Cited by: §2.
  • [18] O. L. Frost (1972) An algorithm for linearly constrained adaptive array processing. Proceedings of the IEEE 60 (8), pp. 926–935. Cited by: §2.
  • [19] S. Fu, C. Liao, Y. Tsao, and S. Lin (2019)

    MetricGAN: generative adversarial networks based black-box metric scores optimization for speech enhancement

    .
    External Links: 1905.04874 Cited by: §2.
  • [20] (2014-06) Galaxy s5 explained: audio. https://news.samsung.com/global/galaxy-s5-explained-audio. Cited by: §2, §2.
  • [21] F. G. Germain, Q. Chen, and V. Koltun (2018)

    Speech denoising with deep feature losses

    .
    External Links: 1806.10522 Cited by: §2.
  • [22] R. Gu, S. Zhang, L. Chen, Y. Xu, M. Yu, D. Su, Y. Zou, and D. Yu (2020) Enhancing end-to-end multi-channel speech separation via spatial feature learning. arXiv preprint arXiv:2003.03927. Cited by: §2.
  • [23] C. Han, Y. Luo, and N. Mesgarani (2020) Real-time binaural speech separation with preserved spatial cues. External Links: 2002.06637 Cited by: §2, §5.3.
  • [24] T. Herzke, H. Kayser, F. Loshaj, G. Grimm, and V. Hohmann (2017) Open signal processing software platform for hearing aid research ( openmha ). Cited by: §2.
  • [25] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam (2017)

    MobileNets: efficient convolutional neural networks for mobile vision applications

    .
    External Links: 1704.04861 Cited by: §3.2.1.
  • [26] Https://appleinsider.com/articles/21/03/30/apple-airpods-beats-dominated-audio-wearable-market-in-2020. Cited by: §1.
  • [27] Https://www.bluetooth.com/media/le-audio/le-audio-faqs. Cited by: §6.
  • [28] Https://www.microsoft.com/en-us/research/academic-program/deep-noise-suppression-challenge-interspeech-2021/. Cited by: §5.2.
  • [29] InvenSense (2013-12) Microphone array beamforming. Technical report Technical Report AN-1140-00, InvenSense Inc., 1745 Technology Drive, San Jose, CA 95110 U.S.A. Cited by: §2.
  • [30] T. Jenrungrot, V. Jayaram, S. Seitz, and I. Kemelmacher-Shlizerman (2020) The cone of silence: speech separation by localization. External Links: 2010.06007 Cited by: §2, §4, §5.2.
  • [31] F. Kawsar, C. Min, A. Mathur, and A. Montanari (2018) Earables for personal-scale behavior analytics. IEEE Pervasive Computing 17 (3), pp. 83–89. External Links: Document Cited by: 1st item, §2.
  • [32] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §3.3.5.
  • [33] W. Kock (1950) Binaural localization and masking. The Journal of the Acoustical Society of America 22 (6), pp. 801–804. Cited by: §2.
  • [34] H. Krim and M. Viberg (1996) Two decades of array signal processing research: the parametric approach. IEEE signal processing magazine 13 (4), pp. 67–94. Cited by: §1.
  • [35] J. Li, S. Sakamoto, S. Hongo, M. Akagi, and Y. Suzuki (2011) Two-stage binaural speech enhancement with wiener filter for high-quality speech communication. Speech Communication 53 (5), pp. 677–689. Cited by: §2.
  • [36] Y. Luo, Z. Chen, N. Mesgarani, and T. Yoshioka (2020) End-to-end microphone permutation and number invariant multi-channel speech separation. External Links: 1910.14104 Cited by: §4, §5.2.
  • [37] Y. Luo and N. Mesgarani (2019) Conv-tasnet: surpassing ideal time–frequency magnitude masking for speech separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing. Cited by: 2nd item, §1, §2, §3.2.1.
  • [38] R. Lyon (1983) A computational model of binaural localization and separation. pp. 1148–1151. Cited by: §2.
  • [39] D. Ma, A. Ferlini, and C. Mascolo (2021) OESense: employing occlusion effect for in-ear human sensing. MobiSys ’21, pp. 175–187. External Links: ISBN 9781450384438, Link, Document Cited by: §2.
  • [40] D. Ma, A. Ferlini, and C. Mascolo (2021) OESense: employing occlusion effect for in-ear human sensing. In MobiSys, pp. 175–187. External Links: ISBN 9781450384438 Cited by: §2.
  • [41] C. Macartney and T. Weyde (2018) Improved speech enhancement with the wave-u-net. External Links: 1811.11307 Cited by: §2.
  • [42] Meet.google.com. Cited by: §2.
  • [43] Microphone array beamforming. https://invensense.tdk.com/wp-content/uploads/2015/02/microphone-array-beamforming.pdf. Cited by: §2.
  • [44] C. Min, A. Mathur, and F. Kawsar (2018) Exploring audio and kinetic sensing on earable devices. WearSys ’18, pp. 5–10. External Links: ISBN 9781450358422, Link, Document Cited by: §2.
  • [45] N. Mohammadiha, P. Smaragdis, and A. Leijon (2013) Supervised and unsupervised speech enhancement using nonnegative matrix factorization. IEEE Transactions on Audio, Speech, and Language Processing 21 (10), pp. 2140–2151. External Links: ISSN 1558-7924, Link, Document Cited by: §2.
  • [46] M. Nikzad, A. Nicolson, Y. Gao, J. Zhou, K. K. Paliwal, and F. Shang (2020) Deep residual-dense lattice network for speech enhancement. External Links: 2002.12794 Cited by: §2.
  • [47] T. L. Paine, P. Khorrami, S. Chang, Y. Zhang, P. Ramachandran, M. A. Hasegawa-Johnson, and T. S. Huang (2016) Fast wavenet generation algorithm. External Links: 1611.09482 Cited by: §3.2.1.
  • [48] S. Pascual, A. Bonafonte, and J. Serrà (2017) SEGAN: speech enhancement generative adversarial network. External Links: 1703.09452 Cited by: §2.
  • [49] Pavlovic,Caslav, Hohmann,Volker, Kayser,Hendrik, Wong,Louis, Herzke,Tobias, Prakash,S. R., Hou,zezhang, and Maanen,Paul (2018) Open portable platform for hearing aid research. The Journal of the Acoustical Society of America 143 (3), pp. 1738–1738. External Links: Document, Link, https://doi.org/10.1121/1.5035670 Cited by: §2.
  • [50] J. Powar and A. R. Beresford (2019) A data sharing platform for earables research. In Proceedings of the 1st International Workshop on Earable Computing, EarComp’19, New York, NY, USA, pp. 30–35. External Links: ISBN 9781450369022, Link, Document Cited by: §2.
  • [51] Project gutenberg. Note: https://www.gutenberg.org/Accessed: 2021-12-20 Cited by: §5.2.
  • [52] C. K. A. Reddy, H. Dubey, V. Gopal, R. Cutler, S. Braun, H. Gamper, R. Aichner, and S. Srinivasan (2021) ICASSP 2021 deep noise suppression challenge. pp. 6623–6627. External Links: Document Cited by: §5.2.
  • [53] K. Reindl, Y. Zheng, and W. Kellermann (2010) Speech enhancement for binaural hearing aids based on blind source separation. pp. 1–6. Cited by: §2.
  • [54] M. Risoud, J.-N. Hanson, F. Gauvrit, C. Renard, P.-E. Lemesre, N.-X. Bonne, and C. Vincent (2018) Sound source localization. European Annals of Otorhinolaryngology, Head and Neck Diseases 135 (4), pp. 259–264. External Links: ISSN 1879-7296, Document, Link Cited by: §4.
  • [55] O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. External Links: 1505.04597 Cited by: §3.2.2.
  • [56] J. L. Roux, S. Wisdom, H. Erdogan, and J. R. Hershey (2018) SDR - half-baked or well done?. CoRR abs/1811.02508. External Links: Link, 1811.02508 Cited by: §5.1.
  • [57] R. Scheibler, E. Bezzam, and I. Dokmanić Pyroomacoustics: a python package for audio room simulation and array processing algorithms. pp. 351–355. Cited by: §4.
  • [58] (2003) Series g: transmission systems and media, digital systems and networks. Note: ITU-T Rec. G.114 Cited by: 3rd item, §5.4.
  • [59] (2015-07) Setting up the timeslot api. https://devzone.nordicsemi.com/nordic/short-range-guides/b/software-development-kit/posts/setting-up-the-timeslot-api. Cited by: §3.4.2.
  • [60] N. Shankar, G. Shreedhar Bhat, and I. Panahi (2020-07)

    Efficient two-microphone speech enhancement using basic recurrent neural network cell for hearing and hearing aids

    .
    The Journal of the Acoustical Society of America 148, pp. 389–400. External Links: Document Cited by: §2.
  • [61] M. H. Soni, N. Shah, and H. A. Patil Time-frequency masking-based speech enhancement using generative adversarial network. Cited by: §2.
  • [62] F. Stöter, A. Liutkus, and N. Ito (2018) The 2018 signal separation evaluation campaign. External Links: 1804.06267 Cited by: §5.3.
  • [63] C. Subakan, M. Ravanelli, S. Cornell, M. Bronzi, and J. Zhong (2021) Attention is all you need in speech separation. External Links: 2010.13154 Cited by: §1.
  • [64] X. Sun, R. Xia, J. Li, and Y. Yan A deep learning based binaural speech enhancement approach with spatial cues preservation. pp. 5766–5770. External Links: Document Cited by: §2.
  • [65] K. Tan, X. Zhang, and D. Wang (2019) Real-time speech enhancement using an efficient convolutional recurrent network for dual-microphone mobile phones in close-talk scenarios. In ICASSP 2019, Vol. , pp. 5751–5755. External Links: Document Cited by: §2, §5.2, §5.3.
  • [66] K. Tan, X. Zhang, and D. Wang (2021) Deep learning based real-time speech enhancement for dual-microphone mobile phones. IEEE/ACM Transactions on Audio, Speech, and Language Processing (), pp. 1–1. External Links: Document Cited by: §2.
  • [67] B. A. Telephony and A. W. Group (2020-04) Hands-free profile: bluetooth® profile specification. Technical report Technical Report v1.8, Bluetooth SIG. Cited by: §2.
  • [68] P. Tzirakis, A. Kumar, and J. Donley (2021) Multi-channel speech enhancement using graph neural networks. External Links: 2102.06934 Cited by: §2, §4.
  • [69] R. van Hoesel, M. Böhm, J. Pesch, A. Vandali, R. D. Battmer, and T. Lenarz (2008) Binaural speech unmasking and localization in noise with bilateral cochlear implants using envelope and fine-timing based strategies. The Journal of the Acoustical Society of America 123 (4), pp. 2249–2263. Cited by: §2.
  • [70] B. D. Van Veen and K. M. Buckley (1988) Beamforming: a versatile approach to spatial filtering. IEEE assp magazine 5 (2), pp. 4–24. Cited by: §1, §2.
  • [71] C. Veaux, J. Yamagishi, K. MacDonald, et al. (2016) Superseded-cstr vctk corpus: english multi-speaker corpus for cstr voice cloning toolkit. Cited by: §4.
  • [72] A. Wang, M. Kim, H. Zhang, and S. Gollakota (AAAI 2022) Hybrid neural networks for on-device directional hearing. External Links: 2112.05893 Cited by: §2.
  • [73] D. Wang (2005) On ideal binary mask as the computational goal of auditory scene analysis. In Speech separation by humans and machines, pp. 181–197. Cited by: §5.3.
  • [74] F. Weninger, H. Erdogan, S. Watanabe, E. Vincent, J. Roux, J. R. Hershey, and B. Schuller (2015) Speech enhancement with lstm recurrent neural networks and its application to noise-robust asr. In ICASSP 20182018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)2018 IEEE Spoken Language Technology Workshop (SLT)ICASSP 2019ICASSP’83. IEEE International Conference on Acoustics, Speech, and Signal Processing2010 4th International Symposium on Communications, Control and Signal Processing (ISCCSP)2018 ICASSPICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), LVA/ICA 2015, Vol. 8. External Links: ISBN 9783319224817, Link, Document Cited by: §2.
  • [75] N. L. Westhausen and B. T. Meyer (2020) Dual-signal transformation lstm network for real-time noise suppression, arxiv. arXiv. External Links: Document, Link Cited by: §5.3, Table 1.
  • [76] G. Wichern, J. Antognini, M. Flynn, L. R. Zhu, E. McQuinn, D. Crow, E. Manilow, and J. L. Roux (2019) WHAM!: extending speech separation to noisy environments. arXiv preprint arXiv:1907.01160. Cited by: §4, §4.
  • [77] (2016-07) Wireless timer synchronization among nrf5 devices. https://devzone.nordicsemi.com/nordic/short-range-guides/b/bluetooth-low-energy/posts/wireless-timer-synchronization-among-nrf5-devices. Cited by: §3.4.2.
  • [78] Www.krisp.ai. Cited by: §1, §2, §5.3.
  • [79] Y. Xu, J. Du, L. Dai, and C. Lee (2015) A regression approach to speech enhancement based on deep neural networks. IEEE/ACM Transactions on Audio, Speech, and Language Processing 23 (1), pp. 7–19. External Links: Document Cited by: §2.
  • [80] Z. Yang and R. R. Choudhury (2021) Personalizing head related transfer functions for earables. SIGCOMM ’21, New York, NY, USA, pp. 137–150. External Links: ISBN 9781450383837, Link, Document Cited by: §2.
  • [81] T. Yoshioka, H. Erdogan, Z. Chen, and F. Alleva (2018) Multi-microphone neural speech separation for far-field multi-talker speech recognition. pp. 5739–5743. Cited by: §2.
  • [82] X. Zhang and D. Wang (2017) Deep learning based binaural speech separation in reverberant environments. IEEE/ACM transactions on audio, speech, and language processing 25 (5). Cited by: §2, §2.
  • [83] H. Zhao, C. Gan, A. Rouditchenko, C. Vondrick, J. McDermott, and A. Torralba (2018) The sound of pixels. External Links: 1804.03160 Cited by: §4.