RadioMic: Sound Sensing via mmWave Signals

08/06/2021
by   Muhammed Zahid Ozturk, et al.
University of Maryland
0

Voice interfaces has become an integral part of our lives, with the proliferation of smart devices. Today, IoT devices mainly rely on microphones to sense sound. Microphones, however, have fundamental limitations, such as weak source separation, limited range in the presence of acoustic insulation, and being prone to multiple side-channel attacks. In this paper, we propose RadioMic, a radio-based sound sensing system to mitigate these issues and enrich sound applications. RadioMic constructs sound based on tiny vibrations on active sources (e.g., a speaker or human throat) or object surfaces (e.g., paper bag), and can work through walls, even a soundproof one. To convert the extremely weak sound vibration in the radio signals into sound signals, RadioMic introduces radio acoustics, and presents training-free approaches for robust sound detection and high-fidelity sound recovery. It then exploits a neural network to further enhance the recovered sound by expanding the recoverable frequencies and reducing the noises. RadioMic translates massive online audios to synthesized data to train the network, and thus minimizes the need of RF data. We thoroughly evaluate RadioMic under different scenarios using a commodity mmWave radar. The results show RadioMic outperforms the state-of-the-art systems significantly. We believe RadioMic provides new horizons for sound sensing and inspires attractive sensing capabilities of mmWave sensing devices

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 4

page 6

page 7

page 8

page 13

page 14

01/20/2020

On the Feasibility of Acoustic Attacks Using Commodity Smart Devices

Sound at frequencies above (ultrasonic) or below (infrasonic) the range ...
10/22/2020

DBNET: DOA-driven beamforming network for end-to-end farfield sound source separation

Many deep learning techniques are available to perform source separation...
09/14/2021

Tesla-Rapture: A Lightweight Gesture Recognition System from mmWave Radar Point Clouds

We present Tesla-Rapture, a gesture recognition interface for point clou...
08/20/2019

From Text to Sound: A Preliminary Study on Retrieving Sound Effects to Radio Stories

Sound effects play an essential role in producing high-quality radio sto...
06/10/2020

Listen to What You Want: Neural Network-based Universal Sound Selector

Being able to control the acoustic events (AEs) to which we want to list...
04/21/2021

FD-JCAS Techniques for mmWave HetNets: Ginibre Point Process Modeling and Analysis

In this paper, we study the co-design of full-duplex (FD) radio with joi...
02/28/2020

Amateur Drones Detection: A machine learning approach utilizing the acoustic signals in the presence of strong interference

Owing to small size, sensing capabilities and autonomous nature, the Unm...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Figure 1. An illustrative scenario of RadioMic

Sound sensing, as the most natural way of human communication, has also become a ubiquitous modality for human-machine-environment interactions. Many applications have emerged in the Internet of Things (IoT), including voice interfaces, sound events monitoring in smart homes and buildings, acoustic sensing for gestures and health, etc. For example, smart speakers like Amazon Alexa or Google Home can now understand user voices, control IoT devices, monitor the environment, and sense sounds of interest such as glass breaking or smoke detectors, all currently using microphones as the primary interface. With the proliferation of such devices in smart environments, new methods to sense acoustic contexts have become even more important, while microphones only mimic human perception, with limited capabilities.

Various sound-enabled applications acutely demand a next-generation sound sensing modality with more advanced features: robust sound separation, noise-resistant, through-the-wall recovery, sound liveness detection against side attacks, etc. For example, as illustrated in Fig. 1, robust sound separation can enable a smart voice assistant to have sustained performance over noisy environments (Wang et al., 2020; Xu et al., 2019). Being able to identify animate subjects accurately and quickly would improve the security of voice control systems against demonstrated audio attacks (Roy et al., 2017; Zhang et al., 2017; Alegre et al., 2014; Wu et al., 2014). Sensing behind an insulation can increase the operational range of a smart device to multiple rooms, and allows to retain sound-awareness of outside environments even in a soundproof space.

Although microphones have been the most commonly used sensors to sense acoustics events, they have certain limitations. As they can only sense the sound at the destination111There is a form of microphone, named contact microphone or piezo microphone, that senses sound at the source through contact with solid objects. (i.e., the microphone’s location), a single microphone cannot separate and identify multiple sound sources, whereas a microphone array can only separate in the azimuth direction and require large apertures. And by sensing any sound arrived at the destination, they raise potential privacy concerns when deploying as a ubiquitous and continuous sound sensing interface in homes. In addition, they are prone to inaudible voice attacks and replay attacks, as they only sense the received sound but nothing about the source. To overcome some of these limitations, various modalities have been exploited to sense sound signals, like accelerometer (Zhang et al., 2015), vibration motor (Roy and Roy Choudhury, 2016), cameras and light (Davis et al., 2014; Nassi et al., 2020), laser (Muscatell, 1984), lidar (Sami et al., 2020), etc. These systems either still sense the sound at the destination, thus having similar drawbacks as microphones, or require line-of-sight (LOS) and lighting conditions to operate. Consequently, these modalities fail to enable the aforementioned goals in a holistic way.

Recently, as mmWave has been deployed as standalone low-cost sensing devices (ti1, 2020; dec, 2020), or on commodity routers (net, 2021), smartphones (sol, 2020) and smart hubs (sol, 2021), researchers have attempted to sense sound from the source directly using mmWave (Wang et al., 2020; Xu et al., 2019; Wei et al., 2015). For example, WaveEar (Xu et al., 2019)

employs deep learning with extensive training to reconstruct sound from human throat, using a customized radar.

UWHear (Wang et al., 2020) uses an ultra-wideband radar to extract and separate sound vibration on speaker surfaces. mmVib (Jiang et al., 2020a) monitors single tone machinery vibration using mmWave radar, but not the much weaker sound vibration. However, these systems cannot robustly reject non-acoustic motion interferences and mainly reconstructs low frequency sounds (1 kHz), and none of them recovers sound from daily objects like a paper bag. Nevertheless, they show the feasibility of sound sensing using mmWave signals and inspire a more advanced design.

In order to enable the aforementioned features holistically, we propose RadioMic, a mmWave-based sensing system that can capture sound and beyond, as illustrated in Figure 1. RadioMic

can detect, recover and classify sound from sources in multiple environments. It can recover various types of sounds, such as music, speech, and environmental sound, from both

active sources (e.g., speakers or human throats) and passive sources (e.g., daily objects like a paper bag). When multiple sources present, it can reconstruct the sounds separately with respect to distance, which could not be achieved by classical beamforming in microphone arrays, while being immune to motion interference. RadioMic can also sense sound through walls and even soundproof materials as RF signals have different propagation characteristics than sound. Potentially, RadioMic, located in an insulated room (or in a room with active noise cancellation (Roy et al., 2017)), can be used to monitor and detect acoustic events outside the room, offering both sound proofness and awareness in the same time. Besides, RadioMic can even detect liveliness of a recorded speech and tell whether it is from a human subject or inanimate sources, providing an extra layer of security for IoT devices.

RadioMic’s design involves multiple challenges. 1) Sound-induced vibration is extremely weak, on the orders of m (e.g., m on aluminum foil for source sound at 85dB (Davis et al., 2014)). Human ears or microphone diaphragms have sophisticated structures to maximize this micro vibration. Speaker diaphragms or daily objects, however, alter the sound vibration differently, and create severe noise, combined with noise from radio devices. 2) An arbitrary motion in the environment interferes with the sound signal, and makes robust detection of sound non-trivial, especially when the sound-induced vibration is weak. 3) Wireless signals are prone to multipath, and returned signals comprise of static reflections, sound vibration, and/or its multiple copies. 4) As we will show later, due to the material properties, sounds captured from daily objects are fully attenuated on high frequencies beyond 2 kHz, which significantly impairs the intelligibility of the sensed sounds.

RadioMic overcomes these challenges in multiple distinct ways. It first introduces a novel radio acoustics model that relates radio signals and acoustic signals. On this basis, it detects sound with a training-free module that utilizes fundamental differences between a sound and any other motion. To reduce the effect of background, RadioMic filters and projects the signal in the complex plane, which reduces noise while preserving the signal content, and further benefits from multipath and receiver diversities to boost the recovered sound quality. RadioMic also employs a radio acoustics neural network to solve the extremely ill-posed high-frequency reconstruction problem, which leverages massive online audio datasets and requires minimal RF data for training.

We implement RadioMic using a commercial off-the-shelf (COTS) mmWave radar, evaluate the performance and compare with the state of the arts in different environments, using diverse sound files at varying sound levels. The results show that RadioMic can recover sounds from active sources such as speakers and human throats, and passive objects like aluminum foil, paper bag, or bag of chips. Particularly, RadioMic outperforms latest approaches (Wang et al., 2020; Jiang et al., 2020a) in sound detection and reconstruction under various criteria. Furthermore, we demonstrate case studies of multiple source separation and sound liveness detection to show interesting applications that could be achieved by RadioMic.

In summary, our contributions are:

  • [leftmargin=1em]

  • We design RadioMic, an RF-based sound sensing system that separates multiple sounds and operates through the walls. To the best of our knowledge, RadioMic is the first RF system that can recover sound from passive objects and also detect liveness of the source.

  • We build the first radio acoustics model from the perspective of Channel Impulse Response, which is generic to the underlying RF signals and underpins training-free robust sound detection and high-fidelity sound reconstruction.

  • We develop a radio acoustics neural network, requiring minimal RF training data, to enhance the sensed sound by expanding the recoverable frequencies and denoising.

  • We implement RadioMic on low-cost COTS hardware and demonstrate multiple attractive applications. The results show that it outperforms the state-of-the-art approaches.

2. Radio Acoustics

In this section, we explain mechanics of the sound sensing using radio signals, which we name as radio acoustics.

Figure 2. Examples of radio-sensed sound. Bottom are microphone references. Left: guitar sound (Note G3, 196 Hz); Right: frequency sweep sound sensed from aluminum foil.

2.1. Mechanics

Sound is basically modulation of medium222Without loss of generality, we assume the medium to be air in this paper. pressure through various mechanisms. It is generated by a vibrating surface, and the modulation signal travels through in place motion of air molecules.

A vibrating surface could be a speaker diaphragm, human throat, strings of musical instrument such as a guitar, and many daily objects like a paper bag. In the case of speakers, motion on the speaker diaphragm modulates the signal, whereas in human throat, vocal cords create the vibration, with the mouth and lips operating as additional filters, based on the source-filter model (Fant, 1970). To sense the sound, the same mechanism is employed at the microphones to convert the changes in the air pressure into electrical signal, via suitable diaphragms and electrical circuitry. Microphone diaphragms are designed to be sensitive to air vibration and optimized to capture the range of audible frequencies (20Hz-2kHz), and even beyond (Roy et al., 2017).

Mechanics of extracting sound from radio signals rely on the Doppler phenomenon and the relationship between the object vibration and the reflected signals. The vibration alters how the signals reflected off the object surface propagate. Therefore, the reflection signals can measure the tiny vibrations of an object surface in terms of Doppler shifts, from which we can recover the sound.

The vibration not only occurs at the source where sound is generated, but also on intermediary objects that are incited by the air. Most commonly, sound-modulated air molecules can cause m level or smaller vibration on any object surface. The vibration amplitude depend on material properties and various factors. Generally, sound vibrations are stronger at the source where they are generated (referred as active vibration sources), and much weaker at the intermediary objects (referred as passive vibration sources). Microphones typically only sense sound from active sources, as the air-modulated passive sound is too weak to further propagate to the microphone diaphragms. Differently, as radio acoustics sense sound directly at the source, it can potentially reconstruct sound from both active sources and passive sources. To illustrate the concept, we place a radar in front of a guitar, and play the string G3 repeatedly, and provide the radio and microphone spectrograms in Fig. 2. As shown, when the string is hit, it continues to vibrate (or move back and forth) in place to create the changes in the air pressure, and therefore the sound. The radar senses the sound by capturing the motion of the strings, whereas the microphone captures the modulated air pressure at the diaphragm. The right part of Fig. 2 shows an example of sound sensed from an aluminum foil. Although the sensing mechanisms of microphones and radio are completely different in their nature, they capture the same phenomenon, and the resulting signals are similar.

2.2. Theoretical Background

Starting with the mechanics of sound vibration, we build the radio acoustics model. Later on, we explain how this model is used to reconstruct sound from radio signals in §3.

Sound Vibration Properties: As explained previously, sound creates vibration (motion) on objects, which is proportional to the transmitted energy of sound from the air to the object and depends on multiple factors, such as inertia and signal frequency (Fahy, 2000). Denoting the acoustic signal with , we can model the displacement due to sound as:

(1)

where denotes vibration generation mechanism for an active source or the impulse response of the air-to-object interface for a passive object, and represents convolution.

Radio Acoustic Model: From the point view of physics, sound-induced vibration is identical to machinery vibration, except that sound vibration is generally of magnitude weaker. To model the sound vibration from RF signals, one could follow the model used in (Jiang et al., 2020a), which however assumes the knowledge of the signal model and thus depends on the specific radio devices being used. In RadioMic, we establish a model based on the Channel Impulse Response (CIR) of RF signals, which is generic and independent of the underlying signals. By doing so, in principle, the model applies to any radio device that outputs high-resolution CIR, be it an FMCW radar (Jiang et al., 2020a), an impulse radar (Wang et al., 2020), or others (Palacios et al., 2018).

Figure 3. Different use cases of RadioMic. a) Sensing sound from active/passive sources, b) Sensing through soundproof materials, c) Separating sound of multiple sources, d) Sound liveness detection

CIR of an RF signal can be given as

(2)

where and are referred as long and short time, denotes number of range bins (sampling wrt. distance), denotes complex scaling factor, is the roundtrip duration from range bin , and represents dirac delta function, indicating presence of an object. Assuming no multipath, and an object of interest at range bin , corresponding to time delay , CIR of that range bin can be given as:

(3)

where denotes the carrier frequency. If we assume the object to remain stationary in range bin , we can drop the variables , and , convert time delay into range, and rewrite the CIR as:

(4)

where denotes the actual distance of the object, and denotes the wavelength.

Now, considering a vibrating object (i.e., sound source), we can decompose the range value into the static and vibrating part as . As can be seen, there is a direct relationship between the CIR and the phase of the returned signal. By extracting the phase, could be used to derive , and therefore, the vibration signal, . We further omit the temporal dependency of , as we assume the object to be stationary, and the effect of displacement due to vibration on path loss to be negligible.

So far, we have assumed to have the vibrating object in the line of the radar solely, and did not account for other reflections from the environment. As suggested by (4), lies on a circle in the plane with center at the origin. However, due to various background reflections,

is actually superimposed with a background vector, and the circle center is shifted from the origin. Thus,

can be written as:

(5)

where and are the amplitude and phase shift caused by the sum of all background reflections and vibrations, and

is the additive white noise term. Equation (

5) explains the received signal model, and will be used to build our reconstruction block of RadioMic in §3.

2.3. Potential Applications

As in Fig. 3, RadioMic could benefit various applications, including many that have not been easily achieved before. By overcoming limitations of today’s microphone, RadioMic can enhance performance of popular smart speakers in noisy environments. Collecting spatially-separated audio helps to better understand acoustic events of human activities, appliance functions, machine states, etc. Sound sensing through soundproof materials will provide awareness of outside contexts while preserving the quiet space, which would be useful, for example, to monitor kid activities while working from home in a closed room. Detect liveness of a sound source can protect voice control systems from being attacked by inaudible voice (Zhang et al., 2017) or replayed audio (Wu et al., 2014). With mmWave entering more smart devices, RadioMic could also be combined with a microphone to leverage mutual advantages.

On the other hand, RadioMic can be integrated with other existing wireless sensing applications. For example, Soli-based sleep monitoring (sol, 2021) currently employs microphone to detect coughs and snore, which may pose privacy concerns yet is no longer needed with RadioMic. While remarkable progress has been achieved in RF-based imaging (Zhao et al., 2018; Jiang et al., 2020b; Zhang et al., 2021), RadioMic could offer a channel of the accompanying audio.

We will show proof of concepts for some of the applications (§6) and leave many more to be explored in the future.

3. RadioMic Design

RadioMic consists of four main modules. It first extracts CIR from raw RF signals (§3.1). From there, it detects sound vibration (§3.2) and recovers the sound (§3.3), while rejecting non-interest motion. Lastly, it feeds the recovered sound into a neural network for enhancement (§3.4).

3.1. Raw Signal Conversion

Our implementation mainly uses a COTS FMCW mmWave radar, although RadioMic can work with other mmWave radios that report high-resolution CIR such as an impulse radar. Prior to explaining how RadioMic recovers sound, we provide preliminaries to extract CIR from a linear FMCW radar for comprehensive explanation.

CIR on impulse radar has also been exploited (Wang et al., 2020; Zhang et al., 2021; Wu et al., 2020).

An FMCW radar transmits a single tone signal with linearly increasing frequency, called a chirp, and captures the echoes from the environment. Time delay of the echoes could be extracted by calculating the amount of frequency shift between the transmitted and received signals, which can be converted to a range information. This range information is used to differentiate an object from the other reflectors in the environment. In order to obtain the range information, the frequency shift between transmitted and received signals are calculated by applying FFT, which is usually known as Range-FFT (Stove, 1992). The output of Range-FFT can be considered as CIR, , and our modeling in §2.2 would be applicable.

RadioMic further obtains the so-called range-Doppler spectrogram

from the CIR, which can be extracted by a short-time Fourier Transform (STFT) operation. STFT is basically FFT operations applied in the

dimension in for subsets of long-time indices, called frames. We denote the output range-Doppler spectrograms as , where denotes frequency shift, corresponds to range bins (equivalent to ), and is the frame index. We note that is defined for both positive and negative frequencies, corresponding to different motion directions of the objects, which will be used in the following section.

3.2. Sound Detection & Localization

As any range bin can have sound vibration, it is critical to have a robust detection module that can label both range bins and time indices effectively. Standard methods in the literature, such as constant false alarm rate (CFAR) or Herfindahl–Hirschman Index (HHI) (Wang et al., 2020) are not robust (See §5.1), as we envision a system to be triggered only by sound vibration but not arbitrary motion.

In RadioMic, we leverage the physical properties of sound vibration to design a new approach. Mainly, RadioMic relies on the fact that a vibration signal creates both positive and negative Doppler shifts, as it entails consequent displacement in both directions. This forward and backward motion is expected to have the same amplitudes at the same frequency bins, but with the opposite signs as the total displacement is zero. This would result in symmetric spectrograms, as also noted by other work (Rong et al., 2019; Jiang et al., 2020a). RadioMic exploits this observation with a novel metric for robust sound detection.

To define a sound metric, let denote the magnitude of the positive frequencies of range-Doppler spectrogram , i.e., . Similarly, we define for negative frequencies. Note that the values of and are always positive, as they are defined as magnitudes, and would have non-zero mean even when there is no signal, due to the additive noise. Calculating the cosine distance or correlation coefficient would result in high values, even if there is only background reflections. In order to provide a more robust metric, we subtract the noise floor from both and and denote the resulting matrices with and . Then, instead of using standard cosine distance, we change the definition to enforce similarity of the amplitude in and :

(6)
Figure 4. Sound metric. An aluminum foil is placed at 0.5m. Music starts playing around 1.5s, while random motion occurs at distances 13m for 10s. Spectrograms at distance (a) 0.5m and (b) 1.5m, and (c) resulting sound metric map

RadioMic calculates the sound metric as in (6) for each range bin , and for each time-frame , resulting in a sound metric map as illustrated in in Fig. 4c, music sound (Fig.4a) results in high values of the sound metric, whereas arbitrary motion (Fig.4b) is suppressed significantly, due to asymmetry in the Doppler signature and power mismatches. This illustrates the responsiveness of sound metric to vibration, while keeping comparatively lower values for random motion.

To detect vibration, RadioMic

uses a median absolute deviation based outlier detection algorithm, and only extracts outliers with positive deviation. Our evaluation in §

5 shows that, an outlier based scheme outperforms a fixed threshold, as non-sound motion sometimes can arbitrarily have high values, which can create false alarms for a hard threshold. Additionally, this approach also adapts perfectly to various motion and sounds of diverse amplitudes, including those from active sources and passive sources.

As the sound detection algorithm runs on the range bins, it can detect multiple range bins with sound. The radio signals in these range bins will be processed separately by RadioMic. This enables detection of multiple sources and reconstruction of each sound signal separately. As a byproduct, it also locates at which bin does the sound occur, and reduces interference.

3.3. Sound Recovery

Having extracted the time indices and the range information about active or passive vibration sources in the environment, RadioMic extracts raw sound signals. Using the signal model in (5), RadioMic recovers the acoustic signal by first filtering out the interference and background and approximating the remaining signal with a line fit to further reduce noise.

We first apply an FIR high-pass filter as the changes in the background usually have much lower frequencies.

The resulting signal, can be given as:

(7)

where is the filtered noise term, and is the center of a circle. The signal component, , remains mostly unchanged, due to the frequencies of interest with sound signals, and this operation moves the arc of the vibration circle to the origin in the IQ plane. Furthermore, this operation reduces any drifting in IQ plane, caused by the hardware.

As explained in the previous section, the curvature of the arc is in the order of for displacement with a mmWave device, by projecting the arc, , onto the tangent line at , we can approximate as

(8)

where , and .

Using the fact that , can be further simplified as

(9)

which suggests that, already has the real-valued sound signal

, scaled and projected in complex plane, with an additional noise. Using the fact that the projection onto an arbitrary line does not change noise variance, we can estimate

with minimum mean squared error (MMSE) criteria. The estimate is given as:

(10)

where is the real value operator. Angle can be found as:

(11)

In order to mitigate any arbitrary motion and reject noise, we first project the samples on the optimum line, then calculate spectrogram. Afterwards, we extract the maximum of two spectrograms (i.e. ) and apply inverse-STFT to construct real-valued sound signal. This is different than the naïve approaches such as extracting maximum of the two spectrograms immediately (Rong et al., 2019), or extracting one side only (Xu et al., 2019) as we first reduce the noise variance by projection. This also ensures a valid sound spectrogram, as the inverse spectrogram needs to result in a real-valued sound signal.

Diversity Combining: The raw sound output does not directly result in very high quality sound, due to various noise sources and the extremely small displacement on the object surface. To mitigate these issues, we utilize two physical redundancies in the system: 1) receiver diversity offered by multiple antennas that can be found in many radar arrays, and 2) multipath diversity

. Multiple receivers sense the same vibration signal from slightly different locations, and combination of these signals could enable a higher signal-to-noise ratio (SNR). In addition, multipath environment enables to observe a similar sound signal in multiple range bins, which could be used to further reduce the noise levels. To exploit these multiple diversities, we employ a selection combining scheme. Mainly, silent moments are used to get an estimate for each multipath component and receiver. When sound is detected, the signal with the highest SNR is extracted. We employ this scheme, as we sometimes observe significant differences between received signal quality on different antennas, due to hardware noise (

e.g.some antennas report noisier outputs, occasionally), and specularity effect. We only consider nearby bins when combining multipath components, in order to recover sound from multiple sources.

3.4. Sound Enhancement via Deep Learning

Even though the aforementioned processes reduce multiple noise sources, and optimally create a sound signal from radio signals, the recovered sound signals face two issues:

  • [leftmargin=1em]

  • Narrowband: RadioMic so far cannot recover frequencies beyond 2 kHz, noted as high-frequency deficiency. This creates a significant problem, as the articulation index, a measure of the amount of intelligible sound, is less than 50% for 2kHz band-limited speech, as reported by (French and Steinberg, 1947).

  • Noisy: The recovered sound is a noisy copy of the original sound. As the vibration on object surfaces is on the order of , phase noise and other additive noises still exist.

High-Frequency Deficiency: As our results show, frequencies beyond 2 kHz are attenuated fully in the recovered sound, as the channel in (1) removes useful information in those bands. To explain why this happens, we return back to our modeling of the signal. Namely, the output signal is a noisy copy of , which could be written as:

(12)

from (1). As can be seen, what we observe is the output of the air pressure-to-object vibration channel (or mechanical response of a speaker), as also observed by (Davis et al., 2014; Roy and Roy Choudhury, 2016; Sami et al., 2020), In order to recover fully, one needs to invert the effect of . However, classical signal processing techniques like spectral subtraction or equalization cannot recover the entire band, as the information at the high frequencies have been lost, and these classical methods cannot exploit temporal redundancies in a speech or sound signal.

Neural Bandwidth Expansion and Denoising:

To overcome these issues, namely, to reduce the remaining noise and reconstruct the high frequency part, we resort to deep learning. We build an autoencoder based neural network model, named as

radio acoustics networks (RANet), which is modified from the classical UNet (Ronneberger et al., 2015). Similar models have been proposed for bandwidth expansion of telephone sound (8kHz) to high fidelity sound (16 kHz) (Lagrange and Gontier, 2020; Li and Lee, 2015; Abel and Fingscheidt, 2018) and for audio denoising (Park and Lee, 2016; Xu et al., 2014). Although the formulation of the problem is similar, theoretical limitations are stricter in RadioMic, as there is severer noise, and stronger band-limit constraints on the recovered speech (expanding 2 kHz to 4 kHz), in addition to the need for solving both problems together.

Fig. 5 portrays the structure of RANet, with the entire processing flow of data augmentation, training, and evaluation illustrated in Fig. 6. We highlight the major design ideas below and leave more implementation details to §4.

1) RANet Structure: RANet consists of downsampling, residual and upsampling blocks, which are connected sequentially, along with some residual and skip connections. On a high level, the encoding layers (downsampling blocks) are used to estimate a latent representation of the input spectrograms (e.g. similar to images); and decoding layers (upsampling blocks) are expected to reconstruct high-fidelity sound. Residual layers in the middle are added to capture more temporal and spatial dependencies by increasing the receptive field of the convolutional layers, and to improve model complexity. RANet takes input spectrograms of size , as grayscale images, and uses strided convolutions, with number of kernels doubling in each layer of downsampling blocks, starting from . In the upsampling blocks, the number of kernels are progressively reduced by half, to ensure a symmetry between the encoder and decoder. In the residual blocks, the number of kernels do not change, and outputs of each double convolutional layer is combined with their input. We use residual and skip connections to build RANet, as these are shown to make the training procedure easier (He et al., 2016), especially for deep neural networks.

Figure 5. RANet Structure
Figure 6. Working process of RANet in RadioMic

2) Training RANet without Massive RF Data: As we have proposed a relatively deep neural network for an extremely challenging inverse problem, a successful training process requires extensive data collection. However, collecting massive RF data is costly, which is a practical limitation of many learning-based sensing systems. On the other hand, there have seen a growing, massive audio datasets becoming available online. In RadioMic, instead of going through an extensive data collection procedure like (Xu et al., 2019)

, we exploit the proposed radio acoustics model and translate massive open-source datasets to synthetically simulated radio sound for training. Two parameters are particularly needed to imitate radio sound with an audio dataset,

i.e., the channel and noise as in (12). We use multiple estimates for these parameters to cover different scenarios, and artificially create radar sound at different noise levels and for various frequency responses, thus allowing us to train RANet efficiently with very little data collection overhead. Furthermore, this procedure ensures a rigorous system evaluation, as only the evaluation dataset consists of the real radio speech.

3) Generating Sound from RANet: Using the trained model, RANet uses raw radar sound as input, extracts magnitude spectrograms that will be used for denoising and bandwidth expansion. Output magnitude spectrograms of RANet is combined with the phase of the input spectrograms, as usually done in similar work (Xu et al., 2014; Boll, 1979)

and the time-domain waveform of the speech is constructed. In order to reduce the effect of padding on the edges, and capture long-time dependencies, only the center parts of the estimated spectrograms are used, and inputs are taken as overlapping frames with appropriate paddings in two sides, similar to

(Park and Lee, 2016).

4. System Implementation

Hardware Setup While RadioMic is not limited to a specific type of radar, we mainly use an FMCW mmWave radar for our implementation. We use a COTS mmWave radar TI IWR1443 with a real-time data capture board DCA1000EVM, both produced by Texas Instruments. It works on 77GHz with a bandwidth of 3.52GHz. The evaluation kit costs around $500, while the chip could be purchased for $15.

Our radar device has 3 transmitter (Tx) and 4 receiver (Rx) antennas. We set the sampling rate of 6.25 kHz, which enables to sense up to 3.125 kHz, close to human sound range, while avoiding too high duty cycle on the radar. We use two Tx antennas, as the device does not allow to utilize all the three at this configuration. This enables 8 virtual receivers, due to the placement of antennas. We do not use customized multi-chirp signals as in (Jiang et al., 2020a) since we found it does not benefit much while may prohibit the potential for simultaneous sensing of other applications.

Data Processing: Using the software provided by Texas Instruments, and applying Range-FFT, extracted CIR results in () samples per second. We extract range-Doppler spectrograms using frame lengths of 256 samples ( ms) with overlap, and periodic-Hanning window. The selection of window function ensures perfect reconstruction, and has good sidelobe suppression properties, to ensure reduced effect of DC sidelobes on symmetry of range-Doppler spectrograms.

Figure 7. ROC curve of sound detection
Figure 8. Detection coverage of RadioMic
Figure 9. Detection with different daily materials
Figure 10. Detection at different sound levels

RANet Implementation: To create synthetic data for training RANet, we play a frequency sweep at various locations, and extract the radio channel response . To account for fluctuations, we apply a piecewise linear fit to the measured , and added randomness to capture fluctuations.To capture noise characteristics, we collect data in an empty room without any movement. Noise from each range bin is extracted, which are then added to the synthetic data, with varying scaling levels to account for different locations. As the input dataset, we use LibriSpeech (Panayotov et al., 2015) corpus.

Inputs to the neural network are taken as 128 samples (), where only the middle 32 samples (

) are used for reconstruction and error calculation. We implement RANet in PyTorch, and use an NVIDIA 2080S GPU with CUDA toolkit for training. We use L2 loss between the real and estimated spectrograms. Prior to calculating the error, we apply a mapping of

, where

denotes a spectrogram. We randomly select 2.56 million spectrograms for creating synthetic input/output data, and train the network for 10 epochs, with Rmsprop algorithm.

5. Experiments and Evaluation

We benchmark different modules of RadioMic in multiple places, such as office space, home and an acoustically insulated chamber, and compare RadioMic with the state-of-the-art approaches, followed by two case studies in §6.

Evaluation Metrics: It is non-trivial to evaluate the quality of sound. There are quite many different metrics in the speech processing literature, and we borrow some of them. Specifically, we adopt perceptual evaluation of speech quality (PESQ) (Rix et al., 2002), log-likelihood ratio (LLR) (Quackenbush et al., 1988) and short-time intelligibility index (STOI) (Taal et al., 2010). PESQ tries to extract a quantitative metric that can be mapped to mean-opinion-score, without the need for user study, and reports values from 1 (worst) to 5 (best). LLR measures the distance between two signals, and estimates values in 0 (best) to 2 (worst). STOI is another measure of intelligibility of the sound, and reports values in with 1 the best. We avoid reporting SNR between the reference and reconstructed sound, as it does not correlate with human perception (Quackenbush et al., 1988). Rather, we report SNR using the noise energy estimated from silent moments, as it is used by RadioMic during raw sound data extraction and it gives a relative idea on the amount of noise suppression. Furthermore, we also visualize spectrograms and provide sample files333Sample sound files recovered by RadioMic can be found here: https://bit.ly/radiomicsamples. to better illustrate the outputs of RadioMic.

5.1. Sound Detection Performance

Overall Detection: Our data collection for detection analysis includes random motions, such as standing up and sitting repeatedly, walking, running, and rotating in place, as well as static reflectors in the environment, and human bodies in front of the radar. On the other hand, we also collect data using multiple sound and music files with active and passive sources. More importantly, we have also collected motion and sound data at the same time to see if RadioMic can reject these interferences successfully.

To illustrate the gains coming from the proposed sound metric, we implement and compare with existing methods: 1) HHI (UWHear (Wang et al., 2020)): UWHear uses HHI, which requires some training to select an appropriate threshold. 2) CFAR (mmVib (Jiang et al., 2020a)): The method used in (Jiang et al., 2020a) requires knowing the number of vibration sources a prior, and extracts the highest peaks. To imitate this approach and provide a reasonable comparison, we apply CFAR detection rule at various threshold levels, and remove the detections around DC to have a fairer comparison. Additionally, we also compare hard thresholding (RadioMic-T) with the outlier-based detector (RadioMic-O). Note that the detector in (Xu et al., 2019) requires extensive training, and is not applied here. We provide the receiver-operating characteristics (ROC) curve for all methods in Fig. 10. As can be seen, while RadioMic-T is slightly worse than RadioMic-O, the other methods fail to distinguish random motion from the vibration robustly, which prevents them from practical applications as there would be arbitrary motion in the environment.

(a) LLR
(b) STOI
Figure 11. Verification of line projection

Operational Range: We investigate the detection performance of RadioMic at different distances azimuth angles (with respect to the radar) using an active source (a pair of speakers) and a passive source (aluminum foil of size inches). We use 5 different sound files for each location, three of which are music files, and two are human speech. As shown in Fig. 10, RadioMic can robustly detect sound up to 4m in an active case with 91% mean accuracy, and up to in the passive source case with %70 accuracy, both with a field of view of 90. Passive source performance is expected to be lower, as the vibration is much weaker.

(a) SNR
(b) PESQ
(c) LLR
(d) STOI
Figure 12. Comparison with UWHear (Wang et al., 2020) and mmVib (Jiang et al., 2020a). UW(I) and UW(Q) denotes in-phase and quadrature signals extracted by UWHear (Wang et al., 2020) respectively. RANet and diversity combining in RadioMic is not applied for this comparison. Horizontal lines on violin plots represent 25th, 50th and 75th percentile, respectively. Box plot is used for (b) due to outliers.
(a) SNR
(b) PESQ
(c) LLR
(d) STOI
Figure 13. Overall performance of RadioMic with gains from multiple components. Rx: receiver combining; Rx+M: receiver and multipath combining; Rx+M+DL: the end results.

Materials of Passive Sources: To further evaluate detection performance of RadioMic with passive materials, we conduct experiments with additional daily materials, such as picture frames, paper bags, or bag of chips. As shown in Fig. 10, many different materials enable sound detection using RadioMic. And even at a lower rate, some sound signal is detected for particular instances as the evaluation is done with frames with duration. The performance could be improved by temporal smoothing, if the application scenario requires a lower miss-detection ratio.

Sound Levels: We also investigate the effect of sound amplitude on the detection ratio from a passive object. We reduce the amplitude of the test files from 85 dB to 60 dB, with a 5 dB step, and measure the detection rates in Fig. 10. As seen, RadioMic outperforms existing approaches, especially when the amplitude is lower. Detecting sound at even lower levels is a common challenge for non-microphone sensing due to the extremely weak vibrations (Davis et al., 2014; Wang et al., 2020; Xu et al., 2019).

5.2. Sound Reconstruction Performance

Raw Reconstruction:

We first experimentally validate the mathematical model provided in §2.2. Our main assertion is that, the sound signal could be approximated with a linear projection on IQ plane, where the optimum angle could be obtained by signal-to-noise maximizing projection. We test multiple projection angles deviated from the RadioMic’s optimum angle, and generate results using LLR and STOI for these angles. Our results in Fig. LABEL:fig:verification_lineprojection indicate the best results for projecting at , and the worst at with a monotonic decrease in between, which is consistent with the line model. This further indicates that an optimal scheme can achieve higher performance than an arbitrary axis, as done in (Wang et al., 2020).

Comparative Study: To compare our method with existing works, we employ only the raw reconstruction without RANet. The results are portrayed with various metrics in Fig. LABEL:fig:method_comparison_main. We provide results with respect to in-phase (I) and quadrature (Q) part of UWHear (Wang et al., 2020) as they do not provide a method to combine/select between the two signals. For this comparison, we use an active speaker at 1m away, with various azimuth angles. Overall, RadioMic outperforms both of the methods, for just the raw sound reconstruction, and it further improves the quality of the sound with additional processing blocks (Fig. LABEL:fig:overall_performance). The average SNR of mmVib (Jiang et al., 2020a) is slightly higher. This is because SNR is calculated without a reference signal, and drifts in the IQ plane boosts the SNR value of mmVib, as the later samples seem to have nonzero mean. However, the more important metrics, intelligibility score and LLR, are much worse for mmVib, as it is not designed to monitor the vibration over a long time, but for short frames.

Overall Performance: With gains from diversity combining and deep learning, we provide the overall performance of RadioMic in Fig. LABEL:fig:overall_performance. We investigate the effect of each component on a dataset using a passive source. Overall, each of the additional diversity combining schemes improves the performance with respect to all metrics. At the same time, RANet reduces the total noise levels significantly (Fig. (a)a) and increases PESQ (Fig. (b)b). However, as in Fig. (c)c, RANet yields a worse value with LLR, which is due to the channel inversion operation of applied on the radar signal. While an optimal channel recovery operation is demanded, RANet is trained on multiple channel responses and only approximates to . Consequently, the channel inversion applied by RANet is expected to be sub-optimal. Lastly, STOI metric (Fig. (d)d) shows a higher variation, which is due to high levels of noise in the sample audio files in the input. In case of large noise, we have observed that RANet learns to combat the effect of noise , instead of inverting , and outputs mostly empty signals, which could also be observed by the distribution around 0 dB in Fig. (a)a. While when there is enough signal content, RANet improves the intelligibility further.

(a) Locations
(b) Amplitudes
Figure 14. Recovered sound SNR (a) at different locations and (b) with different sound amplitudes.

Distances and Source Amplitudes: To investigate sound recovery from varying locations and angles, we provide two heatmaps in Fig. (a)a to show the raw SNR output for active and passive sources. Similar to sound detection in Fig. 10, nearby locations have higher SNR, allowing better sound recovery, and the dependency with respect to the angle is rather weak. Increasing distance reduces the vibration SNR strongly, (e.g., from 20 dB at 1m to 14 dB at 2m for an active source) possibly due to the large beamwidth of our radar device and high propagation loss.

We then test both active and passive sources at various sound levels. In Fig. (b)b, we depict the SNR with respect to sound amplitude, where the calibration is done by measuring the sound amplitude at 0.5m away from the speakers, at Hz. Generally, the SNR decreases with respect to decreasing sound levels. And at similar sound levels, a passive source, aluminum foil, can lose up to 10 dB compared to an active source. In addition, RadioMic retains a better SNR with decreasing sound levels than increasing distance (Fig. (a)a), which indicates that the limiting factor for large distances is not the propagation loss, but the reflection loss, due to relatively smaller surface areas. Hence, with more directional beams (e.g. transmit beamforming, or directional antennas), effective range of the RadioMic could be improved, as low sound amplitudes also look promising for some recovery.

Active vs. Passive Comparison: In order to show potential differences between the nature of active and passive sources, and present the results in a more perceivable way, we provide six spectrograms in Fig. 15, which are extracted by using two different synthesized audio files. In this setting, the passive source (aluminum foil) is placed at 0.5m away and active source is located at 1m. As shown, active source (speaker diaphragm) have more content in the lower frequency bands, whereas passive sound results in more high frequency content, due to the aggressive channel compensation operation on . More detailed comparisons are provided in Table LABEL:tab:active_passive_comparison.

Figure 15. Spectrogram comparison of RadioMic outputs and a microphone. Two rows correspond to the synthesized speech of two different sentences. Passive source is a small aluminum foil, whereas active is a loudspeaker.

LOS vs. NLOS Comparison: We further validate RadioMic in NLOS operations. To that end, in addition to our office area, we conduct experiments in an insulated chamber (Fig. LABEL:fig:exp-setupsc), which has a double glass layer on its side. This scenario is representative of expanding the range of an IoT system to outside rooms from a quiet environment. In this particular scenario, we test both the passive source (e.g. aluminum foil), and the active source (e.g. speaker). As additional layers would attenuate the RF reflection signals further, we test NLOS setup at slightly closer distances, with active speaker at 80 cm and the passive source at 35 cm away. Detailed results are in Table LABEL:tab:active_passive_comparison, with visual results in Fig. LABEL:fig:case-nlos. As seen, insulation layers does not affect RadioMic much, and LOS and NLOS settings perform quite similar. Some metrics even show improvement in NLOS case due to shorter distances.

Figure 16. Example setups. (a) Passive source; (b) Multiple speakers; (c) Insulated chamber; (d) Sensing from throat.

]fig:exp-setups

]tab:active_passive_comparison Setup SNR PESQ LLR STOI LOS, Active 24.7 0.84 1.61 0.55 LOS, Passive 10.4 1.20 1.57 0.61 NLOS, Active 29.4 1.12 1.52 0.58 NLOS, Passive 8.8 1.36 1.57 0.64

Table 1. Active vs. Passive Source Comparison
Figure 17. Through-wall spectrograms. Left: microphone reference; Right: reconstructed results. Top row also includes a music file.

]fig:case-nlos

Figure 18. Recovery from throat. RadioMic spectrogram of a) humming a song around 60 dB, and b) speaking, c) Microphone spectrogram for case b).

]fig:throat_main

Figure 19. Multiple source separation. Spectrograms of RadioMic for a) source #1 and b) source #2, c) Microphone spectrogram with mixed sound.

Sound Recovery from Human Throat: Lastly, we show how RadioMic can also capture vocal folds vibration from human throat, as another active source. We start with humming in front of the speaker at a quiet 60 dB level, and show the results in LABEL:fig:throat_maina. After this observation, we collect multiple recordings from a user, where the setup is given in Fig. LABEL:fig:exp-setupsd. In Fig. LABEL:fig:throat_mainb and LABEL:fig:throat_mainc, we provide the RadioMic and microphone spectrograms. Although RadioMic is not trained with a frequency response from human throat, it can still capture some useful signal content. On the other hand, we noticed that intelligibility of such speech is rather low, comparing to other sources, and RANet sometimes does not estimate the actual speech. Prior work (Xu et al., 2019) focuses on extracting sound from human throat, and with extensive RF data collection, they have shown feasibility of better sound recovery from throat. We believe the performance RadioMic regarding human throat could be improved as well with a massive RF data from human throat. We leave these improvements for future and switch our focus to another application of sound liveness detection of human subjects in the next section.

6. Case Studies

In this section, we first show RadioMic for multiple source separation, and then extend it to classify sound sources.

6.1. Multiple Source Separation

Separation of multiple sound sources would enable multi-person sensing, or improved robustness against interfering noise sources. In order to illustrate feasibility, we play two different speech files simultaneously from left and right channels of the stereo speakers. As in Fig. LABEL:fig:exp-setupsb, we place right speaker at 0.75m, and left speaker at 1.25m. We provide the results in Fig. 19, which include two spectrograms extracted by RadioMic, along with a microphone spectrogram. As seen, microphone spectrogram extracts mixture of multiple sources, and is prone to significant interference. In contrast, RadioMic signals show much higher fidelity, and two person’s speech can be separated from each other well. Previous work UWHear (Wang et al., 2020) specifically focuses on the problem of sound separation and demonstrates good performance using UWB radar. RadioMic excels in achieving more features in one system, in addition to the source separation capability. And we believe there is a great potential to pursue higher fidelity by using RadioMic in tandem with a single microphone, which we leave for future investigation.

Figure 20.

(a) Speaker spectrogram, (b) throat spectrogram, (c) Power delay profile extracted from (a,b), (d) confusion matrix for classification.

]fig:spectrumcomparisonradarmic

6.2. Sound Liveliness Detection

As another application, we investigate feasibility of sound source classification. As RadioMic senses at source of the sound, it captures the additional physical characteristics of the sound generation mechanism simultaneously. Starting with this observation, we investigate the question: Is it possible to differentiate the source of a sound between a human and an inanimate source like a speaker? This is a critical application as it is well-known that today’s microphones all suffer from inaudible attacks (Zhang et al., 2017) and replay attacks (Roy et al., 2017) due to the hardware defects.

Our results show that RadioMic can enable sound liveliness detection, with unprecedented response times. In our experiment, we ask a user to recite five different sentences, in two different languages, and we record the speech using a condenser microphone. Afterwards, we play the same sound through speakers at the same distance, at a similar sound level, and capture RadioMic output.

We first provide the comparison of two spectrograms in Fig.LABEL:fig:spectrumcomparisonradarmic(a,b), and the average over time in Fig. LABEL:fig:spectrumcomparisonradarmic(c). From the figures, we make three observations: 1) As illustrated by reflection band in Fig. LABEL:fig:spectrumcomparisonradarmic(c), the human throat shows a weaker amplitude around DC component. This is because the reflection coefficients of speakers and human throat vary significantly, a phenomenon utilized for material sensing (Wu et al., 2020). 2) Due to minute body motions and the movement of the vocal tract, the reflection energy of human throat varies more over time and has stronger sidelobes, which could be seen in the frequency band labeled as motion in Fig. LABEL:fig:spectrumcomparisonradarmic(c). 3) Due to skin layer between vocal cords and the radar, human throat applies stronger low-pass filtering on the vibration compared to speakers, as labeled as sound in Fig. LABEL:fig:spectrumcomparisonradarmicc, which relates to the frequencies of interest for sound.

Then to enable RadioMic for liveness detection, we implement a basic classifier based on these observations. We propose to use the ratio of the energy in motion affected bands (35-60 Hz) over the entire radar spectrogram as an indicator for liveness. As shown in Fig. LABEL:fig:spectrumcomparisonradarmic(d), RadioMic can classify the sources with accuracy with only 40 ms of data, which increases to by increasing to . We believe RadioMic promises a valuable application here as it can sense the sound and classify the source at the same time, and we plan to investigate it thoroughly as next step.

7. Related Work

Wireless sensing has been an emerging field (Kotaru et al., 2015; Wang et al., 2015; Jiang et al., 2020b; Xie and Xiong, 2020; Chen et al., 2019; Zheng et al., 2019; Wang et al., 2018; Liu and Wang, 2019; Ma et al., 2019), with many applications including lip motion recognition and sound sensing. For example, voice recognition (Khanna et al., 2019), pitch extraction (Chen et al., 2017), and vocal cords vibration detection (Hong et al., 2016) have been investigated using the Doppler radar concept. Contact based RF sensors (Eid and Wallace, 2009; Birkholz et al., 2018) have also been investigated, which require user cooperation. Likewise, WiHear (Wang et al., 2014) captures signatures of lip motions with WiFi, and matches different sounds, with a limited dictionary.

A pioneer work (Wei et al., 2015) uses a 2.4 GHz SDR to capture sound, while recently mmWave has been more widely employed. Some fundamentals have been established in (Li, 1996) between sound and mmWave. Recently, WaveEar (Xu et al., 2019) uses a custom built radar with 16 Tx and 16 Rx antennas to capture vocal folds vibration, and reconstructs sound using a deep learning based approach. UWHear (Wang et al., 2020) focuses on separating multiple active sound sources with a UWB radar. mmVib (Jiang et al., 2020a) does not directly recover sound but measures machinery vibration using a mmWave radar. All of these approaches only focus on an active vibration source. Some introductory works study feasibility of sound sensing with passive sources (Rong et al., 2019; Guerrero et al., 2020), but do not build a comprehensive system. Moreover, existing works mainly focus on lower frequency signals, and do not address the fundamental limitations in high frequency.

Ultrasound-based: Ultrasound based methods are used to capture lip motion (Jennings and Ruck, 1995), recognize the speaker (Kalgaonkar and Raj, 2008), synthesize speech (Toth et al., 2010), or enhance sound quality (Lee, 2019). These approaches usually have very limited range, and require a prior dictionary, as the motion cannot be related to sound immediately. We note that RadioMic differs fundamentally from acoustic sensing (Zhang et al., 2017; Wang et al., 2016; Mao et al., 2018; Yun et al., 2015), which leverages sound for sensing, rather than recovering the sound itself.

Light based: VisualMic (Davis et al., 2014) recovers sound from the environment (e.g. bag of chips) using high-speed cameras. The similar phenomenon is exploited using lamp (Nassi et al., 2020), laser (Muscatell, 1984), lidar (Sami et al., 2020), and depth cameras (Galatas et al., 2012). Comparing to RadioMic, these works usually need expensive specialized hardware.

IMU-based: Inertial sensors are also used for sound reconstruction (Zhang et al., 2015; Roy and Roy Choudhury, 2016; Michalevsky et al., 2014). All these methods sense the sound at destination like contact microphones, and has similar drawbacks to microphones, in addition to their limited bandwidth.

8. Conclusion

In this paper, we propose RadioMic, a mmWave radar based sound sensing system, which can reconstruct sound from sound sources and passive objects in the environment. Using the tiny vibrations that occur on the object surfaces due to ambient sound, RadioMic can detect and recover sound as well as identify sound sources using a novel radio acoustics model and neural network. Extensive experiments in various settings show that RadioMic outperforms existing approaches significantly and benefits many applications.

There are room for improvement and various future directions to explore. First, our system can capture the sound up to 4 meters, and higher order frequencies start to disappear beyond 2 meters, mainly due to a relatively wide beamwidth of of our device. Other works focusing on extracting throat vibration either use highly directional antennas (even up to ) beamwidth (Chen et al., 2017), very close distance (less than 40 cm) in (Khanna et al., 2019), many more antennas (e.g., 1616) (Xu et al., 2019). We believe that more advanced hardware and sophisticated beamforming could underpin better performance of RadioMic.

Second, RANet mitigates the fundamental limit of high-frequency deficiency, which currently uses 1-second window with limited RF training data. Better quality could be achieved by exploiting long-term temporal dependencies with a more complex model, given more available RF data. With more RF data from human throat, we also expect to see a better performance for human speech sensing.

Third, we believe RadioMic and microphones are complementary. Using suitable deep learning techniques, the side information from RadioMic could be used in tandem with microphone to achieve better performance of sound separation and noise mitigation than microphone alone.

Lastly, with our initial results, it is promising to build a security system for sound liveness detection against side-channel attacks. Exploring RadioMic with mmWave imaging and other sensing (sol, 2021) is also an exciting direction.

References

  • (1)
  • dec (2020) 2020. Decawave DW1000. https://www.decawave.com/product/dw1000-radio-ic/
  • sol (2020) 2020. Soli Radar-Based Perception and Intercation in Pixel 4. https://ai.googleblog.com/2020/03/soli-radar-based-perception-and.html
  • ti1 (2020) 2020. Texas Instruments, IWR1443. https://www.ti.com/product/IWR1443
  • sol (2021) 2021. Contactless Sleep Sensing in Nest Hub with Soli. https://ai.googleblog.com/2021/03/contactless-sleep-sensing-in-nest-hub.html
  • net (2021) 2021. Netgear, Nighthawk X10 Smart WiFi Router. https://www.netgear.com/home/wifi/routers/ad7200-fastest-router/
  • Abel and Fingscheidt (2018) J. Abel and T. Fingscheidt. 2018. Artificial Speech Bandwidth Extension Using Deep Neural Networks for Wideband Spectral Envelope Estimation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 26, 1 (2018), 71–83. https://doi.org/10.1109/TASLP.2017.2761236
  • Alegre et al. (2014) Federico Alegre, Artur Janicki, and Nicholas Evans. 2014. Re-assessing the threat of replay spoofing attacks against automatic speaker verification. In IEEE BIOSIG. IEEE, 1–6.
  • Birkholz et al. (2018) Peter Birkholz, Simon Stone, Klaus Wolf, and Dirk Plettemeier. 2018. Non-invasive silent phoneme recognition using microwave signals. IEEE/ACM Transactions on Audio, Speech, and Language Processing 26, 12 (2018), 2404–2411.
  • Boll (1979) S. Boll. 1979. Suppression of acoustic noise in speech using spectral subtraction. IEEE Transactions on Acoustics, Speech, and Signal Processing 27, 2 (1979), 113–120.
  • Chen et al. (2017) Fuming Chen, Sheng Li, Yang Zhang, and Jianqi Wang. 2017. Detection of the Vibration Signal from Human Vocal Folds Using a 94-GHz Millimeter-Wave Radar. Sensors 17, 3 (Mar 2017), 543. https://doi.org/10.3390/s17030543
  • Chen et al. (2019) Lili Chen, Jie Xiong, Xiaojiang Chen, Sunghoon Ivan Lee, Kai Chen, Dianhe Han, Dingyi Fang, Zhanyong Tang, and Zheng Wang. 2019. WideSee: Towards wide-area contactless wireless sensing. In ACM SenSys. 258–270.
  • Davis et al. (2014) Abe Davis, Michael Rubinstein, Neal Wadhwa, Gautham Mysore, Fredo Durand, and William T. Freeman. 2014. The Visual Microphone: Passive Recovery of Sound from Video. ACM Transactions on Graphics (Proc. SIGGRAPH) 33, 4 (2014), 79:1–79:10.
  • Eid and Wallace (2009) A. M. Eid and J. W. Wallace. 2009. Ultrawideband Speech Sensing. IEEE Antennas and Wireless Propagation Letters 8 (2009), 1414–1417.
  • Fahy (2000) Frank J. Fahy. 2000. Foundations of engineering acoustics. Elsevier.
  • Fant (1970) Gunnar Fant. 1970. Acoustic theory of speech production. Number 2. Walter de Gruyter.
  • French and Steinberg (1947) Norman R. French and John C Steinberg. 1947. Factors governing the intelligibility of speech sounds. The Journal of the Acoustical Society of America 19, 1 (1947), 90–119.
  • Galatas et al. (2012) G. Galatas, G. Potamianos, and F. Makedon. 2012. Audio-visual speech recognition incorporating facial depth information captured by the Kinect. In EURASIP EUSIPCO. 2714–2717.
  • Guerrero et al. (2020) E. Guerrero, J. Brugués, J. Verdú, and P. d. Paco. 2020. Microwave Microphone Using a General Purpose 24-GHz FMCW Radar. IEEE Sensors Letters 4, 6 (2020), 1–4.
  • He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In IEEE CVPR. 770–778. https://doi.org/10.1109/CVPR.2016.90
  • Hong et al. (2016) Hong Hong, Heng Zhao, Zhengyu Peng, Hui Li, Chen Gu, Changzhi Li, and Xiaohua Zhu. 2016. Time-Varying Vocal Folds Vibration Detection Using a 24 GHz Portable Auditory Radar. Sensors 16, 8 (Aug 2016), 1181. https://doi.org/10.3390/s16081181
  • Jennings and Ruck (1995) D. L. Jennings and D. W. Ruck. 1995.

    Enhancing automatic speech recognition with an ultrasonic lip motion detector. In

    IEEE ICASSP, Vol. 1. 868–871 vol.1.
  • Jiang et al. (2020a) Chengkun Jiang, Junchen Guo, Yuan He, Meng Jin, Shuai Li, and Yunhao Liu. 2020a. mmVib: Micrometer-Level Vibration Measurement with Mmwave Radar. In ACM MobiCom. Article 45, 13 pages. https://doi.org/10.1145/3372224.3419202
  • Jiang et al. (2020b) Wenjun Jiang, Hongfei Xue, Chenglin Miao, Shiyang Wang, Sen Lin, Chong Tian, Srinivasan Murali, Haochen Hu, Zhi Sun, and Lu Su. 2020b. Towards 3D human pose construction using wifi. In ACM MobiCom. 1–14.
  • Kalgaonkar and Raj (2008) K. Kalgaonkar and B. Raj. 2008. Ultrasonic Doppler sensor for speaker recognition. In IEEE ICASSP. 4865–4868.
  • Khanna et al. (2019) Rohan Khanna, Daegun Oh, and Youngwook Kim. 2019.

    Through-wall remote human voice recognition using doppler radar with transfer learning.

    IEEE Sensors Journal 19, 12 (2019), 4571–4576.
  • Kotaru et al. (2015) Manikanta Kotaru, Kiran Joshi, Dinesh Bharadia, and Sachin Katti. 2015. Spotfi: Decimeter level localization using wifi. In ACM SigComm. 269–282.
  • Lagrange and Gontier (2020) Mathieu Lagrange and Félix Gontier. 2020.

    Bandwidth extension of musical audio signals with no side information using dilated convolutional neural networks. In

    IEEE ICASSP. Barcelona, Spain.
  • Lee (2019) Ki-Seung Lee. 2019. Speech enhancement using ultrasonic doppler sonar. Speech Communication 110 (2019), 21 – 32. https://doi.org/10.1016/j.specom.2019.03.008
  • Li and Lee (2015) K. Li and C. Lee. 2015. A deep neural network approach to speech bandwidth expansion. In IEEE ICASSP. 4395–4399. https://doi.org/10.1109/ICASSP.2015.7178801
  • Li (1996) Zong-Wen Li. 1996. Millimeter wave radar for detecting the speech signal applications. International Journal of Infrared and Millimeter Waves 17, 12 (1996), 2175–2183.
  • Liu and Wang (2019) K. J. Ray Liu and Beibei Wang. 2019. Wireless AI: Wireless Sensing, Positioning, IoT, and Communications. Cambridge University Press.
  • Ma et al. (2019) Yongsen Ma, Gang Zhou, and Shuangquan Wang. 2019. WiFi sensing with channel state information: A survey. ACM Computing Surveys (CSUR) 52, 3 (2019), 1–36.
  • Mao et al. (2018) Wenguang Mao, Mei Wang, and Lili Qiu. 2018. Aim: Acoustic imaging on a mobile. In ACM MobiSys. 468–481.
  • Michalevsky et al. (2014) Yan Michalevsky, Dan Boneh, and Gabi Nakibly. 2014. Gyrophone: Recognizing Speech from Gyroscope Signals. In USENIX Security. San Diego, CA, 1053–1067.
  • Muscatell (1984) Ralph P. Muscatell. 1984. Laser microphone. US Patent 4,479,265.
  • Nassi et al. (2020) Ben Nassi, Yaron Pirutin, Adi Shamir, Yuval Elovici, and Boris Zadov. 2020. Lamphone: Real-Time Passive Sound Recovery from Light Bulb Vibrations. Cryptology ePrint Archive, Report 2020/708.
  • Palacios et al. (2018) Joan Palacios, Daniel Steinmetzer, Adrian Loch, Matthias Hollick, and Joerg Widmer. 2018. Adaptive codebook optimization for beam training on off-the-shelf IEEE 802.11 ad devices. In ACM MobiCom. 241–255.
  • Panayotov et al. (2015) V. Panayotov, G. Chen, D. Povey, and S. Khudanpur. 2015. Librispeech: An ASR corpus based on public domain audio books. In IEEE ICASSP. 5206–5210. https://doi.org/10.1109/ICASSP.2015.7178964
  • Park and Lee (2016) Se Rim Park and Jinwon Lee. 2016. A Fully Convolutional Neural Network for Speech Enhancement. CoRR abs/1609.07132 (2016). arXiv:1609.07132 http://arxiv.org/abs/1609.07132
  • Quackenbush et al. (1988) Schuyler R. Quackenbush, T.P. Barnwell, and M.A. Clements. 1988. Objective Measures of Speech Quality. Prentice Hall.
  • Rix et al. (2002) Antony W Rix, Michael P Hollier, Andries P Hekstra, and John G Beerends. 2002. Perceptual Evaluation of Speech Quality (PESQ) The New ITU Standard for End-to-End Speech Quality Assessment Part I–Time-Delay Compensation. Journal of the Audio Engineering Society 50, 10 (2002), 755–764.
  • Rong et al. (2019) Y. Rong, S. Srinivas, A. Venkataramani, and D. W. Bliss. 2019. UWB Radar Vibrometry: An RF Microphone. In IEEE ACSSC. 1066–1070.
  • Ronneberger et al. (2015) Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-Net: Convolutional Networks for Biomedical Image Segmentation. In MICCAI. Springer International Publishing, 234–241.
  • Roy et al. (2017) Nirupam Roy, Haitham Hassanieh, and Romit Roy Choudhury. 2017. Backdoor: Making microphones hear inaudible sounds. In ACM MobiSys. 2–14. https://doi.org/10.1145/3081333.3081366
  • Roy and Roy Choudhury (2016) Nirupam Roy and Romit Roy Choudhury. 2016. Listening through a Vibration Motor. In ACM MobiSys. 57–69. https://doi.org/10.1145/2906388.2906415
  • Sami et al. (2020) Sriram Sami, Yimin Dai, Sean Rui Xiang Tan, Nirupam Roy, and Jun Han. 2020. Spying with Your Robot Vacuum Cleaner: Eavesdropping via Lidar Sensors. In ACM SenSys. 354–367.
  • Stove (1992) Andrew G Stove. 1992. Linear FMCW radar techniques. In IEE Proceedings F (Radar and Signal Processing), Vol. 139. IET, 343–350.
  • Taal et al. (2010) C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen. 2010. A short-time objective intelligibility measure for time-frequency weighted noisy speech. In IEEE ICASSP. 4214–4217. https://doi.org/10.1109/ICASSP.2010.5495701
  • Toth et al. (2010) Arthur R Toth, Kaustubh Kalgaonkar, Bhiksha Raj, and Tony Ezzat. 2010. Synthesizing speech from Doppler signals. In IEEE ICASSP. 4638–4641.
  • Wang et al. (2018) Beibei Wang, Qinyi Xu, Chen Chen, Feng Zhang, and K. J. Ray Liu. 2018. The promise of radio analytics: A future paradigm of wireless positioning, tracking, and sensing. IEEE Signal Processing Magazine 35, 3 (2018), 59–80.
  • Wang et al. (2014) Guanhua Wang, Yongpan Zou, Zimu Zhou, Kaishun Wu, and Lionel Ni. 2014. We can hear you with Wi-Fi! ACM MobiCom. https://doi.org/10.1145/2639108.2639112
  • Wang et al. (2015) Wei Wang, Alex X Liu, Muhammad Shahzad, Kang Ling, and Sanglu Lu. 2015. Understanding and modeling of wifi signal based human activity recognition. In ACM MobiCom. 65–76.
  • Wang et al. (2016) Wei Wang, Alex X Liu, and Ke Sun. 2016. Device-free gesture tracking using acoustic signals. In ACM MobiCom. 82–94.
  • Wang et al. (2020) Ziqi Wang, Zhe Chen, Akash Deep Singh, Luis Garcia, Jun Luo, and Mani B. Srivastava. 2020. UWHear: Through-Wall Extraction and Separation of Audio Vibrations Using Wireless Signals. In ACM SenSys. 1–14. https://doi.org/10.1145/3384419.3430772
  • Wei et al. (2015) Teng Wei, Shu Wang, Anfu Zhou, and Xinyu Zhang. 2015. Acoustic Eavesdropping through Wireless Vibrometry. In ACM MobiCom. 130–141. https://doi.org/10.1145/2789168.2790119
  • Wu et al. (2020) Chenshu Wu, Feng Zhang, Beibei Wang, and K. J. Ray Liu. 2020. mSense: Towards Mobile Material Sensing with a Single Millimeter-Wave Radio. In PACM on Interactive, Mobile, Wearable and Ubiquitous Technologies.
  • Wu et al. (2014) Z. Wu, S. Gao, E. S. Cling, and H. Li. 2014. A study on replay attack and anti-spoofing for text-dependent speaker verification. In APSIPA ASC. 1–5. https://doi.org/10.1109/APSIPA.2014.7041636
  • Xie and Xiong (2020) Binbin Xie and Jie Xiong. 2020. Combating interference for long range LoRa sensing. In ACM SenSys. 69–81.
  • Xu et al. (2019) Chenhan Xu, Zhengxiong Li, Hanbin Zhang, Aditya Singh Rathore, Huining Li, Chen Song, Kun Wang, and Wenyao Xu. 2019. Waveear: Exploring a mmwave-based noise-resistant speech sensing for voice-user interface. In ACM MobiSys. 14–26.
  • Xu et al. (2014) Y. Xu, J. Du, L. Dai, and C. Lee. 2014. An Experimental Study on Speech Enhancement Based on Deep Neural Networks. IEEE Signal Processing Letters 21, 1 (2014), 65–68. https://doi.org/10.1109/LSP.2013.2291240
  • Yun et al. (2015) Sangki Yun, Yi-Chao Chen, and Lili Qiu. 2015. Turning a mobile device into a mouse in the air. In ACM MobiSys. 15–29.
  • Zhang et al. (2021) Feng Zhang, Chenshu Wu, Beibei Wang, and K. J. Ray Liu. 2021.

    mmEye: Super-Resolution Millimeter Wave Imaging.

    IEEE Internet of Things Journal 8, 8 (2021), 6995–7008. https://doi.org/10.1109/JIOT.2020.3037836
  • Zhang et al. (2017) Guoming Zhang, Chen Yan, Xiaoyu Ji, Tianchen Zhang, Taimin Zhang, and Wenyuan Xu. 2017. Dolphinattack: Inaudible voice commands. In ACM CCS. 103–117.
  • Zhang et al. (2015) Li Zhang, Parth H. Pathak, Muchen Wu, Yixin Zhao, and Prasant Mohapatra. 2015. Accelword: Energy efficient hotword detection through accelerometer. In ACM MobiSys. 301–315.
  • Zhao et al. (2018) Mingmin Zhao, Yonglong Tian, Hang Zhao, Mohammad Abu Alsheikh, Tianhong Li, Rumen Hristov, Zachary Kabelac, Dina Katabi, and Antonio Torralba. 2018. RF-based 3D skeletons. In ACM SIGCOMM. 267–281.
  • Zheng et al. (2019) Yue Zheng, Yi Zhang, Kun Qian, Guidong Zhang, Yunhao Liu, Chenshu Wu, and Zheng Yang. 2019. Zero-effort cross-domain gesture recognition with Wi-Fi. In ACM MobiSys. 313–325.