Voice-driven services (VDS) are widely deployed in commercial products to enable personalized and convenient experiences for consumers. Examples include digital assistants from Amazon, Apple and Google for performing tasks like smart home control and online commerce in a hands-free manner, and automated and human-assisted voice response systems for customer support. While this increasing ubiquity can be primarily attributed to improved real world performance of deep learning driven speech and speaker recognition, security is an equally important consideration in operational settings. To secure VDS in practice, one of the common security mechanisms used is “voice password”. For example, applications such as Samsung Bixby, Wechat and others[bixbypassword, wechatpassword, mobileapp_password] prompt users to speak a password and perform two factor authentication by checking (i) if the spoken passphrase is correct, and (ii) it was spoken by the authorized user.
Voice passwords, however, do not provide security against voice spoofing techniques such as speech replays [gong2018protecting, pradhan2019combating, leng2020replay] and synthesis [synthesisattack]. While there are known practical limitations in conducting speech synthesis attacks (e.g., the attack proposed in [synthesisattack]
requires 24 hours of high quality training data), it is easier to conduct replay attacks in practice. Broadly, there are two categories of defense mechanisms that aim to prevent speech replays. The first class of methods use machine learning to determine whether a speech sample is produced by a human or replayed using a playback device[chen2017you, wang2020differences]. The second category of techniques use external devices (e.g., a vibration sensor) to check if the speech is produced by a human in real-time [feng2017continuous]. For machine learning techniques, real world performance usually depends on training data and its relevance to the test environment. A state-of-the-art method Void [ahmed2020void], as an example, has significantly different equal error rates (EER), 0.3% and 11.6%, on two speech datasets collected in two different environments. Another practical limitation is unreliability of playback device signatures used by machine learning techniques [wang2020differences]. In addition, for methods that use extra hardware, there are usability and cost implications for practical deployment.
In this work, we aim to address a subset of replay attacks, called speech re-use attacks. We define speech re-use attacks to be specific to a certain type of VDS. As shown in Figure 1, let A and A’ be the VDS types for the capture and attack steps, respectively. Replay attacks typically consider a variety of VDS types for capture and attack steps (A = A’ and A A’), while re-use attacks are a subset of replay attacks where the VDS types in consideration for the capture and attack steps is similar (A = A’). The proposed security overlay prevents re-use attacks on protected VDS. For example, if the security overlay is integrated with each Alexa’s VDS, it can prevent speech captured from user interaction with one Alexa’s VDS from being re-used on another Alexa’s VDS. This is similar to the device specific random pattern used in Face ID to counter digital and physical spoofs [faceID_security] (see Figure 2).
We present AEOLUS111Named after Aelous, the ”Keeper of the Winds” in Greek mythology., a security overlay that proactively embeds a dynamic acoustic nonce in the voice channel via the microphone at the time of user interaction, and detects the presence of the embedded nonce in the speech recorded by the speaker to ensure speech freshness. AEOLUS is designed as a software solution, and can be integrated by a vendor with any closed-loop VDS that involves real-time user interaction. It is useful for providing additional security in critical applications such as user authentication and payments. AEOLUS does not require any extra hardware and works with in-built speaker and microphone in consumer products. AEOLUS is secure by design against attacks that remove the embedded nonce for speech re-use. It uses Frequency-hopping spread spectrum (FHSS) [cvejic2004algorithms] [cvejic2004spread] technique for dynamic acoustic nonce embedding. Similar techniques have been previously used to watermark audio recordings for copyright purposes [swanson1998robust] [bassia2001robust]. However, their application in offline audio watermarking is fundamentally different from VDS operational setting where the length and content of the input speech is unknown apriori. AEOLUS is designed for real-time over-the-air use without prior knowledge of the input speech.
AEOLUS addresses two key practical challenges. The first challenge is to reliably embed an acoustic nonce over-the-air and retrieve it successfully from the recorded speech. Over-the-air acoustic nonce propagation is affected by factors such as background noise and distance between microphone and loudspeaker [chenmetamorph]. The latter, for instance, results in distorted and attenuated speech signal which increases the likelihood of errors while extracting the embedded acoustic nonce. The second challenge is to embed the acoustic nonce such that it is imperceptible to a VDS user. Achieving imperceptibility is non-trivial in light of the first challenge. This is because an important consideration for reliability is embedding acoustic nonce of certain strength which, in turn, makes the acoustic nonce perceptible. Therefore, we model acoustic nonce generation as an optimization problem and compute the set of optimal parameters, e.g., nonce’s operating frequency, amplitude, bitrate in different environments using differential evolution.
We evaluate AEOLUS in three different environments and show that it works reliably upto a range of 4 m. Additionally, we conduct a user study involving 120 subjects (approved by the institutional human study review board) to evaluate the imperceptibility of the embedded nonce. The results of the study show that majority of users find the embedded acoustic nonce to be either imperceptible or non-disruptive.
In summary, the contributions of this work are as follows:
Design of a VDS security overlay called AEOLUS to ensure freshness of input speech without any extra hardware dependency.
Modeling acoustic nonce propagation as an optimization problem to address i) reliability, i.e. ensuring successful embedding and retrieval of dynamic acoustic nonce without impacting VDS functionality, and (ii) imperceptibility, i.e. to have minimal impact on VDS users’ experience.
Comprehensive real-world evaluation to show that AEOLUS can work effectively (0.5% false reject rate (FRR) and 0% false accept rate (FAR) in detecting speech re-use) up to a range of 4 meters in three different environments.
User study with 120 subjects to demonstrate that the embedded acoustic nonce does not degrade user experience.
2 Background and Related Work
2.1 Example Use Cases
Voice-driven Payments. Voice-driven services are being increasingly used for payments, especially in retail environments [vpsurvey]. Amazon and Exxon recently announced voice-driven payments at over 11,500 gas stations in the US [gasstationpay]. A recent user study [paysafe] indicates that while most users are comfortable conducting low-value purchases such as ordering a meal and shopping for groceries using voice-driven payment services, the majority of them do not have sufficient confidence in conducting high value purchases due to security concerns.
Most voice-driven payment services authenticate users through speaker verification. However, they often lack adequate protection mechanisms against speech re-use attacks where an attacker records a user interaction and re-uses it to attack the service either on-site or remotely. This is analogous to card skimming [scaife2018fear] where an attacker re-uses a victim’s credit card information. AEOLUS can be deployed to prevent speech re-use attacks in voice-driven payments.
Workplace Automation. A variety of digital assistants are being used for unsupervised or semi-supervised interactions in workplace environments [alexabusiness]. In addition to accurate speaker recognition (identification and/or verification), it is important to have adequate protection mechanisms against speech re-use to protect sensitive resources in such environments [workplaceva]. AEOLUS can be incorporated in workplace digital assistants for this purpose.
2.2 Related Work
Speech Replay Detection
Existing research related to speech replay detection can be classified into two categories, software-based and hardware-based methods. Software-based methods use machine learning to determine whether the input speech is produced by a human or replayed using a recording device[chen2017you] [ahmed2020void] [kinnunen2017asvspoof]. Chen et al. [chen2017you] present a method that obtains 0% EER for distance (between user and VDS input device) of only a few centimeters which limits its applicability in practice. Ahmed et al. [ahmed2020void]’s method, on the other hand, is reported to work well upto a distance of 2.6 meters. However, the EER varies significantly in different environments with it being as low as 11.6%. One of the best performing methods [kinnunen2017asvspoof] reports EER of 6.7% on ASVSpoof 17 database. A caveat, however, is that these methods aim to address general replay attacks. In contrast, AEOLUS aims to prevent speech re-use and obtains 0% EER upto a distance of 4 meters in three different environments.
The second category of approaches use additional hardware (e.g., a vibration sensor on user’s neck) to check that speech is produced in real-time by a user [feng2017continuous] [pradhan2019combating]. Such approaches, in general, outperform machine learning-based methods. For example, Feng et al. [feng2017continuous] report 97% successful detection rate, on average, with no strict assumption on distance. However, there are two major practical limitations using the aforementioned approaches for detecting speech re-use, (i) requirement of extra hardware which has cost implications, and (ii) inconvenience to users. AEOLUS only uses the built-in speaker and microphone in smart devices yet achieves high speech re-use detection rate.
Audio Watermarking. Audio watermarking is widely used for multimedia copyright protection [swanson1998robust, seok2001audio]. Both time domain [bassia2001robust]
and frequency domain[cox1996secure, neubauer1998digital] watermarking techniques have been proposed in the literature. Typically, audio watermark embedding and detection is self-synchronized [wu2005efficiently]. Advanced watermarking methods use spread-spectrum modulation [swanson1998robust] to prevent watermark removal and provide imperceptibility [cox1996secure, kirovski2003spread]. While useful, these techniques cannot be directly used in VDS operational setting. They are designed for fixed length offline audio files, e.g., music recordings in CDs [swanson1998robust], and do not consider the impact of environmental factors, e.g., bitrate variability and background noise, in over-the-air transmission. Such environmental factors result in a dynamic and lossy acoustic environment in turn making watermark embedding and retrieval significantly challenging. Further, they are not designed for real-time speech and speaker recognition systems where input speech is unknown apriori and can be of variable length. AEOLUS is designed to take these pragmatic considerations into account. It is able to successfully embed and retrieve acoustic nonce in the voice channel without degrading user experience in practice.
3 Proposed Security Overlay
3.1 Threat Model
The following assumptions are made while designing AEOLUS to address the speech re-use attack scenario. The hardware device (loudspeaker and microphone) that is used by a user to interact with VDS is trusted and secure, and an adversary cannot disable the speaker or the microphone. The proposed overlay can be used by a VDS to generate, embed, and detect acoustic nonce. At the time of user interaction, the method used to trigger a VDS, e.g., words or phrases like “Alexa” and “Hey, Siri” are not protected but can initiate the proposed overlay to prevent re-use of voice commands and confidential passphrases. For example, to prevent re-use of a user’s voice password to access recent account transactions when the user is interacting with a voice banking VDS [paypalpin].
An adversary can use a commodity recording device to record prior user interactions with “any” AEOLUS-enabled VDS, and re-use the recorded interaction to attack any AEOLUS-enabled VDS. Attack scenarios where an adversary can record or access clean user speech samples (i.e., recordings without embedded acoustic nonce) to launch replay or speech synthesis attacks are considered out of scope and are not addressed. The key underlying assumptions are that (i) users typically interact with VDS using specific commands (e.g., a user-defined passphrase [bixbypassword]) and these commands can be protected with the proposed overlay, and (ii) it is difficult to record clean samples of these commands from interaction with unprotected devices or in human conversations. However, this does not limit the use of AEOLUS in conjunction with other defense mechanisms for replay or speech synthesis attacks.
3.2 Core Components
To prevent speech re-use in VDS, AEOLUS uses three core components, nonce generation, nonce embedding and nonce detection.
Nonce Generation. The nonce generation module is invoked upon initiation of a new user interaction session (see Figure 2(b) 1⃝). Akin to nonce-based authentication [tsai2009efficient, bellare1995provably], the module generates a digital nonce (a random number of a length ) that is valid for the current session. Nonce length
should be sufficiently large to prevent use of duplicate nonces in different sessions. Given limited bandwidth of over-the-air channel and estimated user base for popular VDS (e.g., Alexa skill has 300,000 daily users[alexadailyuser]), the minimum permitted value of is set to 32 bits. 32-bit long nonce is sufficient to protect short speech samples (e.g., a voice passphrase) in a particular session. Nonce repetition is permitted to protect longer speech samples during the session. The nonce for session is stored in a set . is subsequently used by the nonce detection module to determine if the embedded nonce is current.
Nonce Embedding. Post nonce generation, an encoding algorithm, Binary Frequency Shift Keying (BFSK) [park2010underwater, truax2001acoustic], is used to encode the generated digital nonce as an acoustic nonce. Each bit is independently encoded as a cosine wave . Bit “0” is encoded as a 1kHz cosine wave, and bit “1” as a 5 kHz cosine wave. All cosine waves in the set are concatenated to yield the acoustic nonce . The acoustic nonce is embedded in over-the-air channel using VDS playback device ( Figure 2(b) 2⃝). An important parameter for nonce embedding is the time duration of the embedded acoustic nonce relative to the time duration of user interaction. Using a real-world dataset with 27,000 samples [gong2019remasc], we estimate that the average user interaction is 3 sec. As an example, let the minimum duration of embedded nonce be half of the average user interaction. In this case, the acoustic nonce embedding would require a bitrate of 14 bits/sec, and the maximum duration of each component cosine wave can be 42 ms. Note that this estimate does not consider the idle time in user interaction. Additionally, to prevent an adversary from obtaining clean speech sample by removing the embedded nonce, the nonce embedding module leverages Frequency-hopping Spread Spectrum (FHSS) [cvejic2004spread, cvejic2004algorithms]. FHSS is a well-known technique that provides high robustness against standard embedded signal removal attacks. For high robustness, the module uses FHSS with a set of different frequencies and periodically selects the operating frequency at random.
Nonce Detection. Once user interaction ends, the recorded audio is processed using the nonce detection module (Figure 2(b) 3⃝). The module decodes the acoustic nonce using BFSK, and checks that (i) the nonce is completely decoded, and (ii) the decoded nonce is current. The recorded speech is subsequently processed using standard VDS modules (Figure 2(b) 4⃝). If nonce detection fails or if the decoded nonce has a high Bit Error Rate (BER), a recourse action (e.g. a retry or reject) can be taken. Furthermore, if speech is reused by an adversary and both the current nonce and the obsolete nonce from re-used speech are detected, information about re-used speech such as when and where it was recorded can be determined from . Another scenario can be signal interference between the current and obsolete nonce resulting in inappropriate decoding result. In such case, pertinent recourse action can be taken.
4 Practical Realization
In this section, we discuss the two key challenges while implementing AEOLUS in practice: reliability and imperceptibility.
Reliability. The first challenge is to reliably embed an acoustic nonce in over-the-air channel, as well as to accurately detect nonce from the recorded speech. Because over-the-air acoustic nonce transmission is impacted by factors such as background noise and distance between microphone and playback device, it is difficult to achieve this in operational settings. It is also important to determine if any previously embedded acoustic nonce is present in the recorded speech to ascertain speech re-use. The metric used to measure reliability is the bit error rate (BER) between the embedded and detected nonce. Recall that BFSK is used for acoustic nonce embedding and detection. Under the assumption that only additive white Gaussian noise (AWGN) is present in over-the-air channel, BER can be computed using the complimentary error function [kumar2010comparison] as follows:
Note that BER computation in Eqn. 1
involves the normalized per bit signal-to-noise ratio[johnson2006signal] which depends on the frequency and amplitude of the carrier wave. However, other types of noises besides AWGN are typically present in over-the-air channel. Hence, we conduct Room Impulse Response (RIR) simulation experiments to study the impact of carrier wave amplitude and frequency on BER.
Imperceptibility. The second challenge is to ensure that the embedded acoustic nonce does not degrade VDS user experience. For this, the acoustic nonce should be as imperceptible as possible to a VDS user. The nonce embedding and generation modules presented earlier do not adequately address this challenge because (i) they do not optimize any objective metric for imperceptibility, and (ii) they do not account for dynamic and noisy environments where it is difficult to achieve both reliability and imperceptibility simultaneously. For measuring imperceptibility, Sound Pressure Level (SPL) is computed using the following equation:
Here, represents the frequency and is the amplitude of the carrier wave. represents the density of the transmission channel (e.g., over-the-air channel), and denotes the speed of acoustic transmission. Given that the average SPL for human speech is 60dBm [human_normal_hearing], and should ideally ensure that SPL is below this threshold for imperceptibility. Like Equation 1, while this equation provides a theoretical basis to understand SPL, we study how the aforementioned parameters impact SPL using RIR simulation.
4.2 Key Parameters
There are two key parameters that impact reliability and imperceptibility: (i) acoustic nonce generation and embedding parameters that include frequency and amplitude of the carrier wave used for acoustic nonce generation and bitrate used for acoustic nonce encoding, and (ii) environmental parameters that include the distance between the microphone and user, the room size, and background noise among others. To understand the impact of the these parameters on reliability and imperceptibility, we implement AEOLUS as a prototype and setup an RIR environment called MCRoomSim in MATLAB [wabnitz2010room]. RIR is a commonly used tool for estimating the performance of signal processing applications in over-the-air channel [chenmetamorph, abdullah2019practical].
4.2.1 Nonce Generation and Embedding Parameters
Amplitude. According to RIR simulation, a lower amplitude yields lower ratio and consequently, higher BER (see Figure 3(a)). This is consistent with Equation 1. Furthermore, as shown in Figure 5, SPL is proportional to the amplitude of the carrier wave.
Frequency. BER and SPL are evaluated in the frequency range at which typical speaker recognition systems operate (5 Hz - 8 kHz; sampling rate = 500 Hz). The other parameters are fixed as follows: = 1, frequency hopping = 2, amplitude (normalized) = 1, bitrate = 30 bits/sec, distance = 1 meter. It is observed that the carrier wave frequency neither impacts BER nor SPL. The later is in contrast to existing research [suzuki2004equal, wiki_elc] which shows that the frequency of an acoustic wave affects the loudness as perceived by humans in a non-linear manner. Hence, the frequency to SPL mapping from IoSR toolkit [githubmapping] is used as reference.
Bitrate. Figure 3(b) shows the impact of acoustic nonce bitrate on BER. BER increases significantly as bitrate increases beyond 20 bits/sec. One of the causes is asynchronous signal overlapping in the time domain when two bits are transmitted too close to each other.
4.2.2 Environmental Parameters
Frequency Shifting. Over-the-air acoustic nonce transmission induces differences in the received frequencies and the transmitted frequencies called frequency shifting. Frequency shifting can also occur because of hardware limitations, e.g., a microphone with a 1.8 kHz frequency limit being used to capture a 2 kHz wave. Since the received frequency is used to decode the acoustic nonce, frequency shifting needs to be accounted for by AEOLUS. We investigate frequency shifting in the simulation setting (operating frequency range: 500 Hz-8 kHz, cardioid loudspeaker type, dipole microphone type) and determine the frequency shifting range to be +/- 15 Hz.
Distance. Distance between a VDS user and VDS input impacts imperceptibility of the acoustic nonce. Figure 4(b) indicates that this distance is inversely proportional to SPL.
4.3 Mathematical Formulation
To address reliability and imperceptibility simultaneously, we model over-the-air acoustic nonce transmission as an optimization problem. Let a user’s recited speech be denoted by , and the acoustic nonce embedded by AEOLUS be . Also, let the function that accounts for variations induced by over-the-air channel, e.g., due to acoustic signal air absorption and distance, be . The speech recorded by the microphone is thus represented as . When is embedded at the time of user interaction with the VDS, mixes with . As a result, the microphone records , where denotes signal convolution between and .
Assume that AEOLUS detects nonce in . Let us define an objective function denoting BER that takes the parameter set as input, and a function denoting SPL. The goal is to find the optimal parameter set that minimizes (for reliability) subject to the constraint that the SPL difference between speech with and without embedded nonce is less than or equal to a threshold (for imperceptibility):
4.4 Computing Optimal Parameters
Initialization. The environmental parameters are first initialized (Figure 2(b) 5⃝) using the VDS input. The background noise energy is calculated by inspecting the ambient sound. Frequency shifting is calculated by transmitting and decoding a nonce that covers the working frequency range (500 - 8 kHz, sampled every 500 Hz). It is assumed that the distance between the user and VDS input device is set by the VDS device vendor.
Optimization. Computing the gradients of the optimization parameters directly using gradient-based methods is inefficient [chen2019real]. This is because the VDS’s acoustic environment is dynamic (e.g., due to occlusions and varying background noise levels). In addition, the formulated optimization is a multi-dimensional problem with considerations of both reliability and imperceptibility. Hence, we use differential evolution (DE) [das2010differential, storn1997differential] which is a gradient-free method. DE generates diverse candidate solutions in each iteration which avoids local minima and aids in convergence to a global minima. It does not require apriori knowledge of the underlying acoustic encoding method or acoustic environment. Algorithm 1 summarizes the use of DE for obtaining optimal parameters. The initial set of candidate parameters includes feasible frequency , amplitude , and bitrate , estimated using RIR simulation. Once AEOLUS is deployed, the initialization module first estimates the environmental parameters, e.g., and . Subsequently, the optimization module uses the pre-calculated RIR parameters for efficient convergence to optimal parameters (see Section 5.1).
5 Experimental Evaluation
AEOLUS is implemented as a prototype in MATLAB and evaluated in three different environments: (i) a conference room (approx. 5.5 m by 4 m) selected as a reference for a semi-public indoor environment, (ii) a campus dining area (approx. 25m by 50m) chosen to mimic a noisy indoor public area, and (iii) inside a car parked at a gas station to simulate the use case of voice-driven payments at gas stations [gasstationpay]. For (i) and (ii), a FIFINE K669B microphone is used to record speech, and a Logitech S120 Speaker System is used as playback device to simulate users. The evaluation is performed at different distances (0.5-4 m) between the microphone and playback device. For (iii), the microphone is identical but the car’s in-built speakers are used instead of the Logitech speakers. Also, the evaluation is performed at two different distances 0.5 and 0.7 m.
AEOLUS is evaluated both on speech data captured from live participants in ReMASC dataset [gong2019remasc] and synthesized speech from Amazon Polly. A total of ten different speaker profiles are used (four of which are synthesized). For each profile, three different speech samples are used. Microsoft Azure’s speech and speaker recognition services [ms_speaker] are used for speech-to-text and speaker recognition. Text-independent speaker verification service [ms_verification] is used for speaker verification experiments. Each speaker’s profile is first enrolled using a speech sample longer than 20 seconds. Verification is performed using the speaker’s speech samples with and without acoustic nonce, and yields one of the following outputs: acceptance with “high”, “normal”, or “low” confidence, or a rejection.
( 200 Hz)
|Dining Hall||58||5200 Hz||1||25|
|Gas Station (Car)||47||4800 Hz||0.61||33|
Optimal Parameters. First, we estimate optimal parameters for achieving reliability and imperceptibility simultaneously in a given environment using Algorithm 1. The frequency step size is set to 200 Hz and is set as 0.1 to ensure that the generated nonce is below the average human speech SPL of 60 db. It is observed that lower amplitudes and frequencies and higher bitrates are comparatively optimal for acoustic nonce embedding in low noise environments, e.g., conference room (see Table 1). Relatively higher amplitudes and frequencies, and lower bitrates are more optimal in environments with higher noise, such as the dining hall. This is because the background noise in such environments typically vests as a low amplitude low frequency noise.
Computational Overhead. Algorithm 1 iteratively optimizes the parameter set for the given environment. For conference room, a single iteration is sufficient. However, for dining hall and gas station environments, a few iterations () are required and the computation takes a few seconds. This does not cause a delay at the time of user interaction because it overlaps with the time a user typically takes to set up a VDS. The acoustic nonce embedding and decoding process occurs while user is interacting with VDS. The computational overhead is negligible because the nonce size is small, and the playback device is used only when the user is speaking.
Reliability Next, we measure the reliability of embedding and retrieving acoustic nonce in the three environments. For this, we compute the average BER over 5 independent trials at different distances between the VDS input device (microphone) and user. Figure 7 shows that in all three environments, the average BER upto a distance of 1.5 m is 0%. Note, however that BER increases significantly beyond 1.5m in dining hall because the environment is relatively noisy.
Speech Re-use Detection. In this experiment, we measure the efficacy of AEOLUS in detecting speech re-use (see Table 2). The input speech is considered re-used if (i) the current nonce is not decoded correctly, and (ii) if any trace of an obsolete nonce is detected. FRR indicates the proportion of falsely rejected samples because of incorrect nonce decoding or because the detected nonce is obsolete. FAR indicates how many re-used speech samples with obsolete nonces are falsely accepted. AEOLUS achieves 0.5% FRR at 0% FAR for speech re-use prevention upto a distance of 4 meters in the three environments.
Impact on Speaker and Speech Recognition. The goal of this experiment is to assess the impact of acoustic nonce embedding and decoding on the performance of a commercial speaker and speech recognition system (Microsoft Azure). The results show that there is limited impact on the performance of the speaker recognition system for the conference room and dining hall environments upto a distance of 1.5 m (see Appendix Figure 7(a)
). The embedded nonce marginally increases the average word error rate (WER) of the speech recognition system to 4.4%. The WER is close to the reported average WER of 5.1% for the same system in the NIST 2000 Switchboard test set[xiong2018microsoft]. Manually reviewing the results indicates that most errors are related to incorrect suffix “-ing”. A few errors are due to small word omissions such as “an” and “a”. Relatively more errors occur in the dining hall environment due to the background noise.
4-5kHz, 2 Hop., Avg. of 30 samples
1-8kHz, 10 Hop., Avg. of 30 samples
Verification (Score, 0-1)
Recognition (WER %)
Verification (Score, 0-1)
Recognition (WER %)
Acoustic Nonce Removal. An adversary may attempt to remove the embedded acoustic nonce and obtain a “clean” speech sample to conduct re-use attacks. To test the robustness of AEOLUS in this adversarial setting, we use 6 common audio watermark removal techniques [wu2005efficiently] (see Appendix 0.B for details). Each technique is iteratively used until the acoustic nonce cannot be detected in a speech sample. The removal attempts are repeated for all speech samples in the dataset mentioned above. Table 3 shows that nonce removal techniques reduced the average recognition performance of the commercial speaker recognition system by 79.52%, and increased the average word error rate of the speech recognition system by 33.57%.
5.3 Human Perception
5.3.1 Empirical Measurement.
In this experiment, we empirically measure the change in SPL due to the embedded nonce. The distance between VDS output and the recording device is 1 m in dining hall and conference room, and 0.7 m in the car. The results (see Appendix Figure 9) show that the average SPL difference in each environment is less than 1.4%. Non-parametric Wilcoxon Signed Rank test [wilcoxon1970critical] (at 95% confidence level) is used to measure the statistical significance of the difference in SPL before and after acoustic nonce embedding. The results indicate that the differences are statistically insignificant.
|Gas Station (Car)||32.14%/25%||42.86%/75%||8.57%/0%|
User Study. Next, we study the impact of AEOLUS on VDS usability. For this, 120 participants (with prior experience using a VDS) from various demographics (age groups, gender shown in Appendix Table 6) are recruited on Amazon Mechanical Turk. Each participant is presented with a survey that takes approximately 10 minutes to complete. Post completion of the survey, a participant is paid 4 USD as incentive. The survey questions are designed to ensure bias is not introduced in the results (no leading questions).
Each participant is asked to enter their demographic information. The participant is then instructed to play a test sample and adjust the playback volume to a comfortable level. Following this, the participant is presented with a survey with five speech samples and follow up questions. Included in the five samples is one sample with only background noise (and no speech) to check if the participant is attentive. The participants are asked to provide speech transcriptions for all samples and the provided transcriptions are used to check the participants’ attentiveness. Eleven participants failed to pass this test and hence were excluded.
The test samples are uniformly selected at random from a dataset of 24 speech recordings, 12 recordings each with and without embedded nonce from the three environments. The average length of the speech samples is 6.4 seconds. To ensure the user study is unbiased, all samples presented to a participant are from the same speaker and pertain to the same voice command. After listening to a speech sample, a participant is asked to rate the noise level of the speech sample on a scale of 0 - 10 (10 being the highest noise level). The participant is also instructed to report any noticeable noise source that affects their experience, and to specify the characteristic of the noise that resulted in this disruption (e.g., human conversation, train or car noise). The answer is used to ascertain the source of noise that impacted the usability (e.g., the embedded nonce).
Results. Table 4 reports the average human perception ratings. The acoustic nonce does not degrade overall user experience for 94.16% of speech samples, on average, in the three environments. Since this does not adequately measure the participant’s perception of nonce embedding for each sample, per speech sample perception is also reported. No speech sample with embedded nonce is perceived to be disruptive by the majority of participants.
|Gas Station (Car)||5||5.6||Yes|
Perceived Noise Level. Table 5 shows that the speech samples containing the acoustic nonce are perceived to have a higher noise level. The dependent samples paired t-test (at 95% confidence) indicates a statistically significant difference (between samples with and without nonce) in the conference room and car. However, this difference is statistically insignificant in the dining hall due to higher background noise level compared to the conference room and car. Despite perceiving higher noise level, the majority of participants do not perceive any speech sample to be disruptive (see Table 4).
6 Summary and Future Work
In this work, we present a security overlay called AEOLUS that can be integrated with any voice-driven service to prevent speech re-use. AEOLUS reliably embeds a dynamic acoustic nonce that is non-disruptive to a VDS user, and detects the embedded nonce from the recorded user speech. AEOLUS can be combined with other defense mechanisms to further enhance VDS security. As an example, it can be used alongside a dynamic challenge-response method where in each interaction, the user is expected to respond to a fresh challenge (e.g., repeat a phrase). The acoustic nonce can be embedded in each challenge, and can be detected using AEOLUS. Experiments conducted in three different environments show that AEOLUS can be used in operational VDS scenarios with minimal overhead. In future, we plan to conduct in-depth investigation of methods for generation and encoding of acoustic nonce, e.g., using variations of music or other forms of sound along with different encoding schemes. We also plan to conduct large-scale live subject testing to further evaluate reliability and imperceptibility.
Appendix 0.A Appendix: Distance between adversary and VDS speaker.
To launch a speech re-use attack, an adversary first records a user’s interaction with VDS. In this experiment, we measure how effectively the acoustic nonce is preserved with varying distance between the VDS speaker and adversary’s recording device (microphone) at the time of capture. Figure 7(b) shows that this distance significantly impacts BER. The computed BER is 0% upto a distance of 4 m when the signal to noise ratio is low. If the captured speech is being re-used, assuming a lossless audio input to VDS, the entire nonce can be recovered and used for determining the particular user interaction session the speech was previously used in.
Appendix 0.B Appendix: Nonce Removal Techniques
Resampling. Samples the audio at a different frequency (e.g. 44.1 KHz) to remove sampling-dependent watermark.
Amplitude Compression. Alters the amplitude of recorded audio to bypass amplitude-related watermark detection.
Filtering. Uses a high pass or low pass filter to remove a specific range of frequency in the audio. For example, separate audible signals from inaudible frequencies (e.g. ultrasound).
Lossy Compression. Leverages data encoding methods that uses inexact approximations and partial data discarding to compress the audio. For example, MP3 [mp3] or MPEG-1 Audio Layer III reduce audio file size by taking advantage of a perceptual limitation of human hearing.
Equalization. Modifies the frequency response of an audio system using linear filters. Frequency response is the measure of the audio output in comparison to the input as a function of frequency. For example, a 5kHz wave input may have a 4kHz output response.
Additive Noise. Adds noise to decrease SNR (signal-to-noise ratio) so that watermark cannot be extracted successfully.
Appendix 0.C Appendix: User Study Demographics
The demographics for the 120 participants in the user study are shown in Figure 6. There are more female participants than male participants, and 56.7% of the participants are between age 25 to 35.
|Category||Gender Groups||Age Groups|