Log In Sign Up

Robust Audio Adversarial Example for a Physical Attack

by   Hiromu Yakura, et al.

The success of deep learning in recent years has raised concerns about adversarial examples, which allow attackers to force deep neural networks to output a specified target. Although a method by which to generate audio adversarial examples targeting a state-of-the-art speech recognition model has been proposed, this method cannot fool the model in the case of playing over the air, and thus, the threat was considered to be limited. In this paper, we propose a method to generate adversarial examples that can attack even when playing over the air in the physical world by simulating transformation caused by playback or recording and incorporating them in the generation process. Evaluation and a listening experiment demonstrated that audio adversarial examples generated by the proposed method may become a real threat.


page 1

page 2

page 3

page 4

page 5


Detecting Audio Adversarial Examples with Logit Noising

Automatic speech recognition (ASR) systems are vulnerable to audio adver...

Adversarial attack on Speech-to-Text Recognition Models

Recent studies have highlighted audio adversarial examples as a ubiquito...

Isometric 3D Adversarial Examples in the Physical World

3D deep learning models are shown to be as vulnerable to adversarial exa...

Towards Weighted-Sampling Audio Adversarial Example Attack

Recent studies have highlighted audio adversarial examples as a ubiquito...

Towards Resistant Audio Adversarial Examples

Adversarial examples tremendously threaten the availability and integrit...

Crafting Adversarial Examples For Speech Paralinguistics Applications

Computational paralinguistic analysis is increasingly being used in a wi...

LMD: A Learnable Mask Network to Detect Adversarial Examples for Speaker Verification

Although the security of automatic speaker verification (ASV) is serious...

1 Introduction

In recent years, deep learning achieved vastly improved accuracy, especially in fields such as image classification and speech recognition, and has become used practically [1]. On the other hand, these deep learning methods are said to be vulnerable to adversarial examples [2, 3]. More specifically, an attacker can make deep learning models misclassify examples by intentionally adding a small perturbation to the examples. Such examples are referred to as adversarial examples.

Figure 1: Illustration of the proposed attack. Carlini et al. [4] assumed that adversarial examples are provided directly to the recognition model. We propose a method that targets an over-the-air condition, which leads to a real threat.

For example, Carlini et al. [4] proposed a method by which to generate adversarial examples that make DeepSpeech [5], one of the state-of-the-art speech recognition models, output desired transcriptions. However, this method targets the case in which the waveform of the adversarial example is input directly to the model, as shown in Fig. 1 (A). In other words, it is not feasible to attack under an over-the-air condition, in which the adversarial example is played by a speaker and recorded by a microphone, as shown in Fig. 1 (B). This is attributed to the reverberation of the environment and noise from both the speaker and the microphone, which cause a massive distortion of the input waveform, as compared to the case of direct input.

In this paper, we propose a method by which to generate a robust audio adversarial example that can attack speech recognition models in the physical world by being played over the air. In the proposed method, we address the above problem by simulating the influence of the reverberation and the noise and incorporating the simulated influence into the generation process. To the best of our knowledge, this is the first approach to succeed in generating such adversarial examples that can attack complex speech recognition models based on recurrent neural networks, such as DeepSpeech, over the air.

Considering a scenario of a physical attack, image adversarial examples must be shown explicitly on the attack target. In contrast, audio adversarial examples can simultaneously attack numerous targets by spreading via outdoor speakers or radios. In this respect, we believe that our research, which enables generation of audio adversarial examples that can attack in the physical world, is a significant contribution. Moreover, we believe our research could contribute to improving the robustness of speech recognition models by training models to discriminate adversarial examples, through a process similar to adversarial training in the image domain [3].

2 Background

In this section, we briefly introduce an adversarial example and review related studies on audio adversarial examples.

2.1 Adversarial Example

An adversarial example is defined as follows. Given a trained classification model and an input sample , an attacker wishes to modify so that the model recognizes the sample as having a specified label and the modification does not change the sample significantly:


Here, is a parameter that limits the magnitude of perturbation added to the input sample and is introduced so that humans cannot notice the difference between a legitimate input sample and an input sample modified by an attacker.

Focusing on the perturbation , the derivation of Eqn. 1

can be regarded as the following optimization problem using the loss function

of model :


By solving this problem using optimization methods, the attacker can obtain an adversarial example.

2.2 Image Adversarial Example for a Physical Attack

Considering attacks on physical recognition devices (e.g., object recognition of auto-driving cars), adversarial examples are given to the model through sensors. In the example of the auto-driving car, image adversarial examples are given to the model after being printed on physical materials and being photographed by a car-mounted camera. Through such a process, the adversarial examples are transformed and exposed to noise. However, adversarial examples generated by Eqn. 2 are assumed to be given directly to the model and do not work for such scenarios.

In order to address this problem, Athalye et al. [6] proposed a method to simulate transformations caused by printing or taking a picture and incorporate the transformations into the generation process of image adversarial examples. This method can be represented as follows using a set of transformations consisting of, e.g., enlargement, reduction, rotation, change in brightness, and addition of noise:


As a result, adversarial examples are generated so that images work even after being printed and photographed.

2.3 Related Research

Several studies have proposed methods by which to generate audio adversarial examples against speech recognition models [7, 4]. Yuan et al. [7] proposed a method targeting at a deep neural network of Kaldi [8]. Despite its success for over-the-air attacks, their method is based on frame-by-frame generation and cannot attack recurrent networks, which are used in most state-of-the-art models, such as DeepSpeech [5].

On the other hand, Carlini et al. [4]

proposed a method of generating an entire audio waveform, rather than frame-wise generation, to attack DeepSpeech. They implemented MFCC feature extraction on the computational graph of TensorFlow and generated an adversarial example of the target phrase



Here, represents the MFCC extraction from the waveform of . They reported the success rate of the obtained adversarial examples as 100% when inputting waveforms directly into the recognition model, but did not succeed at all under the over-the-air condition.

To the best of our knowledge, there has been no proposal to generate audio adversarial examples, which work under the over-the-air condition, targeting speech recognition models using a recurrent network.

3 Proposed Method

In this research, we propose a method by which to generate a robust adversarial example that can attack DeepSpeech [5] in the over-the-air condition. The basic idea is to incorporate transformations caused by playback and recording into the generation process, similar to Athalye et al. [6]. We introduce three techniques: a band-pass filter, impulse response, and white Gaussian noise.

3.1 Band-pass Filter

Since the audible range of humans is 20 to 20,000 , normal speakers are not made to play sounds outside this range. Moreover, microphones are often made to automatically cut out all but the audible range in order to reduce noise. Therefore, if the obtained perturbation is outside the audible range, the perturbation will be cut during playback and recording and will not function as an adversarial example.

Therefore, we introduced a band-pass filter in the generation process in order to explicitly limit the frequency range of the perturbation. Based on empirical observations, we set the band to 1,000 to 4,000 , which exhibited less distortion. Here, the generation process is represented as follows based on Eqn. 4:


3.2 Impulse Response

Based on the fact that impulse responses can reproduce reverberations in various environments by convolution, a method of using impulse responses in the training of a speech recognition model to enhance the robustness to the reverberation has been proposed [9]. Similarly, we introduced impulse responses to the generation process in order to make the obtained adversarial example robust to reverberations.

In addition, considering the scenario of attacking numerous devices at once via outdoor speakers or radios, we want the obtained adversarial example to work in various environments. Therefore, in the same manner as Athalye et al. [6], we take an expectation value over impulse responses recorded in diverse environments. Here, Eqn. 5 is extended likewise Eqn. 3, assuming that the set of collected impulse responses is and the convolution using impulse response is :


3.3 White Gaussian Noise

White Gaussian noise is used in the evaluation of speech recognition models to measure their robustness against the background noise [10]. Consequently, we introduce white Gaussian noise in the generation process in order to make the obtained adversarial example robust to background noise. Here, Eqn. 6 is extended as follows:


4 Evaluation

In order to confirm the effectiveness of the proposed method, we conducted evaluation experiments. We played and recorded audio adversarial examples generated by the proposed method and verified whether these adversarial examples are recognized as target phrases.

Settings. We implemented Eqn. 7 using Mozilla’s implementation333 of DeepSpeech [5] and generated adversarial examples. Since calculating the expected value of the loss is difficult, we instead implemented Eqn. 7 as to sample a fixed number of impulse responses randomly from . Then, the expected value is calculated as the average over the sampled impulse responses.

For the input sample , we used the first four seconds of Bach’s Cello Suite No.1, which is the same as the public samples of Carlini et al. [4]. For the target phrase , we prepared three different cases: “hello world,” “open the door444This phrase is used in [7] to discuss an attack scenario using voice commands.,” and “ok google555This phrase is used as a trigger word of Google Home (” For the set of impulse responses , we collected 615 impulse responses from various databases [11, 12, 13, 14].

Then, we played and recorded each adversarial example 10 times using a speaker and a microphone (JBL CLIP2 / Sony ECM-PCV80U) and evaluated transcriptions by DeepSpeech. Audio files we used are avilable at

Target phrase SNR Success Transcriptions (10 trials) by DeepSpeech
(A) hello world 9.3 100% hello world, hello world, hello world, hello world, hello world
hello world, hello world, hello world, hello world, hello world
(B) open the door -2.7 100% open the door, open the door, open the door, open the door, open the door
open the door, open the door, open the door, open the door, open the door
(C) ok google 7.5 0% oh god, oh good, oh good, oh good, oh good
oh good, oh god, oh good, oh good, oh good
Table 1: Recognition results of the generated audio adversarial examples. Both “hello world” and “open the door” were successfully recognized in all 10 trials.

Results. The results are shown in Tab. 1. The success rate of the adversarial example generated to be recognized as “hello world” was 100% with a signal-to-noise (SNR) of 9.3 . On the other hand, in the previous method [7] targeted at Kaldi [8], more magnified perturbation, i.e., an SNR of less than 2.0 was reported, as required for an over-the-air attack. In other words, the proposed method is able to generate adversarial example which is harder to notice though it is targeted at the more complex speech recognition model.

Moreover, as shown in Tab. 1 (B), we succeeded in generating an adversarial example, recognized as “open the door,” which can be a threat in the physical world. The adversarial example has a smaller SNR than the case of “hello world,” but at the same time, we also confirmed that the success rate was 50%, even if the perturbation was diminished to an SNR of 1.4 . In other words, the attack will succeed once in two attempts, and, depending on the attack scenario, the attack can be considered a sufficient threat.

On the other hand, the generation process targeting the phrase “ok google” did not converge after 10,000 iterations, and the transcriptions were different from the desired phrase, as shown in Tab. 1

(C). This is attributable to the language model included in DeepSpeech. In other words, the success rate is highly affected by the occurrence probability of the target phrase. In detail, in the training data of DeepSpeech

666Common Voice Dataset:, the word “google” appears only 23 times, whereas the word “good” appears 2,367 times. This resulted in a difference in their occurrence probabilities and caused difficulty in creating an adversarial example to be recognized as “ok google.”

5 Listening Experiment

In order to consider an attack scenario using the generated adversarial examples, whether humans can notice the adversarial examples is important. If an attacker can make intended phrases to be recognized without being noticed by humans, an attack exploiting speech recognition devices will be possible.

For example, Yuan et al. [7] conducted listening experiments using Amazon Mechanical Turk (AMT) in the proposal of the adversarial example generation method for Kaldi [8]. As a result, they reported that only 2.2% of the participants realized that the lyrics had changed from the original songs used as input samples, whereas approximately 65% noticed abnormal noises in the generated adversarial examples.

We similarly conducted listening experiments using AMT in order to confirm whether humans notice an attack.

ok google turn off open the door happy birthday
good night call john hello world airplane mode on
Table 2: List of choices presented to participants in the listening experiments. We chose simple phrases of lengths similar to those of “hello world” or “open the door,” concentrating on phrases that are used as voice commands.

Settings. We used the generated adversarial examples of Tab. 1 (A) and (B), which were recognized as target phrases with a success rate of 100%. We conducted an online survey separately for each adversarial example with 25 participants. They listened to the adversarial example three times, and after each listening, we asked (1) whether they heard anything abnormal (for affirmative responses, they were asked to write what they felt), (2) (with the disclosure that some voice is included) whether they heard any words (for affirmative responses, they were asked to write down the words), and (3) (with the presentation of eight phrases in Tab. 2) which phrase they believe was included.

Hear Hear With presentation of
anything a target the choices listed in Tab. 2
abnormal phrases Correct Incorrect Not sure
(A) 36.0% 0.0% 4.0% 28.0% 68.0%
(B) 64.0% 0.0% 12.0% 16.0% 72.0%
Table 3: Results of the listening experiments of Tab. 1 (A) and (B). Although a certain number of participants felt abnormal, most of the participants could not hear the target phrases, even when presented with choices.

Results. The results are shown in Tab. 3. As shown in Tab. 3 (A), only 36% of the participants felt abnormal and provided comments like “the sound was unclear,” ”there was background noise,” and “it sounded like birds in the background.” For Tab. 3 (B), although 64% of the participants felt abnormal, similar comments to (A) like “it was like hearing over a bad Skype connection or phone call” were provided, and an indication of any messages or utterances were not available. No one could hear the targeted phrases of both (A) and (B).

Furthermore, even when presented with the choices for the target phrases, more than half of the participants responded “I could not catch anything.” Here, the percentage of participants who selected the correct phrase for (A), excluding the participants who responded “not sure,” was 12.5%, which is the same as random selection, whereas that for (B) was 42.9%. The reason that (B) is more likely to be noticed can be explained by its larger magnitude, as shown in Tab. 1. Note that these results were obtained under the condition in which we explicitly instructed the participants to listen for the adversarial examples and presented them with choices for the target phrases. Moreover, the overall correct rate for (B) was 12.0%, which is considered not to deter the attack scenario, which might seek a situation that is less likely to be noticed.

Based on the above considerations, we concluded that the generated adversarial examples sound like mere noise and are almost unnoticeable to humans, which can be a real threat. In addition, based on the comments, adversarial examples may be more difficult to notice if we use birdsong as the input samples or play the samples through a telephone.

6 Conclusion

In this research, we proposed a method by which to generate audio adversarial example targeting the state-of-the-art speech recognition model that can attack even in the physical world. We were able to generate such robust adversarial examples by introducing a band-pass filter, impulse response, and white Gaussian noise to the generation process in order to simulate the transformations caused by the over-the-air playback. In the evaluation, we confirmed that the adversarial examples generated by the proposed method can have smaller perturbations than the conventional method, which cannot deal with recurrent networks. Moreover, the results of listening experiments confirmed that the obtained adversarial examples are almost unnoticeable to humans. To the best of our knowledge, this is the first approach to successfully generate audio adversarial examples for speech recognition models that use a recurrent network in the physical world.

In the future work, we would like to examine a detailed attack scenario and possible defense methods regarding the audio adversarial examples generated by the proposed method. We would also like to consider the possibility of realizing a robust speech recognition model using adversarial training, as discussed for the image classification [3].

7 Acknowledgment

This study was supported by JST CREST JPMJCR1302 and KAKENHI 16H02864.


  • [1] Yann LeCun, Yoshua Bengio, and Geoffrey E. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436–444, 2015.
  • [2] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian J. Goodfellow, and Rob Fergus, “Intriguing properties of neural networks,” in ICLR, 2014, pp. 1–10.
  • [3] Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy, “Explaining and harnessing adversarial examples,” in ICLR, 2015, pp. 1–11.
  • [4] Nicholas Carlini and David A. Wagner, “Audio adversarial examples: Targeted attacks on speech-to-text,” in DLS, 2018, pp. 1–7.
  • [5] Awni Y. Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, and Andrew Y. Ng, “Deep speech: Scaling up end-to-end speech recognition,” in arXiv:1412.5567, 2014, pp. 1–12.
  • [6] Anish Athalye, Logan Engstrom, Andrew Ilyas, and Kevin Kwok, “Synthesizing robust adversarial examples,” in ICML, 2018, pp. 284–293.
  • [7] Xuejing Yuan, Yuxuan Chen, Yue Zhao, Yunhui Long, Xiaokang Liu, Kai Chen, Shengzhi Zhang, Heqing Huang, Xiaofeng Wang, and Carl A. Gunter, “Commandersong: A systematic approach for practical adversarial voice recognition,” in USENIX Security, 2018, pp. 1–16.
  • [8] Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Burget, Ondrej Glembek, Nagendra Goel, Mirko Hannemann, Petr Motlicek, Yanmin Qian, Petr Schwarz, Jan Silovsky, Georg Stemmer, and Karel Vesely, “The kaldi speech recognition toolkit,” in ASRU, 2011, pp. 1–4.
  • [9] Vijayaditya Peddinti, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur,

    “Reverberation robust acoustic modeling using i-vectors with time delay neural networks,”

    in INTERSPEECH, 2015, pp. 2440–2444.
  • [10] John H. L. Hansen and Bryan L. Pellom, “An effective quality evaluation protocol for speech enhancement algorithms,” in ICSLP, 1998, pp. 2819–2822.
  • [11] Keisuke Kinoshita, Marc Delcroix, Takuya Yoshioka, Tomohiro Nakatani, Armin Sehr, Walter Kellermann, and Roland Maas, “The reverb challenge: Acommon evaluation framework for dereverberation and recognition of reverberant speech,” in WASPAA, 2013, pp. 1–4.
  • [12] Satoshi Nakamura, Kazuo Hiyane, Futoshi Asano, Takanobu Nishiura, and Takeshi Yamada,

    “Acoustical sound database in real environments for sound scene understanding and hands-free speech recognition,”

    in LREC, 2000, pp. 965–968.
  • [13] Marco Jeub, Magnus Schafer, and Peter Vary, “A binaural room impulse response database for the evaluation of dereverberation algorithms,” in ICDSP, July 2009, pp. 1–5.
  • [14] Jimi Y. C. Wen, Nikolay D. Gaubitch, Emanuël A. P. Habets, Tony Myatt, and Patrick A. Naylor, “Evaluation of speech dereverberation algorithms using the mardy database,” in IWAENC, 2006, pp. 1–4.