Audio Adversarial Examples: Targeted Attacks on Speech-to-Text

by   Nicholas Carlini, et al.

We construct targeted audio adversarial examples on automatic speech recognition. Given any audio waveform, we can produce another that is over 99.9 characters per second). We apply our iterative optimization-based attack to Mozilla's implementation DeepSpeech end-to-end, and show it has a 100 rate. The feasibility of this attack introduce a new domain to study adversarial examples.



page 1

page 2

page 3

page 4


Audio Adversarial Examples: Attacks Using Vocal Masks

We construct audio adversarial examples on automatic Speech-To-Text syst...

FAAG: Fast Adversarial Audio Generation through Interactive Attack Optimisation

Automatic Speech Recognition services (ASRs) inherit deep neural network...

Imperceptible, Robust, and Targeted Adversarial Examples for Automatic Speech Recognition

Adversarial examples are inputs to machine learning models designed by a...

Beyond L_p clipping: Equalization-based Psychoacoustic Attacks against ASRs

Automatic Speech Recognition (ASR) systems convert speech into text and ...

Adversarial Jamming for a More Effective Constellation Attack

The common jamming mode in wireless communication is band barrage jammin...

Adversarial attack on Speech-to-Text Recognition Models

Recent studies have highlighted audio adversarial examples as a ubiquito...

Towards Weighted-Sampling Audio Adversarial Example Attack

Recent studies have highlighted audio adversarial examples as a ubiquito...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

As the use of neural networks continues to grow, it is critical to examine their behavior in adversarial settings. Prior work

[8] has shown that neural networks are vulnerable to adversarial examples [40], instances similar to a natural instance

, but classified by a neural network as any (incorrect) target

chosen by the adversary.

Existing work on adversarial examples has focused largely on the space of images, be it image classification [40], generative models on images [26], image segmentation [1]

, face detection


, or reinforcement learning by manipulating the images the RL agent sees

[6, 21]. In the discrete domain, there has been some study of adversarial examples over text classification [23] and malware classification [16, 20].

There has been comparatively little study on the space of audio, where the most common use is performing automatic speech recognition. In automatic speech recognition, a neural network is given an audio waveform and perform the speech-to-text transform that gives the transcription of the phrase being spoken (as used in, e.g., Apple Siri, Google Now, and Amazon Echo).

Constructing targeted adversarial examples on speech recognition has proven difficult. Hidden and inaudible voice commands [11, 41, 39] are targeted attacks, but require synthesizing new audio and can not modify existing audio (analogous to the observation that neural networks can make high confidence predictions for unrecognizable images [33]). Other work has constructed standard untargeted adversarial examples on different audio systems [24, 13]. The current state-of-the-art targeted attack on automatic speech recognition is Houdini [12], which can only construct audio adversarial examples targeting phonetically similar phrases, leading the authors to state

targeted attacks seem to be much more challenging when dealing with speech recognition systems than when we consider artificial visual systems.


In this paper, we demonstrate that targeted adversarial examples exist in the audio domain by attacking DeepSpeech [18], a state-of-the-art speech-to-text transcription neural network. Figure 1 illustrates our attack: given any natural waveform , we are able to construct a perturbation that is nearly inaudible but so that is recognized as any desired phrase. We are able to achieve this by making use of strong, iterative, optimization-based attacks based on the work of [10].

Our white-box attack is end-to-end, and operates directly on the raw samples that are used as input to the classifier. This requires optimizing through the MFC pre-processing transformation, which is has been proven to be difficult [11]. Our attack works with success, regardless of the desired transcription or initial source audio sample.

By starting with an arbitrary waveform, such as music, we can embed speech into audio that should not be recognized as speech; and by choosing silence as the target, we can hide audio from a speech-to-text system.

Fig. 1: Illustration of our attack: given any waveform, adding a small perturbation makes the result transcribe as any desired target phrase.

Audio adversarial examples give a new domain to explore these intriguing properties of neural networks. We hope others will build on our attacks to further study this field. To facilitate future work, we make our code and dataset available111 Additionally, we encourage the reader to listen to our audio adversarial examples.

Ii Background

Neural Networks & Speech Recognition.

A neural network is a differentiable parameterized function . Its parameters can be updated by gradient descent to learn any function.

We represent audio as a

-dimensional vector

. Each element is a signed 16-bit value, sampled at 16KHz. To reduce the input dimensionality, the Mel-Frequency Cepstrum (MFC) transform is often used as a preprocessing step [18]. The MFC splits the waveform into 50 frames per second, and maps each frame to the frequency domain.

Standard classification neural networks take one input and produce an output probability distribution over all output labels. However, in the case of speech-to-text systems, there are exponentially many possible labels, making it computationally infeasible to enumerate all possible phrases.

Therefore, speech recognition systems often use Recurrent Neural Networks (RNNs) to map an audio waveform to a sequence of probability distributions over individual characters, instead of over complete phrases. An RNN is a function which maintains a state vector

with and , where the input is one frame of input, and each output is a probability distribution over which character was being spoken during that frame.

We use the DeepSpeech [18] speech-to-text system (specifically, Mozilla’s implementation [32]). Internally, it consists of a preprocessing layer which computes the MFC followed by a recurrent neural network using LSTMs [19].

Connectionist Temporal Classification

(CTC) [15] is a method of training a sequence-to-sequence neural network when the alignment between the input and output sequences is not known. DeepSpeech uses CTC because the inputs are an audio sample of a person speaking, and the unaligned transcribed sentences, where the exact position of each word in the audio sample is not known.

We briefly summarize the key details and notation. We refer readers to [17] for an excellent survey of CTC.

Let be the input domain — a single frame of input — and be the range — the characters a-z, space, and the special token (described below). Our neural network takes a sequence of frames and returns a probability distribution over the output domain for each frame. We write to mean that the probability of frame having label . We use to denote a phrase: a sequence of characters , where each .

While maps every frame to a probability distribution over the characters, this does not directly give a probability distribution over all phrases. The probability of a phrase is defined as a function of the probability of each character.

We begin with two short definitions. We say that a sequence reduces to if starting with and making the following two operations (in order) yields :

  1. Remove all sequentially duplicated tokens.

  2. Remove all tokens.

For example, the sequence reduces to .

Further, we say that is an alignment of with respect to (formally: ) if (a) reduces to , and (b) the length of is equal to the length of . The probability of alignment under is the product of the likelihoods of each of its elements:

With these definitions, we can now define the probability of a given phrase under the distribution as

As is usually done, the loss function used to train the network is the negative log likelihood of the desired phrase:

Despite the exponential search space, this loss can be computed efficiently with dynamic programming [15].

Finally, to decode a vector to a phrase , we search for the phrase that best aligns to .

Because computing requires searching an exponential space, it is typically approximated in one of two ways.

  • Greedy Decoding searches for the most likely alignment (which is easy to find) and then reduces this alignment to obtain the transcribed phrase:

  • Beam Search Decoding simultaneously evaluates the likelihood of multiple alignments and then chooses the most likely phrase under these alignments. We refer the reader to [15] for a complete algorithm description.

Adversarial Examples.

Evasion attacks have long been studied on machine learning classifiers

[29, 4, 5], and are practical against many types of models [8].

When discussion neural networks, these evasion attacks are referred to as adversarial examples [40]: for any input , it is possible to construct a sample that is similar to (according to some metric) but so that [8]. In the audio domain, these untargeted adversarial example are usually not interesting: causing a speech-to-text system to transcribe “test sentence” as the misspelled “test sentense” does little to help an adversary.

Targeted Adversarial Examples

are a more powerful attack: not only must the classification of and differ, but the network must assign a specific label (chosen by the adversary) to the instance . The purpose of this paper is to show that targeted adversarial examples are possible with only slight distortion on speech-to-text systems.

Iii Audio Adversarial Examples

Iii-a Threat Model & Evaluation Benchmark

Threat Model.

Given an audio waveform , and target transcription , our task is to construct another audio waveform so that and sound similar (formalized below), but so that . We report success only if the output of the network matches exactly the target phrase (i.e., contains no misspellings or extra characters).

We assume a white-box setting where the adversary has complete knowledge of the model and its parameters. This is the threat model taken in most prior work [14]. Just as later work in the space of images showed black-box attacks are possible [35, 22]; we expect that our attacks can be extended to black-box attacks. Additionally, we assume our adversarial examples are directly classified without any noise introduced (e.g., by playing them over-the-air and then recording them with a microphone). Initial work on image-based adversarial examples also made this same assumption, which was later shown unnecessary [27, 2].

Distortion Metric.

How should we quantify the distortion introduced by a perturbation ? In the space of images, despite some debate [36], most of the community has settled on metrics [10], most often using [14, 30], the maximum amount any pixel has been changed. We follow this convention for our audio attacks.

We measure distortion in Decibels (dB): a logarithmic scale that measures the relative loudness of an audio sample:

To say that some signal is “10 dB” is only meaningful when comparing it relative to some other reference point. In this paper, we compare the dB level of the distortion to the original waveform . To make this explicit, we write

Because the perturbation introduced is quieter than the original signal, the distortion is a negative number, where smaller values indicate quieter distortions.

While this metric may not be a perfect measure of distortion, as long as the perturbation is small enough, it will be imperceptible to humans. We encourage the reader to listen to our adversarial examples to hear how similar they sound. Alternatively, later, in Figure 2, we visualize two waveforms which transcribe to different phrases overlaid.

Evaluation Benchmark.

To evaluate the effectiveness of our attack, we construct targeted audio adversarial examples on the first test instances of the Mozilla Common Voice dataset. For each sample, we target different incorrect transcriptions, chosen at random such that (a) the transcription is incorrect, and (b) it is theoretically possible to reach that target.

Iii-B An Initial Formulation

As is commonly done [8, 40], we formulate the problem of constructing an adversarial example as an optimization problem: given a natural example and any target phrase , we solve the formulation

such that

Here M represents the maximum representable value ( in our case). This constraint can be handled by clipping the values of ; for notational simplicity we omit it from future formulation. Due to the non-linearity of the constraint , standard gradient-descent techniques do not work well with this formulation.

Prior work [40] has resolved this through the reformulation


where the loss function is constructed so that The parameter trades off the relative importance of being adversarial and remaining close to the original example.

Constructing a loss function with this property is much simpler in the domain of images than in the domain of audio; on images, directly corresponds to the probability of the input having label . In contrast, for audio, we use a second decoding step to compute , and so constructing a loss function is nontrivial.

To begin, we use the CTC loss as the loss function: For this loss function, one direction of the implication holds true (i.e., ) but the converse does not. Fortunately, this means that the resulting solution will still be adversarial, it just may not be minimally perturbed.

The second difficulty we must address is that when using a distortion metric, this optimization process will often oscillate around a solution without converging [10]. Therefore, instead we initially solve the formulation

such that

for some sufficiently large constant . Upon obtaining a partial solution to the above problem, we reduce and resume minimization, repeating until no solution can be found.

To solve this formulation, we differentiate through the entire classifier to generate our adversarial examples — starting from the audio sample, through the MFC, and neural network, to the final loss. We solve the minimization problem over the complete audio sample simultaneously. This is in contrast with prior work on hidden voice commands [11], which were generated sequentially, one frame at a time. We solve the minimization problem with the Adam [25] optimizer using a learning rate of , for a maximum of iterations.


We are able to generate targeted adversarial examples with success for each of the source-target pairs with a mean perturbation of dB. For comparison, this is roughly the difference between ambient noise in a quiet room and a person talking [38]. We encourage the reader to listen to our audio adversarial examples1. The interval for distortion ranged from dB to dB.

The longer a phrase is, the more difficult it is to target: every extra character requires approximately a dB increase in distortion. However, conversely, we observe that the longer the initial source phrase is, the easier it is to make it target a given transcription. These two effects roughly counteract each other (although we were not able to measure this to a statistically significant degree of certainty).

Generating a single adversarial example requires approximately one hour of compute time on commodity hardware (a single NVIDIA 1080Ti). However, due to the massively parallel nature of GPUs, we are able to construct adversarial examples simultaneously, reducing the time for constructing any given adversarial example to only a few minutes.222Due to implementation difficulties, after constructing adversarial examples simultaneously, we must fine-tune them individually afterwards.

Iii-C Improved Loss Function

Carlini & Wagner [10] demonstrate that the choice of loss function impacts the final distortion of generated adversarial examples by a factor of or more. We now show the same holds in the audio domain, but to a lesser extent. While CTC loss is highly useful for training the neural network, we show that a carefully designed loss function allows generating better lower-distortion adversarial examples. For the remainder of this section, we focus on generating adversarial examples that are only effective when using greedy decoding.

In order to minimize the CTC loss (as done in § III-B), an optimizer will make every aspect of the transcribed phrase more similar to the target phrase. That is, if the target phrase is “ABCD” and we are already decoding to “ABCX”, minimizing CTC loss will still cause the “A” to be more “A”-like, despite the fact that the only important change we require is for the “X” to be turned into a “D”.

This effect of making items classified more strongly as the desired label despite already having that label led to the design of a more effective loss function:

Once the probability of item is larger than any other item, the optimizer no longer sees a reduction in loss by making it more strongly classified with that label.

We now adapt this loss function to the audio domain. Assume we were given an alignment that aligns the phrase with the probabilities . Then the loss of this sequence is

We make one further improvement on this loss function. The constant used in the minimization formulation determines the relative importance of being close to the original symbol versus being adversarial. A larger value of allows the optimizer to place more emphasis on reducing .

In audio, consistent with prior work [11] we observe that certain characters are more difficult for the transcription to recognize. When we choose only one constant for the complete phrase, it must be large enough so that we can make the most difficult character be transcribed correctly. This forces to be larger than necessary for the easier-to-target segments. To resolve this issue, we instead use the following formulation:

such that

where . Computing the loss function requires choice of an alignment . If we were not concerned about runtime efficiency, in principle we could try all alignments and select the best one. However, this is computationally prohibitive.

Instead, we use a two-step attack:

  1. First, we let be an adversarial example found using the CTC loss (following §III-B). CTC loss explicitly constructs an alignment during decoding. We extract the alignment that is induced by (by computing ). We fix this alignment and use it as the target in the second step.

  2. Next, holding the alignment fixed, we generate a less-distorted adversarial example targeting the alignment using the improved loss function above to minimize , starting gradient descent at the initial point .


We repeat the evaluation from Section III-B (above), and generate targeted adversarial examples for the first 100 instances of the Common Voice test set. We are able to reduce the mean distortion from dB to dB. However, the adversarial examples we generate are now only guaranteed to be effective against a greedy decoder; against a beam-search decoder, the transcribed phrases are often more similar to the target phrase than the original phrase, but do not perfectly match the target.

Fig. 2: Original waveform (blue, thick line) with adversarial waveform (orange, thin line) overlaid; it is nearly impossible to notice a difference. The audio waveform was chosen randomly from the attacks generated and is 500 samples long.

Figure 2 shows two waveforms overlaid; the blue, thick line is the original waveform, and the orange, thin line the modified adversarial waveform. This sample was chosen randomly from among the training data, and corresponds to a distortion of dB. Even visually, these two waveforms are nearly indistinguishable.

Iii-D Audio Information Density

Recall that the input waveform is converted into 50 frames per second of audio, and DeepSpeech outputs one probability distribution of characters per frame. This places the theoretical maximum density of audio at 50 characters per second. We are able to generate adversarial examples that produce output at this maximum rate. Thus, short audio clips can transcribe to a long textual phrase.

The loss function is simpler in this setting. The only alignment of to is the assignment

. This means that the logit-based loss function can be applied directly without first heuristically finding an alignment; any other alignment would require omitting some character.

We perform this attack and find it is effective, although it requires a mean distortion of dB.

Iii-E Starting from Non-Speech

Not only are we able to construct adversarial examples that cause DeepSpeech to transcribe the incorrect text for a person’s speech, we are also able to begin with arbitrary non-speech audio sample and make that recognize as any target phrase. No technical novelty on top of what was developed above is required to mount this attack: we only let the initial audio waveform be non-speech.

To evaluate the effectiveness of this attack, we take five-second clips from classical music (which contain no speech) and target phrases contained in the Common Voice dataset. We have found that this attack requires more computational effort (we perform iterations of gradient descent) and the total distortion is slightly larger, with a mean of dB.

Iii-F Targeting Silence

Finally, we find it is possible to hide speech by adding adversarial noise that causes DeepSpeech to transcribe nothing. While performing this attack without modification (by just targeting the empty phrase) is effective, we can slightly improve on this if we define silence to be an arbitrary length sequence of only the space character repeated. With this definition, to obtain silence, we should let

We find that targeting silence is easier than targeting a specific phrase: with distortion less than dB below the original signal, we can turn any phrase into silence.

This partially explains why it is easier to construct adversarial examples when starting with longer audio waveforms than shorter ones: because the longer phrase contains more sounds, the adversary can silence the ones that are not required and obtain a subsequence that nearly matches the target. In contrast, for a shorter phrase, the adversary must synthesize new characters that did not exist previously.

Iv Audio Adversarial Example Properties

Iv-a Evaluating Single-Step Methods

In contrast to prior work which views adversarial examples as “blind spots” of a neural network, Goodfellow et al. [14] argue that adversarial examples are largely effective due to the locally linear nature of neural networks.

Fig. 3:

CTC loss when interpolating between the original audio sample and the adversarial example (blue, solid line), compared to traveling equally far in the direction suggested by the fast gradient sign method (orange, dashed line). Adversarial examples exist far enough away from the original audio sample that solely relying on the local linearity of neural networks is insufficient to construct targeted adversarial examples.

The Fast Gradient Sign Method (FGSM) [14] demonstrates that this is true in the space of images. FGSM takes a single step in the direction of the gradient of the loss function. That is, given network with loss function , we compute the adversarial example as

Intuitively, for each pixel in an image, this attack asks “in which direction should we modify this pixel to minimize the loss?” and then taking a small step in that direction for every pixel simultaneously. This attack can be applied directly to audio, changing individual samples instead of pixels.

However, we find that this type of single-step attack is not effective on audio adversarial examples: the inherent non-linearity introduced in computing the MFCCs, along with the depth of many rounds of LSTMs, introduces a large degree of non-linearity in the output.

In Figure 3 we compare the value of the CTC loss when traveling in the direction of a known adversarial example, compared to traveling in the fast gradient sign direction. While initially (near the source audio sample), the fast gradient direction is more effective at reducing the loss function, it quickly plateaus and does not decrease afterwards. On the other hand, using iterative optimization-based attacks find a direction that eventually leads to an adversarial example. (Only when the CTC loss is below 10 does the phrase have the correct transcription.)

We do, however, observe that the FGSM can be used to produce untargeted audio adversarial examples, that make a phrase misclassified (although optimization methods again can do so with less distortion).

Iv-B Robustness of Adversarial Examples

The minimally perturbed adversarial examples we construct in Section III-B can be made non-adversarial by trivial modifications to the input. Here, we demonstrate here that it is possible to construct adversarial examples robust to various forms of noise.

Robustness to pointwise noise.

Given an adversarial example , adding pointwise random noise to and returning will cause to lose its adversarial label, even if the distortion is small enough to allow normal examples to retain their classification.

We generate a high confidence adversarial example [8, 10], and make use of Expectation over Transforms [2] to generate an adversarial example robust to this synthetic noise at . The adversarial perturbation increases by approximately dB when we do this.

Robustness to MP3 compression.

Following [3]

, we make use of the straight-through estimator

[7] to construct adversarial examples robust to MP3 compression. We generate an adversarial example such that is classified as the target label by computing gradients of the CTC-Loss assuming that the gradient of the MP3 compression is the identity function. While individual gradient steps are likely not correct, in aggregate the gradients average out to become useful. This allows us to generate adversarial examples with approximately larger distortion that remain robust to MP3 compression.

V Open Questions

Can these attacks be played over-the-air?

Image-based adversarial examples have been shown to be feasible in the physical world [27, 2]. In the audio space, both hidden voice commands and Dolphin Attack’s inaudible voice commands are effective over-the-air when played by a speaker and recorded by a microphone [11, 41].

The audio adversarial examples we construct in this paper do not remain adversarial after being played over-the-air, and therefore present a limited real-world threat; however, just as the initial work on image-based adversarial examples did not consider the physical channel and only later was it shown to be possible, we believe further work will be able to produce audio adversarial examples that are effective over-the-air.

Do universal adversarial perturbations [31] exist?

One surprising observation is that on the space of images, it is possible to construct a single perturbation that when applied to an arbitrary image will make its classification incorrect. These attacks would be powerful on audio, and would correspond to a perturbation that could be played to cause any other waveform to recognize as a target phrase.

Are audio adversarial examples transferable?

That is, given an audio sample , can we generate a single perturbation so that for multiple classifiers ? Transferability is believed to be a fundamental property of neural networks [34], significantly complicates constructing robust defenses [9], and allows attackers to mount black-box attacks [28]. Evaluating transferability on the audio domain is an important direction for future work.

Which existing defenses can be applied audio?

To the best of our knowledge, all existing defenses to adversarial examples have only been evaluated on image domains. If the defender’s objective is to produce a robust neural network, then it should improve resistance to adversarial examples on all domains, not just on images. Audio adversarial examples give another point of comparison.

Vi Conclusion

We demonstrate targeted audio adversarial examples are effective on automatic speech recognition. With optimization-based attacks applied end-to-end, we are able to turn any audio waveform into any target transcription with success by only adding a slight distortion. We can cause audio to transcribe up to 50 characters per second (the theoretical maximum), cause music to transcribe as arbitrary speech, and hide speech from being transcribed.

We present preliminary evidence that audio adversarial examples have different properties from those on images by showing that linearity does not hold on the audio domain. We hope that future work will continue to investigate audio adversarial examples, and separate the fundamental properties of adversarial examples from properties which occur only on image recognition.


This work was supported by National Science Foundation award CNS-1514457, Qualcomm, and the Hewlett Foundation through the Center for Long-Term Cybersecurity.