Imperceptible, Robust, and Targeted Adversarial Examples for Automatic Speech Recognition

03/22/2019 ∙ by Yao Qin, et al. ∙ 40

Adversarial examples are inputs to machine learning models designed by an adversary to cause an incorrect output. So far, adversarial examples have been studied most extensively in the image domain. In this domain, adversarial examples can be constructed by imperceptibly modifying images to cause misclassification, and are practical in the physical world. In contrast, current targeted adversarial examples applied to speech recognition systems have neither of these properties: humans can easily identify the adversarial perturbations, and they are not effective when played over-the-air. This paper makes advances on both of these fronts. First, we develop effectively imperceptible audio adversarial examples (verified through a human study) by leveraging the psychoacoustic principle of auditory masking, while retaining 100 progress towards physical-world over-the-air audio adversarial examples by constructing perturbations which remain effective even after applying realistic simulated environmental distortions.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Adversarial examples (Szegedy et al., 2013) are inputs that have been specifically designed by an adversary to cause a machine learning algorithm to produce a misclassification (Biggio et al., 2013)

. Initial work on adversarial examples focused mainly in the domain of image classification. In order to differentiate properties of adversarial examples on neural networks in general from properties which hold true only on images, it is important to study adversarial examples in different domains. Indeed, adversarial examples are known to exist on domains ranging from reinforcement learning

(Huang et al., 2017) to reading comprehension (Jia & Liang, 2017) to speech recognition (Carlini & Wagner, 2018). This paper focuses on the latter of these domains, where (Carlini & Wagner, 2018) showed that any given source audio sample can be perturbed slightly so that an automatic speech recognition (ASR) system would transcribe the audio as any different target sentence.

To date, adversarial examples on ASR differ from adversarial examples on images in two key ways. First, adversarial examples on images are imperceptible to humans: it is possible to generate an adversarial example without changing the 8-bit brightness representation (Szegedy et al., 2013). Conversely, adversarial examples on ASR systems are often perceptible. While the perturbation introduced is often small in magnitude, upon listening it is obvious that the added perturbation is present (Schönherr et al., 2018). Second, adversarial examples on images work in the physical world (Kurakin et al., 2016) (e.g., even when taking a picture of them). In contrast, adversarial examples on ASR systems do not yet work in such an “over-the-air” setting where they are played by a speaker and recorded by a microphone.

In this paper, we improve the construction of adversarial examples on ASR and match the power of attacks on images by developing adversarial examples which are imperceptible, and make steps towards robust adversarial examples111The project webpage is at http://cseweb.ucsd.edu/~yaq007/imperceptible-robust-adv.html.

In order to generate imperceptible adversarial examples, we depart from the common distance measure widely used for adversarial example research. Instead, we make use of the psychoacoustic principle of auditory masking, and only add the adversarial perturbation to regions of the audio where it will not be heard by a human, even if this perturbation is not “quiet” in terms of absolute energy.

Further investigating properties of adversarial examples which appear to be different from images, we examine the ability of an adversary to construct physical-world adversarial examples (Kurakin et al., 2016). These are inputs that, even after taking into account the distortions introduced by the physical world, remain adversarial upon classification. We make initial steps towards developing audio which can be played over-the-air by designing audio which remains adversarial after being processed by random room-environment simulators (Scheibler et al., 2018).

Finally, we additionally demonstrate that our attack is capable of attacking a modern, state-of-the-art Lingvo ASR system (Shen et al., 2019).

2 Related Work

We build on a long line of work studying the robustness of neural networks. This research area largely began with (Biggio et al., 2013; Szegedy et al., 2013), who first studied adversarial examples for deep neural networks.

This paper focuses on adversarial examples on automatic speech recognition systems. Early work in this space (Gong & Poellabauer, 2017; Cisse et al., 2017) was successful when generating untargeted adversarial examples that produced incorrect, but arbitrary, transcriptions. A concurrent line of work succeeded at generating targeted attacks in practice, even when played over a speaker and recorded by a microphone (a so-called “over-the-air” attack) but only by both (a) synthesizing completely new audio and (b) targeting older, traditional (i.e., not neural network based) speech recognition systems (Carlini et al., 2016; Zhang et al., 2017; Song & Mittal, 2017).

These two lines of work were partially unified by Carlini & Wagner (2018) who constructed adversarial perturbations for speech recognition systems targeting arbitrary (multi-word) sentences. However, this attack was neither effective over-the-air, nor was the adversarial perturbation completely inaudible; while the perturbations it introduces are very quiet, they can be heard by a human (see § 7.2). Concurrently, the CommanderSong (Yuan et al., 2018) attack developed adversarial examples that are effective over-the-air but at a cost of introducing a significant perturbation to the original audio.

Following this, concurrent work with ours develops attacks on deep learning ASR systems that either work over-the-air or are less obviously perceptible.

  • Yakura & Sakuma (2018), create adversarial examples which can be played over-the-air. These attacks are highly effective on short two- or three-word phrases, but not on the full-sentence phrases originally studied. Further, these adversarial examples often have a significantly larger perturbation, and in one case, the perturbation introduced had a higher magnitude than the original audio.

  • Schönherr et al. (2018)

    work towards developing attacks that are less perceptible through using “Psychoacoustic Hiding” and attack the Kaldi system, which is partially based on neural networks but also uses some “traditional” components such as a Hidden Markov Model instead of an RNN for final classification. Because of the system differences we can not directly compare our results to that of this paper, but we encourage the reader to listen to examples from both papers.

Our concurrent work manages to achieve both of these results (almost) simultaneously: we generate adversarial examples that are both nearly imperceptible and also remain effective after simulated distortions. Simultaneously, we target a state-of-the-art network-based ASR system, Lingvo, as opposed to Kaldi and generate full-sentence adversarial examples as opposed to targeting short phrases.

A final line of work extends adversarial example generation on ASR systems from the white-box setting (where the adversary has complete knowledge of the underlying classifier) to the black-box setting

(Khare et al., 2018; Taori et al., 2018) (where the adversary is only allowed to query the system). This work is complementary and independent of ours: we assume a white-box threat model.

3 Background

3.1 Problem Definition

Given an input audio waveform , a target transcription and an automatic speech recognition (ASR) system which outputs a final transcription, our objective is to construct an imperceptible and targeted adversarial example that can attack the ASR system when played over-the-air. That is, we seek to find a small perturbation , which enables to meet three requirements:

  • Targeted: the classifier is fooled so that and . Untargeted adversarial examples on ASR systems often only introduce spelling errors and so are less interesting to study.

  • Imperceptible: sounds so similar to that humans cannot differentiate and when listening to them.

  • Robust: is still effective when played by a speaker and recorded by a microphone in an over-the-air attack. (We do not achieve this goal completely, but do succeed at simulated environments.)

3.1.1 ASR Model

We mount our attacks on the Lingvo classifier (Shen et al., 2019), a state-of-the-art sequence-to-sequence model (Sutskever et al., 2014) with attention (Bahdanau et al., 2014) whose architecture is based on the Listen, Attend and Spell model (Chan et al., 2016)

. It feeds MFCCs into an encoder consisting of a stack of convolutional and LSTM layers, which conditions an LSTM decoder that outputs the transcription. The use of the sequence-to-sequence framework allows the entire model to be trained end-to-end with the standard cross-entropy loss function.

3.1.2 Threat Model

In this paper, as is done in most prior work, we consider the white box threat model where the adversary has full access to the model as well as its parameters. In particular, the adversary is allowed to compute gradients through the model in order to generate adversarial examples.

When we mount over-the-air attacks, we do not assume we know the exact configurations of the room in which the attack will be performed. Instead, we assume we know the distribution from which the room will be drawn, and generate adversarial examples so as to be effective on any room drawn from this distribution.

3.2 Adversarial Example Generation

Adversarial examples are typically generated by performing gradient descent with respect to the input on a loss function designed to be minimized when the input is adversarial (Szegedy et al., 2013). Specifically, let be an input to a neural network , let be a perturbation, and let be a loss function that is minimized when . Most work on adversarial examples focuses on minimizing the max-norm ( norm) of . Then, the typical adversarial example generation algorithm (Szegedy et al., 2013; Carlini & Wagner, 2017; Madry et al., 2017) solves

(where in some formulations ). Here, controls the maximum perturbation introduced.

To generate adversarial examples on ASR systems, Carlini & Wagner (2018) set to the CTC-loss and use the max-norm which has the effect of adding a small amount of adversarial perturbation consistently throughout the audio sample.

4 Imperceptible Adversarial Examples

Unlike on images, where minimizing distortion between an image and the nearest misclassified example yields a visually indistinguishable image, on audio, this is not the case (Schönherr et al., 2018). Thus, in this work, we depart from the distortion measures and instead rely on the extensive work which has been done in the audio space for capturing the human perceptibility of audio.

4.1 Psychoacoustic Models

A good understanding of the human auditory system is critical in order to be able to construct imperceptible adversarial examples. In this paper, we use frequency masking, which refers to the phenomenon that a louder signal (the “masker”) can make other signals at nearby frequencies (the “maskees”) imperceptible (Mitchell, 2004; Lin & Abdulla, 2015). In simple terms, the masker can be seen as creating a “masking threshold” in the frequency domain. Any signals which fall under this threshold are effectively imperceptible.

Because the masking threshold is measured in the frequency domain, and because audio signals change rapidly over time, we first compute the short-time Fourier transform of the raw audio signal to obtain the spectrum of overlapping sections (called “windows”) of a signal. The window size

is 2048 samples which are extracted with a “hop size” of 512 samples and are windowed with the modified Hann window. We denote as the th bin of the spectrum of frame .

Then, we compute the log-magnitude power spectral density (PSD) as follows:

(1)

The normalized PSD estimate

is defined by Lin & Abdulla (2015)

(2)
Masking Threshold

Given an audio input, in order to compute its masking threshold, first we need to identify the maskers, whose normalized PSD estimate must satisfy three criteria: 1) they must be local maxima in the spectrum; 2) they must be higher than the threshold in quiet; and 3) they have the largest amplitude within 0.5 Bark (a psychoacoustically-motivated frequency scale) around the masker’s frequency. Then, each masker’s masking threshold can be approximated using the simple two-slope spread function, which is derived to mimic the excitation patterns of maskers. Finally, the global masking threshold is a combination of the individual masking threshold as well as the threshold in quiet via addition (because the effect of masking is additive in the logarithmic domain). We refer interested readers to our appendix and (Lin & Abdulla, 2015) for specifics on computing the masking threshold.

When we add the perturbation to the audio input , if the normalized PSD estimate of the perturbation is under the frequency masking threshold of the original audio , the perturbation will be masked out by the raw audio and therefore be inaudible to humans. The normalized PSD estimate of the perturbation can be calculated via:

(3)

where and are the PSD estimate of the perturbation and the original audio input.

4.2 Optimization with Masking Threshold

Loss function

Given an audio example and a target phrase , we formulate the problem of constructing an imperceptible adversarial example as minimizing the loss function , which is defined as:

(4)

where requires that the adversarial examples fool the audio recognition system into making a targeted prediction , where . In the Lingvo model, the simple cross entropy loss function is used for . The term constrains the normalized PSD estimation of the perturbation to be under the frequency masking threshold of the original audio . The hinge loss is used here to compute the loss for masking threshold:

(5)

where is the predefined window size and outputs the greatest integer no larger than . The adaptive parameter is to balance the relative importance of these two criteria.

Stability

The existence of the function in the threshold and the normalized PSD estimate of the perturbation leads to instability during back-propagation. Therefore, we remove the term in the PSD estimate of and and then they become:

(6)

and the normalized PSD of the perturbation turns into

(7)

Correspondingly, the threshold becomes:

(8)

4.2.1 Two Stage Attack

Empirically, we find it is difficult to directly minimize the masking threshold loss function via backpropagation without any constraint on the magnitude of the perturbation

. This is reasonable because it is very challenging to fool the neural network and limit a very large perturbation to be under the masking threshold in the frequency domain at the same time. In contrast, if the perturbation is relatively small in magnitude, then it will be much easier to push the remaining distortion under the frequency masking threshold.

Therefore, we divide the optimization into two stages: the first stage of optimization focuses on finding a relatively small perturbation to fool the network (as was done in prior work (Carlini & Wagner, 2018)) and the second stage makes the adversarial examples imperceptible.

In the first stage, we set in Eqn 4 to be zero and clip the perturbation to be within a relatively small range. As a result, the first stage solves:

(9)

where represents the max-norm of . Specifically, we begin by setting and then on each iteration:

(10)

where is the learning rate and and is the gradient of with respect to . We initially set to a large value and then gradually reduced during optimization following Carlini & Wagner (2018).

The second stage focuses making the adversarial examples imperceptible, with an unbounded max-norm; instead, is only constrained by the masking threshold constraints. Specifically, initialize with optimized in the first stage and then on each iteration:

(11)

where is the learning rate and is the gradient of with respect to . The loss function is defined in Eqn. 4. The parameter that balances the network loss and the imperceptibility loss is initialized with a small value, e.g., 0.05, and is adaptively updated according to the performance of the attack. Specifically, every twenty iterations, if the current adversarial example successfully fools the ASR system (i.e. ), then is increased to attempt to make the adversarial example less perceptible. Correspondingly, every fifty iterations, if the current adversarial example fails to make the targeted prediction, we decrease . We check for attack failure less frequently than success (fifty vs. twenty iterations) to allow more iterations for the network to converge. The details of the optimization algorithm are further explained in the appendix.

5 Robust Adversarial Examples

5.1 Acoustic Room Simulator

In order to improve the robustness of adversarial examples when playing over-the-air, we use an acoustic room simulator to create artificial utterances (speech with reverberations) that mimic playing the audio over-the-air. The transformation function in the acoustic room simulator, denoted as , takes the clean audio as an input and outputs the simulated speech with reverberation . First, the room simulator applies the classic Image Source Method introduced in (Allen & Berkley, 1979; Scheibler et al., 2018) to create the room impulse response based on the room configurations (the room dimension, source audio and target microphone’s location, and reverberation time). Then, the generated room impulse response is convolved with the clean audio to create the speech with reverberation, to obtain where denotes the convolution operation. To make the generated adversarial examples robust to various environments, multiple room impulse responses are used. Therefore, the transformation function follows a chosen distribution over different room configurations.

5.2 Optimization with Reverberations

In this section, our objective is to make the perturbed speech with reverberation (rather than the clean audio) fool the ASR system. As a result, the generated adversarial examples will be passed through the room simulator first to create the simulated speech with reverberation , mimicking playing the adversarial examples over-the-air, and then the simulated will be fed as the new input to fool the ASR system, aiming at . Simultaneously, the adversarial perturbation should be relatively small in order not to be audible to humans.

In the same manner as the Expectation over Transformation in (Athalye et al., 2018), we optimize the expectation of the loss function over different transformations as follows:

(12)

Rather than directly targeting , we apply the loss function (the cross entropy loss in the Lingvo network) to the classification of the transformed speech . We approximate the gradient of the expected value via independently sampling a transformation from the distribution at each gradient descent step.

In the first iterations, we initialize with a sufficiently large value and gradually reduce it following Carlini & Wagner (2018). We consider the adversarial example successful if it successfully fools the ASR system under a single random room configuration; that is, if for just one . Once this optimization is complete, we obtain the max-norm bound for , denoted as . We will then use the perturbation as an initialization for in the next stage.

Then in the following iterations, we finetune the perturbation with a much smaller learning rate. The max-norm bound is increased to , where , and held constant during optimization. During this phase, we only consider the attack successful if the adversarial example successfully fools a set of randomly chosen transformations , where and is the size of the set . The transformation set is randomly sampled from the distribution at each gradient descent step. In other words, the adversarial example generated in this stage satisfies , . In this way, we can generate robust adversarial examples that successfully attack ASR systems when the exact room environment is not known ahead of time, whose configuration is drawn from a pre-defined distribution. More details of the algorithm are shown in the appendix.

It should be emphasized that there is a tradeoff between imperceptibility and robustness (as we will show experimentally in Section 7.2). If we increase the max amplitude of the perturbation , the robustness can always be further improved. Correspondingly, it becomes much easier for humans to perceive the adversarial perturbation and alert the ASR system. In order to keep these adversarial examples mostly imperceptible, we therefore limit the amplitude of the perturbation to be in a reasonable range.

6 Imperceptible and Robust Attacks

By combining both of the techniques we developed earlier, we now develop an approach to generate both imperceptible and robust adversarial examples. This can be achieved by minimizing the loss

(13)

where the cross entropy loss function is again the loss used for Lingvo, and the imperceptibility loss is the same as that defined in Eqn 5. Since we need to fool the ASR system when the speech is played after random perturbations, the cross entropy loss forces the transcription of the transformed adversarial example to be (again, as done earlier).

To further improve these adversarial examples to be imperceptible, we optimize to constrain the perturbation to fall under the masking threshold of the clean audio in the frequency domain. This is much easier compared to optimizing the hinge loss because the frequency masking threshold of the clean audio can be pre-computed while the masking threshold of the speech with reverberation varies with the room reverberation . In addition, optimizing and have similar effects based on the convolution theorem that the Fourier transform of a convolution of two signals is the pointwise product of their Fourier transforms. Note that the speech with reverberation is a convolution of the clean audio and a simulated room reverberation , hence:

(14)

where is the Fourier transform, denotes the convolution operation and represents the pointwise product. We apply the short-time Fourier transform to the perturbation and the raw audio signal first in order to compute the power spectral density and the masking threshold in the frequency domain. Since most of the energy in the room impulse response falls within the spectral analysis window size, the convolution theorem in Eqn 14 is approximately satisfied. Therefore, we arrive at:

(15)

As a result, optimizing the imperceptibility loss can help in finding the optimal and in constructing the imperceptible adversarial examples that can attack the ASR systems in the physical world.

Specifically, we will first initialize with the perturbation that enables the adversarial examples to be robust in Section 5. Then in each iteration, we randomly sample a transformation from the distribution and update according to:

(16)

where is the learning rate and , a parameter that balances the importance of the robustness and the imperceptibility, is adaptively changed based on the performance of adversarial examples. Specifically, if the constructed adversarial example can successfully attack a set of randomly chosen transformations, then will be increased to focus more on imperceptibility loss. Otherwise, is decreased to make the attack more robust to multiple room environments. The implementation details are illustrated in the appendix.

7 Evaluation

7.1 Datasets and Evaluation Metrics

Datasets

We use the LibriSpeech dataset (Panayotov et al., 2015) in our experiments, which is a corpus of 16KHz English speech derived from audiobooks and is used to train the Lingvo system (Shen et al., 2019). We randomly select 1000 audio examples as source examples, and 1000 separate transcriptions from the test-clean dataset to be the targeted transcriptions. We ensure that each target transcription is around the same length as the original transcription because it is unrealistic and overly challenging to perturb a short audio clip (e.g., 10 words) to have a much longer transcription (e.g., 20 words). Examples of the original and targeted transcriptions are available in the appendix.


Input Clean Adversarial Accuracy (%) 58.60 100.00 WER (%) 4.47 0.00
Table 1: Sentence-level accuracy and WER for 1000 clean and (imperceptible) adversarially perturbed examples, fed without over-the-air simulation into the Lingvo model. In “Clean”, the ground truth is the original transcription. In“Adversarial”, the ground truth is the targeted transcription.
Input Clean Robust () Robust () Imperceptible & Robust Accuracy (%) 31.37 62.96 64.64 49.65 WER (%) 15.42 14.45 13.83 22.98
Table 2: Sentence-level accuracy and WER for 100 clean and adversarially perturbed examples, fed with over-the-air simulation into the Lingvo model. The ground truth for “clean” inputs is the original transcription while the ground truth is the targeted transcription for the adversarial inputs. The perturbation is bounded by .
Evaluation Metrics

For automatic speech recognition, we evaluate our model using the standard word error rate (WER) metric, which is defined as , where , and are the number of substitutions, deletions and insertions of words respectively, and is the total number of words in the reference.

We also calculate the success rate (sentence-level accuracy) as , where is the number of audio examples that we test, and is the number of audio examples that are correctly transcribed. Here, “correctly transcribed” means the original transcription for clean audio and the targeted transcription for adversarial examples

Figure 1: Results of human study for imperceptibility. Here baseline represents the adversarial example generated by Carlini & Wagner (2018), and ours denotes the imperceptible adversarial example generated following the algorithm in Section. 4.

7.2 Imperceptibility Analysis

To attack the Lingvo ASR system, we construct 1000 imperceptible and targeted adversarial examples, one for each of the examples we sampled from the LibriSpeech test-clean dataset. Table2 shows the performance of the clean audio and the constructed adversarial examples. We can see that the word error rate (WER) of the clean audio is just on the 1000 test examples, indicating the model is of high quality. Our imperceptible adversarial examples perform even better, and reach a 100 success rate.

7.2.1 Qualitative Human Study

Of the 1000 examples selected from the test set, we randomly selected of these with their corresponding imperceptible adversarial example. We generate then generate an adversarial example using the prior work of Carlini & Wagner (2018) for the same target phrase; this attack again succeeds with 100% success. We perform three experiments to validate that our adversarial examples are imperceptible, especially compared to prior work.

Experimental Design.

We recruit users online from Amazon Mechanical Turk. We give each user one of the three (nearly identical) experiments, each of which we describe below. In all cases, the experiments consist of 20 “comparisons tasks”, where we present the evaluator with some audio samples and ask them questions (described below) about the samples. We ask the user to listen to each sample with headphones on, and answer a simple question about the audio samples (the question is determined by which experiment we run, as given below). We do not explain the purpose of the study other than that it is a research study, and do not record any personally identifying information.222Unfortunately, for this reason, we are unable to report aggregate statistics such as age or gender, slightly harming potential reproducibility. We randomly include a small number of questions with known, obvious answers; we remove 3 users from the study who failed to answer these questions correctly.

In all experiments, users have the ability to listen to audio files multiple times when they are unsure of the answer, making it as difficult as possible for our adversarial examples to pass as clean data. Users additionally have the added benefit of hearing examples back-to-back, effectively “training” them to recognize subtle differences. Indeed, a permutation test finds users are statistically significantly better at distinguishing adversarial examples from clean audio during the second half of the experiment compared to the first half of the experiment, although the magnitude of the difference is small: only by about .

Figure 1 summarizes the statistical results we give below.

Experiment 1: clean or noisy.

We begin with what we believe is the most representative experiment of how an attack would work in practice. We give users one audio sample and ask them to tell us if it has any background noise (e.g., static, echoing, people talking in the background).

As a baseline, users believed that of original clean audio samples contained some amount of noise, and of users believed that the adversarial examples generated by Carlini & Wagner (2018) contained some amount of noise. In comparison, only of users believe that the adversarial examples we generate contain any noise, a result that is not statistically significantly different from clean audio (). That is, when presented with just one audio sample in isolation, users do not believe the adversarial examples we generate are any noisier than the clean samples.

Experiment 2: identify the original.

We give users two audio samples and inform them that one of the audio samples is a modified version of the other; we ask the user to select the audio sample corresponding to the one which sounds like the more natural audio sample. This setup is much more challenging: when users can listen to both the before and after, it is often possible to pick up on the small amount of distortion that has been added. When comparing the original audio to the adversarial examples generated by Carlini & Wagner (2018), the evaluator chose the original audio of the time. When we have the evaluator compare the imperceptible adversarial examples we generate to those of Carlini & Wagner (2018), our imperceptible examples are selected as the better audio sample of the time—a difference that is not statistically distinguishable from comparing against the clean audio.

However, when directly comparing the adversarial examples we generate to the clean audio, users prefer the clean audio still of the time. Observe that the baseline percentage, when the samples are completely indistinguishable, is . Thus, users only perform better than random guessing at distinguishing our examples from clean examples.

Experiment 3: identical or not.

Finally, we perform the most difficult experiment: we present users with two audio files, and ask them if the audio samples are identical, or if there are any differences. As the baseline, when given the same audio sample twice, users agreed it was identical of the time. (That is, in of cases the evaluator wrongly heard a difference between the two samples.) When given a clean audio sample and comparing it to the audio generated by Carlini & Wagner (2018), users only believed them to be identical of the time. Comparing clean audio to the adversarial examples we generate, user believed them to be completely identical of the time, more often than the adversarial examples generated by the baseline, but below the -identical value for actually-identical audio.

7.3 Robustness Analysis

To mount our simulated over-the-air attacks, we consider a challenging setting that the exact configuration of the room in which the attack will be performed is unknown. Instead, we are only aware of the distribution from which the room configuration will be drawn. First, we generate 1000 random room configurations sampled from the distribution as the training room set. The test room set includes another 100 random room configurations sampled from the same distribution. Adversarial examples are created to attack the Lingvo ASR system when played in the simulated test rooms. We randomly choose 100 audio examples from LibriSpeech dataset to perform this robustness test.

As shown in Table 2, when fed non-adversarial audio played in simulated test rooms, the WER of the Lingvo ASR degrades to which suggests some robustness to reverberation. In contrast, the success rate of adversarial examples in (Carlini & Wagner, 2018) and our imperceptible adversarial examples in Section 4 are in this setting. The success rate of our robust adversarial examples generated based on the algorithm in Section 5 is over , and the WER is smaller than that of the clean audio. Both the success rate and the WER demonstrate that our constructed adversarial examples remain effective when played in the highly-realistic simulated environment.

In addition, the robustness of the constructed adversarial examples can be improved further at the cost of increased perceptibility. As presented in Table 2, when we increase the max-norm bound of the amplitude of the adversarial perturbation ( is increased from 300 to 400), both the success rate and WER are improved correspondingly. Since our final objective is to generate imperceptible and robust adversarial examples that can be played over-the-air in the physical world, we limit the max-norm bound of the perturbation to be in a relatively small range to avoid a huge distortion toward the clean audio.

To construct imperceptible as well as robust adversarial examples, we start from the robust attack () and finetune it with the imperceptibility loss. In our experiments, we observe that 81 of the robust adversarial examples 333The other 19 adversarial examples lose the robustness because they cannot successfully attack the ASR system in 8 randomly chosen training rooms in any iteration during optimization. can be further improved to be much less perceptible while still retaining high robustness (around 50 success rate and WER).

7.3.1 Qualitative Human Study

We run identical experiments (as described earlier) on the robust and robust-and-imperceptible adversarial examples.

In experiment 1, where we ask evaluators if there is any noise, only heard any noise on the clean audio, compared to on the robust (but perceptible) adversarial examples and on the robust and imperceptible adversarial examples. 444Evaluators stated they heard noise on clean examples less often compared to the baseline in the prior study. We believe this is due to the fact that when primed with examples which are obviously different, the baseline becomes more easily distinguishable.

In experiment 2, where we ask evaluators to identify the original audio, comparing clean to robust adversarial examples the evaluator correctly identified the original audio of the time versus when comparing the clean audio to the imperceptible and robust adversarial examples.

Finally, in experiment 3, where we ask evaluators if the audio is identical, the baseline clean audio was judged different of the time when compared to the robust adversarial examples, and the clean audio was judged different of the time when compared to the imperceptible and robust adversarial examples.

In all cases, the imperceptible and robust adversarial examples are statistically significantly less perceptible than just the robust adversarial examples, but also statistically significantly more perceptible than the clean audio. Directly comparing the imperceptible and robust adversarial examples to the robust examples, evaluators believed the imperceptible examples had less distortion of the time.

Clearly the adversarial examples that are robust are significantly easier to distinguish from clean audio, even when we apply the masking threshold. However, this result is consistent with work on adversarial examples on images, where completely imperceptible physical-world adversarial examples have not been successfully constructed. On images, physical attacks require over as much distortion to be effective on the physical world (see, for example, Figure 4 of Kurakin et al. (2016)).

8 Conclusion

In this paper, we successfully construct imperceptible adversarial examples (verified by a human study) for automatic speech recognition based on the psychoacoustic principle of auditory masking, while retaining 100 targeted success rate on arbitrary full-sentence targets. Simultaneously, we also make progress towards developing robust adversarial examples that remain effective after being played over-the-air (processed by random room environment simulators), increasing the practicality of actual real-world attacks using adversarial examples targeting ASR systems.

We believe that future work is still required: our robust adversarial examples do not play fully over-the-air, despite working in simulated room environments. Resolving this difficulty while maintaining a high targeted success rate is necessary for demonstrating a practical security concern.

As a final contribution of potentially independent interest, this work demonstrates how one might go about constructing adversarial examples for non--based metrics. Especially on images, nearly all adversarial example research has focused on this highly-limited distance measure. Devoting effort to identifying different methods that humans use to assess similarity, and generating adversarial examples exploiting those metrics, is an important research effort we hope future work will explore.

References

Appendix

Appendix A Frequency Masking Threshold

In this section, we detail how we compute the frequency masking threshold for constructing imperceptible adversarial examples. This procedure is based on psychoacoustic principles which were refined over many years of human studies. For further background on psychoacoustic models, we refer the interested reader to (Lin & Abdulla, 2015; Mitchell, 2004).

Step 1: Identifications of Maskers

In order to compute the frequency masking threshold of an input signal , where , we need to first identify the maskers. There are two different classes of maskers: tonal and nontonal maskers, where nontonal maskers have stronger masking effects compared to tonal maskers. Here we simply treat all the maskers as tonal ones to make sure the threshold that we compute can always mask out the noise. The normalized PSD estimate of the tonal maskers must meet three criteria. First, they must be local maxima in the spectrum, satisfying:

(17)

where .

Second, the normalized PSD estimate of any masker must be higher than the threshold in quiet ATH(), which is:

(18)

where ATH() is approximated by the following frequency-dependency function:

(19)

The quiet threshold only applies to the human hearing range of . When we perform short time Fourier transform (STFT) to a signal, the relation between the frequency and the index of sampling points is

(20)

where is the sampling frequency and is the window size.

Last, the maskers must have the highest PSD within the range of 0.5 Bark around the masker’s frequency. Human’s main hearing range between 20Hz and 16kHz is divided into 24 non-overlapping critical bands, whose unit is Bark, varying as a function of frequency as follows:

(21)

As the effect of masking is additive in the logarithmic domain, the PSD estimate of the masker is further smoothed with its neighbors by:

(22)

Step 2: Individual masking thresholds

An individual masking threshold is better computed with frequency denoted at the Bark scale because the spreading functions of the masker would be similar at different Barks. We use to represent the Bark scale of the frequency index . There are a number of spreading functions introduced to imitate the characteristics of maskers and here we choose the simple two-slope spread function:

(23)

where and and are the Bark scale of the masker at the frequency index and the maskee at frequency index respectively. Then, refers to the masker at Bark index contributing to the masking effect on the maskee at Bark index . Empirically, the threshold is calculated by:

(24)

where and SF[] is the spreading function.
Step 3: Global masking threshold

The global masking threshold is a combination of individual masking thresholds as well as the threshold in quiet via addition. The global masking threshold at frequency index measured with Decibels (dB) is calculated according to:

(25)

where is the set of all the selected maskers. The computed is used as the frequency masking threshold for the input audio to construct imperceptible adversarial examples.

Appendix B Notations and Definitions

The notations and definitions used in our proposed algorithms are listed in Table 3.

The clean audio input
The adversarial perturbation added to clean audio
The constructed adversarial example
The targeted transcription
The attacked neural network (ASR)
Fourier transform
The index of the spectrum
The window size in short term Fourier transform
The -th bin of the spectrum for audio
The -th bin of the spectrum for perturbation
The log-magnitude power spectral density (PSD) for audio at index
The normalized PSD estimated for audio at index
The log-magnitude power spectral density (PSD) for audio at index
The normalized PSD estimated for audio at index
The frequency masking threshold for audio at index
Loss function to optimize to construct adversarial examples
Loss function to fool the neural network with the input and output
Imperceptibility loss function

A hyperparameter to balance the importance of

and
Max-norm
Max-norm bound of perturbation
The gradient of () with regard to
The learning rate in gradient descent
Room reverberation
The room transformation related to room configurations
The distribtion from which the transformation is sampled from
The optimized in the first stage in constructing imperceptible adversarial examples
The optimized in the first stage in constructing robust adversarial examples
The optimized in the first stage in constructing robust adversarial examples
The max-norm bound for used in the second stage in constructing robust adversarial examples
The optimized in the second stage in constructing robust adversarial examples
The difference between
A set of transformations sampled from distribution
The size of the transformation set
Table 3: Notations and Definitions used in our algorithms.

Appendix C Implementation Details

The adversarial examples generated in our paper are all optimized via Adam optimizer (Kingma & Ba, 2014). The hyperparameters used in each section are displayed below.

c.1 Imperceptible Adversarial Examples

In order to construct imperceptible adversarial examples, we divide the optimization into two stages. In the first stage, the learning rate is set to be 100 and the number of iterations is 1000 as (Carlini & Wagner, 2018). The max-norm bound starts from 2000 and will be gradually reduced during optimization. In the second stage, the number of iterations is 4000. The learning rate starts from 1 and will be reduced to be 0.1 after 3000 iterations. The adaptive parameter which balances the importance between and begins with and gradually updated based on the performance of adversarial examples. Algorithm 1 shows the details of the two-stage optimization.

  Input: audio waveform , target phrase , ASR system , perturbation , loss function , hyperparameters and , learning rate in the first stage and second stage , number of iterations in the first stage and second stage .
  # Stage 1: minimize
  Initialize , and .
  for  to  do
     
     Clip
     if  and  then
        if  then
           
        end if
        
     end if
  end for
  
  # Stage 2: minimize the perceptibility
  Reassign
  for  to  do
     
     if  and  then
        
     end if
     if  and  then
        
     end if
  end for
  Output: adversarial example
Algorithm 1 Optimization with Masking Threshold
Original phrase 1 the more she is engaged in her proper duties the less leisure will she have for it even as an
accomplishment and a recreation
Targeted phrase 1 old will is a fine fellow but poor and helpless since missus rogers had her accident
Original phrase 2 a little cracked that in the popular phrase was my impression of the stranger who now made his
appearance in the supper room
Targeted phrase 2 her regard shifted to the green stalks and leaves again and she started to move away
Table 4: Examples of the original and targeted phrases on the LibriSpeech dataset.

c.2 Robust Adversarial Examples

To develop the robust adversarial examples that could work after being played over-the-air, we also optimize the adversarial perturbation in two stages. The first stage intends to find a relatively small perturbation while the second stage focuses on making the constructed adversarial example more robust to random room configurations. The learning rate in the first stage is and will be updated for 2000 iterations. The max-norm bound for the adversarial perturbation starts from 2000 as well and will be gradually reduced. In the second stage, the number of iterations is set to be 4000 and the learning rate is . In this stage, is fixed and equals the optimized in the first stage plus . The size of the transformation set is set to be .

c.3 Imperceptible and Robust Attacks

To construct imperceptible and robust adversarial examples, we begin with the robust adversarial examples generated in Section. C.2. In the first stage, we focus on reducing the imperceptibility by setting the initial to be 0.01 and the learning rate is set to be 1. We update the adversarial perturbation for 4000 iterations. If the adversarial example successfully attacks the ASR system in 4 out of 10 randomly chosen rooms, then will be increased by 2. Otherwise, for every 50 iterations, will be decreased by 0.5.

In the second stage, we focus on improving the less perceptible adversarial examples to be more robust. The learning rate is 1.5 and starts from a very small value of . The perturbation will be further updated for 6000 iterations. If the adversarial example successfully attacks the ASR system in 8 out of 10 randomly chosen rooms, then will be increased by 1.2.

Appendix D Transcription Examples

Some examples of the original phrases and targeted transcriptions from the LibriSpeech dataset (Panayotov et al., 2015) are shown in Table 4.