Enabling Fast and Universal Audio Adversarial Attack Using Generative Model

04/26/2020 ∙ by Yi Xie, et al. ∙ 0

Recently, the vulnerability of DNN-based audio systems to adversarial attacks has obtained the increasing attention. However, the existing audio adversarial attacks allow the adversary to possess the entire user's audio input as well as granting sufficient time budget to generate the adversarial perturbations. These idealized assumptions, however, makes the existing audio adversarial attacks mostly impossible to be launched in a timely fashion in practice (e.g., playing unnoticeable adversarial perturbations along with user's streaming input). To overcome these limitations, in this paper we propose fast audio adversarial perturbation generator (FAPG), which uses generative model to generate adversarial perturbations for the audio input in a single forward pass, thereby drastically improving the perturbation generation speed. Built on the top of FAPG, we further propose universal audio adversarial perturbation generator (UAPG), a scheme crafting universal adversarial perturbation that can be imposed on arbitrary benign audio input to cause misclassification. Extensive experiments show that our proposed FAPG can achieve up to 167X speedup over the state-of-the-art audio adversarial attack methods. Also our proposed UAPG can generate universal adversarial perturbation that achieves much better attack performance than the state-of-the-art solutions.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

As the current most powerful artificial intelligence (AI) technique, deep neural networks (DNNs) have been widely adopted in many practical applications. Despite their current success and popularity, DNNs suffer from several severe limitations, especially the inherent high vulnerability to adversarial attack 

[7, 3], a very harmful attack approach that imposes well-crafted adversarial perturbation on the benign input of DNNs to cause misclassification. Being originally discovered in the image classification applications, to date the vulnerability of DNNs, especially various types of adversarial perturbation generation methods [9, 14, 11], has been extensively investigated in many image-domain applications.

Considering the rapidly increasing use of DNNs in modern audio-domain applications and systems, such as smart speaker (e.g., Apple Homepod, Amazon Echo) and voice assistant (e.g., Siri, Google Assistant, Alexa), recently both machine learning and cyber security communities have begun to study the possibility of adversarial attack in the audio domain. Some pioneering efforts

[4, 13] in this topic have demonstrated that, the idea of injecting inconspicuous perturbations into benign voice inputs to mislead the DNN-based audio systems, is not just conceptually attractive but also practically feasible. To date, several work have reported the successful adversarial attack in different audio-domain applications, including but not limited to speaker verification [10, 5], speech command recognition [2, 6], and speech-to-text transcription [4, 23].

Limitations of Prior Work. Although the existing work have already demonstrated the feasibility of audio adversarial attack, they are still facing several challenges. More specifically, the state-of-the-art adversarial attack approaches, no matter in image domain or audio domain, make several idealized assumptions on the temporal setting, especially having large time budget for generating adversarial perturbation and owning authorization to observe the entire benign input. For image adversarial attack, these simplified assumptions usually hold since the benign input images are typically static and constant all the time. However, audio signals have different temporal behaviors: 1) In practical audio applications, the benign inputs are typically quickly-streaming voice input; therefore, the existing audio adversarial attacks, which rely on time-consuming iterative optimization approaches such as C&W [4]

or genetic algorithms 

[2], are too slow to launch the attack against these real-time audio processing systems. 2) Also, the inherent sequential nature of audio signals makes it impossible for the adversary to generate adversarial perturbation during input-streaming phase, since the existing perturbation generation methods are based on full observation of the entire input. Consequently, the current audio adversarial attack can only be performed against the recorded or playback voice instead of real-time audio signals, thereby making them impractical for various real-world audio-domain attacking scenarios.

Technical Preview and Contributions. To overcome these limitations, in this paper we propose to use generative model to generate adversarial perturbations in the audio domain. This generative model learns the distribution of adversarial perturbations from training data in an offline way. Once being well trained, the generative model can generate audio adversarial perturbations very quickly, thereby unlocking the possibility of realizing audio adversarial attack in the real-time setting. Our main contributions of this paper are summarized as follows:

  • [leftmargin=*]

  • We, for the first time, propose a generative model-based fast audio adversarial perturbation generator (FAPG). Unlike existing methods requiring considerable adversarial perturbation generation time, our proposed FAPG generates the desired audio adversarial perturbation through a well-trained generative model Wave-U-Net [17] in a single forward pass, thereby significantly accelerating the perturbation generation speed.

  • We propose to integrate a set of trainable class-wise embedding feature maps into FAPG to encode all the label information in the audio data to a unified model. Unlike conventional generative model-based image-domain adversarial attacks, which require different generative models for different targeted classes, the proposed audio-domain FAGP can generate adversarial perturbation targeting at any adversary-desired class using a single generator model. Such reduction significantly saves the memory cost and model training time if the adversary expects to launch attacks with multiple target classes.

  • Built on the top of input-dependent FAPG, we further propose an input-independent universal audio adversarial perturbation generator (UAPG). UAPG is able to generate a single universal audio adversarial perturbation (UAP), which can be applied and re-used on different benign audio inputs without the need of input-dependent perturbation re-generation. In addition, since the universality of UAP exists across different benign inputs, such important characteristic removes the prior constraint on needing to observe entire input for perturbation generation, thereby enabling the realization of real-time audio adversarial attack.

  • We evaluate the attack performance using FAPG and UAPG against a speech command recognition model on the Google Speech Commands dataset [20]. Compared with the state-of-the-art input-dependent attack, our FAPG-based attack achieves speedup with the comparable attack successful rate. Compared with the existing input-independent (universal) attack and real-time, our UAPG-based attack achieves an averagely and fooling rate increase, respectively.

2 Fast Audio Adversarial Perturbation Generator (FAPG)

2.1 Motivation

Dilemma Between Speed and Performance. As analyzed in Section 1, one of the most challenging limitations for the existing audio adversarial attacks is their slow generation process for adversarial perturbations. To be specific, the current commonly adopted underlying adversarial perturbation-generating approaches, such as BIM [9], C&W [4] and genetic algorithms [2], are built on numbers of iterations to optimize or search the perturbations. Although this iterative mechanism can bring high attack performance, the corresponding required generation time is prohibitively long, such as seconds or even hours for producing one well-crafted perturbation. On the other hand, the existing one-step perturbation generation methods, e.g. FGSM [7], though enjoy the advantage on fast generation, suffer the poor attack performance problem, such as relatively much lower attack success rate (ASR) than their iteration-based counterparts.

Generative Model-based Solution in Image Domain. Such a dilemma between the speed and performance of adversarial perturbation generation is not an audio-specific problem, but also widely exists in the image domain. To address this problem, recent image-domain studies [14, 16] have proposed to utilize generative models, such as Generative adversarial network (GAN) [8]

and autoencoder

[19], to accelerate the generation of image adversarial perturbations. Different from the multi-step optimization-based approaches (e.g. C&W and BIM), the generative model-based solutions aim to learn the distribution of adversarial perturbations from the training images. After being well trained, the generative model performs one-step generation from input image to adversarial perturbation, where such process is essentially a fast one-pass forward propagation over the generative model, thereby significantly improving the generation speed for image adversarial perturbations.

Challenges in Audio Domain. Such progress on image domain naturally encourages the exploration of using generative model to accelerate audio adversarial perturbation generation. However, audio signals have a huge difference from images. A speaker’s voice is essentially a 1-D time-serial signal that contains very important sequential order information. Also, unlike well-defined fixed-size image data, voice data typically have very different signal lengths even from the same user and in the same dataset. Besides these new audio-specific challenges, generative model-based audio adversarial perturbation also suffers the same class-specific model preparation problem of image-based counterparts: To be specific, when utilizing generative model to perform targeted attack, for each target class, an individual generative model has to be trained for specific use. Considering the number of classes can be very high, e.g., hundreds or even thousands, the required memory cost for launching the attack is very high.

2.2 Proposed FAPG: Construction & Training

Figure 1: Overall architecture of the proposed FAPG.

Overall Architecture. To address these challenges, we propose FAPG, a fast audio adversarial perturbation generator, to launch the audio-domain adversarial attack in a rapid, high-performance and low-memory-cost way. Figure 1 illustrates the overall architecture of FAPG, which contains a generative model (), e.g., Wave-U-Net [17], and multiple class-wise embedding feature maps. During the training phase, both the generative model and embedding feature maps are jointly trained on the training dataset. After proper training, given a benign audio input and a target class label

that the adversary plans to mislead the DNN classifier (

) to, the corresponding audio adversarial perturbation can be quickly generated via performing inference of the benign input over the well-trained generative model, where the embedding feature map for the target class is concatenated to one intermediate feature map of . Next, we describe the details of the used generative model and the set of embedding feature maps as follows.

Audio-specific Generative Model. Generative model is the core component of FAPG. Although various types of generative models have been widely used in image-domain applications, they are not well suited for the use in FAPG due to the inherent difference (e.g., sequence order and varying length) between image and audio signals. To address these challenges, we adopt Wave-U-Net [17]

, which was originally used for audio source separation, as the underlying generative model of FAPG. Wave-U-Net is a type of special CNN containing 1-D convolutional, decimal down-sampling blocks and linear interpolation up-sampling blocks. Such inherent encoder-decoder structure makes Wave-U-Net exhibit strong distribution modeling capability. Meanwhile, its unique design of first-layer 1-D convolution and up/down sampling blocks also enables Wave-U-Net can naturally capture the temporal information from 1-D varying-length data.

Class-wise Embedding Feature Maps. The purpose of using -class embedding feature maps is to ensure that a single generative model can be re-used for attacks against different target classes instead of class-specific design. To this end, those class-aware embedding feature maps, denoted as , are designed to be trainable, and each of them corresponds to one target class. After joint training of generative model and these embedding feature maps , the label information for class is encoded in the corresponding feature map . Then during the generation phase is concatenated with one intermediate feature map of to craft the adversarial perturbation for target class . In our design, has the exact same shape of the intermediate feature map that it will be concatenated. To be specific, is typically aligned with the intermediate feature map at the intersection between the encoder and decoder parts of Wave-U-Net. This is because the feature map has the smallest size at this position, and thereby minimizing the storage cost of the corresponding .

1 Require: Training dataset , class label , DNN classifier , noise constraint constant
2 Result: Trained FAPG: generative model , class-wise embedding feature maps
3 Initialize , and
4 for number of training iterations do
5        for number of steps do
6               minibatch of samples from ;
7               ;
8               embeds with ;
9               , ;
10               ;
11               ;
12               ;
13               minimize Loss to update and ;
14               decrease
15        end for
16       
17 end for
Algorithm 1 Training Procedure of FAPG

Training Procedure of FAPG. Next we describe the training procedure of FAPG, or more specifically, the joint training for and . To be specific, in the forward propagation phase of the entire training procedure, for each batch of input voice data , we first randomly select one target class , and fetch the corresponding embedding feature map . This selected feature map is concatenated into the generative model to form an overall model . A forward pass on will be performed with input . The result, denoted as , is clipped to the range of to constrain the generated perturbation to be imperceptible, where is a threshold parameter. Notice that according to our experiments, should be set as a relatively large value initially, and gradually decreased during the training procedure. Empirically such adjusting scheme can bring better training convergence.

After perturbation is calculated from the generative model, it is imposed on the benign data to form the adversarial input, which can cause the misclassification of DNN classifier

. Then, the loss function, which is the key of the entire training procedure, is formulated as follows:

(1)

where the first and second terms are the cross-entropy loss and loss, respectively, and is a pre-set coefficient. The existence of loss in the entire loss function is to control the attack strength and make the generated adversarial perturbation imperceptible.

Consequently, in the backward propagation phase both the generative model and the current selected embedding feature map are updated simultaneously by minimizing the loss function. Notice that for each batch data, is randomly selected. Therefore after rounds of iterations the generative model itself learns the general distribution of adversarial perturbations, and different learns the encoded information for each specific target class. The details of the entire FAPG training procedure are summarized in Algorithm 1.

3 Universal Audio Adversarial Perturbation Generator (UAPG)

3.1 Motivation

Limitations of FAPG. As presented in Section 2.2, FAPG provides a fast solution to generate audio adversarial perturbations. However, it is essentially an input-dependent generating approach, which means its perturbation generation mechanism is based on the observation of the entire benign input. Such an assumption cannot be satisfied in many real-time applications considering the characteristics of streaming audio signals.

Universal Audio Adversarial Perturbation Generator (UAPG). To address this limitation, we further develop universal audio adversarial perturbation generator (UAPG) to craft audio-domain universal adversarial perturbation (UAP). As revealed by its name, a single universal adversarial perturbation can be applied and re-used on different benign inputs to cause mis-classification without the need of input-dependent perturbation re-generation. Such unique universality completely removes the prior constraint on observation of the entire input, and makes UAPG very suitable to launch real-time audio adversarial attack with zero time cost.

Challenges of UAPG Design. The attractive benefits of UAP have already led to some work on studying image-specific UAP [11, 14]. Lending the methodology used in those research progress in image domain, recent work [18, 13] report audio-domain UAP generating methods for speech command recognition and speech-to-text systems, respectively. Besides, [6] also proposes a technique to realize real-time audio adversarial attack without using the entire voice input, which has the similar effect as using UAP.

Despite these existing efforts, designing a robust and powerful UAPG is still non-trivial but facing two main challenges: 1) the experimental results show that the current audio-domain UAPs typically have much lower attack performance than the input-dependent perturbations; and 2) the attack enabled by some audio-domain UAPs are only untargeted attack, where the adversary cannot precisely obtain the desired target results.

Figure 2: Overall architecture and training scheme of the proposed UAPG.

3.2 Proposed UAPG: Construction & Training

Overall Scheme.

Different from existing studies, we aim to devise a UAPG which could achieve high targeted attack performance. Figure 2 shows the idea: we produce an input-dependent UAP based on a signal vector

, which is to be trained to exhibit a certain degree of universality. After initialization, is used to produce UAPs, and it is being updated in an iterative way by gradually improving the universality of derived UAPs across different training data samples. Finally, an effective UAPG is able to be constructed via evolving well-trained .

From FAPG to UAPG.

The underlying method used for generating UAPs is our proposed FAPG. Intuitively, FAPG learns to estimate the distribution of adversarial perturbations instead of iteratively optimizing the perturbation for a specific audio input. Therefore the FAPG-generated perturbation naturally exhibits better universality than that the one comes from non-generative method. Moreover, our FAPG is designed to integrate various target classes information into a single generative model, thereby enabling the capability of producing targeted universal perturbations.

Training Procedure of UAPG. We then introduce the training details to facilitate an effective UAPG. In general, to formulate an input-agnostic universal attack, our goal is to find a universal perturbation to satisfy:

(2)

The training procedure of UAPG is shown in Algorithm 2. We aim to generate a single universal perturbation via the well-trained and the corresponding , which can be obtained from the well-trained input-dependent FAPG. Different from input-dependent scenario, the audio input signals is now replaced by a single trainable vector . Then the universal perturbation is returned and imposed on benign data to craft the adversarial audio example. Through feeding such an adversarial audio into the DNN classifier , we can update by minimizing the following loss function:

(3)

where the first and second terms represent the cross-entropy loss and loss, respectively. With the guidance of the above loss function, we optimize by iteratively applying the derived across the entire training data. In particular, in order to construct a UAPG that can be universally applied to any target class, at each training step, a random target class is selected to help to learn inter-class representations. After constructing the unified , the universal perturbations computed by our UAPG could be effectively applied on any input data to fool the DNN model in an audio-agnostic way, without requiring re-generating adversarial perturbation for each individual audio input.

1 Require: Training dataset , class label , DNN classifier , generative model , class-wise embedding feature maps , noise constraint constant
2 Result: Trained UAPG
3 Random initialize
4 for number of training iterations do
5        for number of steps do
6               ;
7               embeds with ;
8               , ;
9               minibatch of samples from ;
10               ;
11               ;
12               );
13               minimize Loss to update ;
14              
15        end for
16       
17 end for
Algorithm 2 Training Procedure of UAPG

4 Attack Evaluation

4.1 Experimental Methodology

Target Model and Dataset.

We evaluate the proposed FAPG and UAPG on a state-of-the-art Convolutional Neural Networks (CNNs) based speech command recognition system 

[15], which has been widely used as the baseline model in many previous studies of audio-domain adversarial attacks and defenses [2, 1, 22]. Moreover, we use the crowd-sourced speech command dataset [20] containing utterances including 10 representative speech commands (i.e., “yes”, “no”, “up”, “down”, “left”, “right”, “on”, “off”, “stop”, and “go”). The audio files are stored with a sample rate of . Prior to the speech command recognition model, -dimensional MFCC features are extracted from the audio input. We randomly separate the dataset into training and testing set with a ratio of to , and the command recognition accuracy of this baseline model on the testing dataset is .

Evaluation Metrics. (1) Fooling Rate (FR) is used for evaluating both targeted and untargeted attacks, which shows the ratio of the number of adversarial examples that lead to a false classification and the total number of adversarial examples; (2) Attack Success Rate (ASR) is only used for evaluating targeted attacks, which is the ratio of the number of succeeded attacks and total attack attempts; (3) Distortion Metric: We quantify the relative noise level of with respect to the original audio in decibels (dB): .

Type FGSM
BIM
C&W
FAPG
FR 48.36% 97.24% 98.9% 97.77%
ASR 17.83% 92.26% 96.16% 93.55%
Time 0.051s 1.36s 9.16s 0.055s
Table 1: Comparison of the overall performance of audio-dependent attacks.

4.2 Audio-dependent Targeted Attack via FAPG

FAPG Generator Implementation. We use model of Wave-U-Net [17] to construct our FAPG. Specifically, our model contains down-sampling blocks and up-sampling blocks. The feature map size of the last encoding layer is , which is also the size of each additional class-wise embedding feature map. FAPG is trained on the same training set as used for the training of the target speech command recognition model. A total number of training steps are conducted using Adam optimizer with a batch size of and . The clipping bound is initially set to and gradually reduced until it reaches to .

Speedup of Generation Time. We compare the time consumption of our proposed FAPG with one-step FGSM [9] and two iteration-based attack methods, i.e., BIM [9] and C&W [3]. For fair comparison, the parameters of FGSM are tuned to produce the same magnitude of adversarial distortion (measured in dB) as FAPG, and the configuration of other attacks follows previous work [22]. We evaluate each type of attack by launching attacks alternatively targeting at every speech command and record the average performance over all testing samples. The average time consumption of adversarial perturbation generation is measured on a Nvidia Tesla V100 GPU. As shown in Table 1, the proposed FAPG can achieve a comparable high ASR as the iteration-based methods such as BIM and C&W while only requiring the time of the single-step method (i.e., FGSM). The above results show an adversarial perturbation generation speedup of compared with C&W based attack.

Figure 3: Class-wise ASR performance of FAPG (%).

Class-wise Attack Performance. Figure 3 illustrates the detailed class-wise performance of our FAPG. Specifically, each cell shows the ASR of an original-target command pair (cases with the same original and target command are excluded). As we can see, our FAPG can retain a high ASR for nearly all original-target command pairs, with only case has ASR below . In addition, the average perturbed audio distortion is measured as dB, which is approximate the difference between the ambient noise in a quite room and a person talking [4].

Memory Cost Reduction. To show the memory cost reduction brought by the proposed feature embedding maps, we first train generative models targeting our speech commands. These well-trained models lead to a MB memory consumption. To achieve the same attacking task, our proposed FAPG generator, however, only requires a single generative model with class-wise embedding feature maps. Therefore, the implemented FAPG only takes around MB memory. We show the feasibility of using a single generative model with a -class embedding feature maps to generate the adversarial perturbations targeting at classes. This significantly reduces the storage cost of the generator, especially when the adversary has hundreds or thousands of target speech commands.

Figure 4: Visualization of audio-dependent perturbations and universal perturbations targeting at different speech commands.

4.3 Audio-agnostic Universal Attack via UAPG

UAPG Implementation. The proposed trainable vector is initiated as the same size of the original audio input. A pre-trained FAPG model from section 4.2 is used to construct the UAPG. The training of involves the same training set as used for speech command recognition model. A total number of training steps are conducted using Adam optimizer with a learning rate of and a batch size of . We set to 0.03 which corresponds to an average distortion of of the generated adversarial perturbations.

Class-wise Universal Attack Performance.

To investigate the effectiveness of UAPG, we plot the audio-dependent perturbations generated by FAPG as well as the audio-agnostic perturbations generated by UAPG using principal component analysis (PCA) 

[21], as shown in Figure 4. We only show the adversarial perturbations targeting at five commands for the sake of illustration. Although the universal perturbations are created without accessing the distribution of real speech commands, all universal perturbations locate within the manifold of corresponding audio-dependent perturbations generated for the same target class. This demonstrates that our UAPG can efficiently learn the inherent adversarial representations with respect to each target command.

UAPG
UAP-HC
[Vadillo et al., 2019]
Original
Speech
FR ASR FR ASR
“yes” 87.12% 83.14% 66.24% -
“no” 96.48% 91.84% 66.23% -
“up” 95.70% 93.92% 61.89% -
“down” 88.28% 83.78% 44.82% -
“left” 89.32% 88.04% 0.00% -
“right” 84.16% 80.55% 55.84% -
“on” 86.97% 85.08% 57.39% -
“off” 90.87% 86.51% 68.60% -
“stop” 87.97% 86.63% 43.48% -
“go” 93.38% 89.60% 63.34% -
Average 90.03% 86.90% 52.78% -
Table 2: Comparison of the effectiveness of universal attacks. UAPG is our proposed targeted universal attack; UAP-HC is an untarget universal attack.

Table 2 shows the performance comparison of universal attacks between our proposed UAPG and the state-of-the-art audio universal attack, UAP-HC [18] which is based on DeepFool algorithm [12]. To evaluate UAPG attack, we generate universal perturbations targeting at our speech commands, respectively. For each type of original command, we will add the corresponding perturbation attempting to make it to be recognized as the rest of commands. Additionally, since UAP-HC is untargeted universal attack regardless of the target target command, we only report the average FR in this case. Specifically, UAP-HC could only achieve up to FR, which indicates the perturbation generated through UAP-HC has limited generality over other speech commands. In comparison, our proposed UAPG shows a significant improvement in universality over HAP-HC by achieving an average FR of and ASR of .

Comparison with Other Real-time Attack. We also compare the performance of the proposed UAPG with the recent real-time adversarial attack (RAA) [6]. For fair comparison, we launch the attack in a more practical black-box setting. We re-train our UAPG on another speech command recognition model (i.e., CNN-3 model [24]) and evaluate the real-time attack performance on the same target speech recognition model as used in previous sections. As shown in Table 3, due to the design structure, there is a s emitting delay to apply adversarial noises for RAA whereas our UAPG is able to launch the real-time without any time delay. Moreover, RAA could only achieve an untargeted real-time attack with up to FR. However, our UAPG, even under targeted attack setting, is able to achieve a FR of and an ASR of . Therefore, the proposed UAPG could generate more robust universal perturbations enabling the realization of real-time attacks in practice.

Method UAPG RAA
Attack Scenario Targeted Untargeted
Emitting Delay 0s 0.01s
FR 73.48% 43.5%
ASR 40.01% -
Table 3: Comparison of real-time attack. UAPG is our proposed targeted universal attack; RAA is a state-of-art real-time attack.

5 Conclusion

In this work, we propose a fast and universal adversarial attack on speech command recognition. By exploiting Wave-U-Net and the class-wise feature embedding maps, our proposed FAPG can launch fast audio adversarial attack targeting at any speech command within a single pass of feed-forward propagation, which results in an adversarial perturbation generation speedup of over comparing to the state-of-the-art solutions. Moreover, built on the top of FAPG, our proposed UAPG is able to generate universal adversarial perturbation that can be applied on arbitrary benign audio input. Extensive experiments demonstrate the effectiveness and robustness of the proposed FAPG and UAPG, making the real-time attack against speech command recognition systems possible.

References

  • [1] S. Abdoli, L. G. Hafemann, J. Rony, I. B. Ayed, P. Cardinal, and A. L. Koerich (2019) Universal adversarial audio perturbations. arXiv preprint arXiv:1908.03173. Cited by: §4.1.
  • [2] M. Alzantot, B. Balaji, and M. Srivastava (2018)

    Did you hear that? adversarial examples against automatic speech recognition

    .
    arXiv preprint arXiv:1801.00554. Cited by: §1, §1, §2.1, §4.1.
  • [3] N. Carlini and D. Wagner (2017) Towards evaluating the robustness of neural networks. In 2017 IEEE Symposium on Security and Privacy (SP), pp. 39–57. Cited by: §1, §4.2.
  • [4] N. Carlini and D. Wagner (2018) Audio adversarial examples: targeted attacks on speech-to-text. In 2018 IEEE Security and Privacy Workshops (SPW), pp. 1–7. Cited by: §1, §1, §2.1, §4.2.
  • [5] G. Chen, S. Chen, L. Fan, X. Du, Z. Zhao, F. Song, and Y. Liu (2019) Who is real bob? adversarial attacks on speaker recognition systems. arXiv preprint arXiv:1911.01840. Cited by: §1.
  • [6] Y. Gong, B. Li, C. Poellabauer, and Y. Shi (2019) Real-time adversarial attacks. CoRR abs/1905.13399. External Links: Link, 1905.13399 Cited by: §1, §3.1, §4.3.
  • [7] I. J. Goodfellow, J. Shlens, and C. Szegedy (2014) Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572. Cited by: §1, §2.1.
  • [8] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §2.1.
  • [9] A. Kurakin, I. Goodfellow, and S. Bengio (2016) Adversarial examples in the physical world. arXiv preprint arXiv:1607.02533. Cited by: §1, §2.1, §4.2.
  • [10] Z. Meng, Y. Zhao, J. Li, and Y. Gong (2019) Adversarial speaker verification. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6216–6220. Cited by: §1.
  • [11] S. Moosavi-Dezfooli, A. Fawzi, O. Fawzi, and P. Frossard (2017) Universal adversarial perturbations. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    ,
    pp. 1765–1773. Cited by: §1, §3.1.
  • [12] S. Moosavi-Dezfooli, A. Fawzi, and P. Frossard (2016) Deepfool: a simple and accurate method to fool deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2574–2582. Cited by: §4.3.
  • [13] P. Neekhara, S. Hussain, P. Pandey, S. Dubnov, J. McAuley, and F. Koushanfar (2019) Universal adversarial perturbations for speech recognition systems. arXiv preprint arXiv:1905.03828. Cited by: §1, §3.1.
  • [14] O. Poursaeed, I. Katsman, B. Gao, and S. Belongie (2018) Generative adversarial perturbations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4422–4431. Cited by: §1, §2.1, §3.1.
  • [15] T. N. Sainath and C. Parada (2015) Convolutional neural networks for small-footprint keyword spotting. In Sixteenth Annual Conference of the International Speech Communication Association, Cited by: §4.1.
  • [16] Y. Song, R. Shu, N. Kushman, and S. Ermon (2018) Constructing unrestricted adversarial examples with generative models. In Advances in Neural Information Processing Systems, pp. 8312–8323. Cited by: §2.1.
  • [17] D. Stoller, S. Ewert, and S. Dixon (2018) Wave-u-net: a multi-scale neural network for end-to-end audio source separation. arXiv preprint arXiv:1806.03185. Cited by: 1st item, §2.2, §2.2, §4.2.
  • [18] J. Vadillo and R. Santana (2019) Universal adversarial examples in speech command classification. arXiv preprint arXiv:1911.10182. Cited by: §3.1, §4.3.
  • [19] P. Vincent, H. Larochelle, Y. Bengio, and P. Manzagol (2008)

    Extracting and composing robust features with denoising autoencoders

    .
    In Proceedings of the 25th international conference on Machine learning, pp. 1096–1103. Cited by: §2.1.
  • [20] P. Warden (2018) Speech commands: a dataset for limited-vocabulary speech recognition. arXiv preprint arXiv:1804.03209. Cited by: 4th item, §4.1.
  • [21] S. Wold, K. Esbensen, and P. Geladi (1987) Principal component analysis. Chemometrics and intelligent laboratory systems 2 (1-3), pp. 37–52. Cited by: §4.3.
  • [22] F. Yu, Z. Xu, Y. Wang, C. Liu, and X. Chen (2018) Towards robust training of neural networks by regularizing adversarial gradients. arXiv preprint arXiv:1805.09370. Cited by: §4.1, §4.2.
  • [23] X. Yuan, Y. Chen, Y. Zhao, Y. Long, X. Liu, K. Chen, S. Zhang, H. Huang, X. Wang, and C. A. Gunter (2018) Commandersong: a systematic approach for practical adversarial voice recognition. In 27th USENIX Security Symposium (USENIX Security 18), pp. 49–64. Cited by: §1.
  • [24] Y. Zhang, N. Suda, L. Lai, and V. Chandra (2017) Hello edge: keyword spotting on microcontrollers. arXiv preprint arXiv:1711.07128. Cited by: §4.3.