Identifying Audio Adversarial Examples via Anomalous Pattern Detection

by   Victor Akinwande, et al.

Audio processing models based on deep neural networks are susceptible to adversarial attacks even when the adversarial audio waveform is 99.9 to a benign sample. Given the wide application of DNN-based audio recognition systems, detecting the presence of adversarial examples is of high practical relevance. By applying anomalous pattern detection techniques in the activation space of these models, we show that 2 of the recent and current state-of-the-art adversarial attacks on audio processing systems systematically lead to higher-than-expected activation at some subset of nodes and we can detect these with up to an AUC of 0.98 with no degradation in performance on benign samples.



There are no comments yet.


page 1

page 2

page 3

page 4


Detecting Adversarial Attacks On Audio-Visual Speech Recognition

Adversarial attacks pose a threat to deep learning models. However, rese...

Crafting Adversarial Examples For Speech Paralinguistics Applications

Computational paralinguistic analysis is increasingly being used in a wi...

Characterizing the Weight Space for Different Learning Models

Deep Learning has become one of the primary research areas in developing...

DAAIN: Detection of Anomalous and Adversarial Input using Normalizing Flows

Despite much recent work, detecting out-of-distribution (OOD) inputs and...

Attacks Meet Interpretability: Attribute-steered Detection of Adversarial Samples

Adversarial sample attacks perturb benign inputs to induce DNN misbehavi...

Metamorphic Relation Based Adversarial Attacks on Differentiable Neural Computer

Deep neural networks (DNN), while becoming the driving force of many nov...

Gaussian Process Subset Scanning for Anomalous Pattern Detection in Non-iid Data

Identifying anomalous patterns in real-world data is essential for under...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Speech-based interaction is widely used in virtual personal assistants (e.g. Siri, Google Assistant) and also moving to more critical areas such as virtual assistants for physicians (e.g. Saykara). Given the increasing application of deep neural network-based audio processing systems, the robustness of these systems is of high relevance. Neural networks are susceptible to adversarial attacks where an adversarial example (Szegedy et al., 2013) typically crafted by adding small perturbations to the inputs causes an erroneous output that may be prespecified by the adversary (Carlini and Wagner, 2017). Existing work on adversarial examples focuses predominantly on image processing tasks (Akhtar and Mian, 2018)

and recently more attention is being given to other domains such as audio for automatic speech recognition (ASR)

(Alzantot et al., 2018; Gong and Poellabauer, 2017; Carlini and Wagner, 2018; Taori et al., 2018; Gong and Poellabauer, 2017; Qin et al., 2019). Thus, the detection of adversarial attacks is a key component in building robust models.

Given an input audio waveform , an ASR system , and a target transcription , most attacks seek to find a small perturbation such that and though . This is referred to as a targeted attack and such an adversarial audio waveform may be 99.9% similar to a benign sample (Carlini and Wagner, 2018). Also, recent work (Schönherr et al., 2019; Qin et al., 2019; Yakura and Sakuma, 2018) has demonstrated the feasibility of these adversarial samples being played over-the-air by simulating room impulse responses and making them robust to reverberations. We observe that the key differentiation between generating adversarial examples across different tasks or input modalities such as images, audio or text lies in a change of architecture as these attacks generally attempt to maximize the training loss and it is valuable to study properties of adversarial examples that hold across multiple domains. Whereas existing work in defending against audio adversarial attacks in propose preprocessing the audio waveform as a means of defense (Yang et al., 2018; Das et al., 2018; Rajaratnam et al., 2018; Subramanian et al., 2019), our work treats the problem as anomalous pattern detection and operates in an unsupervised manner without apriori knowledge of the attack or labeled examples. We also do not rely on training data augmentation or specialized training techniques and can be complementary to existing preprocessing techniques.

We claim two novel contributions of this work. First, we propose a detection mechanism by using nonparametric scan statistics, efficiently optimized over node-activations to quantify the anomalousness of such inputs into a real-valued “score”. Second, we show empirical results across two state-of-the-art audio adversarial attacks  (Carlini and Wagner, 2018; Qin et al., 2019) with consistent performance comparable to the current state-of-the-art without losing accuracy of benign samples which is a known downside of preprocessing approaches.

Related Work: The property of deep neural networks being susceptible to adversarial attack was largely established in (Biggio et al., 2013) and (Szegedy et al., 2013). Since then, numerous kinds of attacks have been designed across multiple data modalities including images (Goodfellow et al., 2014; Papernot et al., 2016; Carlini and Wagner, 2017) and audio (Cisse et al., 2017; Gong and Poellabauer, 2017; Alzantot et al., 2018; Carlini and Wagner, 2018; Qin et al., 2019). Early work on audio adversarial research focused on untargeted attacks where the goal is to produce an incorrect but arbitrary transcription for an ASR given an input that has been minimally perturbed. In (Carlini and Wagner, 2018), an iterative optimization targeted attack is introduced with 100% success rate on state-of-the-art audio models with the limitation being the inability of these samples to remain adversarial when played over-the-air. In (Qin et al., 2019), this limitation is addressed by leveraging psychoacoustics towards more imperceptible an over-the-air attacks.

On the other hand, defending against audio adversarial attacks has predominantly focused on preprocessing techniques such as mp3 compression, quantization, adding noise, or smoothing (Rajaratnam et al., 2018; Das et al., 2018; Yang et al., 2018; Subramanian et al., 2019). However, these approaches modify the input in some way and affect performance on benign samples. Particularly, (Yang et al., 2018) propose using the explicit temporal dependency in audio e.g. correlations in consecutive waveform segments and show that it is affected by adversarial perturbations. However, there exists an inductive bias in form of what time step(s) to break up the waveform sequence for it to be evaluated as being adversarial while minimizing the performance degradation if the sample is benign. Also, for an audio sample to be deleterious, in the real world, only a small subset of the transcription given a waveform sequence needs to be adversarially targeted. For example, a sample that translates to - “Alexa, please call the doctor” and changed to “Alexa, please call the doorman” using a combination attack may have a low Word Error Rate in the face of Temporal Dependency. This motivates the need to explore other mechanisms for adversarial audio detection beyond the information provided by the input. Our work shows strong discriminative power (which we refer to as detection power) against adversarial samples without preprocessing the input thus preventing performance degradation on clean samples.

2 Non-parametric Scan Statisitcs and Subset Scanning

Subset scanning treats the pattern detection problem as a search for the “most anomalous” subset of observations in the data. Herein, anomalousness is quantified by a scoring function, . Therefore, the goal is to efficiently identify over all relevant subsets of node activations within an ASR that is processing audio waveforms at runtime. This work uses non-parametric scan statistics (NPSS) that have been used in other pattern detection methods  (Neill and Lingwall, 2007; McFowland III et al., 2013; McFowland et al., 2018; Chen and Neill, 2014).

Let there be clean audio samples included in . These samples generate activations at each node . Let (not in ) be a test sample under evaluation. This audio sample creates activations at each node in the network. The -value, , is the proportion of background activations greater than the activation induced by the test sample at node . We convert the test sample

to a vector of

-values of length . The key assumption is that under the alternative hypothesis of an anomaly present in the activation data, then at least some subset of the activations will systematically appear extreme. We now turn to non-parametric scan statistics to identify and quantify this set of -values.

The general form of the NPSS score function is


where represents the number of empirical -values contained in subset and is the number of -values less than (significance level) contained in subset .

There are well-known goodness-of-fit statistics that can be utilized in NPSS (McFowland et al., 2018)

. In this work we use the Berk-Jones test statistic 

(Berk and Jones, 1979): , where is the Kullback-Liebler divergence between the observed and expected proportions of significant -values.

Efficient Maximization of NPSS: Although NPSS provides a means to evaluate the anomalousness of a subset of node activations discovering which of the possible subsets provides the most evidence of an anomalous pattern is computationally infeasible for moderately sized data sets. However, NPSS has been shown to satisfy the linear-time subset scanning (LTSS) property (Neill, 2012), which allows for an efficient and exact maximization over subsets of data.

The LTSS property uses a priority function to rank nodes and then proves that the highest-scoring subset consists of the “top-k” priority nodes for some in . The priority of a node for NPSS is the proportion of -values that are less than . However, because we are scoring a single audio sample and there is only one -value at each node, the priority of a node is either 1 (when the -value is less than ) or 0 (otherwise). Therefore, for a fixed, given threshold, the most anomalous subset is all and only nodes with -values less than .

To maximize the scoring function over we first sort the nodes by their -values. Let be the subset containing the nodes with the smallest -values. Let be the largest -value among these nodes. The LTSS property guarantees that the highest-scoring subset (over all thresholds) will be one of these subsets with their corresponding threshold. Any subset of nodes that does not take this form (or uses an alternate ) is provably sub-optimal and not considered. Critically, this drastically reduced search space still guarantees identifying the highest-scoring subset of nodes for a test audio sample under evaluation. Pseudo-code for subset scanning over activations for audio samples can be found in Algorithm 1 in the Appendix.

3 Experiments & Results

We introduce the datasets, target models, and attack types evaluated in our method. We describe the experimental setup and summarize our results in Table 1.


Mozilla Common Voice dataset: Common Voice is an audio dataset provided by Mozilla. This dataset is public and contains samples from voice recordings of humans. We resample the subset used in our experiments to 16Khz with an average duration of 3.9 seconds.

Dataset LibriSpeech dataset: LibriSpeech (Panayotov et al., 2015) is a corpus of approximately 1000 hours of 16Khz English speech derived from audiobooks from the LibriVox project. Samples in our subset have an average duration of 4.3 seconds.

Target Models and Attacks:

DeepSpeech: We apply CW attack (Carlini and Wagner, 2018) on version 0.4.1 of DeepSpeech (Hannun et al., 2014). We set the number of iterations to 100 with a learning rate of 100 and generate adversarial examples for Mozilla Common Voice (the first 100 test instances) and Librispeech (Panayotov et al., 2015) (the first 200 test-clean instances) with a success rate of 92% and 94.5% respectively.

Lingvo: We apply Qin attack (Qin et al., 2019) on the Lingvo system (Shen et al., 2019) with a stage 1 learning rate of 100 with 1000 iterations and a stage 2 learning rate of 0.1 with 4000 iterations. We generate adversarial examples for Librispeech (the first 130 test-clean instances) with a 100% success rate.

Experimental setup:

The goal of subset scanning is to identify the most anomalous (highest scoring according to a non-parametric scan statistic) subset of nodes in a specific layer of the ASR for a given audio waveform. For each of our experiments, we extract the node activations, post activation function. In this case, after the relu function (non-parametric scan statistics can be applied on any activation function and architecture). Offline, we extract activations from clean audio samples that form our distribution of activations under a null hypothesis of no adversarial noise present. Activations from samples in the evaluation set are compared against the activations from the clean samples to create empirical

-values for each sample. These -values are scored by non-parametric scan statistics to quantify the anomalousness of each sample in the evaluation set (See Section 2). We set to for all experiments. Future supervised experiments could tune to increase detection power further. We evaluate the ability to separate clean samples from adversarial samples using AUC which is a threshold independent metric and refer to this AUC score as detection power.

DeepSpeech experiment on Common Voice: We draw the first 1000 samples from the trainset and choose the first 800 as our background and remaining 200 as our clean samples. We randomly draw 90 samples from the 92 adversarially generated examples as our adversarial samples.
DeepSpeech experiment on Librispeech:, We draw the first 1000 samples from the test-clean set and choose the first 800 as our background and remaining 200 as our clean samples. We randomly draw 90 samples from the 189 adversarially generated examples as our adversarial samples.
Lingvo experiment on Librispeech: We draw the 201-600 samples from the test-clean set and choose the first 300 as our background and remaining 100 as our clean samples. We randomly draw 100 samples from the 130 adversarially generated examples as our adversarial samples.

As the audio samples vary in length the outputs from the activation nodes also vary, we choose the minimum across all sets (background, clean and adversarial) and cut off the activations at this length. Since both evaluated attacks perturb the entire audio waveform we believe the adversarial pattern will still be detected. We experiment with choosing different segments of the activation with no significant variance to detection power.


We show the results of scanning over specific layers of both evaluated models as well as the number of node activation for each layer in Table 1. Subset scanning achieves AUC as high as 0.973 on Common Voice and 0.982 on LibriSpeech with DeepSpeech and 0.755 on LibriSpeech with Lingvo. We leave the exploration of what layers provide the most discriminative potential for future work. These results show that subset scanning is indeed an effective method for detecting adversarial audio attacks. Given these results, we think that studying properties of adversarial examples that hold across multiple domains presents an interesting research direction.

Model, Data, Attack,
(Bgd, Clean, Adversarial sizes) Layers dimensions.
See A.1 for names Temporal Dependency (WER) Subset scanning (Dectection Power)
DeepSpeech, Common Voice, CW 80, 2048 0.936 0.283
(800, 200, 90) 80, 2048 0.158
80, 4096 0.973
80, 2048 0.903
DeepSpeech, Librispeech, CW 64, 2048 0.930 0.568
(800, 200, 90) 64, 2048 0.038
64, 4096 0.982
64, 2048 0.527
Lingvo, Librispeech, Qin 179, 40, 32 Not applied 0.755
(300, 100, 100) 212, 20, 32 0.491
423, 40, 32 0.571
212, 20, 32 0.479
Table 1: Detection Power across attacks and datasets. We compare with current state-of-the-art detection method Temporal Dependency and show competitive detection power.

4 Conclusion

In this work, we proposed an unsupervised method for adversarial audio attack detection with subset scanning. Our method can detect multiple state-of-the-art adversarial attacks across multiple datasets. This detection power comes from the idea that adversarially noised samples produce anomalous activations in neural networks that are detectable by efficiently searching over subsets of these activations. Whereas existing work in defending against audio adversarial attack proposes preprocessing the audio waveform as a means of defense, we treat the problem as anomalous pattern detection without apriori knowledge of the attack or labeled examples. We also do not rely on training data augmentation or specialized training techniques and can be complementary to existing pre-processing techniques. Future work will focus on leveraging the information contained in which subset of nodes optimized the scoring function for that sample. This could lead towards new methods of neural network visualizations and explainability.


  • N. Akhtar and A. Mian (2018)

    Threat of adversarial attacks on deep learning in computer vision: a survey

    IEEE Access 6, pp. 14410–14430. Cited by: §1.
  • M. Alzantot, B. Balaji, and M. Srivastava (2018) Did you hear that? adversarial examples against automatic speech recognition. arXiv preprint arXiv:1801.00554. Cited by: §1, §1.
  • R. H. Berk and D. H. Jones (1979) Goodness-of-fit test statistics that dominate the Kolmogorov statistics. Zeitschrift fär Wahrscheinlichkeitstheorie und Verwandte Gebiete 47, pp. 47–59. Cited by: §2.
  • B. Biggio, I. Corona, D. Maiorca, B. Nelson, N. Šrndić, P. Laskov, G. Giacinto, and F. Roli (2013)

    Evasion attacks against machine learning at test time

    In Joint European conference on machine learning and knowledge discovery in databases, pp. 387–402. Cited by: §1.
  • N. Carlini and D. Wagner (2017) Towards evaluating the robustness of neural networks. In 2017 IEEE Symposium on Security and Privacy (SP), pp. 39–57. Cited by: §1, §1.
  • N. Carlini and D. Wagner (2018) Audio adversarial examples: targeted attacks on speech-to-text. In 2018 IEEE Security and Privacy Workshops (SPW), pp. 1–7. Cited by: §1, §1, §1, §1, §3.
  • F. Chen and D. B. Neill (2014) Non-parametric scan statistics for event detection and forecasting in heterogeneous social media graphs. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’14, pp. 1166–1175. Cited by: §2.
  • M. Cisse, Y. Adi, N. Neverova, and J. Keshet (2017) Houdini: fooling deep structured prediction models. arXiv preprint arXiv:1707.05373. Cited by: §1.
  • N. Das, M. Shanbhogue, S. Chen, L. Chen, M. E. Kounavis, and D. H. Chau (2018) Adagio: interactive experimentation with adversarial attack and defense for audio. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 677–681. Cited by: §1, §1.
  • Y. Gong and C. Poellabauer (2017) Crafting adversarial examples for speech paralinguistics applications. arXiv preprint arXiv:1711.03280. Cited by: §1, §1.
  • I. J. Goodfellow, J. Shlens, and C. Szegedy (2014) Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572. Cited by: §1.
  • A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A. Coates, et al. (2014) Deep speech: scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567. Cited by: §3.
  • E. McFowland, S. Somanchi, and D. B. Neill (2018) Efficient Discovery of Heterogeneous Treatment Effects in Randomized Experiments via Anomalous Pattern Detection. ArXiv e-prints. External Links: 1803.09159 Cited by: §2, §2.
  • E. McFowland III, S. D. Speakman, and D. B. Neill (2013) Fast generalized subset scan for anomalous pattern detection. The Journal of Machine Learning Research 14 (1), pp. 1533–1561. Cited by: §2.
  • D. B. Neill and J. Lingwall (2007) A nonparametric scan statistic for multivariate disease surveillance. Advances in Disease Surveillance 4, pp. 106. Cited by: §2.
  • D. B. Neill (2012) Fast subset scan for spatial pattern detection. Journal of the Royal Statistical Society (Series B: Statistical Methodology) 74 (2), pp. 337–360. Cited by: §2.
  • V. Panayotov, G. Chen, D. Povey, and S. Khudanpur (2015) Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210. Cited by: §3, §3.
  • N. Papernot, P. McDaniel, S. Jha, M. Fredrikson, Z. B. Celik, and A. Swami (2016) The limitations of deep learning in adversarial settings. In 2016 IEEE European symposium on security and privacy (EuroS&P), pp. 372–387. Cited by: §1.
  • Y. Qin, N. Carlini, I. Goodfellow, G. Cottrell, and C. Raffel (2019) Imperceptible, robust, and targeted adversarial examples for automatic speech recognition. arXiv preprint arXiv:1903.10346. Cited by: §1, §1, §1, §1, §3.
  • K. Rajaratnam, K. Shah, and J. Kalita (2018) Isolated and ensemble audio preprocessing methods for detecting adversarial examples against automatic speech recognition. arXiv preprint arXiv:1809.04397. Cited by: §1, §1.
  • L. Schönherr, S. Zeiler, T. Holz, and D. Kolossa (2019) Robust over-the-air adversarial examples against automatic speech recognition systems. arXiv preprint arXiv:1908.01551. Cited by: §1.
  • J. Shen, P. Nguyen, Y. Wu, Z. Chen, M. X. Chen, Y. Jia, A. Kannan, T. Sainath, Y. Cao, C. Chiu, et al. (2019) Lingvo: a modular and scalable framework for sequence-to-sequence modeling. arXiv preprint arXiv:1902.08295. Cited by: §3.
  • V. Subramanian, E. Benetos, M. Sandler, et al. (2019) Robustness of adversarial attacks in sound event classification. Cited by: §1, §1.
  • C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus (2013) Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199. Cited by: §1, §1.
  • R. Taori, A. Kamsetty, B. Chu, and N. Vemuri (2018) Targeted adversarial examples for black box audio systems. arXiv preprint arXiv:1805.07820. Cited by: §1.
  • H. Yakura and J. Sakuma (2018) Robust audio adversarial example for a physical attack. arXiv preprint arXiv:1810.11793. Cited by: §1.
  • Z. Yang, B. Li, P. Chen, and D. Song (2018) Characterizing audio adversarial examples using temporal dependency. arXiv preprint arXiv:1809.10875. Cited by: §1, §1.

Appendix A Appendix

This appendix is meant to provide implementation details that are relevant to readers wishing to implement their version of our experiments. Direct code is provided on Github

. We also provide the algorithm pseudo-code as well as detailed detection power plots. All the evaluated attacks are implemented and open-sourced by their authors.

a.1 Detection power plots across multiple layers for evaluated attack-dataset pairs

Figure 1: ROC curves for scores from each of the adversarial samples as compared to the scores from evaluation sets containing all clean samples for multiple layers.

a.2 Scanning Algorithm

input : Background set of samples: , evaluation sample: , .
output :  Score for the evaluation sample
TrainNetwork (training dataset);
Some flattened layer of ;
for  to  do
       for  to  do
             ExtractActivation (, )
       end for
end for
for  to  do
       ExtractActivation (, )
end for
SortAscending ();
for  to  do
       NPSS (, k, k);
end for
return , , and
Algorithm 1 Pseudo-code for subset scanning over activations of individual audio samples.