Recent investigations in machine learning have demonstrated that machine learning systems are vulnerable against designed inputs known as adversarial examples [1, 2], a fundamental type of adversarial learning attack called test-time evasion . Earlier adversarial attacks were applied on machine learning models of image domain [4, 5, 6, 7, 8] and then these attacking methods have been spread out onto other domains, e.g. speech signals [9, 10, 11, 12, 13]. The adversary adds a very small optimized perturbation, which is not detectable by human, to a legitimate input and generates an adversarial example that results the learning model to return a wrong output. It is therefore important to be able to detect adversarial attacks and subsequently for example prevent passing on the attacking signals to the learning model, a speech recognizer in the context of this work, so as to avoid outputting wrong results.
In general, there are two main categories of approaches in adversarial attacks as targeted’ and ’non-targeted’ attacks. In non-targeted attacks the adversary aims to make the learning model return a wrong output and being wrong is enough. In targeted attacks the adversary aims to make the learning model return a particular output that is wrong and different from the expected output. Generating targeted examples is generally more difficult. This research focuses on targeted type of attacks on speech signal inputs for speech to text tasks. It is noted that adversarial examples in this context differ from data examples generated through generative adversarial networks [14, 15] where examples are generated via adversarial training for the purpose of e.g. data augmentation and enhancement.
The adversary’s level of access to the victim learning model categorizes attacking methods in two different types: 1) white-box where the adversary has full access to the layers and parameters of the victim learning model, and 2) black-box where the adversary has no access to these. The state-of-the-art white-box attack in the speech domain is Carlini & Wagner (C&W) method  that is a gradient-based method using iterative optimization and has achieved 100% success rate in their experiment. The similarity between normal and corresponding adversarial examples is 99% when the victim model is Baidu DeepSpeech  and the dataset is Mozilla Common Voice dataset . As a black-box attack, Alzantot method 
is a gradient-free method using genetic algorithm optimization and has reported 87% overall success rate. The similarity between normal and corresponding generated adversarial examples is 85% when the victim model is Speech Command classification algorithm and the dataset is Google Speech Command dataset .
Due to high success rates of adversarial attacks and their high similarity to normal examples, distinction of adversarial and normal examples is a highly motivated task. Research on adversarial attack detection or characterization has mostly focused on image domain [20, 21]. One latest and comprehensive work in speech domain  is to use temporal dependencies to characterize adversarial examples. Defense methods, e.g. signal transformation  and obfuscated gradients , have also been investigated, but they provide rather limited robustness improvement in face of advanced attacks. Audio preprocessing methods and their ensemble for defense against black-box attacks are studied in . In , noise flooding is applied to signals for defensing against black-box examples. All these methods are concerned with modifying input signals and testing the behaviours of the recognition model and have moderate success. There is a lack of studying dedicated systems for detecting both white-box and black-box adversarial examples in audio and speech domain. This motivates us to formulate the defense against speech adversarial attacks as a classification problem, design a strategy for generating adversarial example datasets covering both black-box and white-box attacks and propose a CNN-based detection system. The created adversarial example datasets and the source code for adversarial example detection are made publicly available 111http://kom.aau.dk/~zt/online/adversarial_examples.
In generating adversarial examples, we take into account the length of source speech signals, the proportion of speech and non-speech in a signal, and the length of targeted sentences and we consider both white-box and black-box attacks. For feature extraction, we apply cepstral features. For the detection model, we propose a CNN structure with small kernels in order to detect small perturbations in adversarial examples. To our knowledge, this is the first practical investigation on adversarial attack detection in speech recognition tasks.
2 Adversarial attack algorithms and dataset generation
This section introduces two state-of-the-art attacking methods, one for white-box and another for black-box, and describes how we systematically generate datasets for detection of adversarial examples.
2.1 Attacking methods
One of the most successful white-box attacking methods is C&W 
. This method uses Connectionist Temporal Classification (CTC) loss function for perturbation optimization. Like all white-box attacks, the adversary has full access to all layers and parameters of the speech recognition model and can use all gradients to minimize the perturbation while maximizing the success rate. Equation 1 shows the optimization of perturbation in the C&W method:
where is the added perturbation, is the legitimate speech example, is the desired target by adversary, is the CTC loss function and the value is being updated to make a balance between changing the example to an adversarial example and keeping it close to the original normal example. The parameter is a threshold to indicate that the scale of perturbation should not be more than in order to keep the perturbation very low.
As an example for black-box attacking methods, we choose the Alzantot  attacking method. This is a gradient-free method that uses a genetic algorithm. The method has access only to the input and output of the victim speech recognition system. The difference between normal and adversarial examples for both attacking methods is illustrated in Fig. 1 using three spectrograms. The first one (a) belongs to a normal command example chosen from the Google Speech Command dataset. The original sentence of the sample is ”yes”. Spectrograms (b) and (c) are the generated adversarial examples with C&W method and Alzantot method, respectively, using the same target word ”right”.
Figure 1 shows that the C&W method makes less changes in the speech region of original signal. The patterns of perturbation in the non-speech region are different in (b) and (c). When we listen to examples (b) and (c), we can clearly hear ”yes” but low noise is present in the background.
2.2 Dataset generation
Based on the two attacking methods introduced above, we design and generate two separated datasets: A for white-box attacks and B for black-box attacks. For white-box attacks, the C&W method is chosen to attack Baidu DeepSpeech  as the victim learning model using Mozilla Common Voice dataset . As speech signals in Mozilla Common Voice dataset have different lengths, we need choose from the huge dataset a subset of signals of various lengths in a principled way. According to our preliminary tests, there are three possibilities regarding to the length of original signals and the length of the target as follows: 1) Length of utterance is longer than length of target, 2) length of utterance is equal to length of target, and 3) length of utterance is shorter than length of target.
Therefore, we categorize speech signals in the dataset to three different classes in the term of signal length:
Short signals: Audio files with the length of 1-2 seconds.
Medium signals: Audio files with the length of 3-4 seconds.
Long signals: Audio files with the length of 6-7 seconds.
The target of attack is other side of our equation. We thus also categorize our attacking targets to the same classes according to the common sentence lengths in the dataset and in this work compose the following targets:
Short target: ”Open all doors”.
Medium target: ”Switch off wifi connection”.
Long target: ”I need a reservation for sixteen people at the seafood restaurant down the street”.
Another consideration about choosing audio files is to ensure enough percent of speech region in the signal. There are some samples of minutes long but containing no speech and they are not good candidates for our dataset. We have chosen only audio files that contain more than 68% of speech using the open-source robust Voice Activity Detection (rVAD) algorithm.
For each of the three categories of normal examples (short, medium and long), we chose 100 examples to be attacked. Each example was attacked by using the three targets (short, medium and long). Consequently, we generated (i.e. 900) adversarial examples. To have a balanced dataset without repeated samples, we chose 900 normal examples (300 examples for each category) that are different from the previously chosen 100 normal ones, to represent normal examples. In the end, we have the dataset A containing 900 adversarial examples and 900 normal examples.
The black-box method uses mutual targeting on Google Speech Command dataset , from which 10 different commands have been chosen. For each command, we use 20 samples. The attacking algorithm generates adversarial examples for each command using all other 9 commands as the target, so called mutual targeting. As a result, we have (i.e. 1800) adversarial examples in our generated dataset. Afterwards, we add other 1800 normal examples (180 examples for each command) to the dataset. The dataset B thus contains 1800 normal and 1800 adversarial examples. Note that all samples in both Mozilla Common Voice and Google Speech Command datasets were recorded with 16kHz sampling rate and adversarial examples were generated with the same rate.
In this work, we used a Linux machine with four Nvidia GeForce GTX TITAN GPUs. We applied the default parameters of original attacking methods to generate adversarial examples for both black-box and white-box attacks. The average time to generate a black-box adversarial example for a 1 second audio file is approximately 40 seconds and the average time to generate a white-box adversarial example for the same audio and the same target is around 4 minutes.
3 Detection Method
In this section we present the speech feature used in this work followed by our detection method.
3.1 Speech Feature
The Mel-frequency Cepstral Coefficient (MFCC) feature  that has shown good performance across many tasks is used in this work, while cepstral features with different filter banks  are of interest for further study. The process of MFCC feature extraction involves dividing the signal into small frames, applying Mel-frequency filterbanks onto the frames, and taking logarithmic compression and discrete cosine transform. The extracted MFCCs of a signal have dimensions where indicates the number of coefficients and indicates the number of frames. In this work we have as a fixed value for all signals and
varies across speech samples depending on their length. In this work we use as the classifier (detection model) a CNN model, which is widely used for speech applications[30, 18]
, but input samples of different sizes present a problem for CNN. However, this can be solved by using zero padding for all samples, and the dimensionwas set to the maximum value of our audio samples that is 698 (the longest audio files in long category of the dataset A). In practice, there are various ways to make this computation more efficient, which is out of the scope.
3.2 CNN architecture
Let us consider the input speech signal represented in two dimensions . The correlation in both time and cepstral dimensions are to be considered in a sliding receptive field. Therefore, we have a convolution of our input and the weight matrix , where and show the area of receptive field in time and cepstral directions, respectively. Consequently, we have feature maps with the size of where and indicate the dimensions of feature maps.
The proposed CNN architecture for this work, as shown in Fig. 2, is a simple and typical one that uses three convolutional layers with small receptive fields. The first and the second 2D convolutional layers have the size of 64 with a Reluactivation function while the third one has the size of 32 with a linear Selu
activation function. A max-pooling layer is performed after each convolutional layer. Next, there are a fully connected layer with the size of 128 and a binary output softmax layer to predict if the input is normal or adversarial. Figure 2 shows the architecture of the proposed CNN.
The CNN architecture is designed to be very sensitive to small changes in both time and cepstral directions. Consequently, it can recognize small changes with special patterns. We expect that this model performs well to detect adversarial perturbations. Detailed architecture of the proposed CNN model is presented in Table 1.
|Convolutional 2D||64||kernel size=(2, 2)||Relu|
|MaxPooling 2D||pool size=(1, 3)|
|Convolutional 2D||64||kernel size=(2, 2)||Relu|
|MaxPooling 2D||pool size=(1, 1)|
|Convolutional 2D||32||kernel size=(2, 2)||Linear|
This section presents experimental settings, results and discussions.
4.1 Experimental settings
The experiments were conducted based on the two generated datasets A (white-box) and B (black-box). Six different experimental scenarios are designed as shown in the first three columns of Table 2. The goal is to test the detection performance of a narrowly trained system (only one type of attack) for both known and unknown attacks and the performance of a broadly trained system (both types of attack). We expect to have high accuracy in the situations of matched training and testing while the performance for unknown attacks is expected to be low. Multi-condition training with A and B is expected to boost the performance of testing on both types of attacks with one system.
The scenarios and accuracy results of testing data with 95% confidence interval for our experiments
To evaluate our CNN detection model, we have separated each dataset (A and B) into Training and Testing
subsets, with 75% for training and 25% for testing. It was also ensured that the adversarial examples in the training subset were generated from source examples that are different from those source examples used for generating testing adversarial examples. All training epochs are 100, which is enough for all models to converge.
4.3 Results and analysis
To have more reliable results, we ran all experiments under the same conditions with the same configurations. Each experiment was run 10 times. The last column of Table 2 shows overall accuracy numbers with 95% confidence interval for all scenarios for testing data. The accuracy numbers of matched training and testing conditions (1 and 4) are more than 98% and those of multi-condition training (5 and 6) are more than 96%. These results show that the CNN model can learn adversarial perturbation very well although there exists noise in some normal speech examples. The CNN model can detect white-box examples better than black-box examples, which is slightly surprising since the overall difference between normal and adversarial examples is 99% for white-box and it is 85% for black-box. This result can be explained by looking at the adversarial perturbations generated by each method. The adversarial perturbation in white-box is very small but has special patterns (as shown in Fig. 1). However, the adversarial perturbation in black-box attack, although being larger, is more similar to normal noise in real world.
The results of mismatched training and testing conditions (2 and 3), as shown in Table 2, are very different. Accuracy for (Train A, Test B) is lower than matched conditions, but is still around 81%. The accuracy of (Train B, Test A), however, is only a random guessing. This result indicates the perturbations of these two methods are quite different and dedicated training is required.
The detailed results for the Experiment 1 (Train A, Test A) are shown in Table 3. It is obvious that the highest accuracy of detection happens when we have maximum difference between the length of source signal and the length of attacking target. Furthermore, when these two lengths are more equal, we have lower accuracy.
Figure 3 illustrates the detailed results for Experiment 4 (Train B, Test B). It shows that when we have more similarity between source labels and target sentences in the term of phonemes, the accuracy of detection is lower.
One further experiment was conducted to evaluate the performance on unknown target sentences. Table 4 shows the results of testing adversarial examples with unknown targets and lengths using dataset A. The high accuracy results indicate that the model is robust in detecting targets of unknown sentences and unknown lengths.
|Train Short & Medium, Test Long|
|Train Short & Long, Test Medium|
|Train Medium & Long, Test Short|
We have presented a strategy to systematically generate two separated datasets of adversarial speech examples using state of the art attacking methods, one for white-box attacks and one for black-box attacks. Then, we proposed a CNN model for adversarial attack detection and evaluated the model with the generated datasets through matched, mismatched and multi-condition training and testing strategies. Experimental results demonstrate that it is feasible to train a learning model to accurately detect adversarial examples generated from known attacking methods while detecting unknown attacks deserves more attention. The work further provides insights about the behaviours of different training strategies.
We wish to thank the authors of the Mozilla Common Voice dataset and the Google Speech Command dataset, which our datasets are built upon, for making their datasets publicly available.
-  Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian J. Goodfellow, and Rob Fergus, “Intriguing properties of neural networks,” CoRR, vol. abs/1312.6199, 2014.
-  Ian Goodfellow, Jonathon Shlens, and Christian Szegedy, “Explaining and harnessing adversarial examples,” in International Conference on Learning Representations, 2015.
-  David J Miller, Zhen Xiang, and George Kesidis, “Adversarial learning in statistical classification: A comprehensive review of defenses against attacks,” arXiv preprint arXiv:1904.06292, 2019.
-  Yanpei Liu, Xinyun Chen, Chang Liu, and Dawn Xiaodong Song, “Delving into transferable adversarial examples and black-box attacks,” CoRR, vol. abs/1611.02770, 2017.
-  Nicholas Carlini and David A. Wagner, “Towards evaluating the robustness of neural networks,” 2017 IEEE Symposium on Security and Privacy (SP), pp. 39–57, 2017.
-  Pin-Yu Chen, Yash Sharma, Huan Zhang, Jinfeng Yi, and Cho-Jui Hsieh, “Ead: Elastic-net attacks to deep neural networks via adversarial examples,” 09 2017.
-  Dong Su, Huan Zhang, Hongge Chen, Jinfeng Yi, Pin-Yu Chen, and Yupeng Gao, “Is robustness the cost of accuracy? – a comprehensive study on the robustness of 18 deep image classification models,” 08 2018.
-  Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, and Pascal Frossard, “Deepfool: A simple and accurate method to fool deep neural networks,” , pp. 2574–2582, 2016.
-  Nicholas Carlini and David Wagner, “Audio adversarial examples: Targeted attacks on speech-to-text,” 05 2018, pp. 1–7.
-  Moustafa Alzantot, Bharathan Balaji, and Mani B. Srivastava, “Did you hear that? adversarial examples against automatic speech recognition,” CoRR, vol. abs/1801.00554, 2017.
-  Moustapha Cissé, Yossi Adi, Natalia Neverova, and Joseph Keshet, “Houdini: Fooling deep structured prediction models,” CoRR, vol. abs/1707.05373, 2017.
-  Felix Kreuk, Yossi Adi, Moustapha Cisse, and Joseph Keshet, “Fooling end-to-end speaker verification by adversarial examples,” 01 2018.
-  Xuejing Yuan, Yuxuan Chen, Yue Zhao, Yunhui Long, Xiaokang Liu, Kai Chen, Shengzhi Zhang, Heqing Huang, Xiaofeng Wang, and Carl A. Gunter, “Commandersong: A systematic approach for practical adversarial voice recognition,” 01 2018.
-  Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio, “Generative adversarial nets,” in Advances in neural information processing systems, 2014, pp. 2672–2680.
-  Daniel Michelsanti and Zheng-Hua Tan, “Conditional generative adversarial networks for speech enhancement and noise-robust speaker verification,” in Interspeech 2017. ISCA, 2017, pp. 2008–2012.
-  Eric Battenberg, Jitong Chen, Rewon Child, Adam Coates, Yashesh Gaur, Yi Li, Hairong Liu, Sanjeev Satheesh, David Seetapun, Anuroop Sriram, and Zhenyao Zhu, “Exploring neural transducers for end-to-end speech recognition,” 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 206–213, 2017.
-  “Mozilla common voice dataset,” 2019.
-  Tara Sainath and Carolina Parada, “Convolutional neural networks for small-footprint keyword spotting,” in Interspeech, 2015.
-  Pete Warden, “Speech commands: A dataset for limited-vocabulary speech recognition,” CoRR, vol. abs/1804.03209, 2018.
-  Yang Song, Taesup Kim, Sebastian Nowozin, Stefano Ermon, and Nate Kushman, “Pixeldefend: Leveraging generative models to understand and defend against adversarial examples,” CoRR, vol. abs/1710.10766, 2018.
-  Weilin Xu, David Evans, and Yanjun Qi, “Feature squeezing: Detecting adversarial examples in deep neural networks,” CoRR, vol. abs/1704.01155, 2018.
-  Zhuolin Yang, Bo Li, Pin-Yu Chen, and Dawn Xiaodong Song, “Characterizing audio adversarial examples using temporal dependency,” CoRR, vol. abs/1809.10875, 2018.
-  Anish Athalye, Nicholas Carlini, and David A. Wagner, “Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples,” in ICML, 2018.
-  Krishan Rajaratnam, Kunal Shah, and Jugal Kalita, “Isolated and ensemble audio preprocessing methods for detecting adversarial examples against automatic speech recognition,” arXiv preprint arXiv:1809.04397, 2018.
-  Krishan Rajaratnam and Jugal Kalita, “Noise flooding for detecting audio adversarial examples against automatic speech recognition,” in 2018 IEEE International Symposium on Signal Processing and Information Technology (ISSPIT). IEEE, 2018, pp. 197–201.
Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen
“Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks,”in Proceedings of the 23rd International Conference on Machine Learning, New York, NY, USA, 2006, ICML ’06, pp. 369–376, ACM.
-  Zheng-Hua Tan and Børge Lindberg, “Low-complexity variable frame rate analysis for speech recognition and voice activity detection,” IEEE Journal of Selected Topics in Signal Processing, vol. 4, pp. 798–807, 2010.
-  Steven Davis and Paul Mermelstein, “Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences,” IEEE transactions on acoustics, speech, and signal processing, vol. 28, no. 4, pp. 357–366, 1980.
-  Hong Yu, Zheng-Hua Tan, Yiming Zhang, Zhanyu Ma, and Jun Guo, “Dnn filter bank cepstral coefficients for spoofing detection,” IEEE Access, vol. 5, pp. 4779–4787, 2017.
-  O. Abdel-Hamid, A. Mohamed, H. Jiang, L. Deng, G. Penn, and D. Yu, “Convolutional neural networks for speech recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 22, no. 10, pp. 1533–1545, Oct 2014.