On the human evaluation of audio adversarial examples

by   Jon Vadillo, et al.

Human-machine interaction is increasingly dependent on speech communication. Machine Learning models are usually applied to interpret human speech commands. However, these models can be fooled by adversarial examples, which are inputs intentionally perturbed to produce a wrong prediction without being noticed. While much research has been focused on developing new techniques to generate adversarial perturbations, less attention has been given to aspects that determine whether and how the perturbations are noticed by humans. This question is relevant since high fooling rates of proposed adversarial perturbation strategies are only valuable if the perturbations are not detectable. In this paper we investigate to which extent the distortion metrics proposed in the literature for audio adversarial examples, and which are commonly applied to evaluate the effectiveness of methods for generating these attacks, are a reliable measure of the human perception of the perturbations. Using an analytical framework, and an experiment in which 18 subjects evaluate audio adversarial examples, we demonstrate that the metrics employed by convention are not a reliable measure of the perceptual similarity of adversarial examples in the audio domain.



There are no comments yet.


page 12

page 13

page 14


Universal adversarial examples in speech command classification

Adversarial examples are inputs intentionally perturbed with the aim of ...

Imperceptible, Robust, and Targeted Adversarial Examples for Automatic Speech Recognition

Adversarial examples are inputs to machine learning models designed by a...

A Multiversion Programming Inspired Approach to Detecting Audio Adversarial Examples

Adversarial examples (AEs) are crafted by adding human-imperceptible per...

Analysis of Dominant Classes in Universal Adversarial Perturbations

The reasons why Deep Neural Networks are susceptible to being fooled by ...

On the Suitability of L_p-norms for Creating and Preventing Adversarial Examples

Much research effort has been devoted to better understanding adversaria...

A Surprising Density of Illusionable Natural Speech

Recent work on adversarial examples has demonstrated that most natural i...

Can you hear me now? Sensitive comparisons of human and machine perception

The rise of sophisticated machine-recognition systems has brought with i...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Human-computer interaction increasingly relies on Machine Learning (ML) models such as Deep Neural Networks (DNNs) trained from, usually large, datasets Fang et al. (2018); Gao et al. (2019); Hassan et al. (2018); Nunez et al. (2018). The ubiquitous applications of DNNs in security-critical tasks, such as face identity recognition Sun et al. (2014); Parkhi et al. (2015), speaker verification Heigold et al. (2016); Snyder et al. (2017), voice controlled systems Feng et al. (2017); Boles and Rad (2017); Gong and Poellabauer (2018) or signal forensics Bayar and Stamm (2018, 2017); Zeng et al. (2017); Athulya et al. (2017) require a high reliability on these computational models. However, it has been demonstrated that such models can be fooled by perturbing an input sample with malicious and quasi-imperceptible perturbations. These attacks are known in the literature as adversarial examples Szegedy et al. (2014); Goodfellow et al. (2014). Due to the fact that these attacks are designed to be hardly detectable, they suppose a serious concern regarding the reliable application of DNNs in adversarial scenarios.

The study of adversarial examples has focused primarily on the image domain and computer vision tasks

Akhtar and Mian (2018), whereas domains such as text or audio have received much less attention. In fact, such domains imply additional challenges and difficulties. One of the evident differences between domains is the way in which the information is represented, and, therefore, the way in which adversarial perturbations are measured, bounded and perceived by human subjects. In the image domain, norms are mainly used as a basis to measure the distortion between the original signal and the adversarial example. However, recent works have pointed out that such metrics do not always properly represent the perceptual distortion introduced by adversarial perturbations Fezza et al. (2019); Jordan et al. (2019); Dukler et al. (2019). Although in some works norms are also used during the generation of adversarial examples to limit the amount of perturbation Alzantot et al. (2018); Gong and Poellabauer (2017)

, in the audio domain more representative metrics are usually employed for acoustic signals, such as signal-to-noise ratio (SNR)

Yakura and Sakuma (2018); Du et al. (2019) or Sound Pressure Level (SPL) Zhang et al. (2017a); Roy et al. (2017); Abdoli et al. (2019). These metrics are computed in decibels (dB), which is a standard scale employed for acoustic signals. However, even for such metrics, measuring the perceptual distortion of the attacks is not straightforward, as other characteristics have a high influence, such as time-frequency properties. In text problems the difficulty of characterizing the perceptual distortion is even greater, due to the fact that every change is inevitably noticeable, and therefore, the aim is to produce semantically and syntactically similar adversarial examples Alzantot et al. (2018).

In this paper we focus on the human evaluation of adversarial examples in the audio domain. A more comprehensive approach to evaluating adversarial distortions can serve to better understand the risks of adversarial attacks in the audio domain. For instance, the development of adversarial defenses or secure human machine interaction systems can focus on the more effective, unnoticeable, attacks. Therefore, the goal of this study is to perform an analysis of the human perception of audio adversarial perturbations according to different factors. Based on these results, we will also study to which extent the similarity-metrics employed in the literature are suitable to model such subjective criterion.

The remainder of the paper is organized as follows: In the following section we introduce the main concepts related to adversarial examples and review previous approaches to evaluate the distortion produced by adversarial perturbation in the audio domain. This section also highlights a number of research questions related to the evaluation of audio distortion that have not been previously addressed. Section 3 describes the selected task, target models and dataset, as well as the particular method employed for generating adversarial perturbation in the audio domain. Section 4 presents a preliminary evaluation of the adversarial perturbation according to the metrics proposed in the literature. In Section 5, we present the design of an experiment to find answers to some of the issues involved in the perceptual evaluation of the perturbations. The results of the experiment in which human subjects evaluate different aspects of the adversarial perturbations are also presented and discussed. Section 6 concludes the paper and identifies lines for future research.

2 Related work

The existence of adversarial examples which are able to fool DNNs have been reported for many different audio related tasks, such as automatic speech recognition

Alzantot et al. (2018); Carlini and Wagner (2018); Neekhara et al. (2019), music content analysis Kereliuk et al. (2015) or sound classification Abdoli et al. (2019). A common adversarial attack scheme is represented in Fig. 1. Note that it is assumed that an adversary can feed the perturbed signal directly into the model. Even if this is a common assumption, some works have demonstrated that such attacks can be designed to work in the physical world Yakura and Sakuma (2018); Carlini et al. (2016); Yuan et al. (2018).

Our work builds on previous research where adversarial perturbations for audio command classification have been introduced Vadillo and Santana (2019). However, the evaluation methods and type of analysis presented in this paper are valid for other approaches conceived to generate audio adversarial examples for other ML tasks.

Figure 1: Illustration of an adversarial attack, in which an adversarial perturbation is added to a clean audio waveform, forming an adversarial example which is misclassified by a target DNN model, while not altering the human perception of the audio.

2.1 Adversarial example: formal description

Let be a classification model

, which classifies an input

from the input space as one of the classes represented in . An adversarial example is defined as , where represents the adversarial perturbation capable of producing a misclassification of for the (correctly classified) input : . A necessary requirement for an adversarial attack is that the perturbation should be imperceptible, and therefore, the goal is to minimize the distortion introduced by as much as possible, according to a suitable distortion metric

Depending on the objective of the attack, adversarial examples can be categorized in different ways. First of all, a targeted adversarial example consists of a perturbed sample which satisfies , where represents the target (incorrect) label that we want to be produced by the model. In contrast, an untargeted adversarial example only requires the output label to be incorrect , without any additional regard about the output class assigned to .

Furthermore, depending on the scope of the adversarial perturbation , we can differentiate between individual or universal adversarial perturbations. In the first case, the perturbation is crafted specifically to be applied to one particular input . Therefore, it is not expected that the same perturbation will be able to fool the model for a different sample. In the second case, universal adversarial examples are input agnostic perturbations able to fool the model independently of the input. In Vadillo and Santana (2019), different levels of universality are proposed, depending on the number of classes for which it is expected to work. The first universality level comprises single-class universal perturbations that are conceived to fool the target model only for inputs of one particular class. We will focus on single-class universal perturbations, although our findings regarding the weaknesses and gaps in the evaluation of adversarial perturbations are not restricted to this universality level.

2.2 Methods for assessing audio adversarial perturbations

In this section we collect the strategies employed by previous works in order to verify that audio perturbations are not detectable by humans. Even if an essential requirement for adversarial perturbations to suppose a real threat is that they must be imperceptible, a good specification of such (mainly subjective) a constraint is not straightforward, and, indeed, is not well established yet.

Furthermore, even if the analysis is constrained to the audio domain, the understanding and definition of what can make a sample natural is very related to the ML task that is being solved by the model (e.g., it might be harder to categorize a music tune as “unnatural” than a spoken command). With a large variety of ML tasks related to the analysis of acoustic signals (e.g., speech recognition, music content analysis or ambient sound classification), each of them may require, therefore, a different criterion to assess the distortion of the adversarial examples according to human perception. Although a number of strategies have been proposed in these domains Roy et al. (2017); Kereliuk et al. (2015); Zhang et al. (2017a); Schönherr et al. (2018); Carlini and Wagner (2018), we focus on those suitable for spoken commands. Among these strategies are:

  • Thresholding the perturbation amount

  • Models of human perception and hearing system

  • Human evaluation

2.2.1 Thresholding the perturbation amount

The methods discussed in this section rely on limiting or measuring the perturbation amount that is added to the original input, according to a distortion metric, to ensure that the perturbations are imperceptible or quasi-imperceptible, or that the distortion levels are below a maximum acceptable threshold.

In Alzantot et al. (2018), the perturbation applied to spoken commands is restricted to the 8 least-significant-bits of a subset of samples in a 16 bits-per-sample audio file. Similarly, in Gong and Poellabauer (2017), the effectiveness of the proposed attack for speech paralinguistic tasks is measured for different perturbation amounts under the norm. The restrictions applied in both cases guarantee that the maximum change applicable to each value of the signal is constrained. However, such thresholds are not representative for acoustic signals, as they do not guarantee a low perceptual distortion on audio attacks.

In Carlini and Wagner (2018); Neekhara et al. (2019); Yang et al. (2018), in which audio adversarial perturbations for speech recognition models are addressed, the relative loudness of the adversarial perturbation with respect to the original signal is measured in Decibels (dB), which is a more representative metric for acoustic signals:




In Du et al. (2019), the signal to noise ratio (SNR) is used to measure the relative distortion of adversarial perturbations for speech recognition models, computed as:


where and represent the power of the clean signal and the perturbation , respectively. The SNR has been used in other works on audio adversarial examples Kereliuk et al. (2015); Carlini et al. (2016); Yuan et al. (2018); Yakura and Sakuma (2018); Abdoli et al. (2019). However, these works are not based on speech signals, as their approaches rely on data with very different characteristics, such as urban sound classification, music content analysis or the injection of malicious commands into songs. Therefore, the results are not directly comparable to spoken speech recognition, the tasks addressed in this paper.

2.2.2 Models of human perception and hearing system

The human hearing system is able to identify sounds in a range from 20Hz to 20kHz, so that perturbations outside this range can not be perceived Rosen and Howell (2010); Rossing (2007). Based on this fact, in Zhang et al. (2017a) and Roy et al. (2017), high frequencies are used to generate audio inaudible to humans but which is captured and classified by a device. Although these attacks may not fit in our specification of adversarial examples (since humans cannot perceive the generated audio, and therefore cannot judge it as benign either), they introduce the idea of using frequency ranges that are out of the human hearing range in adversarial scenarios.

A different strategy is employed in Schönherr et al. (2018), where a psychoacoustic model Zwicker and Fastl (2013) is used to compute the hearing thresholds of different zones of the clean audio signal, which is used to restrict the perturbation to the least perceptible parts.

2.2.3 Human evaluation

In Cisse et al. (2017) and Kreuk et al. (2018)

, an ABX test is performed, which is a standard method to identify detectable differences between two choices of sensory stimuli. In this method a subject is asked to listen to two audios A and B, and afterwards a third audio X, which will be either A or B, randomly selected. The objective of this test is to assess if the user is able to distinguish between A and B. Optimally, the accuracy ratio would be 50%, equal to the probability of selecting randomly between the two choices. In our scenario, the two initial audios A and B would correspond to the clean and perturbed audio (in any order).

In Yakura and Sakuma (2018) and Yuan et al. (2018) the adversarial perturbations are embeded in songs, which can be deployed in the physical world without raising suspicions for humans listeners (e.g., in elevators or TV advertisements) to force a target model to understand speech commands. In both works a human evaluation is carried out on Amazon Mechanical Turk. According to the results presented by the authors, almost none of the participants perceived speech in the perturbed signals. However, a considerable percentage of people reported that an abnormal noise could be noticed in the songs.

In Schönherr et al. (2018), a Multiple Stimuli with Hidden Reference and Anchor (MUSHRA) test is carried out Schinkel-Bielefeld et al. (2013), to perform a subjective assessment of the audio quality of adversarial examples. The goal of the test is to score the quality of perturbed audio signals (anchors, e.g., adversarial examples) with respect to the original signal (hidden reference, in this context, the original audio).111It is worth mentioning that this test is mainly used to assess the intermediate quality level of coding systems, whereas for small impairments, which should be the case of audio adversarial perturbations, dedicated tests have been proposed. According to the results, the adversarial examples obtained considerably lower scores than the clean audio signals.

Finally, in Vaidya et al. (2015); Carlini et al. (2016); Alzantot et al. (2018); Gong and Poellabauer (2017); Du et al. (2019), experiments with human subjects are performed with the aim of analyzing their response to the task, in order to assess if the adversarial perturbation has any influence on the responses provided by human listeners. However, no analysis of the perceptual distortion introduced by the perturbations is reported, except in Du et al. (2019), in which the subjects are asked to evaluate the noise level of the audio signals.

2.3 Summary

Despite the fact that different methods have been proposed to measure the distortion levels introduced by audio adversarial perturbations, we found that the majority of the approaches are not enough to adequately represent the human perception of these attacks. Apart from that, some of the thresholds and acceptable distortion levels assumed in previous works do not always guarantee that the perturbations are imperceptible, and therefore, the detectability of the attacks can be questionable. With this paper, we intend to provide evidence and raise awareness about these gaps. We hope that the results reported may contribute to establish a more thorough measurement of the distortion, and therefore, to a more realistic study of audio adversarial examples.

3 Adversarial examples of speech commands

Our goal is to evaluate the detectability of audio adversarial perturbation, and to determine to what extent the metrics commonly used in the literature agree with the human evaluation. To accomplish this goal, we should first establish a number of stepping stones:

  1. Identify a suitable and representative audio task.

  2. Identify a model appropriate for the task

  3. Collect or identify a dataset to train a model.

  4. Using the model, generate the adversarial examples for the task.

  5. Estimate the actual fooling rate of the adversarial examples.

3.1 Selection of the task, model, and dataset

The task we have selected is speech command classification since it is an exemplar machine learning task which is part of the repertoire of extensively used speech-based virtual assistants.

The DNN model we have selected is based on the architecture proposed for small-footprint keyword recognition Sainath and Parada (2015)

. This model takes as input an audio waveform, computes the spectrogram of the signal and extracts a set of MFCC features for different time intervals. This results in a two-dimensional representation of the audio signal, which is fed into the following topology: two convolutional layers with a ReLU activation function, a fully-connected layer and a softmax layer. The same architecture has been used in related works on adversarial examples

Alzantot et al. (2018); Du et al. (2019) and as a baseline model in other research tasks Warden (2018); Zhang et al. (2017b).

We used the Speech Command Dataset Warden (2018), which is a widely used dataset in the study of adversarial attacks for speech recognition systems Alzantot et al. (2018); Yang et al. (2018); Du et al. (2019). The dataset is composed of recordings of 30 different spoken commands, provided by a large number of different people. Audio files are stored in a 16-bit WAV file, with a sample-rate of 16kHz and a fixed duration of one second. As in previous publications Warden (2018); Alzantot et al. (2018), we selected the following subset of commands to develop our work: “Yes”, “No”, “Up”, “Down”, “Left”, “Right”, “On”, “Off”, “Stop”, and “Go”. Additionally, We will also consider two special classes: “Unknown” (a spoken command not considered in the previous set), and “Silence” (no speech detected in the audio). Note that this selection comprises a wide variety of commands in terms of phonetic similarity.

3.2 Generating single-class universal perturbations

As previously mentioned, we focus on single-class universal perturbations, an attack approach that attempts to generate a single perturbation which is able to fool the model for any input corresponding to a particular class . We decided to focus on universal perturbations because an initial experimentation with individual perturbations (crafted using Deepfool algorithm) led us to the conclusion that the perturbations were undoubtedly imperceptible. This conclusion has been reported before in the literature Fezza et al. (2019) for the case of image adversarial examples. Therefore, we selected the more challenging task of generating universal perturbations, which requires higher distortion levels. Moreover, we selected single-class universal attacks in order to study in more detail the results on different commands. The particular choice of the class to which the target perturbation is applied is a factor that may influence the perceptual distortion of the perturbations.

The selected attack method is based on the strategy proposed in Moosavi-Dezfooli et al. (2017), a state-of-the-art method to generate universal perturbations based on accumulating individual perturbations created for a set of training samples using the Deepfool algorithm Moosavi-Dezfooli et al. (2016). We use the UAP-HC reformulation of this strategy for audio samples as presented in Vadillo and Santana (2019), where more details about the process to generate the perturbations can be found.

We generated 5 different universal perturbations per class, starting from a different training set of 1000 samples in each case. During the crafting process, the universal perturbations were bounded by the norm, with a threshold value of . In addition, the Deepfool algorithm was limited to a maximum number of 100 iterations. The overshoot parameter of the Deepfool algorithm was set to 0.1. Finally, the UAP-HC

algorithm was restricted to 5 epochs, that is, 5 complete passes through the entire training set.

3.3 Effectiveness of the perturbations fooling the model

To measure the effectiveness of the universal perturbations, we compute the percentage of audios for which the prediction changes when the perturbation is applied. We will refer to this metric as fooling ratio (FR) Moosavi-Dezfooli et al. (2017). The effectiveness of the generated perturbations is shown in Table 1, for the training set (the set of samples used to optimize the universal perturbation) and for the validation set (the set of samples used to compute the effectiveness of the attack for inputs not used during the optimization process).222The samples used to optimize the perturbation will be selected from the training set used to train the DNN. Equivalently, the validation set of the algorithm will be selected from the validation set used during the training process of the model. Results are shown for the average effectiveness of the 5 perturbations generated for each class, as well as for the one that maximizes the FR on the training set.

According to the results, the generated adversarial examples are highly effective for the majority of the classes, with a maximum FR above 70% for 7 out of 12 classes in both training and validation sets. Note that we obtain a considerably high effectiveness also in the class unknown, which is composed of a diverse set of spoken commands. However, the hardest class to fool is silence, in which the maximum FR is below 25% in both training and validation sets. This may be due to the fact that, according to the nature of the audios corresponding to that class, trying to fool the model by adding a small amount of noise is a challenging task.

Class Max. FR% Mean FR%
Train Valid Train Valid
Silence 23.80 19.46 22.24 19.61
Unknown 72.70 73.06 70.58 73.51
Yes 74.50 74.36 68.26 66.40
No 86.50 83.77 81.48 79.40
Up 84.20 75.45 82.20 74.73
Down 71.50 65.55 68.06 64.51
Left 52.30 49.73 42.20 40.59
Right 68.70 63.82 60.62 56.47
On 76.00 75.65 54.42 53.28
Off 80.10 73.48 75.18 70.85
Stop 61.40 61.82 56.92 57.30
Go 87.80 80.06 86.24 80.90
Table 1: Effectiveness of the generated single-class universal perturbations.

It is important to bear in mind that the effectiveness of a universal perturbation is directly correlated to the distortion amount introduced. We show in Fig. 2, for each class, the way in which the FR increases as the distortion amount introduced by the perturbations increases. These results have been obtained by scaling the magnitude of a universal perturbation according to two distortion criteria: the norm of the perturbation and the decibel difference between the perturbation and each sample of the dataset. In the first case, the perturbation signal is scaled in order to ensure that its norm equals the desired threshold, and it is equally applied to every input sample. In the second case, the perturbation signal is scaled for every input sample , in order to ensure that the metric equals the specified threshold.

Figure 2: Variation in the effectiveness (FR%) in the validation set of the generated single-class universal adversarial perturbations according to two different criteria: norm of the perturbation (top) and metric with respect to each input signal (bottom).

The fact that the FR is directly correlated with the distortion level implies that there is a trade-off between the effectiveness and the detectability of the attacks. Therefore, to adequately study the risk posed by audio adversarial attacks, it is important to establish realistic and rigorous criteria for assessing the human perception of such attacks.

4 Evaluation of the distortion using similarity metrics

While the ability to fool the model is an essential ingredient of adversarial examples, the other requirement is that the perturbation is not noticed by humans. In this section, we evaluate the distortion produced by the generated adversarial perturbations, according to different criteria.

4.1 Evaluating the distortion: the standard, uninformed way

We first computed the distortion according to the standard approaches employed in previous works on adversarial examples in speech related tasks Carlini and Wagner (2018); Neekhara et al. (2019); Yang et al. (2018), as described in equation (1) (Section 2.2). In Carlini and Wagner (2018), where individual adversarial perturbations are created for speech transcription scenarios, the mean distortion of the generated perturbations is -31dB, and the 95% interval for distortion ranges from -15dB to -45dB.333Note that according to this metric, the lower the distortion value, the less detectable the perturbation. The same range of distortion is reported in Yang et al. (2018). In Neekhara et al. (2019), where universal adversarial perturbations are generated also for speech transcription models, the distortion level of the perturbations is bounded under different thresholds, obtaining a mean distortion of dB in the best case, and a mean distortion of dB in the worst case. Overall, distortion levels below -32dB are considered acceptable in these works.

Fig. 3 shows the distortion level of the generated perturbation with respect to each input sample in the validation set, according to the same approach. Results are computed independently for each class, and averaged for the 5 trials carried out in each of them. Table 2 shows the mean distortion level obtained for each class. As can be seen, the mean distortion is below -40dB in all the classes except silence, in which the mean distortion is of -29.52dB444This effect can be explained by the fact that, due to the nature of the samples corresponding to the class silence, their loudness level is lower than for the rest of classes.. Moreover, without considering the class silence, more than 90% of the samples are below -32dB in all the cases. Therefore, our perturbations can be considered as highly acceptable according to this standard.

% of samples
below -32dB
Silence -29.52 48.04
Unknown -41.35 90.20
Yes -40.58 90.45
No -42.56 93.09
Up -40.24 89.18
Down -40.63 90.64
Left -48.10 99.03
Right -43.30 95.20
On -46.31 96.21
Off -42.01 94.03
Stop -43.92 96.11
Go -41.88 93.28
Table 2: Distortion levels produced by the generated single-class universal perturbations (standard evaluation). Results are averaged for the 5 experiments carried out for each class.
Figure 3: Distortion level of the generated single-class universal perturbations, evaluated in the validation set using the standard evaluation approach: applied to the whole signals. Results are averaged for the 5 perturbations generated for each class.

4.2 Evaluating the distortion: detailed and signal-part-informed way

In order to measure the distortion in more detail, we employed the approach presented in Vadillo and Santana (2019). In this case, the distortion induced by the perturbation in the original sample is computed in terms of the difference between both the maximum (as defined in equation (1)) and the mean decibel values, defined as:




Furthermore, previous work on evaluating the naturalness of adversarial examples in the audio domain compute the distortion between two signals by applying the metrics to the entire signals Carlini and Wagner (2018); Yang et al. (2018); Neekhara et al. (2019). In this paper, we advocate the application of both metrics in two different parts of each audio signal: the vocal part and the background part. This differentiation is due to the fact that, for spoken commands, the amount of sound outside the vocal part is considerably lower. Thus, the same amount of perturbation would be perceived differently depending on the infected part. By mapping the distortive effect of the perturbations to these parts of the signals we also get a better assessment of how the attack works better.

As we are handling short single-command audio signals, the vocal part of an audio signal will be delimited by the continuous range containing 95% of the energy of the signal, that is:


Thus, we will assume that ranges will be composed just of background noise. Notice that this partition is well suited for single command audios in which it is assumed that the vocal part of the signal is contiguous. Audio signals belonging to the silence class will be omitted from the analysis of the vocal part, as they are composed only of background noise, without any vocal part.

4.3 Results of the different signal-part approach

The results obtained with the described evaluation approach are shown in Fig. 4. The first row of the figure shows the results obtained using metric, and the bottom row the results obtained using metric. Notice the difference between the horizontal axis scales of the figures.

By comparing the perturbations in the vocal part and the background part, it can be seen that perturbations in the vocal part are less noticeable, with a decibel difference significantly lower, which occurs using both and distortion metrics.

Regarding the distortion amount in the vocal part, the obtained results are significantly below the threshold of dB in almost all the samples, independently of the metric. Compared to the sound intensity level of a normal conversation, a distortion of dB corresponds to the weakest audible signal between 10kHz and 100Hz frequency range Smith and others (1997), which is roughly the difference between the ambient noise in a quiet room and a person talking Carlini and Wagner (2018).

While the distortion level outside the vocal part is still acceptable under the metric, according to the metric the distortion exceeds the threshold of -32dB for a great majority of the samples. In fact, in about half of the cases the difference in decibels is greater than -20dB, which may indicate that the perturbations could be highly detectable in those parts.

Figure 4: Distortion level of the generated single-class universal perturbations, evaluated in the validation set using metric (top row) and metric (bottom row). For each audio, the distortion has been measured in the vocal part as well as in the background part. Results are averaged for the 5 perturbations generated for each class.

5 Human evaluation of voice command adversarial examples

While the methods presented in the previous section provide a more accurate and detailed assessment on the quality of the adversarial examples, the metrics used are not expected to capture all the subtleties of a proper human evaluation. Therefore, we designed an experiment in which human subjects listen to audio adversarial examples and judge them according to different criteria. The main goal of the experiment was to study to which extent the perturbations are detectable by humans. In this section we describe the experimental design and its results.

5.1 Experimental design

A set of eighteen subjects (12 men and 6 women), independent of the research, was selected to conduct the experiment. Each participant was instructed to listen to different audio clips and answer some questions about them. The experiment is composed of two parts:

  • In the first part, the naturalness of the generated universal adversarial examples is investigated. The other question investigated is to what extent the distortion produced by the perturbation affects the understandability of the spoken commands. To address these questions, the participants are asked to listen a set of 12 audio clips, either clean or adversarially perturbed, and provide the following information:

    • Identify the command that can be heard in the audio clip, in order to determine if the adversarial perturbations affect the understandability of the spoken commands.

    • Assess the level of naturalness of the audio clip, in order to study whether the adversarial examples are perceived as perturbed audios in comparison to clean instances. As both clean and perturbed audios will be tested, the comparison between the results obtained in both cases may reflect if the perturbations are perceived just as a regular background noise or other ordinary perturbations, or whether they are perceived as artificial or malicious. In the experiment, the subjects evaluated the naturalness on a scale from 1 to 5, with the following scale provided as reference:

      • 1) Clearly perturbed audio with an artificial sound or noise.

      • 2) The audio is slightly perturbed by an artificial sound or noise, not likely to be caused by the low quality of the microphones or ambient sounds.

      • 3) Not sure

      • 4) No obvious signs of an artificial perturbation. The detectable perturbations are likely to be caused by a low- or mid-quality microphone, ambient sounds or ordinary noises.

      • 5) The audio clip clearly does not contain any artificial perturbation.

  • In the second part of the experiment, each participant performed an ABX test, a method to identify detectable differences between two choices of sensory stimuli. In this method, a subject is asked to listen to two audios A and B, and afterwards a third audio X, which will be either A or B, randomly selected. The goal of the test is to evaluate if the subject is able to determine if X corresponds to A or to B. In our experiment, the two initial audios A and B will correspond to the clean and perturbed audio, in any order. Thus, this test will determine if the perturbations are detectable in comparison to the clean audio sample. Six trials were carried out in each experiment, that is, six sets of three audio clips A, B and X.

Due to the fact that the audio clips of the dataset contain different characteristics, such as the intensity of the spoken command or the amount of background noise, the perception of a perturbation may change according to these features. For this reason, we decided to classify the audios considering three levels of intensity: low, medium and high. The distortion metric presented in (5) will be used to measure the mean distortion of the audio signals. According to this metric, 99% of the intensities of the audio samples lie approximately in the decibel range . By performing a rough uniform binning of the intensity range (known as equal-width binning in the literature Dougherty et al. (1995)), the levels were defined as follows:

  • Low intensity level: audios with a mean distortion below 50dB.

  • Medium intensity level: audios with a mean distortion between 50dB and 70dB.

  • High intensity level: audios with a mean distortion above 70dB.

To ensure a uniform representation of the different levels of intensity, each experiment was composed of audio signals of only one of these levels. Nine different experiments were created, (three experiments per intensity level), and each of them was assigned to two different participants, making a total of 18 experiments and participants. A summary of the final experimental setup is provided in Table 3. The minor unbalance in the number of original and modified audios in the first part of the experiment is assumed to ensure greater uniformity in the frequency of each command, depending on the different factors influencing the experiment, as shown in Table 4, as well as to ensure that the model correctly classified the original audio samples but incorrectly classified the adversarial examples.

Experiment Intensity Audio samples (part 1)
ABX trials
(part 2)
Clean Adv. Total
1 Low 6 6 12 6
2 Low 6 6 12 6
3 Low 5 7 12 6
4 Medium 6 6 12 6
5 Medium 6 6 12 6
6 Medium 7 5 12 6
7 High 6 6 12 6
8 High 6 6 12 6
9 High 7 5 12 6
Table 3: Summary of the experimental setup designed for the human evaluation of the distortion produced by the universal perturbations.
Type Sil. Unk. Yes No Up Down Left Right On Off Stop Go
Low intensity 6 6 6 6 6 6 6 4 6 6 8 6
Medium intensity 6 4 8 6 6 6 6 6 6 6 6 6
High intensity 6 12 8 6 0 12 4 8 2 2 4 8
Clean 12 2 8 12 4 22 12 4 4 12 14 4
Adversarial 6 20 14 6 8 2 4 14 10 2 4 16
Total Frequency 18 22 22 18 12 24 16 18 14 14 18 20
Table 4: Number of audios per command used in the experiment (part 1).

5.2 Analysis of the results

5.2.1 Command classification task

The first factor to be analyzed is the accuracy percentage obtained by humans in the command classification task (first part of the experiment), that is, which percentage of samples have been correctly labeled by humans.

According to the results, the total number of instances wrongly classified considering all the instances, clean and adversarial, is 13 out of 216 (9 corresponding to clean samples and 4 to adversarial samples), which corresponds to a total accuracy in the command classification of %. In order to provide more detailed information, the accuracies obtained for each intensity level are shown in Fig. 5, in which the percentages are computed independently for clean instances and for adversarial examples. Overall, these results indicate that the adversarially perturbed spoken commands are clearly recognizable and well classified by humans, independently of the intensity level of the original audio. In other words, although the adversarial perturbations are able to fool the target model, they do not affect the human understanding of the command. The obtained results are consistent with those achieved in Du et al. (2019), where the success rate of a set of people in classifying audio commands is reported using the same dataset as us, but without considering silence or unknown as classes and without differentiating between the intensity level of the original signals. According to the results reported in Du et al. (2019), the accuracy in recognizing the commands was 93.5% for clean samples and 92.0% for adversarial examples.

Figure 5: Accuracy percentages achieved by the participants of the experiment in the speech command classification task. Results have been split for each sample type (clean or adversarial) as well as for the intensity levels of the original audios in the experiments (low, medium or high).

5.2.2 Naturalness

Furthermore, the results obtained in the analysis of the naturalness level assigned to the instances is displayed in Fig. 6. The figure shows the frequencies with which samples are classified in each naturalness level, split according to the sample type (clean or adversarial). In addition, the results are jointly computed for all the experiments (top left) as well as for each intensity level individually: low (top right), medium (bottom left) and high (bottom right). Considering all the experiments, it can be observed that the adversarial examples obtained lower scores in comparison to the clean samples. We verified by an exact multinomial statistical test that there exist significant differences regarding the scores assigned to clean and adversarial audios (achieving a p-value below a tolerance of ). Indeed, while % of the clean samples are classified with a naturalness level of 4 or 5, only % of adversarial examples have been classified in the same range. These results indicate that, in general, the adversarial perturbations are perceived in the audio signals as artificial sounds or noises with a considerably higher frequency than clean samples.

Doing the same analysis independently for each intensity level, it can be observed that the main difference is given in the lowest intensity level, in which % of the adversarial examples achieved a score of or , while only % of clean samples were classified in that range. For the highest intensity level, however, the percentage of adversarial examples scored with a or is even greater than the corresponding percentage for clean samples. Thus, the human perception of the adversarial examples is clearly related to the intensity level of the original audio signals. This is a remarkable fact that should be taken into consideration in the evaluation of audio adversarial examples.

Figure 6: Analysis of the naturalness level assigned to the audio samples of the speech command classification task in all the experiments, split by sample type (clean or adversarial). The results are computed for all the experiments (top left) as well as for each intensity level individually: low (top right), medium (bottom left) and high (bottom right).

5.2.3 ABX test

In order to better evaluate if the perturbations are perceivable, the results obtained in the ABX test (second part of the experiment) have been analyzed. This is summarized in Fig. 7. The first row of the figure shows the percentage of success cases in the ABX test, that is, the percentage of cases in which the unknown audio (audio X) has been correctly classified. The second row shows the confidence level of the answers. All these results have been computed independently for each intensity level.

The success rate of the experiments with low and medium intensity levels is of % and % respectively, revealing that the perturbations are clearly perceivable in such cases. On the contrary, only a % success rate is achieved for high intensity levels, close to the optimum value of %, which is equivalent to a random guessing. We verified by an exact binomial test555The alternative hypothesis of the test is that the empirical success ratio is greater than . The same test with the alternative hypothesis that the empirical ratio is not equal to obtained a p-value of . that the achieved success ratio is no significantly greater (achieving a p-value of ) than the probability

corresponding to a binomial distribution

, where

is the sample size. This fact indicates that, in such cases, the adversarial examples are not distinguishable from their corresponding clean audio examples. It is worth noting that, given our experimental setup, 95% (Clopper–Pearson) confidence intervals of the success ratio is

for low intensity audios, for medium intensity audios and for high intensity audios. The results provided can, therefore, be considered representative of the human perception of the distortion.

Consistently with the success rates, the subjects were highly confident in providing their answers in more than % of the cases in the experiments containing audios with low and medium intensity levels. Contrarily, in % of the answers the participants reported a low confidence in the experiments containing audios with high intensity levels.

Overall, these analyses demonstrate that the detectability of the perturbations largely depends on the intensity level of the clean audio, being detectable for audios with low and medium intensity levels, but not perceivable for audios with a high intensity level.

It is worth mentioning that, according to the standard approach used in previous related works to measure the detectability of audio adversarial examples, the crafted perturbations were far below the maximum acceptable distortion. However, the results obtained in this section reinforce our proposal about the need to employ more rigorous approaches in order to measure and set a threshold on the distortion produced by the adversarial perturbations in a more representative way. We encourage the reader to listen to some adversarial examples, to empirically assess the perceptual distortion of adversarial perturbations according to different characteristics.666https://vadel.github.io/adversarialDistortion/AdversarialPerturbations.html

Figure 7: Success percentages obtained in the ABX test (top) and confidence levels of the answers in the test (bottom), both computed independently for each intensity level.

6 Conclusions

In this paper we have addressed the measurement of the perceptual distortion of audio adversarial examples, which remains a challenging task despite being a fundamental condition for effective adversarial attacks. For this purpose, we have performed an analysis of the human perception of audio adversarial perturbations for speech command classification tasks, and this analysis has been used to study whether the distortion metrics employed in the literature correlate with the human judgment.

We have found out that, while the distortion levels of our perturbations are acceptable according to the standard evaluation approaches employed by convention, the same perturbations were highly detectable and judged as artificial by human subjects. For this reason, we have proposed a novel framework to measure the distortion in a more comprehensive way, based on a differential analysis in the vocal and background parts of the audio signals, which provide a more realistic and rigorous evaluation of the perceptual distortion. Our experiments with single-class universal perturbations for a set of varied commands also demonstrate that there exist differences regarding the effectiveness of the attacks, related to the relative distortion, and how the perceptual distortion of the perturbations changes depending on the intensity levels of the audio signal in which it is injected.

These results highlight the lack of audio metrics capable of modeling the human perception in a realistic and representative way, and stress the need to include human evaluation as a necessary step for validating methods used to generate adversarial perturbation in the audio domain. We hope that future works could advance in this direction in order to fairly evaluate the risk that adversarial examples suppose.


The authors would like to thank to the Intelligent Systems Group (University of the Basque Country, Spain) for providing the computational resources needed to develop the project. This work has received support form the predoctoral grant that Jon Vadillo holds by the Basque Government (reference PRE_2019_1_0128). Roberto Santana acknowledges support by the Basque Government (IT1244-19 and ELKARTEK programs), and Spanish Ministry of Economy and Competitiveness MINECO (project TIN2016-78365-R).


  • [1] S. Abdoli, L. G. Hafemann, J. Rony, I. B. Ayed, P. Cardinal, and A. L. Koerich (2019) Universal adversarial audio perturbations. arXiv preprint arXiv:1908.03173. Cited by: §1, §2.2.1, §2.
  • [2] N. Akhtar and A. Mian (2018)

    Threat of adversarial attacks on deep learning in computer vision: a survey

    IEEE Access 6, pp. 14410–14430. Cited by: §1.
  • [3] M. Alzantot, B. Balaji, and M. Srivastava (2018) Did you hear that? Adversarial examples against automatic speech recognition. arXiv preprint arXiv:1801.00554. Cited by: §1, §2.2.1, §2.2.3, §2, §3.1, §3.1.
  • [4] M. Alzantot, Y. Sharma, A. Elgohary, B. Ho, M. Srivastava, and K. Chang (2018-October-November) Generating natural language adversarial examples. In

    Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

    Brussels, Belgium, pp. 2890–2896. External Links: Link, Document Cited by: §1.
  • [5] M. Athulya, P. Sathidevi, et al. (2017) Mitigating effects of noise in forensic speaker recognition. In 2017 International Conference on Wireless Communications, Signal Processing and Networking (WiSPNET), pp. 1602–1606. Cited by: §1.
  • [6] B. Bayar and M. C. Stamm (2017)

    A generic approach towards image manipulation parameter estimation using convolutional neural networks

    In Proceedings of the 5th ACM Workshop on Information Hiding and Multimedia Security, pp. 147–157. Cited by: §1.
  • [7] B. Bayar and M. C. Stamm (2018) Constrained convolutional neural networks: a new approach towards general purpose image manipulation detection. IEEE Transactions on Information Forensics and Security 13 (11), pp. 2691–2706. Cited by: §1.
  • [8] A. Boles and P. Rad (2017) Voice biometrics: deep learning-based voiceprint authentication system. In 12th System of Systems Engineering Conference (SoSE), pp. 1–6. Cited by: §1.
  • [9] N. Carlini, P. Mishra, T. Vaidya, Y. Zhang, M. Sherr, C. Shields, D. Wagner, and W. Zhou (2016) Hidden voice commands. In 25th USENIX Security Symposium (USENIX Security 16), pp. 513–530. Cited by: §2.2.1, §2.2.3, §2.
  • [10] N. Carlini and D. Wagner (2018) Audio adversarial examples: Targeted attacks on speech-to-text. arXiv preprint arXiv:1801.01944. Cited by: §2.2.1, §2.2, §2, §4.1, §4.2, §4.3.
  • [11] M. Cisse, Y. Adi, N. Neverova, and J. Keshet (2017) Houdini: fooling deep structured prediction models. arXiv preprint arXiv:1707.05373. Cited by: §2.2.3.
  • [12] J. Dougherty, R. Kohavi, and M. Sahami (1995) Supervised and unsupervised discretization of continuous features. In Machine Learning Proceedings 1995, pp. 194–202. Cited by: §5.1.
  • [13] T. Du, S. Ji, J. Li, Q. Gu, T. Wang, and R. Beyah (2019) SirenAttack: generating adversarial audio for end-to-end acoustic systems. arXiv preprint arXiv:1901.07846. Cited by: §1, §2.2.1, §2.2.3, §3.1, §3.1, §5.2.1.
  • [14] Y. Dukler, W. Li, A. Tong Lin, and G. Montúfar (2019) Wasserstein of Wasserstein loss for learning generative models. Cited by: §1.
  • [15] B. Fang, F. Sun, H. Liu, and C. Liu (2018) 3D human gesture capturing and recognition by the IMMU-based data glove. Neurocomputing 277, pp. 198–207. Cited by: §1.
  • [16] H. Feng, K. Fawaz, and K. G. Shin (2017) Continuous authentication for voice assistants. In Proceedings of the 23rd Annual International Conference on Mobile Computing and Networking, pp. 343–355. Cited by: §1.
  • [17] S. A. Fezza, Y. Bakhti, W. Hamidouche, and O. Déforges (2019) Perceptual evaluation of adversarial attacks for CNN-based image classification. In 2019 Eleventh International Conference on Quality of Multimedia Experience (QoMEX), pp. 1–6. Cited by: §1, §3.2.
  • [18] J. Gao, M. Galley, L. Li, et al. (2019) Neural approaches to conversational AI. Foundations and Trends® in Information Retrieval 13 (2-3), pp. 127–298. Cited by: §1.
  • [19] Y. Gong and C. Poellabauer (2017) Crafting adversarial examples for speech paralinguistics applications. arXiv preprint arXiv:1711.03280. Cited by: §1, §2.2.1, §2.2.3.
  • [20] Y. Gong and C. Poellabauer (2018) An overview of vulnerabilities of voice controlled systems. arXiv preprint arXiv:1803.09156. Cited by: §1.
  • [21] I. J. Goodfellow, J. Shlens, and C. Szegedy (2014) Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572. Cited by: §1.
  • [22] M. M. Hassan, M. Z. Uddin, A. Mohamed, and A. Almogren (2018) A robust human activity recognition system using smartphone sensors and deep learning. Future Generation Computer Systems 81, pp. 307–313. Cited by: §1.
  • [23] G. Heigold, I. Moreno, S. Bengio, and N. Shazeer (2016) End-to-end text-dependent speaker verification. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5115–5119. Cited by: §1.
  • [24] M. Jordan, N. Manoj, S. Goel, and A. G. Dimakis (2019) Quantifying perceptual distortion of adversarial examples. arXiv preprint arXiv:1902.08265. Cited by: §1.
  • [25] C. Kereliuk, B. L. Sturm, and J. Larsen (2015) Deep learning and music adversaries. IEEE Transactions on Multimedia 17 (11), pp. 2059–2071. Cited by: §2.2.1, §2.2, §2.
  • [26] F. Kreuk, Y. Adi, M. Cisse, and J. Keshet (2018) Fooling end-to-end speaker verification with adversarial examples. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1962–1966. Cited by: §2.2.3.
  • [27] S. Moosavi-Dezfooli, A. Fawzi, O. Fawzi, and P. Frossard (2017) Universal adversarial perturbations. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 1765–1773. Cited by: §3.2, §3.3.
  • [28] S. Moosavi-Dezfooli, A. Fawzi, and P. Frossard (2016) Deepfool: a simple and accurate method to fool deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2574–2582. Cited by: §3.2.
  • [29] P. Neekhara, S. Hussain, P. Pandey, S. Dubnov, J. McAuley, and F. Koushanfar (2019) Universal adversarial perturbations for speech recognition systems. arXiv preprint arXiv:1905.03828. Cited by: §2.2.1, §2, §4.1, §4.2.
  • [30] J. C. Nunez, R. Cabido, J. J. Pantrigo, A. S. Montemayor, and J. F. Velez (2018)

    Convolutional neural networks and long short-term memory for skeleton-based human activity and hand gesture recognition

    Pattern Recognition 76, pp. 80–94. Cited by: §1.
  • [31] O. M. Parkhi, A. Vedaldi, A. Zisserman, et al. (2015)

    Deep face recognition.

    In BMVC, Vol. 1, pp. 6. Cited by: §1.
  • [32] S. Rosen and P. Howell (2010) Signals and systems for speech and hearing. Vol. 29, Brill. Cited by: §2.2.2.
  • [33] T. Rossing (2007) Springer handbook of acoustics. Springer Science & Business Media. Cited by: §2.2.2.
  • [34] N. Roy, H. Hassanieh, and R. Roy Choudhury (2017) Backdoor: making microphones hear inaudible sounds. In Proceedings of the 15th Annual International Conference on Mobile Systems, Applications, and Services, pp. 2–14. Cited by: §1, §2.2.2, §2.2.
  • [35] T. N. Sainath and C. Parada (2015) Convolutional neural networks for small-footprint keyword spotting. In Sixteenth Annual Conference of the International Speech Communication Association, pp. 1–5. Cited by: §3.1.
  • [36] N. Schinkel-Bielefeld, N. Lotze, and F. Nagel (2013) Audio quality evaluation by experienced and inexperienced listeners. In Proceedings of Meetings on Acoustics ICA2013, Vol. 19, pp. 060016. Cited by: §2.2.3.
  • [37] L. Schönherr, K. Kohls, S. Zeiler, T. Holz, and D. Kolossa (2018) Adversarial attacks against automatic speech recognition systems via psychoacoustic hiding. arXiv preprint arXiv:1808.05665. Cited by: §2.2.2, §2.2.3, §2.2.
  • [38] S. W. Smith et al. (1997) The Scientist and Engineer’s Guide to Digital Signal Processing. Cited by: §4.3.
  • [39] D. Snyder, D. Garcia-Romero, D. Povey, and S. Khudanpur (2017) Deep neural network embeddings for text-independent speaker verification.. In Interspeech, pp. 999–1003. Cited by: §1.
  • [40] Y. Sun, Y. Chen, X. Wang, and X. Tang (2014) Deep learning face representation by joint identification-verification. In Advances in neural information processing systems, pp. 1988–1996. Cited by: §1.
  • [41] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus (2014) Intriguing properties of neural networks. In 2nd International Conference on Learning Representations, pp. 1–10. External Links: Link Cited by: §1.
  • [42] J. Vadillo and R. Santana (2019) Universal adversarial examples in speech command classification. arXiv preprint arXiv:1911.10182. Note: Submitted for publication External Links: Link Cited by: §2.1, §2, §3.2, §4.2.
  • [43] T. Vaidya, Y. Zhang, M. Sherr, and C. Shields (2015) Cocaine noodles: exploiting the gap between human and machine speech recognition. In 9th USENIX Workshop on Offensive Technologies (WOOT 15), pp. 1–14. Cited by: §2.2.3.
  • [44] P. Warden (2018) Speech commands: a dataset for limited-vocabulary speech recognition. arXiv preprint arXiv:1804.03209. Cited by: §3.1, §3.1.
  • [45] H. Yakura and J. Sakuma (2018) Robust audio adversarial example for a physical attack. arXiv preprint arXiv:1810.11793. Cited by: §1, §2.2.1, §2.2.3, §2.
  • [46] Z. Yang, B. Li, P. Chen, and D. Song (2018) Characterizing audio adversarial examples using temporal dependency. arXiv preprint arXiv:1809.10875. Cited by: §2.2.1, §3.1, §4.1, §4.2.
  • [47] X. Yuan, Y. Chen, Y. Zhao, Y. Long, X. Liu, K. Chen, S. Zhang, H. Huang, X. Wang, and C. A. Gunter (2018) Commandersong: a systematic approach for practical adversarial voice recognition. In 27th USENIX Security Symposium (USENIX Security 18), pp. 49–64. Cited by: §2.2.1, §2.2.3, §2.
  • [48] J. Zeng, J. Zeng, and X. Qiu (2017) Deep learning based forensic face verification in videos. In 2017 International Conference on Progress in Informatics and Computing (PIC), pp. 77–80. Cited by: §1.
  • [49] G. Zhang, C. Yan, X. Ji, T. Zhang, T. Zhang, and W. Xu (2017) DolphinAttack: inaudible voice commands. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, pp. 103–117. Cited by: §1, §2.2.2, §2.2.
  • [50] Y. Zhang, N. Suda, L. Lai, and V. Chandra (2017) Hello edge: keyword spotting on microcontrollers. arXiv preprint arXiv:1711.07128. Cited by: §3.1.
  • [51] E. Zwicker and H. Fastl (2013) Psychoacoustics: facts and models. Vol. 22, Springer Science & Business Media. Cited by: §2.2.2.