Log In Sign Up

Extending Adversarial Attacks to Produce Adversarial Class Probability Distributions

by   Jon Vadillo, et al.

Despite the remarkable performance and generalization levels of deep learning models in a wide range of artificial intelligence tasks, it has been demonstrated that these models can be easily fooled by the addition of imperceptible but malicious perturbations to natural inputs. These altered inputs are known in the literature as adversarial examples. In this paper we propose a novel probabilistic framework to generalize and extend adversarial attacks in order to produce a desired probability distribution for the classes when we apply the attack method to a large number of inputs. This novel attack strategy provides the attacker with greater control over the target model, and increases the complexity of detecting that the model is being attacked. We introduce three different strategies to efficiently generate such attacks, and illustrate our approach extending DeepFool, a state-of-the-art attack algorithm to generate adversarial examples. We also experimentally validate our approach for the spoken command classification task, an exemplary machine learning problem in the audio domain. Our results demonstrate that we can closely approximate any probability distribution for the classes while maintaining a high fooling rate and by injecting imperceptible perturbations to the inputs.


page 1

page 2

page 3

page 4


Adversarial Reprogramming of Neural Networks

Deep neural networks are susceptible to adversarial attacks. In computer...

Defending Against Adversarial Machine Learning

An Adversarial System to attack and an Authorship Attribution System (AA...

Adv-OLM: Generating Textual Adversaries via OLM

Deep learning models are susceptible to adversarial examples that have i...

When and How to Fool Explainable Models (and Humans) with Adversarial Examples

Reliable deployment of machine learning models such as neural networks c...

Adversarial Attacks for Tabular Data: Application to Fraud Detection and Imbalanced Data

Guaranteeing the security of transactional systems is a crucial priority...

ROOM: Adversarial Machine Learning Attacks Under Real-Time Constraints

Advances in deep learning have enabled a wide range of promising applica...

Universal adversarial examples in speech command classification

Adversarial examples are inputs intentionally perturbed with the aim of ...

1 Introduction

Deep Neural Networks (DNNs) are currently the core of a wide range of technologies applied in security critical tasks, such as self-driving vehicles

[1, 2], identity recognition systems [3, 4, 5] or malware detection systems [6, 7], and therefore, the effectiveness and the robustness are two fundamental requirements for these models. However, it has been found that DNNs can be easily deceived by inputs perturbed imperceptibly for humans, known as adversarial examples [8], which implies a security breach that can be exploited by an adversary with malicious ends. Thus, the discovery of such vulnerabilities has raised a wide range of research lines focused on new attack strategies.

The most common adversarial attacks are mainly focused on perturbing an input to change the class of a pretrained model during its prediction-phase, known as evasion attacks, such as modifying a traffic signal to be misclassified by an image recognition system on a self-driving car. These perturbations can be created with the objective of changing the (originally correct) output to any other (incorrect) class, named untargeted attacks, or even to force the model to produce a particular target class, for instance, to misclassify a stop signal with a speed limit signal. In the last case, we refer to them as targeted attacks, which provide a greater control over the target model than untargeted attacks.

Recently, more complex and effective attack strategies have been proposed in the literature, such as universal attacks [9], in which the aim is to create a single input-agnostic perturbation capable of fooling the model when it is applied to any incoming input. In addition, some strategies are capable of providing to the adversary more control over the target model. For instance, in backdoor attacks [10, 11] a small proportion of the training data is modified so that, in the presence of a particular trigger, it is possible to force the model to return a particular class with high probability, without reducing the effectiveness of the model in natural inputs. This creates a backdoor in the model that can be exploited by an adversary, for example, to impersonate people in identity recognition scenarios.

In this work, we focus on creating an attack strategy capable of providing a new form of control over the target models. In particular, we introduce a strategy which provides to the adversary not only the ability to deceive the model, but also to produce any (a posteriori) probability distribution of the output classes, even in scenarios where we can only introduce very low amounts of distortion to the inputs.

1.1 Objective

Let us consider a target machine learning model that implements a classification function , where represents the -dimensional input space, and the set of possible output classes, where represents the -th class.

The main objective of this paper is to create an attack method that is able to efficiently produce adversarial examples not only for the local objective of achieving , but also to accomplish the global objective of producing a specific probability distribution for the classes , that is:


where represents the probability distribution of the natural inputs. The idea of controlling the probability distribution of the classes produced by the adversarial examples provides a novel perspective to design such attacks. For instance, an adversary can try to replicate the same probability distribution produced by the target DNN on clean inputs, or change it during time, in order to increase the complexity of detecting that the model is being systematically fooled.

1.2 Contributions

The contributions of our work are the following: (1) we introduce a novel probabilistic attack approach capable of producing a particular adversarial class probability distribution of the classes while fooling any incoming input, (2) we propose four different methods to create the optimal policies to guide such attacks, and (3) we validate them for the speech command classification task, an exemplary machine learning problem in the audio domain, comparing the effectiveness of the four strategies under multiple criteria, such as the similarity of the produced probability distributions and the target distributions, the percentage of inputs fooled by the attack or the number of parameters to be optimized for each method.

2 Related work

The intriguing vulnerability of deep learning models to adversarial attacks was first reported in [8]

, for the image classification task, in which box-constrained L-BFGS optimization approaches were employed to craft the adversarial examples. Although the study of this phenomenon has focused mainly on computer vision tasks, it has been shown that such vulnerabilities can be found in other tasks and domains, such as audio, text and natural language processing

[12, 13, 14, 15, 16, 17].

Adversarial attacks must satisfy two fundamental requirements in order to pose an actual thread to deep learning models: they must be able to create perturbations capable of fooling the target model for any input sample, and introduce imperceptible perturbations. Therefore, adversarial attack methods mainly focus on optimizing these two objectives. However, more ambitious methods have been proposed in order to generate more powerful attacks, or achieve more complex objectives.

Most of adversarial perturbation generation methods can be grouped in different classes according to the scope of the objective of the adversarial attack. The most frequent methods focus on fooling a target model during its prediction-phase, with varying degree of generality, such as individual perturbations (designed for one particular input), single-class perturbations [18] (designed to fool any input of one particular class) or universal perturbations [9] (input-agnostic perturbations). Other methods aim to increase the risk of adversarial attacks in more realistic scenarios, such as creating perturbations that can be deployed in the physical world [19, 20, 21, 22, 23], or maximizing the transferability of the perturbations across different target models or architectures, which have not been used to generate the perturbation [24, 25].

Relying on a different attack strategy, some methods are based on including adversarially perturbed inputs in the training dataset that will be used to train the model, in order to induce an adversarial behaviour during the test-phase [26, 27, 28, 29, 11]. More specific attack algorithms have been also proposed to address particular tasks, domains or scenarios, which imply additional restrictions or requirements, such as graph-based clustering [30] or traffic signal analysis [31].

The approach we introduce can be taken as a novel attack strategy, which targets the more complex goal of producing a particular probability distribution for the classes after the systematic application of the attack, while preserving the ability of fooling the model for any incoming input sample. To the best of our knowledge, no prior work has addressed the problem of producing such adversarial effect on the target model.

Closer to our work, [31] and [32] introduce a more general approach to create adversarial attacks, which allows the optimization of different objectives. In both methods, a generative model is used to create perturbations, and the training procedure of the model considers regularization terms used to model some features of the data. In [32], generative networks are used to sample adversarial inputs which are capable of fooling the network and satisfy additional restrictions, for instance, to occupy a low portion of the image, to be robust against real-world conditions such as lighting effects or rotations, or to be inconspicuous to human observers. In [31], domain constraints are imposed by using a remapping function, and an additional regularization term is included to model other particular requirements, such as ensuring that the perturbed input features follow a particular distribution.

However, the are significant difference between these methods and the approach we introduce. Firstly, both methods focus on ensuring that the adversarial inputs contain some specific data-features, due to the particular domain requirements. However, we focus on inducing

a global behaviour on the output of the model, with no regard about particular data-features on the input. As consequence, our approach can be used to extend any targeted attack algorithm, and it is extensible to any task, which provides a much more general solution. Finally, our approach relies on an efficient linear program, avoiding costly non-linear optimization procedures.

2.1 DeepFool

Our approach relies on a method capable of creating targeted adversarial examples. Although any targeted attack algorithm can be used, in this paper we employed a reformulation of DeepFool [33], a state-of-the art attack algorithm, in order to introduce and illustrate our approach. A detailed introduction to this algorithm is provided in Section 5.1.

Previous related works have proposed different extensions of DeepFool in order to increase the effectiveness of the attack. In [11, 34], the algorithm is reformulated to generate targeted individual adversarial attacks. In [35] the authors propose adding Gaussian noise to the perturbation during the crafting process to improve the effectiveness of the attack.

In [9], DeepFool is extended to create universal adversarial examples, that is, input-agnostic perturbations able to fool the model for any incoming input. Moreover, the strategy presented in [9] has been adapted in recent works to develop universal adversarial examples for different tasks in the audio domain [18]. Similarly to our method, universal attacks provide some global capabilities, in particular, fooling any incoming input with the same perturbation. However, such methods are not capable of controlling the probability distribution for the output classes. In addition, note that our method employs individual attacks, which produce more optimized perturbations for any input, and therefore, lower amounts of distortion.

3 Producing specific class probability distributions

We focus on generating a higher level attack approach in which the application of the attack for many incoming inputs , assuming an input data distribution , can produce not only a missclassification for every , but also a desired (fixed) probability distribution of the predicted classes by the target model .

3.1 Assumptions and key concepts

In this section, we specify a number of assumptions and concepts that will be used to develop our methodology.

First of all, we assume that the clean input

is correctly classified by the target classifier, and that

, in order to ensure that the attack is actually fooling the model. In addition, being a distortion metric and a maximum distortion threshold, we require the adversarial example to satisfy , to ensure that is as similar as possible to .

The approach we introduce will use a targeted adversarial attack as basis, that is, attacks capable of forcing the model to produce a particular target class . However, setting a maximum distortion supposes that we may not reach every possible target class by adversarially perturbing an input sample. For this reason, we consider that is a reachable class from if it is possible to generate a targeted adversarial example , so that and , and will be denoted as . We assume that is always a reachable class. However, if there are not reachable classes , we will consider that we can not create any valid adversarial example for .

3.2 Attack description

The main rationale of the approach we introduce is to extend or generalize a targeted adversarial attack method in order to achieve the global objective of producing any probability distribution of the output classes, while maintaining a high fooling rate and minimally distorted inputs. To enable such attacks, our method consists on generalizing the attack algorithm to be stochastic, so that the target class is randomly selected, and the probability of transitioning from the class to the class depends on the source class and the input at hand. These probabilities will be represented in a transition matrix , where represents the probability of transitioning from the class to the class .

Although the transition-matrix depends only on the source class of the inputs, in practice, it might not be possible to move all the inputs towards any class without surpassing the distortion threshold . In order to address this limitation, given an input , the transition probabilities will be normalized considering that the probability of moving to a non-reachable class is zero. Let be the probability distribution assigned in corresponding to the inputs with source class . Being the set of reachable classes for one particular input of class , the probability of selecting as the target class is determined by:


Note that the probability will represent the probability of maintaining an input in its own class , and therefore, these values should be as low as possible in order to ensure that we maximize the number of inputs that will fool the model. However, depending on the probability distribution of the classes we want to produce, a nonzero value for these probabilities may be needed to achieve such goals, for instance, if we require a high probability for one class but this class is seldomly reached from inputs belonging to the rest of classes.

Algorithm 1 provides the pseudocode of this approach. By modeling the decision to move from one class to another in this way, it is possible to approximate with which probability the model will predict each class. This capability allows the attacker to have more control over the model, or make it more difficult for the defender to detect attacks, for example, by maintaining the initial probability distribution of the classes despite inducing errors in the model.

0:  A classification model , a set of classes , an input sample of ground-truth class , a maximum distortion threshold , a transition matrix
0:  An adversarial example
1:   initialize with False
2:  for  do
3:      Generate a perturbation for targeting class .
4:     if  then
6:     end if
7:  end for
9:   probability distribution in the row of corresponding to the class .
11:  for  do
12:     if  then
14:     else
16:     end if
17:  end for
18:   randomly select a class in according to the probabilities Return the adversarial example with the targeted perturbation corresponding to class :
Algorithm 1 Generating adversarial class probability distributions

4 Constructing optimal transition matrices to guide targeted attacks

In this section we introduce different strategies to construct the optimal transition matrix which, used to stochastically decide the class transitions, produces a target probability distribution for the output classes. Formally, being a set of inputs sampled from a data distribution , and the original probability distribution of the classes assigned by the target classifier, we want to obtain a transition matrix that satisfies:


The main purpose of these approaches is to ensure that the transition matrix is capable of producing the desired probability distribution given the conditions and particularities of the problems, such as the initial probability distribution or the reachability of the inputs of each class. Additionally, in order to increase the expected proportion of inputs to be fooled, the values in the diagonal needs to be as low as possible.

We will define the problem of finding such matrices as a linear program, in which we will use different strategies to restrict the possible values of the matrix, in order to fulfill all the specified requirements. All the methods provide advantages and disadvantages, as will be discussed in Section 4.5.

4.1 Method 1: Baseline

The first method will only take into account the most basic requirements in order to generate a valid transition matrix, and therefore, the results obtained with this method will be used as baseline. In this way, we will be able to compare the gain in terms of effectiveness that the following methods imply, in which different strategies will be used to restrict the search space and make the transition matrices better fit the characteristics and limitations of the problem.

Thus, this method consists on directly searching a transition matrix that satisfies equation (3), while minimizing the sum of the diagonal of . In addition, the number of null probabilities outside the diagonal of should be avoided, since they reduce the number of possible classes to which we can move the inputs of class . To achieve this, we will introduce an auxiliary variable matrix , so that and , with and . Each will be included in the set of restrictions as a lower bound of to require a minimum probability, , . At the same time, the values in will be maximized in the objective function of the linear program:

Taking into account all these basic requirements, the optimal transition matrix can be obtained by solving the following linear program:

min (4)

4.2 Method 2

In the second method we will extend the linear program introduced in Method 1, to include an additional restriction to the values of in order to better fit the characteristics and limitations of the problem.

In this method, an auxiliary matrix will be considered, in which represents the number of samples in that, with a ground-truth class , can reach the class :


We assume that the ground-truth class of an input is always reachable, and therefore, . If we divide each by the number of inputs of class , the value will represent the proportion of samples in which, with a ground-truth class , can reach the class :


Note that we are estimating, by using the set

, the maximum proportion of successful targeted attacks that it is possible to create for the inputs in , for any pair of source class and target class . Therefore, to generate a realistic transition matrix , we will maintain the following restriction: . This restriction seeks to avoid assigning transition probabilities so high that, in practice, will be unlikely to be obtained due to the distortion threshold, which may suppose a loss of effectiveness regarding the global objective of producing , as the algorithm may not be able to successfully follow the guidance of . However, setting an upper bound to the values of according to the values of may suppose increasing the values in the diagonal, decreasing the fooling rate expectation.

Based on all these facts, we will generate the optimal transition matrix by solving the following linear program:

min (7)

4.3 Method 3

The main drawback of the strategy used in Method 2 is that establishing upper bounds for every value of can significantly limit the space of possible transition matrices, and therefore, limit the range of objective probability distributions that can be produced.

For instance, even if it is estimated using the training set that it is not possible to move more than a certain proportion of cases from the class to the class , in some cases it can be necessary to assign to values higher than , for example, to produce with a very high probability. In such case, even if reaching the class from the class is unlikely, we can specify that, when this transition is possible, we want to produce it with a high probability.

Therefore, in order to be able to produce a wide range of distributions , Method 3 does not impose an upper bound constraint on the values of . Apart from that, the row-normalized version of will be used in this method, denoted as , which, already represents a transition matrix. In particular, the probability distribution is the one that would be achieved if the target class of each input is uniformly selected in the set of reachable classes for . As our intention is to produce , we aim to find an auxiliary transition matrix , so that . If we denote , we can generate by solving the following linear program:

min (8)

In this case, the linear program will minimize the values in the diagonal of , being the product of two matrices. Therefore, note that if the values of are non-zero for the majority of its positions, the sum of the values in the diagonal of can not be zero, compromising the fooling rate of the attack. For the same reason, now we do not require an auxiliary variable to avoid null values in .

4.4 Method 4

As we introduced in Section 3.2, even if we specify a probability for every possible transition, in practice, an input sample may not be able to reach any possible class without surpassing the maximum distortion allowed, so we need to normalize those probabilities to consider only the reachable classes from that input. However, the two previous methods have not considered this effect during the training process, which may cause a reduction in the effectiveness of the resulting matrices when they are applied during the prediction-phase of the model. For this reason, in this method, we will make use of that information with the aim of achieving a more comprehensive attack.

To construct the transition matrix , we start by estimating the probabilities that an input of class can reach a particular subset of classes but not , that is, . We will denote these probabilities , or for simplicity. In order to estimate these values, the set of reachable classes will be computed for each , and the frequency of each subset will be calculated.

The next step is to define with which probability an input of class and with a set of reachable classes will be moved from to the class , that is, , or for simplicity. These probabilities will be also denoted as when referring to them as variables in the linear program. All these values will directly define the transition matrix , in the following way:


As we assume that the ground-truth class of an input is always reachable, for the inputs of class , the probabilities corresponding to those sets in which will be zero. That is, if . Similarly, must be zero if .

In order to find the appropriate values for the variables , we will solve the following linear program:

min (10)

The main disadvantage of this method is that it requires a considerably larger number of decision variables, bounded by assuming that for classes, there are possible subsets of reachable classes , each with an associated probability , and for each of them another distribution of probabilities .

Due to the high number of possible subsets, will be zero for many of the subsets in practice. This reduces the number of parameters that can be tuned, and as a consequence, also the number of probability distributions that can be produced. For this reason, to avoid having multiple null values for those probabilities, in this method we will smooth every probability distribution using the Laplace correction.

In addition, after a preliminary experiment we discovered that due to the values of the linear problem was infeasible for many objective probability distributions, especially for low distortion thresholds. This is because the values are highly influenced by such probabilities, which, indeed, are lower thresholds for . In addition, those values can be considerably greater than the rest of subsets if there is a sufficiently large proportion of samples that can not be fooled, specially for low values of . This also translates into a lower fooling rate expectation.

To avoid all these consequences, we set to zero every , even if this can reduce the effectiveness of the method in producing the objective probability distribution, as we are not considering the estimated proportion of samples that can not be fooled.

With these simplifications, the exact number of variables that need to be updated in the linear program are determined by:


where represents the binomial coefficient.

Finally, it is worth mentioning that, as we are estimating a probability distribution for every source-class and every set of reachable classes , defined by it would be possible to use this distribution instead of using to compute the probability of selecting each possible class. However, we verified that this approximation achieved worse approximations of the objective distributions than directly using the probability distributions defined in .

4.5 Overview of the attack strategies

All the strategies introduced in the previous sections can be used to efficiently generate the transition matrices needed to produce adversarial class probability distributions, all of them containing some advantages as well as some weaknesses.

Method 2 provides a simple framework to generate transition matrices, in which the modeling of the problem is particularly suited to minimize the diagonal of the transition matrix, what, as a consequence, maximizes the fooling rate expectation. However, the upper-bounds used as restrictions may be too rigid, reducing the range of probability distributions that can be produced.

Method 3 specifies less strict constraints to the values of , at the cost of compromising the fooling rate expectation, due to the fact that it is defined as a matrix product , what makes the problem less suited to minimize the values in the diagonal for dense matrices .

A positive point in both Method 2 and Method 3 is the low number of parameters to be optimized, bounded by . Method 4, however, requires a considerably larger number of parameters, bounded by , but provides a more comprehensive and general approach to generate the transition matrix, which allows taking into account the particular set of reachable classes for each training instance.

5 Validating our proposal: setup and results

In this section we present the particular attack algorithm, task, dataset, model and further details regarding the experimental setup used to validate our proposals. We also report the obtained results, in which we measure the effectiveness of the introduced approaches according to different criteria.

5.1 Underlying attack algorithm: DeepFool

As the underlying attack process used to generate targeted adversarial perturbations, in this paper we focus on DeepFool [33], a state-of-the-art algorithm to generate adversarial attacks, initially introduced for images.

This method consists on iteratively pushing an initial input towards the closest decision boundary of the decision space represented by the model. Let be the ground-truth class of , the target classifier, and

the output logits of

corresponding to the class . The region of space in which predicts is defined as:


At each iteration , the region is approximated by a polyhedron , defined as:


Therefore, the distances between the input and the boundaries of are approximated by the distances between and the boundaries of . By using this approximation, the closest decision boundary, which will correspond to a class , is determined according to the following criterion:


where and is the direction in which the input will be moved. Finally, is pushed towards the selected decision boundary:


The algorithm stops when it finally reaches a new decision region, that is, when .

Different works have extended or reformulated the original version of DeepFool in order to improve the attack or achieve more complex goals, as described in Section 2 [9, 18, 11, 35]. In order to fit in our specification, we will employ a targeted version of DeepFool, and impose a stop criterion to prevent exceeding the maximum amount of perturbation .

The targeted version of DeepFool can be obtained if, at every iteration , the sample is moved in the direction of the target class , that is, removing the criterion specified in equation (14) and applying the following update rule:


In this case, the process stops when the condition is satisfied, or, again, when the amount of distortion exceeds . Algorithm 2 provides the pseudocode for this approach. The targeted reformulation of the DeepFool algorithm has been reported before in [11, 34].

0:   A classification model , an input sample of class , a target class , maximum distortion threshold
0:   A targeted adversarial perturbation
3:   initialize with zeros Total perturbation
4:   initialize with zeros Local perturbation at step
5:  while  do
9:     if  then
13:     end if
14:  end while
Algorithm 2 Targeted DeepFool

5.2 Case of study: speech command classification

Due to advances in automatic speech recognition technologies based on machine learning models, and their deployment in smartphones, voice assistants and industrial applications, there has been a rapid increase in the study of adversarial attacks and defenses for such models, despite being considerably less studied than computer vision problems. For these reasons, we have decided to validate our proposal in the task of speech command classification, an exemplary and representative task in this domain.

We will use the Speech Command Dataset [36], which consists on a set of WAV audio files of 30 different spoken commands. The duration of all the files is fixed to 1 second, and the sample-rate is 16kHz in all the samples, so that each audio waveform is composed by values, in the range . We will use a subset of ten classes, those standard labels selected in previous publications [36, 14]: Yes, No, Up, Down, Left, Right, On, Off, Stop, and Go. In order to provide a more realistic setup, two special classes have been also considered: Silence, representing that no speech has been detected, and Unknown, representing an unrecognized spoken command, or different to the ones mentioned before.

As in previous related works [14, 37]

, a Convolutional Neural Network will be used as classification model, based on the architecture proposed in

[38] for small-footprint keyword recognition. For simplification purposes, we will assume a uniform initial probability distribution .

5.3 Experimental details

We generated a set of samples by randomly sampling a set of 500 samples per class from the training set of the Speech Command Dataset. We ensured that all the samples were correctly classified by the model. This set will be used also to generate the auxiliar matrix as described in Section 4, and therefore, used to generate the transition matrix .

The ultimate goal is to validate that any desired probability distribution can be approximated with a low error by guiding the extended DeepFool (Algorithm 2) using Algorithm 1 and a transition matrix , optimized using any of the methods introduced in Section 4. This will be validated in another set of samples , disjoint from , composed of 500 inputs per class, forming a total of 6000 input samples.

The objective probability distributions will be randomly sampled from a Dirichlet distribution of parameters and . A total of 100 different objective probability distributions will be sampled, and the approach will be tested in each of them. Apart from this set, the particular case in which will be also tested, that is, when the aim is to reproduce the original probability distribution obtained by the model.

The proportion of samples that have been classified as each particular class after the attack is applied to every input in will be taken as the empirical probability distribution of , and will be denoted . Using this collection of data, we will evaluate to what extent the empirical probability distributions match

. The similarity between both distributions will be measured using different metrics: the maximum and mean absolute difference, the Kullback-Leibler divergence and the Spearman correlation.

The transition matrices will be generated using the four linear programs described in Sections 4.1, 4.2, 4.3 and 4.4. For Method 1 and Method 2, an upper-bound of will be set for the values in . The results will be computed under the following maximum distortion thresholds: .

5.4 Particular case: reproducing the initial probability distribution

For illustration purposes, we first report the results obtained for the particular scenario in which we want to produce the same probability distribution that the model produces when it is applied on clean samples. Notice that this distribution is the same as the ground truth distribution of the classes, since we assume that the model produces a correct classification for the original samples. Having the ability to reproduce such distributions allows an attacker to deploy attacks that are less likely to be detected, and without losing control over the frequency with which each class is predicted.

Figures 1, 2, 3 and 4 contain, for Method 1, 2, 3 and 4, respectively, a graphical comparison of the initial probability distribution and the one produced after perturbing the inputs with our attack, for different maximum distortion thresholds . The figures also include the Kullback-Leibler divergences between both distributions. Note that the bottom-right chart of the four figures represents the achieved fooling rate for every , and, as reference, the maximum fooling rate that can be achieved for such thresholds, that is, the percentage of inputs in for which it is possible to create a targeted attack capable of fooling the model.

According to the results, in all the cases the algorithms were able to maintain a probability distribution very close to the original one, being Method 1 the less accurate. In addition, the attack maintained also fooling rates very close to the optimal values with independence of the distortion threshold, with a negligible loss is Method 2 and Method 4, and with a loss of approximately 10% in Method 3.

It is noteworthy that, in this particular case, the produced probability distributions are more approximate for the lowest values of tried. This is due to the fact that for low distortion thresholds the number of inputs for which the model can be fooled is lower, and therefore, a larger number of inputs remains correctly classified as their ground-truth class, what makes the empirical probability distribution closer to the original. However, note that the results obtained for high values of also represent close approximations of the target distributions, and at the same time, the model is fooled for almost all the input samples.

Fig. 1: Method 1 (Baseline): Comparison between the initial probability distribution and the produced probability distribution , for different values of , in the particular case in which the objective distribution is . The Kullback-Leibler divergence between both distributions, , is also reported above each figure. The bottom-right figure represents the achieved fooling rates, and the maximum fooling rate that can be achieved for every , that is, the percentage of inputs in for which it is possible to create a targeted attack capable of fooling the model.
Fig. 2: Method 2: Comparison between the initial probability distribution and the produced probability distribution , for different values of , in the particular case in which the objective distribution is . The Kullback-Leibler divergence between both distributions, , is also reported above each figure. The bottom-right figure represents the achieved fooling rates, and the maximum fooling rate that can be achieved for every , that is, the percentage of inputs in for which it is possible to create a targeted attack capable of fooling the model.
Fig. 3: Method 3: Comparison between the initial probability distribution and the produced probability distribution , for different values of , in the particular case in which the objective distribution is . The Kullback-Leibler divergence between both distributions, , is also reported above each figure. The bottom-right figure represents the achieved fooling rates, and the maximum fooling rate that can be achieved for every , that is, the percentage of inputs in for which it is possible to create a targeted attack capable of fooling the model.
Fig. 4: Method 4: Comparison between the initial probability distribution and the produced probability distribution , for different values of , in the particular case in which the objective distribution is . The Kullback-Leibler divergence between both distributions, , is also reported above each figure. The bottom-right figure represents the achieved fooling rates, and the maximum fooling rate that can be achieved for every , that is, the percentage of inputs in for which it is possible to create a targeted attack capable of fooling the model.

5.5 Deeper exploration

In this section we provide a deeper evaluation of the approach, testing it against 100 random probability distributions, randomly drawn from a Dirichlet distribution, as described in Section 5.3.

First, we computed the percentage of cases in which the method managed to generate a valid transition matrix, that is, which satisfies all the restrictions of the corresponding linear program. This information in shown in Table I, for different values of . As it can be seen, Method 1, 3 and 4 managed to create a valid perturbation for all the cases tried, independently of the maximum distortion threshold. For Method 2, although it also achieved a total success for values of distortion above or equal 0.05, the percentage drops dramatically for lower values of .

Table I also includes the success percentages of two variants of the Method 4. In the first case, without the Laplace correction and without fixing the probabilities to zero, the method was not able to generate a valid transition matrix for distortions below , and even in the maximum distortion tried the method only succeeded in the 50% of the cases. Applying the Laplace correction (without fixing ), those results improve significantly, succeeding in more than 80% of the cases for distortions thresholds above 0.1, in 74% of the cases for 0.05 and in 25% of the cases for the lowest distortion level. These results clearly reflect that those corrections are necessary to make the linear programs feasible.

Max. distortion amount ()
Method 1 % % % % %
Method 2 % % % % %
Method 3 % % % % %
Method 4 % % % % %
Method 4 1 % % % % %
Method 4 2 % % % % %
  • Without the Laplace correction and without fixing the values of to zero.

  • Without fixing the values of to zero.

TABLE I: Success percentages in generating valid transition matrices for the different methods introduced.

The similarity between the objective distributions and the corresponding empirical distributions is analyzed in Figure 5, for the different similarity metrics considered. All the values have been averaged for all the target probability distributions considered in the experiment, independently for every maximum distortion threshold. The results obtained in the Method 2 for and are computed only for the cases in which the method achieved a valid transition matrix, and therefore, the results might be biased.

First of all, the results show that Method 1 achieves worse results compared to the rest of the methods, what validates the hypothesis that Method 1 can be taken as an appropriate baseline for the problem, and also that the strategies employed in the other methods are capable of increasing the effectiveness of the attack.

Apart from that, it is clear that the effectiveness of the methods in reproducing the target distribution increases when the maximum allowed distortion increases. Regarding the results obtained in Method 2, 3 and 4, the maximum difference is below for values of a distortion of or greater, which reflects a very high similarity. In fact, for the mean difference of all the values, this value decreases to . The KLD also shows the same descending trend as the maximum and mean differences. Finally, the Spearman correlation between both distributions is above for those methods, which indicates that even if there are differences between the values, both distributions are highly correlated. This metric also reflects a considerably higher effectiveness in comparison to the baseline results obtained using Method 1, particularly for low values of threshold.

Comparing the effectiveness of Methods 2, 3 and 4, as expected, Method 2 was the least effective in closely reproducing the target distributions (considering the values , in which the method succeeded in all the cases and therefore the results are directly comparable). Oppositely, Method 2 is the most effective one according to all the metrics, followed by Method 3, which achieved intermediate results between the two previous methods.

Fig. 5: Sensitivity analysis of different similarity metrics between the produced probability distribution and the objective probability distribution : mean absolute difference, maximum absolute difference, Kullback Leibler Divergence (KLD) and Spearman correlation.

Finally, Figure 6 compares the fooling rates obtained for each value of . The fooling rate represents the percentage of adversarial examples that produced a wrong output in the model. As in previous charts, this information has been averaged considering all the objective probability distributions tried, for each value of . In addition, the figure includes the maximum fooling rate that can be obtained with a maximum distortion .

The results demonstrate that the introduced algorithms keep a very high fooling rate, close to the maximum in Methods 1, 2 and 4. Method 3, as we foresaw, achieved slightly lower fooling rates, of approximately 10% below the maximum with independence of the distortion threshold. Therefore, the algorithm can effectively achieve the more complex global objective of producing a target probability distribution for the classes while keeping a remarkable effectiveness in the local objective of fooling any incoming input sample.

Fig. 6: Comparison of the fooling rate percentage between all the introduced approaches.

As an overview of the distortion, Table II shows the average distortion level introduced by the perturbations, in decibels (dB), computed independently for each ground-truth class. Following the methodology introduced in [39], the distortion has been computed as


being the clean signal and the perturbation, both of length , and considering only the signal ranges outside the vocal part in . As it can be seen, even for the highest values of tried, the mean distortion level is below -32dB for the majority of the classes, which is the maximum acceptable distortion threshold assumed in related works on adversarial perturbations in speech signals [13, 40, 39].

Max. distortion amount ()
Silence -17.01 -20.71 -22.41 -22.49 -22.54
Unknown -42.32 -41.67 -37.51 -36.11 -35.43
Yes -40.44 -39.59 -31.86 -29.17 -28.04
No -41.15 -40.71 -36.75 -35.19 -34.69
Up -39.50 -39.49 -36.90 -36.41 -36.31
Down -41.24 -40.05 -35.46 -33.92 -33.24
Left -40.91 -40.52 -35.88 -34.53 -33.94
Right -40.51 -40.45 -35.29 -33.33 -32.63
On -40.71 -40.43 -36.02 -34.41 -33.73
Off -40.54 -40.28 -35.71 -34.03 -33.30
Stop -40.43 -39.66 -32.38 -30.36 -29.72
Go -41.13 -41.06 -37.58 -36.41 -36.01
TABLE II: Distortion levels introduced by the adversarial perturbations generated, measured in decibels (dB).

5.6 General comparison of the introduced approaches

As a general overview of the effectiveness of the introduced strategies, focusing on Methods 2, 3 and 4, the three of them provided an effective way to find optimal transition matrices, capable of producing the desired objective probability distributions. In addition, and considering that the effectiveness of the method depends on multiple factors, there is no one better method in all the cases. For instance, Method 3 was the most effective one in producing the desired probability distributions, but achieved lower fooling rates than Method 2 and Method 4, which achieved values close to the maximum fooling rates. In comparison to the Method 2, Method 4 was capable of producing adversarial distributions closer to the target ones, but requires a considerably higher number of parameters, which can be prohibitive for problems with a very large number of classes. Thus, depending on the factor to be optimized, one method could be more suitable than the others.

Figure 7 provides a graphic comparison of the effectiveness according to the most relevant factors. Notice that some axis are flipped to represent in all the cases that a value is better if it is closer to bottom-left corner. As it can be seen, the methods form a pareto-front in the majority of the cases, particularly for .

Fig. 7: Multi-factorial comparison of the effectiveness of the four methods introduced. Note that the number of parameters is bounded by for Methods 1, 2 and 3

6 Conclusions

In this paper we have introduced a novel strategy to generate adversarial attacks capable of producing any desired probability distribution for the classes when the attack is applied to multiple incoming inputs. The introduced attack has been conceived as an extension of targeted adversarial attacks, in which the target class is stochastically selected under the guidance of a transition matrix, which is optimized to achieve the desired goals. We have introduced four different strategies to optimize the transition matrices, which can be solved by using linear programs. We also experimentally validated our approach for the spoken command classification task, using as underlying attack an extension of the DeepFool algorithm, a state-of-the-art method to generate adversarial examples. Our results clearly show that the introduced methods are capable of producing close approximations of the target probability distribution for the output classes while achieving high fooling rates. This novel attack perspective can be used to produce more complex malicious behaviors in the target models, and to study adversarial attacks and defenses in more challenging scenarios.

7 Future lines

An interesting future research line could be trying to generate adversarial class distributions using a single universal perturbation. In this way, a single perturbation may not only cause the misclassification of every input, but also produce a desired probability distribution of the classes when applied to a large number of samples.

In addition, more than one class transition could be considered in order to increase the attack effectiveness of DeepFool, allowing consecutive jumps between two classes, which may increase the reachability between classes, and as a consequence, increase the range of probability distributions that can be approximated, or improve the approximation.


This work is supported by the Basque Government (BERC 2018-2021 and ELKARTEK programs, IT1244-19, and PRE_2019_1_0128 predoctoral grant) and from the Spanish Ministry of Economy and Competitiveness MINECO (project TIN2016-78365-R). Jose A. Lozano acknowledges support by the by the Spanish Ministry of Science, Innovation and Universities through BCAM Severo Ochoa accreditation (SEV-2017-0718).