Crafting Adversarial Examples For Speech Paralinguistics Applications

11/09/2017 ∙ by Yuan Gong, et al. ∙ University of Notre Dame 0

Computational paralinguistic analysis is increasingly being used in a wide range of applications, including security-sensitive applications such as speaker verification, deceptive speech detection, and medical diagnosis. While state-of-the-art machine learning techniques, such as deep neural networks, can provide robust and accurate speech analysis, they are susceptible to adversarial attacks. In this work, we propose a novel end-to-end scheme to generate adversarial examples by perturbing directly the raw waveform of an audio recording rather than specific acoustic features. Our experiments show that the proposed adversarial perturbation can lead to a significant performance drop of state-of-the-art deep neural networks, while only minimally impairing the audio quality.



There are no comments yet.


page 5

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.



Computational speech paralinguistic analysis is rapidly turning into a mainstream topic in the field of artificial intelligence. In contrast to computational linguistics, computational paralinguistics analyzes how people speak rather than what people say 

[Schuller and Batliner2013]. Studies have shown that human speech not only contains the basic verbal message, but also paralinguistic information, which can be used (when combined with machine learning) in a wide range of applications, such as speech emotion recognition [Trigeorgis et al.2016] and medical diagnostics [Bocklet et al.2011]. Many of these applications are security sensitive and must be very reliable, e.g., speaker verification systems used to prevent unauthorized access should have a low false positive rate [Variani et al.2014], while systems used to detect deceptive speech should have a low false negative rate [Schuller et al.2016]

. A threat to these types of systems, which has not yet found widespread attention in the speech paralinguistic research community, is that the machine learning models these systems rely on can be vulnerable to adversarial attacks, even when these models are otherwise very robust to noise and variance in the input data. That is, such systems may fail even with very small, but well-designed perturbations of the input speech, leading the machine learning models to produce wrong results.

Figure 1: Illustration of an attack on a speaker verification system: the attacker can make the system accept an illegal input by adding a well-designed small perturbation.

An adversarial attack can lead to severe security concerns in security-sensitive computational paralinguistic applications. Since it is difficult for humans to distinguish adversarial examples generated by an attacking algorithm from the original speech samples, it is possible for an attacker to impact speech-based authentication systems, lie detectors, and diagnostic systems (e.g., Figure 1 illustrates an example of attacking a speaker verification system). In addition, with rapidly growing speech databases (e.g., recordings collected from speech-based IoT systems), machine learning techniques are increasingly being used to extract hidden information from these databases. However, if these databases are polluted with adversarial examples, the conclusions drawn from the analysis can be false or misleading. As a consequence, obtaining a deeper understanding of this problem will be essential to learning how to prevent future adversarial attacks from impacting speech analysis, but also how to utilize such attacks to increase privacy protection of speech. That is, if well-designed, adversarial attacks can also be used to fool speaker recognition systems (thereby protecting the privacy of the speaker), while still maintain enough sound quality to allow other types of analysis (e.g., for medical diagnostics) to succeed. Toward this end, this paper proposes a new method to generate adversarial speech examples from original examples. In this work, instead of manipulating specific speech acoustic features, our goal is to perturb the raw waveform of an audio directly. Our experiments show that the resulting adversarial speech examples are able to impact the outcomes of state-of-the-art paralinguistic models, while affecting the sound quality only minimally.

Related Work

Computer Vision.

The vulnerability of neural networks to adversarial examples, particularly in the field of computer vision, was first discovered and discussed in 

[Szegedy et al.2013]. In [Goodfellow, Shlens, and Szegedy2014], the authors analyzed the reasons and principles behind the vulnerability of neural network to adversarial examples and proposed a simple but effective gradient based attack approach, which has been widely used in later studies. In recent years, properties of adversarial attacking have been studied extensively, e.g., in [Kurakin, Goodfellow, and Bengio2016], the authors found that machine learning models make wrong predictions even using images that are fed to the model using a camera instead of directly using the adversarial examples as input. Further, in [Papernot et al.2016a], the authors found that it is possible to attack a machine learning model even without knowing the details of the model.

Speech and Audio Processing.

In [Carlini et al.2016], the authors proposed an approach to generate hidden voice commands, which can be used to attack speech recognition systems. These hidden voice commands are unrecognizable by the human ear and completely different from the original command. In [Kereliuk, Sturm, and Larsen2015], the authors proposed an approach to generate adversarial examples of music by applying perturbation on the magnitude spectrogram. However, rebuilding time-domain signals from spectrograms is difficult, because of the overlapping windows used for analysis, which makes adjacent components in the spectrogram dependent. In [Iter, Huang, and Jermann2017], the authors also proposed an approach to generate adversarial speech examples, which can mislead speech recognition systems. They first extract MFCC features from the original speech, add perturbations to the MFCC features, and then rebuild speech from the perturbed MFCC features. While the rebuilt speech samples are still recognizable by the human ear, they are very different from the original samples, because extracting MFCC from audio is a very lossy conversion. All these prior efforts add perturbation at the feature level and require a reconstruction step using acoustic and spectrogram features, which will further modify and impact the original data (in addition to the actual perturbation). This additional modification can make adversarial examples sound strange to the human ear. In comparison, in the computer vision field, adversarial examples only contain very minor amounts of noise that are difficult or even impossible to detect by the human eye. Therefore, in order to avoid the downsides of performing the inverse conversions from features to audio, in this work we propose an end-to-end approach to crafting adversarial examples by directly modifying the original waveforms.

Contributions of this Paper

  • We propose an end-to-end speech adversarial example generation scheme that does not require an audio reconstruction step. The perturbation is added directly to the raw waveform rather than the acoustic features or spectrogram features so that no lossy conversion from the features back to the waveform is needed. To the best of our knowledge, this is the first end-to-end adversarial generation approach in the audio processing field.

  • We discover the vanishing gradients problem of using a gradient based attack approach on recurrent neural networks. We solve this problem by using a substitution network that replaces the recurrent structure with a feed-forward convolutional structure. We believe this solution is not limited to our application, but can be used in other applications with sequences inputs.

  • We conduct comprehensive experiments with three different speech paralinguistic tasks, which empirically prove the effectiveness of the proposed approach. The experimental results also indicate that the adversarial examples can be generalized to other models. To the best of our knowledge, this is the first work on adversarial attacks, which is a potentially severe security risk, in the field of computational speech paralinguistics. We expect that our results and discussions will bring useful insights for future studies and efforts in building more robust machine learning models.

Figure 2: The proposed end-to-end adversarial attack scheme.

Crafting Speech Adversarial Examples

The block diagram of the proposed speech adversarial attack scheme is shown in Figure 2. In this section, we first provide a formal definition of the adversarial example generation problem. We then briefly review the gradient based approach used in this work and discuss its use in our application. We explain the limitations of acoustic-feature-level adversarial attacks and why building an end-to-end speech adversarial attack scheme is essential. Further, we discuss the selection of a hypothetical model for the attack. Finally, we describe the vanishing gradient problem in gradient based attacks on recurrent neural networks and address this problem using a model substitution approach.


We describe a classifier that maps raw speech audio waveform vectors into a discrete label set as

and the parameters of as

. We further assume that there exists a continuous loss function associated with

described as . We describe a given speech audio waveform using , the ground-truth label of using , and a perturbation on the vector using . Our goal is to address the following optimization problem:


Gradient Based Adversarial Attacks

Due to the non-convexity of many machine learning models (e.g., deep neural networks), the exact solution of the optimization in Equation 1 can be difficult to obtain. Therefore, gradient based methods [Goodfellow, Shlens, and Szegedy2014] have been proposed to find an approximation of such an optimization problem. The gradient based method under the max-norm constraint is referred to as Fast Gradient Sign Method (FGSM). The FGSM is then used to generate a perturbation :



be the weight vector of a neuron. Consider the situation when the neuron takes the entire waveform

as input, then, when we add a perturbation , the activation of the neuron will be:


The change of the activation is then expressed as:


Assume that the average magnitude of is expressed as , then note that has the same dimension with of :


Since does not grow with , we find that even when is fixed, the change of the activation of the neuron grows linearly with the dimensionality

, which indicates that for a high dimensional problem, even a small change in each dimension can lead to a large change of the activation, which in turn will lead to a change of the output, because even for non-linear neural networks, the activation function primarily operates linearly in the non-saturated region. We call this the

accumulation effect. Speech data processing, when done in an end-to-end manner, is an extremely high dimensional problem. Using typical sampling rates of 8kHz, 16kHz, or 44.1kHz, the dimensionality of the problem can easily grow into the millions for a 30-second audio. Thus, taking the advantage of the accumulation effect, a small average well-designed perturbation for each data point can cause a large shift in the output decision.

Figure 3: A feature-level attack scheme.

End-to-end Adversarial Example Generation

As discussed in the related work section, previous approaches to audio adversarial example generation are on the acoustic feature level. As shown in Figure 3, acoustic features are first extracted from the original audio, the perturbation is added to the acoustic features, and then audio is reconstructed from the perturbed acoustic features. We believe that this scheme has the following limitations:

  • Acoustic feature extraction is usually a lossy conversion and reconstructing audio from acoustic features cannot recover this loss. For example, in

    [Iter, Huang, and Jermann2017], an adversarial attack is conducted on the MFCC features, which typically represent each audio frame of 20-40ms (160-320 data points if the sample rate is 8kHz) using a vector of 13-39 dimensions. The conversion loss can be significant and even be larger than the adversarial perturbation due to the information compression. Thus, the final perturbation on the audio waveform consists of the adversarial perturbation plus an extra perturbation caused by the lossy conversion:

  • Adversarial attacks on acoustic features and on raw audio waveform are actually two different optimization problems:


    Since most acoustic features do not have a linear relationship with the audio amplitude, it is possible that a small perturbation on acoustic features will lead to large perturbations on the audio waveform and vice versa. Thus, an adversarial attack on acoustic features might not even be an approximation of an adversarial attack on the raw audio.

Figure 4: Network topology of WaveRNN and WaveCNN.

In order to overcome above the above-mentioned limitations, we propose a novel end-to-end adversarial attack scheme that directly perturbs the raw audio waveform. Compared to the feature-level attack scheme, the proposed scheme completely abandons the feature extraction and audio reconstruction steps to avoid any possible conversion loss. The optimization is then directly targeted at the perturbation on the raw audio. A key component of the proposed scheme is to use an end-to-end machine learning model that is able to directly map the raw audio to the output as our hypothetical model. A good choice of such a hypothetical model is the model proposed in  [Trigeorgis et al.2016]; we refer to this model as WaveRNN in the remainder of thos paper. WaveRNN is the first network that learns directly from raw audio waveforms to paralinguistic labels using recurrent neural networks (RNNs) and has become a state-of-the-art baseline approach for multiple paralinguistic applications due to its excellent performance compared to previous models [Schuller et al.2017]. As shown in Figure 4, WaveRNN first segments the input audio sequence into frames of 40ms and processes each frame separately in the first few layers (front-end layers). The output of the front-end layers is then fed, in order of time, to the back-end recurrent layers.

Previous studies have shown that adversarial examples are able to generalize, i.e., examples generated for attacking the hypothetical model are then often able to attack other models, even when they have different architectures [Goodfellow, Shlens, and Szegedy2014]. Further, the more similar the structures of the hypothetical model and a practical model are, the better the practical attack performance will be. Thus, the adversarial examples generated for the WaveRNN model are expected to have good attack performance with a variety of state-of-the-art WaveRNN-like networks widely used in different paralinguistic tasks. However, in our work, we also discovered the vanishing gradient problem, which prevents us from using WaveRNN directly as the hypothetical model. This problem is discussed in the next section.

The Vanishing Gradient Problem of Gradient Based Attacks on RNNs

One basic prerequisite of gradient based attacks is that the required gradients can be computed efficiently [Goodfellow, Shlens, and Szegedy2014]. While, in theory, the gradient based method applies as long as the model is differentiable [Papernot et al.2016b], even if there are recurrent connections in the model, we observe that when we use RNNs to process long input sequences (such as speech signals), most elements in go to zero except the last few elements. This means that the gradient of the early input in the sequence is not calculated effectively, which will cause the perturbation to only have values in the last few elements. In other words, the perturbation is only added to the end of the input sequence. Since this problem has not been reported before (calculating the gradients with respect to the input is not typical, while is), we formalize and describe it mathematically in the following paragraphs.

We assume one single output (rather than a sequence of outputs) for each input sequence, which is the typical case for paralinguistic applications. Consider the simplest one-layer RNN and let be the state of neurons at time step , be the ground-truth label, be the prediction, be the loss function, and be the number of time steps of the sequence. The vanishing gradient problem in gradient based attacks can be described as follows:


We can use the chain rule to expand



Since the derivative of the function is in , we have:


When the time interval between and becomes larger, especially when , the continuous multiplication in Equation 9 will approach zero:


We then substitute Equation 11 into Equation 8:


Equation 12

indicates that for a long input sequence vector, partial derivatives of the loss function with respect to the inputs at earlier time steps are disappearing, which will make the gradient based method fail. We believe it is an inherent problem of RNNs. Interestingly, using Long Short Term Memory networks (LSTM) cannot fix the problem, potentially because LSTM will still forget some input information. Figure 

5 (upper graph) shows the partial derivative of the loss function of a trained WaveRNN model with respect to each input audio data point . Even when the WaveRNN uses the LSTM recurrent layers, the partial derivative of the first 60,000 data points are close to zero.

Figure 5: The vanishing gradient problem: partial derivatives of the RNN’s loss function with respect to inputs at earlier time steps in a long sequence tend to disappear. Upper graph: the partial derivative of the loss function of WaveRNN (with LSTM layers) with respect to each input audio data point . Lower graph: the partial derivative of the loss function of WaveCNN with respect to each input audio data point .

The Substitution Model Approach

According to the discussion in the last section, it is difficult to effectively calculate the gradient of the loss function with respect to the input sequence for RNNs. While RNNs are most commonly used in processing such kinds of sequence input problems, they are not indispensable. In order to fix the vanishing gradient problem, we propose a new network with a complete feedforward structure, which can be referred to as WaveCNN. Similar to WaveRNN, WaveCNN also first divides the input audio sequence into frames of 40ms and processes each frame separately in the front-end layers. But after that, instead of feeding the output of the front-end layers of each frame into recurrent layers, the WaveCNN approach concatenates the outputs of the front-end layers of all frames and feeds them to the following convolutional layers. The WaveCNN approach uses convolutional structures as a substitution for the recurrent structures. This modification eliminates the recurrent structures in the network and hence fixes the vanishing gradient problem. As shown in Figure 5, WaveCNN can calculate the gradient over all input data points effectively. The design of WaveCNN still retains the local receptive fields arithmetic of WaveRNN and the back-end convolutional layers are also able to capture temporal information. Actually, in our experiments, WaveCNN performs almost the same as WaveRNN for a series of tasks. We therefore use WaveCNN as the hypothetical model in our work.

Figure 6: The error rate with different perturbation factors for the gender recognition task (left), emotion recognition task (middle), and speaker recognition task (right).
Figure 7: Comparison of the waveform (left), spectrogram (middle), and enlarged vocal spectrogram (right) between the adversarial example with (upper graphs) and original example (lower graphs).

Experimental Evaluation

Dataset and Test Strategy

We use the audio part of the IEMOCAP dataset [Busso et al.2008] for our experiments, which is a commonly used database in speech paralinguistic research. The IEMOCAP database consists of 10,039 utterances (average length: 4.46s) of 10 speakers (5 male, 5 female). In order to make our experiment without loss of generality, we perform the experiments with three different speech paralinguistic tasks using IEMOCAP:

  • Gender Recognition: A binary classification task of predicting the gender of the speaker. The IEMOCAP database consists of 5239 utterances from male speakers and 4800 utterances from female speakers.

  • Emotion Recognition: A binary classification task to distinguish sad speech from angry speech. The IEMOCAP database consists of 1103 utterances annotated as angry and 1084 utterances annotated as sad.

  • Speaker Recognition: A four-class classification task of predicting the identity of the speaker. We use the data of four speakers in IEMOCAP sessions 1&2, which consist of two male speakers and two female speakers. The number of utterances of each speaker are 946, 873, 952, and 859, respectively.

Note that we simplified the emotion recognition task and the speaker recognition task in our experiment by limiting them to a two-class and four-class problem, respectively. This simplification makes the model have a better performance before being attacked. It is more meaningful to attack a model that originally has good performance. The simplification also makes the class distribution balanced in our experiments, therefore, accuracy is an effective metric for evaluating the performance of the model. All experiments are conducted in a hold-out test strategy, i.e., 75%, 5%, and 20% of the data is used for training, validation, and test, respectively. Hyperparameters are tuned only on the validation set. All utterances are padded with zero or cut into 6-second audio and further scaled into [0,1]. The audio sample rate is 16kHz, thus each utterance is represented by a 96000-dimensional vector.

Network Training

The WaveRNN and WaveCNN models are trained separately for each paralinguistic task. The network structure is fixed for all experiments; the only hyperparameter we tune during training is the learning rate. We conduct a binary search for the learning rate in the range [1e-5, 1e-2]. The maximum number of epochs is 200. The best model and learning rate is selected according to the performance on the validation set.

Interestingly, we find that the WaveRNN and WaveCNN models perform very similarly. In the gender recognition task, both models have an accuracy of 88%; in the emotion recognition task, the WaveRNN has an accuracy of 84%, while the WaveCNN has an accuracy of 85%; in the speaker recognition task, the WaveRNN has an accuracy of 69%, while the WaveCNN has an accuracy of 73%. This indicates that the convolutional back-end layers can also process audio sequences well.

Adversarial Attack Evaluation

In our work, we focus on attacking WaveRNN, a state-of-the art and widely used model for speech paralinguistic applications. However, due to the reasons mentioned in the previous section, it is not appropriate to set WaveRNN as the hypothetical model in the attack and we therefore use WaveCNN as our hypothetical model and expect that the attack can be generalized to the WaveRNN model. We perform the attack using FGSM described in Equation 2 on all three paralinguistic tasks with different perturbation factors . We perform the attack twice in all tests. We can observe the following from the results shown in Figure 6:

  • The proposed attack approach is effective in all paralinguistic tasks. With a perturbation factor of 0.02, the gender recognition error rate increases from 12% to 31%, and from 12% to 25% for WaveRNN and WaveCNN, respectively. With a perturbation factor of 0.015, the emotion recognition error rate increases from 16% to 48%, and from 15% to 42.5% for WaveRNN and WaveCNN, respectively. With a perturbation factor of 0.032, the speaker recognition error rate increases from 31% to 62%, and from 29% to 44% for WaveRNN and WaveCNN, respectively. Note that 50% and 75% are the upper bound error ratea for 2-class and 4-class classification tasks.

  • The proposed attack approach can be generalized. While the adversarial examples are built to attack the WaveCNN as the hypothetical model, they can also cause a significant performance drop for the WaveRNN model. Interestingly, the attack affects WaveCNN and WaveRNN in different ways; the error rate of the WaveCNN model increases linearly with respect to the perturbation factor, while the error rate of the WaveRNN model does not change with small perturbation factors, but changes rapidly when the perturbation factor exceeds a specific level. We observe a performance fluctuation of the WaveRNN attack in the gender recognition task, which we believe is due to the loss function coincidentally reaching a sub-optimal solution with the perturbation rather than a flaw of the attack. The error rate still increases with the perturbation factor after the fluctuation.

  • The performance when there is no attack is not an indicator of the vulnerability of the model. In our experiments, the error rate of the emotion recognition model is similar to the gender recognition model when there is no attack, but increases much faster with larger perturbation factors than the gender recognition model. This indicates that even high performing models can be very vulnerable to adversarial examples.

Perturbation Analysis

An adversarial example (generated for the emotion recognition task with ) is compared to an original example in Figure 7. When , both the WaveRNN and WaveCNN models are close to random guesses with an error rate of 51% and 46.5%, respectively. Comparing the spectrograms of the proposed adversarial example and the adversarial example generated by the feature-level attack (i.e., Figure 1 in [Iter, Huang, and Jermann2017]), we observe that the proposed perturbation is much smaller, especially in the vocal parts. The feature-level attack greatly obscures the vocal spectrogram, while the proposed attack barely changes it. From the waveform we also see that the perturbation appears as noise added to the original sample. Listening to the adversarial audio reveals that the vocal part is unchanged, while the perturbation sounds the same as “normal” noise. With such perturbations (i.e., added noise), humans have no problem recognizing the emotion of the speaker at all. Further, when comparing the spectrograms of the original example and the adversarial example, we observe that the perturbation covers a broad spectrum, which means that it would be difficult to eliminate the attack through simple filtering.


Since WaveCNN and WaveRNN have completely different back-end layers, we believe that one reason of the generalizability of adversarial examples between WaveCNN and WaveRNN is that they use the same front-end layers, which have a high probability of learning similar representations and therefore having similar convolutional kernel weights. Considering Equation 

4, the activation change of neurons is still very large. If this assumption is correct, then a variety of end-to-end speech processing models that use similar front-end layer structures might also suffer from the same attack.

More generally, the phenomenon that deep neural network based end-to-end speech paralinguistic models are robust to variance and noise in naturally occurring data, but vulnerable to man-made perturbation is somewhat counter-intuitive, but actually not surprising. End-to-end models experience performance improvements by processing problems in a much higher dimensional space in order to obtain a more accurate approximation function of the problem, but this also leaves blind spots in the space where data is distributed sparsely. Therefore, when manual perturbed data enters these blind spots, the model is not able to make correct predictions. Nonetheless, the problem is not unsolvable. If the exact type of attack is known, it can help with building a defense model by mixing adversarial examples into the training data to make the deep neural network see such adversarial examples and refine its approximation functions so that they can withstand adversarial examples. To the best of our knowledge, current end-to-end audio processing algorithms (not just limited to speech paralinguistics) barely pay attention to this type of risk. Therefore, our goal is to provide a better understanding of such attacks to help design adversarial-robust models in the future.


Adversarial attacks on computational paralinguistic systems pose a critical security risk that has not yet received the attention it deserves. In this work, we propose an end-to-end adversarial example generation scheme, which directly perturbs the raw audio waveform. Our experiments with three different paralinguistic tasks empirically show that the proposed approach can effectively attack WaveRNN models, a state-of-the-art deep neural network approach that is widely used in paralinguistic applications, while the added perturbation is much smaller compared to previous feature-based audio adversarial example generation techniques.


Details of WaveRNN and WaveCNN

Layer Index Network Parameters and Explanation
Common Front-end Layers
Reshape Divide audio into frames of 40ms
1-16 8

(convolutional layers + max pooling layer), kernel size=[1,40], feature number=32, pool size=(1,2), zero padding

WaveCNN Back-end Layers
Reshape Concatenate output of each frame
17-28 6(convolutional layers + max pooling layer), kernel size=[1,40], feature number=32, pool size=(1,2), zero padding
Reshape Flatten
29 Fully-connected layer with 64 units
30 Fully-connected layer with softmax output
WaveRNN Back-end Layers
17 LSTM recurrent layer with 64 units
18 Fully-connected layer with softmax output
Table 1: Network architecture of WaveCNN and WaveRNN

The details of the network architectures for WaveCNN and WaveRNN are shown in Table 1. The model training uses the following: Adam optimizer [Kingma and Ba2014], cross entropy loss function, learning rate decay of 0.1, max number of epochs of 200, and batch size of 100. The initial learning rate is selected within the range of [1e-5,1e-2] using a binary search.