Following the growing usage of deep learning in fields like computer vision and natural language processing (NLP) was an increasing interest in the domain of adversarial learning, that is, attacking and defending deep learning models algorithmically. Of special interest are adversarial examples, which are classifier samples modified in order to be misclassified by the attacked classifier. Most of the research in deep adversarial learning focused on the computer vision, and especially the image recognition domain, and therefore focused mostly on convolutional neural networks (CNN), commonly used in this domain. In recent years, more and more adversarial examples were generation methods were presented in the NLP domain, to bypass, e.g., sentiment analysis classifiers, which are usually RNN classifiers. Those attacks were also extended into the cyber-security domain. For instance, attacks were developed against dynamic analysis based RNN classifiers that use the API calls of a running process as features. This domain raises special interest, because in this domain, there are adversaries:malware developers, who want to evade next generation machine and deep learning based classifiers such as. While attacks were presented against RNN classifiers, to the best of our knowledge, there is currently no published and evaluated method to make an RNN model resistant to adversarial sequences and this paper would be the first one to address it.
In the cyber-security domain, adversarial learning is not only used to increase the accuracy of classifier on out-of-distribution input, but also to asses the robustness of the classifier against real case attacks, making such research critical. For this reason, we focused on this domain and not, for instance, on the more explored NLP domain).
Other than that, adversarial API sequences and their defense method have unique challenges, which separates them from other sequential input, e.g., natural language sentences for sentiment analysis tasks:
Adversarial API sequences must contain only valid APIs. Using the ’unknown-word’ value as part of the vocabulary, (commonly used in NLP) is not possible in API call sequences.
API sequences can easily exceeds millions of API calls per sample, making training classifiers, adversarial attacks and defenses on GPU infeasible.
Our contribution is presenting four different novel defense methods against RNN adversarial sequences. To the best of our knowledge, there is no paper addressing defense methods against RNN adversarial attacks at all, and especially not in the cyber-security domain, in which adversaries (malware developers) actually exist.
2 Background and Related Work
2.1 DNN Adversarial Examples
The input , correctly classified by the classifier , is perturbed with such that the resulting adversarial example remains in the input domain , but is assigned a different label than . To solve Equation 1, we need to transform the constraint
into an optimizable formulation. Then we can easily use the Lagrange multiplier to solve it. To do this, we define a loss functionto quantify this constraint. This loss function can be the same as the training loss, or it can be chosen differently, e.g., hinge loss or cross entropy loss.
 generates a black-box adversarial example in a two phase process:
Substitute model training: the attacker queries the black-box model with selected synthetic inputs generated by augmenting the initial set of inputs representative of the input domain with their FGSM perturbed variants, in order to build a model approximating ’s decision boundaries.
Adversarial sample crafting: the attacker uses substitute model to craft adversarial samples, which are then misclassified by due to the transferability of adversarial examples.
[15, 13] presents a white-box evasion technique for an Android static analysis malware classifier. The features used were from the AndroidManifest.xml file, including permissions, suspicious API calls, activities, etc. The attack is performed iteratively in two steps, until a benign classification is achieved:
Compute the gradient of the white-box model with respect to the binary feature vector.
Find the element in whose modification from 0 to 1 (i.e. only feature addition and not removal) would cause the maximum change in the benign score, and add this manifest feature to the adversarial example.
 used API calls uni-grams as static features. If API types are used, the feature vector dimension is . A generative adversarial network (GAN) was trained, where the discriminator simulates the malware classifier while the generator tries to generate adversarial samples that would be classified as benign by the discriminator, which uses labels from the black-box model.
2.2 RNN Adversarial Examples
Most sequence based adversarial attacks takes place in the natural language processing (NLP) domain, where changing a word can change the meaning of the sentence. For instance, changing: “This is the best movie have I ever seen” to: “This is the worst movie I have ever seen”.
 presented a white-box adversarial examples attack against RNNs, demonstrated against LSTM architecture, for sentiment classification of a movie reviews dataset, where the input is the review and the output is whether the review was positive or negative. The adversary iterates over the words in the review and modifies it as follows:
where is the original model label for and .
provides the direction one has to perturb each of the word embedding components in order to reduce the probability assigned to the current class, and thus change the class assigned to the sentence. However, the set of legitimate word embeddings is finite. Thus, one cannot set the word embedding coordinates to any real value. Instead, one finds the wordin dictionary such that the sign of the difference between the embeddings of and the original input word is closest to
. This embedding takes the direction closest to the one indicated by the Jacobian as most impactful on the model’s prediction. By iteratively applying this heuristic to a word sequence, one eventually finds an adversarial input sequence misclassified by the model. This approach of modifying a word by the gradient is common in many papers, to achieve maximum classification impact with a minimal amount of changes.
 proposes to use reinforcement learning to locate important words that could be deleted in sentiment classification.  and  generate adversarial sequences by inserting or replacing existing words with typos and synonyms.  attacks sentiment classification models in a black-box setting by either inserting, deleting or swapping characters to generate misspelled words mapped into the ’unknown’ word in the NLP dictionary. It develops some scoring functions to find the most important words to modify. Other than attacking text classifiers,  aims to fool reading comprehension systems by adding misleading sentences.
 uses the generative adversarial network (GAN) to craft natural adversarial examples.  and are attacking seq2seq models using a word-level attack method, where the latter focuses on adding specific “malicious” keywords to the adversarial sentence. 
presents an attack algorithm that exploits population-based gradient-free optimization via genetic algorithms.
 modifies text by either replacing a few characters to misspell words in the sentence, or replace words with their nearest semantic-preserving neighbor.
Attacks have also been implemented in the cyber-security domain, mainly for malware classifier based on API calls. The cyber-security domain is extremely relevant because in this domain adversaries exist - malware writers, who want their malware to evade the detection of next generation, machine learning based malware classifiers. Several attacks in the cyber-security domain were presented:
proposed a generative RNN based approach to generate invalid APIs and insert them into the original API sequences. A substitute RNN is trained to fit the targeted RNN. Gumbel-Softmax, a one-hot continuous distribution estimator, was used to smooth the API symbols and deliver gradient information between the generative RNN and the substitute RNN. Null APIs were added, omitting them later to make the generated adversarial sequence shorter.
 presented a black box variant of the attack in , by creating a substitute model and attacking it using a similar method, and extended it to hybrid classifiers combining static and dynamic features and architectures.  presented a black box attack based on benign perturbation generated using a GAN that was trained on benign samples.
2.3 Defense Mechanisms against Non Sequence-Based Adversarial Attacks
To the best of our knowledge, there is currently no published and evaluated method to make an sequence based RNN model resistant to adversarial sequences, beyond a short mentioning of adversarial training as a defense method. This method has limitations:
It shows It requires a dataset of adversarial examples to train on. Thus, it has limited generalization vs. novel adversarial attacks.
Our paper is the first one to discuss defense methods for RNN classifiers, presenting four novel defense methods (in addition to evaluate adversarial training for comparison).
Previous work focused on defense mechanisms against non sequence-based DNN attacks can be divided into two sub-categories:
Detection of adversarial examples.
Making the classifier robust against adversarial attack.
2.3.1 Detection of Adversarial Examples
Several methods have been suggested to detect whether a sample is an adversarial example.
leverages the fact that adversarial samples have a different distribution than normal samples. The statistical differences between them can be detected using a high-dimensional statistical test of maximum mean discrepancy or adding another class of adversarial examples to the classifier. In contrast to our work, this paper deals with non sequential input only.
 took a similar approach and augmented deep neural networks with a small “detector” subnetwork which is trained on the binary classification task of distinguishing genuine data from data containing adversarial perturbations.
detects adversarial examples using two new features: kernel density estimates in the subspace of the last hidden layer of the original classifier, and Bayesian neural network uncertainty estimates.
 has shown that most such techniques cannot handle a well-designed adversarial attack in the image recognition domain.
 uses feature squeezing to detect adversarial examples. This is done by reducing the search space available to an adversary by coalescing samples that correspond to many different feature vectors in the original space into a single sample, done by applying various image-specific dimensionality reduction transformations to the input features. If the original and squeezed inputs produce substantially different outputs from the model, the input is likely to be adversarial. In contrast to our work, this paper deals with feed forward networks (mostly CNN) only, in the computer vision domain.
2.3.2 Making the Classifier Robust against Adversarial Attacks
Instead of actively trying to detect adversarial examples, another approach is to passively try to make the classifier more robust against such attacks. Such methods avoid the false positives that might happened in the above mentioned techniques.
Adversarial training was suggested in  which demonstrated injecting adversarial samples, correctly labeled, in the training set as a means of making the model robust.
Using an ensemble of DNNs as a classifier resistant to adversarial attacks of images was shown in . In contrast to our work, this paper deals with feed forward networks (mostly CNN) only, in the computer vision domain.  introduced Ensemble Adversarial Training, a technique that augments training data with perturbations transferred from other models.
 evaluates three defense methods: weight decay, ensemble of classifiers and Distillation for a dynamic analysis malware classifier based on a non-sequence based deep neural network.
alternately train both classifier and generator networks. The generator network generates an adversarial perturbation that can easily fool the classifier network by using a gradient of each image. Simultaneously, the classifier network is trained to classify correctly both original and adversarial images generated by the generator. These procedures help the classifier network to become more robust to adversarial perturbations.
 trains a reformer network (which is an auto-encoder or a collection of auto-encoders) to differentiate between normal and adversarial examples by approximating the manifold of normal examples. When using a collection of auto-encoders, one reformer network is chosen at random at test time, thus strengthening the defense.
GAN was trained to model the distribution of unperturbed images . At inference time, the closest output (which does not contain the adversarial changes) to the input image is found. This generated image is then fed to the classifier and its prediction is being used as the prediction of the original input. In contrast to our work, this paper deals with feed forward networks (mostly CNN) only, in the computer vision domain.
We would investigate five main approaches, divided into the following subgroups:
Detection of adversarial examples.
Making the classifier robust against adversarial attack.
Each method can be either attack-specific, meaning it requires adversarial examples generated by the attack algorithm the method is trying to defend against, or attack-agnostic, that is, it works against all types of adversarial examples, without the need to have a dataset of such examples.
3.1 RNN Adversarial Sequence Detection Methods
3.1.1 Sequence Squeezing
Feature squeezing was suggested in  as a method to detect adversarial examples. The rationale is reducing the search space available to an adversary by coalescing samples that correspond to many different feature vectors in the original space into a single sample.  uses various image-specific dimensionality reduction transformations to the input features, such as changing the image color depth (e.g., from 24 bit to 8 bit). If the original and squeezed inputs produce substantially different outputs from the model, the input is likely to be adversarial.
When applying feature squeezing to discrete sequence input, such as API calls trace input for malware classification or words for sentiment analysis, transformations such as color depth makes no sense. We wanted to generate a method that would preserve the original rationale of the method, while being generic enough to be applied not just the cyber security domain, but the NLP either.
In-order to do that, we developed our own method, which fits input sequence, termed sequence squeezing. Our method is shown in Algorithm 1.
We used GloVe , already shown to work effectively with API call traces , to generate , the word embedding matrix (of size:, where is the embedding dimensionality) for each API call (=word), and than merged the closest word embedding (using euclidean distance, as cosine distance gave no significant improvement, as already mentioned in ). Every time we merge two embedding, we replace them by their center of mass, which is the embedding of the merged group, to which each of the merged embedding is mapped. This usage of center of mass preserve the overall semantic of the group, and allows us to keep merging the most semantic similar words into groups, even to groups which were already merged before. After the merge is done, we replace each merged group center of mass embedding by the euclidean closest original word merged into it, so we can use the original classifier which was trained on the original embedding. The rationale is we want to choose the API or word with the closest semantic meaning to the merged group members, represented by the merged group center of mass embedding, to maintain the performance of the original classifier.
This defense method does not require a knowledge about the adversarial examples to mitigate them. Thus, it is attack-agnostic, by the definition in Section 3.
3.1.2 Statistical Sequence Irregularities Detection (a.k.a. Adversarial Signatures)
Using the statistical irregularities of the adversarial sequences in order to detect them as adversarial. Adversarial examples are in fact out-of-distribution samples. Since the target classifier was not trained on samples from this distribution - generalization on adversarial examples is hard. However, those statistical differences can also differentiate between adversarial and non adversarial samples.
 leverages the fact that adversarial samples have a different distribution than normal samples for non sequential input. The statistical differences between them can be detected using a high-dimensional statistical test of maximum mean discrepancy. The adversarial detection rate of this approach is around 80%.
In this paper, we extend the method of detecting adversarial examples by their statistical irregularities in two ways:
We extend the proposed method to sequential input.
We improve the detection rate (see Section 4.2) by leveraging the conditional probabilities between the sequence elements (API calls or words).
In-order to do that, we start from the observation that in API calls trace, as well as in natural language sentences, there is a strong dependence between the sequence elements. The reason is that an API call is rarely independent, and in order to produce usable business logic, a sequence of API calls, each relying on the former API calls output, must be implemented. For instance, the API call closesocket() would appear only after the API call socket(). The same is true for sentences: adverb would follow a verb, etc.. Since for most state of the art adversarial examples, only a small fraction of API calls is added to the original, malicious trace, the malicious context (the original surrounding API calls of the original business logic) remains. Thus, we researched the probability of a certain API call sequence to appear, generating “signatures” of API call sequences that are probable to appear on adversarial sequences only, since they contain API calls (the adversarial added API calls) unrelated to their context.
We decided to analyze the statistical irregularities in n-grams of consecutive API calls. The trade-off when choosing n is to have a long enough n-gram to capture the irregularity in the proper context (surrounding API calls), while remaining short enough to allow generalization to other adversarial examples. For each unique n-gram of API call (the features used in[33, 19]), we calculate the adversarial n-gram probability of the n-gram of monitored API calls , where is the vocabulary of available features. Here those features are all of the API calls recorded by the classifier.
is the concatenation operation. The adversarial n-gram probability is the ratio of occurrences of the n-gram in adversarial examples available to the defender , as part of the occurrences in both adversarial examples and target (that is, benign) class samples in the training set, . Note that the equation is valid regardless of and there is no assumption on the ratio between and . The reason we don’t include malicious samples is that we want statistical irregularities from the target class, which is the benign class in this dataset. Also note that we only look at the appearance of the signatures in the target class and not in other classes (here: we look only in the benign class, not the malicious class). This is due to the fact that it makes sense that would contain signatures available in the source class (the label of the sample to be perturbed), here: the malicious class.
We say that he n-gram of monitored API calls is an adversarial signature if the adversarial n-gram probability of this n-gram is bigger than a threshold that is determined by the trade-off between the adversarial examples detection rate and the number of target class samples falsely detected as adversarial; The higher the threshold, the lower both would be.
We can say that a sample is adversarial example if it contains more than adversarial signatures. The more irregular n-grams detected, the more likely is the sequence to be adversarial.
This defense method requires a dataset of adversarial examples, , during its setup in order to make it robust against such examples, making it attack-specific, by the definition in Section 3.. Note that while finding “non-adversarial signatures” using this method is possible, it is more problematic, especially when is very big. Other methods presented in this paper, such as Defense-SeqGAN (Section 3.2.3), implement this approach with less overhead.
3.2 Making an RNN Classifier Robust Against Adversarial Examples
3.2.1 Adversarial Training
Adversarial training is the method of adding adversarial examples, with their actual label (as opposed to the target class label) to the training set of the classifier. The rationale is since adversarial examples are out-of-distribution samples, inserting them into the training set would cause the classifier learn the entire samples distribution, including the adversarial examples.
Unlike the other methods mentioned in this paper, this method was already tried for sequence based input in the NLP domain ([1, 25]), with mixed results about the robustness it provides against adversarial attacks. We evaluate this method either, to understand whether the cyber-security domain, with a much smaller dictionary (less than 400 API call types monitored in , as opposed to millions of possible words in NLP domains), would yield different results.
This defense method requires a dataset of adversarial examples during the model training in order to make it robust against such examples, making it attack-specific, by the definition in Section 3..
3.2.2 RNN Ensemble
Ensemble of models is a detour from the basic premise of deep neural networks (including recurrent neural networks): train a single classifier on all the data to get the best performance, while over-fitting is handled using different mechanisms, such as dropout. However, ensemble of models can be used to mitigate adversarial examples. Since an adversarial example is crafted to bypass a classifier looking at the entire input, an ensemble of models, each focusing on a subset of the input features, might be more robust, since the models trained on the the subsets that don’t contain the perturbations would classify the adversarial correctly.
We evaluate four types of models:
Regular models - Each model is trained on the entire training set and the entire input features. The difference is due to the training method: the initial (random) weights and the training optimizer that can converge to a different function in each model.
Bagging models - Bagging  is used on the training data. The term Bagging is derived from bootstrap aggregation and it consists of drawing m samples with replacement from the training data set of m data points. Each of these new data sets is called a bootstrap replicate. At average each of them contains 63.02% of the training data, where many data points are repeated in the bootstrap replicates. A different bootstrap replicate is used as training data for each classier in the ensemble. This means each model is trained on a random subset of the training set samples. Thus, each model would learn a slightly different data distribution, (hopefully) making it more robust to adversarial example generated to fit the data distribution of the entire dataset.
Adversarial models - Those models are trained on both the training set and adversarial examples generated against a regular model. Thus, those are actually regular models trained using adversarial training (Section 3.2.1).
Offset models - Since the classifier’s input is sequential, we can train each model on a subset of the input sequence, starting from a different offset. That is, if our model is being trained over sequences of 200 API calls, we can split the model into 10 sub-models, one on API calls 1..100, the second on API calls 10..110, and the tenth on API calls 100..200. The idea is the models which classify an API trace of an adversarial example in a sub-sequence without a perturbed part (that is,a purely malicious trace) would be classified correctly, while the perturbed parts would be divides and analyzed separately, making it easier to notice the trace is not benign after all.
The decision of the ensemble was calculated in two possible methods:
Hard voting - Every model predicts its own classification, and the final classification is selected by majority voting.
Soft voting - Every model calculates its own confidence score. The average confidence score is used to determine the classification. Soft voting gives “confident” models more power than hard voting.
This defense method does not require knowledge about the adversarial examples during its setup in order to mitigate them, making it attack-agnostic, with the exception of adversarial models, which are attack-specific, by the definition in Section 3.
3.2.3 Defense SeqGAN
 presented defense-GAN: A GAN was trained to model the distribution of unperturbed images. At inference time, the closest output (which does not contain the adversarial changes) to the input image is found. This generated image is then fed to the classifier and its prediction is being used as the prediction of the original input.
GANs were originally defined for real-valued data only, while API calls of a malware classifier are discrete symbols. For instance, small perturbations suggested in  are not applicable for discrete API calls. For instance, you can’t change WriteFile() to WriteFile()+0.001 in order to estimate the gradient to perturb the adversarial example in the right direction; you need to modify it to an entirely different API). The discrete outputs from the generative model make it difficult to pass the gradient update from the discriminative model to the generative model. We therefore tried several GAN architectures designed to produce output sequences:
In SeqGAN  implementation, a discriminative model that is trained to minimize the binary classification loss between real benign API call sequences and generated ones. Besides the pre-training procedure that follows the MLE (maximum likelihood estimation) metric, the generator is modeled as a stochastic policy in reinforcement learning (RL), bypassing the generator differentiation problem by directly performing a gradient policy update. Given the API sequence and the next API to be sampled from the model , the RL algorithm, REINFORCE, optimizes the GAN objective:
The RL reward signal comes from the GAN discriminator judged on a complete sequence and is passed back to the intermediate state-action steps using Monte Carlo search, to compute the Q-value for generating each token, for the sake of variance reduction.
 proposes a method that optimizes the Maximum Mean Discrepancy (MMD) loss, which is the reconstructed feature distance, by adding a reconstruction term in the objective.
Gumbel Softmax trick is a reparametrization trick used to replace the multinomial stochastic sampling in text generation. It claims that where is a Gumbel distribution with zero mean and unit variance. Note that since this process is differentiable, thus back-propagation can be directly applied to optimize the GAN objective.
The basic structure of MaliGAN  follows that of the SeqGAN. To stabilize the training and alleviate the gradient saturating problem, MaliGAN rescales the reward in a batch.
For each GAN type, we trained two GANs: “benign GAN” to produce API call sequence drawn from the benign distribution used to trained the GAN and “malicious GAN” - for malicious API call sequences. For a given input sequence, benign API call sequences are generated by the “benign GAN” and malicious API call sequences are generated by the “malicious GAN”. We calculate the distance between the input and each of the generated sequences, choosing the nearest sequence to the original input sequence. We than classified the sequence using two approaches:
Nearest Neighbor Classification: Give the input sequence the label of the nearest sequence (so we don’t use the classifier at all).
Defense SeqGAN Classification: Return the classifier prediction on the nearest sequence to the input sequence.
This defense method does not require knowledge about the adversarial examples during its setup to mitigate them, making this defense method attack-agnostic, by the definition in Section 3..
4 Experimental Evaluation
4.1 Dataset and Target Malware Classifiers
We use the same dataset used in , because of its size: it contains 500,000 files (250,000 benign samples and 250,000 malware samples), faithfully representing the malware families in the wild and allowing us a proper setting for attacks and defense methods comparison. Details are shown in Appendix A. Each sample was run in Cuckoo Sandbox, a malware analysis system, for two minutes per sample. The API call sequences generated by the inspected code during its execution were extracted from the JSON file generated by Cuckoo Sandbox. The extracted API call sequences are used as the malware classifier’s features. The samples were run on Windows 8.1 OS, since most malware targets the Windows OS. Anti-sandbox malware were filtered to prevent dataset contamination (see Appendix A). After filtering, the final training set size is 360,000 samples, 36,000 of which serve as the validation set. The test set size is 36,000 samples. All sets are balanced between malicious and benign samples.
There are no commercial trail version or open source API call based deep learning intrusion detection systems available (such commercial products target enterprises and involve supervised server installation). Dynamic models are also not available in VirusTotal. Therefore, we used the malware classifiers shown in Appendix B. Many classifiers are covered, allowing us to evaluate the attack effectiveness against many types of classifiers.
The API call sequences are split into windows of API calls each, and each window is classified in turn. Thus, the input of all of the classifiers is a vector of
(larger window sizes didn’t improve the classifier’s accuracy) API call types with 314 possible values (those monitored by Cuckoo Sandbox). The implementation and hyperparameters (loss function, dropout, activation functions, etc.) of the target classifiers are described in Appendix B. The malware classifiers’ performance and architecture are presented in Appendix B. On the test set, all DNNs have an accuracy higher than 95%, and all other classifiers have an accuracy higher than 90%. The false positive rate of all of the classifiers varied between 0.5-1%.
4.2 Defense Methods Performance
The different defense methods mentioned in Section 3 are measured using two factors:
The adversarial recall is the fraction of adversarial sequences generated the attack which were detected by the defense method. Notice we consider only adversarial sequences which were classified as benign by the target classifier (99.99% of the adversarial sequences for LSTM in ), since those are the only problematic samples. The higher our defense performance, the higher the adversarial recall is, making it harder it is to generate an adversarial example that evades our classifier.
The classifiers performance was measured using the accuracy ratio, which applies equal weight to both FP and FN (unlike precision or recall), thereby providing an unbiased overall performance indicator:
where: TP are true positives (malicious samples classified as malicious by the black-box classifier), TN are true negatives, FP stands for false positives (benign samples classified as malicious), and FN are false negatives.
The higher our defense performance, the higher the performance of the classifier using it.
The performance of the attack versus the various defense methods and the classifiers performance using it for the LSTM classifier (other classifiers behave the same) are shown in Table 1.
|Defense Method||Adversarial Recall [%]||Classifier Accuracy [%]|
|RNN Regular Ensemble||68.26||91.97|
|RNN Offsets Ensemble||54.36||92.67|
|RNN Bagging Ensemble||68.26||92.36|
|RNN Bagging Offsets Ensemble||54.36||92.90|
|RNN Adversarial Ensemble||68.26||92.36|
|RNN Adversarial Offsets Ensemble||54.36||92.90|
We see that sequence squeezing provides an attack agnostic defense method with Additional details about the specific defense methods implementation are detailed in the following subsections.
4.2.1 Sequence Squeezing
We used the Stanford’s GloVe implementation. The vocabulary used by our malware classifiers contains all the API calls monitored by Cuckoo Sandbox, documented in the Cuckoo Sandbox repository. Running Algorithm 1 on our training set (Section 4.1) of API call traces have shown interesting sequence squeezing. It seems that the squeezed groups maintained the “API contextual semantics”, merging, for instance, the variants of the same API, e.g., GetUserNameA() and GetUserNameW(). The sequence squeezing we used, with , is shown in Appendix C. This size was chosen in order to both maintain the “API contextual semantics”, not merging unrelated API calls on the one hand, while narrowing enough the adversarial space, on the other hand. Other hyperparameters were less effective.
4.2.2 Adversarial Signatures
We chose , that is, we used 4-garms of API calls for the “adversarial signatures”. Shorter API call sub-sequences caused more “adversarial false positives”, that is, identifying regular samples as adversarial, while longer API call sequence were too specific. We used: , thus, we looked for API sub-sequences which appear only on adversarial examples. We also used: , so in-order to classify a sample as adversarial example, it’s enough that it contain a signal adversarial signature. Other hyperparameters were less effective.
4.2.3 Adversarial Training
We run the adversarial attack  10 times, each time on a different subset of 2,000 malicious hold-out set samples (which were not part of the training, validation or test set). Eventually, 18,000 malicious adversarial examples replaced 18,000 (=two thirds of the) malicious samples in the original training set. Other sizes resulted in a lesser accuracy.
4.2.4 RNN Ensemble
We used six variants of ensemble of 9 models:
Regular ensemble - Each model trained on the entire dataset.
Offsets ensemble - First model is trained on API calls between 1..140, the second model is trained on API calls 11…150 and the ninth model is trained on API calls 91..230.
Bagging ensemble - Each model trained on a random subset of the dataset, as discussed in Section 3.2.2.
Bagging offsets ensemble - This is a combination of bagging and offsets ensemble. Each model is trained not only on a different API calls offsets (as a regular offsets ensemble described above), but also on a random subset of the training set, as in a bagging ensemble.
Adversarial ensemble - Each model has 14,000 (out of 27,000) malicious samples replaced with their adversarial example variants.
Adversarial offsets ensemble- This is a combination of adversarial and offsets ensemble.The adversarial examples API trace used for training also starts at an offset.
The decision was conducted using soft-voting (Section 3.2.2), since it outperformed hard-voting in all of our tests.
4.2.5 Defense SeqGAN
To implement the benign perturbation GAN, we tested several GAN types, using TexyGEN 
with its default parameters. We use MLE training as the pretraining process for all baseline models except GSGAN, which requires no pretraining. In pretraining, we first train 80 epochs for a generator, and then train 80 epochs for a discriminator. The adversarial training comes next. In each adversarial epoch, we update the generator once and then update the discriminator for 15 mini-batch gradients. We generated a window of 140 API calls, each with 314 possible API call types, in each iteration. As mentioned in Section3.2.3, we tested several GAN implementations with discrete sequence output: : SeqGAN , TextGAN , GSGAN  and MaliGAN . We trained our “benign GAN” using a benign hold-out set (3000 sequences). Next, we generated sequences by the “benign GAN”, using additional benign hold-out set (3,000 sequences) as a test set. We used the same procedure to train our “malicious GAN” and generate additional sequences using it.
SeqGAN outperforms all other models by providing the average minimal distance between the 400 generated sequences and the test set vectors, meaning the generate sequences were he closest to the requested distribution, and thus we used it, naming our method accordingly.
Table 1 reveals several interesting issues:
Sequence squeezing provides descent robustness against adversarial attacks, in an attack-agnostic manner, but also has an added benefit: the classifier performance using it is higher than without it. The reason is that some malicious non adversarial examples, not caught by the original classifier, are caught by the sequence squeezing defense method. This insight requires further analysis and would be part of our future work.
Adversarial signatures provides an attack-specific defense method, with a good adversarial recall.
Adversarial training under-performs, since it is attack-specific and its adversarial recall is low comparing to all other methods.
RNN ensemble provide some adversarial recall. Interestingly, the baggaing and adversarial ensemble provide no additional robustness, and the adversarial recall improves only when offsets are used.
Defense SeqGAN provides very nice adversarial recall. However, the inability of the GAN to capture the complex input sequence distribution, causes its classifier performance to reduce significantly.
Overall, it seems that sequence squeezing provides both descent adversarial recall, additional classifier’s accuracy and attack-agnostic approach, with a relatively low cost of running the classifier twice per prediction.
In this paper, we provide four novel defense methods against RNN adversarial examples. To the best of our knowledge, this is the first paper to focus on this challenge, which is different from non-sequence based adversarial examples defense methods.
Our future work would focus on three directions:
Investigate additional defense methods and evaluate the performance of several defense methods combined.
Implementing and mitigating attacks designed to bypass the defense methods discussed in this paper, as described in the image recognition domain in  (against, e.g., statistical irregularities adversarial examples detection) and in  (against feature squeezing), (against ensemble adversarial training).
Extending our work to other domains with input sequence, such as NLP.
-  Moustafa Alzantot, Yash Sharma, Ahmed Elgohary, Bo-Jhang Ho, Mani B. Srivastava, and Kai-Wei Chang. Generating natural language adversarial examples. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii, editors, Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, pages 2890–2896. Association for Computational Linguistics, 2018.
-  Hyrum S. Anderson, Anant Kharkar, Bobby Filar, David Evans, and Phil Roth. Learning to evade static PE machine learning malware models via reinforcement learning. CoRR, abs/1801.08917, 2018.
-  Hyrum S. Anderson, Anant Kharkar, Bobby Filar, and Phil Roth. Evading machine learning malware detection. In Black Hat US, 2017.
-  Battista Biggio, Igino Corona, Davide Maiorca, Blaine Nelson, Nedim Šrndić, Pavel Laskov, Giorgio Giacinto, and Fabio Roli. Evasion attacks against machine learning at test time. In Machine Learning and Knowledge Discovery in Databases, pages 387–402. Springer Berlin Heidelberg, 2013.
-  Leo Breiman. Bagging predictors. Mach. Learn., 24(2):123–140, August 1996.
Nicholas Carlini and David Wagner.
Adversarial examples are not easily detected.
Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security - AISec 2017. ACM Press, 2017.
-  Tong Che, Yanran Li, Ruixiang Zhang, R. Devon Hjelm, Wenjie Li, Yangqiu Song, and Yoshua Bengio. Maximum-likelihood augmented discrete generative adversarial networks. CoRR, abs/1702.07983, 2017.
-  Minhao Cheng, Jinfeng Yi, Huan Zhang, Pin-Yu Chen, and Cho-Jui Hsieh. Seq2sick: Evaluating the robustness of sequence-to-sequence models with adversarial examples. CoRR, abs/1803.01128, 2018.
-  Javid Ebrahimi, Anyi Rao, Daniel Lowd, and Dejing Dou. Hotflip: White-box adversarial examples for text classification. In Iryna Gurevych and Yusuke Miyao, editors, Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 2: Short Papers, pages 31–36. Association for Computational Linguistics, 2018.
-  R. Feinman, R. R. Curtin, S. Shintre, and A. B. Gardner. Detecting Adversarial Samples from Artifacts. ArXiv e-prints, March 2017.
-  J. Gao, J. Lanchantin, M. L. Soffa, and Y. Qi. Black-box generation of adversarial text sequences to evade deep learning classifiers. In 2018 IEEE Security and Privacy Workshops (SPW), pages 50–56, May 2018.
-  Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 2672–2680. Curran Associates, Inc., 2014.
-  K. Grosse, N. Papernot, P. Manoharan, M. Backes, and P. McDaniel. Adversarial Perturbations Against Deep Neural Networks for Malware Classification. ArXiv e-prints, June 2016.
-  Kathrin Grosse, Praveen Manoharan, Nicolas Papernot, Michael Backes, and Patrick D. McDaniel. On the (statistical) detection of adversarial examples. ArXiv e-prints, abs/1702.06280, 2017.
-  Kathrin Grosse, Nicolas Papernot, Praveen Manoharan, Michael Backes, and Patrick McDaniel. Adversarial examples for malware detection. In Computer Security – ESORICS 2017, pages 62–79. Springer International Publishing, 2017.
-  Warren He, James Wei, Xinyun Chen, Nicholas Carlini, and Dawn Song. Adversarial example defense: Ensembles of weak defenses are not strong. In William Enck and Collin Mulliner, editors, 11th USENIX Workshop on Offensive Technologies, WOOT 2017, Vancouver, BC, Canada, August 14-15, 2017. USENIX Association, 2017.
-  J. Hendrik Metzen, T. Genewein, V. Fischer, and B. Bischoff. On Detecting Adversarial Perturbations. 2017.
-  Jordan Henkel, Shuvendu K. Lahiri, Ben Liblit, and Thomas W. Reps. Code vectors: understanding programs through embedded abstracted symbolic traces. In Gary T. Leavens, Alessandro Garcia, and Corina S. Pasareanu, editors, Proceedings of the 2018 ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/SIGSOFT FSE 2018, Lake Buena Vista, FL, USA, November 04-09, 2018, pages 163–174. ACM, 2018.
-  Weiwei Hu and Ying Tan. Black-box attacks against RNN based malware detection algorithms. ArXiv e-prints, abs/1705.08131, 2017.
-  Weiwei Hu and Ying Tan. Generating adversarial malware examples for black-box attacks based on GAN. ArXiv e-prints, abs/1702.05983, 2017.
-  Wenyi Huang and Jack W. Stokes. MtNet: A multi-task neural network for dynamic malware classification. In Detection of Intrusions and Malware, and Vulnerability Assessment, pages 399–418. Springer International Publishing, 2016.
-  Robin Jia and Percy Liang. Adversarial examples for evaluating reading comprehension systems. In Martha Palmer, Rebecca Hwa, and Sebastian Riedel, editors, Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP 2017, Copenhagen, Denmark, September 9-11, 2017, pages 2021–2031. Association for Computational Linguistics, 2017.
-  Matt J. Kusner and José Miguel Hernández-Lobato. GANS for sequences of discrete elements with the gumbel-softmax distribution. CoRR, abs/1611.04051, 2016.
-  Hyeungill Lee, Sungyeob Han, and Jungwoo Lee. Generative adversarial trainer: Defense to adversarial perturbations with GAN. CoRR, abs/1705.03387, 2017.
-  Jinfeng Li, Shouling Ji, Tianyu Du, Bo Li, and Ting Wang. Textbugger: Generating adversarial text against real-world applications. CoRR, abs/1812.05271, 2018.
-  Jiwei Li, Will Monroe, and Dan Jurafsky. Understanding neural networks through representation erasure. CoRR, abs/1612.08220, 2016.
-  Bin Liang, Hongcheng Li, Miaoqiang Su, Pan Bian, Xirong Li, and Wenchang Shi. Deep text classification can be fooled. In Jérôme Lang, editor, Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI 2018, July 13-19, 2018, Stockholm, Sweden., pages 4208–4215. ijcai.org, 2018.
-  Dongyu Meng and Hao Chen. Magnet: A two-pronged defense against adversarial examples. In Bhavani M. Thuraisingham, David Evans, Tal Malkin, and Dongyan Xu, editors, Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, CCS 2017, Dallas, TX, USA, October 30 - November 03, 2017, pages 135–147. ACM, 2017.
-  Nicolas Papernot, Patrick McDaniel, Ian Goodfellow, Somesh Jha, Z. Berkay Celik, and Ananthram Swami. Practical black-box attacks against machine learning. In Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security, ASIA CCS ’17, pages 506–519, New York, NY, USA, 2017. ACM.
-  Nicolas Papernot, Patrick McDaniel, Ananthram Swami, and Richard Harang. Crafting adversarial input sequences for recurrent neural networks. In MILCOM 2016 - 2016 IEEE Military Communications Conference. IEEE, nov 2016.
-  Jeffrey Pennington, Richard Socher, and Christopher D. Manning. Glove: Global vectors for word representation. In Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, 2014.
-  Ishai Rosenberg, Asaf Shabtai, Yuval Elovici, and Lior Rokach. Low resource black-box end-to-end attack against state of the art API call based malware classifiers. CoRR, abs/1804.08778, 2018.
-  Ishai Rosenberg, Asaf Shabtai, Lior Rokach, and Yuval Elovici. Generic black-box end-to-end attack against state of the art API call based malware classifiers. In Michael Bailey, Thorsten Holz, Manolis Stamatogiannakis, and Sotiris Ioannidis, editors, Research in Attacks, Intrusions, and Defenses - 21st International Symposium, RAID 2018, Heraklion, Crete, Greece, September 10-12, 2018, Proceedings, volume 11050 of Lecture Notes in Computer Science, pages 490–510. Springer, 2018.
-  Pouya Samangouei, Maya Kabkab, and Rama Chellappa. Defense-gan: Protecting classifiers against adversarial attacks using generative models. CoRR, abs/1805.06605, 2018.
-  Suranjana Samanta and Sameep Mehta. Towards crafting text adversarial samples. CoRR, abs/1707.02812, 2017.
-  Jack W. Stokes, De Wang, Mady Marinescu, Marc Marino, and Brian Bussone. Attack and defense of dynamic analysis-based, adversarial neural malware classification models. CoRR, abs/1712.05919, 2017.
-  T. Strauss, M. Hanselmann, A. Junginger, and H. Ulmer. Ensemble Methods as a Defense to Adversarial Perturbations Against Deep Neural Networks. ArXiv e-prints, September 2017.
-  Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian J. Goodfellow, and Rob Fergus. Intriguing properties of neural networks. volume abs/1312.6199, 2014.
-  Florian Tramèr, Alexey Kurakin, Nicolas Papernot, Dan Boneh, and Patrick D. McDaniel. Ensemble adversarial training: Attacks and defenses. CoRR, abs/1705.07204, 2017.
-  Weilin Xu, David Evans, and Yanjun Qi. Feature squeezing: Detecting adversarial examples in deep neural networks. In 25th Annual Network and Distributed System Security Symposium, NDSS 2018, San Diego, California, USA, February 18-21, 2018. The Internet Society, 2018.
-  Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu. Seqgan: Sequence generative adversarial nets with policy gradient. In Satinder P. Singh and Shaul Markovitch, editors, Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, February 4-9, 2017, San Francisco, California, USA., pages 2852–2858. AAAI Press, 2017.
-  Yizhe Zhang, Zhe Gan, Kai Fan, Zhi Chen, Ricardo Henao, Dinghan Shen, and Lawrence Carin. Adversarial feature matching for text generation. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, volume 70 of Proceedings of Machine Learning Research, pages 4006–4015. PMLR, 2017.
-  Zhengli Zhao, Dheeru Dua, and Sameer Singh. Generating natural adversarial examples. In International Conference on Learning Representations (ICLR), 2018.
-  Yaoming Zhu, Sidi Lu, Lei Zheng, Jiaxian Guo, Weinan Zhang, Jun Wang, and Yong Yu. Texygen: A benchmarking platform for text generation models. In Kevyn Collins-Thompson, Qiaozhu Mei, Brian D. Davison, Yiqun Liu, and Emine Yilmaz, editors, The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, SIGIR 2018, Ann Arbor, MI, USA, July 08-12, 2018, pages 1097–1100. ACM, 2018.
Appendix A: Tested Dataset
We used identical implementation details (e.g., dataset, classifiers’ hyperparameters, etc.) as , so the attacks can be compared. Those details are added here, for the reader’s convenience.
The dataset being used is large and includes the latest malware variants, such as the Cerber and Locky ransomware families. Each malware type (ransomware, worms, backdoors, droppers, spyware, PUA, and viruses) has the same number of samples, to prevent a prediction bias towards the majority class. 20% of the malware families (such as the NotPetya ransomware family) were only used in the test set to assess generalization to an unseen malware family. 80% of the malware families (such as the Virut virus family) were distributed between the training and test sets, to determine the classifier’s ability to generalize to samples from the same family. The temporal difference between the training set and the test set is six months (i.e., all training set samples are older than the test set samples), based on VirusTotal’s ’first seen’ date. The ground truth labels of the dataset were determined by VirusTotal, an online scanning service, which contains more than 60 different security products. A sample with 15 or more positive (i.e., malware) classifications from the 60 products is considered malicious. A sample with zero positive classifications is labeled as benign. All samples with 1-14 positives were omitted to prevent false positive contamination of the dataset. Family labels for dataset balancing were taken from the Kaspersky Anti Virus classifications.
It is crucial to prevent dataset contamination by malware that detects whether the malware is running in a Cuckoo Sandbox (or on virtual machines) and if so, quits immediately to prevent reverse engineering efforts. In those cases, the sample’s label is malicious, but its behavior recorded in Cuckoo Sandbox (its API call sequence) isn’t, due to its anti-forensic capabilities. To mitigate such contamination of the dataset, two countermeasures were used: 1) Considering only API call sequences with more than 15 API calls (as in ) and omitting malware that detect a virtual machine (VM) and quits, and 2) Applying YARA rules to find samples trying to detect sandbox programs such as Cuckoo Sandbox and omitting all such samples. One might argue that the evasive malware that applies such anti-VM techniques are extremely challenging and relevant, however, in this paper we focus on the adversarial attack. This attack is generic enough to work for those evasive malware as well, assuming that other mitigation techniques (e.g., anti-anti-VM), would be applied. After this filtering and balancing of the benign samples, about 400,000 valid samples remained. The final training set size is 360,000 samples, 36,000 of which serve as the validation set. The test set size is 36,000 samples. All sets are balanced between malicious and benign samples.
Due to hardware limitations, a subset of the dataset was used as a training set: 54,000 training samples and test and validation sets of 6,000 samples each. The dataset was representative and maintained the same distribution as the dataset described above.
Appendix B: Tested Malware Classifiers
As mentioned in Section 4, we used the malware classifiers from , since many classifiers are covered, allowing us to evaluate the defense performance against many classifier types. The maximum input sequence length was limited to API calls, since longer sequence lengths, e.g.,API calls each, and each window is classified in turn. If any window is malicious, the entire sequence is considered malicious. Thus, the input of all of the classifiers is a vector of API call types in one-hot encoding, using 314 bits, since there were 314 monitored API call types in the Cuckoo reports for the dataset. The output is a binary classification: malicious or benign. An overview of the LSTM architecture is shown in Figure 2.
implementation was used for all neural network classifiers, with TensorFlow used for the backend.
The loss function used for training was binary cross entropy. The Adam optimizer was used for all of the neural networks. The output layer was fully connected with sigmoid activation for all neural networks. For neural networks, a rectified linear unit,, was chosen as an activation function for the input and hidden layers due to its fast convergence compared to or , and dropout was used to improve the generalization potential of the network. A batch size of 32 samples was used.
The classifiers also have the following classifier-specific hyper parameters:
RNN, LSTM, GRU, BRNN, BLSTM, bidirectional GRU - a hidden layer of 128 units, with a dropout rate of 0.2 for both inputs and recurrent states.
Deep LSTM and BLSTM - two hidden layers of 128 units, with a dropout rate of 0.2 for both inputs and recurrent states in both layers.
The classifiers’ performance was measured using the accuracy ratio, which gives equal importance to both false positives and false negatives (unlike precision or recall). The false positive rate of the classifiers varied between 0.5-1%.111The false positive rate was chosen to be on the high end of production systems. A lower false positive rate would mean lower recall either, due to the trade-off between them, therefore making our attack even more effective.
The performance of the classifiers is shown in Table 2. The accuracy was measured on the test set, which contains 36,000 samples.
|Classifier Type||Accuracy (%)|
As can be seen in Table 2, the LSTM variants are the best malware classifiers, in terms of accuracy.
Appendix C: Sequence Squeezing for API Calls monitored by Cuckoo Sandbox