AST-Based Deep Learning for Detecting Malicious PowerShell

10/03/2018
by   Gili Rusak, et al.
MIT
Stanford University
0

With the celebrated success of deep learning, some attempts to develop effective methods for detecting malicious PowerShell programs employ neural nets in a traditional natural language processing setup while others employ convolutional neural nets to detect obfuscated malicious commands at a character level. While these representations may express salient PowerShell properties, our hypothesis is that tools from static program analysis will be more effective. We propose a hybrid approach combining traditional program analysis (in the form of abstract syntax trees) and deep learning. This poster presents preliminary results of a fundamental step in our approach: learning embeddings for nodes of PowerShell ASTs. We classify malicious scripts by family type and explore embedded program vector representations.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

09/18/2014

Convolutional Neural Networks over Tree Structures for Programming Language Processing

Programming language processing (similar to natural language processing)...
04/11/2018

Detecting Malicious PowerShell Commands using Deep Neural Networks

Microsoft's PowerShell is a command-line shell and scripting language th...
09/11/2014

Building Program Vector Representations for Deep Learning

Deep learning has made significant breakthroughs in various fields of ar...
09/09/2020

Multimodal Deep Learning for Flaw Detection in Software Programs

We explore the use of multiple deep learning models for detecting flaws ...
03/05/2021

MalBERT: Using Transformers for Cybersecurity and Malicious Software Detection

In recent years we have witnessed an increase in cyber threats and malic...
04/01/2019

ScriptNet: Neural Static Analysis for Malicious JavaScript Detection

Malicious scripts are an important computer infection threat vector in t...
08/09/2021

A Neural Approach for Detecting Morphological Analogies

Analogical proportions are statements of the form "A is to B as C is to ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

PowerShell is a popular scripting language and a command-line shell. Originally only compatible with Windows, Powershell has gained a multitude of users over the last several years, especially with its cross-platform and open-source version, PowerShell Core. PowerShell is built on the .NET framework and allows third-party users to write cmdlets and scripts that they can disseminate to others through PowerShell (et al., 2018). Along with increasing usage, PowerShell has also unfortunately been subject to malicious attacks through different types of computer viruses (Wueest, 2016). PowerShell scripts can easily be encoded and obfuscated making it increasingly difficult to detect malicious activity (Hendler et al., 2018). According to the FireEye Dynamic Threat Intelligence (DTI) cloud, malicious PowerShell attacks have been rising throughout the past year (Fang, 2018). Detecting these malicious behaviors with Powershell can be challenging for a number of reasons. Attackers can perform malicious activity without deploying binaries on the attacked machines (Wueest, 2016). Additionally, PowerShell is automatically downloaded on Windows machines. Further, attackers have shifted towards sophisticated obfuscation techniques that make detecting malicious scripts difficult (White, 2017). Notably, attackers use the -EncodedCommand

flag to pass Base-64 encoded commands bypassing the Powershell execution regulations. Recently, emerging research has deployed machine learning based models to detect malware in general 

(Al-Dujaili et al., 2018; Huang et al., 2018) and malicious PowerShell in particular (Fang, 2018; Hendler et al., 2018)

, where deep learning is employed to analyze malicious PowerShell scripts inspired by natural language understanding and computer vision approaches. Though these approaches may support learning the features necessary to distinguish malicious scripts, with the wide range of obfuscation options used in Powershell scripts, we speculate that they might overlook some of the rich structural data in the codes. We therefore propose to break away from text-based deep learning and to use structure-based deep learning.

Figure 1. AST-based deep learning for malicious PowerShell detection.

Our proposition is motivated by the successful use of Abstract Syntax Trees (ASTs) in manually crafting features to detect obfuscated PowerShell scripts (Bohannon and Holmes, 2017). While this use case does consider structural information, manually-crafted features can be vulnerable to high-level obfuscation (e.g., AST-based techniques (Cobb, 2017)). Therefore, in this paper, we propose to learn representations of PowerShell scripts in an end-to-end deep learning framework based on their parsed ASTs. Specifically, we build on the work of Peng et al. (2015) to learn representations (embeddings) for AST nodes. These representations can then be incorporated in any of the tasks associated with PowerShell analysis, including malware detection as shown in Fig. 1.

2. Background

Deep Learning for PowerShell

Hendler et al. (2018) proposed to use several deep learning models to distinguish benign and malicious PowerShell commands. With a dataset of malicious and

clean PowerShell commands, they implemented both Natural Language Processing (NLP) based detectors and detectors based on character-level Convolutional Neural Networks (CNNs) for text classification and treated the text as a raw signal at the character level. According to their results on different architectures (including a 9-layer CNN, a 4-layer CNN, and a long short-term memory net), all of the detectors obtained high AUC levels between

and . The authors suggest that the best performing classifier was an ensemble classifier that combined traditional NLP techniques with a CNN-based classifier. However, worse performance on their held out test set was observed with higher false positive rates. In a recent blog, FireEye (Fang, 2018) apply a supervised classifier to detect malicious PowerShell commands leveraging a prefix-tree based stemmer for the PowerShell syntax. The input to the machine learning model is a vectorized representation of the stemmed tokens. The above propositions focused on detecting malicious PowerShell commands rather than scripts which are a more difficult challenge. Moreover, the features are derived from the commands’ textual form, which may not capture the command’s functional semantics and are prone to character frequency tampering.

AST for PowerShell

Bohannon and Holmes (2017)

studied obfuscated PowerShell scripts. They presented a baseline character frequency analysis and used Cosine similarity to detect obfuscation in PowerShell scripts. They identify promising preliminary results and note a significant difference between obfuscated and non-obfuscated codes. Like

(Hendler et al., 2018), the authors run into the issue of false negatives and suggest taking advantage of PowerShell Abstract Syntax Trees (ASTs) since PowerShell’s API allows for simple AST extraction. Based on the parsed ASTs, the authors crafted distributional features (e.g., distribution of AST types). The engineered feature vectors led to robust obfuscation classifiers on the test set. Similar to the character frequency tampering challenge in text-based representations, the AST-based distributional features can be vulnerable to AST-based obfuscation (Cobb, 2017).

Deep Learning with AST

Peng et al. (2015)

developed a technique to build program vector representations, or embeddings, of different abstract syntax node types based on a corpus of ASTs for deep learning approaches. They used nearest-neighbors similarity and k-means clustering to determine the accuracy of their resulting embeddings. They reported qualitative and quantitative results suggested that deep learning is a promising direction for program analysis. In this project, we build on 

(Peng et al., 2015)’s findings and further study this claim.

3. Methods

To learn a robust representation of PowerShell scripts, we take a hybrid approach combining traditional program analysis and deep learning approaches. We convert the PowerShell scripts to their AST counterparts, and then build embedding vector representations of each AST node type based on a corpus of PowerShell programs.

PowerShell scripts to Abstract Syntax Trees

The considered dataset was composed of Base-64 encoded PowerShell scripts. Thus, as a preprocessing step, each PowerShell script/command was decoded. Given a decoded PowerShell script, we determined its abstract syntax tree representation by recursively traversing the script’s properties using [object.PSObject.Properties] and storing items of type [System.Management.Automation.Language.Ast]. We stored the parent-child relationships among the AST nodes in a depth-first-search order as a text file. There were 37 different AST node types. With multi-core machines, ASTs generation can be carried out in parallel.

Preliminary Analysis of Abstract Syntax Trees

After collecting the tree structures of our PowerShell scripts corpus, we conducted an exploratory analysis on the ASTs and their statistics. Furthermore, we used a random forest classifier to label a PowerShell script by its malware family type. As will be shown in Section 

4, few simple AST-based features were indicative of the malware family.

Abstract Syntax Trees to Vector Representations

Having outlined our approach to the problem of malicious PowerShell programs, we herein take a fundamental step towards learning robust AST-based representations. We employed (Peng et al., 2015, Algorithm 1) on the PowerShell dataset to learn real-valued vector representations of the AST node types. To this end, we parsed each constructed AST to a list of data structures to which we refer by subtrees. A subtree of an AST represents a non-leaf node and its immediate child nodes, each labeled by its type. Next, we shuffled the subtrees to avoid reaching a local minima specific to a given script. For each subtree, with parent node and child nodes , define Similar to (Peng et al., 2015)

, we define a loss function to measure how well the learnt vectors are describing the subtrees. Let

be the number of distinct AST types whose embeddings we are trying to learn. Let be the embedding matrix of the AST node types and define as the embedding vector that corresponds to the type of node . The same holds for . Additionally, let be weight matrices and

be a bias vector. Further, define

as the weights matrix of node as

(1)

Let the distance metric be defined by

(2)

Let be the distance function applied on a negative example of a given subtree where of the children nodes are changed to different AST types. Given the parameters: , we optimized

the distance between a normal subtree’s construction and that of a corrupted adversarial subtree. We used the Adam optimizer to find optimal embedding vectors and adjust the hyperparameters

and . By default, .

4. Experiments

Setup.

We utilize a corpus of hand-annotated and thoroughly analyzed malicious PowerShell scripts (White, 2017). This dataset consists of known malicious Powershell scripts annotated and classified based on their family types. These include ShellCode Inject, Powerfun Reverse, and others. The code repository will be made available at https://github.com/ALFA-group.

Experiment 1: Malware Family Classification.

As a preliminary experiment, we attempted to classify malicious PowerShell scripts by family types. We used properties from the abstract syntax tree representation to conduct this classification. Specifically, we used only two features: depth and number of nodes per PowerShell AST. We used the family types as the labels of our classifier. Since the dataset used suffered from a class-imbalance problem, we weighted the classes when training the classifier (in this case a random forest classifier) based on how many examples each class contained. After hyperparameter tuning on maximum depth, we fit a classifier with a maximum depth of . Due to sparsity of the dataset we used, we limited our experiment to family types with more than examples per family, resulting in eight different families. We randomly split the data into

train/test split. The confusion matrix of the held-out test data is shown in Fig. 

2. To our surprise, we found that two naive AST-based features—AST node count and AST depth—were enough to achieve an 3-fold cross-validation accuracy of . Notably, even very simple features performed well because of the inherent program analysis background. This serves as a motivating example for the effectiveness of ASTs and exemplifies the power of harnessing ASTs to understand program representations.

Figure 2. Heatmap for the confusion matrix results on the held out test set in the Malware Family Classification experiment.

Experiment 2: Learning AST Node Representations.

Extending these results, we build program vector representations of the dataset. As a case study, we analyzed a random sample of malicious subtrees from the total of subtrees in the malicious PowerShell corpus. This collection contained distinct AST node types comprising unique subtrees. We built the embedding matrix for these node types using the method described earlier. We trained our model for epochs until the loss stabilized towards 0. The qualitative results are summarized in a dendrogram in Fig. 3. It shows the relationships of embeddings with similar ones. Notably, the TryStatement and CatchClause node types are neighbors, as well as ForStatement and DoWhileStatement, and Command and CommandParameter. This is promising since one would expect such commands to serve similar functions in scripts. This preliminary experiment has limitations: for example, one would expect the ForEachStatement to land near the ForStatement as well. Additional training on the full malicious dataset is required to fully assess the validity of these methods. As next steps, we hope to make use of these embeddings to build robust classifiers to classify a malicious script based on family. Afterwards, we will use these embeddings to build robust classifiers to determine if a given PowerShell script is malicious or not.

Figure 3. Dendrogram of node types and their relationships in the Learning Node Representations experiment.

5. Conclusion

PowerShell scripts have targeted industries including Higher Education, High Tech, Professional and Legal Services, and Healthcare. This paper motivated the use of static program analysis (in the form of abstract syntax trees) to supplement deep learning techniques with rich structural information about the code, instead of text-based representations. We seek to use deep learning in an end-to-end unsupervised framework to identify intrinsic common patterns in our programs since even ASTs can be obfuscated. We saw that the depth and node count of an AST were enough to distinguish malware families and we took our first fundamental step in learning representations of PowerShell programs.

Acknowledgement

This work was supported by the MIT-IBM Watson AI Lab and CSAIL CyberSecurity Initiative. We thank Palo Alto Networks for the dataset.

References