Detecting Malicious PowerShell Scripts Using Contextual Embeddings

by   Amir Rubin, et al.
Ben-Gurion University of the Negev

PowerShell is a command line shell, that is widely used in organizations for configuration management and task automation. Unfortunately, PowerShell is also increasingly used by cybercriminals for launching cyber attacks against organizations, mainly because it is pre-installed on Windows machines and it exposes strong functionality that may be leveraged by attackers. This makes the problem of detecting malicious PowerShell scripts both urgent and challenging. We address this important problem by presenting several novel deep learning based detectors of malicious PowerShell scripts. Our best model obtains a true positive rate of nearly 90 than 0.1 Our models employ pre-trained contextual embeddings of words from the PowerShell "language". A contextual word embedding is able to project semantically similar words to proximate vectors in the embedding space. A known problem in the cybersecurity domain is that labeled data is relatively scarce in comparison with unlabeled data, making it difficult to devise effective supervised detection of malicious activity of many types. This is also the case with PowerShell scripts. Our work shows that this problem can be largely mitigated by learning a pre-trained contextual embedding based on unlabeled data. We trained our models' embedding layer using a scripts dataset that was enriched by a large corpus of unlabeled PowerShell scripts collected from public repositories. As established by our performance analysis, the use of unlabeled data for the embedding significantly improved the performance of our detectors. We estimate that the usage of pre-trained contextual embeddings based on unlabeled data for improved classification accuracy will find additional applications in the cybersecurity domain.


page 1

page 2

page 6


Learning Semantic Representations for Novel Words: Leveraging Both Form and Context

Word embeddings are a key component of high-performing natural language ...

Enriching Rare Word Representations in Neural Language Models by Embedding Matrix Augmentation

The neural language models (NLM) achieve strong generalization capabilit...

Chinese Event Extraction Using DeepNeural Network with Word Embedding

A lot of prior work on event extraction has exploited a variety of featu...

Detecting Malicious PowerShell Commands using Deep Neural Networks

Microsoft's PowerShell is a command-line shell and scripting language th...

Bringing Giant Neural Networks Down to Earth with Unlabeled Data

Compressing giant neural networks has gained much attention for their ex...

View Distillation with Unlabeled Data for Extracting Adverse Drug Effects from User-Generated Data

We present an algorithm based on multi-layer transformers for identifyin...

Please sign up or login with your details

Forgot password? Click here to reset