Anomaly Detection in Cyber Network Data Using a Cyber Language Approach

08/15/2018 ∙ by Bartley D. Richardson, et al. ∙ KeyW Corporation 0

As the amount of cyber data continues to grow, cyber network defenders are faced with increasing amounts of data they must analyze to ensure the security of their networks. In addition, new types of attacks are constantly being created and executed globally. Current rules-based approaches are effective at characterizing and flagging known attacks, but they typically fail when presented with a new attack or new types of data. By comparison, unsupervised machine learning offers distinct advantages by not requiring labeled data to learn from large amounts of network traffic. In this paper, we present a natural language-based technique (suffix trees) as applied to cyber anomaly detection. We illustrate one methodology to generate a language using cyber data features, and our experimental results illustrate positive preliminary results in applying this technique to flow-type data. As an underlying assumption to this work, we make the claim that malicious cyber actors leave observables in the data as they execute their attacks. This work seeks to identify those artifacts and exploit them to identify a wide range of cyber attacks without the need for labeled ground-truth data.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

As the amount of cyber data continues to grow, cyber network defenders are faced with increasing amounts of data they must analyze to ensure the security of their networks. In addition, new types of attacks are constantly being created and executed globally. Current rules-based approaches are effective at characterizing and flagging known attacks, but they typically fail when presented with a new attack or new types of data. By comparison, unsupervised machine learning offers distinct advantages by not requiring labeled data to learn from large amounts of network traffic. In this paper, we present a natural language-based technique (suffix trees) as applied to cyber anomaly detection. We illustrate one methodology to generate a language using cyber data features, and our experimental results illustrate positive preliminary results in applying this technique to flow-type data. As an underlying assumption to this work, we make the claim that malicious cyber actors leave observables in the data as they execute their attacks. This work seeks to identify those artifacts and exploit them to identify a wide range of cyber attacks without the need for labeled ground-truth data.

Ii Previous Work

Previous work has investigated network data for pattern-of-life and anomaly detection that informs the approaches taken in this paper. In terms of pattern-of-life, work focuses on identifying and classifying users within a network

gu2015novel ; sharafuddin2010know ; verde2014no ; abt2014small . Clustering for anomaly detection leung2005unsupervised ; portnoy2001intrusion ; munz2007traffic is a common technique for network data due to the fact that most network datasets are unlabeled and contain no ground truth. The focus of these works is largely centered around intrusion detection and make the assumption that intrusions should be anomalous relative to the network as a whole. Likewise, there exists previous work around the probabilistic suffix tree (PST) and using it to model and predict protein families bejerano2001variations as well as determine anomalous user behavior from event logs liu2013incorporating .

Iii Data Sources and Experimental Configuration

Our data for this effort consists of network traffic data collected in a traditional compute environment. Specifically, we utilize the University of New Brunswick Information Security Centre of Excellence (ISCX) Intrusion Detection Evaluation DataSet ids2012 . The reason we selected this dataset is that it contains labeled data for known attacks, and this permits us to produce ROC curves and calculate AUC in order to evaluate the effectiveness of the technique. We concentrate on Bro netflow data. By default, Bro creates separate log files for different actions. For example, the DNS log file contains DNS resolution requests and associated metadata while the HTTP log file contains, among other things, GET and POST requests (with the associated URLs) for activity over HTTP and HTTPS ports. The Bro CONN (connection) log contains high-level metadata for other log types, including: IP addresses, ports used, bytes transferred, packets transferred, duration, TCP flags, and protocol. Bro netflow is aggregated to the session level and is bidirectional, enabling it to represent many PCAP frames in a single row of information. The relationship between PCAP and flow-type data is illustrated in Figure 1. The ISCX dataset contains 2,028,053 labeled Netflow records, with 96.6% of them labeled normal and the remaining 3.4% labeled as attack. The volume of traffic is shown Figure 2. Statistical analysis of real network traffic is used to create agent-based background traffic in a testbed environment, and the attacks are real and planned by a white hat team based on the architecture of the testbed environment.

Figure 1: Flow-type data (top) compared with PCAP data (bottom). Flow data is aggregated and contains information across multiple PCAP frames, and raw PCAP data provides access to the application payload that may not be represented in aggregated flow data.
Figure 2: Volume of attack (top) and normal (bottom) traffic in the ISCX dataset

Our analytics run on a cloud compute environment using Cloudera CDH v5.11.0 and a heavily modified Spark v.2.1.0.cloudera1. The physical machine contains 6.15TB of addressable RAM with 420 VCores. Actual physical servers include: 8x24 cores, 22x40 cores, 1x12 core, and 1x4 cores. HDFS is currently configured for a total capacity of 260.4TiB. Our analytics utilize Spark MLLib heavily, and input/output is managed via Hive tables in HDFS. It is important to note that this is a research cloud and is not a production environment.

Iv Creating Sequences of Cyber Data

Before we can apply the analytic to the cyber data, we must transform it into a sequence of activity (i.e., create the communication language). Figure 3 illustrates this process in detail. In general, some feature engineering is performed a priori to sequence creation. Various combinations of protocol, port, bytes, packets, and other features can be encoded into discrete tokens, and sequences of these tokens effectively compress the communication between two networked computers. For this work, we focus on proto-bytes (a protocol identifier with the sum of the bytes transferred) and proto-density (a protocol identifier with the sum of bytes/packet transferred). In order to keep the vocabulary at a manageable size, we also bin the quantitative features in some way. One method that has shown promise is to take the floor of the log2 feature value. This allows us to keep some sense of magnitude (e.g., bytes, KB, MB, GB) while reducing the number of tokens produced.

Figure 3: Construction of cyber language sequences from flow data

Another issue to consider when creating sequences of cyber data is the sequence length. Sequence length directly relates to time and how long the communication remains open. Various ways to sessionize exist, including by hour, day, week, and after 30 minutes of no activity between two IP addresses. For this work, we construct sequences that terminate after an hour. This has the added benefit of keeping most sequences to a relatively similar length, so there is not the issue of normalizing all sequences to account for widely varying lengths. These sequences are created using a parallelized Spark-based approach that can construct multiple types of sequences relatively quickly.

V Modeling Cyber Data using a Probabilistic Suffix Tree

Creating the PST model is relatively straightforward. The cyber language sequences are fed into a slightly modified PST code model that distributes the learning across a Spark cluster. Typical starting hyperparemeter values set the depth of the tree to 14, the minimum probability to 0.0001, the probability threshold (specifying the minimum probability necessary for inclusion of the suffix in the tree) to 0.0005, and the two smoothing parameters at

and .

After creating the sequences and the PST model, we then score each sequence using the model. The overall process is shown in Figure 4. Each sequence receives a likelihood score (a probability between 0 and 1), and we flag for investigation those sequences that receive a non-zero likelihood score below a set limit. These sequences represent those less likely to exist in the data (i.e., anomalous sequences).

Figure 4: Analytic flow for creating PST models from cyber network data

To build intuition, we present Figure 5. In this figure, the application of PST modeling to English words is on the left while the cyber application is on the right. In the traditional application, we seek to quantify a word’s conformity to traditional spelling patterns. Notice that words like “actions” and “stations” are more likely (therefore further to the right on the histogram) while words like “chutzpah” and “syzygy” are less likely (further to the left). In our application to cyber data via construction of a cyber language, we seek to add interpretability to findings using similar intuition. Our application necessitates an additional step to transform the data and sequence the tokens. Instead of analyzing English spelling patterns, we are quantifying spelling patterns of our tokenized sequences representing machine-to-machine communication. The underlying assumption is that sequences of less-likely spellings are anomalous to the network environment where they are observed, and these events warrant increased scrutiny by a cyber expert.

Figure 5: Application of PST to natural language (left) and cyber data (right)

Vi Experimental Results

This section presents experimental results of the PST approach to cyber anomaly detection on the ISCX dataset. Figure 6 shows the results for two different types of tokens and their respective ROC curves. We experimented with two main types of tokenization including proto-density binned in buckets of 10 (left) and proto-bytes binned using log2 (right). For the ISCX dataset, using proto-bytes as a feature significantly outperformed using proto-density.

Figure 6: The effect of tokenization on the performance of the PST analytic

Another factor in PST performance is the tuning of the analytic hyperparameters. As implemented, the PST has several hyperparameters. The tree depth specifies the maximum depth of the model generated while the probability threshold value is used to determine if a sequence is significant and is a candidate to add to the PST. Raising the probability threshold makes the PST more restrictive. The other parameters (tau, epsilon, and probability minimum) are essentially used together to remove useless nodes from the PST model and as a smoothing factor. Figure

7 illustrates the effect of tuning these PST hyperparameters to a specific dataset.

Figure 7: The effect of tuning PST hyperparameters to increase the AUC

By tuning the PST hyperparameters, we are able to increase the AUC from 0.545 (shown on the right side of Figure 6) to 0.748. It should also be noted that the shape of the ROC curve is of interest from the view of a cyber analyst. By noting the sharp rise in the ROC curve at the beginning, we observe that the results presented to a cyber analyst (assuming this same ordering) are less likely to be false positives (i.e., less likely to degrade trust in the system). In an operational environment, we would typically not have ground truth labels for our data. It is important to build trust in the system by presenting minimal false positives to the cyber security analyst.

Vii Conclusions and Future Work

This work demonstrates that there is a method to view and interpret cyber communications as a language and that applying language-based analytic techniques to this new synthetic language has potential. We have shown that there is value in viewing network traffic as a language, and that PSTs can be used to characterize that language. By selectively engineering the input features and tuning the PST hyperparameters, we can substantially increase the AUC while maintaining a favorable ROC curve shape.

Future work in this area includes how to best retain the labels when aggregating flows into sessions. While we have characterized on attack vs. non-attack (normal), there are various types of attacks that exist and generating more nuanced labels than these binary indicators would be useful. Additional research and experimentation should be devoted to the correct evaluation criteria for results. Do cyber analysts care about the overall AUC or, perhaps more importantly, the true positive rate for the first predicted anomalies/attacks? In addition, experiments that evaluate how generalizable the results from applying this methodology on the ISCX data are necessary. One method to do this is to use additional labeled datasets from ISCX and then generate model-fit comparisons on labeled and unlabeled data to show correlation.

References

  • [1] Xiaodan Gu, Ming Yang, Jiaxuan Fei, Zhen Ling, and Junzhou Luo. A novel behavior-based tracking attack for user identification. In Advanced Cloud and Big Data, 2015 Third International Conference on, pages 227–233. IEEE, 2015.
  • [2] Esam Sharafuddin, Nan Jiang, Yu Jin, and Zhi-Li Zhang. Know your enemy, know yourself: Block-level network behavior profiling and tracking. In Global Telecommunications Conference (GLOBECOM 2010), 2010 IEEE, pages 1–6. IEEE, 2010.
  • [3] Nino Vincenzo Verde, Giuseppe Ateniese, Emanuele Gabrielli, Luigi Vincenzo Mancini, and Angelo Spognardi. No nat’d user left behind: Fingerprinting users behind nat from netflow records alone. In Distributed Computing Systems (ICDCS), 2014 IEEE 34th International Conference on, pages 218–227. IEEE, 2014.
  • [4] Sebastian Abt, Sebastian Gärtner, and Harald Baier. A small data approach to identification of individuals on the transport layer using statistical behaviour templates. In Proceedings of the 7th International Conference on Security of Information and Networks, page 25. ACM, 2014.
  • [5] Kingsly Leung and Christopher Leckie. Unsupervised anomaly detection in network intrusion detection using clusters. In Proceedings of the Twenty-eighth Australasian conference on Computer Science-Volume 38, pages 333–342. Australian Computer Society, Inc., 2005.
  • [6] Leonid Portnoy, Eleazar Eskin, and Sal Stolfo. Intrusion detection with unlabeled data using clustering. In In Proceedings of ACM CSS Workshop on Data Mining Applied to Security (DMSA-2001. Citeseer, 2001.
  • [7] Gerhard Münz, Sa Li, and Georg Carle.

    Traffic anomaly detection using k-means clustering.

    In GI/ITG Workshop MMBnet, 2007.
  • [8] Gill Bejerano and Golan Yona. Variations on probabilistic suffix trees: statistical modeling and prediction of protein families. Bioinformatics, 17(1):23–43, 2001.
  • [9] Xumin Liu, Hua Liu, and Chen Ding. Incorporating user behavior patterns to discover workflow models from event logs. In Web Services (ICWS), 2013 IEEE 20th International Conference on, pages 171–178. IEEE, 2013.
  • [10] Ali Shiravi, Hadi Shiravi, Mahbod Tavallaee, and Ali A. Ghorbani. Toward developing a systematic approach to generate benchmark datasets for intrusion detection. Computers and Security, 31(3):357 – 374, 2012.