Once a corporate or government entity discovers that its network has been compromised, a response team begins an investigation to respond to the breach. As part of the investigation, heavy use is made of memory, disk, and network forensic capabilities. The cyber security analysts look for IOCs, such as IP addresses, domain names, file hashes, and Windows Registry keys, that will allow them to automatically detect the adversary in the future. Additionally, the analyst may create other more advanced signatures, such as regular expressions, to detect the adversary. As adversaries change their TTPs, some of the IOCs lose their effectiveness. If all of the IOCs become invalid, the adversary will be able to engage in malicious activities without being detected. It is likely that the adversary will be rediscovered after a future breach and ensuing incident response and then this process of extracting and deploying IOCs to sensors is repeated. A better approach to this problem would be to automatically adapt the IOCs so that known adversaries can continue to be tracked as they change their TTPs and the data they generate correspondingly drifts over time.
The TTK problem is one of the fundamental challenges in cyber security. The complex adversary and defender dynamics that are inherent in this problem can be modeled as a non-cooperative game in game theory. Adversaries are aware that defenders have deployed signatures to sensors to detect their activity, and this leads them to constantly change TTPs such as command and control (C&C) messaging, the IPs and domains of servers used in C&C, and exploits directed at discovered vulnerabilities. In addition, many of these indicators are extremely brittle, which makes it easy for the adversary to intentionally manipulate his or her footprint to avoid detection. For example, if an IOC is a key in a Registry entry, the adversary simply needs to write to a different location and then the indicator will be invalidated. If the indicator is the IP address of a server used in C&C communication, the adversary simply needs to use another IP address and the IOC is no longer valid. Similarly, the defender is also aware that the adversaries are constantly changing their tactics to avoid detection. Manually updating the IOCs to continue to bracket the adversary as behavior changes is not possible given current resource allocations.
In this work, we describe a framework for solving the Tracking the Known problem. Starting with a base model of indicators, our framework automatically updates the current model based on its predicted labels on the data stream. The intent is to self-adapt the model to concept drift in the data stream via a data-driven approach. In cyber defense, analysts often rely on regular expressions as indicators to detect adversaries. A regular expression is a sequence of characters that concisely represents a search pattern. Therefore, as an initial demonstration, we explore inducing regular expressions from labeled data. Many such algorithms have been provided for learning regular expressions due to the Regex Golf problem : given two datasets of strings, create the shortest regular expression that matches all the strings in one set while not matching any of the strings in the other.
Ii Related Work
To the best of our knowledge, there has been no prior work on the TTK problem. However, the notion of adapting or updating a model based on its own outcomes is known as self-training in the semi-supervised learning literature.
Regular expressions can be used to express a regular language, which is a formal language in theoretical computer science. Finite automata can be used to recognize regular languages. In fact, regular expressions and finite automata are known to be equivalent by Kleene’s theorem . Argyros et al. inferred web application firewall filters (i.e., regular expressions) by generating queries and observing the responses. They showed an improvement over existing automata learning algorithms by reducing the number of required queries by through the use of symbolic representations . This has obvious overlap with Regex Golf since the queries that are filtered can be considered part of the adversary dataset and those that are passed can be considered part of the non-adversary dataset. Prasse et al. attempted to infer a regular expression that matched a given set of strings (i.e., email messages) and was as close as possible to the regular expression that a human expert would have created. They applied their method to the problem of identifying spam email messages. Their technique frequently predicted the exact regular expression a human expert would have created or the predicted regular expression was accepted by the expert with little modification . Becchi et al. showed how finite automata can be extended to accommodate Perl-Compatible Regular Expressions (PCRE) .
Regular expressions have been used in many applications to solve challenging cyber security problems. Micron Technology, Inc. built a massively parallel semiconductor architecture that directly implemented regular expressions in hardware 
. The Machine Learning Lab at the University of Trieste
recently used genetic programming to learn regular expressions. The Norvig solution  discussed in Section IV was used as their baseline for comparison.
Iii Tracking the Known
The TTK problem is essentially a decision-theoretic problem on a data stream . Given , where represents some cyber datum/event (e.g., an HTTP session, Windows Registry key, IP address), the goal is to use a model to determine if was generated by an adversary or not, which is typically labeled by , where denotes the class of interest (i.e., the positive class). However, if the model is not updated against the stream, its performance will degrade over time as the data stream drifts.
Algorithm 1 shows the process for updating the model in the TTK framework. For a given window , the current model is used to label an event and then is added to the appropriate set ( for a positive prediction and for a negative). At the end of the window, an algorithm is used to derive a new model , which is then concatenated with the current model in an ensemble-like fashion. (Some algorithms might update directly.) Then, the process is repeated on a new window.
For the use case considered in this paper, the model is a list of regular expressions that attempts to match on the positive class, potentially indicating the presence of adversarial activity. However, the framework is generic and, for instance, the model could be induced using other machine learning algorithms. Basically, any algorithm that can categorize as being generated by the adversary or not could be utilized by this framework.
In order to solve the TTK problem, we created an automated, cyclic pipeline to keep models of IOCs up-to-date with the data generated by an adversary. Fig. 1 shows a high-level view of the implementation of our system. The cyclic nature of the pipeline can be described by the following stages:
Iii-1 Collect Data
As cyber data is extremely voluminous, it must be captured, processed, and stored in a near real-time, streaming manner. Therefore, a Data Stream Processor must be efficient enough to process streaming cyber data without losing events. Additionally, custom, high-performance, neuromorphic hardware was utilized to further increase performance where possible.
The Event Labeler is used to warm-start the detection system by annotating events based on alternative mechanisms, such as a threat feed or an analyst’s detection rules. Additionally, a human-in-the-loop might also be employed to relabel events and adapt the system.
Iii-2 Ingest Data
A high-speed Ingest Engine was developed to ingest the processed cyber data into a High-Performance Database. This database stores historical cyber events that have been annotated. This historical data allows models to be developed and validated before deployment.
Iii-3 Partition Data
Depending on the task of interest, a user can partition the data as needed and derive a model based on that data.
Iii-4 Generate Model
The model can be induced from any algorithm that can discriminate between two or more sets, such as supervised machine learning.
Iii-5 Validate Model
Before being deployed to the production system, the model is validated against the historical, annotated data. Standard statistical machine learning performance metrics are calculated (e.g., true positive rate, false positive rate, etc.). If the model performs well historically, it is deployed to the production system.
Iii-6 Deploy Model
During this stage, the new model is deployed and used by the Data Stream Processor to process and annotate incoming cyber data.
Iv Learning Regular Expressions
In this work, our approach to solving TTK requires an algorithm for learning regular expressions: given two datasets of strings, create the shortest regular expression that matches all the strings in one dataset while not matching any of the strings in the other dataset. As long as any of the IOCs remain valid, our solution continues to track the adversary and hence identify the data that he or she produces. This data goes in one dataset while data not generated by the adversary goes in another. The algorithm for deriving regular expressions is then repeated to identify a more robust set of regular expressions for bracketing the adversary and then those new, more current, rules are deployed.
If it is assumed that it is possible to create two distinct sets as required by Norvig’s algorithm, there is a straightforward solution: combine the full contents of the adversary dataset using logical disjunction (i.e., the operator) to create a very long regular expression that provides a perfect solution. Of course, this solution is unwieldy, so we want to optimize the size of the regular expression required to separate the two sets. A regular expression to do this will consist of some set of matching conditions -ed together. This problem can be generalized as the set cover problem, usually stated as:
Given a set of elements and a collection of sets whose union equals , identify the smallest sub-collection of whose union equals .
As an example, consider as all integers from ,
and . The union of all sets in is , meaning covers , but we can find a smaller sub-collection that still has this property. In this case, also covers .
Optimizing to find the smallest collection that covers the original set is an NP-hard problem, with the decision version222The optimization problem involves finding a solution, while the decision version involves determining if a solution exists. of the problem proved NP-complete in . Given this, the best way to approach the problem is through approximation.
Norvig’s algorithm begins by generating a set of regex components as follows. Each dataset is first broken into -grams with ranging in size from 1 to the length of the longest string in the dataset and then a set of all possible subsequences of these sizes is created. For this set, a ’.’ is then added in every possible iteration of these subsequences replacing some number of characters. Next, regular expression components are created by inserting special repetition characters, such as ’+’, ’*’, or ’?’, after each character that is not ’.’ in every possible combination. From this set, any component that matches anything in the non-adversary dataset is filtered out. This provides a set of components to choose from that match at least one string in the adversary set and no strings in the non-adversary set.
These components are ranked based on how many strings they match and the best is added to a solution set. The strings already covered are removed from consideration, and the remaining components are again ranked by how many of the remaining strings they match. This process is repeated until all strings are covered. This is a purely greedy algorithm.
V Case Study: Self-Adaptive Block List
In order to adequately test the TTK framework, we selected a common cyber security problem that could be solved by our framework, collected and processed data for an extended period of time, and performed experiments to validate our solution. A common task in network defense involves developing and deploying rules to detect and block potentially malicious network traffic. Therefore, we selected this problem as an initial case study.
V-a Collect Data
The raw data are the ingress and egress packets collected at a corporate network border. The majority of this data are Hypertext Transfer Protocol (HTTP) and HTTP over Transport Layer Security (HTTPS) flows initiated by users on the inside of the network as they visit various sites on the internet. A streaming analysis tool was used to decode the HTTP flows and extract the metadata/features.
Note that the Event Labeler attached the bootstrapping labels from a blacklist to the HTTP flows. After these labels were attached, the initial adversary and non-adversary datasets were created from which the initial regular expressions were derived. The Event Labeler was no longer needed at this point as future labels were provided by the regular expressions themselves.
Two different methods were developed to process the regular expression matching. For ease of initial development of the overall system, the WaterSlide open source software tool was used, which is a modular metadata processing engine that is highly optimized for processing streaming data . WaterSlide has a module that uses the standard re2 library, a highly optimized regular expression engine, for processing regular expressions . While the re2 library is very efficient, processing complex regular expressions is computationally intensive and represented a significant performance bottleneck for the system.
As part of the research to develop the TTK system, a module for WaterSlide was developed that utilizes an FPGA-based regular expression processing accelerator, i.e., the Neural Processing Unit (NPU) . The architecture for the NPU was motivated by the latest understanding of neuromorphic processing models and greatly accelerated the processing of many regular expressions as demonstrated in Fig. 2. The implementation of the NPU was a PCI attached FPGA development board. The NPU API was transparently implemented, completely hiding the complexity of the NPU in a generally available library.
V-B Ingest Data
The output of the data collection, the HTTP metadata and the labels of any regular expressions that matched, was loaded into a database. One of the challenges was how to seed or bootstrap the system to create the initial set of regular expressions. Recall that our approach relied on a dataset representing adversarial activity and a dataset representing non-adversarial activity to generate the regular expressions. Initially, there were no deployed regular expressions, hence they could not be used to split the incoming stream into the two datasets. To create the initial datasets, a blacklist 
was used to classify the various domains into, e.g., ads, news, e-commerce, banking, and dating after the HTTP data had been loaded into the database. In an actual cyber security context, these initial labels (or more generally IOCs) would have been extracted by cyber security analysts in the context of a forensics investigation into a breach.
The output of the Data Stream Processor is a set of tab-separated values (TSV) files containing the HTTP metadata and the labels of any regular expressions that matched. We developed an ingest engine to read the TSV files and then, using the Apache Phoenix  API, loaded those into a database, Apache HBase .
V-C Partition Data
Our solution required that the incoming HTTP flows be split into the dataset of interest and the dataset not of interest, based on the specific task that an analyst might want to automate. Then, a set of regular expressions to detect the dataset of interest was derived and deployed to the sensor. After deployment, it was the regular expressions themselves, not the bootstrapping labels from the blacklist, that bifurcated the incoming flows into the appropriate datasets.
V-D Generate Model
Once the data has been split into the two datasets, we play Regex Golf to generate the more up-to-date regular expressions to bracket the threat. Our implementation is based on a solution by Peter Norvig [9, 10]. Norvig’s solution requires that the datasets be disjoint, i.e., no string can appear in both the adversary and the non-adversary datasets. This assumption may be problematic in the cyber security context. For example, a network sensor may detect a downloaded file that could appear both in conjunction with a malware toolset but also occur normally as part of a legitimate toolset. A good example of this is the netcat program, which is often an indicator of malware, but is also a legitimate sysadmin tool. In fact, many antivirus products will flag a zip file if it contains netcat as potential malware. Norvig’s algorithm will also always find a perfect solution, i.e., it will find a set of regular expressions that match everything in the adversary dataset and nothing in the non-adversary dataset. Relaxing these two constraints is an area of future work that might lead to an algorithm that is better-suited for deployment. Section IV provides more detail on this algorithm.
V-E Validate Model
A method to validate regular expressions after they were generated by the Regex Golf model was developed. First, a vector of tuples for each dataset showing the true class of each string (e.g., adversary) and the class predicted by the algorithm was created. These vectors were passed to thescikit-learn library  to calculate standard metrics such as precision, recall, accuracy, etc. Given the limitations of the current algorithm, these metrics are all currently perfect. However, if we relax the perfect classification constraint, these metrics could become meaningful.
V-F Deploy Model
The regular expressions learned by the Regex Golf model were written to a file, along with appropriate labels, after the generation and validation processes. The streaming analysis engine recognized that the file had been modified, which triggered a reload of that file and its associated regular expressions and labels. These regular expressions were compiled by the NPU and then the NPU was used to accelerate application of the regular expressions to the network data. The system has now returned to the beginning of the pipeline and data was collected using these newly-deployed regular expressions. This created new adversary and non-adversary datasets and the automated, cyclic process continued.
Vi Experimental Results
As an initial experiment, we investigated the problem of tracking advertising domains. Data was collected at a corporate network border for two weeks, resulting in approximately 23 million unique HTTP flows. Of those flows, roughly 34% were identified as advertising domains by the list of domain categories, which was used as ground truth.
The TTK problem is analogous to finding positive examples in a binary classification task. Therefore, some common metrics for estimating classification performance were used. The true positive rate (TPR) is the fraction of positive examples that were correctly identified by the model and is represented aswhere TP is the number of positives correctly classified and P is the total number of positive samples. The false positive rate (FPR) is the fraction of negative examples (non-advertising domains) that were identified as advertising domains. It is defined as , where FP is the number of negatives incorrectly classified and N is the total number of negative samples. A common metric for summarizing the detection performance is the Receiver Operating Characteristic (ROC) curve, which plots the FPR (on the -axis) against the TPR (on the -axis). Computing the area under the ROC curve (AUC) provides a single metric for summarizing detection performance. An AUC value of 1.0 represents perfect detection performance, while an AUC score of 0.5 means that the model does no better than random.
Fig. 3 shows the cumulative performance for window sizes of 1,000 and 10,000. The window size indicates how many domains are used to seed the models, as well as how often metrics are collected. The naïve solution is simply the list of advertising domains seen in the first window and is analogous to a standard blocklist that only utilizes domain names. For the naïve solution, the FPR did not change from 0.0 because our ground truth doesn’t change. In other words, the same domains that were marked ads in the beginning are still labeled ads as we continue the experiment. We also see that for the naïve solution the TPR went down because new domains were seen that were in the advertising category, but weren’t in the original list of advertising domains used for detection.
For the Regex Golf solution, the FPR went up because the learned regular expression marked domain names as advertisements that were not. This increase is an unfortunate side effect of the self-training paradigm in that errors tend to propagate. On the other hand, the TPR remained steady or decreased only slightly because the learned regular expression had some generalizability and could correctly recognize some previously unseen domains as advertisements.
For the smaller window size, the naïve solution shows a 60% decrease in ability to detect positives, while the Regex Golf solution only shows a 22% decrease over time with a negligible difference in overall detection performance. For the larger window size, the naïve solution shows a 50% decrease in ability to detect positives, while the Regex Golf solution only shows a 33% decrease over time. The Regex Golf solution has a 6% lower overall detection performance than the naïve solution. In summary, it is apparent that the Regex Golf solution is finding more positive instances with only a slight degradation in performance, as indicated by the slightly lower AUC scores.
Vii Conclusions and Future Work
In this work, we demonstrated the ability to keep IOCs (i.e., regular expressions) up-to-date with an evolving adversary. This will allow network sensors to continue to bracket these existing threats as they change their TTPs. Not losing the ability to detect these adversaries will save countless hours in incident response because they will be identified before they have breached our networks or at least before they have spread laterally. It will also prevent these adversaries from accomplishing their objectives, e.g., exfiltrating desired information.
The current solution for generating regular expressions requires that the two datasets are disjoint. This constraint is not realistic to impose in a production cyber security context. In addition, the current model will always find a perfect solution; as the datasets grow larger, this may prevent the model from ever converging. Relaxing this constraint could lead to much faster build times, again advantageous in a production cyber security context.
Currently, our system is only utilizing a single feature: the domain name. By utilizing other features of HTTP traffic, we anticipate even better results. Given the performance increase provided by the NPU, deploying a larger set of regular expressions that match on different features is feasible, perhaps by using a custom set multicover optimization. Additionally, our framework allows for more complex models to be employed. Any algorithm that can partition data or solve a learning task can be supported by the framework. Therefore, trying more advanced methods, such as supervised learning or deep learning, is also an area of future work. Deep learning can also be used in the Data Stream Processor to transform features into more appropriate representations for learning.
Finally, we would like to deploy this in cyber security operations using the regular expressions discovered in a forensics investigation as the first set of deployed regular expressions. We will then be able to quantify how well the system brackets an adversary and their changing TTPs over time.
The authors acknowledge financial support from Sandia National Laboratories’ Laboratory Directed Research and Development Program, and specifically the Hardware Acceleration of Adaptive Neural Algorithms (HAANA) Grand Challenge Project. Sandia National Laboratories is a multimission laboratory managed and operated by National Technology and Engineering Solutions of Sandia, LLC., a wholly owned subsidiary of Honeywell International, Inc., for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-NA0003525.
-  G. Argyros, I. Stais, A. Kiayias, and A. D. Keromytis, “Back in black: towards formal, black box analysis of sanitizers and filters,” in Security and Privacy (SP), 2016 IEEE Symposium on. IEEE, 2016, pp. 91–109.
A. Bartoli, A. De Lorenzo, E. Medvet, and F. Tarlao, “Playing regex golf with
genetic programming,” in
Proceedings of the 2014 Annual Conference on Genetic and Evolutionary Computation. ACM, 2014, pp. 1063–1070.
-  M. Becchi and P. Crowley, “Extending Finite Automata to Efficiently Match Perl-compatible Regular Expressions,” in Proceedings of the 2008 ACM CoNEXT Conference. ACM, 2008, p. 25.
-  P. Dlugosch, D. Brown, P. Glendenning, M. Leventhal, and H. Noyes, “An efficient and scalable semiconductor architecture for parallel automata processing,” IEEE Transactions on Parallel and Distributed Systems, vol. 25, no. 12, pp. 3088–3098, 2014.
-  C. T. Do, N. H. Tran, C. Hong, C. A. Kamhoua, K. A. Kwiat, E. Blasch, S. Ren, N. Pissinou, and S. S. Iyengar, “Game Theory for Cyber Security and Privacy,” ACM Computing Surveys (CSUR), vol. 50, no. 2, p. 30, 2017.
-  D. Follett, D. Townsend, G. Karpman, J. Naegle, R. Suppona, J. Aimone, and C. James, “Neuromorphic Data Microscope,” Neuromorphic Computing Symposium, forthcoming.
-  R. M. Karp, “Reducibility among combinatorial problems,” in Complexity of computer computations. Springer, 1972, pp. 85–103.
-  S. C. Kleene, “Representation of events in nerve nets and finite automata,” DTIC Document, Tech. Rep., 1951.
-  P. Norvig. (2014) xkcd 1313: Regex Golf. Retrieved 2017-11-10. [Online]. Available: http://nbviewer.jupyter.org/url/norvig.com/ipython/xkcd1313.ipynb?create=1
-  ——. (2014) xkcd 1313: Regex Golf (Part 2: Infinite Problems). Retrieved 2017-11-10. [Online]. Available: http://nbviewer.jupyter.org/url/norvig.com/ipython/xkcd1313-part2.ipynb
-  B. S. Olivier Chapelle and A. Zien, Eds., Semi-Supervised Learning. MIT Press, 2006.
-  P. Prasse, C. Sawade, N. Landwehr, and T. Scheffer, “Learning to identify concise regular expressions that describe email campaigns,” Journal of Machine Learning Research, vol. 16, pp. 3687–3720, 2015.
-  Open Source Community. (2016) WaterSlide. Retrieved 2017-11-10. [Online]. Available: https://github.com/waterslideLTS/waterslide
-  Apache Software Foundation. (2017) Apache HBase. Retrieved 2017-11-10. [Online]. Available: https://hbase.apache.org/
-  ——. (2017) Apache Phoenix. Retrieved 2017-11-10. [Online]. Available: https://phoenix.apache.org/
-  Explain xkcd. (2016) 1313: Regex Golf. Retrieved 2017-11-10. [Online]. Available: https://www.explainxkcd.com/wiki/index.php/1313:_Regex_Golf
-  Google. (2010) re2: A Principled Approach to Regular Expression Matching. Retrieved 2017-11-10. [Online]. Available: http://google-opensource.blogspot.com/2010/03/re2-principled-approach-to-regular.html
-  Machine Learning Lab. (2017) Machine Learning Lab. Retrieved 2017-11-10. [Online]. Available: http://machinelearning.inginf.units.it/
-  Open Source Community. (2017) scikit-learn. Retrieved 2017-11-10. [Online]. Available: http://scikit-learn.org/
-  URLBlacklist.com. (2017) URLBlacklist.com. Website no longer available. [Online]. Available: http://www.urlblacklist.com/