Malware attack intelligence describes the working of the attacks, their tactics, techniques, and procedures (TTPs), and the technology vulnerabilities exploited by the malware. This intelligence can equip security researchers with information to build better defenses against advanced cyber attacks and issue early warnings about future threats. The Internet can lead to evidence on attack intelligence through thousands of diverse and heterogeneous sources, globally known as cyber threat intelligence (CTI). Discerning and utilizing this knowledge speedily and accurately for longitudinal studies mandate rigorous development of techniques that are constantly evolving and adapting to the complexities of malware attacks. Therefore, threat intelligence is best utilized when it is timely, actionable, shared in an universally acceptable format, contextual, and understandable.
CVE111https://cve.mitre.org/, NVD222https://nvd.nist.gov/ are vulnerability tracking programs where information is combined through a centralized platform in a semi-structured way. Industry standards like Structured Threat Information eXpression (STIX)(stix) and trusted automated exchange of indicator information (TAXII)(connolly2014trusted) provide a language-agnostic framework for storing, sending, and receiving packages. However, despite concerted efforts by experts towards organizing malware threat information, there is a lack of context and transparency in sharing CTI. Analysts require more than data-driven threat intelligence. They seek trustworthy data sources, relevant threat indicators, sources and motivations behind attacks, and the likelihood of an attack. They also demand context to get a comprehensive picture of the threats, victims, and the distinct tactics an attacker deploys.
Malware threat ontologies, like MalONT2.0, enable communicating contextual CTI feeds by representing them in a structured format that encapsulates data and information characterized by properties that vary according to context. Our main contributions are as follows:
We present MalONT2.0, an ontology for capturing malware threat intelligence through classes and relations that combine semantic features (such as malware, attacker, infrastructure) with syntactic features and factual data (extracted from VirusTotal333virustotal.com).
Annotated CTI reports on android malware attacks, where each annotation is instantiated into classes and shares a relationship with other instances. Both classes and relations are defined and described in MalONT2.0 and share similarities with the STIX2.1 framework, where applicable. The instances are stored as RDF triples (pujara2013knowledge), .
We provide a dynamically growing knowledge graph generated by automatically feeding new CTI reports into the existing KG. We demonstrate the use of this knowledge graph through three queries (shown in Section 3).
1.1. What’s new in MalONT2.0
MalONT2.0 is a significant improvement over prior version (rastogi2020malont) as it comprehensively captures semantic, syntactic, and factual description of a malware threat. Prior version emphasised on contextual information coming from semantic data and contained partial factual data. The main source of CTI was unstructured threat reports written on diverse threats such as malware, and APT. The latest knowledge graph is generated from CTI reports that focus exclusively on android malware threats. Designing an ontology requires answering competency questions that can provide a wide coverage of the domain. These questions (or a narrower version of a competency question) are validated by confirming answers to queries on the instantiated triples. While building MalONT2.0, we updated the competency questions to the following:
Find all missing intelligence from the KG relating to various attack vectors - malware, actors (attacker, attacker-group, organizations, country), infrastructure (software, applications, platform, infrastructure used, TTPs). This information may be spread across triples generated across multiple CTI reports describing an attack vector.
Triples collected from CTI reports from a wide range of sources can be aggregated with syntactic intelligence from a source like VirusTotal to provide a richer description of the attack vector.
Identifying similar properties and grouping attack vectors can reveal latent behaviors and can be used in predictive models to forecast future events, both short- and long-term.
1.2. Example Knowledge Graph
See Figure 4 (left) for a sub-graph of MalKG. A CTI report contains information about a malware, Pegasus mapped to Malware class. This malware is also known as Chrysaor, and it logs user keystrokes and leaks the data of popular apps. The CTI report contains the hash for a single sample of the malware. According to VirusTotal, this sample was first seen in April 2017.
2. System Architecture
The proposed framework dynamically gathers unstructured CTI reports from the Internet. It extracts threat intelligence information in the form of RDF triples, assigns them classes and relations from MalONT2.0 (see Figure 1) forming a new knowledge graph, which it appends to the existing MalKG. In this section, we describe the main components of this framework .
CTI reports corpus
– MalONT2.0 is used to instantiate 25 CTI reports written between 2011 – 2021, and downloaded from the Internet. We followed the process of natural language annotation for machine learning(pustejovsky2013natural), created mutually agreed upon annotation guidelines, including a tie-breaking process managed by a security expert. These reports were authored by analysts from security organizations such as McAfee Labs. The annotated text has approximately 3,400 tags extracted by annotators using BRAT444https://brat.nlplab.org/ resulting in 1,100 entities and 2,300 relations. Triples generated from these are the structural components of MalKG that capture large-scale facts related to android malware threat intelligence.
– Semantic text patterns map to classes (called entity in KG) and object properties (called relations in KG) defined by MalONT2.0. An ideal open-source ontology can systematically capture cyber threat and attack information (facts and analysis) to model the contents of CTI reports. For instance, in MalONT2.0, three classes largely describe malware behavior –Malware, Vulnerability, and Indicator. Instances of these classes can equip the analysts with information on malware behavior and TTPs. See Figure 3 for a snapshot on an instance of MalONT2.0. See GitHub555https://github.com/aiforsec/DemoCCS2021/ code for annotated text and corresponding triples for all CTI reports.
Knowledge Graph Generation and Querying– We construct a knowledge graph corresponding to the malware ontology by populating it with triple instances derived from actual CTI reports. However, one may argue about the necessity concerning knowledge graphs, given that the instantiated ontology is previously obtaining facts and knowledge regarding the domain. The two key features of knowledge graphs are the ability to reason on deduced information and infer latent information. MalKG captures properties connecting nodes (also called entities) and employs a reasoner to draw associations among entities that would otherwise not be recognized.
Dynamically generated KG– In addition to the MalKG generated from annotated triples, we also present a dynamically growing knowledge graph called TINKER. CTI reports are regularly published by security and technology companies. These reports, especially those freely available for public access, are shared on the Twitter platform by company personnel. We have built an interface using python that uses academic Twitter API to extract unique occurrences of android malware CTI reports. Triple extraction models trained on annotated instances are used for dynamic extraction from frequent batches of CTI reports.
3. Outline of Poster and Demonstration
Our demonstration describes the ontology MalONT2.0 and briefly compares it with other ontologies, namely UCO (uco), MalONT (our prior work)(rastogi2020malont), and Swimmer ontology(swimmer)– arguably the first malware ontology. These ontologies have been chosen for comparison based on the competency questions defined in Section 1.1. We also have a live demonstration of MalONT2.0 ontology prepared, as well as queries for the knowledge graph. A few queries of the knowledge graph are visualized using Neo4j Bloom666https://neo4j.com/developer/neo4j-bloom/ in figures 2, 3, and 4.
3.1. MalONT2.0 based annotations
First we will demonstrate (a) MalONT2.0 in protege showing all the classes, sub-classes, and their descriptions, (b) configuration of BRAT777https://brat.nlplab.org/ using the OWL file generated by MalONT2.0 prior to annotating CTI reports, (c) semantic features in MalONT2.0 based on STIX2.1 framework and the syntactic features and factual data based on VirusTotal, (d) triple generation in the form of RDF from the annotated text, and assignment of classes and relations. See Figure 5 for a snippet of a McAfee threat report on a spyware, ”Golden Cup”.
3.2. Demonstrate MalKG
We will demonstrate (a) MalKG generated using only annotated triples, (b) MalKG generated using all CTI reports collected so far using the dynamically growing platform, (c) an entire sub-graph of a malware extracted from the larger KG. This will include all the triples connected to this malware including those generated from the CTI reports and VirusTotal (see Figure 4).
3.3. Run queries on MalKG
4. Conclusion and Future Work
We present MalONT2.0 ontology for capturing contextual threat intelligence from heterogeneous sources. CTI reports provide semantic information which is used to instantiate classes and relations of MalONT2.0. VirusTotal provides additional syntactic information for hashes that occur in the CTI reports, and the graph generated from this is appended to the existing MalKG. A dynamic knowledge graph, TINKER, is also proposed. For future work, we plan to perform validations on triples generated for TINKER. We also plan to demonstrate the use on TINKER for forecasting threat vectors.
This work is supported by the IBM AI Research Collaboration (AIRC). The authors would like to thank RPI researchers Erin Turnbull and Yueting Liao for their support.