Automated Dataset Generation System for Collaborative Research of Cyber Threat Intelligence Analysis

11/25/2018 ∙ by Daegeon Kim, et al. ∙ Korea University 0

The objectives of cyber attacks are becoming sophisticated and the attackers are concealing their identity by disguising their characteristics to be others. Cyber Threat Intelligence (CTI) analysis is gaining attention to generate meaningful knowledge for understanding the intention of an attacker and, eventually, to make predictions. Developing the analysis technique requires a high volume and fine quality dataset. However, the organizations which have useful data do not release it to the research community because they do not want to disclose threats toward them and the data assets they have. Due to data inaccessibility, academic research tends to be biased towards the techniques for steps among each CTI process except for the analysis and production step. In this paper, we propose the automated dataset generation system named CTIMiner. The system collects threat data from publicly available security reports and malware repositories. The data is stored in the structured format. We release the source codes and the dataset to the public that includes about 628,000 records from 423 security reports published from 2008 to 2017. Also, we present a statistical feature of the dataset and the techniques that can be developed using it. Moreover, we demonstrate one application example of the dataset that analyzes the correlation and characteristics of incidents. We believe our dataset promotes collaborative research of the threat information analysis to generate CTI.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Cyber Threat Intelligence (CTI) is evidence-based knowledge including context, mechanisms, indicators, implications and actionable advice regarding existing or emerging threats to assets McMillan [2013]. One can utilize CTI to have broad situational awareness, to collaborate in defeating cyber threats one is facing with others, and to prevent cyber threats by applying CTI into defense systems.

With the increase of global cyber threats, CTI is gaining more attention as a response to such threats. Many nations and organizations also try to promote the use of CTI by enacting laws that legalize and encourage collecting CTI Council on Foreign Relations [2015], by sharing it on the bi or multilateral cooperation cam [2016] ccd [2008] eni [2004] The White House [2015], and by establishing standards Johnson and Waltermire [2014] MIT [2013]. Furthermore, during the recent decade, the number of articles related to CTI has increased about 24 times as it can be seen in Fig. 1111This is the Google Scholar search result with exact keyword matching of “cyber threat intelligence” including patents and citation on January 5th, 2018..

Figure 1: The number of articles related to CTI within the recent decade.

During the Olympic Winter Games PyeongChang 2018, a cyber attack took place targeting the server operated by the organizing committee. What makes this case taken attention is that security researchers attributed different countries as the actor of this cyber attack. Rosenberg [2018] and GReAT [2018] respectively insisted the Chinese and the Russian threat actor be responsible for the cyber attack. Paul Rascagneres [2018] and Juan Andres Guerrero-Saade and Lesnewich [2018] pointed out that it was unavailable to conclude the attribution based on small amounts of code overlap the malware used by Lazarus Group, the North Korean hacking group. Especially, GReAT [2018] insisted that there was including evidence the Russian attacker tried to disguise the culprit to be the North Korean hacking group. In this example, we can expect that the evidence-based precise analysis that considers all possibly related cases has vital importance for CTI generation.

However, among the traditional intelligence process US Joint Chiefs of Staff [2013]– planning and direction, collection, processing and exploitation, analysis and production, and dissemination and integration – most of the technical research about CTI tends to focused on the steps except for analysis and production step which require real CTI dataset. Despite many advantages of CTI analysis such as 1) interoperability of data (machine, vendor, and organization independent), 2) compact expression of heterogeneous source of threat information, 3) possibility of performing long-term and nation-wide threat analysis Kim et al. [2016], we believe that the most challenging aspect of the study is the limited accessibility of data to researchers. Some web services provide the functionality to searching threat data, but they do not offer enough and useful set of data for research purposes. Also, most of the dataset is only consisted by specific data types, e.g., IP, URL, or hash value, and some dataset is strictly restricted to access in some regions or to a people having particular nationalities.

In this paper, we propose a cyber threat dataset generation system named CTIMiner which automatically collects data from public security reports and malware repository websites, and stores it in a structured format. The generated dataset contains several types of data such as malware analysis information which consisted by file path, mutex, and code sign information, and those listed above. The main contributions of our work are:

  • Promoting collaborative CTI analysis research by proposing the cyber threat data generation system and the public database

  • Demonstrating the usages of the dataset for correlation analysis

  • Suggesting the techniques to be developed to generate CTI from the dataset

It would be better to introduce the techniques to generate CTI from the dataset. However, it is beyond the scope of this paper and remained as our future research concern. We believe the suggestion of the required techniques for analyzing the dataset can inspire the researchers and promote CTI analysis research.

The remainder of his paper is organized as follows. The intelligence process and its associations with CTI activities are presented in section 2 with several studies related to each step. The overall system architecture of CTIMiner and the phases composing the run-time process is described in section 3. The dataset structure, the categories of data and the statistical features are explained in section 4. After demonstrating the dataset usage and suggesting the techniques to analyze it in section 5, we conclude this paper in section 7 following the source code and dataset access introduction in section 6.

2 Related Works

2.1 Intelligence Process and Automated CTI Activities

In the field of military operation, well-defined intelligence process has been adapted to generate intelligence efficiently from low-level data collected in the field to support decision making US Joint Chiefs of Staff [2013]. This process is intended to be followed by a human intelligence officer, but it can also be projected into automated CTI activities.

Once the operation direction is determined to fulfill the identified intelligence requirement, the raw data is collected and extracted from the sensors and data sources, which have the ability as well as the functionality to obtain it. The data gathered from the various sources are combined and converted into forms, in other words information

, so that the data can be efficiently analyzable. The information is passed into analysis algorithms, such as big data or machine learning based methods, which enable the

intelligence to support human analysts. Such intelligence is spread to others who have access to it. The shared intelligence can also be integrated into the intelligence the one already possess.

The association between the intelligence process and the automated CTI activities are illustrated in Fig. 2. From the following subsections, previous studies concerning CTI are introduced in the sequence of the intelligence process, except for the planning and direction step since it is more a strategic matter rather than technical.

Figure 2: The Association between The Intelligence Process and The Automated CTI Activities.

2.2 Collection

Since CTI is also a product of the threat data processing through the intelligence process, low-level threat data can be collected in this step. Goel classified the types of data to be collected into unstructured data and network data 

Goel [2011]. The former typically consists of hacker forum postings, blogs, and websites, and the latter is generated from information security systems such as firewalls, intrusion detection systems, and honeynets.

Benjamin et al. proposed the method of extracting information from hacker forums, IRC channels, and carding shops to identify threats to them Benjamin et al. [2015]. Also, Fachkha and Debbabi characterized the darknet and compared several methods to extract threat information from it Fachkha and Debbabi [2016].

As a data repository for the research regarding analysis of cyber security, IMPACT U.S. Department of Homeland Security [2014] which is the newest version of PREDICT provides several types of dataset such as network flow data, IDS and firewall data, and unsolicited email data. It also provides useful tools for data analysis. However, the service is only available to the DHS-approved countries; United States, Australia, Canada, Israel, Japan, Netherlands, Singapore, and United Kingdom.

2.3 Processing and Exploitation

During the processing and exploitation step, collected raw data is converted into forms that can be readily used by intelligence analysts and other consumers. Unstructured data and heterogeneous sources of data having different structures can be stored in a unified data format in this step for further analysis.

STIX (Structured Threat Information eXpression) Barnum [2014] and OpenIOC MANDIANT [2011] proposed by the MITRE and MANDIANT are representative standards to express threat data. Especially, STIX is widely used due to the scalability of the schema that uses components such as CybOX (Cyber Observable eXpression), MAEC (Malware Attribute Enumeration and Characterization) and CAPEC (Common Attack Pattern Enumeration and Classification). Liao et al. proposed the elements extracting method to construct structured data from unstructured one Liao et al. [2016]

. One thing to notice in this approach is that the meaning of the elements in the context can be also retrieved by using the natural language processing technique.

2.4 Analysis and Production

During the analysis and product step, all processed information is integrated, evaluated, analyzed, and interpreted to produce intelligence. Kornmaier and Jaouën insisted that to generate operational or strategic intelligence beyond a tactical one which is technical in nature, the threat data should be fused with data collected from different disciplines such as Human Based Intelligence (HUMINT), Imagery Intelligence (IMINT), Signal Intelligence (SIGINT), and Geographic Intelligence (GeoINT) Kornmaier and Jaouën [2014].

Modi et al. proposed automated threat data fusing system that correlates data crawled from the web applying a string-matching based approach Modi et al. [2016]. Similar commercial CTI services also exist such as iDefense® IntelGraph by Verisign and the web intelligence engine by Recorded Future that allows users to navigate through extensive threat data following string-matching correlation. One key feature of Recorded Future is that it can perform predictive analytics for specific future events by the use of information noticed ahead of time Truvé [2016]. However, the commercial services provide indicator-centric analysis approach so that it is hard to trace correlation between incidents.

Kim et al. proposed the general framework for efficient CTI correlation analysis by adopting the novel concept that expresses similarity between threat events in graphical structures Kim et al. [2016]. The graphical structures allow the analysts to trace the specifications and the transition of related cyber incidents to infer the attacker’s intention.

Using a threat report as the source of information, Qamar et al.  Qamar et al. [2017] proposed the automated mechanism to determine the risk of the threat analyzed in the report towards a networked system. For the purpose, they defined the ontology of IoCs, network, associated risk, and the relations of them. For the risk analysis of the networked system, the four parameters – threat relevance, threat likelihood, total loss of affected assets, and threat reachability – are defined.

2.5 Dissemination and Integration

During the dissemination and integration step, intelligence is delivered to and used by the consumer. There is a guideline Johnson and Waltermire [2014], and a technical standard protocol Connolly et al. [2014] that exists for sharing CTI. Also, MISP (Malware Information Sharing Platform)222http://www.misp-project.org, MANTIS (Model-based Analysis of Threat Intelligence Sources)333https://github.com/siemens/django-mantis and CIF (Collective Intelligence Framework)444http://csirtgadgets.org/collective-intelligence-framework are useful open-source platforms to store and share CTI.

As more participants in a community share CTI, access control issues with the shared data often arise. Zhao and White proposed the access control model that extends the group-centric Secure Information Sharing (g-SIS) model to support collaborative information sharing in a community Zhao and White [2012]. Even though such assistive technologies promote CTI sharing, social and political issues, for example, the authority to operate CTI sharing policies and the trust management within a community are often controversial to establish collaborative CTI sharing.

2.6 Data, Information, and Intelligence

In many CTI related literatures, the terminologies, data, information and intelligence, are often intermixed without clarification. We need to use them clearly based on the definition in US Joint Chiefs of Staff [2013].

Data is the individual facts collected from sensors in the operational environment. Information is data gathered and processed into an intelligible form, and intelligence is the new understanding of current and past information that allows prediction of the future and informs decisions.

These definitions are not only applied to the general intelligence process but also CTI activities. Throughout data fusion and mining process, Bass defined data as the measurement and observations, information as the data placed in context, indexed, and organized, and knowledge, which is equal to intelligence, as the information explained or understood Bass [2000].

3 CTIMiner System Architecture

We propose a cyber threat data collecting system, CTIMiner, with the system architecture presented in Fig. 3. The CTI collecting procedure is composed of three phases. During the first phase, it gathers threat data from publicly accessible cyber intelligence reports published by organizations and companies. It also collects additional related data from malware repository during the second phase. Finally, all collected data is stored in the database after passing through the last phase that generates combined information in a structured format.

Figure 3: CTIMiner System Architecture.

3.1 Phase 1: Parsing Indicator of Compromise

This phase starts with collecting cyber intelligence reports which analyze cyber incidents and malware interrelated APT campaigns and groups. For this, we obtain a list of papers from APTnotes555 https://github.com/aptnotes/data which provide publicly available articles and blog contents related to malicious attacks, activity, and software associated with vendor-defined APT groups and/or tool-sets. To maintain the usability of the dataset, we exclude the periodically published threat analysis reports from the list that integrate analysis results about different APT groups that are not interrelated each other. Therefore, one can assume that the extracted data at phase 1 and 2 are related to the same (or related) threat actors. We can use this property to set the ground truth of data for analysis. This property and the dataset usability are explained in detail in section 4 and 5, respectively.

Next, Indicators of Compromise (IoCs) are extracted from the reports using the parser. We utilize ioc_parser666https://github.com/armbues/ ioc_ parser that extracts IoCs matched by predefined regular expressions such as URL, host, IP address, e-mail account, hashes (MD5, SHA1, SHA256), Common Vulnerabilities and Exposures (CVE), registry, file names ending with specific extensions, and Program Database (PDB) path. Among the obtained data, the malware hash values are passed to the second phase for further data collection, and others to the last phase.

3.2 Phase 2: Collecting Analysis Data

Due to the functional limitation of the parser, there can be unextracted IoCs remaining in the reports that can be found in malware analysis data. Moreover, we can get additional data from the analysis results that are not in the contents of the reports. Notably, the valuable data, which cannot be expressed as the regular expression such as mutex, file mapping, code sign, and other strings, are only collectible from the malware analysis results.

To collect malware analysis data, we use the malware repository service, malwares.com, operated by SAINT SECURITY Inc., the first cloud-based malware analysis platform in South Korea. It possesses over 800 million malware samples and maintains the partnership with VirusTotal. If the malware analysis results are retrieved by querying the hash value, the data in the results - hashes, URL, IP address, PDB path, code sign, file name, and other strings - are passed to the last phase; otherwise, the hash value itself is passed. We do not store malware samples in the database because of the possible occurrence of the copyright concerns when it is publicly released. For the new hash values found from the results, the analysis data is gathered through the same procedure.

3.3 Phase 3: Data Filtering and Storing

The data collected from several sources may be redundant or noisy which can be filtered out in this phase. For example, some files are automatically generated by the operating system regardless of the intent of the malware creator when the malware is executed. We merge the repetitive data and remove noisy data in this phase. What needs to be considered for noise removal is the trade-off between false-positive and false-negative. The filtered data is stored in the MISP server that provides API to manage and export data in various structured formats.

Optionally, we categorized the data types composing the dataset and analyze their statistical characteristics in this phase where the results are presented in the next section.

3.4 System Processing Results of Phase 1 & 2

We ran this system to the collected 423 APT reports published from 2008 to 2017, and the numerical processing results are in Table 1. Among 10,391 malware hashes extracted from the reports, we got analysis results regarding 71.5 % of them from the malware repository. Among the analysis information, we found 406 new malware hashes which were not contained in the APT reports and added the analysis information to the dataset. The worth of inclusion of the malware analysis data in addition to the IoCs extracted from the reports is explained in the statistical analysis of the dataset in section 4.2.

Types Counts %
# of the reports 423 -
# of the data stored in the dataset 628,067 -
# of the malware hashes in the reports 10,391 -
# of the analyzed malware 8,468 81.5
# of the additionally extracted malware 406 4.8
Table 1: Phase 1 & 2 Processing Results

4 Dataset Descriptions

4.1 Dataset Structure and Data Types

Figure 4: The Relationship of a Set of Events.
Figure 5: The Data Schema of an Event.
Figure 6: The Example of a Set of Events.

The dataset is composed of several sets of events and Fig. 4 shows the relationship of one set of events. One set of events composed of two types of events-one report event and several malware events. A report event includes the data extracted from the first phase explained in section 3 which parsed texture IoCs from the APT reports. Whenever malware hashes are detected, and it is possible to have the analyzed data of them in phase 2, malware events are created. These malware events and the report event where the malware hashes are originated from can be grouped under the title of the report.

The data schema of an event is presented in Fig. 5 and one short example of a set of events are in Fig. 6. Since all malware events originated from one report includes the same file name of the report, this can be used as the ground truth of the correlation analysis of the data. Also, compilation dates of malware and publication dates of reports can be useful to the temporal analysis of the dataset. The sample application of the dataset for correlation analysis using those dataset characteristics is demonstrated in section 5.

The types of attributes stored to dataset are IP, URL, e-mail address, date and time, vulnerability (CVE), file name, PDB path, digital code sign serial number, and other string data such as the author and title of a document. The amount of data, the report, the malware events are in Table 2. Using the source codes we publicly released, one can create a dataset in person composed of the attribute types one interested.

Year Data Types Report Malware
Hash IP URL e-mail date time CVE file name PDB code sign others total
2008 0 3 171 0 0 0 17 0 0 0 191 2 0
2009 2 7 84 2 0 0 10 0 0 0 105 2 0
2010 223 79 280 14 32 2 213 0 0 0 800 7 32
2011 1,440 412 478 17 319 7 713 2 38 25 3,340 14 319
2012 2,240 433 637 46 465 30 828 2 43 7 4,524 22 465
2013 8,329 2,505 3,032 599 1,798 45 3,003 97 802 61 19,571 47 1,798
2014 5,614 5,484 3,282 476 1,116 83 2,804 22 438 28 18,842 100 1,116
2015 6,801 2,752 2,658 334 1,554 48 3,077 28 206 34 17,258 78 1,554
2016 8,001 525,020 3,449 235 1,833 81 4,873 43 154 14 543,703 79 1,974
2017 4,343 3,316 3,582 534 935 49 2,780 13 99 9 15,660 72 1,017
Total 36,993 540,727 17,917 2,313 8,060 345 19,433 207 1,891 181 628,067 423 8,785
Table 2: The Number of Data for Each Types

4.2 Data Categories and Statistics

We observed that the collected data from the reports and the malware analysis information are related to common cyber campaigns or threat actors which can be categorized as Fig. 7.

Figure 7: The Data Categories in the Dataset.

The characteristics of each category are as follows.

  1. The data that can only be extracted by the parser belongs in this category. The quality and the quantity of this type of data highly depends on the contents of reports and the functionality of the parser.

  2. The malware analysis data that is contained in reports but unable to be extracted by the parser belongs in this category. The volume of this type of data shows how much the malware analysis data can compensate for the limitation of the parser. Also, the indicator about this category can be used to compare the quality of analysis results of several malware repositories.

  3. This category includes the data extracted by the parser as well as by the malware analysis results.

  4. Some data related to campaigns or threat actors can be excluded in the APT reports due to the low priority compared to other information or the analysis limitations of the authors. Such data found from malware analysis results belong to this category.

  5. The noise data generated by the parser belongs to this category. The functional limitation of the parser increases the portion of data in this category.

  6. The data in this category is the noise generated from malware analysis information. It is hard to distinguish between 4⃝ and 6⃝, but the meaningless data generated by the runtime environment of malware in default is the case belonging to this category.

  7. There is a lot of data in the reports that is difficult for the parser or the malware analysis information to obtain. Especially, nontechnical information such as actors and groups of cyber campaigns mainly lies in this category. These data need to be extracted manually or by other supplemental methods.

  8. This category is similar to 7⃝ in the sense that either data extraction methods cannot discover the data in the category. Publishers can intentionally exclude the data, or they does not even know about it. The volume of this category can be minimized by comparing several reports related to the same campaigns or threat actors, or gathering multi-source information such as HUMINT and SIGINT.

The statistical feature of the categories of the dataset generate through phase 3 of the system is in Table 3.

Category 1⃝ 2⃝ 3⃝ 4⃝
% 46 17 11 26
Table 3: The Percentage of Data for Each Category

It is worth to note that 43% of data comes from malware analysis results (2⃝ and 4⃝) and 26% is newly discovered data that is not contained in the reports (4⃝). Comparing that the vast amount of data type in 2⃝ is the hash values, 4⃝ is consisted by various types of data including code sign, IP address and other string information that are valuable to identify the incidents.

5 Dataset Application

As aforementioned, the objective of generating our dataset is to promote academic research related to CTI analysis. We propose three research topics applying the dataset and demonstrate one dataset application example in this section. It would be better if the novel analysis techniques were proposed, but that is out of the scope of this paper. The provided application example is the automatically generated correlation analysis result of the dataset by MISP.

5.1 Noise Removal

As explained in section 4

, the dataset includes several types of noise which makes it hard for further data analysis, and causes erroneous results. The reasons that the dataset contains noises are the malfunctions of the data extraction methods and the inclusion of less meaningful data. The effective noise removal technique should be able to consider the contextual necessities of data among the whole dataset or the sets of events. For example, the data contained in several sets of related events where there is little similarity of each event set is noise in the high probability since it increases dissimilarity of the event sets correlated with this data.

5.2 Correlation Analysis

Good usability of the dataset comes from finding the underlying relations of data. Without the correlations, the dataset itself is nothing but a significant amount of scattered data that can only be used for searching existence of some items.

Since an event in the dataset is composed of several threat data about it, the correlations between events are determined by analyzing the relation of the threat data consisting them. String-matching based method where many commercial cyber intelligence services provide would be one way to find relations of events. However, this simple method has several limitations. If two events contain attacker names such as ‘Bart Simpson’ and ‘B. Simpson’, the simple string-matching based method will not find the relations of the events. Similarly, if each of the events includes the URLs, ‘bartsimpson.com’ and ‘bsimpson.net’, the relations will not be discovered. String-similarity analysis and heuristics can be adopted to overcome such limitation. Moreover, the probabilistic approaches can improve to event-wise analysis considering the relation of sets of data in the events.

5.3 Temporal Analysis

Understanding the history of cyber campaigns by adversaries is crucial not only to defend current incidents and presume the underlying intents but also to draw the direction of adversarial activities from the big picture. Furthermore, the Tactics, Techniques, and Procedures (TTPs) identified from the campaigns by the temporal analysis can characterize the behavior of the adversarial groups. Therefore the characteristics can be used as a feature for correlation analysis of the sequences of events.

5.4 Dataset Application Example

The proposed dataset can be used for the correlation analysis of cyber incidences.The cyber threat actor group retrieving the correlation as the example is Lazarus group, which is suspected to the attribution of many major cyber campaigns listed as following:

  • Sony Pictures Entertainment attack (2014)

  • The bank heist including the Bangladesh Bank (2016)

  • The worldwide WannaCry ransomware distribution (2017)

We conducted a correlation analysis of dataset collected by CTIMiner with help to MISP correlation graph demonstrated in Fig. 8.

The security report, the starting point of the correlation analysis, is ‘Lazarus’ False Flag Malware Shevchenko [2017] marked as a⃝. As mentioned in the report, the Lazarus group was involved which the polish banks heist where the corresponding report is BadCyber [2017] marked as b⃝. The data-wise correlation of incidents can be found in Fig. 8. The data in 1⃝ which is extracted from the reports and from malware analysis results correlates a⃝ and b⃝, and the others in 2⃝ link a⃝ to c⃝ that is another report from BAE systems regarding Lazarus group. Therefore, through a⃝, b⃝ and c⃝ can have correlation.

Figure 8: The Sample Application of the Dataset to the Correlation Analysis.

Although this paper does not intend to propose the CTI analysis techniques, by applying previously proposed dataset application, we can deduct the practical lessons how this dataset can be used for CTI generation this example. Basically, a CTI analysis algorithm is able to find the connectivities of the data extracted from the same APT report. In advanced, the algorithm can correlate the reports that analyze the same attributes and campaigns. What a CTI analysis algorithm should eventually aim to generate actionable intelligence is to find the patterns of the attack for predicting the intents of attackers and preparing against the similar attack.

Kim et al. proposed the event-centric correlation analysis approach to assist generating such CTI. He suggested the novel concept and the construction algorithm that expresses similarity between threat events and temporal characteristics in graphical structures Kim et al. [2016]. To use our CTI dataset for the advanced analysis, successive researches should be preceded.

6 Source code and Dataset Access

The source codes of CTIMiner system and the generated dataset described in this paper are available to the public. These are accessible at our GitHub repository777https://github.com/dgkim0803/CTIMiner. Using the source codes, security reports, and MISP, one can generate a dataset composed of the data types that he/she is interested in.

7 Conclusion

As the cyber threats are prevalent and the volume of the collectible data increase rapidly, researches develop techniques for each intelligence process to be conducted actively. However, compared to other intelligence processes steps, the studies have been undertaken limitedly for the analysis and production step that requires the real CTI dataset for the analysis. We pointed out that dataset unavailability is the main reason suppressing vitalization of the research despite many interest. To address the problem, we proposed CTIMiner system that generates the dataset consisted of the data contained in security reports and supplemented with malware analysis data related to the reports. After categorizing the types of data collected from the system, we provided the statistical feature of the dataset. To show the usability and applicability of the dataset, we proposed several research topics possible to be conducted using the dataset and demonstrated the correlation analysis result for an event in the dataset.

Our future research direction is to develop and enhance the proposed analysis technique using the dataset on top of the CTI correlation analysis framework Kim et al. [2016]. By releasing this dataset to the public, we believe it can promote the threat information analysis research to generate CTI.

8 Acknowledgement

We sincerely thanks to SAINT SECURITY Inc. to allow us to use the API of malwares.com repository for constructing the CTI dataset.

References

  • McMillan [2013] Rob McMillan. Definition: Threat intelligence. Gartner, 2013.
  • Council on Foreign Relations [2015] Council on Foreign Relations. Senate Resolution 754, Cybersecurity Information Sharing Act of 2015 (CISA), 2015.
  • cam [2016] Cybersecurity alliance for mutual progress (camp). https://www.cybersec-alliance.org, 2016.
  • ccd [2008] Nato cooperative cyber defence centre of excellence (ccdcoe). https://ccdcoe.org, 2008.
  • eni [2004] European network and information security agency (enisa). https://www.enisa.europa.eu, 2004.
  • The White House [2015] The White House. FACT SHEET: President Xi Jinping’s State Visit to the United States, 2015.
  • Johnson and Waltermire [2014] Chris Johnson and David Waltermire. NIST Special Publication 800-150 Guide to Cyber Threat Information Sharing, 2014.
  • MIT [2013] The MITRE cybersecurity standards. https://www.mitre.org/capabilities/cybersecurity/overview/cybersecurity-resources/standards, 2013.
  • Rosenberg [2018] Jay Rosenberg. 2018 winter cyber olympics: Code similarities with cyber attacks in pyeongchang. http://www.intezer.com/2018-winter-cyber-olympics-code-similarities-cyber-attacks-pyeongchang/, 2018.
  • GReAT [2018] GReAT. Olympicdestroyer is here to trick the industry. https://securelist.com/olympicdestroyer-is-here-to-trick-the-industry/84295/, 2018.
  • Paul Rascagneres [2018] Martin Lee Paul Rascagneres. Who wasn’t responsible for olympic destroyer? https://blog.talosintelligence.com/2018/02/who-wasnt-responsible-for-olympic.html, 2018.
  • Juan Andres Guerrero-Saade and Lesnewich [2018] Priscilla Moriuchi Juan Andres Guerrero-Saade and Greg Lesnewich. Targeting of olympic games it infrastructure remains unattributed. https://www.recordedfuture.com/olympic-destroyer-malware/, 2018.
  • US Joint Chiefs of Staff [2013] US Joint Chiefs of Staff. Joint Publication 2-0 Joint Intelligence, 2013.
  • Kim et al. [2016] Daegeon Kim, Jiyoung Woo, and Huy Kang Kim. ”I Know What You Did Before”: General Framework for Correlation Analysis of Cyber Threat Incidents. In IEEE 35th International Conference on Military Communications, pages 782–787, 2016. ISBN 9781509037810.
  • Goel [2011] Sanjay Goel. Cyberwarfare: Connecting the Dots in Cyber Intelligence. Communications of the ACM, 54(8):132, 2011. ISSN 00010782. doi: 10.1145/1978542.1978569.
  • Benjamin et al. [2015] Victor Benjamin, Weifeng Li, Thomas Holt, and Hsinchun Chen. Exploring threats and vulnerabilities in hacker web: Forums, IRC and carding shops. In IEEE International Conference on Intelligence and Security Informatics, pages 85–90, 2015. ISBN 9781479998883. doi: 10.1109/ISI.2015.7165944.
  • Fachkha and Debbabi [2016] Claude Fachkha and Mourad Debbabi. Darknet as a Source of Cyber Intelligence: Survey, Taxonomy, and Characterization. IEEE Communications Surveys and Tutorials, 18(2):1197–1227, 2016. ISSN 1553877X. doi: 10.1109/COMST.2015.2497690.
  • U.S. Department of Homeland Security [2014] U.S. Department of Homeland Security. IMPACT. https://www.impactcybertrust.org, 2014.
  • Barnum [2014] Sean Barnum. Standardizing cyber threat intelligence information with the Structured Threat Information eXpression (STIX), 2014.
  • MANDIANT [2011] MANDIANT. Sophisticated Indicators for the Modern Threat Landscape: An Introduction to OpenIOC, 2011.
  • Liao et al. [2016] Xiaojing Liao, Kan Yuan, Xiaofeng Wang, Zhou Li, Luyi Xing, and Raheem Beyah. Acing the IOC Game: Toward Automatic Discovery and Analysis of Open-Source Cyber Threat Intelligence. In ACM SIGSAC Conference on Computer and Communications Security, pages 755–766, 2016. ISBN 9781450341394. doi: 10.1145/2976749.2978315.
  • Kornmaier and Jaouën [2014] Andreas Kornmaier and Fabrice Jaouën. Beyond technical data - a more comp rehensive Situational Awareness fed by available Intelligence Information. In 6th International Conference on Cyber Conflict, pages 139–154, 2014.
  • Modi et al. [2016] Ajay Modi, Zhibo Sun, Anupam Panwar, Tejas Khairnar, Ziming Zhao, Adam Doup, and Paul Black. Towards Automated Threat Intelligence Fusion. In IEEE 2nd International Conference on Collaboration and Internet Computing, pages 1–9, 2016.
  • Truvé [2016] Staffan Truvé. Temporal Analytics for Predictive Cyber Threat Intelligence. In 25th International Conference Companion on World Wide Web, pages 867–868, 2016. ISBN 978-1-4503-4144-8. doi: 10.1145/2872518.2889294. URL http://dx.doi.org/10.1145/2872518.2889294.
  • Qamar et al. [2017] Sara Qamar, Zahid Anwar, A. Mohammad Rahman, Ehab Al-Shaer, and Bei-Tseng Chu. Data-Driven Analytics for Cyber-Threat Intelligence and Information Sharing. Computers & Security, 67:35–58, 2017. ISSN 0167-4048.
  • Connolly et al. [2014] Julie Connolly, Mark Davidson, and Charles Schmidt. The Trusted Automated eXchange of Indicator Information (TAXII), 2014.
  • Zhao and White [2012] Wanying Zhao and Gregory White. A Collaborative Information Sharing Framework for Community Cyber Security. In IEEE International Conference on Technologies for Homeland Security, pages 457–462, 2012. ISBN 978-1-4673-2709-1. doi: 10.1109/THS.2012.6459892.
  • Bass [2000] Tim Bass. Intrusion detection systems and multisensor data fusion. Commun. ACM, 43(4):99–105, April 2000. ISSN 0001-0782. doi: 10.1145/332051.332079. URL http://doi.acm.org/10.1145/332051.332079.
  • Shevchenko [2017] Sergei Shevchenko. Lazarus’ False Flag Malware. http://baesystemsai.blogspot.kr/2017/02/lazarus-false-flag-malware.html, 2017.
  • BadCyber [2017] BadCyber. Several Polish banks hacked, information stolen by unknown attackers. https://badcyber.com/several-polish-banks-hacked-information-stolen-by-unknown-attackers/, 2017.