ThreatCrawl: A BERT-based Focused Crawler for the Cybersecurity Domain

04/24/2023
by   Philipp Kuehn, et al.
0

Publicly available information contains valuable information for Cyber Threat Intelligence (CTI). This can be used to prevent attacks that have already taken place on other systems. Ideally, only the initial attack succeeds and all subsequent ones are detected and stopped. But while there are different standards to exchange this information, a lot of it is shared in articles or blog posts in non-standardized ways. Manually scanning through multiple online portals and news pages to discover new threats and extracting them is a time-consuming task. To automize parts of this scanning process, multiple papers propose extractors that use Natural Language Processing (NLP) to extract Indicators of Compromise (IOCs) from documents. However, while this already solves the problem of extracting the information out of documents, the search for these documents is rarely considered. In this paper, a new focused crawler is proposed called ThreatCrawl, which uses Bidirectional Encoder Representations from Transformers (BERT)-based models to classify documents and adapt its crawling path dynamically. While ThreatCrawl has difficulties to classify the specific type of Open Source Intelligence (OSINT) named in texts, e.g., IOC content, it can successfully find relevant documents and modify its path accordingly. It yields harvest rates of up to 52 of our knowledge, better than the current state of the art.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/25/2023

Neural Natural Language Processing for Long Texts: A Survey of the State-of-the-Art

The adoption of Deep Neural Networks (DNNs) has greatly benefited Natura...
research
01/27/2023

Cybersecurity Threat Hunting and Vulnerability Analysis Using a Neo4j Graph Database of Open Source Intelligence

Open source intelligence is a powerful tool for cybersecurity analysts t...
research
04/17/2023

Classification of US Supreme Court Cases using BERT-Based Techniques

Models based on bidirectional encoder representations from transformers ...
research
04/14/2022

Brazilian Court Documents Clustered by Similarity Together Using Natural Language Processing Approaches with Transformers

Recent advances in Artificial intelligence (AI) have leveraged promising...
research
05/21/2021

Towards Automatic Comparison of Data Privacy Documents: A Preliminary Experiment on GDPR-like Laws

General Data Protection Regulation (GDPR) becomes a standard law for dat...
research
01/29/2021

Fine-tuning BERT-based models for Plant Health Bulletin Classification

In the era of digitization, different actors in agriculture produce nume...

Please sign up or login with your details

Forgot password? Click here to reset