DarkBERT: A Language Model for the Dark Side of the Internet

05/15/2023
by   Youngjin Jin, et al.
0

Recent research has suggested that there are clear differences in the language used in the Dark Web compared to that of the Surface Web. As studies on the Dark Web commonly require textual analysis of the domain, language models specific to the Dark Web may provide valuable insights to researchers. In this work, we introduce DarkBERT, a language model pretrained on Dark Web data. We describe the steps taken to filter and compile the text data used to train DarkBERT to combat the extreme lexical and structural diversity of the Dark Web that may be detrimental to building a proper representation of the domain. We evaluate DarkBERT and its vanilla counterpart along with other widely used language models to validate the benefits that a Dark Web domain specific model offers in various use cases. Our evaluations show that DarkBERT outperforms current language models and may serve as a valuable resource for future research on the Dark Web.

READ FULL TEXT

page 5

page 8

page 16

research
09/05/2023

On the Planning, Search, and Memorization Capabilities of Large Language Models

The rapid advancement of large language models, such as the Generative P...
research
04/14/2022

Shedding New Light on the Language of the Dark Web

The hidden nature and the limited accessibility of the Dark Web, combine...
research
05/27/2023

The Curse of Recursion: Training on Generated Data Makes Models Forget

Stable Diffusion revolutionised image creation from descriptive text. GP...
research
08/04/2021

Mitigating harm in language models with conditional-likelihood filtration

Language models trained on large-scale unfiltered datasets curated from ...
research
02/03/2017

ArchiveSpark: Efficient Web Archive Access, Extraction and Derivation

Web archives are a valuable resource for researchers of various discipli...
research
07/31/2023

Generative Models as a Complex Systems Science: How can we make sense of large language model behavior?

Coaxing out desired behavior from pretrained models, while avoiding unde...

Please sign up or login with your details

Forgot password? Click here to reset