Counteracting Dark Web Text-Based CAPTCHA with Generative Adversarial Learning for Proactive Cyber Threat Intelligence

01/08/2022
by   Ning Zhang, et al.
0

Automated monitoring of dark web (DW) platforms on a large scale is the first step toward developing proactive Cyber Threat Intelligence (CTI). While there are efficient methods for collecting data from the surface web, large-scale dark web data collection is often hindered by anti-crawling measures. In particular, text-based CAPTCHA serves as the most prevalent and prohibiting type of these measures in the dark web. Text-based CAPTCHA identifies and blocks automated crawlers by forcing the user to enter a combination of hard-to-recognize alphanumeric characters. In the dark web, CAPTCHA images are meticulously designed with additional background noise and variable character length to prevent automated CAPTCHA breaking. Existing automated CAPTCHA breaking methods have difficulties in overcoming these dark web challenges. As such, solving dark web text-based CAPTCHA has been relying heavily on human involvement, which is labor-intensive and time-consuming. In this study, we propose a novel framework for automated breaking of dark web CAPTCHA to facilitate dark web data collection. This framework encompasses a novel generative method to recognize dark web text-based CAPTCHA with noisy background and variable character length. To eliminate the need for human involvement, the proposed framework utilizes Generative Adversarial Network (GAN) to counteract dark web background noise and leverages an enhanced character segmentation algorithm to handle CAPTCHA images with variable character length. Our proposed framework, DW-GAN, was systematically evaluated on multiple dark web CAPTCHA testbeds. DW-GAN significantly outperformed the state-of-the-art benchmark methods on all datasets, achieving over 94.4 success rate on a carefully collected real-world dark web dataset...

READ FULL TEXT
research
07/12/2018

Deep Learning for Imbalance Data Classification using Class Expert Generative Adversarial Network

Without any specific way for imbalance data classification, artificial i...
research
08/16/2023

Diff-CAPTCHA: An Image-based CAPTCHA with Security Enhanced by Denoising Diffusion Model

To enhance the security of text CAPTCHAs, various methods have been empl...
research
03/13/2023

AGTGAN: Unpaired Image Translation for Photographic Ancient Character Generation

The study of ancient writings has great value for archaeology and philol...
research
06/28/2019

ProtoNet: Learning from Web Data with Memory

Learning from web data has attracted lots of research interest in recent...
research
09/17/2019

ShamFinder: An Automated Framework for Detecting IDN Homographs

The internationalized domain name (IDN) is a mechanism that enables us t...
research
11/27/2020

Leveraging Regular Fundus Images for Training UWF Fundus Diagnosis Models via Adversarial Learning and Pseudo-Labeling

Recently, ultra-widefield (UWF) 200-degree fundus imaging by Optos camer...
research
12/01/2022

Leveraging Large-scale Multimedia Datasets to Refine Content Moderation Models

The sheer volume of online user-generated content has rendered content m...

Please sign up or login with your details

Forgot password? Click here to reset