DarkDiff: Explainable web page similarity of TOR onion sites

08/23/2023
by   Pieter Hartel, et al.
0

In large-scale data analysis, near-duplicates are often a problem. For example, with two near-duplicate phishing emails, a difference in the salutation (Mr versus Ms) is not essential, but whether it is bank A or B is important. The state-of-the-art in near-duplicate detection is a black box approach (MinHash), so one only knows that emails are near-duplicates, but not why. We present DarkDiff, which can efficiently detect near-duplicates while providing the reason why there is a near-duplicate. We have developed DarkDiff to detect near-duplicates of homepages on the Darkweb. DarkDiff works well on those pages because they resemble the clear web of the past.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/21/2021

The Impact of Main Content Extraction on Near-Duplicate Detection

Commercial web search engines employ near-duplicate detection to ensure ...
research
10/07/2018

Multi-reference Cosine: A New Approach to Text Similarity Measurement in Large Collections

The importance of an efficient and scalable document similarity detectio...
research
08/30/2021

Web Application Testing: Using Tree Kernels to Detect Near-duplicate States in Automated Model Inference

In the context of End-to-End testing of web applications, automated expl...
research
10/26/2021

Fragment-Based Test Generation For Web Apps

Automated model-based test generation presents a viable alternative to t...
research
01/04/2020

Locality-Sensitive Hashing for Efficient Web Application Security Testing

Web application security has become a major concern in recent years, as ...
research
09/18/2022

Evolution of a Web-Scale Near Duplicate Image Detection System

Detecting near duplicate images is fundamental to the content ecosystem ...
research
12/02/2018

Improved and Robust Controversy Detection in General Web Pages Using Semantic Approaches under Large Scale Conditions

Detecting controversy in general web pages is a daunting task, but incre...

Please sign up or login with your details

Forgot password? Click here to reset