Short Text Classification Approach to Identify Child Sexual Exploitation Material

10/29/2020
by   Mhd Wesam Al-Nabki, et al.
0

Producing or sharing Child Sexual Exploitation Material (CSEM) is a serious crime fought vigorously by Law Enforcement Agencies (LEAs). When an LEA seizes a computer from a potential producer or consumer of CSEM, they need to analyze the suspect's hard disk's files looking for pieces of evidence. However, a manual inspection of the file content looking for CSEM is a time-consuming task. In most cases, it is unfeasible in the amount of time available for the Spanish police using a search warrant. Instead of analyzing its content, another approach that can be used to speed up the process is to identify CSEM by analyzing the file names and their absolute paths. The main challenge for this task lies behind dealing with short text distorted deliberately by the owners of this material using obfuscated words and user-defined naming patterns. This paper presents and compares two approaches based on short text classification to identify CSEM files. The first one employs two independent supervised classifiers, one for the file name and the other for the path, and their outputs are later on fused into a single score. Conversely, the second approach uses only the file name classifier to iterate over the file's absolute path. Both approaches operate at the character n-grams level, while binary and orthographic features enrich the file name representation, and a binary Logistic Regression model is used for classification. The presented file classifier achieved an average class recall of 0.98. This solution could be integrated into forensic tools and services to support Law Enforcement Agencies to identify CSEM without tackling every file's visual content, which is computationally much more highly demanding.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/05/2020

Metadata-Based Detection of Child Sexual Abuse Material

In the last decade, the scale of creation and distribution of child sexu...
research
01/27/2023

Adversarial Networks and Machine Learning for File Classification

Correctly identifying the type of file under examination is a critical p...
research
02/25/2021

File fragment recognition based on content and statistical features

Nowadays, the speed up development and use of digital devices such as sm...
research
05/16/2019

Learning from Context: Exploiting and Interpreting File Path Information for Better Malware Detection

Machine learning (ML) used for static portable executable (PE) malware d...
research
04/13/2022

A Natural Language Processing Approach for Instruction Set Architecture Identification

Binary analysis of software is a critical step in cyber forensics applic...
research
11/10/2019

A Multimodal CNN-based Tool to Censure Inappropriate Video Scenes

Due to the extensive use of video-sharing platforms and services for the...
research
10/05/2022

Using Full-Text Content to Characterize and Identify Best Seller Books

Artistic pieces can be studied from several perspectives, one example be...

Please sign up or login with your details

Forgot password? Click here to reset