Seeing without Looking: Analysis Pipeline for Child Sexual Abuse Datasets

04/29/2022
by   Camila Laranjeira, et al.
1

The online sharing and viewing of Child Sexual Abuse Material (CSAM) are growing fast, such that human experts can no longer handle the manual inspection. However, the automatic classification of CSAM is a challenging field of research, largely due to the inaccessibility of target data that is - and should forever be - private and in sole possession of law enforcement agencies. To aid researchers in drawing insights from unseen data and safely providing further understanding of CSAM images, we propose an analysis template that goes beyond the statistics of the dataset and respective labels. It focuses on the extraction of automatic signals, provided both by pre-trained machine learning models, e.g., object categories and pornography detection, as well as image metrics such as luminance and sharpness. Only aggregated statistics of sparse signals are provided to guarantee the anonymity of children and adolescents victimized. The pipeline allows filtering the data by applying thresholds to each specified signal and provides the distribution of such signals within the subset, correlations between signals, as well as a bias evaluation. We demonstrated our proposal on the Region-based annotated Child Pornography Dataset (RCPD), one of the few CSAM benchmarks in the literature, composed of over 2000 samples among regular and CSAM images, produced in partnership with Brazil's Federal Police. Although noisy and limited in several senses, we argue that automatic signals can highlight important aspects of the overall distribution of data, which is valuable for databases that can not be disclosed. Our goal is to safely publicize the characteristics of CSAM datasets, encouraging researchers to join the field and perhaps other institutions to provide similar reports on their benchmarks.

READ FULL TEXT

page 12

page 14

page 20

page 21

page 22

research
01/09/2022

Applying Artificial Intelligence for Age Estimation in Digital Forensic Investigations

The precise age estimation of child sexual abuse and exploitation (CSAE)...
research
06/21/2023

Exploiting Multimodal Synthetic Data for Egocentric Human-Object Interaction Detection in an Industrial Scenario

In this paper, we tackle the problem of Egocentric Human-Object Interact...
research
09/14/2023

Detecting Misinformation with LLM-Predicted Credibility Signals and Weak Supervision

Credibility signals represent a wide range of heuristics that are typica...
research
08/24/2022

Of Human Criteria and Automatic Metrics: A Benchmark of the Evaluation of Story Generation

Research on Automatic Story Generation (ASG) relies heavily on human and...
research
01/31/2022

Lessons from the AdKDD'21 Privacy-Preserving ML Challenge

Designing data sharing mechanisms providing performance and strong priva...
research
06/09/2021

A machine learning pipeline for aiding school identification from child trafficking images

Child trafficking in a serious problem around the world. Every year ther...
research
03/15/2023

DACOS-A Manually Annotated Dataset of Code Smells

Researchers apply machine-learning techniques for code smell detection t...

Please sign up or login with your details

Forgot password? Click here to reset