A Graph-based Stratified Sampling Methodology for the Analysis of (Underground) Forums

08/18/2023
by   Giorgio Di Tizio, et al.
0

[Context] Researchers analyze underground forums to study abuse and cybercrime activities. Due to the size of the forums and the domain expertise required to identify criminal discussions, most approaches employ supervised machine learning techniques to automatically classify the posts of interest. [Goal] Human annotation is costly. How to select samples to annotate that account for the structure of the forum? [Method] We present a methodology to generate stratified samples based on information about the centrality properties of the population and evaluate classifier performance. [Result] We observe that by employing a sample obtained from a uniform distribution of the post degree centrality metric, we maintain the same level of precision but significantly increase the recall (+30 distribution is respecting the population stratification. We find that classifiers trained with similar samples disagree on the classification of criminal activities up to 33

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/16/2020

Neighborhood-based Pooling for Population-level Label Distribution Learning

Supervised machine learning often requires human-annotated data. While a...
research
04/27/2019

Analysis of Confident-Classifiers for Out-of-distribution Detection

Discriminatively trained neural classifiers can be trusted, only when th...
research
04/12/2023

Does Informativeness Matter? Active Learning for Educational Dialogue Act Classification

Dialogue Acts (DAs) can be used to explain what expert tutors do and wha...
research
05/19/2022

Classifying Human Activities using Machine Learning and Deep Learning Techniques

Human Activity Recognition (HAR) describes the machines ability to recog...
research
05/06/2018

Automatic Classification of Object Code Using Machine Learning

Recent research has repeatedly shown that machine learning techniques ca...
research
12/17/2022

Two-sample test based on Self-Organizing Maps

Machine-learning classifiers can be leveraged as a two-sample statistica...
research
03/13/2021

Simpson's Bias in NLP Training

In most machine learning tasks, we evaluate a model M on a given data po...

Please sign up or login with your details

Forgot password? Click here to reset