Simplistic Collection and Labeling Practices Limit the Utility of Benchmark Datasets for Twitter Bot Detection

01/17/2023
by   Chris Hays, et al.
0

Accurate bot detection is necessary for the safety and integrity of online platforms. It is also crucial for research on the influence of bots in elections, the spread of misinformation, and financial market manipulation. Platforms deploy infrastructure to flag or remove automated accounts, but their tools and data are not publicly available. Thus, the public must rely on third-party bot detection. These tools employ machine learning and often achieve near perfect performance for classification on existing datasets, suggesting bot detection is accurate, reliable and fit for use in downstream applications. We provide evidence that this is not the case and show that high performance is attributable to limitations in dataset collection and labeling rather than sophistication of the tools. Specifically, we show that simple decision rules – shallow decision trees trained on a small number of features – achieve near-state-of-the-art performance on most available datasets and that bot detection datasets, even when combined together, do not generalize well to out-of-sample datasets. Our findings reveal that predictions are highly dependent on each dataset's collection and labeling procedures rather than fundamental differences between bots and humans. These results have important implications for both transparency in sampling and labeling procedures and potential biases in research using existing bot detection tools for pre-processing.

READ FULL TEXT
research
05/31/2023

BotArtist: Twitter bot detection Machine Learning model based on Twitter suspension

Twitter as one of the most popular social networks, offers a means for c...
research
03/13/2020

Designing Tools for Semi-Automated Detection of Machine Learning Biases: An Interview Study

Machine learning models often make predictions that bias against certain...
research
07/01/2020

Towards Accurate Labeling of Android Apps for Reliable Malware Detection

In training their newly-developed malware detection methods, researchers...
research
07/16/2020

TUDataset: A collection of benchmark datasets for learning with graphs

Recently, there has been an increasing interest in (supervised) learning...
research
10/25/2019

ALET (Automated Labeling of Equipment and Tools): A Dataset, a Baseline and a Usecase for Tool Detection in the Wild

Robots collaborating with humans in realistic environments will need to ...
research
07/05/2017

A dataset for Computer-Aided Detection of Pulmonary Embolism in CTA images

Todays, researchers in the field of Pulmonary Embolism (PE) analysis nee...

Please sign up or login with your details

Forgot password? Click here to reset