A ground-truth dataset and classification model for detecting bots in GitHub issue and PR comments

10/07/2020 ∙ by Mehdi Golzadeh, et al. ∙ 0

Bots are frequently used in Github repositories to automate repetitive activities that are part of the distributed software development process. They communicate with human actors through comments. While detecting their presence is important for many reasons, no large and representative ground-truth dataset is available, nor are classification models to detect and validate bots on the basis of such a dataset. This paper proposes such a ground-truth dataset, based on a manual analysis with high interrater agreement, of pull request and issue comments in 5,000 distinct Github accounts of which 527 accounts have been identified as bots. Using this dataset we propose an automated classification model based on the random forest classifier, taking as main features the number of empty and non-empty comments of each account, the number of comment patterns, and the inequality between comments within comment patterns. We obtained a very high accuracy (weighted F1-score of 0.99) on the remaining test set containing 40 misclassified as humans. We integrated the classification model into an open source command-line tool, to allow practitioners to detect which accounts in a given Github repository actually correspond to bots.



There are no comments yet.


page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.