A ground-truth dataset and classification model for detecting bots in GitHub issue and PR comments

10/07/2020
by   Mehdi Golzadeh, et al.
0

Bots are frequently used in Github repositories to automate repetitive activities that are part of the distributed software development process. They communicate with human actors through comments. While detecting their presence is important for many reasons, no large and representative ground-truth dataset is available, nor are classification models to detect and validate bots on the basis of such a dataset. This paper proposes such a ground-truth dataset, based on a manual analysis with high interrater agreement, of pull request and issue comments in 5,000 distinct Github accounts of which 527 accounts have been identified as bots. Using this dataset we propose an automated classification model based on the random forest classifier, taking as main features the number of empty and non-empty comments of each account, the number of comment patterns, and the inequality between comments within comment patterns. We obtained a very high accuracy (weighted F1-score of 0.99) on the remaining test set containing 40 misclassified as humans. We integrated the classification model into an open source command-line tool, to allow practitioners to detect which accounts in a given Github repository actually correspond to bots.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset