The Mafiascum Dataset: A Large Text Corpus for Deception Detection

11/19/2018 ∙ by Bob de Ruiter, et al. ∙ 0

Detecting deception in natural language has a wide variety of applications, but because of its hidden nature there are no public, large-scale sources of labeled deceptive text. This work introduces the Mafiascum dataset [1], a collection of over 700 games of Mafia, in which players are randomly assigned either deceptive or non-deceptive roles and then interact via forum postings. Almost 10,000 documents were compiled from the dataset, which each contained all messages written by a single player in a single game. This corpus was used to construct a set of hand-picked linguistic features based on prior deception research and a set of average word vectors enriched with subword information. An SVM classifier fit on a combination of these feature sets achieved an area under the precision-recall curve of 0.35 (chance = 0.26) and an ROC AUC of 0.64 (chance = 0.50). [1]



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.