An Investigation of Supervised Learning Methods for Authorship Attribution in Short Hinglish Texts using Char & Word N-grams

12/26/2018
by   Abhay Sharma, et al.
0

The writing style of a person can be affirmed as a unique identity indicator; the words used, and the structuring of the sentences are clear measures which can identify the author of a specific work. Stylometry and its subset - Authorship Attribution, have a long history beginning from the 19th century, and we can still find their use in modern times. The emergence of the Internet has shifted the application of attribution studies towards non-standard texts that are comparatively shorter to and different from the long texts on which most research has been done. The aim of this paper focuses on the study of short online texts, retrieved from messaging application called WhatsApp and studying the distinctive features of a macaronic language (Hinglish), using supervised learning methods and then comparing the models. Various features such as word n-gram and character n-gram are compared via methods viz., Naive Bayes Classifier, Support Vector Machine, Conditional Tree, and Random Forest, to find the best discriminator for such corpora. Our results showed that SVM attained a test accuracy of up to 95.079 an accuracy of up to 94.455 failed to perform as well as expected. We also found that word unigram and character 3-grams features were more likely to distinguish authors accurately than other features.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/29/2023

BERT-based Authorship Attribution on the Romanian Dataset called ROST

Being around for decades, the problem of Authorship Attribution is still...
research
11/06/2017

Authorship Analysis of Xenophon's Cyropaedia

In the past several decades, many authorship attribution studies have us...
research
06/26/2023

Integrating Bidirectional Long Short-Term Memory with Subword Embedding for Authorship Attribution

The problem of unveiling the author of a given text document from multip...
research
03/24/2020

Forensic Authorship Analysis of Microblogging Texts Using N-Grams and Stylometric Features

In recent years, messages and text posted on the Internet are used in cr...
research
11/16/2020

Datasets and Models for Authorship Attribution on Italian Personal Writings

Existing research on Authorship Attribution (AA) focuses on texts for wh...
research
11/28/2019

Metre as a stylometric feature in Latin hexameter poetry

This paper demonstrates that metre is a privileged indicator of authoria...
research
07/12/2017

N-GrAM: New Groningen Author-profiling Model

We describe our participation in the PAN 2017 shared task on Author Prof...

Please sign up or login with your details

Forgot password? Click here to reset