Learning Stylometric Representations for Authorship Analysis

06/03/2016
by   Steven H. H. Ding, et al.
0

Authorship analysis (AA) is the study of unveiling the hidden properties of authors from a body of exponentially exploding textual data. It extracts an author's identity and sociolinguistic characteristics based on the reflected writing styles in the text. It is an essential process for various areas, such as cybercrime investigation, psycholinguistics, political socialization, etc. However, most of the previous techniques critically depend on the manual feature engineering process. Consequently, the choice of feature set has been shown to be scenario- or dataset-dependent. In this paper, to mimic the human sentence composition process using a neural network approach, we propose to incorporate different categories of linguistic features into distributed representation of words in order to learn simultaneously the writing style representations based on unlabeled texts for authorship analysis. In particular, the proposed models allow topical, lexical, syntactical, and character-level feature vectors of each document to be extracted as stylometrics. We evaluate the performance of our approach on the problems of authorship characterization and authorship verification with the Twitter, novel, and essay datasets. The experiments suggest that our proposed text representation outperforms the bag-of-lexical-n-grams, Latent Dirichlet Allocation, Latent Semantic Analysis, PVDM, PVDBOW, and word2vec representations.

READ FULL TEXT

page 1

page 3

page 9

page 14

page 16

page 17

research
02/24/2019

Text Analysis in Adversarial Settings: Does Deception Leave a Stylistic Trace?

Textual deception constitutes a major problem for online security. Many ...
research
07/22/2018

Examining Scientific Writing Styles from the Perspective of Linguistic Complexity

Publishing articles in high-impact English journals is difficult for sch...
research
04/23/2018

Discovering Style Trends through Deep Visually Aware Latent Item Embeddings

In this paper, we explore Latent Dirichlet Allocation (LDA) and Polyling...
research
04/07/2023

GEMINI: Controlling the Sentence-level Writing Style for Abstractive Text Summarization

Human experts write summaries using different techniques, including rewr...
research
07/12/2017

The Case for Being Average: A Mediocrity Approach to Style Masking and Author Obfuscation

Users posting online expect to remain anonymous unless they have logged ...
research
08/14/2017

Continuous Representation of Location for Geolocation and Lexical Dialectology using Mixture Density Networks

We propose a method for embedding two-dimensional locations in a continu...
research
05/16/2022

Quantitative Discourse Cohesion Analysis of Scientific Scholarly Texts using Multilayer Networks

Discourse cohesion facilitates text comprehension and helps the reader f...

Please sign up or login with your details

Forgot password? Click here to reset