PART: Pre-trained Authorship Representation Transformer

09/30/2022
by   Javier Huertas-Tato, et al.
20

Authors writing documents imprint identifying information within their texts: vocabulary, registry, punctuation, misspellings, or even emoji usage. Finding these details is very relevant to profile authors, relating back to their gender, occupation, age, and so on. But most importantly, repeating writing patterns can help attributing authorship to a text. Previous works use hand-crafted features or classification tasks to train their authorship models, leading to poor performance on out-of-domain authors. A better approach to this task is to learn stylometric representations, but this by itself is an open research challenge. In this paper, we propose PART: a contrastively trained model fit to learn authorship embeddings instead of semantics. By comparing pairs of documents written by the same author, we are able to determine the proprietary of a text by evaluating the cosine similarity of the evaluated documents, a zero-shot generalization to authorship identification. To this end, a pre-trained Transformer with an LSTM head is trained with the contrastive training method. We train our model on a diverse set of authors, from literature, anonymous blog posters and corporate emails; a heterogeneous set with distinct and identifiable writing styles. The model is evaluated on these datasets, achieving zero-shot 72.39% and 86.73% accuracy and top-5 accuracy respectively on the joint evaluation dataset when determining authorship from a set of 250 different authors. We qualitatively assess the representations with different data visualizations on the available datasets, profiling features such as book types, gender, age, or occupation of the author.

READ FULL TEXT
research
09/12/2021

Leveraging Table Content for Zero-shot Text-to-SQL with Meta-Learning

Single-table text-to-SQL aims to transform a natural language question i...
research
09/19/2021

Preventing Author Profiling through Zero-Shot Multilingual Back-Translation

Documents as short as a single sentence may inadvertently reveal sensiti...
research
10/29/2022

Beyond prompting: Making Pre-trained Language Models Better Zero-shot Learners by Clustering Representations

Recent work has demonstrated that pre-trained language models (PLMs) are...
research
05/14/2023

STORYWARS: A Dataset and Instruction Tuning Baselines for Collaborative Story Understanding and Generation

Collaborative stories, which are texts created through the collaborative...
research
10/04/2016

Ensemble Maximum Entropy Classification and Linear Regression for Author Age Prediction

The evolution of the internet has created an abundance of unstructured d...
research
09/03/2021

LG4AV: Combining Language Models and Graph Neural Networks for Author Verification

The automatic verification of document authorships is important in vario...
research
05/21/2019

A Comparative Analysis of Distributional Term Representations for Author Profiling in Social Media

Author Profiling (AP) aims at predicting specific characteristics from a...

Please sign up or login with your details

Forgot password? Click here to reset