Boosting word frequencies in authorship attribution

11/02/2022
by   Maciej Eder, et al.
0

In this paper, I introduce a simple method of computing relative word frequencies for authorship attribution and similar stylometric tasks. Rather than computing relative frequencies as the number of occurrences of a given word divided by the total number of tokens in a text, I argue that a more efficient normalization factor is the total number of relevant tokens only. The notion of relevant words includes synonyms and, usually, a few dozen other words in some ways semantically similar to a word in question. To determine such a semantic background, one of word embedding models can be used. The proposed method outperforms classical most-frequent-word approaches substantially, usually by a few percentage points depending on the input settings.

READ FULL TEXT

page 4

page 5

page 6

research
08/05/2021

Counting scattered palindromes in a finite word

We investigate the scattered palindromic subwords in a finite word. We s...
research
06/09/2010

Measuring Meaning on the World-Wide Web

We introduce the notion of the 'meaning bound' of a word with respect to...
research
03/15/2018

Advancing Acoustic-to-Word CTC Model

The acoustic-to-word model based on the connectionist temporal classific...
research
08/18/2019

Parsimonious Morpheme Segmentation with an Application to Enriching Word Embeddings

Traditionally, many text-mining tasks treat individual word-tokens as th...
research
03/07/2020

Discovering linguistic (ir)regularities in word embeddings through max-margin separating hyperplanes

We experiment with new methods for learning how related words are positi...
research
09/08/2017

A Statistical Comparison of Some Theories of NP Word Order

A frequent object of study in linguistic typology is the order of elemen...

Please sign up or login with your details

Forgot password? Click here to reset