Stylistic Fingerprints, POS-tags and Inflected Languages: A Case Study in Polish

06/05/2022
by   Maciej Eder, et al.
0

In stylometric investigations, frequencies of the most frequent words (MFWs) and character n-grams outperform other style-markers, even if their performance varies significantly across languages. In inflected languages, word endings play a prominent role, and hence different word forms cannot be recognized using generic text tokenization. Countless inflected word forms make frequencies sparse, making most statistical procedures complicated. Presumably, applying one of the NLP techniques, such as lemmatization and/or parsing, might increase the performance of classification. The aim of this paper is to examine the usefulness of grammatical features (as assessed via POS-tag n-grams) and lemmatized forms in recognizing authorial profiles, in order to address the underlying issue of the degree of freedom of choice within lexis and grammar. Using a corpus of Polish novels, we performed a series of supervised authorship attribution benchmarks, in order to compare the classification accuracy for different types of lexical and syntactic style-markers. Even if the performance of POS-tags as well as lemmatized forms was notoriously worse than that of lexical markers, the difference was not substantial and never exceeded ca. 15

READ FULL TEXT

page 6

page 7

research
08/27/2018

An Investigation of the Interactions Between Pre-Trained Word Embeddings, Character Models and POS Tags in Dependency Parsing

We provide a comprehensive analysis of the interactions between pre-trai...
research
05/24/2023

Another Dead End for Morphological Tags? Perturbed Inputs and Parsing

The usefulness of part-of-speech tags for parsing has been heavily quest...
research
09/08/2017

A Statistical Comparison of Some Theories of NP Word Order

A frequent object of study in linguistic typology is the order of elemen...
research
03/02/2018

Syntax-Aware Language Modeling with Recurrent Neural Networks

Neural language models (LMs) are typically trained using only lexical fe...
research
10/05/2020

Speakers Fill Lexical Semantic Gaps with Context

Lexical ambiguity is widespread in language, allowing for the reuse of e...
research
12/15/2021

Penn-Helsinki Parsed Corpus of Early Modern English: First Parsing Results and Analysis

We present the first parsing results on the Penn-Helsinki Parsed Corpus ...
research
08/20/2017

LSTM Network for Inflected Abbreviation Expansion

In this paper, the problem of recovery of morphological information lost...

Please sign up or login with your details

Forgot password? Click here to reset