Linguistic data mining with complex networks: a stylometric-oriented approach

08/16/2018
by   Tomasz Stanisz, et al.
0

By representing a text by a set of words and their co-occurrences, one obtains a word-adjacency network - a network being in a way a reduced representation of the given language sample. In this paper, the possibility of using network representation in order to extract information about individual language styles of literary texts is studied. By determining selected quantitative characteristics of the networks and applying machine learning algorithms, it is made possible to distinguish between texts of different authors. It turns out that within the studied set of texts in English and Polish, the properly rescaled weighted clustering coefficients and weighted degrees of only a few nodes in the word-adjacency networks are sufficient to obtain the accuracy of authorship attribution over 90%. A correspondence between the authorship of texts and the structure of word-adjacency networks can therefore clearly be found; it may be stated that the network representation allows to distinguish individual language styles by comparing the way the authors use particular words and punctuation marks. The presented approach can be viewed as a generalization of the authorship attribution methods based on simplest lexical features. Apart from the characteristics given above, other network parameters are studied, both local and global ones, for both the unweighted and weighted networks. Their potential to capture the diversity of writing styles is discussed; some differences between languages are also observed.

READ FULL TEXT

page 12

page 16

research
04/09/2015

Concentric network symmetry grasps authors' styles in word adjacency networks

Several characteristics of written texts have been inferred from statist...
research
05/29/2017

On the "Calligraphy" of Books

Authorship attribution is a natural language processing task that has be...
research
03/26/2015

Unsupervised authorship attribution

We describe a technique for attributing parts of a written text to a set...
research
05/11/2017

On the role of words in the network structure of texts: application to authorship attribution

Well-established automatic analyses of texts mainly consider frequencies...
research
10/18/2016

Stylometric Analysis of Early Modern Period English Plays

Function word adjacency networks (WANs) are used to study the authorship...
research
01/30/2018

Manuscripts in Time and Space: Experiments in Scriptometrics on an Old French Corpus

Witnesses of medieval literary texts, preserved in manuscript, are layer...
research
04/09/2020

Two halves of a meaningful text are statistically different

Which statistical features distinguish a meaningful text (possibly writt...

Please sign up or login with your details

Forgot password? Click here to reset