Representation of texts as complex networks: a mesoscopic approach

06/30/2016
by   Henrique F. de Arruda, et al.
0

Statistical techniques that analyze texts, referred to as text analytics, have departed from the use of simple word count statistics towards a new paradigm. Text mining now hinges on a more sophisticated set of methods, including the representations in terms of complex networks. While well-established word-adjacency (co-occurrence) methods successfully grasp syntactical features of written texts, they are unable to represent important aspects of textual data, such as its topical structure, i.e. the sequence of subjects developing at a mesoscopic level along the text. Such aspects are often overlooked by current methodologies. In order to grasp the mesoscopic characteristics of semantical content in written texts, we devised a network model which is able to analyze documents in a multi-scale fashion. In the proposed model, a limited amount of adjacent paragraphs are represented as nodes, which are connected whenever they share a minimum semantical content. To illustrate the capabilities of our model, we present, as a case example, a qualitative analysis of "Alice's Adventures in Wonderland". We show that the mesoscopic structure of a document, modeled as a network, reveals many semantic traits of texts. Such an approach paves the way to a myriad of semantic-based applications. In addition, our approach is illustrated in a machine learning context, in which texts are classified among real texts and randomized instances.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/22/2018

Paragraph-based complex networks: application to document classification and authenticity verification

With the increasing number of texts made available on the Internet, many...
research
05/01/2017

Labelled network subgraphs reveal stylistic subtleties in written texts

The vast amount of data and increase of computational capacity have allo...
research
07/29/2016

Text authorship identified using the dynamics of word co-occurrence networks

The identification of authorship in disputed documents still requires hu...
research
01/17/2022

Accessibility and Trajectory-Based Text Characterization

Several complex systems are characterized by presenting intricate charac...
research
05/18/2019

Semantic flow in language networks

In this study we propose a framework to characterize documents based on ...
research
05/04/2022

Using virtual edges to extract keywords from texts modeled as complex networks

Detecting keywords in texts is important for many text mining applicatio...
research
11/15/2017

Detecting and assessing contextual change in diachronic text documents using context volatility

Terms in diachronic text corpora may exhibit a high degree of semantic d...

Please sign up or login with your details

Forgot password? Click here to reset