Paragraph-based complex networks: application to document classification and authenticity verification

06/22/2018
by   Henrique F. de Arruda, et al.
0

With the increasing number of texts made available on the Internet, many applications have relied on text mining tools to tackle a diversity of problems. A relevant model to represent texts is the so-called word adjacency (co-occurrence) representation, which is known to capture mainly syntactical features of texts.In this study, we introduce a novel network representation that considers the semantic similarity between paragraphs. Two main properties of paragraph networks are considered: (i) their ability to incorporate characteristics that can discriminate real from artificial, shuffled manuscripts and (ii) their ability to capture syntactical and semantic textual features. Our results revealed that real texts are organized into communities, which turned out to be an important feature for discriminating them from artificial texts. Interestingly, we have also found that, differently from traditional co-occurrence networks, the adopted representation is able to capture semantic features. Additionally, the proposed framework was employed to analyze the Voynich manuscript, which was found to be compatible with texts written in natural languages. Taken together, our findings suggest that the proposed methodology can be combined with traditional network models to improve text classification tasks.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

06/30/2016

Representation of texts as complex networks: a mesoscopic approach

Statistical techniques that analyze texts, referred to as text analytics...
12/04/2015

Topic segmentation via community detection in complex networks

Many real systems have been modelled in terms of network concepts, and w...
05/18/2019

Semantic flow in language networks

In this study we propose a framework to characterize documents based on ...
07/18/2021

A pattern recognition approach for distinguishing between prose and poetry

Poetry and prose are written artistic expressions that help us to apprec...
03/13/2020

Using word embeddings to improve the discriminability of co-occurrence text networks

Word co-occurrence networks have been employed to analyze texts both in ...
07/29/2016

Text authorship identified using the dynamics of word co-occurrence networks

The identification of authorship in disputed documents still requires hu...
01/17/2022

Accessibility and Trajectory-Based Text Characterization

Several complex systems are characterized by presenting intricate charac...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Due to the ever-increasing number of available online texts, many machine learning techniques have been developed to treat this kind of information 

Manning et al. (2008); Stamatatos (2009); Pak and Paroubek (2010); Yaveroğlu et al. (2014); Symeonidis et al. (2018); Sicilia et al. (2018); Xiong et al. (2018)

. Among many statistical methods, network-based approaches have also been proposed to address several natural language processing problems, including writing style analysis 

Amancio (2015c), authorship attribution Mehri et al. (2012); Amancio et al. (2011)

and sentiment analysis 

Zhao et al. (2014). Several graph-based approaches hinge on the topological information of the obtained networks to perform some type of classification Cancho and Solé (2001); Liu and Cong (2013); Amancio (2015b); Erkan and Radev (2004); Angelova and Weikum (2006); Jin and Srihari (2007); Yu et al. (2017).

A well-known representation of texts as complex networks is the co-occurrence model Cancho and Solé (2001); Liu and Cong (2013); Amancio (2015b); Wachs-Lopes and Rodrigues (2016). This model represents words as nodes, and edges are established for every pair of adjacent words. Recently, this representation was found to capture mainly syntax features Amancio et al. (2013); Masucci and Rodgers (2006), which has been confirmed by numerous works using co-occurrence networks to study language styles Segarra et al. (2015); Masucci and Rodgers (2006); Arruda et al. (2016b); Mehri et al. (2012); Amancio (2015a); Cong and Liu (2014). In order to grasp features that go beyond syntax, other models have been proposed. In Arruda et al. (2016a), the authors still consider words as nodes, but the connections are created considering a larger window, rather than only consecutive words. Upon applying community detection methods, this approach was successfully employed to detect topics. Regarding the mesoscopic scale, a network based on similarity of large chunks was proposed in Arruda et al. (2018). This methodology was found to be useful to understand and visualize the unfolding of stories Marinho et al. (2017). In the summarization context, another approach that also took into consideration larger chunks of texts is the network of connected paragraphs Salton et al. (1997).

In this work, we propose a novel paragraph-based network, which takes into consideration textual similarity by employing tf-idf (term frequency-inverse document frequency) weighting Manning and Schütze (1999)

together with cosine similarity. Differently from previous approaches 

Salton et al. (1997), the paragraph-based networks considered here are analyzed in terms of their topological and dynamical properties. The properties of the adopted network representation were probed by considering two different criteria. To test the informativeness of the networks, we investigated whether paragraph-based networks are able to discriminate real from shuffled texts. In the second test, we analyzed if the networks are able to capture syntax and, mostly importantly, semantic textual information. Our results showed that the modularity played an important role in distinguishing real and shuffled texts, since the presence of communities turned out to be a characteristic inherent of real texts. We also found that particular measurements are able to capture semantic features of texts, a feature that has not been observed in most co-occurrence networks modeling texts Amancio et al. (2013).

In addition to the analysis aimed at better understanding the statistical properties of paragraph-based networks, we probed the statistical properties of an unknown text – the Voynich manuscript – using the framework proposed here. Differently from other approaches, we did not assume that pages are organized in any specific order. This is an important feature because a recent study revealed that the traditionally assumed pages ordering might be unreliable Reddy and Knight (2011). Interestingly, our results indicate that the Voynich manuscript is compatible with natural languages and incompatible with shuffled texts. These conclusions were mostly corroborated by observing the community structure arising from the manuscript.

The remainder of this paper is organized as follow. In section 2, we present the employed datasets, the proposed methodology, and the used complex network measurements. Section 3 presents the an analysis of the paragraph-network properties by comparing real documents with two versions of shuffled texts. Furthermore, in the same section, we present a case study where we analyze the Voynich manuscript. Finally, in Section 4, we conclude the study with perspectives for further works.

2 Materials and methods

This section describes the employed datasets, the approach devised to create paragraph-based networks and the measurements extracted from the text networks.

2.1 Dataset

We employed two datasets. The first one, henceforth referred to as the Holy Bible dataset, was used to represent the variation of syntax across different languages when the text/content is the same. It comprises three books from the New Testament of the Holy Bible: Matthew, Mark and Luke. different languages were considered: Arabic, Basque, English, Esperanto, German, Greek, Hebrew, Hungarian, Korean, Latin, Maori, Portuguese, Russian, Swahili, Vietnamese, and Xhosa. The three books were concatenated into a single document so as to obtain a larger text, as our method is more reliable when larger pieces of texts are used to construct the network. This same procedure has been applied in similar studies Amancio et al. (2013). For all considered languages, the paragraphs comprise the same verses. In total, paragraphs were manually identified.

The second dataset, henceforth referred to as Books dataset, comprises books in different languages, namely English, French, German, Italian and Portuguese. This dataset was used to analyze how the network structure varies across different documents in the same language. The list of books is presented in Appendix A.

2.2 Paragraph-based networks

In this work, texts are modeled as complex networks. A network (or graph) can be defined as a set of nodes and a set of edges. In a unweighted network, the element of the adjacency matrix is equal to 1 if node is connected to node ; otherwise, . In weighted networks, the element corresponds to the weight of the link between nodes and .

The main objective of the adopted network model is to represent how short contexts (i.e. paragraphs) semantically relate to each other in a textual document. To create a paragraph-based network, the raw text is divided into chunks of paragraphs. Each paragraph is considered as a network node, as illustrated in Figure 1(a)). In order to establish links between paragraphs, each node is considered as a document in the set of documents. The tf-idf (term frequency-inverse document frequency) weighting map Manning and Schütze (1999) is then computed to quantify the relevance of each word :

(1)

where is the frequency of , is the total number of words in and is the number of documents (paragraphs) in which

appears. For each paragraph, a vector containing the tf-idf weights for the words is created and then edge weights are computed by using the cosine similarity for all pairs of paragraphs (nodes). Note that this methodology creates a fully connected, weighted graph, as illustrated in Figure 

1(a). Because many of the complex networks measurements are defined only for unweighted networks, we removed the weakest edges using a threshold . For the considered networks, we chose a threshold for each network in order to keep all networks with the same size and density . This is an important step in the pre-processing phase because several network measurements are known to be very sensitive to both size and density Costa et al. (2007); Amancio (2015c). In preliminary experiments, we found that a perturbation in does not alter the conclusions reported here. The effect of thresholding the weighted network is illustrated in Figure 1(b).

Figure 1: Example of thresholding paragraph-based networks. In the weighted version (a), all nodes (paragraphs) are connected among themselves and the weight of each edge is given by the textual (semantical) similarity between the nodes. The unweighted version (b) is obtained by removing all edges with weights below a given threshold T.

The proposed methodology is similar to the mesoscopic networks approach Arruda et al. (2018) in terms of the network edge weights. Actually, paragraph networks can be understood as a specific case of mesoscopic networks in which each chunk of text is a single paragraph with no forced overlap between adjacent chunks. An advantage of the present study is that here we can analyze texts in which the order of the pages and paragraphs are unknown. Note that other techniques of word representation, such as word embeddings, were not considered here because the proposed method was developed to be applied even in texts whose language is unknown.

2.3 Network variations

In order to compare real and shuffled texts, three types of networks were considered. The paragraph-based network – denoted as real texts (RT) – is obtained from the pre-processed texts of the considered datasets, as described in Section 2. The other networks are obtained from shuffled versions of the original text. The versions are created by shuffling words (SW) or sentences (SS). The versions obtained from an extract of the book The Adventures of Sherlock Holmes, by Arthur Conan Doyle are:

  1. Real Text (RT) version: ”Quite so,” he answered, lighting a cigarette, and throwing himself down into an armchair. ”You see, but you do not observe. The distinction is clear.

  2. Shuffled words (SW) version: ”Quite a into do distinction armchair. but lighting and answered, The observe. himself down you so,” not throwing he see, cigarette, is clear. ”You an

  3. Shuffled Sentences (SS) version: ”You see, but you do not observe. ”Quite so,” he answered, lighting a cigarette, and throwing himself down into an armchair. The distinction is clear.

2.4 Network characterization

The following network measurements were used to characterize the paragraph-based networks:

  1. Degree (): This measurement quantifies the number of immediate neighbors of a node  Costa et al. (2007) and it is obtained as .

  2. Betweenness (): This measurement quantifies the relevance of a node (or edge) in terms of the number of shortest paths including that node (or edge) Boccaletti et al. (2006). The betweenness centrality of a given node is calculated as

    (2)

    where is the number of shortest paths connecting nodes and that include node , and is the number of shortest paths connecting and , for all pairs and . In text networks, this measurement has been applied to identify if a concept/node is semantically related to one or more topological communities Amancio et al. (2011).

  3. Clustering coefficient (cc):

    The clustering coefficient represents the probability of two neighbors of a given node being connected with each other 

    Costa et al. (2007). Locally, the clustering coefficient is calculated as . In text analysis, the clustering coefficient has also been used to identify if a concept appears in generic or specific contexts. Differently from the betweenness, only local information is considered.

  4. Neighborhood (): this measurement quantifies the amount of nodes in the -th concentric level around node  Newman (2010). In this study, we used .

  5. Eccentricity (Ecc): the eccentricity of a node is a centrality index equal to the maximum length of all the shortest paths from to the other nodes in the network Harary (1969).

  6. Eigenvector centrality (EC):

    the eigenvector centrality assigns a value to a given node

    proportional to the sum of the eigenvector centrality values of the nodes connected to . By doing so, the centrality value of a node increases when it is connected to nodes with high eigenvector centrality Newman (2010).

  7. Closeness centrality (): this measurement is given by the inverse of the average distance from a node to the other nodes in the network Newman (2010). It is obtained as , where is the average distance from node to all the other nodes, and is the length of a geodesic path connecting nodes and .

  8. Accessibility (): This measurement quantifies the number of accessible nodes at the -th concentric level centered at node  Travençolo and Costa (2008) (we used ). This analysis accounts for the accessibility of a node taking into account the probability of a random walker to reach a given node departing from , in steps. The equation that describes this measurement is based on the Shannon entropy, as follows

    (3)

    In language networks, the accessibility (and its variations) has been used as an important feature to identify the relevance of words in the context of structural/stylistic analysis Amancio (2015c, b).

  9. Generalized Accessibility (): The generalized accessibility does not depend on the parameter . In contrast with the previous measurement, generalized accessibility uses a modified random walk, called accessibility random walk, which assigns higher weights to the shortest paths and penalizes the longest ones de Arruda et al. (2014). Mathematically, the measurement is defined as

    (4)

    where is computed as the probability transition of all the pairs of nodes and . More details are available in de Arruda et al. (2014).

  10. Symmetry (): As another variation of accessibility, this measurement quantifies the symmetry of the topology around a given node , by considering its neighborhood (Silva et al. (2016b). is defined in a two-fold manner: (i) the backbone (), in which the connections between nodes in the same hierarchical level () are removed and (ii) merged (), where the nodes that are connected and belong to the same hierarchical level () are merged into a single node. The measurement is computed as

    (5)

    where is the set of all nodes in the hierarchic level of node , is the number of nodes in , and by considering a given hierarchic level , is the number of nodes without edges connecting to the next hierarchical level. In this study, we employed . In text networks, the symmetry has been useful to identify the authorship of texts Amancio et al. (2015).

  11. Modularity (): proposed by Newman and Girvan (2004), the modularity measures the quality of a given network partitioning in terms of its communities. It can be obtained as:

    (6)

    where is the number of edges, is the number of nodes, if the nodes and are from the same class (community) and , otherwise. This measurement ranges from . For , the number of edges inside the communities is greater than the expected in a equivalent random network. In other words, a positive value of modularity is an indicative that the network is organized in communities.

Apart from the modularity, all of the aforementioned measurements are locally defined, i.e. each node has a specific value. To summarize the values obtained for a measurement across all nodes of the network, we took the average (

) and the standard deviation (

). Note that this approach has already been adopted in similar works Arruda et al. (2016b); Marinho et al. (2017).

An important issue arising from the characterization and classification of networks concerns the comparison of networks with different sizes. Since several network measurements may depend on the total number of nodes, we decide to construct the networks so as the total number of nodes (paragraphs) is constant.

2.5 Informativeness analysis

In the adopted network representation, we define as informative the measurements whose values obtained from real books and the respective shuffled versions are significantly different. Measures complying with this condition are therefore able of discriminating between real and random manuscripts. Note that an informative measurement is useful to verify if an unknown manuscript is compatible with a known textual structure (e.g. the structure observed in documents written in natural languages).

Two criteria were used to test the informativeness of the networks:

  1. Criterion A: this criterion is aimed at verifying if the values obtained from the set of all shuffled texts of the dataset can be discriminated from the values obtained for all real texts. Let and be the total number of books in the RT dataset and the number of shuffled versions generated for each book in RT, respectively. Here, we perform a comparison of values in RT with values obtained from shuffled texts.

  2. Criterion B: it consists in comparing the value obtained for the real (RT) text with the values obtained in the corresponding shuffled versions of the same text. For a given measurement , the distance between a real text and the respective shuffled versions is obtained by computing the z-score (i.e. the standard score):

    (7)

    where is the value obtained in the real text, is the set of values obtained from the shuffled versions (SW or SS); and and represent the mean and standard deviation of the distribution, respectively.

In our tests, for each real text, we created samples for both SW and SS versions.

2.6 Dependency with language and semantics

An important property to be verified in a text network is the ability of the extracted measurements to capture syntactical and/or semantical features of the represented texts Amancio et al. (2013). In order to study the dependency of the measurements on syntax and semantics, the measurements are extracted in two classes of datasets. For a given measurement, represents the set of values obtained for in a dataset comprising the same book in different languages (). In a similar fashion, represents the set of values obtained for in a dataset comprising different texts () written in the same language . If a given network measurement depends more on the language (i.e. the syntax) than on the approached subject (i.e. the semantics), one expects that variability of the distribution of will be larger than the variability of . Conversely, if is more dependent on semantics, one expects that the variability of will be larger than the variability of  Amancio et al. (2013). Here, the variability of the distributions is computed by using the coefficient of variation (CV) of the distribution, i.e.

(8)

3 Results and discussion

In this section, we analyze the properties of the metrics extracted from the proposed network representation. Here we focus on two main properties: informativeness and the ability of the metrics to capture syntactical and/or semantic textual features. The applicability of the adopted representation is then illustrated in the analysis of an unknown text: the Voynich manuscript.

3.1 Informativeness

In this study, we used distinct ways to quantify informativeness Amancio et al. (2013). In the first approach, we consider a measurement as informative if the value obtained for in a real (RT) text differs from the values of obtained in any other shuffled (SW and SS) text of the considered dataset (see Condition A described in the methodology). The results obtained for this type of analysis are shown in Table 1. To facilitate the comparison of measurements taking values in distinct intervals, a normalization was applied. For each measurement, and for each of the datasets (Holy Bible and Books), the results are standardized considering all three types of texts (RT, SS and SW). As such, the average value of each normalized measurement in Table 1 is zero (i.e. ) and the standard deviation is one.

Holy Bible dataset Books dataset (all)
RT SS SW RT SS SW
Table 1: Measurements obtained for the different network types (RT, SS, and SW) by considering the Holy Bible dataset, the English part of the book dataset, and the entire book dataset (see Appendix A). Note that all the presented data is standardized to be possible to compare different measurements.

Considering the Holy Bible dataset, the modularity () was the measurement that best discriminated real from shuffled texts. The modularity in real networks differs and from the SW and SS versions, respectively. This result suggests that the community structure is much more apparent in real networks, which might be a consequence of the bursty topical textual structure present in real texts Arruda et al. (2016a). In addition to the modularity, other measurements were also found to be informative. When comparing RT and SW, the largest differences of values were found for the accessibility (), symmetry ( and ) and the clustering coefficient (). The best discrimination between RT and SS was found for the symmetry ( and ), accessibility () and closeness (). Interestingly, several of the measurements were able to distinguish between real and shuffled texts, regardless of the considered shuffling process.

Considering the Books dataset, the modularity also turned out to be the measurement that best discriminated real from shuffled texts. Once again, real texts oftentimes displayed a clearer community structure. This means that the informativeness achieved by the modularity is a characteristic that seems to depend neither on syntax nor semantics. Apart from , the following measurements were also found to discriminate real from both shuffled networks: clustering coefficient (), degree () and accessibility (, and and ).

As a complementary test, for each measurement, we used the z-score (see equation 7) to compare a real text and its corresponding shuffled versions (informativeness test based on Condition B). Note that this is a less strict informativeness test because, differently from the previous case, we do not compare a real text with shuffled versions from all texts of the dataset. Here, we rather compare a real text and the shuffled versions generated only from the same book. In Table 2, we show the percentage of documents in which we observed a significant difference between real and shuffled texts – according to the z-score defined in equation 7.

Measurements SW SS
Holy Bible Books Holy Bible Books
Degree
Betweenness
Clustering
Neighborhood
Eccentricity
Eigenvector
Closeness
Symmetry
Accessibility
Modularity
Table 2: Percentage of documents in each dataset where the difference between a real text (English part) and the corresponding shuffled version was found to be significant. Apart from the modularity, the informativeness seems to depend on the type of dataset used.

As found in the first test, is the most critical measurement for both of the considered datasets, reaching 100% of informativeness. Other measurements had similar results for both datasets, eg., , , and , which we found to be informative for approximately 50% of the samples. However, for many other measurements, the level of informativeness varied according to the dataset. For example, and were found to be more informative in the Holy Bible dataset. Conversely, and seemed to be more informative in the Books dataset.

All in all, the results obtained here suggest that, apart from the modularity, it is important to analyze the characteristics of the dataset to decide if network measurements extracted from paragraph networks can be classified as informative – even if a less strict definition of informativeness is taken into account. Interestingly, the results obtained here confirm that paragraph networks are less informative than other types of text networks 

Amancio (2015c). In the case of word adjacency networks, most of the measurements were found to be informative, independently of the characteristics of the considered datasets.

3.2 Dependency on syntax and semantics

In this section, we evaluate the dependency of the measurements by considering their variability in two distinct scenarios: in datasets where (i) the semantics (text) is constant and the language (syntax) is varied; and (ii) the language is constant and the semantics varies. To represent (i), we used the Holy Bible dataset. The dataset employed in the second scenario was created by selecting only the Books in English from the Books dataset. We decided to use the English language because, in the considered dataset, a larger number of books written in this language is available.

In the first analysis, we identified the measurements that were able to capture syntax/language subtleties. The measurements that were found to display significant variability in this scenario (i.e. in the Holy Bible dataset) were: accessibility ( and ), degree (), eccentricity (), symmetry (, , and ), neighborhood () and betweenness ().

We also identified the measurements that are sensitive to changes in semantics. The measurements taking the highest coefficients of variation in the English Books datasets were: eccentricity (), closeness (), symmetry ( and ), betweenness (), degree (), eigenvector centrality (), accessibility ( and ) and neighborhood (). Note that some measurements might depend on both syntax and semantics. This is the case of . Interestingly, for both symmetry and accessibility, the ability to capture syntax or semantics subtleties depends on the hierarchical level being analyzed.

In addition to the aforementioned tests, we probed, for each measurement, which of the two phenomena is more prevalent: (i) the ability to detect changes in syntax; and (ii) the ability to detect changes in semantics. This prevalence analysis was conducted by comparing the coefficient of variation in the considered datasets, as described in the methodology. The obtained results are shown in Figure 2. The top sub-panels illustrate the results obtained for , , and . In most of these cases, while the variability across languages (Bible dataset) or topics (English dataset) is high, there is no significant difference between these values. This means that, for these measurements, both syntax and semantics are captured.

Figure 2:

Comparisons of the measurements that provided lower values CV. The error bar represents the confidence interval of a mean for the interval of 95% of confidence.

A different behavior can be observed for the measurements depicted in the bottom sub-panels of Figure 2. For both and , a significant difference of coefficients of variation was found. For the first three measurements, the variability across topics turned out to be significantly higher than the variability across topics. This is an interesting finding in text networks, since measurements extracted from other texts networks (such as co-occurrence networks) are mostly dependent on syntax Amancio et al. (2013). This result suggests that paragraph-based networks can be used to complement the analysis based on traditional co-occurrence networks when both syntax and semantics are relevant for the problem being addressed.

3.3 Classification tests

To illustrate the applicability of the paragraph-based network in classification tasks, some classification problems were tackled using the . In the first example, we considered the problem of deciding whether a manuscript has a structure compatible with a shuffled, meaningless document. In the second classification problem, we probed whether an unknown text – the Voynich manuscript – can be considered compatible with real texts.

3.3.1 Discriminating real and shuffled texts

We applied our method to distinguish real from shuffled texts in order to illustrate the capabilities of paragraph based networks to characterize texts regarding a real application. For each book presented in Appendix A, the three paragraph-based networks were created, RT, SW, and SS. After that, the network measurements described previously were extracted, the values were standardized, and those values were used as classification features. To select the features for this task, we considered the most informative measurements obtained from Table 1. More specifically, for each pair real vs. shuffled texts (i.e., RT vs. SS and RT vs. SW) we identified the top 10 measurements providing the best discrimination. Then, we selected those measurements appearing in both top 10 lists. The measurements selected are: , , , , , , , , and .

The classification was evaluated by the live-one-out cross-validation and the SMO classifier algorithm, which is an SVM implementation available in Weka Witten et al. (2016); Hall et al. (2009). The parameters were chosen according to the procedure defined in Amancio et al. (2014). When considering three classes (RT, SW, and SS), the accuracy was . However, the true positive rate was 0.98 for the RT samples and 0.71 and 0.50 for SW and SS, respectively. The false negative rates ware 0.02, 0.24, and 0.14 for RT, SW, and SS, respectively. These results mean that the proposed framework can easily differentiate between real and shuffled texts. Conversely, the discrimination between the two classes of shuffled documents represent a more challenging task.

A variation of the same classification problem considered both shuffled versions as being the same class. In this case, SVM reached of accuracy. The false positive rate of the RT networks was 0, and the only two classification mistakes were made by real texts classified as shuffled texts. Figure 3 illustrates the separation between the two classes by considering the projection into a single dimension obtained via linear discriminant analysis Friedman et al. (2001). Given the importance stressed by the modularity in the informativeness analysis, we also evaluated the performance when only this measurement is used for the classification. In this case, the accuracy rate reached , which confirms the importance of the modularity in discriminating real and shuffled texts.

Figure 3: Probability density function (pdf) of the linear discriminant analysis projection obtained from the selected features in the classification of texts in two classes: real vs. shuffled (SW and SS) texts.

3.3.2 Case Example: Voynich manuscript

The Voynich manuscript is known to be a mysterious text, and many of its aspects have been studied for several years Reddy and Knight (2011). Some studies have relied on textual analysis Reddy and Knight (2011), while others have used complex networks tools to study its properties Amancio et al. (2013); Montemurro and Zanette (2013). In order to handle the manuscript – originally written in an unknown alphabet – it is necessary to translate its characters into a known set of symbols. Here we used the European Voynich Alphabet (EVA) Znadbergen (2018), which provides the original characters manually translated into European characters. To provide a better quality translation, for each line of the text, different translations are available. Here we considered the voting of the most recurring character for all different translations of the same line. Additionally, because our approach relies on text paragraphs, we detected paragraphs by visually inspecting the original manuscript. When comparing the Voynich manuscript with shuffled texts, we disregarded the SS versions because there is no trivial way to detect sentences in the Voynich manuscript.

First, we analyzed if the Voynich manuscript, when characterized with the metrics extracted from paragraph-based networks, is compatible with real texts and not compatible with gibberish, shuffled texts. This is a long-standing question about the manuscript, since several scholars have questioned the existence of a meaningful textual structure in this mysterious text Belfield (2007). An illustration comparing the structure of the Voynich manuscript and a shuffled network is shown in Figure 4. It is clear from the visualizations that the Voynich manuscript presents an well-defined community structure, with two dominant groups. The communities seems to capture the topical organization of the manuscript in some degree: the extract about plants seems to be separated in a specific community. The equivalent shuffled network, shown in Figure 4(b) reveals no apparent community structure. Since the modularity was found to be informative in the previous analysis, the organization in communities in Figure 4(a) suggests, at the paragraph level, that the Voynich manuscript is not compatible with shuffled texts. Interestingly, this same conclusion has been reported when different types of networks are used to represent the manuscript Amancio et al. (2013); Montemurro and Zanette (2013); Reddy and Knight (2011).

Figure 4: Network visualizations of the two version of paragraph based networks of Voynich manuscript. In (a) the paragraphs were labelled by considering the figures in the corresponding pages. The topics considered were: (i) text, when no images are available; (ii) plants; (iii) bath, with figures of women and bath-like shape; (iv) astronomy, with spatial-like figures; and (v) vase. The visualization was provided by the software implemented by Silva et al. (2016a).

As a complementary analysis, taking into account the community structure of the networks, we analyzed the modularity , which is much higher for the set of real texts (RT) when compared to the set of shuffled texts (SS and SW), as shown in Figure 5. The modularity obtained for the Voynich (represented with a blue arrow in the figure) is not compatible with any of the two distributions obtained for shuffled texts. On the other hand, the modularity of the manuscript is compatible with the modularity extracted from real texts.

In order to analyze the Voynich manuscript, we employed the same classifier as in the previous section. As a result, the document was classified as real text. A set of 30 SW networks of Voynich were also classified, and the accuracy of 100% was found. This perfect classification can also be seen in Figure 5, which shows that the generated SW networks of Voynich (orange arrows) are mostly compatible with the distributions obtained for shuffled texts.

Figure 5: Probability density function (pdf) of the modularity measurement for three types of networks (RT, SW, and SS). The modularity of the two versions of the Voynich manuscript RT and SW are represented by the blue and orange arrows, respectively.

4 conclusions

In the current study, we probed the properties of a paragraph-based networked representation of texts. Two main properties were considered: the ability of the networks to distinguish real from shuffled texts (informativeness test) and the ability to capture syntactic and/or semantic text features. Interestingly, we found that the most informative measurement is the modularity, since artificial, shuffled texts are not organized in well-defined communities. Our results also revealed that several measurements are able to capture semantic features. This is an important feature, since the well-known word adjacency (co-occurrence) networks are only able to capture syntax features. Our findings suggest that both co-occurrence and paragraph-based networks can be used in a complementary way when both syntax and semantics important for a natural language processing task.

The adopted network representation was used to analyze the statistical nature of the Voynich manuscript. Previous studies hinging on word networks showed that the Voynich syntax is coherent with natural languages Amancio et al. (2013); Montemurro and Zanette (2013). Recently, an extensive analysis using several natural languages argued that Hebrew is the most probable language of the manuscript Hauer and Kondrak (2016). Here, we proposed a different analysis, by focusing on the organization in paragraphs. Our analysis revealed that the Voynich manuscript is compatible with natural languages at the paragraph level. This finding was confirmed by analyzing the organization of the text into well-defined communities: similarly to several natural languages, the Voynich also displays a clear community structure organization. Furthermore, we applied our classification approach and the Voynich manuscript was classified as real text. As a complement, the accuracy of was found when we classified 30 samples of the shuffled version of the Voynich manuscript.

As a future work, many other natural language problem can be addressed by considering the proposed network. For instance, several problems currently being addressed by word adjacency models could be benefited from the paragraph-based network approach. Examples of applications include machine translation quality, analysis of plagiarism and authorship attribution Amancio (2015a). Additionally, other unknown documents can also be examined in terms of their organization in paragraphs Belfield (2007).

Acknowledgments

Henrique F. de Arruda acknowledges Capes-Brazil for sponsorship. Vanessa Q. Marinho thanks FAPESP (grant no. 2015/05676-8) for financial support. Luciano da F. Costa thanks CNPq (grant no. 307333/2013-2) and NAP-PRP-USP for sponsorship. Diego R. Amancio acknowledges FAPESP (grant no. 16/19069-9 and 17/13464-6) for financial support. This work has been supported also by FAPESP grants 11/50761-2 and 2015/22308-2. The authors acknowledge Filipi Nascimento Silva for fruitful conversations.

Appendix A Dataset

The list of books used to analyze how the network structure varies across different documents in the same language is shown below. Five different languages were considered: English, French, German, Italian and Portuguese. The list of books is organized by language. The author of each book is listed between parentheses after each title. The books were obtained from the Project Gutenberg 111http://www.gutenberg.org.

  1. English: The Adventures of Sherlock Holmes (Arthur Conan Doyle), The Tragedy of the Korosko (Arthur Conan Doyle), The Valley of Fear (Arthur Conan Doyle), Uncle Bernac - A Memory of the Empire (Arthur Conan Doyle), Dracula’s Guest (Bram Stoker), The Lair of the White Worm (Bram Stoker), The Jewel Of Seven Stars (Bram Stoker), The Man (Bram Stoker), The Mystery of the sea (Bram Stoker), A Tale of Two Cities (Charles Dickens), Barnaby Rudge: A Tale of the Riots of Eighty (Charles Dickens), American Notes (Charles Dickens), Great Expectations (Charles Dickens), Hard Times (Charles Dickens), The Works of Edgar Allan Poe – Volumes 2 and 4 (Edgar Allan Poe), Beasts and Super-Beasts (Hector H. Munro), The Chronicles of Clovis (Hector H. Munro), The Toys of Peace (Hector H. Munro), The Girl on the Boat (P. G. Wodehouse), My Man Jeeves (P. G. Wodehouse), Something New (P. G. Wodehouse), The Adventures of Sally (P. G. Wodehouse), The Clicking of Cuthbert (P. G. Wodehouse), A Pair of Blue Eyes (Thomas Hardy), Far from the Madding Crowd (Thomas Hardy), Jude the Obscure (Thomas Hardy), The Mayor of Casterbridge (Thomas Hardy), The Hand of Ethelberta (Thomas Hardy), Barry Lyndon (William M. Thackeray), The History of Pendennis (William M. Thackeray), The Virginians (William M. Thackeray) and Vanity Fair (William M. Thackeray).

  2. French: Le fils du Soleil (Gustave Aimard), Face au Drapeau (Jules Verne), Pierre de Villerglé (Louis Amédée Achard), Les Idoles d’argile (Louis Reybaud) and Han d’Islande (Victor Hugo).

  3. German: Die Wahlverwandtschaften (Goethe), Der Moloch (Jakob Wassermann), Königliche Hoheit (Thomas Mann) and Lichtenstein (Wilhelm Hauff).

  4. Italian: Il Peccato di Loreta (Alberto Boccardi), La Montanara (Anton Giulio Barrili), Alla Finestra (Enrico Castelnuovo), Sciogli la treccia, Maria Maddalena (Guido da Verona) and La Pergamena Distrutta (Virginia Mulazzi).

  5. Portuguese: Amor de Perdição (Camilo Castelo Branco), A Cidade e as Serras (Eça de Queirós), Os Bravos do Mindello (Faustino da Fonseca), Transviado (Jaime de Magalhães Lima) and Uma Família Inglesa (Júlio Dinis).

References

  • Amancio (2015a) Amancio, D. R., 2015a. Comparing the topological properties of real and artificially generated scientific manuscripts. Scientometrics 105 (3), 1763–1779.
  • Amancio (2015b) Amancio, D. R., 2015b. A complex network approach to stylometry. PLoS ONE 10 (8), e0136076.
  • Amancio (2015c) Amancio, D. R., 2015c. Probing the topological properties of complex networks modeling short written texts. PLoS ONE 10 (2), e0118394.
  • Amancio et al. (2011) Amancio, D. R., Altmann, E. G., Oliveira Jr., O. N., Costa, L. F., 2011. Comparing intermittency and network measurements of words and their dependence on authorship. New Journal of Physics 13 (12), 123024.
  • Amancio et al. (2013) Amancio, D. R., Altmann, E. G., Rybski, D., Oliveira Jr, O. N., Costa, L. F., 2013. Probing the statistical properties of unknown texts: application to the voynich manuscript. PLoS ONE 8 (7), e67310.
  • Amancio et al. (2014) Amancio, D. R., Comin, C. H., Casanova, D., Travieso, G., Bruno, O. M., Rodrigues, F. A., Costa, L. F., 2014. A systematic comparison of supervised classifiers. PloS one 9 (4), e94137.
  • Amancio et al. (2015) Amancio, D. R., Silva, F. N., Costa, L. F., 2015. Concentric network symmetry grasps authors’ styles in word adjacency networks. EPL (Europhysics Letters) 110 (6), 68001.
  • Angelova and Weikum (2006) Angelova, R., Weikum, G., 2006. Graph-based text classification: learn from your neighbors. In: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, pp. 485–492.
  • Arruda et al. (2016a) Arruda, H. F., Costa, L. F., Amancio, D. R., 2016a. Topic segmentation via community detection in complex networks. Chaos: An Interdisciplinary Journal of Nonlinear Science 26 (6), 063120.
  • Arruda et al. (2016b) Arruda, H. F., Costa, L. F., Amancio, D. R., 2016b. Using complex networks for text classification: Discriminating informative and imaginative documents. EPL (Europhysics Letters) 113 (2), 28007.
  • Arruda et al. (2018) Arruda, H. F., Silva, F. N., Marinho, V. Q., Amancio, D. R., Costa, L. F., 2018. Representation of texts as complex networks: a mesoscopic approach. Journal of Complex Networks 6 (1), 125–144.
  • Belfield (2007) Belfield, R., 2007. The Six Unsolved Ciphers: Inside the Mysterious Codes that Have Confounded the World’s Greatest Cryptographers. Ulysses Press.
  • Boccaletti et al. (2006) Boccaletti, S., Latora, V., Moreno, Y., Chavez, M., Hwang, D.-U., Feb. 2006. Complex networks: Structure and dynamics. Physics Reports 424, 175–308.
  • Cancho and Solé (2001) Cancho, R. F., Solé, R. V., 2001. The small world of human language. Proceedings of the Royal Society of London B: Biological Sciences 268 (1482), 2261–2265.
  • Cong and Liu (2014) Cong, J., Liu, H., 2014. Approaching human language with complex networks. Physics of Life Reviews 11 (4), 598–618.
  • Costa et al. (2007) Costa, L. F., Rodrigues, F. A., Travieso, G., Villas Boas, P. R., 2007. Characterization of complex networks: A survey of measurements. Advances in Physics 56 (1), 167–242.
  • de Arruda et al. (2014) de Arruda, G. F., Barbieri, A. L., Rodríguez, P. M., Rodrigues, F. A., Moreno, Y., Costa, L. F., 2014. Role of centrality for the identification of influential spreaders in complex networks. Phys. Rev. E 90, 032812.
  • Erkan and Radev (2004)

    Erkan, G., Radev, D. R., 2004. Lexrank: Graph-based lexical centrality as salience in text summarization. Journal of Artificial Intelligence Research 22, 457–479.

  • Friedman et al. (2001) Friedman, J., Hastie, T., Tibshirani, R., 2001. The elements of statistical learning. Vol. 1. Springer series in statistics New York.
  • Hall et al. (2009) Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I. H., 2009. The weka data mining software: an update. ACM SIGKDD explorations newsletter 11 (1), 10–18.
  • Harary (1969) Harary, F., 1969. Graph Theory. Addison-Wesley Series in Mathematics. Addison Wesley.
  • Hauer and Kondrak (2016) Hauer, B., Kondrak, G., 2016. Decoding anagrammed texts written in an unknown language and script. Transactions of the Association for Computational Linguistics 4, 75–86.
  • Jin and Srihari (2007) Jin, W., Srihari, R. K., 2007. Graph-based text representation and knowledge discovery. In: Proceedings of the 2007 ACM symposium on Applied computing. ACM, pp. 807–811.
  • Liu and Cong (2013) Liu, H., Cong, J., 2013. Language clustering with word co-occurrence networks based on parallel texts. Chinese Science Bulletin 58 (10), 1139–1144.
  • Manning et al. (2008) Manning, C. D., Raghavan, P., Schütze, H., et al., 2008. Introduction to information retrieval. Vol. 1. Cambridge university press Cambridge.
  • Manning and Schütze (1999) Manning, C. D., Schütze, H., 1999. Foundations of Statistical Natural Language Processing. MIT Press, Cambridge, MA, USA.
  • Marinho et al. (2017) Marinho, V. Q., Arruda, H. F., Lima, T. S., Costa, L. F., Amancio, D. R., 2017. On the “calligraphy” of books. In: TextGraphs. Association for Computational Linguistics, pp. 1–10.
  • Masucci and Rodgers (2006) Masucci, A. P., Rodgers, G. J., 2006. Network properties of written human language. Physical Review E 74 (2), 026102.
  • Mehri et al. (2012) Mehri, A., Darooneh, A. H., Shariati, A., 2012. The complex networks approach for authorship attribution of books. Physica A: Statistical Mechanics and its Applications 391 (7), 2429–2437.
  • Montemurro and Zanette (2013) Montemurro, M. A., Zanette, D. H., 2013. Keywords and co-occurrence patterns in the voynich manuscript: An information-theoretic analysis. PLoS ONE 8 (6), e66344.
  • Newman (2010) Newman, M., 2010. Networks: An Introduction. Oxford University Press, Inc., New York, NY, USA.
  • Newman and Girvan (2004) Newman, M. E. J., Girvan, M., 2004. Finding and evaluating community structure in networks. Physical Review E 69 (026113).
  • Pak and Paroubek (2010) Pak, A., Paroubek, P., 2010. Twitter as a corpus for sentiment analysis and opinion mining. In: LREc. Vol. 10.
  • Reddy and Knight (2011) Reddy, S., Knight, K., 2011. What we know about the voynich manuscript. In: Proceedings of the 5th ACL-HLT workshop on language technology for cultural heritage, social sciences, and humanities. Association for Computational Linguistics, pp. 78–86.
  • Salton et al. (1997) Salton, G., Singhal, A., Mitra, M., Buckley, C., 1997. Automatic text structuring and summarization. Information Processing & Management 33 (2), 193–207.
  • Segarra et al. (2015) Segarra, S., Eisen, M., Ribeiro, A., 2015. Authorship attribution through function word adjacency networks. IEEE Transactions on Signal Processing 63 (20), 5464–5478.
  • Sicilia et al. (2018) Sicilia, R., Giudice, S. L., Pei, Y., Pechenizkiy, M., Soda, P., 2018. Twitter rumour detection in the health domain. Expert Systems with Applications.
  • Silva et al. (2016a) Silva, F. N., Amancio, D. R., Bardosova, M., Costa, L. F., Oliveira Jr, O. N., 2016a. Using network science and text analytics to produce surveys in a scientific topic. Journal of Informetrics 10 (2), 487–502.
  • Silva et al. (2016b) Silva, F. N., Comin, C. H., Peron, T. K., Rodrigues, F. A., Ye, C., Wilson, R. C., Hancock, E. R., F. Costa, L., 2016b. Concentric network symmetry. Information Science 333, 61–80.
  • Stamatatos (2009) Stamatatos, E., 2009. A survey of modern authorship attribution methods. Journal of the Association for Information Science and Technology 60 (3), 538–556.
  • Symeonidis et al. (2018) Symeonidis, S., Effrosynidis, D., Arampatzis, A., 2018. A comparative evaluation of pre-processing techniques and their interactions for twitter sentiment analysis. Expert Systems with Applications.
  • Travençolo and Costa (2008) Travençolo, B., Costa, L. F., 2008. Accessibility in complex networks. Physics Letters A 373 (1), 89 – 95.
  • Wachs-Lopes and Rodrigues (2016) Wachs-Lopes, G. A., Rodrigues, P. S., 2016. Analyzing natural human language from the point of view of dynamic of a complex network. Expert Systems with Applications 45, 8–22.
  • Witten et al. (2016) Witten, I. H., Frank, E., Hall, M. A., Pal, C. J., 2016. The WEKA Workbench. Online Appendix for “Data Mining: Practical Machine Learning Tools and Techniques”. Morgan Kaufmann.
  • Xiong et al. (2018) Xiong, R., Wang, J., Zhang, N., Ma, Y., 2018. Deep hybrid collaborative filtering for web service recommendation. Expert Systems with Applications.
  • Yaveroğlu et al. (2014) Yaveroğlu, Ö. N., Malod-Dognin, N., Davis, D., Levnajic, Z., Janjic, V., Karapandza, R., Stojmirovic, A., Pržulj, N., 2014. Revealing the hidden language of complex networks. Scientific Reports 4, 4547.
  • Yu et al. (2017) Yu, D., Wang, W., Zhang, S., Zhang, W., Liu, R., 2017. Hybrid self-optimized clustering model based on citation links and textual features to detect research topics. PLoS ONE 12 (10), e0187164.
  • Zhao et al. (2014) Zhao, L., Wang, J., Huang, R., Cui, H., Qiu, X., Wang, X., 2014. Sentiment contagion in complex networks. Physica A: Statistical Mechanics and its Applications 394, 17–23.
  • Znadbergen (2018) Znadbergen, R., jan 2018. Text analysis - transcription of the text.
    URL http://www.voynich.nu/transcr.html