Authorship Attribution Based on Life-Like Network Automata

10/20/2016 ∙ by Jeaneth Machicao, et al. ∙ 0

The authorship attribution is a problem of considerable practical and technical interest. Several methods have been designed to infer the authorship of disputed documents in multiple contexts. While traditional statistical methods based solely on word counts and related measurements have provided a simple, yet effective solution in particular cases; they are prone to manipulation. Recently, texts have been successfully modeled as networks, where words are represented by nodes linked according to textual similarity measurements. Such models are useful to identify informative topological patterns for the authorship recognition task. However, there is no consensus on which measurements should be used. Thus, we proposed a novel method to characterize text networks, by considering both topological and dynamical aspects of networks. Using concepts and methods from cellular automata theory, we devised a strategy to grasp informative spatio-temporal patterns from this model. Our experiments revealed an outperformance over traditional analysis relying only on topological measurements. Remarkably, we have found a dependence of pre-processing steps (such as the lemmatization) on the obtained results, a feature that has mostly been disregarded in related works. The optimized results obtained here pave the way for a better characterization of textual networks.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 13

page 22

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The current massive production of data has brought up plenty of challenges to the areas of Data Mining, Natural Language Processing (NLP) and Machine Learning. An example of a current challenge in information sciences is the authorship attribution task, which amounts to the ability to assign authorship to anonymous or disputed documents. This task has drawn attention from researchers mostly for its implications in real applications, such as plagiarism detection 

FrancoSalvador2016550 ; Labbe , forensics against cyber crimes Vacca:2005:CFC:1076307 and resolution of disputed documents ASI:ASI21001 .

Several methods have been proposed to undertake the authorship attribution problem ASI:ASI21001 . Traditional techniques use text analytics and natural language processing concepts to characterize authors’ writing styles ASI:ASI21001 . For example, in several studies, it has been shown that the raw frequency of function words or the intermittency of content words is notably useful to discriminate authors’ styles amancio2015authorship ; Brennan:2012:ASC:2382448.2382450 . In recent years, deeper paradigms have been employed to tackle this problem. Syntactical and semantical features are some examples of features not relying only on simple statistical analyses Halteren:2007:AVL:1187415.1187416 . Despite being effective in particular contexts, deeper paradigms require a more complex data handling, a painstaking effort that may not yield good results in generic scenarios. Even though methods based on simple statistical analyses yield, in general, excellent results with the advantage of not requiring a large corpora for training or language-dependent resources, they are prone to manipulation via obfuscation of imitation attacks Brennan:2012:ASC:2382448.2382450 . For this reason, more robust statistical methods have been proposed.

A recent trend in authorship attribution research is using the complex network framework, due to the success of its use in related tasks, mostly in text classification tasks Martinici ; Dorogovtsev2603 ; 0295-5075-100-5-58002 ; 10.1371/journal.pone.0067310 ; 0295-5075-93-2-28005 ; 0295-5075-83-1-18002 ; Mehri20122429 . In this paradigm, documents are modeled by means of a co-occurrence network 10.1371/journal.pone.0067310 , and the properties of the formed networks are used as authors’ fingerprints in the classification process amancio2011comparing

. Although such methods have proven useful for discriminating writing styles with a certain robustness provided by topological analysis, they usually provide no better results than traditional techniques based e.g. in n-grams models when used as a single source of text characterization. However, complex network topologies are less prone to manipulation, which makes these network-based methods more robust in real scenarios. Note that complex network-based measurements provide a complementary view of unstructured documents, a feature that can be further explored in hybrid approaches.

In a typical networked-based authorship recognition system, texts are modeled as a network and the structure of these networks is then used as a relevant feature to discriminate distinct authors amancio2011comparing . While traditional network topology measurements are useful to understand the main topological properties of texts, they may provide an ambiguous characterization, mainly when subtleties in style are not mapped into equivalent informative network structures. For this reason, the creation of informative, efficient and unambiguous network measurements for specific models remains as an open problem in network science. In this context, we explore a novel network characterization based on cellular automata theory (CA) WOLFRAM19841 .

In the last decade, the fusion of networks and cellular automata, appeared into the Literature watts1999small ; tomassini2005evolution ; marr2012cellular ; lifelikeNA

. This discrete dynamical system, called as Network Automata (NA), uses the network structure as the tessellation of the cellular automaton, whose dynamics is governed by a rule that defines the states of its nodes at each time step. NAs turned out to be a powerful tool for pattern recognition purposes because it combines the advantages of the networks for modeling and analyses with the capabilities of CAs to extract complex patterns 

gonccalves2012complex ; lifelikeNA .

In this manuscript, we propose a method to characterize networks representing written texts to tackle the authorship attribution task. The proposed method is based on Life-Like Network Automata (LLNA) lifelikeNA , which was inspired by the 2D Life-like CA GardnerMathematicalGT , a well-known set of rules explored in diverse fields Soto2015TheXU ; machicao2012chaotic ; Broderick2004ALV ; CsuhajVarj1997EcoGrammarSA . We depart from the well-known word-adjacency model and include a LLNA dynamics to characterize text networks. More specifically, our approach relies on a selection of informative LLNA’s rules and, therefore, we expect to obtain spatio-temporal patterns possessing two important properties: (i) the books written by the same author displays similar patterns; and (ii) books written by distinct authors display distinct spatio-temporal patterns. Using a collection of texts written by 8 authors, we obtained an accuracy of 76%, which is considerably more accurate than traditional methods based solely on topological properties of networks and, therefore, demonstrating the good performance of the proposed method.

Ii Material and methods

ii.1 Proposal overview

In this section, we introduce an overview of the main proposal (see Figure 1) to understand not only the sequence of mathematical preliminaries, but also the experiments setup that are presented in Section III. First, we introduce the well-known network model of text representation, the word-adjacency model. We also present optional text pre-processing strategies which may be applied to improve the characterization of texts. Some network measurements used to explore the properties of networks are presented. Next, we discuss the Life-Like network automata representation used in this article and their respective measurements. The measurements extracted from the Life-Like network automata dynamics are then used to characterize the style of each author.

Figure 1: Authorship attribution framework based on LLNA method. The following steps are applied: (1) a written text is pre-processed; (2) a network is generated based on the extraction of keywords from the pre-processing; (3) a selected LLNA rule evolves over the textual network topology; (4) spatio-temporal features from the LLNA are extracted and then are used for the authorship attribution task.

ii.2 Modeling and characterizing texts as networks

In recent years, distinct ways to model texts as complex networks and graphs have been proposed mihalcea2011graph . Particularly, in the current study, we have used the so-called word adjacency (or co-occurrence) model, as it has been proven useful to grasp stylistic textual patterns sole2010language ; amancio2011using ; amancio2011comparing . In this model, each node represents a word and the edges are created whenever two words appear as adjacent in the raw text. Mathematically, the word adjacency network is represented by an adjacency matrix , whose elements are defined as

(1)

ii.2.1 Network construction

Prior to the transformation of the text as a network, some pre-processing steps may be required. In most of the applications devoted to represent texts as networks, the three following steps are performed. The first step is the tokenization, which is responsible to split the document into meaningful units, such as words and punctuation marks. The second step performs the removal of stopwords, which are the words conveying little semantic meaning such as articles and prepositions. The list of stopwords is shown in Section S3 of the Supplementary Information111The Supplementary Information is available at https://dl.dropboxusercontent.com/u/2740286/automata.pdf.

. Note that, in this phase, punctuation marks are also disregarded, as they do not contribute to the semantic meaning of text. Finally, the third step, a lemmatization is applied to map the remaining words into their canonical forms. As such, verbs and nouns are mapped to their infinitive and singular forms, respectively. The lemmatization process usually requires the identification of the individual parts-of-speech to solve possible ambiguities. In this paper, we have used the Average Perceptron part-of-speech Tagger proposed by Collins 

collins2002discriminative . An exemplification of the pre-processing steps of a text extracted from the book The Valley of Fear by Doyle, is shown in Section S4 of the Supplementary Information.

Although lemmatization is often used in NLP tasks, Toman toman2006influence argued that this pre-processing step does not affect the performance of general text classification systems. To our knowledge, there is no systematic analysis on the effect of lemmatization on network-based authorship recognition methods. For this reason, we have considered the following three variations in the application of pre-processing in raw texts: (i) none, no lemmatization is performed; (ii) partial, only nouns are lemmatized; and (iii) full, all words are lemmatized, as it is done in more traditional works.

ii.2.2 Network measurements

In this section, we present a brief description of measurements used to characterize the topological properties of complex networks. These measurements are used here to study how the properties of text networks vary with distinct pre-processing steps. In addition, these measurements are also used for comparison and validation purposes.

The simplest measurements are the number of nodes () and edges (). The density of a network is defined as , i.e. the fraction between the total number of edges and the maximum possible number of edges obtained in an equivalent fully connected network.

The degree of a node is defined as the number of neighbors that and is given by

(2)

The coefficient of the degree distribution is another widely known measurement in network science Newman:2010:NI:1809753 . Similar to other real-world networks, text adjacency networks display the scale-free behavior 10.1371/journal.pone.0067310

. To estimate the coefficient

, we used the strategy defined in Clauset-PowerLawMatlab . The degree is also usually measured in global terms as

(3)

The quantity defined in equation 3 is the average degree, a measurement that has been applied in a myriad of network contexts Newman:2010:NI:1809753 , even though many of the studied distributions makes this quantity not a representative element of the distribution, as many networks display a fat-tailed behavior Li2016649 ; 10.1371/journal.pone.0110121 ; sscoor ; 0295-5075-99-2-28002 . This is the case of text networks, whose fat-tailed degree distribution stems from the Zipf’s law lantiq . However, in several cases, the average degree is useful to discriminate distinct topologies Newman:2010:NI:1809753 .

Another well known connectivity measurement is the hierarchical degree , which corresponds to the number of neighbors at distance . This is a simple extension of the concept of node degree for further hierarchies. Despite its seeming simplicity, the use of hierarchies has proven useful to improve the characterization of several real-world networks amancio2011using .

While the degree is essentially a local measurement, some other indexes were specially devised to characterize the global topology of networks. This is the case of distance-based metrics. Measurements based on geodesic paths include the average shortest path length () and the diameter (). The average shortest path length of a network is computed as

(4)

where is the length of the shortest distance between nodes and . The diameter of a network , is the largest path length among all distances.

The transitivity of the network was measured by the average clustering coefficient , where is the clustering coefficient computed for node

and measure the probability of any two neighbors of

being linked. Mathematically, the local clustering coefficient is computed as

(5)

where represents the number of edges between the neighbors of node . Even though this measure was originally used in social sciences, the clustering coefficient has been used to identify the specificity of words in distinct contexts.

Finally, we used the assortativity measure to measure if similar nodes are connected to each other. In this case, we used the concept of degree correlation, which assigns a high assortativity value for networks with edges established mostly between nodes with similar degree newman2002 . The assortativity is given by

(6)

In general, text networks are disassortative, i.e.  10.1371/journal.pone.0067310 .

ii.3 Life-Like network automata

A network automata can be defined as a tuple . represents the NA space, which is the topology of a network comprising nodes (cells). is the set of binary states , where is the live state and the dead state. The cell’s state can be identified by the function , such that gives the state of cell at time . Finally, represents the initial configuration of all cells (i.e. the configuration at ) and is a transition function, i.e., the rule that governs the NA dynamics by defining how cells states are updated over time  lifelikeNA . Hereafter, we consider that the automata dynamics is stopped when .

The LLNA was proposed as a class of binary NA inspired by the rules of the Life-like Cellular Automata (CA) lifelikeNA , which uses a set of outer-totalistic rules, i.e., rules that depend on the current state of cell and on the states of its neighboring cells. The LLNA transition function is stated as

(7)

where the neighborhood density of node is the proportion of alive neighbors, i.e.

(8)

In the LLNA method, due to Moore’s neighborhood lifelikeNA . As a consequence, there exists a total of possible transition rules in the Life-Like family of rules lifelikeNA . In equation 7, the parameters and serve to label the rule in the form B-S, where B and S stand for “born” and “survive”, respectively; and and are the possible digits in the rule described by equation 7. For instance, the rule B3-S23 is given by

(9)

ii.3.1 LLNA measurements

The dynamic of a network automata provides a global spatio-temporal pattern of evolution. Thus, each network node can be analyzed as a sequence of ones and zeros. A set of measurements, such as the Shannon entropy and Lempel-Ziv complexity were suggested to extract quantitative properties from the generated spatio-temporal patterns lifelikeNA .

The Shannon entropy of a binary sequence is defined as

(10)

where and are the probability of having ones and zeros in the sequence, respectively shannon19481mathematical . The Shannon entropy ranges in the interval , where oscillating and complex spatio-temporal patterns tend to higher entropy values, while steady patterns tend to lower values.

The Lempel-Ziv complexity , different from Shannon entropy, is a measurement based on the number of different blocks () that a sequence can contain LEMPELZIV76 . A minimum block is defined using the first bit on the left of the sequence. Then, one moves rightward, bit by bit, until an unseen subsequence appears, which is formed starting exactly after a previous block and ending at the current position. For instance, the binary sequence of length , can be divided into minimum blocks: . Given the number of blocks , the Lempel-Ziv complexity is computed as

(11)

In literature, there exist several statistical similarity measurements designed to compare two binary sequences and  Lesot:2009:SMB:1479242.1479248 . Most of these measurements are defined in terms of the following binary instances =, , and . The most traditional measurements are

In our experiments, we have compared binary sequences by considering both spatial and temporal patterns. If we consider the spatial pattern, binary sequences generated by all nodes in two distinct time steps are compared. Analogously, if one considers temporal patterns, sequences generated by two nodes are compared by considering all times. In short, the spatio-temporal states of nodes can be represented in a matrix form, whose element stored in the -th row and -th column represents the state of node at time . Thus, spatial patterns are analyzed via comparison of horizontal sequences, while temporal patterns are analyzed by comparing vertical sequences. Let be a horizontal sequence obtained at the -th time step and a vertical sequence obtained from the -th node. Horizontal sequences and are compared, with . In a similar fashion, vertical sequences and are also compared, with . The similarity obtained from spatial and temporal comparisons are represented by and , respectively. Further experiments regarding the influence the parameters and are explained in Section S1 of the Supplementary Information.

ii.3.2 LLNA-based pattern recognition

We employed the LLNA method to extract the intrinsic patterns from textual networks, which aim to distinguish among authors’ written style. In the so-called training phase, these techniques first identify patterns for each author’s writing style. Then, the patterns identified in the previous phase are used to classify unseen instances in the classification phase. In this manuscript, several well-known supervised classification methods were employed: Bayesian Networks (BNT), Naive Bayes (NVB), RBF Networks (RBF), Multi Layer Perceptron (MLP), Support Vector Machines (SVM), k Nearest Neighbors (kNN), C4.5 (C45) and Random Forest (RFO) 

Bishop:2006:PRM:1162264 . All classifiers were set up with their default configuration of parameters, as suggested in 10.1371/journal.pone.0094137 .

To evaluate the performance of the classification, we used the k-fold cross-validation strategy Bishop:2006:PRM:1162264 . To perform the evaluation, this method splits the data into two sets: the training dataset is the set of samples used for training purposes, while the test set is used for validation purposes. Since these two sets are mutually exclusive and, therefore, the evaluation is performed over unknown instances, the cross-validation method is a reliable strategy. In this study, we use because each author was characterized by a set of 5 books (see description of the dataset in Section II.4). Thus, at each iteration, one book of each author is chosen to compound the test dataset, while the remaining books are selected to form the training dataset.

The results were also further probed by using confusion matrices, which are structures, reporting for each possible class (in our case, for each distinct author) the relationship between predicted and real classes. Traditionally, a confusion matrix is used to identify the following patterns of performance:

, which is the number of instances belonging to class which were correctly assigned to ; while is the number of instances belonging to class which were incorrectly assigned to class . Specially, the quantity will be useful to identify which authors cannot be discriminated with the proposed technique.

ii.4 Dataset

An English corpus of known authors (labeled instances in the supervised training phase) was created to evaluate the accuracy of the proposed method. The corpus comprises books, which were extracted from the Project Gutenberg repository 222See www.gutenberg.org. The books in our dataset were written by distinct authors. The full list of books and the respective authors is provided in Section S2 of the Supplementary Information. The distribution of books for authors is uniform, i.e. each author is represented by a set of 5 books. In this study, we considered the task of discriminating among 8 distinct authors. This dataset is hereafter referred to as validation-dataset. Note that datasets using a similar distribution of authors and genres have been considered in related works amancio2011comparing ; ebrahimpour2013automated ; amancio2015concentric ; amancio2015authorship . The remaining set of authors, hereafter referred to as rule-selection-dataset, was used to the particular process of selecting the best LLNA set of rules. Note that the choice of best rules was performed in a different dataset because, if the same dataset was used for selecting rules and evaluating classifiers, the obtained results could not represent a true classifier generalization Bishop:2006:PRM:1162264 .

In the general scenario of textual classification, the application of pre-processing steps may be useful for the task in hand. In semantical tasks, such as the word sense disambiguation, the lemmatization of words plays an important role on the performance Navigli:2009:WSD:1459352.1459355 . In the authorship attribution task, conversely, this same lemmatization step may lead to a great loss of information, hindering the accurate identification of authors’ particular writing choices ASI:ASI21001 . However, it has been shown that in network based techniques, the lemmatization step is important to cluster distinct writing forms into the same node. In our experiments, we also evaluated three types of lemmatization strategies to generate the textual networks, which led to the creation of three distinct variations of datasets for both validation-dataset and rule-selection-dataset.

  1. none-dataset: the original dataset was kept, i.e. the lemmatization step was disregarded.

  2. partial-dataset: the lemmatization was applied only in nouns. Thus, all nouns are mapped to their singular forms.

  3. full-dataset: the lemmatization was applied to all words. Therefore, verbs and nouns are mapped to their infinitive and singular forms, respectively.

Iii Results and discussion

The main purpose of this manuscript is to characterize networks representing written texts to obtain informative features for the authorship attribution task. Differently from traditional approaches, here we explored the use of LLNA rules to discriminate network topologies. We have used this approach because it has been shown that authors’ particular writing choices modify word adjacency networks in a consistent form amancio2011comparing .

As described in Section II.4, our dataset comprises books written by 20 distinct authors, and three distinct pre-processing strategies were probed to generate the textual networks. In Section III.1, we qualitatively discuss the patterns arising from the dynamics of the LLNA modelling for each book. In Section III.2, we perform the selection of the best LLNA rules, which are then applied in the authorship problem described in Section III.3. In Section III.4, we compared the proposed approach with the one based on traditional topological measurements amancio2011comparing . Finally, in Section III.5, we explore the effects of the lemmatization process on the properties of the networks.

iii.1 LLNA spatio-temporal pattern

Table 1 shows the spatio-temporal diagram of 40 networks of the partial-dataset using rule B024678-S4. A spatio-temporal diagram is the representation of the states along time, thus, each column represents the state of a given node and each line represents one time step. In this particular case, for each spatio-temporal diagram, the columns were ordered by the node degree. Thus, the left-most columns are the nodes taking the lowest degrees , and, the right-most, the ones taking the largest values of . Note that the number of nodes varies across networks (also reported in Figure 5), therefore, the diagrams are formed by a different number of columns. For simplicity’s sake, the diagrams were scaled to fit within the columns of the table.

Table 1: Spatio-temporal diagrams using the LLNA rule B024678-S4 obtained from books written by eight authors. The partial-dataset was used in this case. The LLNA dynamics was performed until and the initial states

were defined by a random uniform distribution. The spatio-temporal diagram shows the nodes’ states: dead, in black; and alive, in white. While the horizontal axis represent the nodes (sorted by increasing order of degree

), the vertical axis represents the temporal variable.

Notice that for the particular LLNA rule B024678-S4, Table 1 reveals a general pattern among all the authors. Three notable regions arise: the leftmost correspond to an oscillatory pattern with a higher tendency of alive nodes, followed by a row with tendency of dead nodes (region comprising nodes with average degree ). Then, another shorter oscillatory region appears, followed by a second region, which also presents a higher frequency of dead nodes (region comprising nodes with average degree ). The reader should note that rule B024678-S4 does not favor nodes with average degrees and for birth and survival conditions, which explains the distribution of these vertical patterns in the diagrams. The influence of this rule over the nodes with average degree is less apparent due to the lower frequency of these nodes. The rightmost nodes, which correspond to hubs in the network, also show oscillatory patterns that are directly related to the dynamics of rule B024678-S4, which favors the birth of the nodes and penalizes their survival. Therefore, there are a dependency between the rule and the network topology.

Despite the above mentioned similar structures in the spatio-temporal diagram, author-dependent patterns can also be noted. For instance, the patterns obtained for Darwin in all five books are strongly similar. Darwin’s textual networks present a bigger region corresponding to nodes with average degree , and a major ratio of nodes with high connectivity which are influenced by the rule. Therefore, the spatio-temporal diagram lead us to deduce that the books written by the same author exhibits similar patterns, while allowing to distinguish among the other authors, and that there is a strong dependency of the LLNA’s rule.

Based on the spatio-temporal diagram displayed in Table 1, we applied measurements (see Section II.3.1) that allow the characterization of the textual networks in terms of a time series containing only zeros and ones. Before presenting the results of the classification based on time series analysis in Section III.3, we first address the LLNA rule selection in the next section.

iii.2 LLNA rule selection

The rule selection is as important parameter to achieve higher accuracies using the LLNA method lifelikeNA . We evaluated, exhaustively, each of the possible Life-Like rules using the rule-selection-dataset comprising 12 authors. As discussed before, the reader should note that the rule selection was performed in different dataset in order to obtain LLNA rules that best represent a true classifier generalization Bishop:2006:PRM:1162264 .

To characterize the dynamics of the LLNA, we used a feature vector storing the Shannon entropy and the Lempel-Ziv distributions , during time steps. Because the choice of the best rule encompasses the induction and evaluation of classifiers, we only used in this phase the kNN method. We have chosen particularly this method because, in general, it generates better results while keeping an excellent processing time Bishop:2006:PRM:1162264

. Note that, the application of other methods in this phase, such as neural networks or SVM, would be impractical owing to the time complexity associated to these methods 

Bishop:2006:PRM:1162264 .

Figure 2 depicts the histogram distribution of the accuracies obtained for the complete rule-space of the LLNA. Most of the rules yielded low accuracy classifiers. Typically, accuracies lower than 40% have been found. In this study, we only selected the 400 rules yielding the highest accuracy rates. Note that the selection of best rules is performed independently in each of three datasets: none-, partial- and full- from the rule-selection-dataset. Moreover, as the selection rule is a preliminary phase, one should expect that among the set of best rules further improvement can be achieved by using other LLNA measurements lifelikeNA .

Figure 2: Histogram of the distribution of accuracy for all evaluated rules of the LLNA in the rule-selection-dataset comprising 12 authors. From left to right, the histograms for each of the 3 datasets none, partial and full, are shown respectively. As an example, the highlighted five rules maximizes the classification of the rule-selection-dataset, when a partial lemmatization was applied. For this rule selection experiment, both Shannon entropy and Lempel-Ziv complexity were considered as corresponding feature vectors, and kNN classifier.

iii.3 Classification of authorship networks

For the authorship identification problem, we applied the best rules obtained to identify authorship in the validation-dataset comprising the 8 authors. First, we compared the three datasets, none-, partial- and full-dataset by using different measurements extracted from the LLNA dynamics: the Shannon entropy distribution , the Lempel-Ziv distribution , and the binary distance distribution, which can be analyzed in a twofold way: horizontally and vertically (see Section II.3.1).

We evaluated the performance of the classification by using different LLNA measurements, extracted from the spatio-temporal pattern, in two ways, isolated and combined. Thus, four feature vectors were used to characterize authors’ styles. The first feature vector is composed by the distribution of the Shannon entropy , which is divided into 40 bins, therefore, contains 40 attributes. Similarly, the second feature vector is composed by the Lempel-Ziv complexity distribution divided into 40 bins. This vector was normalized by the maximum value achieved among the group of samples. The third and fourth feature vectors are the binary distance distributions, which were explored by means of vertical and horizontal analyses, which also contains 30 attributes per measurement. Finally, the combined vector contains 140 attributes.

We tested the accuracy of the 400 selected rules (see Section III.2) with different feature vectors as well as the combination of them. Table 2 presents the best rules obtained for the validation-dataset. The columns , , and show the accuracy rates obtained for each distinct feature vector. The results when combining these distributions are shown in the last column of the same table. Note that the isolated feature vector yielded the maximum accuracy of 76.03% ( 12.02%) for rule B024678-S4 when using the partial-dataset.

Lemmat. Rule
None B03468-S0368 45.48 ( 11.86) 39.40 ( 13.85) 22.08 ( 13.33) 72.75 ( 13.54) 50.13 ( 14.28)
B138-S3 43.95 ( 15.82) 41.23 ( 15.02) 18.58 ( 11.32) 70.88 ( 12.94) 41.73 ( 15.32)
B0124678-S4568 40.40 ( 14.06) 47.03 ( 14.10) 43.03 ( 16.52) 68.10 ( 14.37) 53.85 ( 16.07)
Partial B024678-S4 35.58 ( 15.74) 52.63 ( 15.74) 41.18 ( 14.65) 76.03 ( 12.02) 49.85 ( 16.24)
B02468-S1346 42.85 ( 12.24) 54.35 ( 11.54) 22.75 ( 11.79) 68.25 ( 11.28) 53.75 ( 12.09)
B01346-S1357 31.08 ( 14.94) 47.55 ( 14.19) 36.13 ( 15.44) 64.43 ( 14.56) 45.23 ( 13.03)
Full B1457-S3568 36.75 ( 10.32) 46.38 ( 14.32) 9.73 ( 8.71) 72.72 ( 13.20) 50.45 ( 13.61)
B15-S278 31.68 ( 11.61) 35.85 ( 13.81) 30.30 ( 11.77) 65.80 ( 16.13) 33.83 ( 13.31)
B014568-S13478 42.95 ( 16.30) 24.38 ( 14.18) 32.20 ( 14.07) 65.78 ( 13.34) 38.90 ( 13.59)
Table 2: Accuracy rate (%) obtained using different measurements and their combinations as attributes to classify 8 authors of the validation-dataset. To select the best rules, we used the kNN with k=1 and 5-fold cross validation. The best result among all classifiers (see Section II.3.2) were also obtained with the kNN method.

To illustrate the discriminability obtained with our method, in Figure 3

-a), we show a principal component analysis project into two dimensions. In this case, the

partial-dataset was analyzed, with a dynamics based on the rule B024678-S4 and a characterization performed in terms of the feature vector . Even though only two dimensions were used to visualize our data, there is a clear separation between Darwin and the other authors. A similar pattern occurs for Munro. Interestingly, some authors display a very consistent stylistic (Munro and Wodehouse), while others can considerably vary their styles from book to book (e.g. Dickens).

Figure 3: a) Principal component analysis performed for the authorship recognition task using the five books from the authors of the validation-dataset using partial lemmatization. For this plot was used rule B024678-S4 and as a feature vector. b) Confusion matrix using kNN method achieved by the best classification rate. Each cell shows the number of correct predicted instances, where nonzero elements are indicated.

In Figure 3-b) we provide the confusion matrix obtained with the best rule. As expected, Darwin is easily distinguished from the other authors. In a similar fashion, the induced classifier can perfectly discriminate among Wodehouse, Darwin, Poe, and Munro. The author with the lowest classification accuracy is Doyle, since three of his books were incorrectly assigned to Dickens, Hardy and Munro.

The best accuracy rate found using the best configuration of parameters shows unequivocally that the proposed features can capture authors’ particularities in written styles, allowing thus the discrimination of authors in unknown texts. Note that, a random authorship attribution would accurately recognize authors with probability in our dataset comprising books. Thus, the -value associated with the obtained accuracy of books accurately classified (see Figure 3-b)) is

(12)

confirming thus the significance of the obtained results. In the next section, we probe the relevance of our results in comparison with the traditional characterization relying only on topological measurements of networks.

iii.4 Evaluation of traditional measurements and robustness analysis

We compared the results obtained with the Life-Like network automata with traditional measurements used to characterize complex networks amancio2011comparing . The left side of Table 3 shows the accuracy obtained in the classification of the network models when using traditional network measurements. Note that the performance of the traditional method, in general, is improved when no lemmatization is applied. The best result was obtained with the SVM classifier (), which is similar to the best results reported in amancio2011comparing . A similar performance was also obtained with the MLP classifier (%). The right side of Table 3 shows the results obtained with the proposed method. Rules B03468-S0368, B024678-S4, B1457-S3568 provided the highest accuracies for the none-, partial- and full-dataset when using only the binary distance distribution . Considering all the variations of both datasets and classifiers, the highest accuracy rate was . This means that our method outperformed the traditional technique by a margin of 14.73%, when comparing the best configuration of both strategies. The best results obtained by each strategy are also illustrated in Figure 4-a).

Traditional network measurements Proposed method (LLNA)
None Partial Full
None
(B03468-S0368)
Partial
(B024678-S4)
Full
(B1457-S3568)
BN 48.23 (14.88) 45.43 (14.18) 44.23 (13.62) 65.58 (14.30) 44.73 (12.86) 50.55 (14.30)
NVB 58.28 (15.16) 56.23 (14.5) 51.13 (15.07) 62.80 (15.24) 57.48 (15.37) 50.10 (14.42)
MLP 59.23 (13.92) 50.73 (14.13) 45.03 (15.2) 69.63 (14.36) 59.25 (14.11) 60.50 (14.41)
KNN 52.00 (14.9) 49.40 (14.52) 43.65 (16.38) 72.75 (13.54) 76.03 (12.02) 72.72 (13.20)
C45 44.25 (13.05) 42.55 (14.72) 42.13 (14.47) 52.15 (14.56) 32.53 (13.29) 45.05 (15.22)
RF 53.43 (15.01) 54.60 (14.44) 45.88 (14.18) 69.45 (12.70) 61.25 (14.5) 63.32 (14.51)
RBF 52.48 (14.17) 51.68 (14.68) 46.38 (15.52) 27.30 (6.83) 51.15 (15.67) 38.40 (8.33)
SVM 61.30 (15.56) 49.28 (13.97) 50.20 (14.70) 72.65 (12.69) 70.03 (13.38) 66.45 (14.13)
Table 3: Comparison of the accuracy rate (%) obtained using traditional network measurements and the proposed method based on network automata. Remarkably, our method outperforms the traditional approach by an average margin of .

The robustness of the proposed methodology with regard to the total number of authors considered was verified by considering other variations of authors in the validation-dataset. To do so, we selected all variations of authors among the total of authors. We then applied the proposed methodology to probe the sensibility of our method to specific datasets. As shown in Figure 4-b), there is only a minor variation in the accuracy when considering datasets of authors, suggesting that our method is robust with regard to the variation of datasets. A similar procedure was performed to study the robustness in datasets comprising a distinct number of authors (from 2 to 7 authors). Note that, in these other scenarios, a similar robust behavior was found. Interestingly, similar accuracy results have been obtained when considering 3 and 8 authors, suggesting thus that our method is more effective when more complex authorship attribution tasks are considered.

Figure 4:

a) Comparison of the accuracy obtained by the proposed method (left side) and the classical network measurements (right side). The histograms on the left (mean and standard deviation) represent the best accuracies obtained when using rules B03468-S0368, B024678-S4 and B1457-S3568 for

none-, partial- and full-dataset, respectively. In a similar way, the histograms on the right show the best accuracies obtained using network measurements as a feature vector. For all these experiments kNN method was used. b) Average accuracy obtained in the variations of the original dataset. Each variation considers a different number of authors, which ranges from to .

iii.5 Effect of the lemmatization on network measurements

Table 4 shows the preliminary topological properties for one of Doyle’s book modeled as a network, considering the three lemmatization processes (none, partial and full). The columns show the measurements presented in Section II.2.2, as follows: number of nodes , number of edges , average degree , clustering coefficient , average path length , power-law exponent , diameter , density and degree assortativity .

From the same table, one can note a decreasing of both the number of nodes and edges , while the average degree increases. This effect can be explained by the fact that when the lemmatization process is performed, the multiple representations of a word are all transformed to its canonical form, e.g., the words has and have will have only one representation in a network, the node have, instead of having two. Moreover, the diameter for all the networks is maintained around 11. We also observed that all networks studied here obey a power law constant around . Therefore, these textual networks have a scale-free structure, which is supported by the maximum likelihood method and the Kolmogorov-Smirnov statistic that accepts the hypotheses of a reasonable fit. Moreover, this property is consistent with the scale-free textual networks found in the literature.

Lemm.
None 5914 22991 7.78 0.04 3.63 2.33 11 0.0013 -0.06
Partial 5374 22775 8.48 0.04 3.54 2,29 11 0.0016 -0.06
Full 4977 22451 9.02 0.05 3.47 2.20 10 0.0018 -0.07
Table 4: Measurements extracted for the textual network corresponding to Doyle’s book “Uncle Bernac - A Memory of the Empire” regarding the three types of lemmatization process (none-, partial- and full-dataset).
Figure 5: Average network measurements for eight authors highlighted in the diagrams and for the three datasets: none-, partial- and full-dataset (see description in Section II.4). The following distributions are shown for each author: number of nodes (), number of edges (), average connectivity (), average clustering coefficient (), average path length (), diameter (), density (), power-law exponent ( and degree assortativity ().

Figure 5 presents a set of average topological measurements calculated for each author of the validation-dataset. The standard deviation was obtained considering the five books of each author. Figure 5 also shows the values obtained for the three variations of dataset. The main results concerning each measurement are described below:

  • Total number of nodes () and edges (): decreases with the lemmatization process, whereas is not influenced by this process. This effect occurs because, even when nodes are removed during the lemmatization, adjacency relationships are not affected, and, consequently, the degree of the remaining nodes tends to increase. This effect is evident in the top-right diagram displaying the average network connectivity .

  • Average clustering coefficient (): This measurement was influenced by both and . tends to increase with the lemmatization process because the network remains with almost the same number of edges, while the number of nodes decreases as a consequence of mapping distinct variations of the same concept into the same node.

  • Average shortest path length (): Similarly to the number of edges, the average shortest path length is not much affected by the lemmatization process. However, note that the values of tend to decrease as a consequence of the decrease in the total number of nodes.

  • Diameter (): In most cases, the diameter increases by a short margin when the lemmatization process is performed. However, this pattern seems to depend from author to author. Note, e.g. that the average diameter decreases when the full lemmatization is applied for books authored by Doyle. Conversely, the lemmatization process seems to cause an opposite effect on networks modelling books written by Allan Poe.

  • Density (): The density of links increases in most cases, as the lemmatization process removes nodes, and the number of edges is practically not affected. An exception occurs for Darwin. Remarkably, the average density of the none- and full- datasets are in a similar fashion.

  • Power-law exponent (: Almost all the textual networks present power exponent between 2 and 3, which is a characteristic that have been demonstrated for many real-world networks Newman:2010:NI:1809753 ; dorogovtsev2002evolution and, particularly in text networks, is a consequence of the Zipf’s Law. Concerning the effect of the lemmatization process on this feature, no clear pattern can be identified, as opposite effects have been found e.g. for Stoker and Poe.

Iv Conclusion

In this paper, we have addressed the authorship attribution problem, which is a task of practical relevance in many contexts of information science research. We have specifically studied the effect of the textual organization in the discriminability of documents written by distinct authors. To capture the structural properties of texts, we have used the well-known network framework, given its potential revealed in related applications. Unlike the traditional approach based only on topological properties of networks, we have proposed here a methodology to capture further information concerning authors’ particular styles. To do so, we have represented networks modelling texts as network automata with a dynamics based on Life-Like rules. Upon selecting a set of discriminative rules that serve to coordinate the automata dynamics, we have found that the variations in the binary states of nodes are more discriminative than simple traditional topological characterization. More specifically, we have obtained an improvement of almost 15% in the classification of distinct authors. Interestingly, the best results were obtained with a partial lemmatization process, suggesting that this procedure is more adequate than just lemmatizing all words when text networks are used as the underlying model for this task.

The methodology proposed here paves the way for improving the characterization of related information systems modelled in terms of networks. This is evident if we recall that network automata approaches are specially suitable to describe networks with scale-free distributions lifelikeNA and, as a consequence, documents following Zipf’s Law. Further works could investigate the effectiveness of our approach e.g. in the analysis of the complexity of texts or in applications related to extractive summarization. Given the complementarity of the analysis provided by the network automata framework, we argue that a combination relying on traditional superficial and networked features could lead to optimized results in a variety of natural language processing applications.

Acknowledgements

J.M. is grateful for the support of the Coordination for the Improvement of Higher Education Personnel (CAPES). E.A.C.J. and D.R.A. are grateful for the support from Google (Google Research Awards in Latin America grant). D.R.A. is also grateful for the financial support from São Paulo Research Foundation (FAPESP grant #2014/20830-0). G.H.B.M. is grateful for the support from CAPES and FAPESP with grant #2015/05899-7. O.M.B. gratefully acknowledges the financial support of CNPq (National Council for Scientific and Technological Development, Brazil) (Grant #307797/2014-7 and Grant #484312/2013-8) and FAPESP (Grant #11/01523-1 and Grant #2015/05899-7).

References

  • (1) Franco-Salvador, M., Rosso, P. & Montes-y-Gómez, M.

    A systematic study of knowledge graph analysis for cross-language plagiarism detection.

    Information Processing & Management 52, 550–570 (2016).
  • (2) Labbé, C. & Labbé, D. Duplicate and fake publications in the scientific literature: how many scigen papers in computer science? Scientometrics 94, 379–396 (2013).
  • (3) Vacca, J. R. Computer Forensics: Computer Crime Scene Investigation (Networking Series) (Networking Series) (Charles River Media, Inc., Rockland, MA, USA, 2005).
  • (4) Stamatatos, E. A survey of modern authorship attribution methods. Journal of the American Society for Information Science and Technology 60, 538–556 (2009).
  • (5) Amancio, D. R. Authorship recognition via fluctuation analysis of network topology and word intermittency. Journal of Statistical Mechanics: Theory and Experiment 2015, P03005 (2015).
  • (6) Brennan, M., Afroz, S. & Greenstadt, R. Adversarial stylometry: Circumventing authorship recognition to preserve privacy and anonymity. ACM Trans. Inf. Syst. Secur. 15, 12:1–12:22 (2012).
  • (7) Halteren, H. V. Author verification by linguistic profiling: an exploration of the parameter space. ACM Trans. Speech Lang. Process. 4, 1–17 (2007).
  • (8) Martincic-Ipsic, S., Margan, D. & Mestrovic, A. Multilayer network of language: a unified framework for structural analysis of linguistic subsystems. Physica A: Statistical Mechanics and its Applications 457, 117–128 (2016).
  • (9) Dorogovtsev, S. N. & Mendes, J. F. F. Language as an evolving word web. Proceedings of the Royal Society of London B: Biological Sciences 268, 2603–2606 (2001).
  • (10) Amancio, D. R., Aluisio, S. M., Oliveira Jr., O. N. & Costa, L. F. Complex networks analysis of language complexity. EPL (Europhysics Letters) 100, 58002 (2012).
  • (11) Amancio, D. R., Altmann, E. G., Rybski, D., Oliveira Jr., O. N. & Costa, L. F. Probing the statistical properties of unknown texts: application to the voynich manuscript. PLoS ONE 8, e67310 (2013).
  • (12) Liu, H. & Xu, C. Can syntactic networks indicate morphological complexity of a language? EPL (Europhysics Letters) 93, 28005 (2011).
  • (13) Liu, H. & Hu, F. What role does syntax play in a language network? EPL (Europhysics Letters) 83, 18002 (2008).
  • (14) Mehri, A., Darooneh, A. H. & Shariati, A. The complex networks approach for authorship attribution of books. Physica A: Statistical Mechanics and its Applications 391, 2429 – 2437 (2012).
  • (15) Amancio, D. R., Altmann, E. G., Oliveira Jr., O. N. & Costa, L. F. Comparing intermittency and network measurements of words and their dependence on authorship. New Journal of Physics 13, 123024 (2011).
  • (16) Wolfram, S. Universality and complexity in cellular automata. Physica D: Nonlinear Phenomena 10, 1–35 (1984).
  • (17) Watts, D. J. Small worlds: the dynamics of networks between order and randomness (Princeton university press, 1999).
  • (18) Tomassini, M., Giacobini, M. & Darabos, C. Evolution and dynamics of small-world cellular automata. Complex Systems 15, 261–284 (2005).
  • (19) Marr, C. & Hütt, M.-T. Cellular automata on graphs: Topological properties of er graphs evolved towards low-entropy dynamics. Entropy 14, 993–1010 (2012).
  • (20) Miranda, G., Machicao, J. & Bruno, O. M. Exploring spatio-temporal patterns as network descriptors based on cellular automata. Scientific Reports (under review) (2016).
  • (21) Gonçalves, W. N., Martinez, A. S. & Bruno, O. M. Complex network classification using partially self-avoiding deterministic walks. Chaos: An Interdisciplinary Journal of Nonlinear Science 22, 033139 (2012).
  • (22) Gardner, M. Mathematical games the fantastic combinations of john conway’s new solitaire game ”life”. Scientific American 223, 120–123 (1970).
  • (23) Soto, J. M. G. & Wuensche, A. The x-rule: Universal computation in a non-isotropic life-like cellular automaton. J. Cellular Automata 10, 261–294 (2015).
  • (24) Machicao, J., Marco, A. G. & Bruno, O. M. Chaotic encryption method based on life-like cellular automata. Expert Systems with Applications 39, 12626–12635 (2012).
  • (25) Broderick, G., Rúaini, M., Chan, E. & Ellison, M. J. A life-like virtual cell membrane using discrete automata. In Silico Biology 5, 163–178 (2004).
  • (26) Csuhaj-Varjú, E., Kelemen, J., Kelemenová, A. & Paun, G. Eco-grammar systems: A grammatical framework for studying life-like interaction. Artificial Life 3, 1–28 (1997).
  • (27) Mihalcea, R. & Radev, D. Graph-based natural language processing and information retrieval (Cambridge University Press, 2011).
  • (28) Solé, R. V., Corominas-Murtra, B., Valverde, S. & Steels, L. Language networks: Their structure, function, and evolution. Complexity 15, 20–26 (2010).
  • (29) Amancio, D. R. et al. Using metrics from complex networks to evaluate machine translation. Physica A: Statistical Mechanics and its Applications 390, 131–142 (2011).
  • (30) Collins, M.

    Discriminative training methods for hidden markov models: Theory and experiments with perceptron algorithms.

    In Proceedings of the ACL-02 conference on Empirical methods in natural language processing-Volume 10, 1–8 (Association for Computational Linguistics, 2002).
  • (31) Toman, M., Tesar, R. & Jezek, K. Influence of word normalization on text classification. Proceedings of InSciT 4, 354–358 (2006).
  • (32) Newman, M. E. J. Networks: An Introduction (Oxford University Press, Inc., New York, NY, USA, 2010).
  • (33) Clauset, A., Shalizi, C. R. & Newman, M. E. J. Power-law distributions in empirical data. SIAM Rev. 51, 661–703 (2009).
  • (34) Li, T. et al. An epidemic spreading model on adaptive scale-free networks with feedback mechanism. Physica A: Statistical Mechanics and its Applications 450, 649–656 (2016).
  • (35) Williams, O. & Del Genio, C. I. Degree correlations in directed scale-free networks. PLoS ONE 9, 1–6 (2014).
  • (36) Morita, S. Six susceptible-infected-susceptible models on scale-free networks. Scientific Reports 6, 22506 EP (2016).
  • (37) Carron, P. M. & Kenna, R. Universal properties of mythological networks. EPL (Europhysics Letters) 99, 28002 (2012).
  • (38) Costa, L. F., Sporns, O., Antiqueira, L., Nunes, M. G. V. & Oliveira Jr., O. N. Correlations between structure and random walk dynamics in directed complex networks. Applied Physics Letters 91 (2007).
  • (39) Newman, M. E. J. Assortative mixing in networks. Phys. Rev. Lett. 89, 208701 (2002).
  • (40) Shannon, C. A mathematical theory of communication. The Bell System Technical Journal 27, 379–423 (1948).
  • (41) Abraham, L. & Jacob, Z. On the complexity of finite sequences. IEEE Trans. Inf. Theor. 22, 75–81 (1976).
  • (42) Lesot, M. J., Rifqi, M. & Benhadda, H. Similarity measures for binary and numerical data: a survey. Int. J. Knowl. Eng. Soft Data Paradigm. 1, 63–84 (2009).
  • (43) Bishop, C. M. Pattern Recognition and Machine Learning (Information Science and Statistics) (Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2006).
  • (44) Amancio, D. R. et al. A systematic comparison of supervised classifiers. PLoS ONE 9, e94137 (2014).
  • (45) Ebrahimpour, M. et al. Automated authorship attribution using advanced signal classification techniques. PloS ONE 8, e54998 (2013).
  • (46) Amancio, D. R., Silva, F. N. & Costa, L. F. Concentric network symmetry grasps authors’ styles in word adjacency networks. EPL (Europhysics Letters) 110, 68001 (2015).
  • (47) Navigli, R. Word sense disambiguation: A survey. ACM Comput. Surv. 41, 10:1–10:69 (2009).
  • (48) Dorogovtsev, S. N. & Mendes, J. F. F. Evolution of networks. Advances in physics 51, 1079–1187 (2002).