Digital text forensics aims at examining the originality and credibility of information in electronic documents and, in this regard, at extracting and analyzing information about the authors of the respective texts (Potthast et al., 2019). Among the most important tasks of this field are authorship attribution (AA) and authorship verification (AV), where the former deals with the problem of identifying the most likely author of a document with unknown authorship, given a set of texts of candidate authors. AV, on the other hand, focuses on the question whether was in fact written by a known author , where only a set of reference texts of this author is given. Both disciplines are strongly related to each other, as any AA problem can be broken down into a series of AV problems (Potha and Stamatatos, 2017). Breaking down an AA problem into multiple AV problems is especially important in such scenarios, where the presence of the true author of in the candidate set cannot be guaranteed.
In the past two decades, researchers from different fields including linguistics, psychology, computer science and mathematics proposed numerous techniques and concepts that aim to solve the AV task. Probably due to the interdisciplinary nature of this research field, AV approaches were becoming more and more diverse, as can be seen in the respective literature. In 2013, for example, Veenman and Li(Veenman and Li, 2013) presented an AV method based on compression, which has its roots in the field of information theory. In 2015, Bagnall (Bagnall, 2015)2017) proposed an AV method that applies a semantic space model through Latent Dirichlet Allocation, a generative statistical model used in information retrieval and computational linguistics.
Despite the increasing number of AV approaches, a closer look at the respective studies reveals that only minor attention is paid to their underlying characteristics such as reliability and robustness. These, however, must be taken into account before AV methods can be applied in real forensic settings. The objective of this paper is to fill this gap and to propose important properties and criteria that are not only intended to characterize AV methods, but also allow their assessment in a more systematic manner. By this, we hope to contribute to the further development of this young111According to the literature (Stamatatos et al., 2014), Stamatatos et al. were the first researchers who discussed AV in the context of natural language texts in 2000 (Stamatatos et al., 2000). AV, therefore, can be seen as a young field in contrast to AA, which dates back to the 19th century (Holmes, 1998). research field. Based on the proposed properties, we investigate the applicability of 12 existing AV approaches on three self-compiled corpora, where each corpus involves a specific challenge.
The rest of this paper is structured as follows. Section 2 discusses the related work that served as an inspiration for our analysis. Section 3 comprises the proposed criteria and properties to characterize AV methods. Section 4 describes the methodology, consisting of the used corpora, examined AV methods, selected performance measures and experiments. Finally, Section 5 concludes the work and outlines future work.
2. Related Work
Over the years, researchers in the field of authorship analysis identified a number of challenges and limitations regarding existing studies and approaches. Azarbonyad et al. (Azarbonyad et al., 2015), for example, focused on the questions if the writing styles of authors of short texts change over time and how this affects AA. To answer these questions, the authors proposed an AA approach based on time-aware language models that incorporate the temporal changes of the writing style of authors. In one of our experiments, we focus on a similar question, namely, whether it is possible to recognize the writing style of authors, despite of large time spans between their documents. However, there are several differences between our experiment and the study of Azarbonyad et al. First, the authors consider an AA task, where one anonymous document has to be attributed to one of possible candidate authors, while we focus on an AV task, where is compared against one document of a known author. Second, the authors focus on texts with informal language (emails and tweets) in their study, while in our experiment we consider documents written in a formal language (scientific works). Third, Azarbonyad et al. analyzed texts with a time span of four years, while in our experiment the average time span is 15.6 years. Fourth, in contrast to the approach of the authors, none of the 12 examined AV approaches in our experiment considers a special handling of temporal stylistic changes.
In recent years, the new research field author obfuscation (AO) evolved, which concerns itself with the task to fool AA or AV methods in a way that the true author cannot be correctly recognized anymore. To achieve this, AO approaches which, according to Gröndahl and Asokan (Gröndahl and Asokan, 2019) can be divided into manual, computer-assisted and automatic types, perform a variety of modifications on the texts. These include simple synonym replacements, rule-based substitutions or word order permutations. In 2016, Potthast et al. (Potthast et al., 2016) presented the first large-scale evaluation of three AO approaches that aim to attack 44 AV methods, which were submitted to the PAN-AV competitions during 2013-2015 (Juola and Stamatatos, 2013; Stamatatos et al., 2014; Stamatatos et al., 2015). One of their findings was that even basic AO approaches have a significant impact on many AV methods. More precisely, the best-performing AO approach was able to flip on average % of an authorship verifier’s decisions towards choosing N (“different author”), while in fact Y (“same author”) was correct (Potthast et al., 2016). In contrast to Potthast et al., we do not focus on AO to measure the robustness of AV methods. Instead, we investigate in one experiment the question how trained AV models behave, if the lengths of the questioned documents are getting shorter and shorter. To our best knowledge, this question has not been addressed in previous authorship verification studies.
3. Characteristics of Authorship Verification
Before we can assess the applicability of AV methods, it is important to understand their fundamental characteristics. Due to the increasing number of proposed AV approaches in the last two decades, the need arose to develop a systematization including the conception, implementation and evaluation of authorship verification methods. In regard to this, only a few attempts have been made so far. In 2004, for example, Koppel and Schler (Koppel and Schler, 2004) described for the first time the connection between AV and unary classification, also known as one-class classification. In 2008, Stein et al. (Stein et al., 2008) compiled an overview of important algorithmic building blocks for AV where, among other things, they also formulated three AV problems as decision problems. In 2009, Stamatatos (Stamatatos, 2009) coined the phrases profile- and instance-based approaches that initially were used in the field of AA, but later found their way also into AV. In 2013 and 2014, Stamatatos et al. (Juola and Stamatatos, 2013; Potha and Stamatatos, 2014) introduced the terms intrinsic- and extrinsic models that aim to further distinguish between AV methods. However, a closer look at previous attempts to characterize authorship verification approaches reveals a number of misunderstandings, for instance, when it comes to draw the borders between their underlying classification models. In the following subsections, we clarify these misunderstandings, where we redefine previous definitions and propose new properties that enable a better comparison between AV methods.
3.1. Reliability (Determinism)
Reliability is a fundamental property any AV method must fulfill in order to be applicable in real-world forensic settings. However, since there is no consistent concept nor a uniform definition of the term “reliability” in the context of authorship verification according to the screened literature, we decided to reuse a definition from applied statistics, and adapt it carefully to AV.
In his standard reference222According to Google Scholar, Bollen’s book was cited more than 30,000 times making it a standard reference across different research fields. book, Bollen (Bollen, 1989) gives a clear description for this term: “Reliability is the consistency of measurement” and provides a simple example to illustrate its meaning: At time we ask a large number of persons the same question Q and record their responses. Afterwards, we remove their memory of the dialogue. At time we ask them again the same question Q and record their responses again. “The reliability is the consistency of the responses across individuals for the two time periods. To the extent that all individuals are consistent, the measure is reliable” (Bollen, 1989). This example deals with the consistency of the measured objects as a factor for the reliability of measurements. In the case of authorship verification, the analyzed objects are static data, and hence these cannot be a source of inconsistency. However, the measurement system itself can behave inconsistently and hence unreliable. This aspect can be described as intra-rater reliability.
Reliability in authorship verification is satisfied, if an AV method always generates the same prediction for the same input , or in other words, if the method behaves deterministically. Several AV approaches, including (Halvani and Steinebach, 2014; Halvani et al., 2017, 2016, 2018; Hürlimann et al., 2015; Jr and Ryan, 2012; Jankowska et al., 2013, 2014; Potha and Stamatatos, 2014) fall into this category. In contrast, if an AV method behaves non-deterministically such that two different predictions for are possible, the method can be rated as unreliable. Many AV approaches, including (Hernández-Castañeda and Calvo, 2017; Koppel and Schler, 2004; Neal et al., 2018; Potha and Stamatatos, 2017; Seidman, 2013; Bagnall, 2015; Bevendorff et al., 2019; Hernández et al., 2015; Kocher and Savoy, 2015) belong to this category, since they involve randomness (e. g., weight initialization, feature subsampling, chunk generation or impostor selection), which might distort the evaluation, as every run on a test corpus very likely leads to different results. Under lab conditions, results of non-deterministic AV methods can (and should) be counteracted by averaging multiple runs. However, it remains highly questionable if such methods are generally applicable in realistic forensic cases, where the prediction regarding a verification case might sometimes result in Y and sometimes in N.
Another important property of an AV method is optimizability. We define an AV method as optimizable
, if it is designed in such a way that it offers adjustable hyperparameters that can be tuned against a training/validation corpus, given an optimization method such as grid or random search. Hyperparameters might be, for instance, the selected distance/similarity function, the number of layers and neurons in a neural network or the choice of a kernel method. The majority of existing AV approaches in the literature (for example,(Koppel and Schler, 2004; Jr and Ryan, 2012; Jankowska et al., 2013; Hürlimann et al., 2015; Castro Castro et al., 2015; Hernández-Castañeda and Calvo, 2017; Koppel and Winter, 2014; Potha and Stamatatos, 2014)) belong to this category. On the other hand, if a published AV approach involves hyperparameters that have been entirely fixed such that there is no further possibility to improve its performance from outside (without deviating from the definitions in the publication of the method), the method is considered to be non-optimizable. Non-optimizable AV methods are preferable in forensic settings as, here, the existence of a training/validation corpus is not always self-evident. Among the proposed AV approaches in the respective literature, we identified only a small fraction (Halvani et al., 2018; Veenman and Li, 2013; Kocher and Savoy, 2015) that fall into this category.
3.3. Model Category
From a machine learning point of view, authorship verification represents a unary classification problem (Hürlimann et al., 2015; Koppel and Schler, 2004; Potha and Stamatatos, 2014, 2018; Stein et al., 2008). Yet, in the literature, it can be observed that sometimes AV is treated as a unary (Jankowska et al., 2014; Jr and Ryan, 2012; Neal et al., 2018; Potha and Stamatatos, 2014) and sometimes as a binary classification task (Kocher and Savoy, 2015; Koppel and Winter, 2014; Hürlimann et al., 2015; Veenman and Li, 2013). We define the way an AV approach is modeled by the phrase model category. However, before explaining this in more detail, we wish to recall what unary/one-class classification exactly represents. For this, we list the following verbatim quotes, which characterize one-class classification, as can be seen, almost identically (emphasis by us):
“One-class classification (OCC) […] consists in making a description of a target class of objects and in detecting whether a new object resembles this class or not. […] The OCC model is developed using target class samples only.” (Rodionova et al., 2016)
Note that in the context of authorship verification, target class refers to the known author such that for a document of an unknown author the task is to verify whether holds. One of the most important requirements of any existing AV method is a decision criterion, which aims to accept or reject a questioned authorship. A decision criterion can be expressed through a simple scalar threshold or a more complex model
such as a hyperplane in a high-dimensional feature space. As a consequence of the above statements, the determination ofor has to be performed solely on the basis of , otherwise the AV method cannot be considered to be unary. However, our conducted literature research regarding existing AV approaches revealed that there are uncertainties how to precisely draw the borders between unary and binary AV methods (for instance, (Boukhaled and Ganascia, 2014; Potha and Stamatatos, 2014, 2018)). Nonetheless, few attempts have been made to distinguish both categories from another perspective. Potha and Stamatatos (Potha and Stamatatos, 2018), for example, categorize AV methods as either intrinsic or extrinsic (emphasis by us):
“Intrinsic verification models view it [i. e., the verification task] as a one-class classification task and are based exclusively on analysing the similarity between  and . […] Such methods […] do not require any external resources.” (Potha and Stamatatos, 2018)
“On the other hand, extrinsic verification models attempt to transform the verification task to a pair classification task by considering external documents to be used as samples of the negative class.” (Potha and Stamatatos, 2018)
While we agree with statement (2), the former statement (1) is unsatisfactory, as intrinsic verification models are not necessarily unary. For example, the AV approach GLAD proposed by Hürlimann et al. (Hürlimann et al., 2015) directly contradicts statement (1). Here, the authors “decided to cast the problem as a binary classification task where class values are Y  and N . […] We do not introduce any negative examples by means of external documents, thus adhering to an intrinsic approach.” (Hürlimann et al., 2015).
A misconception similar to statement (1) can be observed in the paper of Jankowska et al. (Jankowska et al., 2013), who introduced the so-called CNG approach claimed to be a one-class classification method. CNG is intrinsic in that way that it considers only when deciding a problem . However, the decision criterion, which is a threshold , is determined on a set of verification problems, labeled either as Y or N. This incorporates “external resources” for defining the decision criterion, and it constitutes an implementation of binary classification between Y and N in analogy to the statement of Hürlimann et al. (Hürlimann et al., 2015) mentioned above. Thus, CNG is in conflict with the unary definition mentioned above. In a subsequent paper (Jankowska et al., 2014), however, Jankowska et al. refined their approach and introduced a modification, where was determined solely on the basis of . Thus, the modified approach can be considered as a true unary AV method, according to the quoted definitions for unary classification.
In 2004, Koppel and Schler (Koppel and Schler, 2004) presented the Unmasking approach which, according to the authors, represents a unary AV method. However, if we take a closer look at the learning process of Unmasking, we can see333See the intuitive illustration provided in (Bevendorff et al., 2019, Figure 1).
that it is based on a binary SVM classifier that consumes feature vectors (derived from“degradation curves”) labeled as Y (“same author”) or N (“different author”). Unmasking, therefore, cannot be considered to be unary as the decision is not solely based on the documents within , in analogy to the CNG approach of Jankowska et al. (Jankowska et al., 2013) discussed above.
It should be highlighted again that the aforementioned three approaches are binary-intrinsic since their decision criteria or was determined on a set of problems labeled in a binary manner (Y and N) while after training, the verification is performed in an intrinsic manner, meaning that and are compared against or but not against documents within other verification problems (cf. Figure 1). A crucial aspect, which might have lead to misperceptions regarding the model category of these approaches in the past, is the fact that two different class domains are involved. On the one hand, there is the class domain of authors, where the task is to distinguish and . On the other hand, there is the elevated or lifted domain of verification problem classes, which are Y and N. The training phase of binary-intrinsic approaches is used for learning to distinguish these two classes, and the verification task can be understood as putting the verification problem as a whole into class Y or class N, whereby the class domain of authors fades from the spotlight (cf. Figure 1).
Besides unary and binary-intrinsic methods, there is a third category of approaches, namely binary-extrinsic AV approaches (for example, (Bagnall, 2015; Kocher and Savoy, 2015; Hernández et al., 2015; Khonji and Iraqi, 2014; Koppel and Winter, 2014; Potha and Stamatatos, 2017; Veenman and Li, 2013)). These methods use external documents during a potentially existing training phase and – more importantly – during testing. In these approaches, the decision between and is put into the focus, where the external documents aim to construct the counter class .
Based on the above observations, we conclude that the key requirement for judging the model category of an AV method depends solely on the aspect how its decision criterion or is determined (cf. Figure 1):
An AV method is unary if and only if its decision criterion or is determined solely on the basis of the target class during testing. As a consequence, an AV method cannot be considered to be unary if documents not belonging to are used to define or .
An AV method is binary-intrinsic if its decision criterion or is determined on a training corpus comprising verification problems labeled either as Y or N (in other words documents of several authors). However, once the training is completed, a binary-intrinsic method has no access to external documents anymore such that the decision regarding the authorship of is made on the basis of the reference data of as well as or .
An AV method is binary-extrinsic if its decision criterion or is determined during testing on the basis of external documents that represent the outlier class .
Note that optimizable AV methods such as (Halvani and Steinebach, 2014; Jankowska et al., 2014) are not excluded to be unary. Provided that or is not subject of the optimization procedure, the model category remains unary. The reason for this is obvious; Hyperparameters might influence the resulting performance of unary AV methods. The decision criterion itself, however, remains unchanged.
Each model category has its own implications regarding prerequisites, evaluability, and applicability.
3.4.1. Unary AV Methods:
One advantage of unary AV methods is that they do not require a specific document collection strategy to construct the counter class
, which reduces their complexity. On the downside, the choice of the underlying machine learning model of a unary AV approach is restricted to one-class classification algorithms or unsupervised learning techniques, given a suitable decision criterion.
However, a far more important implication of unary AV approaches concerns their performance assessment. Since unary classification (not necessarily AV) approaches depend on a fixed decision criterion or , performance measures such as the area under the ROC curve (AUC) are meaningless. Recall that ROC analysis is used for evaluating classifiers, where the decision threshold is not finally fixed. ROC analysis requires that the classifier generates scores, which are comparable across classification problem instances. The ROC curve and the area under this curve is then computed by considering all possible discrimination thresholds for these scores. While unary AV approaches might produce such scores, introducing a variable
would change the semantics of these approaches. Since unary AV approaches have a fixed decision criterion, they provide only a single point in the ROC space. To assess the performance of a unary AV method, it is, therefore, mandatory to consider the confusion matrix that leads to this point in the ROC space.
Another implication is that unary AV methods are necessarily instance-based and, thus, require a set of multiple documents of the known author . If only one reference document is available (), this document must be artificially turned into multiple samples from the author. In general, unary classification methods need multiple samples from the target class since it is not possible to determine a relative closeness to that class based on only one sample.
3.4.2. Binary AV Methods:
On the plus side, binary-intrinsic or extrinsic AV methods benefit from the fact that we can choose among a variety of binary444 and -ary555 classification models. However, if we consider designing a binary-intrinsic AV method, it should not be overlooked that the involved classifier will learn nothing about individual authors, but only similarities or differences that hold in general for Y and N verification problems (Koppel and Winter, 2014).
If, on the other hand, the choice falls on a binary-extrinsic method, a strategy has to be considered for collecting representative documents for the outlier class . Several existing methods such as (Koppel and Winter, 2014; Potha and Stamatatos, 2017; Veenman and Li, 2013) rely on search engines for retrieving appropriate documents, but these search engines might refuse their service if a specified quota is exhausted. Additionally, the retrieved documents render these methods inherently non-deterministic. Moreover, such methods cause relatively high runtimes (Juola and Stamatatos, 2013; Stamatatos et al., 2014). Using search engines also requires an active Internet connection, which might not be available or allowed in specific scenarios. But even if we can access the Internet to retrieve documents, there is no guarantee that the true author is not among them. With these points in mind, the applicability of binary-extrinsic methods in real-world cases, i. e., in real forensic settings, remains highly questionable.
In the following, we introduce our three self-compiled corpora, where each corpus represents a different challenge. Next, we describe which authorship verification approaches we considered for the experiments and classify each AV method according to the properties introduced in Section 3. Afterwards, we explain which performance measures were selected with respect to the conclusion made in Section 3.4.1. Finally, we describe our experiments, present the results and highlight a number of observations.
A serious challenge in the field of AV is the lack of publicly available (and suitable) corpora, which are required to train and evaluate AV methods. Among the few publicly available corpora are those that were released by the organizers of the well-known PAN-AV competitions666https://pan.webis.de (Juola and Stamatatos, 2013; Stamatatos et al., 2014; Stamatatos et al., 2015). In regard to our experiments, however, we cannot use these corpora, due to the absence of relevant meta-data such as the precise time spans where the documents have been written as well as the topic category of the texts. Therefore, we decided to compile our own corpora based on English documents, which we crawled from different publicly accessible sources. In what follows, we describe our three constructed corpora, which are listed together with their statistics in Table 1. Note that all corpora are balanced such that verification cases with matching (Y) and non-matching (N) authorships are evenly distributed.
4.1.1. DBLP Corpus
As a first corpus, we compiled that represents a collection of 80 excerpts from scientific works including papers, dissertations, book chapters and technical reports, which we have chosen from the well-known Digital Bibliography & Library Project (DBLP) platform777https://dblp.uni-trier.de. Overall, the documents888Note that each document is single-authored. were written by 40 researchers, where for each author , there are exactly two documents. Given the 80 documents, we constructed for each author two verification problems (a Y-case) and (an N-case). For we set ’s first document as and the second document as . For we reuse from as the known document and selected a text from another (random) author as the unknown document. The result of this procedure is a set of 80 verification problems, which we split into a training and test set based on a 40/60% ratio. Where possible, we tried to restrict the content of each text to the abstract and conclusion of the original work. However, since in many cases these sections were too short, we also considered other parts of the original works such as introduction or discussion sections. To ensure that the extracted text portions are appropriate for the AV task, each original work was preprocessed manually. More precisely, we removed tables, formulas, citations, quotes and sentences that include non-language content such as mathematical constructs or specific names of researchers, systems or algorithms. The average time span between both documents of an author is 15.6 years. The minimum and maximum time span are 6 and 40 years, respectively. Besides the temporal aspect of , another challenge of this corpus is the formal (scientific) language, where the usage of stylistic devices999For example, repetitions, metaphors, rhetorical questions, oxymorons, etc. is more restricted, in contrast to other genres such as novels or poems.
4.1.2. Perverted Justice Corpus
As a second corpus, we compiled , which represents a collection of 1,645 chat conversations of 550 sex offenders crawled from the Perverted-Justice portal101010http://www.perverted-justice.com. The chat conversations stem from a variety of sources including emails and instant messengers (e. g., MSN, AOL or Yahoo), where for each conversation, we ensured that only chat lines from the offender were extracted. We applied the same problem construction procedure as for the corpus , which resulted in 1,100 verification problems that again were split into a training and test set given a 40/60% ratio. In contrast to the corpus , we only performed slight preprocessing. Essentially, we removed user names, time-stamps, URLs, multiple blanks as well as annotations that were not part of the original conversations from all chat lines. Moreover, we did not normalize words (for example, shorten words such as “nooooo” to “no”) as we believe that these represent important style markers. Furthermore, we did not remove newlines between the chat lines, as the positions of specific words might play an important role regarding the individual’s writing style.
4.1.3. Reddit Corpus
As a third corpus, we compiled , which is a collection of 200 aggregated postings crawled from the Reddit platform111111https://www.reddit.com. Overall, the postings were written by 100 Reddit users and stem from a variety of subreddits. In order to construct the Y-cases, we selected exactly two postings from disjoint subreddits for each user such that both the known and unknown document and differ in their topic. Regarding the N-cases, we applied the opposite strategy such that and belong to the same topic. The rationale behind this is to figure out to which extent AV methods can be fooled in cases, where the topic matches but not the authorship and vice versa. Since for this specific corpus we have to control the topics of the documents, we did not perform the same procedure applied for and to construct the training and test sets. Instead, we used for the resulting 100 verification problems a 40/60% hold-out split, where both training and test set are entirely disjoint.
4.2. Examined Authorship Verification Methods
As a basis for our experiments, we reimplemented 12 existing AV approaches, which have shown their potentials in the previous PAN-AV competitions (Juola and Stamatatos, 2013; Stamatatos et al., 2015) as well as in a number of AV studies. The methods are listed in Table 2 together with their classifications regarding the AV characteristics, which we proposed in Section 3.
|AV Method||Model Categ.||Optimizability||Determinism|
|MOCC (Halvani and Steinebach, 2014)||unary||determin.||optimizable|
|OCCAV (Halvani et al., 2018)||unary||determin.||non-optimiz.|
|COAV (Halvani et al., 2017)||binary-intr.||determin.||optimizable|
|AVeer (Halvani et al., 2016)||binary-intr.||determin.||optimizable|
|GLAD (Hürlimann et al., 2015)||binary-intr.||determin.||optimizable|
|DistAV (Jr and Ryan, 2012)||binary-intr.||determin.||optimizable|
|Unmasking (Koppel and Schler, 2004)||binary-intr.||non-determin.||optimizable|
|Caravel (Bagnall, 2015)||binary-extr.||non-determin.||optimizable|
|GenIM (Seidman, 2013)||binary-extr.||non-determin.||optimizable|
|ImpGI (Potha and Stamatatos, 2017)||binary-extr.||non-determin.||optimizable|
|SPATIUM (Kocher and Savoy, 2015)||binary-extr.||non-determin.||non-optimiz.|
|NNCD (Veenman and Li, 2013)||binary-extr.||determin.||non-optimiz.|
All (optimizable) AV methods were tuned regarding their hyperparameters, according to the original procedure mentioned in the respective paper. However, in the case of the binary-extrinsic methods (GenIM, ImpGI and NNCD) we had to use an alternative impostors generation strategy in our reimplementations, due to technical problems. In the respective papers, the authors used search engine queries to generate the impostor documents, which are needed to model the counter class . Regarding our reimplementations, we used the documents from the static corpora (similarly to the idea of Kocher and Savoy (Kocher and Savoy, 2015)) to generate the impostors in the following manner: Let denote a corpus with verification problems. For each we choose all unknown documents in with and append them the impostor set . Here, it should be highlighted that both GenIM and ImpGI consider the number of impostors as a hyperparameter such that the resulting impostor set is a subset of . In contrast to this, NNCD considers all as possible impostors. This fact plays an important role in the later experiments, where we compare the AV approaches to each other. Although our strategy is not flexible like using a search engine, it has one advantage that, here, it is assumed that the true author of an unknown document is not among the impostors, since in our corpora the user/author names are known121212However, it might be possible that behind multiple user names there is only one person (in other words, we cannot guarantee: one user = one account). beforehand.
4.3. Performance Measures
According to our extensive literature research, numerous measures (e. g., Accuracy, F, c@1, AUC, AUC@1, or EER) have been used so far to assess the performance of AV methods. In regard to our experiments, we decided to use c@1 and AUC for several reasons. First, Accuracy, F and are not applicable in cases where AV methods leave verification problems unanswered, which concerns some of our examined AV approaches. Second, using AUC alone is meaningless for non-optimizable AV methods, as explained in Section 3.4.1. Third, both have been used in the PAN-AV competitions (Stamatatos et al., 2014; Stamatatos et al., 2015). Note that we also list the confusion matrix outcomes.
Overall, we focus on three experiments, which are based on the corpora introduced in Section 4.1:
The Effect of Stylistic Variation Across Large Time Spans
The Effect of Topical Influence
The Effect of Limited Text Length
In the following each experiment is described in detail.
4.4.1. The Effect of Stylistic Variation Across Large Time Spans:
In this experiment, we seek to answer the question if the writing style of an author can be recognized, given a large time span between two documents of . The motivation behind this experiment is based on the statement of Olsson (Olsson, 2008) that language acquisition is a continuous process, which is not only acquired, but also can be lost. Therefore, an important question that arises here is, if the writing style of a person remains “stable” across a large time span, given the fact that language in each individual’s life is never “fixed” (Olsson, 2008). Regarding this experiment, we used the corpus.
The results of the 12 examined AV methods are listed in Table 3, where it can be seen that the majority of the examined AV methods yield useful recognition results with a maximum value of 0.792 in terms of c@1. With the exception of the binary-intrinsic approach COAV, the remaining top performing methods belong to the binary-extrinsic category. This category of AV methods has also been superior in the PAN-AV competitions (Juola and Stamatatos, 2013; Stamatatos et al., 2014; Stamatatos et al., 2015), where they outperformed binary-intrinsic and unary approaches three times in a row (2013–2015).
The top performing approaches Caravel, COAV and NNCD deserve closer attention. All three are based on character-level language models that capture low-level features similar to character -grams, which have been shown in numerous AA and AV studies (for instance, (Stamatatos, 2013; Neal et al., 2018)) to be highly effective and robust. In (Halvani et al., 2017; Bevendorff et al., 2019), it has been shown that Caravel and COAV were also the two top-performing approaches, where in (Halvani et al., 2017) they were evaluated on the PAN-2015 AV corpus (Stamatatos et al., 2015), while in (Bevendorff et al., 2019) they were applied131313Note that the implementation in (Bevendorff et al., 2019) differs from the one used in this paper. on texts obtained from Project Gutenberg. Although both approaches perform similarly, they differ in the way how the decision criterion is determined. While COAV requires a training corpus to learn , Caravel assumes that the given test corpus (which provides the impostors) is balanced. Given this assumption, Caravel first computes similarity scores for all verification problems in the corpus and then sets to the median of all similarities (cf. Figure 3). Thus, from a machine learning perspective, there is some undue training on the test set. Moreover, the applicability of Caravel in realistic scenarios is questionable, as a forensic case is not part of a corpus where the Y/N-distribution is known beforehand.
Another interesting observation can be made regarding COAV, NNCD and OCCAV. Although all three differ regarding their model category, they use the same underlying compression algorithm (PPMd) that is responsible for generating the language model. While the former two approaches perform similarly well, OCCAV achieves a poor c@1 score (). An obvious explanation for this is a wrongly calibrated threshold , as can be seen from the confusion matrix, where almost all answers are N-predictions. Regarding the NNCD approach, one should consider that is compared against as well as impostors within a corpus comprised of verification problems. Therefore, a Y-result is correct with relatively high certainty (i. e., the method has high precision compared to other approaches with a similar c@1 score), as NNCD decided that author fits best to among candidates. In contrast to Caravel, NNCD only retrieves the impostors from the given corpus, but it does not exploit background knowledge about the distribution of problems in the corpus.
Overall, the results indicate that it is possible to recognize writing styles across large time spans. To gain more insights regarding the question which features led to the correct predictions, we inspected the AVeer method. Although the method achieved only average results, it benefits from the fact that it can be interpreted easily, as it relies on a simple distance function, a fixed threshold and predefined feature categories such as function words. Regarding the correctly recognized Y-cases, we noticed that conjunctive adverbs such as “hence”, “therefore” or “moreover” contributed mostly to AVeer’s correct predictions. However, a more in-depth analysis is required in future work to figure out whether the decisions of the remaining methods are also primarily affected by these features.
4.4.2. The Effect of Topical Influence:
In this experiment, we investigate the question if the writing style of authors can be recognized under the influence of topical bias. In real-world scenarios, the topic of the documents within a verification problem is not always known beforehand, which can lead to a serious challenge regarding the recognition of the writing style. Imagine, for example, that consists of a known and unknown document and that are written by the same author () while at the same time differ regarding their topic. In such a case, an AV method that it focusing “too much” on the topic (for example on specific nouns or phrases) will likely predict a different authorship (). On the other hand, when and match regarding their topic, while being written by different authors, a topically biased AV method might erroneously predict .
In the following we show to which extent these assumptions hold. As a data basis for this experiment, we used the corpus introduced in Section 4.1.3. The results regarding the 12 AV methods are given in Table 4, where it can be seen that our assumptions hold. All examined AV methods (with no exception) are fooled by the topical bias in the corpus. Here, the highest achieved results in terms of c@1 and AUC are very close to random guessing. A closer look at the confusion matrix outcomes reveals that some methods, for example ImpGI and OCCAV, perform almost entirely inverse to each other, where the former predicts nothing but Y and the latter nothing but N (except 1 Y). Moreover, we can assume that the lower c@1 is, the stronger is the focus of the respective AV method on the topic of the documents. Overall, the results of this experiment suggest that none of the examined AV methods is robust against topical influence.
4.4.3. The Effect of Limited Text Length:
In our third experiment, we investigate the question how text lengths affect the results of the examined AV methods. The motivation behind this experiment is based on the observation of Stamatatos et al. (Stamatatos et al., 2015) that text length is an important issue, which has not been thoroughly studied within authorship verification research. To address this issue, we make use of the corpus introduced in Section 4.1.2. The corpus is suitable for this purpose, as it comprises a large number of verification problems, where more than 90% of all documents have sufficient text lengths (2,000 characters). This allows a stepwise truncation and by this to analyze the effect between the text lengths and the recognition results. However, before considering this, we first focus on the results (shown in Table 5) after applying all 12 AV methods on the original test corpus.
As can be seen in Table 5, almost all approaches perform very well with c@1 scores up to 0.991. Although these results are quite impressive, it should be noted that a large fraction of the documents comprises thousands of words. Thus, the methods can learn precise representations based on a large variety of features, which in turn enable a good determination of (dis)similarities between known/unknown documents. To investigate this issue in more detail, we constructed four versions of the test corpus and equalized the unknown document lengths to 250, 500, 1000, and 2000 characters. Then, we applied the top performing AV methods with a c@1 value on the four corpora. Here, we reused the same models and hyperparameters (including the decision criteria and ) that were determined on the training corpus. The intention behind this was to observe the robustness of the trained AV models, given the fact that during training they were confronted with longer documents.
The results are illustrated in Figure 2, where it can be observed that GLAD yields the most stable results across the four corpora versions, where even for the corpus with the 250 characters long unknown documents, it achieves a c@1 score of 0.727. Surprisingly, Unmasking performs similarly well, despite of the fact that the method has been designed for longer texts i. e., book chunks of at least 500 words (Koppel and Schler, 2004). Sanderson and Guenter also point out that the Unmasking approach is less useful when dealing with relatively short texts (Sanderson and Guenter, 2006). However, our results show a different picture, at least for this corpus.
One explanation of the resilience of GLAD across the varying text lengths might be due to its decision model (an SVM with a linear kernel) that withstands the absence of missing features caused by the truncation of the documents, in contrast to the distance-based approaches AVeer, NNCD and COAV, where the decision criterion is reflected by a simple scalar. Table 6 lists the confusion matrix outcomes of the six AV methods regarding the 250 characters version of .
|AV Method||TP||FN||FP||TN||Total (Y/N/UP)|
Here, it can be seen that the underlying SVM model of GLAD and Unmasking is able to regulate its Y/N-predictions, in contrast to the three distance-based methods, where the majority of predictions fall either on the Y- or on the N-side. To gain a better picture regarding the stability of the decision criteria and of the methods, we decided to take a closer look on the ROC curves (cf. Figure 3) generated by GLAD, Caravel and COAV for the four corpora versions, where a number of interesting observations can be made. When focusing on AUC, it turns out that all three methods perform very similar to each other, whereas big discrepancies between GLAD and COAV can be observed regarding c@1. When we consider the current and maximum achievable results (depicted by the circles and triangles, respectively) it becomes apparent that GLAD’s model behaves stable, while the one of COAV becomes increasingly vulnerable the more the documents are shortened. When looking at the ROC curve of Caravel, it can be clearly seen that the actual and maximum achievable results are very close to each other. This is not surprising, due to the fact that Caravel’s threshold always lies at the median point of the ROC curve, provided that the given corpus is balanced.
While inspecting the 250 characters long documents in more detail, we identified that they share similar vocabularies consisting of chat abbreviations such as “lol” (laughing out loud) or “k” (ok), smileys and specific obscene words. Therefore, we assume that the verification results of the examined methods are mainly caused by the similar vocabularies between the texts.
5. Conclusion and Future Work
We highlighted the problem that underlying characteristics of authorship verification approaches have not been paid much attention in the past research and that these affect the applicability of the methods in real forensic settings. Then, we proposed several properties that enable a better characterization and by this a better comparison between AV methods. Among others, we explained that the performance measure AUC is meaningless in regard to unary or specific non-optimizable AV methods, which involve a fixed decision criterion (for example, NNCD). Additionally, we mentioned that determinism must be fulfilled such that an AV method can be rated as reliable. Moreover, we clarified a number of misunderstandings in previous research works and proposed three clear criteria that allow to classify the model category of an AV method, which in turn influences its design and the way how it should be evaluated. In regard to binary-extrinsic AV approaches, we explained which challenges exist and how they affect their applicability.
In an experimental setup, we applied 12 existing AV methods on three self-compiled corpora, where the intention behind each corpus was to focus on a different aspect of the methods applicability. Our findings regarding the examined approaches can be summarized as follows: Despite of the good performance of the five AV methods GenIM, ImpGI, Unmasking, Caravel and SPATIUM, none of them can be truly considered as reliable and therefore applicable in real forensic cases. The reason for this is not only the non-deterministic behavior of the methods but also their dependence (excepting Unmasking) on an impostor corpus. Here, it must be guaranteed that the true author is not among the candidates, but also that the impostor documents are suitable such that the AV task not inadvertently degenerates from style to topic classification. In particular, the applicability of the Caravel approach remains highly questionable, as it requires a corpus where the information regarding Y/N-distribution is known beforehand in order to set the threshold. In regard to the two examined unary AV approaches MOCC and OCCAV, we observed that these perform poorly on all three corpora in comparison to the binary-intrinsic and binary-extrinsic methods. Most likely, this is caused by the wrong threshold setting, as both tend to generate more N-predictions. From the remaining approaches, GLAD and COAV seem to be a good choice for realistic scenarios. However, the former has been shown to be more robust in regard to varying text lengths given a fixed model, while the latter requires a retraining of the model (note that both performed almost equal in terms of AUC). Our hypothesis, which we leave open for future work, is that AV methods relying on a complex model are more robust than methods based on a scalar-threshold . Lastly, we wish to underline that all examined approaches failed in the cross-topic experiment. One possibility to counteract this is to apply text distortion techniques (for instance, (Stamatatos, 2017)) in order to control the topic influence in the documents.
As one next step, we will compile additional and larger corpora to investigate the question whether the evaluation results of this paper hold more generally. Furthermore, we will address the important question how the results of AV methods can be interpreted in a more systematic manner, which will further influence the practicability of AV methods besides the proposed properties.
Acknowledgements.This work was supported by the German Federal Ministry of Education and Research (BMBF) under the project ”DORIAN” (Scrutinise and thwart disinformation).
- Azarbonyad et al. (2015) Hosein Azarbonyad, Mostafa Dehghani, Maarten Marx, and Jaap Kamps. 2015. Time-Aware Authorship Attribution for Short Text Streams. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’15). ACM, New York, NY, USA, 727–730.
Author Identification Using Multi-headed Recurrent Neural Networks. InWorking Notes of CLEF 2015 - Conference and Labs of the Evaluation forum, Toulouse, France, September 8-11, 2015.
- Bevendorff et al. (2019) Janek Bevendorff, Benno Stein, Matthias Hagen, and Martin Potthast. 2019. Generalizing Unmasking for Short Texts. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 654–659.
- Bollen (1989) Kenneth A. Bollen. 1989. Structural Equations with Latent Variables. Wiley.
Mohamed Amine Boukhaled and
Jean-Gabriel Ganascia. 2014.
Probabilistic Anomaly Detection Method for Authorship Verification. Springer International Publishing, Cham, 211–219.
- Castro Castro et al. (2015) Daniel Castro Castro, Yaritza Adame Arcia, María Pelaez Brioso, and Rafael Muñoz Guillena. 2015. Authorship Verification, Average Similarity Analysis. In Proceedings of the International Conference Recent Advances in Natural Language Processing. INCOMA Ltd. Shoumen, BULGARIA, 84–90.
- Gröndahl and Asokan (2019) Tommi Gröndahl and N. Asokan. 2019. Text Analysis in Adversarial Settings: Does Deception Leave a Stylistic Trace? CoRR abs/1902.08939 (2019). arXiv:1902.08939
- Halvani et al. (2018) Oren Halvani, Lukas Graner, and Inna Vogel. 2018. Authorship Verification in the Absence of Explicit Features and Thresholds. In Advances in Information Retrieval, Gabriella Pasi, Benjamin Piwowarski, Leif Azzopardi, and Allan Hanbury (Eds.). Springer International Publishing, 454–465.
- Halvani and Steinebach (2014) Oren Halvani and Martin Steinebach. 2014. An Efficient Intrinsic Authorship Verification Scheme Based on Ensemble Learning. In Ninth International Conference on Availability, Reliability and Security, ARES 2014, Fribourg, Switzerland, September 8-12, 2014. Washington, DC, USA, 571–578.
- Halvani et al. (2017) Oren Halvani, Christian Winter, and Lukas Graner. 2017. On the Usefulness of Compression Models for Authorship Verification. In Proceedings of the 12th International Conference on Availability, Reliability and Security (ARES ’17). ACM, New York, NY, USA, Article 54, 10 pages.
- Halvani et al. (2016) Oren Halvani, Christian Winter, and Anika Pflug. 2016. Authorship Verification for Different Languages, Genres and Topics. Digit. Investig. 16, S (March 2016), S33–S43.
- Hernández et al. (2015) Josué Gerardo Gutiérrez Hernández, José Casillas, Paola Ledesma, Gibran Fuentes Pineda, and Iván Vladimir Meza Ruíz. 2015. Homotopy Based Classification for Author Verification Task: Notebook for PAN at CLEF 2015. In Working Notes of CLEF 2015 - Conference and Labs of the Evaluation forum, Toulouse, France, September 8-11, 2015.
- Hernández-Castañeda and Calvo (2017) Ángel Hernández-Castañeda and Hiram Calvo. 2017. Author Verification Using a Semantic Space Model. Computación y Sistemas 21, 2 (2017).
- Holmes (1998) David I. Holmes. 1998. The Evolution of Stylometry in Humanities Scholarship. Literary and Linguistic Computing 13, 3 (1998), 111–117.
- Hürlimann et al. (2015) Manuela Hürlimann, Benno Weck, Esther von den Berg, Simon Šuster, and Malvina Nissim. 2015. GLAD: Groningen Lightweight Authorship Detection. In Working Notes of CLEF 2015 – Conference and Labs of the Evaluation forum, Toulouse, France, September 8–11, 2015. 12.
et al. (2013)
Magdalena Jankowska, Vlado
Keselj, and Evangelos E. Milios.
Proximity Based One-class Classification with Common N-Gram Dissimilarity for Authorship Verification Task Notebook for PAN at CLEF 2013. InWorking Notes for CLEF 2013 Conference , Valencia, Spain, September 23-26, 2013.
- Jankowska et al. (2014) Magdalena Jankowska, Evangelos E. Milios, and Vlado Keselj. 2014. Author Verification Using Common N-Gram Profiles of Text Documents. In COLING 2014, 25th International Conference on Computational Linguistics, Proceedings of the Conference: Technical Papers, August 23-29, 2014, Dublin, Ireland, Jan Hajic and Junichi Tsujii (Eds.). ACL, 387–397.
- Jr and Ryan (2012) John Noecker Jr and Michael Ryan. 2012. Distractorless Authorship Verification. In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12) (23-25), Nicoletta Calzolari (Conference Chair), Khalid Choukri, Thierry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, and Stelios Piperidis (Eds.). European Language Resources Association (ELRA), Istanbul, Turkey.
- Juola and Stamatatos (2013) Patrick Juola and Efstathios Stamatatos. 2013. Overview of the Author Identification Task at PAN 2013. In Working Notes for CLEF 2013 Conference, Valencia, Spain, September 23-26, 2013. 20.
- Khonji and Iraqi (2014) Mahmoud Khonji and Youssef Iraqi. 2014. A Slightly-Modified GI-Based Author-Verifier with Lots of Features (ASGALF). In Working Notes for CLEF 2014 Conference, Sheffield, UK, September 15-18, 2014. 977–983.
- Kocher and Savoy (2015) Mirco Kocher and Jacques Savoy. 2015. UniNE at CLEF 2015 Author Identification: Notebook for PAN at CLEF 2015. In CLEF (Working Notes) (CEUR Workshop Proceedings), Vol. 1391. CEUR-WS.org.
- Koppel and Schler (2004) Moshe Koppel and Jonathan Schler. 2004. Authorship Verification as a One-Class Classification Problem. In Machine Learning, Proceedings of the Twenty-first International Conference (ICML 2004), Banff, Alberta, Canada, July 4-8, 2004 (ACM International Conference Proceeding Series), Carla E. Brodley (Ed.), Vol. 69. ACM.
- Koppel and Winter (2014) Moshe Koppel and Yaron Winter. 2014. Determining if Two Documents are Written by the Same Author. JASIST 65, 1 (2014), 178–187.
- Neal et al. (2018) Tempestt J. Neal, Kalaivani Sundararajan, and Damon L. Woodard. 2018. Exploiting Linguistic Style as a Cognitive Biometric for Continuous Verification. In 2018 International Conference on Biometrics, ICB 2018, Gold Coast, Australia, February 20-23, 2018. IEEE, 270–276.
- Olsson (2008) J. Olsson. 2008. Forensic Linguistics: Second Edition: An Introduction To Language, Crime and the Law. Bloomsbury Academic.
- Potha and Stamatatos (2014) Nektaria Potha and Efstathios Stamatatos. 2014. A Profile-Based Method for Authorship Verification. In Artificial Intelligence: Methods and Applications: 8th Hellenic Conference on AI, SETN 2014, Ioannina, Greece, May 15–17, 2014. Proceedings. Springer International Publishing, 313–326.
- Potha and Stamatatos (2017) Nektaria Potha and Efstathios Stamatatos. 2017. An Improved Impostors Method for Authorship Verification. In Experimental IR Meets Multilinguality, Multimodality, and Interaction - 8th International Conference of the CLEF Association, CLEF 2017, Dublin, Ireland, September 11-14, 2017, Proceedings. 138–144.
- Potha and Stamatatos (2018) Nektaria Potha and Efstathios Stamatatos. 2018. Intrinsic Author Verification Using Topic Modeling. In Proceedings of the 10th Hellenic Conference on Artificial Intelligence, SETN 2018, Patras, Greece, July 09-12, 2018. ACM, 20:1–20:7.
- Potthast et al. (2016) Martin Potthast, Matthias Hagen, and Benno Stein. 2016. Author Obfuscation: Attacking the State of the Art in Authorship Verification. In Working Notes Papers of the CLEF 2016 Evaluation Labs (CEUR Workshop Proceedings), Vol. 1609. CLEF and CEUR-WS.org, 716–749.
- Potthast et al. (2019) Martin Potthast, Paolo Rosso, Efstathios Stamatatos, and Benno Stein. 2019. A Decade of Shared Tasks in Digital Text Forensics at PAN. In Advances in Information Retrieval, Leif Azzopardi, Benno Stein, Norbert Fuhr, Philipp Mayr, Claudia Hauff, and Djoerd Hiemstra (Eds.). Springer International Publishing, Cham, 291–300.
- Rodionova et al. (2016) Oxana Ye. Rodionova, Paolo Oliveri, and Alexey L. Pomerantsev. 2016. Rigorous and Compliant Approaches to One-Class Classification. Chemometrics and Intelligent Laboratory Systems 159 (2016), 89 – 96.
Conrad Sanderson and
Simon Guenter. 2006.
Short Text Authorship Attribution via Sequence Kernels, Markov Chains and Author Unmasking: An Investigation. InProceedings of the 2006 Conference on Empirical Methods in Natural Language Processing (EMNLP ’06). Association for Computational Linguistics, Stroudsburg, PA, USA, 482–491.
- Seidman (2013) Shachar Seidman. 2013. Authorship Verification Using the Impostors Method Notebook for PAN at CLEF 2013. In Working Notes for CLEF 2013 Conference , Valencia, Spain, September 23-26, 2013.
- Stamatatos (2009) Efstathios Stamatatos. 2009. A Survey of Modern Authorship Attribution Methods. J. Am. Soc. Inf. Sci. Technol. 60, 3 (March 2009), 538–556.
- Stamatatos (2013) Efstathios Stamatatos. 2013. On the Robustness of Authorship Attribution Based on Character N-Gram Features. Journal of Law and Policy 21 (01 2013), 421–439.
- Stamatatos (2017) Efstathios Stamatatos. 2017. Authorship Attribution Using Text Distortion. In Proceedings of the 15th Conference of the European Chapter of the Association for the Computational Linguistics, EACL 2017, April 3-7, 2017, Valencia, Spain. The Association for Computer Linguistics.
- Stamatatos et al. (2015) Efstathios Stamatatos, Walter Daelemans, Ben Verhoeven, Patrick Juola, Aurelio López-López, Martin Potthast, and Benno Stein. 2015. Overview of the Author Identification Task at PAN 2015. In Working Notes of CLEF 2015 – Conference and Labs of the Evaluation forum, Toulouse, France, September 8–11, 2015. 17.
- Stamatatos et al. (2014) Efstathios Stamatatos, Walter Daelemans, Ben Verhoeven, Benno Stein, Martin Potthast, Patrick Juola, Miguel A. Sánchez-Pérez, and Alberto Barrón-Cedeño. 2014. Overview of the Author Identification Task at PAN 2014. In Working Notes for CLEF 2014 Conference, Sheffield, UK, September 15–18, 2014. 877–897.
- Stamatatos et al. (2000) Efstathios Stamatatos, Nikos Fakotakis, and George K. Kokkinakis. 2000. Automatic Text Categorization in Terms of Genre and Author. Computational Linguistics 26, 4 (2000), 471–495.
- Stein et al. (2008) Benno Stein, Nedim Lipka, and Sven Meyer zu Eissen. 2008. Meta Analysis within Authorship Verification. In 19th International Workshop on Database and Expert Systems Applications (DEXA 2008), 1-5 September 2008, Turin, Italy. IEEE Computer Society, 34–39.
- Tax (2001) David Martinus Johannes Tax. 2001. One-Class Classification: Concept Learning In the Absence of Counter-Examples. Ph.D. Dissertation. Delft University of Technology.
- Veenman and Li (2013) Cor J. Veenman and Zhenshi Li. 2013. Authorship Verification with Compression Features. In Working Notes for CLEF 2013 Conference , Valencia, Spain, September 23–26, 2013. 6.