DeepAI
Log In Sign Up

Higher Criticism for Discriminating Word-Frequency Tables and Testing Authorship

10/30/2019
by   Alon Kipnis, et al.
0

We adapt the Higher Criticism (HC) goodness-of-fit test to detect changes between word frequency tables. We apply the test to authorship attribution, where the goal is to identify the author of a document using other documents whose authorship is known. The method is simple yet performs well without handcrafting and tuning. As an inherent side effect, the HC calculation identifies a subset of discriminating words. In practice, the identified words have low variance across documents belonging to a corpus of homogeneous authorship. We conclude that in testing a new document against the corpus of an author, HC is mostly affected by words characteristic of that author and is relatively unaffected by topic structure.

READ FULL TEXT
07/11/2012

The Author-Topic Model for Authors and Documents

We introduce the author-topic model, a generative model for documents th...
10/12/2018

HiTR: Hierarchical Topic Model Re-estimation for Measuring Topical Diversity of Documents

A high degree of topical diversity is often considered to be an importan...
10/10/2022

A two-stage approach for table extraction in invoices

The automated analysis of administrative documents is an important field...
09/29/2020

Neural Topic Modeling by Incorporating Document Relationship Graph

Graph Neural Networks (GNNs) that capture the relationships between grap...
01/02/2020

Why Molière most likely did write his plays

As for Shakespeare, a hard-fought debate has emerged about Molière, a su...
09/30/2014

An agent-driven semantical identifier using radial basis neural networks and reinforcement learning

Due to the huge availability of documents in digital form, and the decep...
07/02/2009

Bounding the Probability of Error for High Precision Recognition

We consider models for which it is important, early in processing, to es...