Log In Sign Up

Higher Criticism for Discriminating Word-Frequency Tables and Testing Authorship

by   Alon Kipnis, et al.

We adapt the Higher Criticism (HC) goodness-of-fit test to detect changes between word frequency tables. We apply the test to authorship attribution, where the goal is to identify the author of a document using other documents whose authorship is known. The method is simple yet performs well without handcrafting and tuning. As an inherent side effect, the HC calculation identifies a subset of discriminating words. In practice, the identified words have low variance across documents belonging to a corpus of homogeneous authorship. We conclude that in testing a new document against the corpus of an author, HC is mostly affected by words characteristic of that author and is relatively unaffected by topic structure.


The Author-Topic Model for Authors and Documents

We introduce the author-topic model, a generative model for documents th...

HiTR: Hierarchical Topic Model Re-estimation for Measuring Topical Diversity of Documents

A high degree of topical diversity is often considered to be an importan...

A two-stage approach for table extraction in invoices

The automated analysis of administrative documents is an important field...

Neural Topic Modeling by Incorporating Document Relationship Graph

Graph Neural Networks (GNNs) that capture the relationships between grap...

Why Molière most likely did write his plays

As for Shakespeare, a hard-fought debate has emerged about Molière, a su...

An agent-driven semantical identifier using radial basis neural networks and reinforcement learning

Due to the huge availability of documents in digital form, and the decep...

Bounding the Probability of Error for High Precision Recognition

We consider models for which it is important, early in processing, to es...