DeepAI AI Chat
Log In Sign Up

Towards Evaluation of Cultural-scale Claims in Light of Topic Model Sampling Effects

by   Jaimie Murdock, et al.

Cultural-scale models of full text documents are prone to over-interpretation by researchers making unintentionally strong socio-linguistic claims (Pechenick et al., 2015) without recognizing that even large digital libraries are merely samples of all the books ever produced. In this study, we test the sensitivity of the topic models to the sampling process by taking random samples of books in the Hathi Trust Digital Library from different areas of the Library of Congress Classification Outline. For each classification area, we train several topic models over the entire class with different random seeds, generating a set of spanning models. Then, we train topic models on random samples of books from the classification area, generating a set of sample models. Finally, we perform a topic alignment between each pair of models by computing the Jensen-Shannon distance (JSD) between the word probability distributions for each topic. We take two measures on each model alignment: alignment distance and topic overlap. We find that sample models with a large sample size typically have an alignment distance that falls in the range of the alignment distance between spanning models. Unsurprisingly, as sample size increases, alignment distance decreases. We also find that the topic overlap increases as sample size increases. However, the decomposition of these measures by sample size differs by number of topics and by classification area. We speculate that these measures could be used to find classes which have a common "canon" discussed among all books in the area, as shown by high topic overlap and low alignment distance even in small sample sizes.


page 1

page 2

page 3


Multiscale Analysis of Count Data through Topic Alignment

Topic modeling is a popular method used to describe biological count dat...

Sample Size Planning for Classification Models

In biospectroscopy, suitably annotated and statistically independent sam...

Second Order Expansions for Sample Median with Random Sample Size

In practice, we often encounter situations where a sample size is not de...

Improving the Inference of Topic Models via Infinite Latent State Replications

In text mining, topic models are a type of probabilistic generative mode...

Estimating the size of a hidden finite set: large-sample behavior of estimators

A finite set is "hidden" if its elements are not directly enumerable or ...

Detecting Galaxy-Filament Alignments in the Sloan Digital Sky Survey III

Previous studies have shown the filamentary structures in the cosmic web...