SeNMFk-SPLIT: Large Corpora Topic Modeling by Semantic Non-negative Matrix Factorization with Automatic Model Selection

08/21/2022
by   Maksim E. Eren, et al.
0

As the amount of text data continues to grow, topic modeling is serving an important role in understanding the content hidden by the overwhelming quantity of documents. One popular topic modeling approach is non-negative matrix factorization (NMF), an unsupervised machine learning (ML) method. Recently, Semantic NMF with automatic model selection (SeNMFk) has been proposed as a modification to NMF. In addition to heuristically estimating the number of topics, SeNMFk also incorporates the semantic structure of the text. This is performed by jointly factorizing the term frequency-inverse document frequency (TF-IDF) matrix with the co-occurrence/word-context matrix, the values of which represent the number of times two words co-occur in a predetermined window of the text. In this paper, we introduce a novel distributed method, SeNMFk-SPLIT, for semantic topic extraction suitable for large corpora. Contrary to SeNMFk, our method enables the joint factorization of large documents by decomposing the word-context and term-document matrices separately. We demonstrate the capability of SeNMFk-SPLIT by applying it to the entire artificial intelligence (AI) and ML scientific literature uploaded on arXiv.

READ FULL TEXT

page 1

page 3

research
12/01/2021

Topic Analysis of Superconductivity Literature by Semantic Non-negative Matrix Factorization

We utilize a recently developed topic modeling method called SeNMFk, ext...
research
06/12/2017

Topic supervised non-negative matrix factorization

Topic models have been extensively used to organize and interpret the co...
research
02/23/2017

Stability of Topic Modeling via Matrix Factorization

Topic models can provide us with an insight into the underlying latent s...
research
02/24/2021

Deep NMF Topic Modeling

Nonnegative matrix factorization (NMF) based topic modeling methods do n...
research
04/28/2021

Analysis of Legal Documents via Non-negative Matrix Factorization Methods

The California Innocence Project (CIP), a clinical law school program ai...
research
02/08/2022

Police Text Analysis: Topic Modeling and Spatial Relative Density Estimation

We analyze a large corpus of police incident narrative documents in unde...
research
01/12/2022

Topic Modeling on Podcast Short-Text Metadata

Podcasts have emerged as a massively consumed online content, notably du...

Please sign up or login with your details

Forgot password? Click here to reset