Likelihood estimation of sparse topic distributions in topic models and its applications to Wasserstein document distance calculations

07/12/2021
by   Xin Bing, et al.
5

This paper studies the estimation of high-dimensional, discrete, possibly sparse, mixture models in topic models. The data consists of observed multinomial counts of p words across n independent documents. In topic models, the p× n expected word frequency matrix is assumed to be factorized as a p× K word-topic matrix A and a K× n topic-document matrix T. Since columns of both matrices represent conditional probabilities belonging to probability simplices, columns of A are viewed as p-dimensional mixture components that are common to all documents while columns of T are viewed as the K-dimensional mixture weights that are document specific and are allowed to be sparse. The main interest is to provide sharp, finite sample, ℓ_1-norm convergence rates for estimators of the mixture weights T when A is either known or unknown. For known A, we suggest MLE estimation of T. Our non-standard analysis of the MLE not only establishes its ℓ_1 convergence rate, but reveals a remarkable property: the MLE, with no extra regularization, can be exactly sparse and contain the true zero pattern of T. We further show that the MLE is both minimax optimal and adaptive to the unknown sparsity in a large class of sparse topic distributions. When A is unknown, we estimate T by optimizing the likelihood function corresponding to a plug in, generic, estimator  of A. For any estimator  that satisfies carefully detailed conditions for proximity to A, the resulting estimator of T is shown to retain the properties established for the MLE. The ambient dimensions K and p are allowed to grow with the sample sizes. Our application is to the estimation of 1-Wasserstein distances between document generating distributions. We propose, estimate and analyze new 1-Wasserstein distances between two probabilistic document representations.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/22/2020

Optimal estimation of sparse topic models

Topic models have become popular tools for dimension reduction and explo...
research
02/17/2022

Refined Convergence Rates for Maximum Likelihood Estimation under Finite Mixture Models

We revisit convergence rates for maximum likelihood estimation (MLE) und...
research
11/04/2016

Generalized Topic Modeling

Recently there has been significant activity in developing algorithms wi...
research
06/26/2022

The Sketched Wasserstein Distance for mixture distributions

The Sketched Wasserstein Distance (W^S) is a new probability distance sp...
research
05/17/2018

A fast algorithm with minimax optimal guarantees for topic models with an unknown number of topics

We propose a new method of estimation in topic models, that is not a var...
research
04/20/2023

Minimum Φ-distance estimators for finite mixing measures

Finite mixture models have long been used across a variety of fields in ...
research
11/24/2019

Optimal Permutation Recovery in Permuted Monotone Matrix Model

Motivated by recent research on quantifying bacterial growth dynamics ba...

Please sign up or login with your details

Forgot password? Click here to reset