Local and Global Topics in Text Modeling of Web Pages Nested in Web Sites

03/30/2021
by   Jason Wang, et al.
0

Topic models are popular models for analyzing a collection of text documents. The models assert that documents are distributions over latent topics and latent topics are distributions over words. A nested document collection is where documents are nested inside a higher order structure such as stories in a book, articles in a journal, or web pages in a web site. In a single collection of documents, topics are global, or shared across all documents. For web pages nested in web sites, topic frequencies likely vary between web sites. Within a web site, topic frequencies almost certainly vary between web pages. A hierarchical prior for topic frequencies models this hierarchical structure and specifies a global topic distribution. Web site topic distributions vary around the global topic distribution and web page topic distributions vary around the web site topic distribution. In a nested collection of web pages, some topics are likely unique to a single web site. Local topics in a nested collection of web pages are topics unique to one web site. For US local health department web sites, brief inspection of the text shows local geographic and news topics specific to each department that are not present in others. Topic models that ignore the nesting may identify local topics, but do not label topics as local nor do they explicitly identify the web site owner of the local topic. For web pages nested inside web sites, local topic models explicitly label local topics and identifies the owning web site. This identification can be used to adjust inferences about global topics. In the US public health web site data, topic coverage is defined at the web site level after removing local topic words from pages. Hierarchical local topic models can be used to identify local topics, adjust inferences about if web sites cover particular health topics, and study how well health topics are covered.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/16/2021

Hierarchical Topic Presence Models

Topic models analyze text from a set of documents. Documents are modeled...
research
10/12/2018

HiTR: Hierarchical Topic Model Re-estimation for Measuring Topical Diversity of Documents

A high degree of topical diversity is often considered to be an importan...
research
06/12/2011

Evolutionary Biclustering of Clickstream Data

Biclustering is a two way clustering approach involving simultaneous clu...
research
12/04/2019

PDC – a probabilistic distributional clustering algorithm: a case study on suicide articles in PubMed

The need to organize a large collection in a manner that facilitates hum...
research
11/15/2021

Regional Topics in British Grocery Retail Transactions

Understanding the customer behaviours behind transactional data has high...
research
03/01/2019

Characterizing Activity on the Deep and Dark Web

The deep and darkweb (d2web) refers to limited access web sites that req...
research
09/10/2019

Competing Topic Naming Conventions in Quora: Predicting Appropriate Topic Merges and Winning Topics from Millions of Topic Pairs

Quora is a popular Q&A site which provides users with the ability to tag...

Please sign up or login with your details

Forgot password? Click here to reset