Empath: Understanding Topic Signals in Large-Scale Text

02/22/2016
by   Ethan Fast, et al.
0

Human language is colored by a broad range of topics, but existing text analysis tools only focus on a small number of them. We present Empath, a tool that can generate and validate new lexical categories on demand from a small set of seed terms (like "bleed" and "punch" to generate the category violence). Empath draws connotations between words and phrases by deep learning a neural embedding across more than 1.8 billion words of modern fiction. Given a small set of seed words that characterize a category, Empath uses its neural embedding to discover new related terms, then validates the category with a crowd-powered filter. Empath also analyzes text across 200 built-in, pre-validated categories we have generated from common topics in our web dataset, like neglect, government, and social media. We show that Empath's data-driven, human validated categories are highly correlated (r=0.906) with similar categories in LIWC.

READ FULL TEXT

page 1

page 4

research
06/26/2019

Assessing Post Deletion in Sina Weibo: Multi-modal Classification of Hot Topics

Widespread Chinese social media applications such as Weibo are widely kn...
research
11/20/2021

Weakly Supervised Prototype Topic Model with Discriminative Seed Words: Modifying the Category Prior by Self-exploring Supervised Signals

Dataless text classification, i.e., a new paradigm of weakly supervised ...
research
02/01/2023

You Are What You Talk About: Inducing Evaluative Topics for Personality Analysis

Expressing attitude or stance toward entities and concepts is an integra...
research
12/12/2022

Effective Seed-Guided Topic Discovery by Integrating Multiple Types of Contexts

Instead of mining coherent topics from a given text corpus in a complete...
research
02/23/2021

Minimally-Supervised Structure-Rich Text Categorization via Learning on Text-Rich Networks

Text categorization is an essential task in Web content analysis. Consid...
research
04/12/2023

Filler Word Detection with Hard Category Mining and Inter-Category Focal Loss

Filler words like “um" or “uh" are common in spontaneous speech. It is d...
research
12/19/2016

The iCrawl Wizard -- Supporting Interactive Focused Crawl Specification

Collections of Web documents about specific topics are needed for many a...

Please sign up or login with your details

Forgot password? Click here to reset