Combining Lexical and Syntactic Features for Detecting Content-dense Texts in News

by   Yinfei Yang, et al.

Content-dense news report important factual information about an event in direct, succinct manner. Information seeking applications such as information extraction, question answering and summarization normally assume all text they deal with is content-dense. Here we empirically test this assumption on news articles from the business, U.S. international relations, sports and science journalism domains. Our findings clearly indicate that about half of the news texts in our study are in fact not content-dense and motivate the development of a supervised content-density detector. We heuristically label a large training corpus for the task and train a two-layer classifying model based on lexical and unlexicalized syntactic features. On manually annotated data, we compare the performance of domain-specific classifiers, trained on data only from a given news domain and a general classifier in which data from all four domains is pooled together. Our annotation and prediction experiments demonstrate that the concept of content density varies depending on the domain and that naive annotators provide judgement biased toward the stereotypical domain label. Domain-specific classifiers are more accurate for domains in which content-dense texts are typically fewer. Domain independent classifiers reproduce better naive crowdsourced judgements. Classification prediction is high across all conditions, around 80


Detecting Narrative Elements in Informational Text

Automatic extraction of narrative elements from text, combining narrativ...

Rapid Adaptation of POS Tagging for Domain Specific Uses

Part-of-speech (POS) tagging is a fundamental component for performing n...

Detecting Toxicity in News Articles: Application to Bulgarian

Online media aim for reaching ever bigger audience and for attracting ev...

Giveme5W1H: A Universal System for Extracting Main Events from News Articles

Event extraction from news articles is a commonly required prerequisite ...

Content Selection in Deep Learning Models of Summarization

We carry out experiments with deep learning models of summarization acro...

Studying Dishonest Intentions in Brazilian Portuguese Texts

Previous work in the social sciences, psychology and linguistics has sho...

Predicting Event Time by Classifying Sub-Level Temporal Relations Induced from a Unified Representation of Time Anchors

Extracting event time from news articles is a challenging but attractive...

Please sign up or login with your details

Forgot password? Click here to reset