Intentional control of type I error over unconscious data distortion: a Neyman-Pearson classification approach

02/07/2018
by   Lucy Xia, et al.
0

The rise of social media enables millions of citizens to generate information on sensitive political issues and social events, which is scarce in authoritarian countries and is tremendously valuable for surveillance and social studies. In the enormous efforts to utilize social media information, censorship stands as a formidable obstacle for informative description and accurate statistical inference. Likewise, in medical research, disease type proportions in the samples might not represent the proportions in the general population. To solve the information distortion problem caused by unconscious data distortion, such as non-predictable censorship and non-representative sampling, we propose a new distortion-invariant statistical approach to parse data, based on the Neyman-Pearson (NP) classification paradigm. Under general conditions, we derive explicit formulas for the after-distortion oracle classifier with explicit dependency on the distortion rates β_0 and β_1 on Class 0 and Class 1 respectively, and show that the NP oracle classifier is independent of the distortion scheme. We illustrate the working of this new method by combining the recently developed NP umbrella algorithm with topic modeling to automatically detect posts that are related to strikes and corruption in samples of randomly selected posts extracted from Sina Weibo-the Chinese equivalent to Twitter. In situations where type I errors are unacceptably large under the classical classification framework, the use of our proposed approach allows for controlling type I errors under a desirable upper bound.

READ FULL TEXT
research
12/23/2022

Generalizable Natural Language Processing Framework for Migraine Reporting from Social Media

Migraine is a high-prevalence and disabling neurological disorder. Howev...
research
08/04/2022

Analyzing social media with crowdsourcing in Crowd4SDG

Social media have the potential to provide timely information about emer...
research
11/17/2020

Conspiracy and debunking narratives about COVID-19 origination on Chinese social media: How it started and who is to blame

This paper studies conspiracy and debunking narratives about COVID-19 or...
research
10/11/2021

Spatial Data Mining of Public Transport Incidents reported in Social Media

Public transport agencies use social media as an essential tool for comm...
research
01/01/2023

Relevance Classification of Flood-related Twitter Posts via Multiple Transformers

In recent years, social media has been widely explored as a potential so...
research
02/07/2018

Sparse Linear Discriminant Analysis under the Neyman-Pearson Paradigm

In contrast to the classical binary classification paradigm that minimiz...
research
08/13/2015

Neyman-Pearson Classification under High-Dimensional Settings

Most existing binary classification methods target on the optimization o...

Please sign up or login with your details

Forgot password? Click here to reset