Uncertainty Estimation For Community Standards Violation In Online Social Networks

09/30/2020
by   Narjes Torabi, et al.
5

Online Social Networks (OSNs) provide a platform for users to share their thoughts and opinions with their community of friends or to the general public. In order to keep the platform safe for all users, as well as to keep it compliant with local laws, OSNs typically create a set of community standards organized into policy groups, and use Machine Learning (ML) models to identify and remove content that violates any of the policies. However, out of the billions of content that is uploaded on a daily basis only a small fraction is so unambiguously violating that it can be removed by the automated models. Prevalence estimation is the task of estimating the fraction of violating content in the residual items by sending a small sample of these items to human labelers to get ground truth labels. This task is exceedingly hard because even though we can easily get the ML scores or features for all of the billions of items we can only get ground truth labels on a few thousands of these items due to practical considerations. Indeed the prevalence can be so low that even after a judicious choice of items to be labeled there can be many days in which not even a single item is labeled violating. A pragmatic choice for such low prevalence, 10^-4 to 10^-5, regimes is to report the upper bound, or 97.5% confidence interval, prevalence (UBP) that takes the uncertainties of the sampling and labeling processes into account and gives a smoothed estimate. In this work we present two novel techniques Bucketed-Beta-Binomial and a Bucketed-Gaussian Process for this UBP task and demonstrate on real and simulated data that it has much better coverage than the commonly used bootstrapping technique.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/07/2022

Jury Learning: Integrating Dissenting Voices into Machine Learning Models

Whose labels should a machine learning (ML) algorithm learn to emulate? ...
research
12/07/2021

Ground-Truth, Whose Truth? – Examining the Challenges with Annotating Toxic Text Datasets

The use of machine learning (ML)-based language models (LMs) to monitor ...
research
02/13/2018

Weakly supervised collective feature learning from curated media

The current state-of-the-art in feature learning relies on the supervise...
research
06/14/2022

A Truthful Owner-Assisted Scoring Mechanism

Alice (owner) has knowledge of the underlying quality of her items measu...
research
05/10/2022

Don't Throw it Away! The Utility of Unlabeled Data in Fair Decision Making

Decision making algorithms, in practice, are often trained on data that ...
research
10/18/2022

A Human-ML Collaboration Framework for Improving Video Content Reviews

We deal with the problem of localized in-video taxonomic human annotatio...
research
10/01/2019

Learning to estimate label uncertainty for automatic radiology report parsing

Bootstrapping labels from radiology reports has become the scalable alte...

Please sign up or login with your details

Forgot password? Click here to reset