Using Large Language Model Annotations for Valid Downstream Statistical Inference in Social Science: Design-Based Semi-Supervised Learning

06/07/2023
by   Naoki Egami, et al.
0

In computational social science (CSS), researchers analyze documents to explain social and political phenomena. In most scenarios, CSS researchers first obtain labels for documents and then explain labels using interpretable regression analyses in the second step. The recent advancements in large language models (LLMs) can lower costs for CSS research by annotating documents cheaply at scale, but such surrogate labels are often imperfect and biased. We present a new algorithm for using outputs from LLMs for downstream statistical analyses while guaranteeing statistical properties – like asymptotic unbiasedness and proper uncertainty quantification – which are fundamental to CSS research. We show that direct use of LLM-predicted surrogate labels in downstream statistical analyses leads to substantial bias and invalid confidence intervals, even with high surrogate accuracy of 80–90%. To address this, we build on debiased machine learning to propose the design-based semi-supervised learning (DSL) estimator. DSL employs a doubly-robust procedure to combine surrogate labels with a smaller number of gold-standard labels. Our approach guarantees valid inference for downstream statistical analyses, even when surrogates are arbitrarily biased, without requiring stringent assumptions, by controlling the probability of sampling documents for gold-standard labeling. Both our theoretical analysis and experimental results show that DSL provides valid statistical inference while achieving root mean squared errors comparable to existing alternatives that focus only on prediction without statistical guarantees.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/26/2022

Control Variate Polynomial Chaos: Optimal Fusion of Sampling and Surrogates for Multifidelity Uncertainty Quantification

We present a hybrid sampling-surrogate approach for reducing the computa...
research
06/16/2020

Efficient nonparametric statistical inference on population feature importance using Shapley values

The true population-level importance of a variable in a prediction task ...
research
11/01/2021

Combating Noise: Semi-supervised Learning by Region Uncertainty Quantification

Semi-supervised learning aims to leverage a large amount of unlabeled da...
research
08/11/2022

Surrogate-based global sensitivity analysis with statistical guarantees via floodgate

Computational models are utilized in many scientific domains to simulate...
research
07/01/2018

Robust Inference Under Heteroskedasticity via the Hadamard Estimator

Drawing statistical inferences from large datasets in a model-robust way...
research
07/04/2022

Statistical inference of random graphs with a surrogate likelihood function

Spectral estimators have been broadly applied to statistical network ana...

Please sign up or login with your details

Forgot password? Click here to reset