Evaluating the Construct Validity of Text Embeddings with Application to Survey Questions

02/18/2022
by   Qixiang Fang, et al.
0

Text embedding models from Natural Language Processing can map text data (e.g. words, sentences, documents) to supposedly meaningful numerical representations (a.k.a. text embeddings). While such models are increasingly applied in social science research, one important issue is often not addressed: the extent to which these embeddings are valid representations of constructs relevant for social science research. We therefore propose the use of the classic construct validity framework to evaluate the validity of text embeddings. We show how this framework can be adapted to the opaque and high-dimensional nature of text embeddings, with application to survey questions. We include several popular text embedding methods (e.g. fastText, GloVe, BERT, Sentence-BERT, Universal Sentence Encoder) in our construct validity analyses. We find evidence of convergent and discriminant validity in some cases. We also show that embeddings can be used to predict respondent's answers to completely new survey questions. Furthermore, BERT-based embedding techniques and the Universal Sentence Encoder provide more valid representations of survey questions than do others. Our results thus highlight the necessity to examine the construct validity of text embeddings before deploying them in social science research.

READ FULL TEXT

page 1

page 16

research
10/02/2021

Clustering and Network Analysis for the Embedding Spaces of Sentences and Sub-Sentences

Sentence embedding methods offer a powerful approach for working with sh...
research
08/17/2022

Transformer Encoder for Social Science

High-quality text data has become an important data source for social sc...
research
03/04/2019

SECNLP: A Survey of Embeddings in Clinical Natural Language Processing

Traditional representations like Bag of words are high dimensional, spar...
research
07/16/2020

Towards Debiasing Sentence Representations

As natural language processing methods are increasingly deployed in real...
research
06/03/2022

Extracting Similar Questions From Naturally-occurring Business Conversations

Pre-trained contextualized embedding models such as BERT are a standard ...
research
06/04/2018

Neural Network-based exploration of construct validity for Russian version of the 10-item Big Five Inventory

This study aims to present a new method of exploring construct validity ...
research
03/09/2023

On the Robustness of Text Vectorizers

A fundamental issue in natural language processing is the robustness of ...

Please sign up or login with your details

Forgot password? Click here to reset