Detecting covariate drift in text data using document embeddings and dimensionality reduction

09/17/2023
by   Vinayak Sodar, et al.
0

Detecting covariate drift in text data is essential for maintaining the reliability and performance of text analysis models. In this research, we investigate the effectiveness of different document embeddings, dimensionality reduction techniques, and drift detection methods for identifying covariate drift in text data. We explore three popular document embeddings: term frequency-inverse document frequency (TF-IDF) using Latent semantic analysis(LSA) for dimentionality reduction and Doc2Vec, and BERT embeddings, with and without using principal component analysis (PCA) for dimensionality reduction. To quantify the divergence between training and test data distributions, we employ the Kolmogorov-Smirnov (KS) statistic and the Maximum Mean Discrepancy (MMD) test as drift detection methods. Experimental results demonstrate that certain combinations of embeddings, dimensionality reduction techniques, and drift detection methods outperform others in detecting covariate drift. Our findings contribute to the advancement of reliable text analysis models by providing insights into effective approaches for addressing covariate drift in text data.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/21/2019

Application of Fuzzy Clustering for Text Data Dimensionality Reduction

Large textual corpora are often represented by the document-term frequen...
research
07/17/2023

Large-Scale Evaluation of Topic Models and Dimensionality Reduction Methods for 2D Text Spatialization

Topic models are a class of unsupervised learning algorithms for detecti...
research
04/22/2021

On Geodesic Distances and Contextual Embedding Compression for Text Classification

In some memory-constrained settings like IoT devices and over-the-networ...
research
06/20/2023

Unexplainable Explanations: Towards Interpreting tSNE and UMAP Embeddings

It has become standard to explain neural network latent spaces with attr...
research
08/09/2023

Gaussian Image Anomaly Detection with Greedy Eigencomponent Selection

Anomaly detection (AD) in images, identifying significant deviations fro...
research
10/04/2021

DenDrift: A Drift-Aware Algorithm for Host Profiling

Detecting and reacting to unauthorized actions is an essential task in s...

Please sign up or login with your details

Forgot password? Click here to reset