On the Robustness of Text Vectorizers

03/09/2023
by   Rémi Catellier, et al.
0

A fundamental issue in natural language processing is the robustness of the models with respect to changes in the input. One critical step in this process is the embedding of documents, which transforms sequences of words or tokens into vector representations. Our work formally proves that popular embedding schemes, such as concatenation, TF-IDF, and Paragraph Vector (a.k.a. doc2vec), exhibit robustness in the Hölder or Lipschitz sense with respect to the Hamming distance. We provide quantitative bounds for these schemes and demonstrate how the constants involved are affected by the length of the document. These findings are exemplified through a series of numerical examples.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/04/2023

Sentence Embedding Leaks More Information than You Expect: Generative Embedding Inversion Attack to Recover the Whole Sentence

Sentence-level representations are beneficial for various natural langua...
research
05/23/2023

On Robustness of Finetuned Transformer-based NLP Models

Transformer-based pretrained models like BERT, GPT-2 and T5 have been fi...
research
04/17/2021

Robust Embeddings Via Distributions

Despite recent monumental advances in the field, many Natural Language P...
research
11/22/2016

Learning to Distill: The Essence Vector Modeling Framework

In the context of natural language processing, representation learning h...
research
06/28/2016

Hierarchical Neural Language Models for Joint Representation of Streaming Documents and their Content

We consider the problem of learning distributed representations for docu...
research
02/18/2022

Evaluating the Construct Validity of Text Embeddings with Application to Survey Questions

Text embedding models from Natural Language Processing can map text data...

Please sign up or login with your details

Forgot password? Click here to reset