On the Robustness of Text Vectorizers
A fundamental issue in natural language processing is the robustness of the models with respect to changes in the input. One critical step in this process is the embedding of documents, which transforms sequences of words or tokens into vector representations. Our work formally proves that popular embedding schemes, such as concatenation, TF-IDF, and Paragraph Vector (a.k.a. doc2vec), exhibit robustness in the Hölder or Lipschitz sense with respect to the Hamming distance. We provide quantitative bounds for these schemes and demonstrate how the constants involved are affected by the length of the document. These findings are exemplified through a series of numerical examples.
READ FULL TEXT