Text vectorization via transformer-based language models and n-gram perplexities

07/18/2023
by   Mihailo Škorić, et al.
0

As the probability (and thus perplexity) of a text is calculated based on the product of the probabilities of individual tokens, it may happen that one unlikely token significantly reduces the probability (i.e., increase the perplexity) of some otherwise highly probable input, while potentially representing a simple typographical error. Also, given that perplexity is a scalar value that refers to the entire input, information about the probability distribution within it is lost in the calculation (a relatively good text that has one unlikely token and another text in which each token is equally likely they can have the same perplexity value), especially for longer texts. As an alternative to scalar perplexity this research proposes a simple algorithm used to calculate vector values based on n-gram perplexities within the input. Such representations consider the previously mentioned aspects, and instead of a unique value, the relative perplexity of each text token is calculated, and these values are combined into a single vector representing the input.

READ FULL TEXT

page 3

page 8

research
12/21/2022

Reconstruction Probing

We propose reconstruction probing, a new analysis method for contextuali...
research
05/28/2021

ByT5: Towards a token-free future with pre-trained byte-to-byte models

Most widely-used pre-trained language models operate on sequences of tok...
research
09/20/2020

F^2-Softmax: Diversifying Neural Text Generation via Frequency Factorized Softmax

Despite recent advances in neural text generation, encoding the rich div...
research
02/27/2023

Systematic Rectification of Language Models via Dead-end Analysis

With adversarial or otherwise normal prompts, existing large language mo...
research
06/29/2023

Probabilistic Linguistic Knowledge and Token-level Text Augmentation

This paper investigates the effectiveness of token-level text augmentati...
research
09/09/2023

Neurons in Large Language Models: Dead, N-gram, Positional

We analyze a family of large language models in such a lightweight manne...
research
06/30/2023

Should you marginalize over possible tokenizations?

Autoregressive language models (LMs) map token sequences to probabilitie...

Please sign up or login with your details

Forgot password? Click here to reset