Pragmatic Constraint on Distributional Semantics

11/20/2022
by   Elizaveta Zhemchuzhina, et al.
0

This paper studies the limits of language models' statistical learning in the context of Zipf's law. First, we demonstrate that Zipf-law token distribution emerges irrespective of the chosen tokenization. Second, we show that Zipf distribution is characterized by two distinct groups of tokens that differ both in terms of their frequency and their semantics. Namely, the tokens that have a one-to-one correspondence with one semantic concept have different statistical properties than those with semantic ambiguity. Finally, we demonstrate how these properties interfere with statistical learning procedures motivated by distributional semantics.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/23/2021

Using Distributional Principles for the Semantic Study of Contextual Language Models

Many studies were recently done for investigating the properties of cont...
research
01/21/2020

Where New Words Are Born: Distributional Semantic Analysis of Neologisms and Their Semantic Neighborhoods

We perform statistical analysis of the phenomenon of neology, the proces...
research
05/11/2023

Autocorrelations Decay in Texts and Applicability Limits of Language Models

We show that the laws of autocorrelations decay in texts are closely rel...
research
12/31/2018

Types, Tokens, and Hapaxes: A New Heap's Law

Heap's Law states that in a large enough text corpus, the number of type...
research
02/07/2021

A Bayesian nonparametric approach to count-min sketch under power-law data streams

The count-min sketch (CMS) is a randomized data structure that provides ...
research
05/24/2023

Deriving Language Models from Masked Language Models

Masked language models (MLM) do not explicitly define a distribution ove...
research
01/10/2023

Scaling Laws for Generative Mixed-Modal Language Models

Generative language models define distributions over sequences of tokens...

Please sign up or login with your details

Forgot password? Click here to reset