Signal in Noise: Exploring Meaning Encoded in Random Character Sequences with Character-Aware Language Models

03/15/2022
by   Mark Chu, et al.
0

Natural language processing models learn word representations based on the distributional hypothesis, which asserts that word context (e.g., co-occurrence) correlates with meaning. We propose that n-grams composed of random character sequences, or garble, provide a novel context for studying word meaning both within and beyond extant language. In particular, randomly generated character n-grams lack meaning but contain primitive information based on the distribution of characters they contain. By studying the embeddings of a large corpus of garble, extant language, and pseudowords using CharacterBERT, we identify an axis in the model's high-dimensional embedding space that separates these classes of n-grams. Furthermore, we show that this axis relates to structure within extant language, including word part-of-speech, morphology, and concept concreteness. Thus, in contrast to studies that are mainly limited to extant language, our work reveals that meaning and primitive information are intrinsically linked.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/19/2022

Inducing Character-level Structure in Subword-based Language Models with Type-level Interchange Intervention Training

Language tasks involving character-level manipulations (e.g., spelling c...
research
09/18/2020

Will it Unblend?

Natural language processing systems often struggle with out-of-vocabular...
research
06/06/2022

What do tokens know about their characters and how do they know it?

Pre-trained language models (PLMs) that use subword tokenization schemes...
research
03/26/2021

Functorial Language Models

We introduce functorial language models: a principled way to compute pro...
research
07/20/2017

A Sub-Character Architecture for Korean Language Processing

We introduce a novel sub-character architecture that exploits a unique c...
research
09/17/2023

A novel approach to measuring patent claim scope based on probabilities obtained from (large) language models

This work proposes to measure the scope of a patent claim as the recipro...
research
02/22/2017

Context-Aware Prediction of Derivational Word-forms

Derivational morphology is a fundamental and complex characteristic of l...

Please sign up or login with your details

Forgot password? Click here to reset