The Natural Selection of Words: Finding the Features of Fitness

by   Peter D. Turney, et al.

We introduce a dataset for studying the evolution of words, constructed from WordNet and the Google Books Ngram Corpus. The dataset tracks the evolution of 4,000 synonym sets (synsets), containing 9,000 English words, from 1800 AD to 2000 AD. We present a supervised learning algorithm that is able to predict the future leader of a synset: the word in the synset that will have the highest frequency. The algorithm uses features based on a word's length, the characters in the word, and the historical frequencies of the word. It can predict change of leadership (including the identity of the new leader) fifty years in the future, with an F-score considerably above random guessing. Analysis of the learned models provides insight into the causes of change in the leader of a synset. The algorithm confirms observations linguists have made, such as the trend to replace the -ise suffix with -ize, the rivalry between the -ity and -ness suffixes, and the struggle between economy (shorter words are easier to remember and to write) and clarity (longer words are more distinctive and less likely to be confused with one another). The results indicate that integration of the Google Books Ngram Corpus with WordNet has significant potential for improving our understanding of how language evolves.


page 1

page 2

page 3

page 4


A Statistical Model of Word Rank Evolution

The availability of large linguistic data sets enables data-driven appro...

Performance Comparison of Large Language Models on VNHSGE English Dataset: OpenAI ChatGPT, Microsoft Bing Chat, and Google Bard

This paper presents a performance comparison of three large language mod...

A Computational Investigation on Denominalization

Language has been a dynamic system and word meanings always have been ch...

Is language evolution grinding to a halt? The scaling of lexical turbulence in English fiction suggests it is not

Of basic interest is the quantification of the long term growth of a lan...

Bubble-Flip---A New Generation Algorithm for Prefix Normal Words

We present a new recursive generation algorithm for prefix normal words....

A Rule-based/BPSO Approach to Produce Low-dimensional Semantic Basis Vectors Set

We intend to generate low-dimensional explicit distributional semantic v...

It Means More if It Sounds Good: Yet Another Hypotheses Concerning the Evolution of Polysemous Words

This position paper looks into the formation of language and shows ties ...

Please sign up or login with your details

Forgot password? Click here to reset