Universal versus system-specific features of punctuation usage patterns in major Western languages

12/21/2022
by   Tomasz Stanisz, et al.
0

The celebrated proverb that "speech is silver, silence is golden" has a long multinational history and multiple specific meanings. In written texts punctuation can in fact be considered one of its manifestations. Indeed, the virtue of effectively speaking and writing involves - often decisively - the capacity to apply the properly placed breaks. In the present study, based on a large corpus of world-famous and representative literary texts in seven major Western languages, it is shown that the distribution of intervals between consecutive punctuation marks in almost all texts can universally be characterised by only two parameters of the discrete Weibull distribution which can be given an intuitive interpretation in terms of the so-called hazard function. The values of these two parameters tend to be language-specific, however, and even appear to navigate translations. The properties of the computed hazard functions indicate that among the studied languages, English turns out to be the least constrained by the necessity to place a consecutive punctuation mark to partition a sequence of words. This may suggest that when compared to other studied languages, English is more flexible, in the sense of allowing longer uninterrupted sequences of words. Spanish reveals similar tendency to only a bit lesser extent.

READ FULL TEXT
research
07/27/2023

Turkish Native Language Identification

In this paper, we present the first application of Native Language Ident...
research
11/18/2019

Universal and non-universal text statistics: Clustering coefficient for language identification

In this work we analyze statistical properties of 91 relatively small te...
research
06/23/2000

Estimation of English and non-English Language Use on the WWW

The World Wide Web has grown so big, in such an anarchic fashion, that i...
research
03/31/2020

Assessing Human Translations from French to Bambara for Machine Learning: a Pilot Study

We present novel methods for assessing the quality of human-translated a...
research
02/12/2020

Unsupervised Separation of Native and Loanwords for Malayalam and Telugu

Quite often, words from one language are adopted within a different lang...
research
09/29/2022

Enumerating Regular Languages in Constant Delay

We study the task, for a given language L, of enumerating the (generally...
research
10/24/2007

The predictability of letters in written english

We show that the predictability of letters in written English texts depe...

Please sign up or login with your details

Forgot password? Click here to reset