Assessing Keyness using Permutation Tests

08/25/2023
by   Thoralf Mildenberger, et al.
0

We propose a resampling-based approach for assessing keyness in corpus linguistics based on suggestions by Gries (2006, 2022). Traditional approaches based on hypothesis tests (e.g. Likelihood Ratio) model the copora as independent identically distributed samples of tokens. This model does not account for the often observed uneven distribution of occurences of a word across a corpus. When occurences of a word are concentrated in few documents, large values of LLR and similar scores are in fact much more likely than accounted for by the token-by-token sampling model, leading to false positives. We replace the token-by-token sampling model by a model where corpora are samples of documents rather than tokens, which is much closer to the way corpora are actually assembled. We then use a permutation approach to approximate the distribution of a given keyness score under the null hypothesis of equal frequencies and obtain p-values for assessing significance. We do not need any assumption on how the tokens are organized within or across documents, and the approach works with basically *any* keyness score. Hence, appart from obtaining more accurate p-values for scores like LLR, we can also assess significance for e.g. the logratio which has been proposed as a measure of effect size. An efficient implementation of the proposed approach is provided in the `R` package `keyperm` available from github.

READ FULL TEXT

page 7

page 8

research
12/13/2017

A Permutation Test on Complex Sample Data

Permutation tests are a distribution free way of performing hypothesis t...
research
06/06/2018

A Likelihood-based Alternative to Null Hypothesis Significance Testing

The logical and practical difficulties associated with research interpre...
research
07/12/2023

A Study on the Appropriate size of the Mongolian general corpus

This study aims to determine the appropriate size of the Mongolian gener...
research
05/17/2023

A Better Way to Do Masked Language Model Scoring

Estimating the log-likelihood of a given sentence under an autoregressiv...
research
01/30/2019

Using Score Distributions to Compare Statistical Significance Tests for Information Retrieval Evaluation

Statistical significance tests can provide evidence that the observed di...
research
09/14/2023

Masked Generative Modeling with Enhanced Sampling Scheme

This paper presents a novel sampling scheme for masked non-autoregressiv...
research
08/24/2023

Probabilistic Method of Measuring Linguistic Productivity

In this paper I propose a new way of measuring linguistic productivity t...

Please sign up or login with your details

Forgot password? Click here to reset