Inference at Scale Significance Testing for Large Search and Recommendation Experiments

05/03/2023
by   Ngozi Ihemelandu, et al.
0

A number of information retrieval studies have been done to assess which statistical techniques are appropriate for comparing systems. However, these studies are focused on TREC-style experiments, which typically have fewer than 100 topics. There is no similar line of work for large search and recommendation experiments; such studies typically have thousands of topics or users and much sparser relevance judgements, so it is not clear if recommendations for analyzing traditional TREC experiments apply to these settings. In this paper, we empirically study the behavior of significance tests with large search and recommendation evaluation data. Our results show that the Wilcoxon and Sign tests show significantly higher Type-1 error rates for large sample sizes than the bootstrap, randomization and t-tests, which were more consistent with the expected error rate. While the statistical tests displayed differences in their power for smaller sample sizes, they showed no difference in their power for large sample sizes. We recommend the sign and Wilcoxon tests should not be used to analyze large scale evaluation results. Our result demonstrate that with Top-N recommendation and large search evaluation data, most tests would have a 100 significant results. Therefore, the effect size should be used to determine practical or scientific significance.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/30/2019

Using Score Distributions to Compare Statistical Significance Tests for Information Retrieval Evaluation

Statistical significance tests can provide evidence that the observed di...
research
11/20/2018

Higher significance with smaller samples: A modified Sequential Probability Ratio Test

We describe a modified sequential probability ratio test that can be use...
research
05/27/2019

Statistical Significance Testing in Information Retrieval: An Empirical Analysis of Type I, Type II and Type III Errors

Statistical significance testing is widely accepted as a means to assess...
research
08/13/2022

Size Matters: The Use and Misuse of Statistical Significance in Discrete Choice Models in the Transportation Academic Literature

In this paper we review the academic transportation literature published...
research
11/26/2020

NLPStatTest: A Toolkit for Comparing NLP System Performance

Statistical significance testing centered on p-values is commonly used t...
research
05/08/2020

Incentive-Compatible Critical Values

Statistical hypothesis tests are a cornerstone of scientific research. T...
research
09/30/2019

Enhancing statistical inference in psychological research via prospective and retrospective design analysis

In the past two decades, psychological science has experienced an unprec...

Please sign up or login with your details

Forgot password? Click here to reset