Ranger: A Toolkit for Effect-Size Based Multi-Task Evaluation

05/24/2023
by   Mete Sertkan, et al.
9

In this paper, we introduce Ranger - a toolkit to facilitate the easy use of effect-size-based meta-analysis for multi-task evaluation in NLP and IR. We observed that our communities often face the challenge of aggregating results over incomparable metrics and scenarios, which makes conclusions and take-away messages less reliable. With Ranger, we aim to address this issue by providing a task-agnostic toolkit that combines the effect of a treatment on multiple tasks into one statistical evaluation, allowing for comparison of metrics and computation of an overall summary effect. Our toolkit produces publication-ready forest plots that enable clear communication of evaluation results over multiple tasks. Our goal with the ready-to-use Ranger toolkit is to promote robust, effect-size-based evaluation and improve evaluation standards in the community. We provide two case studies for common IR and NLP settings to highlight Ranger's benefits.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/29/2020

Massive Choice, Ample Tasks (MaChAmp):A Toolkit for Multi-task Learning in NLP

Transfer learning, particularly approaches that combine multi-task learn...
research
11/26/2020

NLPStatTest: A Toolkit for Comparing NLP System Performance

Statistical significance testing centered on p-values is commonly used t...
research
02/19/2021

Pyserini: An Easy-to-Use Python Toolkit to Support Replicable IR Research with Sparse and Dense Representations

Pyserini is an easy-to-use Python toolkit that supports replicable IR re...
research
03/14/2018

SentEval: An Evaluation Toolkit for Universal Sentence Representations

We introduce SentEval, a toolkit for evaluating the quality of universal...
research
07/10/2023

InPars Toolkit: A Unified and Reproducible Synthetic Data Generation Pipeline for Neural Information Retrieval

Recent work has explored Large Language Models (LLMs) to overcome the la...
research
04/19/2022

ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented Visual Models

Learning visual representations from natural language supervision has re...
research
06/27/2023

Wespeaker baselines for VoxSRC2023

This report showcases the results achieved using the wespeaker toolkit f...

Please sign up or login with your details

Forgot password? Click here to reset