Hardness of Samples Need to be Quantified for a Reliable Evaluation System: Exploring Potential Opportunities with a New Task

10/14/2022
by   Swaroop Mishra, et al.
1

Evaluation of models on benchmarks is unreliable without knowing the degree of sample hardness; this subsequently overestimates the capability of AI systems and limits their adoption in real world applications. We propose a Data Scoring task that requires assignment of each unannotated sample in a benchmark a score between 0 to 1, where 0 signifies easy and 1 signifies hard. Use of unannotated samples in our task design is inspired from humans who can determine a question difficulty without knowing its correct answer. This also rules out the use of methods involving model based supervision (since they require sample annotations to get trained), eliminating potential biases associated with models in deciding sample difficulty. We propose a method based on Semantic Textual Similarity (STS) for this task; we validate our method by showing that existing models are more accurate with respect to the easier sample-chunks than with respect to the harder sample-chunks. Finally we demonstrate five novel applications.

READ FULL TEXT

page 4

page 10

research
11/14/2022

Are Hard Examples also Harder to Explain? A Study with Human and Model-Generated Explanations

Recent work on explainable NLP has shown that few-shot prompting can ena...
research
11/23/2020

Unsupervised Difficulty Estimation with Action Scores

Evaluating difficulty and biases in machine learning models has become o...
research
07/28/2022

Measuring Difficulty of Novelty Reaction

Current AI systems are designed to solve close-world problems with the a...
research
06/21/2021

Hardness of Samples Is All You Need: Protecting Deep Learning Models Using Hardness of Samples

Several recent studies have shown that Deep Neural Network (DNN)-based c...
research
08/18/2023

Deep Boosting Multi-Modal Ensemble Face Recognition with Sample-Level Weighting

Deep convolutional neural networks have achieved remarkable success in f...
research
09/14/2020

One-bit Supervision for Image Classification

This paper presents one-bit supervision, a novel setting of learning fro...
research
06/10/2021

How Robust are Model Rankings: A Leaderboard Customization Approach for Equitable Evaluation

Models that top leaderboards often perform unsatisfactorily when deploye...

Please sign up or login with your details

Forgot password? Click here to reset