CrossCheck: Rapid, Reproducible, and Interpretable Model Evaluation

04/16/2020
by   Dustin Arendt, et al.
0

Evaluation beyond aggregate performance metrics, e.g. F1-score, is crucial to both establish an appropriate level of trust in machine learning models and identify future model improvements. In this paper we demonstrate CrossCheck, an interactive visualization tool for rapid crossmodel comparison and reproducible error analysis. We describe the tool and discuss design and implementation details. We then present three use cases (named entity recognition, reading comprehension, and clickbait detection) that show the benefits of using the tool for model evaluation. CrossCheck allows data scientists to make informed decisions to choose between multiple models, identify when the models are correct and for which examples, investigate whether the models are making the same mistakes as humans, evaluate models' generalizability and highlight models' limitations, strengths and weaknesses. Furthermore, CrossCheck is implemented as a Jupyter widget, which allows rapid and convenient integration into data scientists' model development workflows.

READ FULL TEXT
research
11/13/2020

Interpretable Multi-dataset Evaluation for Named Entity Recognition

With the proliferation of models for natural language processing tasks, ...
research
09/12/2021

AdViCE: Aggregated Visual Counterfactual Explanations for Machine Learning Model Validation

Rapid improvements in the performance of machine learning models have pu...
research
07/23/2017

Adversarial Examples for Evaluating Reading Comprehension Systems

Standard accuracy metrics indicate that reading comprehension systems ar...
research
11/10/2017

Neural Skill Transfer from Supervised Language Tasks to Reading Comprehension

Reading comprehension is a challenging task in natural language processi...
research
11/07/2019

Dice Loss for Data-imbalanced NLP Tasks

Many NLP tasks such as tagging and machine reading comprehension are fac...
research
09/11/2023

Evaluating the Reliability of CNN Models on Classifying Traffic and Road Signs using LIME

The objective of this investigation is to evaluate and contrast the effe...

Please sign up or login with your details

Forgot password? Click here to reset