The Reasonable Effectiveness of Diverse Evaluation Data

01/23/2023
by   Lora Aroyo, et al.
0

In this paper, we present findings from an semi-experimental exploration of rater diversity and its influence on safety annotations of conversations generated by humans talking to a generative AI-chat bot. We find significant differences in judgments produced by raters from different geographic regions and annotation platforms, and correlate these perspectives with demographic sub-groups. Our work helps define best practices in model development – specifically human evaluation of generative models – on the backdrop of growing work on sociotechnical AI evaluations.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/05/2022

Embodying the Glitch: Perspectives on Generative AI in Dance Practice

What role does the break from realism play in the potential for generati...
research
06/07/2023

Exposing flaws of generative model evaluation metrics and their unfair treatment of diffusion models

We systematically study a wide variety of image-based generative models ...
research
04/15/2023

The Design Space of Generative Models

Card et al.'s classic paper "The Design Space of Input Devices" establis...
research
09/07/2021

Hi, my name is Martha: Using names to measure and mitigate bias in generative dialogue models

All AI models are susceptible to learning biases in data that they are t...
research
12/10/2021

Assessing the Fairness of AI Systems: AI Practitioners' Processes, Challenges, and Needs for Support

Various tools and practices have been developed to support practitioners...
research
11/14/2014

A unified view of generative models for networks: models, methods, opportunities, and challenges

Research on probabilistic models of networks now spans a wide variety of...

Please sign up or login with your details

Forgot password? Click here to reset