HYPE: A Benchmark for Human eYe Perceptual Evaluation of Generative Models

04/01/2019
by   Sharon Zhou, et al.
0

Generative models often use human evaluations to measure the perceived quality of their outputs. Automated metrics are noisy indirect proxies, because they rely on heuristics or pretrained embeddings. However, up until now, direct human evaluation strategies have been ad-hoc, neither standardized nor validated. Our work establishes a gold standard human benchmark for generative realism. We construct Human eYe Perceptual Evaluation (HYPE) a human benchmark that is (1) grounded in psychophysics research in perception, (2) reliable across different sets of randomly sampled outputs from a model, (3) able to produce separable model performances, and (4) efficient in cost and time. We introduce two variants: one that measures visual perception under adaptive time constraints to determine the threshold at which a model's outputs appear real (e.g. 250ms), and the other a less expensive variant that measures human error rate on fake and real images sans time constraints. We test HYPE across six state-of-the-art generative adversarial networks and two sampling techniques on conditional and unconditional image generation using four datasets: CelebA, FFHQ, CIFAR-10, and ImageNet. We find that HYPE can track model improvements across training epochs, and we confirm via bootstrap sampling that HYPE rankings are consistent and replicable.

READ FULL TEXT
research
04/01/2019

HYPE: Human eYe Perceptual Evaluation of Generative Models

Generative models often use human evaluations to determine and justify p...
research
08/09/2019

Enforcing Perceptual Consistency on Generative Adversarial Networks by Using the Normalised Laplacian Pyramid Distance

In recent years there has been a growing interest in image generation th...
research
04/04/2023

Toward Verifiable and Reproducible Human Evaluation for Text-to-Image Generation

Human evaluation is critical for validating the performance of text-to-i...
research
06/10/2016

Improved Techniques for Training GANs

We present a variety of new architectural features and training procedur...
research
10/22/2019

Establishing an Evaluation Metric to Quantify Climate Change Image Realism

With success on controlled tasks, generative models are being increasing...
research
04/01/2020

Improving Perceptual Quality of Drum Transcription with the Expanded Groove MIDI Dataset

Classifier metrics, such as accuracy and F-measure score, often serve as...
research
04/22/2020

DeepFake Detection by Analyzing Convolutional Traces

The Deepfake phenomenon has become very popular nowadays thanks to the p...

Please sign up or login with your details

Forgot password? Click here to reset