HYPE: A Benchmark for Human eYe Perceptual Evaluation of Generative Models

04/01/2019 ∙ by Sharon Zhou, et al. ∙ Stanford University 0

Generative models often use human evaluations to measure the perceived quality of their outputs. Automated metrics are noisy indirect proxies, because they rely on heuristics or pretrained embeddings. However, up until now, direct human evaluation strategies have been ad-hoc, neither standardized nor validated. Our work establishes a gold standard human benchmark for generative realism. We construct Human eYe Perceptual Evaluation (HYPE) a human benchmark that is (1) grounded in psychophysics research in perception, (2) reliable across different sets of randomly sampled outputs from a model, (3) able to produce separable model performances, and (4) efficient in cost and time. We introduce two variants: one that measures visual perception under adaptive time constraints to determine the threshold at which a model's outputs appear real (e.g. 250ms), and the other a less expensive variant that measures human error rate on fake and real images sans time constraints. We test HYPE across six state-of-the-art generative adversarial networks and two sampling techniques on conditional and unconditional image generation using four datasets: CelebA, FFHQ, CIFAR-10, and ImageNet. We find that HYPE can track model improvements across training epochs, and we confirm via bootstrap sampling that HYPE rankings are consistent and replicable.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Generating realistic images is regarded as a focal task for measuring the progress of generative models. Automated metrics are either heuristic approximations (Rössler et al., 2019; Salimans et al., 2016; Denton et al., 2015; Karras et al., 2018; Brock et al., 2018; Radford et al., 2015)

or intractable density estimations, examined to be inaccurate on high dimensional problems 

(Hinton, 2002; Bishop, 2006; Theis et al., 2015). Human evaluations, such as those given on Amazon Mechanical Turk (Rössler et al., 2019; Denton et al., 2015), remain ad-hoc because “results change drastically” (Salimans et al., 2016) based on details of the task design (Liu et al., 2016; Le et al., 2010; Kittur et al., 2008). With both noisy automated and noisy human benchmarks, measuring progress over time has become akin to hill-climbing on noise. Even widely used metrics, such as Inception Score (Salimans et al., 2016) and Fréchet Inception Distance (Heusel et al., 2017), have been discredited for their application to non-ImageNet datasets (Barratt & Sharma, 2018; Rosca et al., 2017; Borji, 2018; Ravuri et al., 2018). Thus, to monitor progress, generative models need a systematic gold standard benchmark. In this paper, we introduce a gold standard benchmark for realistic generation, demonstrating its effectiveness across four datasets, six models, and two sampling techniques, and using it to assess the progress of generative models over time.

Realizing the constraints of available automated metrics, many generative modeling tasks resort to human evaluation and visual inspection (Rössler et al., 2019; Salimans et al., 2016; Denton et al., 2015)

. These human measures are (1) ad-hoc, each executed in idiosyncrasy without proof of reliability or grounding to theory, and (2) high variance in their estimates 

(Salimans et al., 2016; Denton et al., 2015; Olsson et al., 2018)

. These characteristics combine to a lack of reliability, and downstream, (3) a lack of clear separability between models. Theoretically, given sufficiently large sample sizes of human evaluators and model outputs, the law of large numbers would smooth out the variance and reach eventual convergence; but this would occur at (4) a high cost and a long delay.

We present HYPE (Human eYe Perceptual Evaluation) to address these criteria in turn. HYPE: (1) measures the perceptual realism of generative model outputs via a grounded method inspired by psychophysics methods in perceptual psychology, (2) is a reliable and consistent estimator, (3) is statistically separable to enable a comparative ranking, and (4) ensures a cost and time efficient method through modern crowdsourcing techniques such as training and aggregation. We present two methods of evaluation. The first, called , is inspired directly by the psychophysics literature (Klein, 2001; Cornsweet, 1962), and displays images using adaptive time constraints to determine the time-limited perceptual threshold a person needs to distinguish real from fake. The score is understood as the minimum time, in milliseconds, that a person needs to see the model’s output before they can distinguish it as real or fake. For example, a score of ms on indicates that humans can distinguish model outputs from real images at ms exposure times or longer, but not under ms. The second method, called , is derived from the first to make it simpler, faster, and cheaper while maintaining reliability. It is interpretable as the rate at which people mistake fake images and real images, given unlimited time to make their decisions. A score of on means that people differentiate generated results from real data at chance rate, while a score above represents hyper-realism in which generated images appear more real than real images.

We run two large-scale experiments. First, we demonstrate HYPE’s performance on unconditional human face generation using four popular generative adversarial networks (GANs) (Gulrajani et al., 2017; Berthelot et al., 2017; Karras et al., 2017, 2018) across CelebA-64 (Liu et al., 2015). We also evaluate two newer GANs Miyato et al. (2018); Brock et al. (2018) on FFHQ-1024 Karras et al. (2018). HYPE indicates that GANs have clear, measurable perceptual differences between them; this ranking is identical in both and . The best performing model, StyleGAN trained on FFHQ and sampled with the truncation trick, only performs at , suggesting substantial opportunity for improvement. We can reliably reproduce these results with confidence intervals using human evaluators at in a task that takes minutes.

Second, we demonstrate the performance of beyond faces on conditional generation of five object classes in ImageNet Deng et al. (2009) and unconditional generation of CIFAR-10 Krizhevsky & Hinton (2009). Early GANs such as BEGAN are not separable in when generating CIFAR-10: none of them produce convincing results to humans, verifying that this is a harder task than face generation. The newer StyleGAN shows separable improvement, indicating progress over the previous models. With ImageNet-5, GANs have improved on classes considered “easier” to generate (e.g., lemons), but resulted in consistently low scores across all models for harder classes (e.g., French horns).

HYPE is a rapid solution for researchers to measure their generative models, requiring just a single click to produce reliable scores and measure progress. We deploy HYPE at https://<redacted>, where researchers can upload a model and retrieve a HYPE score. Future work will extend HYPE to additional generative tasks such as text, music, and video generation.

2 Hype: A benchmark for Human eYe Perceptual Evaluation

HYPE displays a series of images one by one to crowdsourced evaluators on Amazon Mechanical Turk and asks the evaluators to assess whether each image is real or fake. Half of the images are real images, drawn from the model’s training set (e.g., FFHQ, CelebA, ImageNet, or CIFAR-10). The other half are drawn from the model’s output. We use modern crowdsourcing training and quality control techniques (Mitra et al., 2015) to ensure high-quality labels. Model creators can choose to perform two different evaluations: , which gathers time-limited perceptual thresholds to measure the psychometric function and report the minimum time people need to make accurate classifications, and , a simplified approach which assesses people’s error rate under no time constraint.

2.1 : Perceptual fidelity grounded in psychophysics

Our first method, , measures time-limited perceptual thresholds. It is rooted in psychophysics literature, a field devoted to the study of how humans perceive stimuli, to evaluate human time thresholds upon perceiving an image. Our evaluation protocol follows the procedure known as the adaptive staircase method (Figure 3(Cornsweet, 1962). An image is flashed for a limited length of time, after which the evaluator is asked to judge whether it is real or fake. If the evaluator consistently answers correctly, the staircase descends and flashes the next image with less time. If the evaluator is incorrect, the staircase ascends and provides more time.

Figure 2: Example images sampled with the truncation trick from StyleGAN trained on FFHQ. Images on the right exhibit the highest scores, the highest human perceptual fidelity.
Figure 3: The adaptive staircase method shows images to evaluators at different time exposures, decreasing when correct and increasing when incorrect. The modal exposure measures their perceptual threshold.

This process requires sufficient iterations to converge to the evaluator’s perceptual threshold: the shortest exposure time at which they can maintain effective performance (Cornsweet, 1962; Greene & Oliva, 2009; Fei-Fei et al., 2007). The process produces what is known as the psychometric function (Wichmann & Hill, 2001), the relationship of timed stimulus exposure to accuracy. For example, for an easily distinguishable set of generated images, a human evaluator would immediately drop to the lowest millisecond exposure.

displays three blocks of staircases for each evaluator. An image evaluation begins with a 3-2-1 countdown clock, each number displaying for ms Krishna et al. (2016). The sampled image is then displayed for the current exposure time. Immediately after each image, four perceptual mask images are rapidly displayed for ms each. These noise masks are distorted to prevent retinal afterimages and further sensory processing after the image disappears (Greene & Oliva, 2009). We generate masks using an existing texture-synthesis algorithm (Portilla & Simoncelli, 2000). Upon each submission, reveals to the evaluator whether they were correct.

Image exposures are in the range [ms, ms], derived from the perception literature (Fraisse, 1984). All blocks begin at ms and last for images (% generated, % real), values empirically tuned from prior work (Cornsweet, 1962; Dakin & Omigie, 2009). Exposure times are raised at ms increments and reduced at ms decrements, following the -up/-down adaptive staircase approach, which theoretically leads to a accuracy threshold that approximates the human perceptual threshold (Levitt, 1971; Greene & Oliva, 2009; Cornsweet, 1962).

Every evaluator completes multiple staircases, called blocks, on different sets of images. As a result, we observe multiple measures for the model. We employ three blocks, to balance quality estimates against evaluators’ fatigue (Krueger, 1989; Rzeszotarski et al., 2013; Hata et al., 2017). We average the modal exposure times across blocks to calculate a final value for each evaluator. Higher scores indicate a better model, whose outputs take longer time exposures to discern from real.

2.2 : Cost-effective approximation

Building on the previous method, we introduce : a simpler, faster, and cheaper method after ablating to optimize for speed, cost, and ease of interpretation. shifts from a measure of perceptual time to a measure of human deception rate, given infinite evaluation time. The score gauges total error on the task, enabling the measure to capture errors on both fake and real images, and effects of hyperrealistic generation when fake images look even more realistic than real images. requires fewer images than to find a stable value, empirically producing a x reduction in time and cost ( minutes per evaluator instead of minutes, at the same rate of per hour). Higher scores are again better: indicates that only of images deceive people, whereas indicates that people are mistaking real and fake images at chance, rendering fake images indistinguishable from real. Scores above suggest hyperrealistic images, as evaluators mistake images at a rate greater than chance.

shows each evaluator a total of images: real and fake. We calculate the proportion of images that were judged incorrectly, and aggregate the judgments over the evaluators on images to produce the final score for a given model.

2.3 Consistent and reliable design

To ensure that our reported scores are consistent and reliable, we need to sample sufficiently from the model as well as hire, qualify, and appropriately pay enough evaluators.

Sampling sufficient model outputs. The selection of images to evaluate from a particular model is a critical component of a fair and useful evaluation. We must sample a large enough number of images that fully capture a model’s generative diversity, yet balance that against tractable costs in the evaluation. We follow existing work on evaluating generative output by sampling generated images from each model (Salimans et al., 2016; Miyato et al., 2018; Warde-Farley & Bengio, 2016) and real images from the training set. From these samples, we randomly select images to give to each evaluator.

Quality of evaluators. To obtain a high-quality pool of evaluators, each is required to pass a qualification task. Such a pre-task filtering approach, sometimes referred to as a person-oriented strategy, is known to outperform process-oriented strategies that perform post-task data filtering or processing (Mitra et al., 2015). Our qualification task displays images ( real and

fake) with no time limits. Evaluators must correctly classify

of both real and fake images. This threshold should be treated as a hyperparameter and may change depending upon the GANs used in the tutorial and the desired discernment ability of the chosen evaluators. We choose

based on the cumulative binomial probability of 65 binary choice answers out of 100 total answers: there is only a one in one-thousand chance that an evaluator will qualify by random guessing. Unlike in the task itself, fake qualification images are drawn equally from multiple different GANs to ensure an equitable qualification across all GANs. The qualification is designed to be taken occasionally, such that a pool of evaluators can assess new models on demand.

Payment. Evaluators are paid a base rate of for working on the qualification task. To incentivize evaluators to remained engaged throughout the task, all further pay after the qualification comes from a bonus of per correctly labeled image, typically totaling a wage of /hr.

3 Experimental setup

Datasets. We evaluate on four datasets. (1) CelebA-64 (Liu et al., 2015) is popular dataset for unconditional image generation with k images of human faces, which we align and crop to be px. (2) FFHQ-1024 (Karras et al., 2018) is a newer face dataset with k images of size px. (3) CIFAR-10 consists of k images, sized px, across classes. (4) ImageNet-5 is a subset of classes with k images at px from the ImageNet dataset Deng et al. (2009), which have been previously identified as easy (lemon, Samoyed, library) and hard (baseball player, French horn) Brock et al. (2018).

Architectures. We evaluate on four state-of-the-art models trained on CelebA-64 and CIFAR-10: StyleGAN (Karras et al., 2018), ProGAN (Karras et al., 2017), BEGAN (Berthelot et al., 2017), and WGAN-GP (Gulrajani et al., 2017). We also evaluate on two models, SN-GAN Miyato et al. (2018) and BigGAN Brock et al. (2018) trained on ImageNet, sampling conditionally on each class in ImageNet-5. We sample BigGAN with ( Brock et al. (2018)) and without the truncation trick.

We also evaluate on StyleGAN Karras et al. (2018) trained on FFHQ-1024 with ( Karras et al. (2018)) and without truncation trick sampling. For parity on our best models across datasets, StyleGAN instances trained on CelebA-64 and CIFAR-10 are also sampled with the truncation trick.

We sample noise vectors from the

-dimensional spherical Gaussian noise prior during training and test times. We specifically opted to use the same standard noise prior for comparison, yet are aware of other priors that optimize for FID and IS scores (Brock et al., 2018). We select training hyperparameters published in the corresponding papers for each model.

Evaluator recruitment. We recruit evaluators from Amazon Mechanical Turk, or 30 for each run of HYPE. We explain our justification for this number in the Cost tradeoffs section. To maintain a between-subjects study in this evaluation, we recruit independent evaluators across tasks and methods.

Metrics. For , we report the modal perceptual threshold in milliseconds. For , we report the error rate as a percentage of images, as well as the breakdown of this rate on real and fake images separately. To show that our results for each model are separable, we report a one-way ANOVA with Tukey pairwise post-hoc tests to compare all models.

Reliability is a critical component of HYPE, as a benchmark is not useful if a researcher receives a different score when rerunning it. We use bootstrapping (Felsenstein, 1985), repeated resampling from the empirical label distribution, to measure variation in scores across multiple samples with replacement from a set of labels. We report

bootstrapped confidence intervals (CIs), along with standard deviation of the bootstrap sample distribution, by randomly sampling

evaluators with replacement from the original set of evaluators across iterations.

Rank GAN (ms) Std. 95% CI
1 363.2 32.1 300.0 – 424.3
2 240.7 29.9 184.7 – 302.7
Table 1: on and trained on FFHQ-1024.

Experiment 1: We run two large-scale experiments to validate HYPE. The first one focuses on the controlled evaluation and comparison of against on established human face datasets. We recorded responses totaling ( CelebA-64 FFHQ-1024) models evaluators responses = k total responses for our evaluation and ( CelebA-64 FFHQ-1024) models evaluators responses = k, for our evaluation.

Experiment 2: The second experiment evaluates on general image datasets. We recorded ( CIFAR-10 ImageNet-5) models evaluators responses = k total responses.

4 Experiment 1: and on human faces

We report results on and demonstrate that the results of approximates the those from at a fraction of the cost and time.

4.1

CelebA-64. We find that resulted in the highest score (modal exposure time), at a mean of ms, indicating that evaluators required nearly a half-second of exposure to accurately classify images (Table LABEL:table:stair_celeba). is followed by ProGAN at ms, a drop in time. BEGAN and WGAN-GP are both easily identifiable as fake, so they are tied in third place around the minimum possible exposure time available of ms. Both BEGAN and WGAN-GP exhibit a bottoming out effect — reaching the minimum time exposure of ms quickly and consistently.111We do not pursue time exposures under ms due to constraints on JavaScript browser rendering times.

To demonstrate separability between models we report results from a one-way analysis of variance (ANOVA) test, where each model’s input is the list of modes from each model’s evaluators. The ANOVA results confirm that there is a statistically significant omnibus difference (). Pairwise post-hoc analysis using Tukey tests confirms that all pairs of models are separable (all ) except BEGAN and WGAN-GP ().

FFHQ-1024. We find that resulted in a higher exposure time than , at ms and ms, respectively (Table 1). While the confidence intervals that represent a very conservative overlap of

ms, an unpaired t-test confirms that the difference between the two models is significant (

).

4.2

CelebA-64. Table 2 reports results for on CelebA-64. We find that resulted in the highest score, fooling evaluators of the time. is followed by ProGAN at , BEGAN at , and WGAN-GP at . No confidence intervals are overlapping and an ANOVA test is significant (). Pairwise post-hoc Tukey tests show that all pairs of models are separable (all ). Notably, results in separable results for BEGAN and WGAN-GP, unlike in where they were not separable due to a bottoming-out effect.

Rank GAN (%) Fakes Error Reals Error Std. 95% CI KID FID Precision
1 50.7% 62.2% 39.3% 1.3 48.2 – 53.1 0.005 131.7 0.982
2 ProGAN 40.3% 46.2% 34.4% 0.9 38.5 – 42.0 0.001 2.5 0.990
3 BEGAN 10.0% 6.2% 13.8% 1.6 7.2 – 13.3 0.056 67.7 0.326
4 WGAN-GP 3.8% 1.7% 5.9% 0.6 3.2 – 5.7 0.046 43.6 0.654
Table 2: on four GANs trained on CelebA-64. Counterintuitively, real errors increase with the errors on fake images, because evaluators become more confused and distinguishing factors between the two distributions become harder to discern.

FFHQ-1024. We observe a consistently separable difference between and and clear delineations between models (Table 3). ranks () above () with no overlapping CIs. Separability is confirmed by an unpaired t-test ().

Rank GAN (%) Fakes Error Reals Error Std. 95% CI KID FID Precision
1 27.6% 28.4% 26.8% 2.4 22.9 – 32.4 0.007 13.8 0.976
2 19.0% 18.5% 19.5% 1.8 15.5 – 22.4 0.001 4.4 0.983
Table 3: on and trained on FFHQ-1024. Evaluators were deceived most often by . Similar to CelebA-64, fake errors and real errors track each other as the line between real and fake distributions blurs.

4.3 Cost tradeoffs with accuracy and time

Figure 4: Effect of more evaluators on CI.

One of HYPE’s goals is to be cost and time efficient. When running HYPE, there is an inherent tradeoff between accuracy and time, as well as between accuracy and cost. This is driven by the law of large numbers: recruiting additional evaluators in a crowdsourcing task often produces more consistent results, but at a higher cost (as each evaluator is paid for their work) and a longer amount of time until completion (as more evaluators must be recruited and they must complete their work).

To manage this tradeoff, we run an experiment with on . We completed an additional evaluation with evaluators, and compute bootstrapped confidence intervals, choosing from to evaluators (Figure 4). We see that the CI begins to converge around evaluators, our recommended number of evaluators to recruit.

Payment to evaluators was calculated as described in the Approach section. At evaluators, the cost of running on one model was approximately , while the cost of running on the same model was approximately . Payment per evaluator for both tasks was approximately /hr, and evaluators spent an average of one hour each on a task and minutes each on a task. Thus, achieves its goals of being significantly cheaper to run than while maintaining consistency.

4.4 Comparison to automated metrics

As FID Heusel et al. (2017) is one of the most frequently used evaluation methods for unconditional image generation, it is imperative to compare HYPE against FID on the same models. We also compare to two newer automated metrics: KID Bińkowski et al. (2018)

, an unbiased estimator independent of sample size, and

(precision) Sajjadi et al. (2018), which captures fidelity independently. We show through Spearman rank-order correlation coefficients that HYPE scores are not correlated with FID (), where a Spearman correlation of is ideal because lower FID and higher HYPE scores indicate stronger models. We therefore find that FID is not highly correlated with human judgment. Meanwhile, and exhibit strong correlation (), where is ideal because they are directly related. We calculate FID across the standard protocol of evaluating K generated and K real images for both CelebA-64 and FFHQ-1024, reproducing scores for . KID () and precision () both show a statistically insignificant but medium level of correlation with humans.

4.5 during model training

HYPE can also be used to evaluate progress during model training. We find that scores increased as StyleGAN training progressed from at k epochs, to at k epochs, to at k epochs ().

5 Experiment 2: beyond faces

We now turn to another popular image generation task: objects. As Experiment 1 showed to be an efficient and cost effective variant of , here we focus exclusively on .

5.1 ImageNet-5

We evaluate conditional image generation on five ImageNet classes (Table 4). We also report FID Heusel et al. (2017), KID Bińkowski et al. (2018), and (precision) Sajjadi et al. (2018) scores. To evaluate the relative effectiveness of the three GANs within each object class, we compute five one-way ANOVAs, one for each of object classes. We find that the scores are separable for images from three easy classes: samoyeds (dogs) (), lemons (), and libraries (). Pairwise Posthoc tests reveal that this difference is only significant between SN-GAN and the two BigGAN variants. We also observe that models have unequal strengths, e.g. SN-GAN is better suited to generating libraries than samoyeds.

Comparison to automated metrics. Spearman rank-order correlation coefficients on all three GANs accross all five classes show that there is a low to moderate correlation between the scores and KID (), FID (), and negligible correlation with precision (). Some correlation for our ImageNet-5 task is expected, as these metrics use pretrained ImageNet embeddings to measure differences between generated and real data.

Interestingly, we find that this correlation depends upon the GAN: considering only SN-GAN, we find stronger coefficients for KID (), FID (), and precision (). When considering only BigGAN, we find far weaker coefficients for KID (), FID (), and precision (). This illustrates an important flaw with these automatic metrics: their ability to correlate with humans depends upon the generative model that the metrics are evaluating on, varying by model and by task.

GAN Class (%) Fakes Error Reals Error Std. 95% CI KID FID Precision

Easy

Lemon 18.4% 21.9% 14.9% 2.3 14.2–23.1 0.043 94.22 0.784
Lemon 20.2% 22.2% 18.1% 2.2 16.0–24.8 0.036 87.54 0.774
SN-GAN Lemon 12.0% 10.8% 13.3% 1.6 9.0–15.3 0.053 117.90 0.656

Easy

Samoyed 19.9% 23.5% 16.2% 2.6 15.0–25.1 0.027 56.94 0.794
Samoyed 19.7% 23.2% 16.1% 2.2 15.5–24.1 0.014 46.14 0.906
SN-GAN Samoyed 5.8% 3.4% 8.2% 0.9 4.1–7.8 0.046 88.68 0.785

Easy

Library 17.4% 22.0% 12.8% 2.1 13.3–21.6 0.049 98.45 0.695
Library 22.9% 28.1% 17.6% 2.1 18.9–27.2 0.029 78.49 0.814
SN-GAN Library 13.6% 15.1% 12.1% 1.9 10.0–17.5 0.043 94.89 0.814

Hard

French Horn 7.3% 9.0% 5.5% 1.8 4.0–11.2 0.031 78.21 0.732
French Horn 6.9% 8.6% 5.2% 1.4 4.3–9.9 0.042 96.18 0.757
SN-GAN French Horn 3.6% 5.0% 2.2% 1.0 1.8–5.9 0.156 196.12 0.674

Hard

Baseball Player 1.9% 1.9% 1.9% 0.7 0.8–3.5 0.049 91.31 0.853
Baseball Player 2.2% 3.3% 1.2% 0.6 1.3–3.5 0.026 76.71 0.838
SN-GAN Baseball Player 2.8% 3.6% 1.9% 1.5 0.8–6.2 0.052 105.82 0.785
Table 4: on three models trained on ImageNet and conditionally sampled on five classes. BigGAN routinely outperforms SN-GAN. and are not separable.
GAN (%) Fakes Error Reals Error Std. 95% CI KID FID Precision
23.3% 28.2% 18.5% 1.6 20.1–26.4 0.005 62.9 0.982
PROGAN 14.8% 18.5% 11.0% 1.6 11.9–18.0 0.001 53.2 0.990
BEGAN 14.5% 14.6% 14.5% 1.7 11.3–18.1 0.056 96.2 0.326
WGAN-GP 13.2% 15.3% 11.1% 2.3 9.1–18.1 0.046 104.0 0.654
Table 5: Four models on CIFAR-10. can generate realistic images from CIFAR-10.

5.2 Cifar-10

For the difficult task of unconditional generation on CIFAR-10, we use the same four model architectures in Experiment 1: CelebA-64. Table 5 shows that was able to separate from the earlier BEGAN, WGAN-GP, and ProGAN, indicating that StyleGAN is the first among them to make human-perceptible progress on unconditional object generation with CIFAR-10.

Comparison to automated metrics. Spearman rank-order correlation coefficients on all four GANs show medium, yet statistically insignificant, correlations with KID () and FID () and precision ().

6 Discussion and conclusion

Envisioned Use. We created HYPE as a turnkey solution for human evaluation of generative models. Researchers can upload their model, receive a score, and compare progress via our online deployment. During periods of high usage, such as competitions, a retainer model (Bernstein et al., 2011) enables evaluation using in minutes, instead of the default minutes.

Limitations. Extensions of HYPE

may require different task designs. In the case of text generation (translation, caption generation),

will require much longer and much higher range adjustments to the perceptual time thresholds (Krishna et al., 2017; Weld et al., 2015). In addition to measuring realism, other metrics like diversity, overfitting, entanglement, training stability, and computational and sample efficiency are additional benchmarks that can be incorporated but are outside the scope of this paper. Some may be better suited to a fully automated evaluation (Borji, 2018; Lucic et al., 2018).

Conclusion. HYPE provides two human evaluation benchmarks for generative models that (1) are grounded in psychophysics, (2) provide task designs that produce reliable results, (3) separate model performance, (4) are cost and time efficient. We introduce two benchmarks: , which uses time perceptual thresholds, and , which reports the error rate sans time constraints. We demonstrate the efficacy of our approach on image generation across six models {StyleGAN, SN-GAN, BigGAN, ProGAN, BEGAN, WGAN-GP}, four image datasets {CelebA-64, FFHQ-1024, CIFAR-10, ImageNet-5 }, and two types of sampling methods {with, without the truncation trick}.

References