Improving image generative models with human interactions

09/29/2017 ∙ by Andrew Kyle Lampinen, et al. ∙ 0

GANs provide a framework for training generative models which mimic a data distribution. However, in many cases we wish to train these generative models to optimize some auxiliary objective function within the data it generates, such as making more aesthetically pleasing images. In some cases, these objective functions are difficult to evaluate, e.g. they may require human interaction. Here, we develop a system for efficiently improving a GAN to target an objective involving human interaction, specifically generating images that increase rates of positive user interactions. To improve the generative model, we build a model of human behavior in the targeted domain from a relatively small set of interactions, and then use this behavioral model as an auxiliary loss function to improve the generative model. We show that this system is successful at improving positive interaction rates, at least on simulated data, and characterize some of the factors that affect its performance.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Generative image models have improved rapidly in the past few years, in part because of the success of Generative Adversarial Networks, or GANs Goodfellow2014GenerativeNetworks . GANs attempt to train a “generator” to create images which mimic real images by training it to fool an adversarial “discriminator,” which attempts to discern whether images are real or fake. This is one solution to the difficult problem of learning when we don’t know how to write down an objective function for image quality: take an empirical distribution of “good” images, and try to match it.

Often, we want to impose additional constraints on our goal distribution besides simply matching empirical data. If we can write down an objective which reflects our goals (even approximately), we can often simply incorporate this into the loss function to achieve our goals. For example, when trying to generate art, we would like our network to be creative and innovative rather than just imitating previous styles, and including a penalty in the loss for producing recognized styles appears to make GANs more creative Elgammal2017CAN:Norms

. Conditioning on image content class, training the discriminator to classify image content as well as making real/fake judgements, and including a loss term for fooling the discriminator on class both allows for targeted image generation and improves overall performance

Odena2016ConditionalGANs .

However, sometimes it is not easy to write an explicit objective that reflects our goals. Often the only effective way to evaluate machine learning systems on complex tasks is by asking humans to determine the quality of their results

(Christiano2017DeepPreferences, , e.g.) or by actually trying them out in the real world. Can we incorporate this kind of feedback to efficiently guide a generative model toward producing better results? Can we do so without a prohibitively expensive and slow amount of data collection? In this paper, we tackle a specific problem of this kind: generating images that cause more positive user interactions. We imagine these are measured by a generic Positive Interaction Rate (PIR), which could come from a wide variety of sources.

For example, users might be asked to rate how aesthetically pleasing an image is from 1 to 5 stars. The PIR could be computed as a weighted sum of how frequently different ratings were chosen. Alternatively, these images could be used in the background of web pages. We can assess user interactions with a webpage in a variety of ways (time on page, clicks, shares, etc.), and summarize these interactions as the PIR. In both of these tasks, we don’t know exactly what features will affect the PIR, and we certainly don’t know how to explicitly compute the PIR for an image. However, we can empirically determine the quality of an image by actually showing it to users, and in this paper we show how to use a small amount of this data (results on 1000 images) to efficiently tune a generative model to produce images which increase PIR. In this work we focus on simulated PIR values as a proof of concept, but in future work we will investigate PIR values obtained from real user interactions.

2 Approach

The most straight-forward way to improve our image GAN might be to evaluate the images the model produces with real users at each training step. However, this process is far too slow. Instead, we want to be able to collect a batch of PIR data on a batch of images, and then use them to improve the generative model for many gradient steps; we want to do this despite the fact that the images the generator is producing may evolve to be very different from the original images we collected PIR data on. In order to do this, we use the batch of image and PIR data to train a “PIR Estimator Model” which predicts PIRs on images. We then use these estimated PIRs at each training step to produce a loss.

Our approach is inspired by the work of Christiano and colleagues Christiano2017DeepPreferences

, who integrated human preference ratings between action sequences into training of a reinforcement learning model. However, our problem and approach differ in several key ways. First, we are optimizing a generative image model rather than a RL model. This is more difficult in some ways, since the output space is much higher-dimensional than typical RL problems, which means that low-dimensional feedback (like a scalar PIR) may be harder for the system to learn from. This difficulty is partially offset by the fact that we assume we get “reward” (PIR) information for an image when we evaluate, instead of just getting a preference which we have to map to an underlying reward function. Perhaps most importantly, we use our PIR estimation model as a fully-differentiable loss function for training, instead of just using its estimated rewards. This allows us to more effectively exploit its knowledge of the objective function (at the risk of overfitting to its approximation of the real PIR function).

2.1 System

GAN

Images

Users

PIR Estimator Model

Interactions

Loss

generates

shown to

who generate

used to train

used to train

used as

to improve
Figure 1: Diagram of our system

Our system consists of three components:

  1. A generative image model.

  2. Users to interact with the generated images in some way.

  3. A PIR estimator that models user interactions given an image.

See Fig. 1 for a diagram of the general operation of our system. The generative model produces images, which are served to users. Using interaction data from these users, we train the PIR estimator model, which predicts PIRs given a background image, and then incorporate this estimated PIR into the loss of the generative model to tune it to produce higher quality images. Below, we discuss each of these components in more detail.

2.1.1 Generative model

We begin with an GAN111In fact, we used an ACGAN Odena2016ConditionalGANs because it generated better initial images, but we present the results here in terms of a GAN for clarity. See Appendix A for details of the ACGAN. Goodfellow2014GenerativeNetworks which we pre-trained to produce images from a target distribution (specifically, landscapes of mountains and coasts). Let be the source estimated by the discriminator, the generator, and

a noise input to the generator sampled from a multivariate standard normal distribution

, and be the set of real images shown to the discriminator. Define:

Then the discriminator and generator are trained to maximize the following losses (respectively), where the

are weights set as hyperparameters:

Note that there is a difference between these losses and the standard GAN formulation given in Goodfellow2014GenerativeNetworks – we maximize rather than minimizing . This seems to result in the generation of slightly better images in practice.

We parameterized the generator as a deep neural network, which begins with a fully-connected mapping from the latent (noise) space to a

dimensional image, and then successively upsampled (a factor of 2 by nearest neighbor), padded and applied a convolution (

kernel, stride of 1) and a leaky ReLU (

) nonlinearity repeatedly. We repeated this process 5 times (except with no upsampling on the first step, and a tanh nonlinearity on the last), while stepping the image depth down as follows: 512, 512, 256, 128, 64, and finally 3 (RGB) for the output image. This means that the final output images were .  ((

We parameterized the discriminator as a convolutional network with 7 layers, 6 convolutions (kernels all

; strides 2, 1, 2, 1, 2, 1; dropout after the 1st, 3rd, and 5th layers; filter depth 16, 32, 64, 128, 256, 512; batch normalization after each layer) and a fully connected layer to a single output for real/fake. We used a leaky ReLU (

) nonlinearity after each layer, except the final layer, where we used a tanh.

This GAN was trained on a dataset consisting of landscape images of mountains and coastlines, generated from Google Image searches for these topics. The generator was trained with the Adam optimizer Kingma2015Adam:Optimization

, and the discriminator with RMSProp. The learning rates for both were set to

, and for Adam we set . We used a latent size of 64 units. The model was trained for gradient steps, when the generated images appeared to stop improving.

It is worth noting that this generative model is not photorealistic (see Fig. 3(a) for some samples). Its expressive capacity is limited, and it has clear output modes with limited intra-mode variability. However, for our purposes this may not matter. Indeed, it is in some ways more interesting if we can tweak this model to optimize for many objective functions, since its limited expressive capacity will make it more difficult for us to estimate and pursue the real objective – a limited set of images will effectively give us fewer points to estimate the PIR function from, and will reduce the space in which the model can easily produce images, thus reducing the possibility of gettign very optimal images from the model. For example, a model which produces images of birds may not produce data points which provide good estimates of a PIR based on how much the image looks like a car, and even if it could, it may not be able to produce images which are “more car-like.” If we our able to succeed in improving PIRs with this generative model, it is likely that a better generative model would yield even better results.

2.1.2 User interactions

We will show these images to users in a variety of ways, depending on our target domain. For the purposes of this paper, however, we will use simulated interaction data. See section 3 for more information about how we produced the simulated data.

Of course, since showing images to users is an expensive prospect, we wanted to limit the size of the datasets we used to train the model. Typical datasets used to train vision models are on the order of millions of images, (e.g. ImageNet

Russakovsky2015ImageNetChallenge

), but it is completely infeasible to collect user data on this number of images. We estimated that we could show 1000 images each 1000 times to generate our datasets. We used these dataset sizes and number of impressions for all experiments discussed here, and added noise to the PIRs that was binomially distributed according to the number of times each image was shown and the “true” PIR from the objective.

2.1.3 PIR estimator model

The final component of our system is the PIR estimator model, which learns to predict PIR from a background image. We denote this model by . We parameterize this model as a deep neural network. Specifically, we take the Inception v2 architecture Szegedy2016RethinkingVision , remove the output layer, and replace it with a fully-connected layer to PIR estimates. Instead of predicting PIR as a scalar directly, we predict it by classifying into 100 bins via a softmax, which performs better empirically. This choice was motivated by noting that the scalar version was having trouble fitting some highly multi-modal distributions that appear in the data.

We initialize the Inception v2 parameters from a version of the model trained on an internal version of the JFT dataset Sun2017RevisitingEra . We trained the PIR estimator with the Adam optimizer (learning rate

). When evaluating and when using this model for improving the GAN we froze the weights of the PIR estimator. We also reduced the output softmax’s temperature to 0.01, so it was behaving almost like a max, which empirically improved results. Intuitively, a low softmax temperature in training allows the system to rapidly get gradients from many different output bins and adjust the distribution appropriately, whereas when actually using the system we want to be conservative with our estimates and not be too biased by low probability bins far from the modal estimate.

Why did we not make estimated PIR simply another auxiliary output from the discriminator, like class in the ACGAN (Appendix A)? Because the PIR estimator needs to be held constant in order to provide an accurate training objective. If the PIR estimates were produced by the discriminator, then as the discriminator changed to accurately discriminate the evolving generator images, the PIR estimates would tend to drift without a ground-truth to train them on. Separating the discriminator and the PIR estimator allows us to freeze the PIR estimator while still letting the discriminator adapt.

2.1.4 Integration

Once we have trained a PIR estimator model, we have to use it to improve the GAN. We do this as follows. Let denote the PIR estimator model, as above. Define to be the expectation of the estimated PIR produced over images sampled from the generator:

Then we simply supplement the generator loss by adding this term times a weight , set as a hyperparameter:

We set as this made the magnitude of the PIR loss and the other loss terms roughly comparable. Otherwise we used the same parameters as in the GAN training above, except that we reduced the learning rate to to allow the system to adapt more smoothly to the multiple objectives, and we trained for 50,000 steps.

3 Data

In this paper, we use simulated interaction data as proof of concept. This raises an issue: what functions should we use to simulate user interactions? Human behavior is complex, and if we already knew precisely what guided user interactions, there would be no need to actually collect human behavioral data at all. Since we don’t know what features will guide human behavior, the next best thing we can do is to ensure that our system is able to alter the image generation model in a broad variety of ways, ranging from low level features (like making the images more colorful) to altering complex semantic features (such as including more plants in outdoor scenery). We also want to avoid hand-engineering tasks to the greatest extent possible.

3.1 VGG Features

The first approach we took to evaluating our system’s ability to train for different features was to use activity from hidden layers of a computer vision model, specifically VGG 16

Simonyan2014VeryRecognition trained on ImageNet Russakovsky2015ImageNetChallenge . In particular, we took the norm of the activity of one filter within a layer, and normalized it by the norm of the total layer’s activity, i.e. letting

be the vector of unit activations in filter

of layer of VGG 16 on image , we computed PIR for that image and a given layer and filter as:

(Note that if we did not normalize by the activity in the whole layer, the system might be able to “cheat” to improve the PIR by just increasing the contrast of the images, which will likely increase overall network activity.) As noted above, we also added binomially distributed noise to these PIRs.

This approach to simulating PIRs has several benefits. First, it gives a wide variety of complex objectives that can nevertheless be easily computed to simulate data. Second, models like VGG exhibit hierarchical organization, where lower levels generally respond to lower-level features such as edges and colors, while higher levels respond to higher-level semantic features such as faces Zeiler2014Visualizing2013 . This means we can modulate the PIR by activity of randomly chosen filters on a layer, and then compare performance across different layers in order to evaluate what types of objectives our system is capable of optimizing for effectively. Furthermore, since CNN features at various levels are related to those in corresponding levels of human and macaque visual cortex Yamins2014Performance-optimizedCortex. , we might hope that a system which is capable of optimizing for these features is also capable of optimizing for human preferences.

There are some caveats to this approach, however. First, although the higher layers of CNNs are somewhat selective for “abstract” object categories, they are also fooled by adversarial images that humans would not be, and directly optimizing inputs for these high level features does not actually produce semantically meaningful images Yosinski2014DeepImages

. Thus, even if our system succeeds in increasing activity in a targeted layer which is semantically selective, it will likely do so by adversarially exploiting particulars of VGG 16’s parameterization of the classification problem (although the fact that we are not backpropagating through the true objective will make this harder). It is not necessarily a failure of the system if it exploits simple features of the objective it is given to increase PIRs – indeed, it should be seen as a success. However, success on this task does not necessarily guarantee success on changing semantic level content when interacting with actual humans.

Indeed, it may be easier for the PIR estimator model (which is based on a CNN) to learn objectives which come from another CNN than more general possible objectives. The fact that adversarial examples can sometimes transfer between networks with different architectures Liu2016DelvingAttacks suggests that the computations being performed by these networks are somewhat architecture invariant. Thus CNN objectives may be easier for our PIR estimator to learn than real world objectives.

We have tried to minimize these problems to the greatest extent possible by using different network architectures (Inception V2 and VGG 16, respectively) trained on different datasets (JFT Sun2017RevisitingEra and ImageNet Russakovsky2015ImageNetChallenge , respectively) for the estimator and the objective. However, we cannot be certain that the network is not “cheating” in some way on the VGG 16 tasks, so our results must be considered with this qualification in mind.

Despite these qualifications, we think that evaluating our system’s ability to optimize for objectives generated from various layers of VGG will show its ability to optimize for a variety of complex objectives, and thus will serve as a useful indicator of its potential to improve PIRs from real users.

3.2 Multiple filters

Figure 2: Percent non-zero PIRs vs. effect size (Single-filter VGG tasks and color tasks)

After using our system on the tasks above, we noted that its performance was quite poor at layers 5, 6, and 7 of VGG compared to other tasks (see Fig. 3). This could suggest that our system was unable to capture the complex features represented at the higher levels of VGG. However, we also noticed that the feature representations at these layers tended to be quite sparse, so many of the simulated PIRs we generated were actually zero to within the bin width of our PIR estimator. Respectively, these layers had only 20%, 25%, and 24% non-zero PIRs (collapsing across filters), and around half the filters in each (resp. 6, 4, and 5) were producing zero PIRs. (By contrast, the layer with the next greatest number of zero PIRs, fc8, still had 69% nonzero PIRs overall, and had no filters in which 90% or more of the PIRs were zero.) In a few cases on layers 5, 6, and 7, all of the generated PIRs were zero. This clearly makes learning infeasible, and indeed we noted that there was a strong relationship between number of non-zero simulated PIRs in the training dataset and the ability of our system to improve PIR (see Fig. 2). This is somewhat troubling, since probably most images in the real world will not produce a PIR that is truly zero.

Thus in order to evaluate whether the poor performance of our system at the higher layers of VGG was attributable to the number of zeros or to the complexity of the features, we created less sparse features from these layers by simply targeting a set of filters sampled without replacement from the layer, rather than a single filter. We did this by taking the norm across the target filters, or equivalently by summing the squared norms of the filters before taking the square root, and then normalizing by the activity in the layer as before. Formally, letting be a set of filter indices sampled without replacement from , we computed the PIR for an image as:

The single filter cases above can be thought of as a special case of this, where . To complement these, we also tried . As above, we also added binomially distributed noise to these PIRs.

This can also be thought of as perhaps a more realistic simulation of human behavior, in the sense that it is highly unlikely that there is a single feature which influences human PIRs. Rather, there are probably many related features which influence PIR in various ways. Thus it is important to evaluate our system’s ability to target these types of features as well.

3.3 Colors

Finally, we also considered some simpler objectives based on targeting specific colors in the output images. Analogously to the VGG features, we computed the PIRs from the vector norm of a given image in the targeted color, normalized by the total image value. We considered several objectives of this type:

Single color:

Optimizing for a single color of output image, e.g., for red the objective would be.

Two color:

We split the image horizontally into a left and right half, and then computed PIR from one color in the left half and a different color in the right half.

Three color:

Similar to two color, but split the image into thirds, and computed PIR from a different color in each third.

(As above, we also added binomially distributed noise to these PIRs.) These objectives provide a useful complement to the VGG objectives discussed in section 3.1. Although the single color objectives may be relevant to the classification task VGG 16 performs, the split color tasks are less likely to be relevant to classification. Note that it is important that we split the images along the width instead of the height dimension, as there may well be semantically relevant features corresponding to color divisions along the height dimension, e.g. a blue upper half and green lower half likely correlates with outdoor images, which would provide useful class information. By contrast, it is harder to imagine circumstances where different colors on the left and right halves of the image are semantically predictive, especially since flipping left to right is usually included in the data augmentation for computer vision systems. Thus success on optimizing for these objectives would increase our confidence in the generality of our system.

(a) Change in mean PIR on a variety of tasks
(b)

Effect size (number of standard deviations change in mean PIR) on a variety of tasks.

Figure 3: Change produced by the system on different objectives. Values greater than zero indicate improvement. Each panel represents a set of tasks (such as single filters from VGG), each color/column represents a subset (such as a single layer within VGG), and each point represents one objective (such as a single filter from that layer) for which we optimized the generative model. Points for which the change in mean PIR is not significant by a Welch’s -test with a threshold of are partially transparent. The reduced number of points for the pool1 and pool2 layers on the 20 VGG filters task are because those layers only have 64 and 128 filters, respectively.

4 Results

We present our results in terms of the change in mean PIR from 1000 images produced by the GAN before tuning to 1000 images produced after tuning, or in terms of the effect size of this change (Cohen’s

, i.e. the change in mean PIR standardized by the standard deviation of the PIRs in the pre- and post-tuning image sets). We assess whether these changes are significant by performing a Welch’s t-test (with a threshold of

) between the pre- and post-tuning image sets.

Overall, our system was quite successful at improving PIRs across a range of simulated objective functions (see Fig. 3). Below, we discuss these results in more detail.

4.1 VGG objectives

Our system largely succeed at increasing PIRs on a variety of VGG objectives (see Fig. 3). However, there are several interesting patterns to note.

First, the system is not particularly successful at targeting single filters from the pool5, fc6, and fc7 layers. However, we believe this is due to the fact that filters in these layers produce relatively sparse activation, see section 3.2. Indeed, the performance of the system seems much more consistent when it is optimizing for sets of 20 filters than for single filters.

Even when using 20 filters, however, there is a noticeable decline in the effect size of the improvement our system is able to make at higher layers of VGG ( per layer, , in a linear model controlling for initial standard deviation and percent zeros). This suggests that as the objectives grow more complex, the system may be finding less accurate approximations to them. However, the system’s continuing (if diminished) success at the higher layers of VGG suggests that our model is capable of at least partially capturing complex objective functions.

4.2 Color objectives

(a) Pre-training images
(b) Optimized for red
(c) Optimized for green
(d) Optimized for blue
(e) Optimized for red-green
(f) Optimized for green-blue
(g) Optimized for GBR
(h) Optimized for BRG
Figure 4: Color objective sample images. Samples are randomly drawn, not cherry-picked.

Overall, the system performed quite well at optimizing for the color objectives, particularly the single and two-color results (see Fig. 3). It had more difficulty optimizing for the three-color results, and indeed had produced only very small improvements after the usual 50,000 tuning steps for the generative model, but after 500,000 steps it was able to produce significant improvements for two out of the three objectives (these longer training results are the ones included here).

Because the color objectives are easiest to assess visually, we have included results for a variety of these objectives in Fig. 4. For the single color objectives, the improvement is quite clear, for example the images in Fig. 3(d) appear much more blue than the pre-training ones. For the two color objectives, it appears that the system found the “trick” of reducing the third color, for example the red-green split images in Fig. 3(e) appear much less blue than the pre-training images. Even on the three-color images where the system struggled, there are some visible signs of improvement, for example on the green-blue-red task the system has started producing a number of images with a blue streak in the middle.

4.3 Other analyses

4.3.1 Introspection

Figure 5: Change in mean PIR vs. estimated change in mean PIR. Points for which the change in mean PIR is not significant by a Welch’s -test with a threshold of are partially transparent.

Because is just the expected value of the PIR, by looking at before and after tuning the generative model, we can tell how well the system thinks it is doing, i.e. how much it estimates that it improved PIR. This comparison reveals the interesting pattern that the system is overly pessimistic about its performance. In fact, it tends to underestimate its performance by a factor of more than 1.5 ( when regressing change in mean PIR on predicted change in mean PIR, see Fig. 5). However, it does so fairly consistently. This effect appears to be driven by the system consistently underestimating the (absolute) PIRs, which is probably caused by our change in the softmax temperature between training the PIR estimator and tuning the generative model (which we empirically found improves performance, as noted above).

This is in contrast to the possible a priori expectation that the model would systematically over-estimate its performance, because it is overfitting to an imperfectly estimated objective function. Although decreasing the softmax temperature between training and using the PIR obscures this effect, we do see some evidence of this; the more complex objectives (which the system produced lower effect sizes on) seem to both have lower estimated changes in mean PIR and true changes in PIR which are even lower than the estimated ones (see Fig. 5). Thus although the system is somewhat aware of its reduced effectiveness with these objectives (as evidenced by the lower estimates of change in mean PIR), it is not reducing its estimates sufficiently to account for the true difficulty of the objectives (as evidenced by the fact that the true change in PIR is even lower than the estimates). However, the system was generally still able to obtain positive results on these objectives (see Fig. 3).

4.3.2 Initial variability

Figure 6: Change in mean PIR vs. initial variability. Points for which the change in mean PIR is not significant by a Welch’s -test with a threshold of are partially transparent.

There is a general trend (see Fig. 6) that the variability in PIR in the initial dataset is strongly positively related with the change in PIR the system is able to produce (, ,

, in a linear model controlling for initial mean PIR and initial percent of values that are zero). In fact, initial standard deviation explains about 50% of the variance in the change in mean PIR. This is perhaps not too surprising – more variability means that the generative model has capacity to produce higher PIR images without too much tweaking, and that the PIR estimator model gets a wider range of values to learn from. Still, when attempting to use this system in practice, it is important to keep in mind that starting with a sufficiently expressive generative model will more likely produce better results than starting with a more limited model.

4.3.3 Iteration

Figure 7: Effect size evolution over two iterations

Given that our model improves PIRs, an obvious question is whether we can iterate the process. Once we have increased PIRs, can we train a new PIR estimator on samples from our new generative model, and use that to increase PIRs again? If we could iterate this many times, we might be able to create much larger improvements in PIR than we can on a single step. On the other hand, it is possible that after a single step of optimization we will effectively have saturated the easily achievable improvement in the model, and further steps will not result in much improvement.

To evaluate this, we took the subset of models trained on single filters from VGG layers pool2 and fc8, and used the set of images generated from the post-tuning model along with the original set of images to tune their PIR estimators for another 25,000 steps, and then tuned the generative model using this updated PIR estimator for another 50,000 steps. We then evaluated them as before, see Fig. 7 for the results. The second iteration results were mixed, while the pool2 models all improved from the first step to the second, none of them improved as much as they had on the first step, and many of the fc8 models actually performed worse after the second step. However, it is possible that further hyperparameter tuning could improve these results, and it is certainly the case that running multiple steps of iteration and selecting the best model by experimentation could yield better results, so this is worth investigating further.

5 Discussion

Overall, our system appears to be relatively successful. It can optimize a generative model to produce images targeting functions ranging from low-level visual features such as colors and early features of VGG to features computed at the top layers of VGG. This success across a wide variety of objective functions allows us to be somewhat confident that our system will be able to achieve success in optimizing for real human interactions.

Furthermore, the system did not require an inordinate amount of training data. In fact, we were able to successfully estimate many different objective functions from only 1000 images, several orders of magnitude fewer than is typically used to train CNNs for vision tasks. Furthermore, these images came from a very biased and narrow distribution (samples from our generative model) which is reflective of neither the images that were used to pre-train the Inception model in the PIR estimator, nor the images the VGG model (which produced the simulated objectives) was trained on. Our success from this small amount of data suggests that not only will our system be able to optimize for real human interactions, it will be able to do so from a feasible number of training points.

These results are exciting – the model is able to approximate apparently complex objective functions from a small amount of data, even though this data comes from a very biased distribution that is unrelated to most the objectives in question. But what is really being learned? In the case of the color images, it’s clear that the model is doing something close to correct. However, for the objectives derived from VGG we have no way to really assess whether the model is making the images better or just more adversarial. For instance, when we are optimizing for the logit for “magpie,” it’s almost certainly the case that the result of this optimization will not look more like a magpie to a human, even if VGG does rate the images as more “magpie-like.” On the other hand, this is not necessarily a failure of the system – it is accurately capturing the objective function it is given. What remains to be seen is whether it can capture how background images influence human behavior as well as it can capture the vagaries of deep vision architectures.

We believe there are many domains where a system similar to ours could be useful. We mentioned producing better webpage backgrounds and making more aesthetic images above, but there are many potential applications for improving GANs with a limited amount of human feedback. For example, a model could be trained to produce better music (e.g. song skip rates on streaming generated music could be treated as inverse PIRs). Our system could also offer a different approach to the problem tackled by Elgammal and colleagues Elgammal2017CAN:Norms , by asking artists to evaluate what is “original” art.

5.1 Trading image diversity for PIR

When tuning the GAN, it is worth noting that the decrease in the PIR loss is usually accompanied by an increase in the generator loss, and often by a partial collapse of the generator output (for example, the optimized images generally seem to have fewer output modes than the pre-training images in Fig. 4). This is not especially surprising – because we weighted the PIR loss very highly, the model is rewarded for trading some image diversity for image optimality. Depending on the desired application, the weight on the PIR loss could be adjusted as necessary to trade off between producing images close to the data distribution and optimizing PIR. At its most extreme, one could down-weight the generator loss entirely, and train until the model just produces a single optimal image. However, the generator likely provides some regularization by constraining the images to be somewhat close to the real images, which will reduce overfitting to an imperfect estimate of the PIR function. Furthermore, in many settings we will want to generate a variety of images (e.g. backgrounds for different websites). For these reasons, we chose to keep the generator loss when tuning the GAN.

5.2 Future Directions

There are a number of future directions suggested by this work. First, there are a number of techniques that could be explored to improve our system. As we mentioned above, iterating for multiple steps of click collection and generative model tuning is worth exploring further. Also, some form of data scaling might allow the system to perform better on tasks with low variance. We briefly tried normalizing all data for an objective to have mean 0.5 and standard deviation 0.25, but did not achieve particularly good results from this, possibly because there were many outliers getting clipped to 0 or 1. Still, there are many other possibilities for scaling data that could potentially result in some improvement in performance. Also, one alternative approach to training a GAN to produce high-PIR images would be to use the PIR estimator objective in the Plug & Play Generative Networks framework

Nguyen2017PlugSpace instead of using it to tune the GAN. This could be an interesting direction to explore, but its success would probably depend on expressiveness of the initial generative model. With the mediocre model we started with, it’s probably better to actually tune the model itself, which may allow it to explore parts of image space which it had not previously.

Another fascinating direction would be to try to use our system with objectives from CNN layers as we did here to try to understand the features those layers are attending to, by analyzing the distribution of images that are produced. In this sense, our system can be thought of as offering a new approach to multifaceted feature visualization Nguyen2016MultifacetedNetworks , because our system attempts to optimize a distribution of images for an objective and encourages diversity in the distribution produced, rather than just optimizing a single image.

Finally, we also believe the results on the VGG objectives are interesting from the perspective of distillation or imitation approaches, which attempt to train one network to emulate another (Hinton2015DistillingNetwork, ; Parisotto2015Actor-Mimic:Learning, , e.g), as well as from the perspective of understanding the computations that these vision architectures perform. As far as we are aware, these results are the first to show that a deep vision model can be tuned rapidly from relatively little data to produce outputs which accurately emulate the behavior of hidden layers of another deep vision architecture trained on a different dataset. This suggests both that the inductive biases shared among these architectures are causing them to find similar solutions (which is also supported by work on transferable adversarial examples (Liu2016DelvingAttacks, , e.g.)), and that these networks final layers represent the computations of earlier hidden layers in a way that is somewhat accessible.

6 Conclusions

We have described a system for efficiently tuning a generative image model according to a slow-to-evaluate objective function. We have demonstrated the success of this system at targeting a variety of objective functions simulated from different layers of a deep vision model, as well as from low-level visual features of the images, and have shown that it can do so from a small amount of data. We have quantified some of the features that affect its performance, including the variability of the training PIR data and the number of zeros it contains. Our system’s success on a wide variety of objectives suggests that it will be able to improve real user interactions, or other objectives which are slow and expensive to evaluate. This may result in many exciting applications, such as improving machine-generated images, music, or art.

Acknowledgements

We thank Luke Metz for helpful comments on a draft of this paper.

References

  • (1) Christiano, P., Leike, J., Brown, T. B., Martic, M., Legg, S., and Amodei, D. Deep reinforcement learning from human preferences. arXi (2017).
  • (2) Elgammal, A., Liu, B., Elhoseiny, M., and Mazzone, M. CAN: Creative Adversarial Networks Generating ”Art” by Learning About Styles and Deviating from Style Norms. arXiv, Iccc (2017), 1–22.
  • (3) Goodfellow, I., Pouget-Abadie, J., and Mirza, M. Generative Adversarial Networks. arXiv preprint arXiv: … (2014), 1–9.
  • (4) Hinton, G., Vinyals, O., and Dean, J. Distilling the Knowledge in a Neural Network. 1–9.
  • (5) Kingma, D. P., and Ba, J. Adam: A Method for Stochastic Optimization. Iclr (2015), 1–15.
  • (6) Liu, Y., Chen, X., Liu, C., and Song, D. Delving into Transferable Adversarial Examples and Black-box Attacks. 1–24.
  • (7) Nguyen, A., Yosinski, J., Bengio, Y., Dosovitskiy, A., and Clune, J. Plug & Play Generative Networks: Conditional Iterative Generation of Images in Latent Space. Iccv, 3 (2017), 33.
  • (8) Nguyen, A., Yosinski, J., and Clune, J.

    Multifaceted Feature Visualization: Uncovering the Different Types of Features Learned By Each Neuron in Deep Neural Networks.

  • (9) Odena, A., Olah, C., and Shlens, J. Conditional Image Synthesis With Auxiliary Classifier GANs. arXiv (2016), 1–14.
  • (10) Parisotto, E., Ba, J. L., and Salakhutdinov, R. Actor-Mimic: Deep Multitask and Transfer Reinforcement Learning. 1–16.
  • (11) Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A. C., and Fei-Fei, L. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision 115, 3 (2015), 211–252.
  • (12) Simonyan, K., and Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv preprint (2014), 1–10.
  • (13) Sun, C., Shrivastava, A., Singh, S., and Gupta, A.

    Revisiting Unreasonable Effectiveness of Data in Deep Learning Era.

  • (14) Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna, Z. Rethinking the Inception Architecture for Computer Vision.

    Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR)

    (2016), 2818–2826.
  • (15) Yamins, D. L. K., Hong, H., Cadieu, C. F., Solomon, E. A., Seibert, D., and DiCarlo, J. J. Performance-optimized hierarchical models predict neural responses in higher visual cortex. Proceedings of the National Academy of Sciences of the United States of America 111, 23 (2014), 8619–24.
  • (16) Yosinski, J., Clune, J., and Nguyen, A. Deep Neural Networks are Easily Fooled: High Confidence Predictions for Unrecognizable Images.
  • (17) Zeiler, M. D., and Fergus, R. Visualizing and Understanding Convolutional Networks arXiv:1311.2901v3 [cs.CV] 28 Nov 2013. Computer Vision–ECCV 2014 8689 (2014), 818–833.

Appendix A Acgan

We describe our results using the standard GAN framework for clarity, but we actually used an ACGAN Odena2016ConditionalGANs , which allows for conditioning for various user-specific features. This requires the following adjustments. Defining to be the set of possible classes and to be the set of real images corresponding to a class :

Then the discriminator and generator losses are modified as follows (letting and ):

Note that unlike the standard ACGAN formulation given in Odena2016ConditionalGANs , we do not include in the discriminator’s loss, to keep the discriminator from cooperating with the generator on the classification task.

We modified the generator network by adding one-hot class inputs, and the discriminator by adding class outputs alongside the source output, as in Odena2016ConditionalGANs .