Task Specific Adversarial Cost Function

09/27/2016 ∙ by Antonia Creswell, et al. ∙ 0

The cost function used to train a generative model should fit the purpose of the model. If the model is intended for tasks such as generating perceptually correct samples, it is beneficial to maximise the likelihood of a sample drawn from the model, Q, coming from the same distribution as the training data, P. This is equivalent to minimising the Kullback-Leibler (KL) distance, KL[Q||P]. However, if the model is intended for tasks such as retrieval or classification it is beneficial to maximise the likelihood that a sample drawn from the training data is captured by the model, equivalent to minimising KL[P||Q]. The cost function used in adversarial training optimises the Jensen-Shannon entropy which can be seen as an even interpolation between KL[Q||P] and KL[P||Q]. Here, we propose an alternative adversarial cost function which allows easy tuning of the model for either task. Our task specific cost function is evaluated on a dataset of hand-written characters in the following tasks: Generation, retrieval and one-shot learning.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 10

Code Repositories

piGAN

Task Specific Adversarial Cost Function (Omniglot GAN)


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

1.1 Generative and Discriminative Models

Discriminative models are trained to predict a label, , given an input sample,

. Probabilistically, this is equivalent to learning a conditional probability

. State-of-the-art discriminative models are able to outperform humans on tasks such as natural image recognition [szegedy2015going] and sketch recognition [yu2015sketch]. However training models to achieve or exceed human levels of recognition requires large amounts of labelled training data, which if often expensive to acquire.

Recently, there has been immense interest in generative models, which are able to learn form unlabelled training data, which is often available in abundance. However, generative models are more challenging to learn. A generative model should be able to draw samples from

; however estimating

may be computationally intractable. Instead, we often learn a function that maps a vector to an image sample

. The vector may be either be a noise vector, , drawn from a prior distribution [kingma2013auto, radford2015unsupervised], a label vector [dosovitskiy2015learning], , or a combination of the two [chen2016infogan, mirza2014conditional, gauthier2014conditional, vincent2008extracting]. Probabilistically, these may be interpreted as conditional probabilities: , or . By sampling these conditional probabilities appropriately, novel samples of may be generated.

Generative models are not only useful for synthesising new samples, they may also learn a representation for the training data that can be applied to discriminative tasks, via semi-supervised learning

[lake2015human, chen2016infogan, radford2015unsupervised]

. Semi-supervised learning makes use of large amounts of accessible, unlabelled data to train a model that learns a representation for the data. A smaller set of labelled samples may be mapped to the learned representation space, which hopefully makes classes more separable, allowing a discriminative classifier to be trained using few labelled samples.

There is currently active research in applying generative models to image data, both to improve the quality of generated images [oord2016pixel, zhao2016energy, chen2016infogan, dosovitskiy2015learning] and to apply representations learned during training to discriminative, image tasks [radford2015unsupervised, lake2015human].

1.2 Image Synthesis Using Generative Models

Auto-encoders learn an encoder, , which maps from image space to a latent space and a decoder, which maps back to image space. Auto-encoders are trained to reconstruct samples, rather than synthesise new samples; this is because the distribution, , of the latent space is unknown, and so the decoder cannot be sampled.

Variational auto-encoders [kingma2013auto] address this problem by constraining

to come from a prior distribution e.g. a normal distribution. Variational auto-encoders can be implemented by first sampling from the prior and then sampling the conditional probability

to get a new sample. Samples generated using variational auto-encoders are often overly smoothed because of the constraint on the latent space. But it is not always necessary to constrain the latent space of an auto-encoder in order to generate new samples.

De-noising auto-encoders [bengio2013generalized] are trained to reconstruct an image from a corrupted version. If the corrupted image, is sampled from the conditional probability and the de-noising auto-encoder samples , new samples may be generated by alternatively sampling and , where a new sample is generated at each time step, .

The generative models described thus far look at generating an entire image in one go. An alternative is to develop a single image sequentially. Gregor et al. [gregor2015draw] learn to generate hand-written digits sequentially using an auto-encoder architecture with recurrent connections and an attention mechanism. The attention mechanism allows the generator to focus on smaller regions of an input image, and generate an image a few pixels at a time. The sequential approach generates sharper samples than traditional variational auto-encoders [kingma2013auto]. Gregor et al. also modified their approach to generate two digits per image, and observed that their attention mechanism ensured that the model focused on generating one number at a time.

A more extreme approach to sequential generation is to generate images pixel by pixel. Oord et al. [oord2016pixel] generated natural images one pixel at a time, where the choice for the next pixel depended on the previous pixels. Though the “natural” images generated by Oord et al. [oord2016pixel] do not resemble class specific samples, the statistics of ensembles of samples appear to be consistent with those of natural images.

Lake et al. [lake2015human] also generated images sequentially, by using labelled data within a Bayesian probabilistic method for generating hand-written characters, one stroke at a time. Lake et al. [lake2015human] were able to synthesize very sharp image samples that closely resembled real image examples.

An approach to learning generative models by using labels was suggested by Dosovitskiy et al. [dosovitskiy2015learning]

. The authors trained a convolutional neural network to generate images of tables, cars and chairs from a series of vectors that encoded object class, viewpoint and a spatial transform. They were able to generate examples of objects from varying viewpoints and morph different styles of chairs to suggest new chair designs.

1.3 Discriminative Tasks On Images Using Generative Models

De-noising auto-encoders [vincent2008extracting] can be trained on unlabelled data to learn a representation. Training involves learning both an encoder and a decoder; once trained, the decoder may be removed and the network architecture modified for classification. The network may then be fine-tuned by training the network for the classification task. It is suggested by Erhan et al. [erhan2010does] that this process of pre-training and fine-tuning prevents models that are designed for classification from over-fitting.

Lake et al. [lake2015human] learned generative models for hand-written characters from labelled training data. However, they were able to apply their generative model to also find matching characters for queries from unseen classes. This type of learning – from only one example – is known as one-shot learning, and is a very challenging task. When shown a character from an unseen class, and presented with samples from other unseen classes (including a sample from a similar class), their model was able to pick the correct matching sample more accurately than humans [lake2015human].

A generative model that was first introduced by Goodfellow et al. [goodfellow2014generative] and improved by Salimans et al. [salimans2016improved] was able to achieve state-of-the-art recognition in semi-supervised classification on CIFAR-10 (a dataset of small natural images), MNIST (a dataset of hand-written digits) and SVHN (a dataset of street numbers).

1.4 Task Specific Cost Function For Training Generative Models

A generative model may learn to generate samples with distribution,

, which captures the underlying probability distribution of the training data,

, by minimising a cost function that measure the difference between the two distributions.

The cost function used to train a generative model should fit the purpose of the model. If the model is intended for generation of perceptually high quality samples, it is necessary for the model to capture the densest parts of the training data distribution. This can be achieved by learning a that minimises [huszar2015not]. For a model with sufficient complexity and many samples, it may be possible to learn a such that . However, for a finite model, and with insufficient training samples, the model is likely to fit only the densest parts of the distribution at the cost of not capturing other regions of high density. A pictorial example of this is shown in Fig. 1A.

If the representation of a generative model is intended to be used for discriminative tasks such as retrieval or classification, it is necessary to learn a model that captures the whole distribution of the training data. To achieve this, may be minimised. A model, with finite capacity will be penalised if it does not capture states in , which will encourage the model to capture all regions of high density in the data distribution at the cost of also capturing regions of low density. The model would be less suitable for generation because generations sampled from regions of low probability are likely to be nonsensical. A pictorial example of this is shown in Fig. 1B.

Figure 1: For a data distribution for a model, with finite capacity may be fit by minimising either A) which captures one region of high density well, but ignores others, or B) which captures all regions of high density while also assigning non-zero values to regions of low density.

A cost function that may allow tuning towards one task or the other is the Jensen-Shannon divergence [lin1991divergence], :

for .

Huszar et al. [huszar2015not] showed that is proportional to and , respectively, at the upper and lower limits of :

In previous work Goodfellow et al. [goodfellow2014generative] introduced adversarial training, where a pair of competing models – a generator and discriminator – are trained. The generator is trained to produce samples that appear to come from the training data, and the discriminator is trained to distinguish the training data samples from generated samples. Training is successful when the discriminator cannot distinguish synthesized samples from samples that are drawn from the training data. Goodfellow et al. [goodfellow2014generative] applied adversarial training to learn a distribution over image space in order to synthesize new image samples. Further, Goodfellow et al. [goodfellow2014generative] showed that under certain conditions adversarial training minimizes the Jensen-Shannon divergence at a fixed value of ; this is more commonly known at Jensen-Shannon entropy. By optimising the Jensen-Shannon entropy, rather than the Jensen-Shannon divergence, , the training algorithm and cost function proposed by Goodfellow et al. [goodfellow2014generative], do not depend on and so cannot be tuned towards one task or the other.

The training algorithm proposed by Goodfellow et al. [goodfellow2014generative] draws samples equally from and . Huzar [huszar2015not] proposes an alternative training algorithm for approximating for small and large values by using biased sampling, drawing more values from to approximate or more values from to approximate . However, we are not aware of any experiments that explore the effects of such a sampling strategy, and only a qualitative relationship between the number of samples and the effect on the cost function was suggested.

Instead of using biased sampling during training to approximate , we propose an alternative adversarial cost function which we show is equivalent to plus additional terms that depend only on . In the limits of , the additional terms tend to zero and tends towards or , depending on the choice of . The parameter may be chosen to suit the desired task. We apply our novel cost function to both discriminative and generative tasks to show that a smaller value improves performance on discriminative tasks while larger values improve performance on generative tasks.

2 Preliminaries: Generative Adversarial Networks

The purpose of image generation models is to learn a distribution which captures a training data distribution , over image space. Often, learning to draw samples from directly is computationally intractable. Instead, we want to learn a parametrised function, which maps samples, , from a prior distribution to an image sample in . During training, the parameters, , are learned such that is similar to . This requires comparing samples to real image samples. For example, in an auto-encoder [larsen2015autoencoding], this could be achieved by calculating the pixel-wise error between the generated and real samples (e.g. using MSE or cross-entropy). However, comparing pixel values to evaluate has often been found to lead to poor quality image generation [larsen2015autoencoding]. Instead, another parametrised function, , could be introduced to map all samples directly to a probability of whether that sample is likely to have come from the real data distribution or not, see Fig. 2. This is the idea behind adversarial training, which we will now explain more formally.

In adversarial training a pair of networks is trained, a generator, and a discriminator, . The generator takes as input a vector of random values, drawn from a prior distribution . During training, one objective is to learn a mapping from latent space to sample space, where is the dimensions of the sample space and is the scalar dimension of . The discriminator takes samples either from the training dataset, , or the generator, . During training, the discriminator is trained under a different objective: to learn a mapping , predicting a label for whether a sample was drawn from the training data, (1 - real) or from the generator, (0 - fake). The objective function for training the discriminator is to correctly classify examples as being either real or fake. A well-trained generator can create samples that are realistic enough to fool the discriminator into making incorrect classifications. See Fig. 2.

Figure 2: Generative Adversarial Networks in the context of generative models: A random sample is drawn from a prior distribution and mapped by to be a sample in the model distribution space, . Samples, from the training data distribution, or the model distribution are mapped by to a prediction of whether the sample is from the training data distribution or not.

Previous work on adversarial training has primarily focused on either generating realistic looking samples, [goodfellow2014generative, gauthier2014conditional, makhzani2015adversarial, kataokaimage], classification tasks [ganin2016domain, radford2015unsupervised, makhzani2015adversarial, chen2016infogan] or multi-label tagging [mirza2014conditional]

. More recently, adversarial training has also been applied to image retrieval

[creswell2016adversarial, donahue2016adversarial]. However, adversarial training optimises a cost function which approximates divergence [goodfellow2014generative]. The resulting generative model is not ideal for tasks such as classification or retrieval. We propose an alternative cost function that can be tuned to make adversarial training more suitable either for discriminative tasks or for generative tasks.

3 Motivation (Previous Work)

Generative Adversarial Networks (GANs) have recently attracted interest because of their ability to learn complex generative models with minimal labelling of data. Goodfellow et al. [goodfellow2014generative] introduced GANs, modelling both the generator, , and the discriminator, , as fully connected neural networks. Radford et al. [radford2015unsupervised] extended GANs by using fully convolutional neural networks for both and . These convolutional networks are capable of generating images of realistic looking faces, bedrooms and numbers.

GANs may also be trained with labels [mirza2014conditional, gauthier2014conditional], such that images of specific categories may be generated. These networks are called conditional GANs (cGANs). When training cGANs, the generator takes in both a one-hot label vector, which describes the category that is to be generated, and a vector of random values drawn from some prior distribution, . An improvement to training cGANS was proposed by Chen et al. [chen2016infogan]; by seeking to maximise the mutual information between the one-hot label vector and the generated sample given the one-hot vector, the cGAN is encouraged to use the class label in the one-hot label vector, information which was often ignored in previous approaches [mirza2014conditional, gauthier2014conditional]. We continue to use GANs in an unsupervised setting assuming no labels during training of the GANs.

Adversarial training has also been applied to generative auto-encoders to impose a prior distribution on their encoding vectors [makhzani2015adversarial]. For example, if the data distribution belongs to that of an ensemble of images, auto-encoders may be trained to compress each image to an encoding vector using an encoder and then to reconstruct an approximation of the same image from the encoding vector by using a decoder. Vectors may be passed into the decoder to generate new images, however the images will only be realistic if the input vectors to the decoder are from the same distribution as the encoding vectors of the ensemble. By imposing a prior distribution on the encoding during auto-encoder training, new “encoding” vectors may be drawn from the prior and passed through the decoder to generate new, meaningful image samples. Makhzani et al. [makhzani2015adversarial] showed that adversarial training is both better able to impose prior distributions and is able to impose more complex prior distributions.

GANs have not only be used for generation, but the representations learned during training have also been applied to discriminative tasks [radford2015unsupervised, chen2016infogan, ajakan2014domain, creswell2016adversarial, donahue2016adversarial, dumoulin2016adversarially]. Since GANs are able to learn representations from unlabelled data, [creswell2016adversarial, radford2015unsupervised], they can be useful for learning representations when labelled data is not available, or the amount of labelled data is limited.

Until recently, representations used in discriminative tasks were obtained from trained GANs by passing samples through various layers of a trained discriminator, . However, both Makhzani et al. [makhzani2015adversarial] and Dumoulin et al. [dumoulin2016adversarially] presented an alternative method for obtaining representations for samples by mapping image samples, , back to -space using an encoder which, under certain conditions, inverts the generator. This approach requires training an extra encoding network, and, in practice, this network often only approximately inverts the generator. In our work we continue to use the encoding from the discriminator so as to make minimal changes to the current adversarial network architecture. However, we would consider using encoding networks in future work.

A further application of adversarial training to representation learning is domain adaption, which involves learning a single representation for samples across different domains e.g. sketches and natural images. Both Ganin et al. and Ajakan et al. [ganin2016domain, ajakan2014domain] apply adversarial training to learn representations for similar objects from different domains such that the representation for one domain cannot be distinguished from that of the other domain. The representations that are learned in this way may be applied to classification.

Despite the success of GANs as both generative and discriminative models, there are several problems that may still be addressed. For example, Radford et al. [radford2015unsupervised] showed examples of interpolations between two random images, by generating images along a trajectory in -space, see Fig. 3. Often, samples towards the centre of the interpolation are poor, suggesting that the model is assigning higher probability to regions where probability should be lower. An ideal generator, , should generate realistic samples for any .

Figure 3: Interpolation is performed between two image samples and by generating images along a trajectory in -space that lies between and .

Further, previous work with GANs has ignored the implications of using a representation learned by a generative model for discriminative tasks [radford2015unsupervised, dumoulin2016adversarially, makhzani2015adversarial, creswell2016adversarial, makhzani2015adversarial, salimans2016improved]. A representation learned using GANs tends to capture only a few regions of high density in , failing to capture the whole data distribution [huszar2015not, theis2015note, radford2015unsupervised], which is not ideal for a representation that is intended for discriminative tasks: the representation is unlikely to generalise well to unseen samples. Failure of a GAN to capture the whole data distribution is evident when GANs generate similar samples for different inputs [salimans2016improved]. Salimans et al. [salimans2016improved] address the problem of failing to capture more of the data distribution,

by introducing “mini-batch discrimination” which provides the discriminator with information about all samples in a batch to prevent similar samples being generated. Their approach is based only on heuristics. Salimans et al.

[salimans2016improved] found that employing “mini-batch discrimination” led to the learning of a representation that performed better on discriminative tasks. This is consistent with our argument: a model that captures the whole data distribution should have an improved ability to generalise to new concepts, allowing representations extracted from such a model to be useful for discriminative tasks.

We aim to address both of these problems by providing a single, novel cost function, parametrised by , that can be tuned to be more preferable for either generative or discriminative tasks.

Our alternative cost function can be tuned for generation by using a large , which approximates the non-symmetric . Minimising penalises the model, , when generated samples, , do not come from the training data distribution, . By doing this, we increase the likelihood that samples drawn from our model are consistent with real samples; we find that our alternative cost function prevents nonsensical images being generated when interpolating between two random images using their space representations.

To tune our cost function for discriminative tasks, a small value may be used, which approximates the non-symmetric . Minimising penalises the model, , when it does not capture all regions of density in . We provide experimental evidence to show that this alternative cost function improves performance on several discriminative tasks including one-shot learning and retrieval when compared to regular GAN training.

4 Proposed Cost Function

The original cost function proposed by Goodfellow et al. [goodfellow2014generative] is:

We propose the alternative cost function:

We now show that under similar conditions and assumptions to those made by Goodfellow et al. [goodfellow2014generative], this new cost function is approximately proportional to and for large and small respectively.

4.1 Proposed Cost Function In The Limits Of

First we show that for a fixed generator, , there exists an optimal discriminator, :

where is the distribution of samples generated by .

To find the stationary curve, , of an integral over , we use the Euler-Lagrange theorem. For the general variational problem:

any differentiable and bounded minimiser, , is a solution to the boundary value problem:

In the case where the integrand does not contain a term, the boundary value problem simplifies to:

implying that the stationary curve of the integrand is also the stationary curve of the integral.

Then, for any , the function achieves a maximum in the interval [0,1] at . So, we get:

Note that because the discriminator takes samples either from or , it is only defined for the , and so and do not simultaneously equal zero, satisfying the conditions of .

If the generator and discriminator are trained iteratively [goodfellow2014generative], one may assume that the discriminator is optimised in the first step of an iteration, giving the new cost function for the second step of the iteration, :

Which can be re-arranged to give:

Now, we consider the limits as and , knowing that is proportional to for small and proportional to for large [huszar2015not]:

and

We have shown that the cost function that we propose approximates for small and approximates for large . Which implies that to train a model, , suitable for retrieval, our proposed cost function can be used with a small , and to train a model suitable for generation our proposed cost function can be used with a larger value. We explore the practical implications of this in the context of retrieval, generation and one-shot learning in the experimental section.

5 Experiments & Results

There are currently two main types of application for generative models. The first is the synthesis of novel samples that resemble the training data, and the second is to for discriminative tasks, such as classification and retrieval. The latter make use of the representation that is learned by a network during generative training.

In this section, we evaluate our alternative cost function on three tasks:

  • Generation of novel samples

  • One-shot classification

  • Retrieval of visually similar samples

We compare the performance of GANs trained using the alternative cost functions of these tasks for ; a consideration of the limiting values of suggests that a model trained using small values () should perform better on retrieval and one-shot classification tasks, whilst a model trained using, say, should perform better on generative tasks. The purpose of these experiments is to provide experimental evidence to support this, based on the analysis of Section IV.

Figure 4: Examples of hand-written characters from the Omniglot background dataset [lake2015human].

5.1 Dataset

We apply our alternative cost function to the Omniglot dataset [lake2015human], see Fig. 4. Previously, generative adversarial networks have been trained on handwritten numbers (MNIST), street numbers (SVHN), faces (CelebA) and natural scenes (CIFAR10).

Once trained, the generator of a GAN [radford2015unsupervised] is able to generate hand-written digits that are indistinguishable from real samples, see Fig. 5C. Hand-written digits generated by the trained generator of a conditional GAN [mirza2014conditional] are recognisable as numbers, see Fig. 5D. On the other hand, generation of realistic looking natural image scenes has not yet been achieved. The MNIST dataset consists of only 10 classes each with examples in total. In contrast, the Omniglot [lake2015human] dataset has classes with only examples of each class. For all of our experiments, we use the Omniglot dataset for several reasons:

  • The Omniglot dataset is neither as simple as the MNIST dataset, nor as complex as the CIFAR-10 dataset, which means improvements to regular GAN training may be more evident.

  • There are only a few labelled examples per class, which makes the Omniglot dataset a perfect candidate for using adversarial training to learn a representation for discriminative tasks.

Figure 5: Previous work using GANs to generate hand-written digits. A) Shows examples of the MNIST samples used to train the GANs in B-D, B) Generations using fully connected GANs [goodfellow2014generative], C: Improved generations using deep convolutional generative adversarial networks [radford2015unsupervised]. (A-C: Images modified from [goodfellow2014generative]). and D: Conditional generations using conditional GANs : Image modified from [mirza2014conditional].

The Omniglot dataset [lake2015human] contains characters from different writing systems. The dataset is split into a background dataset of characters from writing systems, while the evaluation dataset consists of different characters from different writing systems. A GAN is trained on the background dataset, using our proposed alternative cost function with each value. Note that although the dataset has labels, the labels are not used at any point during training.

  for Number of training iterations do
     for  iterations do
         # Get m samples from the prior
         # Get m samples from the data
        # Calculate discriminator loss:
        # using our proposed alternative cost function
        
        
         # Update weights
     end for
      # Get m samples from the prior
      # Calculate the generator error
     
      # Update weights
  end for
Algorithm 1 Algorithm For Training a GAN: Similar to the training algorithm of Goodfellow et al. [goodfellow2014generative] but incorporating the proposed change to the cost function.
Figure 6: GAN architecture: Fig 2 gives a conceptual model for how a GAN works. Here we present the overall GAN architecture. A random sample, is drawn from a prior distribution and passed through the generator to generate an image. The generator consists of a fully connected layer and a series of deconvolutional layers. An image either from the generator or the training dataset is passed through the discriminator to predict if the image was from the training data or not. The discriminator consists of a series of convolutional layers and a fully connected layer. Details of both the generator and discriminator architecture can be found in Table I.

5.2 Architecture & Training

For training purposes, the generator, , and discriminator, , may be any differentiable functions; here we used deep convolutional neural networks, see Fig. 6. The network is a regular feed forward convolutional neural network. As suggested by Radford at al. [radford2015unsupervised]

, we used convolutions applied with stride two

[springenberg2014striving, long2015fully] to down-sample the image instead of using pooling. The

network requires upsampling, which cannot be achieved by a regular feed forward network. One method that can be used to upsample the images appropriately would be to use the error tensor (gradient image) for a convolution layer applied with a stride of two. However, we simply applied filters via convolution with stride one and upsampled the resulting image array using bilinear interpolation. The architecture of the networks is similar to that of Radford et al.

[radford2015unsupervised]. The training images used by Radford et al. [radford2015unsupervised] were compared to the Omniglot images which are . To account for this difference in shape, the fully connected layers of both and have more nodes, so that the size of the activation images entering the first convolutional layer in are of size instead of . Another modification is the size of the filters in the final layer of : we used filters of size

to accommodate for the output image having odd-valued dimensions. All networks were initially trained for

iterations using random batches of samples with learning rate of , a faster learning rate than that of Radford et al. [radford2015unsupervised] and a value of 3. However, we found that for , the network did not converge after iterations. Instead, we trained with a value of 1 for . The latent variable, , that is the input to the generator, has dimension

, and is drawn from a uniform distribution,

.

FC: , reshape(256,13,13) C:
batch norm., leakyReLU(0.2)

batch norm., ReLU

D: C:
batch norm., leakyReLU(0.2) batch norm., ReLU
D: C:
batch norm., leakyReLU(0.2) batch norm., ReLU,
reshape(50176)
D: FC:


Table I: Network Architecture Used. FC=fully connected layer, C=convolutional layer with stride 2, D=convolutional layer with stride 0.5, unless stated otherwise. For all experiments in this paper, =100. “batch norm.” refers to batch normalisation.

5.3 Retrieval

The Omniglot dataset consists of a background and an evaluation set. The background set consists of characters from different alphabets to the evaluation dataset. A GAN is trained on the background dataset using , where training with is equivalent to normal GAN training. Retrieval is performed both on the background and the evaluation dataset. To retrieve examples of characters not seen before, the representation that is learned during training should capture the entire distribution of handwritten space in order to generalise well to new concepts. We expect a GAN that is trained using to outperform regular GAN training (), since a GAN trained using approximately minimises thus encouraging the model to capture more of the data distribution. By contrast, we expect a GAN trained using to perform worse than regular GAN training on the evaluation dataset: a GAN trained using approximately minimises , encouraging the model to only capture the densest parts of the training data distribution. Such a model would not be expected to generalise well to unseen parts of the distribution. However, when retrieving from the background dataset, it is likely that for , retrieval will be similar to that of because the model does not have to generalise to new concepts well, as training and testing are performed on the same dataset.

To perform retrieval, both a query sample and samples in the retrieval dataset (either background or evaluation) are encoded. To encode a sample, it is passed through to the penultimate layer of the discriminator to give a

k dimensional encoding vector. The cosine similarity measure is calculated between the query and all samples in the retrieval dataset to score their similarity. The most similar matches are returned in descending order of similarity.

5.3.1 Retrieval Across Multiple Alphabets

For each query character in the Omniglot dataset, there are similar examples, so we retrieve the top matches for any query from the evaluation dataset. We treat every sample in the evaluation dataset, in turn, as a query and take the average accuracy across queries. Fig. 7 shows the average accuracy-retrieval curve for the top retrievals across all queries. As expected, using improved retrieval compared to regular GAN training () while worsened retrieval compared to regular GAN training.

For the task of retrieval, we are particularly interested in how setting improves performance. Fig. 8 shows the top retrievals using and for a selection of queries. Using , the accuracy of the top retrieval is achieved on the evaluation dataset compared to when using . The chance of randomly retrieving a matching sample is , a value of improves top-1 retrieval accuracy by nearly times that of chance.

Figure 7: Comparing accuracy vs. retrieval on the Omniglot evaluation dataset for regular GAN training () and our alternative cost function using .
Figure 8: Comparing top retrievals on the Omniglot evaluation dataset for regular GAN training () and our alternative cost function using .

When retrieving from the background dataset using , the accuracy of the top retrieval is compared to when using . The chance of randomly retrieving a matching sample is , so here a value of improves accuracy performance by a factor of relative to random choice. The accuracy-retrieval curve can be see in Fig. 9. A summary of results is shown in Table II.

Figure 9: Comparing accuracy vs. retrieval on the Omniglot background dataset for regular GAN training () and our alternative cost function using .
Method Accuracy (Top 1)
GAN
GAN
GAN
Table II: Comparison of retrieval accuracy on the evaluation set training using our alternative cost function with different values. Note that is equivalent to regular GAN training.

5.3.2 Retrieval Within Alphabets

We also apply our proposed system to perform retrieval on individual alphabets, and compare GANs trained using our alternative cost function at where again, according to the theory in Section IV, we expect training with to perform best. Fig. 10 shows the accuracy of the top retrieval on each alphabet for and . Results show that a GAN trained using the alternative cost function with improves retrieval performance on all alphabets.

Figure 10: Comparing top 1 retrieval accuracy for each alphabet in the evaluation set for regular GAN training () and our proposed alternative cost function using .

5.4 One-Shot Classification

Humans are often able to learn very quickly from only a few examples; training machines to learn from few examples is more difficult. The machine equivalent to learning from few examples is -shot learning, where in the extreme case, and a classifier learns from only one example. Typically, classification models need to learn from many examples to capture the variation of samples in a dataset. Convolutional neural networks typically learn from millions of images [krizhevsky2012imagenet], making this task very challenging.

In these experiments, a representation for handwritten characters is learned by training two GANs with our alternative cost function. The first GAN is trained using , equivalent to regular GAN training. The second is trained using , which we would expect to learn a representation more suitable for one-shot learning.

Previous work [vinyals2016matching, santoro2016one] has looked at learning labels for five or randomly chosen classes from the Omniglot dataset across all alphabets having been shown one or five examples during training.

Learning to classify only five random samples from different classes across the dataset is of less practical significance compared to learning to classify all samples in the dataset or all samples within a single alphabet. For this reason, we perform the novel task of one-shot learning on both the whole dataset and on individual alphabets. These tasks are more challenging than those of Vinyals et. al [vinyals2016matching] and Santoro et al. [santoro2016one] for several reasons:

  1. By picking samples randomly across all alphabets, the chance of picking two samples from the same alphabet is minimised. Samples from within an alphabet often bear greater similarity to each other than samples from different alphabets, making it easier to perform one-shot learning across alphabets than within alphabets.

  2. Each alphabet in the evaluation dataset has between and character classes, which makes the classification task harder since the probability of randomly guessing the correct label is smaller.

  3. For training and testing, the dataset provided by Lake et al. [lake2015human] is used; it is split into training classes and testing classes, while Vinyals et al. [vinyals2016matching] and Santoro et al. [santoro2016one] split the data into training and testing.

Vinyals et al. [vinyals2016matching] and Santoro et al. [santoro2016one] further boost performance by employing data augmentation methods which have been shown to improve classification results by preventing over fitting – a common problem when the quantity of training data is limited. We do not use data augmentation since we wish to focus our evaluation of the quality of the representation that is learned by using different values.

Fig. 11 shows the results of one-shot classification on individual alphabets. A Nearest Neighbours classifier for each alphabet at each value was trained on a single sample from each character class within that alphabet, encoded using the discriminator of the GAN trained on the background dataset. The classifier was evaluated on the rest of the samples in the dataset to give the scores shown in Fig. 11.

Results of one-shot learning for the entire evaluation dataset are shown in Table III, training Nearest Neighbours (NN) and LinearSVM on a single example of each character class by encoding them using the trained discriminators at each value.

Method Accuracy using Accuracy using
-NN
LinearSVM
Table III: Comparison of One-shot learning accuracy on the whole evaluation set training using our alternative cost function with different values, where is equivalent to regular GAN training.
Figure 11: Comparing One-shot learning accuracy for regular GAN training () and our proposed alternative cost function using .

One-shot classifiers trained with features that have been taken from a trained GAN with outperforms classifiers trained with regular () GAN features on all the alphabets (Fig. 11) and across the dataset as a whole (Table III). This supports the assertion that alternative training with smaller values is better suited to discriminative tasks than regular training of GANs.

5.5 Generating Image Samples

Characters in the Omniglot dataset are made up of strokes [lake2015human], with some characters having similar strokes to each other. The background dataset used for training the GAN consists of characters with examples per character, this means that the GAN has nearly k examples to learn strokes from, but only examples to learn specific characters.

A GAN is trained using our alternative cost function with . We generated random samples by drawing , -dimensional values from a uniform distribution and passing them through the trained generator. The results are shown in Fig. 12 and Fig. 13. In comparing these two figures, it is difficult to draw any conclusions about any benefit to using a larger value; however, experiments involving interpolation show a clear distinction in the way that the generator captures the image space through . This is explored in Section 5.5.2.

Figure 12: Omniglot random generations from a GAN trained using regular methods ().
Figure 13: Omniglot random generations from a GAN trained using our alternative adversarial cost function with .

5.5.1 Checking for over fitting

To show that our generator does not simply over fit to samples from the training data, we show, in Fig. 14, examples of generated samples alongside their pixel-wise nearest neighbour sample from the training data. Results show that the generations are not exact copies of samples from the training data. Further, they strongly suggest that some of the image samples that are generated belong to character classes from the training dataset. However, to match generated samples to a character class, pixel-wise nearest neighbours might not be sufficient.

Figure 14: Pixel-wise nearest neighbour real samples to generated samples. A: For regular GAN training (), B: Using our alternative cost function with .

5.5.2 Interpolating between random image samples

The generator should generate realistic looking samples for any sample, , drawn from the prior distribution, in this case a uniform distribution. According to the analysis in Section IV, training the networks with should encourage the generator to learn a model that captures only the densest parts of the training data distribution at the cost of ignoring the less dense regions. This suggests that samples drawn from a model trained using are more likely to give visually realistic samples than a model trained using .

To test this hypothesis, we would have to generate samples for all possible , which is not feasible. Instead, we take two random values from the prior distribution and linearly interpolate between them at points and generate samples from these points. These are shown in Fig. 15. For both and , the samples at the intermediate points appear to fail, particularly towards the centre of the interpolation. However, for the change is more abrupt and the first and last samples in the interpolations are consistently good, whereas only the fist and last are consistently good for .

Figure 15: Comparing uniform interpolations in z-space between random start and end samples, A: For regularly trained GANs () and B: GANs trained using the alternative cost function with

Using linear interpolation in high dimensions often leads to taking uneven steps between the samples. An alternative interpolation that takes even steps between samples is spherical interpolation [shoemake1985animating], giving a more representative view of the space between samples. Fig. 16 shows hyperspherical interpolations between random samples in -space for and . Here, the effect of is more evident. When evenly traversing -space between two random samples, there are more nonsensical gaps when samples are drawn from a GAN trained using than those drawn from a GAN trained using . The results for are consistent with a model that has optimised , to ensure that any sample drawn from the model is likely to come from the same distribution as the training data. This further supports the hypothesis that training a GAN using larger values is more suitable for sample generation.

Figure 16: Comparing spherical interpolation in z-space between random start and end samples, A: for regulalrly trained GANs () and B: GANs trained using the alternative cost function with . Note the apparent missing samples in A.

6 Discussion

When showing that adversarial training is equivalent to divergence for large and small , it is assumed that is near-optimal. To improve the chance that is close to optimal, for every one iteration that we train for values of , we train for three iterations. For =0.9 we found that the network would not converge for the same number of training iterations used at and so we reduced the number of iterations that was trained only once per iteration. Goodfellow et al. [goodfellow2014generative] suggests that for sample generation, one iteration is sufficient. We have demonstrated the use of to show performance benefits of our proposed alternative cost function on both generative and discriminative tasks. We show that for , a model more suitable for discriminative tasks is learned. The images generated at are not shown because they are either very primitive strokes or blank samples. However, at we find that the hypothesis still holds, whereby discriminative tasks are improved compared to regular GAN training. However, we also find that generations are both more realistic than for

and more varied than regular GAN training. This variance comes at the cost of some samples being non-realistic. This suggests that our approach may be used to address other issues, such as lack of variation in generated samples. We leave this for future work.

7 Conclusion

Generative adversarial networks (GANs) are able to generate realistic looking image samples, while simultaneously learning representations for image samples from a limited set of labelled training data. GANs are able to achieve this by minimising an adversarial cost function, which under certain conditions can be shown to approximate the Jensen-Shannon entropy. However, adversarial training can be improved, particularly when a model is intended specifically for the task of generation or classification.

We propose an alternative adversarial cost function parametrised by which we show to be approximately proportional to , for small and approximately proportional to for large . We perform both generative and discriminative tasks using our alternative cost function to show experimental evidence to support the theory motivating our alternative cost function.

Retrieval and one-shot learning experiments compared regular GAN training to training using . Our results showed that GANs trained using our alternative cost function learned a representation for retrieval and one-shot learning that outperformed regularly trained GANs in all experiments. We also presented the first alphabet wise one-shot classification scores on the Omniglot dataset, classifying all characters in each alphabet. Previous work had only attempted to classify randomly chosen samples [vinyals2016matching, santoro2016one].

Experiments on image generation compared regular GAN training to training using . Evidence for improved synthesis is shown by interpolating between two random samples, showing that when a GAN is trained using our alternative cost function with a large value, there are less gaps in the interpolation. This suggests that using our alternative cost function with a larger value learns a model more suitable for generation that regularly trained GANs.

Both theory and experimental results suggest that our alternative cost function, parametrised by , allows for tuning of generative models for either generative or discriminative tasks by choosing a suitable for the task.

Acknowledgment

We like to acknowledge the Engineering and Physical Sciences Research Council for funding through a Doctoral Training studentship.