Adversarial Transfer Learning

12/06/2018 ∙ by Garrett Wilson, et al. ∙ Washington State University 30

There is a recent large and growing interest in generative adversarial networks (GANs), which offer powerful features for generative modeling, density estimation, and energy function learning. GANs are difficult to train and evaluate but are capable of creating amazingly realistic, though synthetic, image data. Ideas stemming from GANs such as adversarial losses are creating research opportunities for other challenges such as domain adaptation. In this paper, we look at the field of GANs with emphasis on these areas of emerging research. To provide background for adversarial techniques, we survey the field of GANs, looking at the original formulation, training variants, evaluation methods, and extensions. Then we survey recent work on transfer learning, focusing on comparing different adversarial domain adaptation methods. Finally, we take a look forward to identify open research directions for GANs and domain adaptation, including some promising applications such as sensor-based human behavior modeling.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 5

page 14

page 15

page 26

page 27

Code Repositories

transferlearning

Everything about Transfer Learning and Domain Adaptation--迁移学习


view repo

transferlearning

Everything about Transfer Learning and Domain Adaptation--迁移学习


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

In recent years there has been a large and growing interest in generative adversarial networks (GANs). Pitting two well-matched neural networks against each other, playing the roles of a data discriminator and a data generator, the pair is able to refine each player’s abilities in order to perform functions such as synthetic data generation. Goodfellow et al.

(goodfellow2014nips) proposed this technique in 2014. Since that time, hundreds of papers have been published on the topic (ganzoo).

GANs have traditionally been applied to synthetic image generation, but recently researchers have been exploring other novel use cases such as domain adaptation. Supervised learning is arguably the most prevalent use of machine learning and has had much success. However, many common supervised learning methods make an assumption that is not always valid: the training data and testing data are drawn from the same distribution. When these distributions differ but are related,

transfer learning can be used to transfer what is learned on the training distribution to the testing distribution, which often results in improved performance on the testing data in comparison with inaccurately assuming that the training data and testing data were drawn from the same distribution. A popular case of transfer learning is domain adaptation

, where the feature space and task remain fixed between a source domain and a separate target domain while the marginal probability distributions differ. As with other cases of transfer learning, the goal of domain adaptation is to achieve strong predictive performance on the target data. Recent advances in domain adaptation performance on image datasets have resulted from the use of adversarial losses, a technique inspired by GANs.

The goal of this paper is to investigate and compare current work in the area that highlights the use of GANs for transfer learning. Figure 1 shows the percentage of GAN-related papers that also mention terms related to transfer learning. For comparison, the chart also includes the traditional use of GANs for data generation (such as image generation). Because the most prevalent transfer learning area overlapping GANs is domain adaptation, in this paper we focus on domain adaptation and the GAN-inspired adversarial domain adaptation techniques.

Figure 1. Of the 500 papers referring to GANs, this figure shows how many of those papers also include terms related to transfer learning (right 8 terms). For comparison, the use of GANs for data generation such as image generation (or synthesis) are included (left 2 terms). Results are based on searching the first three pages of each of the 16,644 available papers published in AAAI, ACL, AISTATS, CVPR, ICLR, ICML, IJCAI, JMLR, and NIPS between January 2014 and August 2018.

To provide a background for the use of adversarial techniques in transfer learning, we start by providing a game formulaic explanation of GANs with the alternative interpretations, challenges, variants, and extensions that are found in the literature. We follow this with an overview of transfer learning. We next investigate both non-adversarial and the more recent adversarial approaches to domain adaptation. Finally, we identify future research directions for GANs, domain adaptation, and methods of combining and applying these ideas.

Figure 2. Realistic but entirely synthetic images of human faces generated by a GAN trained on the CelebA-HQ dataset. Images courtesy Karras et al. (karras2018progressive).

To facilitate the comparison of current work in GANs and domain adaptation, we will use image generation and image classification domain adaptation respectively as running examples due to the popularity of these use cases. Popular datasets for GAN-based synthetic image generation include human faces (CelebA (Liu_2015_ICCV)), handwritten digits (MNIST (lecun1998mnist)), bedrooms (LSUN (Yu2015LSUNCO)), and sets of other objects (CIFAR-10 (krizhevsky2009learning)

and ImageNet

(5206848; ILSVRC15)). A GAN can be trained using such a dataset. Following training, the goal is for the GAN generator to be capable of generating images that closely resemble images in the dataset but that are entirely synthetic. For example, a generator trained with CelebA will generate images of human faces that look realistic but are not images of real people, as shown in Figure 2.

For image classification domain adaptation using GAN-inspired adversarial losses, a model trained on the source image dataset is adapted to perform well on a target image dataset. One use case is unsupervised domain adaptation, where the target dataset is not required to have labels, which can reduce the cost of creating the dataset. For example, we might adapt a model that was trained to recognize traffic signs from computer-generated synthetic instances with known labels (moiseev2013evaluation) to a dataset consisting of photos of real traffic signs (Stallkamp-IJCNN-2011). Such adaptation saves the human time that would be spent labeling the images.

2. Generative Adversarial Networks

Generative adversarial networks (GANs) are a type of deep generative model (goodfellow2014nips). For synthetic image generation, a training dataset of images must be available. After training, the generative model will be able to generate synthetic images that resemble those in the training data. To learn to do this, GANs utilize two neural networks competing against each other (goodfellow2014nips)

. One network represents a generator. The generator accepts a noise vector as input, which contains random values drawn from some distribution such as normal or uniform. The goal of the generator network is to output a vector that is indistinguishable from the real training data. The other network represents a discriminator, which accepts as input either a real sample from the training data or a fake sample from the generator. The goal of the discriminator is to determine the probability that the input sample is real. During training, these two networks play a minimax game, where the generator tries to fool the discriminator and the discriminator tries to not be fooled.

Using the notation from Goodfellow et al. (goodfellow2014nips), we define a value function employed by the minimax game between the two networks:

(1)

Here, draws a sample from the real data distribution, draws a sample from the input noise, is the discriminator, and is the generator. As shown in the equation, the goal is to find the parameters that maximize the log probability of correctly discriminating between real () and fake () samples while at the same time finding the parameters that minimize the log probability of . The term represents the probability that generated data

is real. If the discriminator correctly classifies a fake input then

. Equation 1 minimizes the quantity . This occurs when , or when the discriminator misclassifies the generator’s output as a real sample. Thus the discriminator’s mission is to learn to correctly classify the input as real or fake while the generator tries to fool the discriminator into thinking that its generated output is real. This process is illustrated in Figure 3.

z

G

Fake Image

D

“real” or “fake”

Real Image

D

“real” or “fake”

Figure 3. Illustration of the GAN generator and discriminator networks. The dashed line between the networks indicates that they share weights (or are the same network). In the top row a real image from the training data (horses zebras dataset by Zhu et al. (zhu2017iccv)) is fed to the discriminator, and the goal of is to make (correctly classify as real). In the bottom row a fake image from the generator is fed to the discriminator, and the goal of is to make (correctly classify as fake), which competes with the goal of to make (misclassify as real).

2.1. Components

2.1.1. Generator

Because GANs are generative models, they fit into the generative model taxonomy described by Goodfellow (goodfellow2016survey) (comparing maximum likelihood variants of the different methods). There are two types of generative models: explicit and implicit. Explicit generative models utilize an explicit density function, therefore the model being learned has an explicit function with parameters that are adjusted to increase the log likelihood of the training data given the parameters. For techniques like fully visible belief nets (frey1998graphical; bengio2000modeling; larochelle2011neural; germain2015made; pmlr-v48-oord16)

and change of variables or nonlinear independent components analysis (ICA) models

(deco1995higher; dinh2016density)

, the density function is tractable. For variational autoencoders (VAEs)

(kingma2013auto; rezende2014stochastic)

and Boltzmann machines

(fahlman1983massively; ackley1985learning; hinton2012practical), the density function is intractable and thus these methods use approximations. In contrast, implicit generative models do not utilize an explicit density function at all but rather sample from the distribution without modeling it explicitly. Some methods such as generative stochastic networks (GSNs) (bengio2014deep)

use a Markov chain to draw samples, but using a Markov chain requires multiple steps and in high dimensional spaces (like images) can be slow to converge

(goodfellow2016survey). Other methods, like GANs, sample directly from the distribution in one step without the use of a Markov chain (goodfellow2016survey). Thus, GANs fit into the generative model taxonomy as having an implicit density function and sampling directly from the distribution in a single step. A summary of the taxonomy is shown in Figure 4.

Generative Models usingMaximum Likelihood

Explicit Density

Tractable Density(e.g. fully visible belief nets,change of variables models)

Approx. Density

Variational(e.g. VAE)

Markov Chain(e.g. Boltzmann machine)

Implicit Density

Markov Chain(e.g. GSN)

Sample Directly(e.g. GAN)
Figure 4. The generative model taxonomy from Goodfellow (goodfellow2016survey). GANs fit into the bottom right leaf. They utilize an implicit density function and sample directly from the distribution in a single step.

2.1.2. Discriminator

To learn a generator that can sample directly from the distribution but without relying upon an explicit density function with a tractable log likelihood or an approximation to the log likelihood, GANs use a discriminator, which can be thought of as learning a loss (or cost) function (goodfellow2016survey; isola2017cvpr)

. For this discussion, let’s assume that our generator is creating synthetic images that are close approximations to real images found in a sample dataset. The GAN loss function is

, which in Equation 1 represents the discriminator learning to classify input as real or fake. The GAN generator is trained to minimize this loss. Employing alternative loss functions results in subsequent changes to the type of data that is generated. As an example, another loss function is mean-squared error. If this loss function is used, the trained generator would produce blurry images (lotter2015) as shown in Figure 5. Intuitively, this outcome is expected because for a given class of images there are multiple realistic outputs that the generator could produce. For example, on a human face there are various slight alterations of hair movement, head rotation, and other features that would look realistic. Mean squared error averages these possibilities and hence generates blurry images (goodfellow2016survey). By using a GAN discriminator as a learned loss function, the GAN generator is penalized if it averages together these alterations because it would create images that are easily distinguishable from the real training data.

Figure 5. Ground-truth face (left). Using mean squared error results in the slight alterations of possible human heads being averaged together, bluring the face (middle). An adversarial loss picks one of the possible outputs resulting in a sharper face (right). Images courtesy Lotter et al. (lotter2015).

2.2. Alternative Interpretations

GANs lie at the intersection of many areas of investigation: generative models, deep learning, game theory, probability theory, energy-based models, and reinforcement learning. This has led to a number of alternative interpretations of GANs. These interpretations offer different insights that lead to techniques for improving stability. They could also potentially lead to new and more diverse applications of GANs. Rather than viewing GANs purely as a generative model with a learned loss function, another interpretation is that GANs are learning an energy function which maps to low energy values where the data falls in a high-dimensional space (e.g., where actual CelebA pictures are positioned in the image space) and high energy values everywhere else

(zhao2016iclr). GANs offer one way to learn this energy function. The samples from the dataset indicate where the energy function should map to low energy values. In this interpretation, the generator outputs “contrastive samples”, or fake points that the trained energy function maps to high energy values. Markov Chain Monte Carlo methods offer one strategy to generate subsequent samples in a non-parametric way. Alternatively, GANs offer a parametric strategy for generating these contrastive samples – learning the parameters through the GAN minimax game.

GANs can also be interpreted as a way to estimate probabilistic models, i.e., density estimation (nowozin2016nips). In this interpretation, there are two distributions. These are , the distribution based on the model parameters, and , the true data distribution. Given a measure of the distance between and , the GAN essentially adjusts the parameters during training to minimize this distance. Building on this interpretation, Nowozin et al. (nowozin2016nips) generalize GANs for any -divergence such as Kullback-Leibler, Pearson, or Jensen-Shannon divergence.

Some interpretations of GANs utilize reinforcement learning. Specifically, inverse reinforcement learning (IRL) learns a reward (or cost) function using expert demonstrations of a target task as training data (ng2000algorithms). For example, the demonstrations may be of how bees decide which flowers to visit (ng2000algorithms), how humans behave in an economic market (ng2000algorithms), how a human drives a car (ng2000algorithms), or how to label an image sequence of characters (ross2011reduction). In the context of GANs, if the generator’s density can be evaluated and is incorporated into the discriminator, Finn et al. (finn2016) showed that GANs are equivalent to maximum entropy IRL. GANs can also be viewed as a type of actor critic approach to reinforcement learning (pfau2016). Both GANs and actor critic methods are multilevel optimization problems in which only the critic (discriminator) has access to the reward (real sample) and the actor (generator) must learn only from the error (gradients) of the critic.

2.3. Training

In recent years there have been impressive results from GANs, fueled in part by the many interpretations that have been considered. At the same time, this research faces a number of challenges. First, training a GAN can encounter problems such as difficulty converging, mode collapse, and vanishing gradients. Second, once successfully trained, the model can be difficult to evaluate and compare with other models. In this section we will look at training challenges followed by a discussion of the evaluation challenges in Section 2.4. There has been much work in both regards, but both problems still require continued research.

2.3.1. Challenges

When training a GAN, multiple challenges may be encountered. GAN training may fail to converge. Because there are two players in the GAN game, each player’s move (i.e., update to its neural network via gradient descent) toward a lower loss may undo the other player’s progress toward reaching its lower loss (goodfellow2016survey). For example, GANs have been observed to oscillate without making progress toward an equilibrium (goodfellow2016survey). In general, an equilibrium to a game may not even exist (e.g. rock-paper-scissors) (arora2017icml), but Arora et al. (arora2017icml)

show that an approximate pure equilibrium does exist for a Wasserstein training objective if the generator wins the game. They additionally propose a technique wherein a mixture of discriminators and generators incorporate other objectives. However, while an approximate equilibrium does exist, that does not mean that backpropagation will find it when training the GAN

(arora2017icml).

Figure 6. The true data distribution is shown in solid blue, the currently generated distribution in dotted red, and the discriminator decision boundary in dashed orange (left). The minimax GAN loss is graphed in solid blue and the NS-GAN loss is shown in dotted red (right). Notice that early in training the generator has learned an incorrect distribution shifted to the right of the true distribution. The minimax GAN loss gradient has “vanished” on the right (i.e. is zero) where the generated samples lie () whereas the NS-GAN loss is non-zero and continues to provide a useful gradient. Figure courtesy Fedus et al. (fedus2017many).

A common type of non-convergence that GANs may suffer from is mode collapse, where the generator only learns to generate realistic samples for a few modes of the data distribution (goodfellow2016survey). Oliehoek et al. (oliehoek2018beyond) further classified mode collapse as either mode omission, where the generator is unable to generate samples in at least one of the modes of the dataset, or mode degeneration, where at least one of the modes is only partially covered. For example, a generator may learn to only generate images of a certain color when the dataset includes images having many colors or of a particular dog when the dataset contains images of many different types of animals (goodfellow2016survey).

Another problem is vanishing gradients illustrated in Figure 6, for which a fix was proposed in the original GAN paper by Goodfellow et al. (goodfellow2014nips). A solution to the minimax game is found through iterative optimization: alternating between optimizing the discriminator objective and the generator objective. However, this approach faces a complication since initially when the generated samples are very poor (generated dotted red distribution on the left is far from the true solid blue distribution), the discriminator will be very confident in whether the generated image is real or fake (dashed orange decision boundary on the left easily determines solid blue is real and dotted red is fake). Thus, , which is the probability of the generated sample being real, will be very close to zero, making the gradient of very small (solid blue line is at zero for generated samples on right). This generator optimization problem attempts to minimize the log probability of the discriminator correctly classifying the sample. Alternatively, the optimization objective could be changed to maximize the log probability of the discriminator incorrectly classifying the sample. Thus, Goodfellow et al. (goodfellow2014nips) suggests replacing the goal of minimizing with the goal of maximizing (dotted red line is non-zero for generated samples on right). This is a trick that is commonly used in practice to address vanishing gradients early in training and is referred to as the non-saturating GAN (NS-GAN) (fedus2017many). However, Arjovsky et al. (arjovsky2017iclr) later showed that this may remedy vanishing gradients at the cost of decreasing stability.

2.3.2. Tricks

There have been a variety of tricks used to improve the stability of GANs (ganhacks), one of which is label smoothing proposed by Salimans et al. (salimans2016nips). Label smoothing is a technique of replacing the ground-truth probabilities of 1 or 0 with smoothed values such as 0.9 or 0.1. In a GAN discriminator, this decreases the confidence in classifying the input as either real or fake. However, one-sided label smoothing is recommended, wherein only the positive label is smoothed (e.g. to be 0.9) while negative label is kept as 0. If the negative label is smoothed and if the generator’s output is obviously fake, there may not be incentive for it to generate a more realistic output.

These authors also propose applying historical averaging. Using historical averaging, a term is included in each player’s cost to penalize parameters that differ from the historical average of the parameters’ values. In some low-dimensional, continuous non-convex games where gradient descent alone results in orbits and not convergence, this technique has resulted in convergence. Exploring whether historical averaging may help GAN convergence in higher dimensions remains an open question (salimans2016nips).

Along a similar line of thinking but using a history of generated data (such as images) rather than a history of parameter values, Shrivastava et al. (shrivastava2017cvpr) propose using a mix of the current batch of generated images together with past images from a history of generated images when training the discriminator. This both helps training to converge and also makes it less likely for the generator to reintroduce learned artifacts during training since the discriminator is more likely to remember these artifacts are fake.

For a theoretically-backed but trivial to implement improvement, Heusel et al. (heusel2017nips) propose a two time-scale update rule (TTUR): the learning rates of the discriminator and generator differ. These learning rates need to be chosen to balance convergence using a small learning rate and fast training using a large learning rate. Typically the discriminator learning rate will be larger than that of the generator. The authors prove that TTUR results in a GAN converging to a stationary local Nash equilibrium given a few assumptions and empirically demonstrate that it increases training stability and performance.

2.3.3. Network Architectures

In addition to the tricks described in the previous section, commonly some specific network architecture choices are introduced to ease training difficulties. Radford et al. (radford2015)

propose DCGAN, five network architecture choices that increase GAN training stability. The first choice is to replace spatial pooling layers with strided convolutions in the discriminator and fractional-strided convolutions in the generator, which require the generator and discriminator to learn how to spatially upsample and downsample, respectively. Second, batch normalization is applied to all layers except for the output layer of the generator and the input layer of the discriminator. Batch normalization is a technique that normalizes the inputs in a batch to have zero mean and unit variance. In their experiments, Radford et al. found this choice helped avoid mode collapse early on, mitigated the effect of poor initialization, and allowed gradients to flow when using very deep networks. Third, fully connected layers can be eliminated. They found that the alternative of using global average pooling improved stability although at the cost of convergence speed. Thus, as a balance between stability and speed, they recommend just eliminating the fully connected layers and relying on the convolutional layers instead. The fourth choice is to use ReLU activations at all generator layers except for the output layer and Tanh for the output layer. Fifth, for the discriminator, leaky ReLU activations can be applied at all layers.

Salimans et al. (salimans2016nips) propose a slight change to batch normalization to help with convergence called virtual batch normalization. They found that using just batch normalization resulted in the output of a network for a given input to be highly dependent on some of the other inputs in a batch. In virtual batch normalization, this problem is remedied by computing statistics on the input added to a fixed reference data batch chosen at the beginning of training (rather than the current batch).

A drastically different approach was proposed by Karras et al. (karras2018progressive) to progressively grow a GAN for higher and higher image resolution output. Initially they train the generator and discriminator with only a few layers. This generator outputs a small image (e.g. 4x4 pixels). Then they add another layer to the generator and another to the discriminator, which now will output a larger image (e.g. 8x8 pixels). This progressive growth continues till reaching the desired output resolution, which in their case was 1024x1024 pixels. They explain that by introducing this curriculum into the training, rather than learning all small-scale details and large-scale variations in all layers at the same time, during the progressive growth the network has likely already converged at lower-resolution layers and thus only has to learn to refine the smaller representation into a larger representation. As an added benefit, this can accelerate the training process. A potential disadvantage of this approach is that while it generates stunning images (see Figure 2), the method may not work in applications where the individual features are not as related as they are with images.

2.3.4. Objective Modifications

Rather than applying tricks or restricting the network architecture, others have chosen to explore modifying the discriminator or generator objectives in an attempt to resolve training challenges. Metz et al. (metz2016)

introduce unrolled GANs. Their method helps address mode collapse, increase diversity, and stabilize training. When training the generator, they add an auxiliary second term that accounts for how the discriminator will respond to the generator update. They add this auxiliary loss only to the generator since most commonly when training a GAN the discriminator will overpower the generator. By allowing the generator to see one step ahead, the networks will be more balanced. More than one unrolling step could be used, but they found that one was sufficient in their experiments. However, they note for more unstable networks such as recurrent neural networks, likely more than one would be needed.

Arjovsky et al. (arjovsky2017icml) propose using a Wasserstein GAN (WGAN), and Gulrajani et al. (gulrajani2017nips) improve upon this with a gradient penalty (WGAN-GP). The original minimax GAN is viewed as minimizing a Jensen-Shannon divergence (goodfellow2014nips; nowozin2016nips), but this divergence may not always be continuous, which may increase the difficulty of training (arjovsky2017icml) since training involves computing gradients. Arjovsky et al. show that using the Earth Mover or Wasserstein-1 distance is advantageous because they are continuous everywhere and differentiable almost everywhere, assuming the discriminator is locally Lipschitz. To be -Lipschtiz, the norm of the discriminator’s gradients must be upper bounded by (gulrajani2017nips), which Arjovsky et al. enforce by clipping the discriminator’s weights to be in for some (e.g. ). However, Gulrajani et al. found weight clipping on some simple experiments to only learn simple approximations to the optimal model. In addition, the weight clipping can result in vanishing or exploding gradients, requiring careful tuning of (gulrajani2017nips). Instead, they propose replacing the weight clipping with a soft constraint by introducing a gradient penalty term into the discriminator’s loss function to penalize norms of gradients far from 1. With this change that results in increased stability, Gulrajani et al. were able to train a variety of architectures and deeper networks, including one that was 101 layers deep. Kodali et al. (kodali2017) used a similar gradient penalty in their deep regret analytic GAN (DRAGAN) training method, which helped to avoid mode collapse. However, gradient penalties are not limited to use in WGAN or DRAGAN. Fedus et al. (fedus2017many) found adding a gradient penalty benefits the original NS-GAN as well.

Jolicoeur-Martineau (jolicoeur2018relativistic) explain that approaches such as WGAN-GP possess a relativistic discriminator: a discriminator that estimates the probability that the real data is more realistic than a sample of fake data (or fake data on average), which helps stabilize training. This is in contrast to the original, non-relativistic discriminator that estimates the probability that input data is real (Section 2). Rather than training the generator to make the discriminator output a probability of 1 that the data is real (i.e., fool the discriminator), they suggest both increasing the probability that the fake data is real (toward 0.5) and decreasing the probability that the real data is real (also toward 0.5). This makes use of the a priori knowledge that half of a minibatch used during training will be real data and half fake data and better matches the theoretical training dynamics of minimizing the Jensen-Shannon divergence (jolicoeur2018relativistic). They show that integral probability metric GANs (IPM-based GANs), which include WGAN and WGAN-GP, are a subset of relativistic GANs (RGANs). They found that using a relativistic discriminator helped to mitigate GANs from becoming stuck early on in training, generated higher-quality samples, allowed for training even on small datasets, and resulted in faster training if including a gradient penalty when compared with WGAN-GP.

Miyato et al. (miyato2018spectral) propose a normalization method called spectral normalization that stabilizes the discriminator during GAN training. Zhang et al. (zhang2018self) use spectral normalization for the generator in addition to the discriminator. Spectral normalization GANs (SN-GANs) resulted in higher Inception scores (Section 2.4), increased robustness to changing the network architecture, and was the first to produce reasonable images on the large number of ImageNet classes from a single generator-discriminator pair (i.e., not splitting up the classes among a number of generator-discriminator pairs) (miyato2018spectral). In contrast to weight normalization, SN-GANs generate more diverse and complex output images but are slightly slower to train. In contrast to WGAN-GP, SN-GANs can handle higher learning rates and are less computationally expensive (miyato2018spectral).

Odena et al. (odena2018generator)

introduce Jacobian Clamping. This is a regularization added to the generator loss that attempts to penalize large condition numbers by penalizing singular values of the Jacobian that fall outside a particular range (specified by two hyperparameters). In one of their experiments, they trained a GAN 10 times with different initializations. For half, the condition number initially increased and continued to stay high. Whereas, for the other half, the condition number initially increased and then decreased. They found the condition number to be predictive of GAN performance, as measured by Inception Score and Féchet Inception Distance, which are discussed in Section 

2.4. By utilizing this regularization to make condition numbers more consistent, they could largely mitigate variance in GAN performance across random network weight initializations.

2.3.5. Combining Methods

These approaches are not all mutually exclusive – a combination of these approaches can aid in training. For example, when Miyato et al. (miyato2018cgans) introduce projection-based conditional GANs, they use WGAN-GP for the generator and spectral normalization in the discriminator. Heusel et al. (heusel2017nips) used both their two time-scale update rule along with either DCGAN or WGAN-GP. When Zhang et al. (zhang2018self) introduce self-attention GANs (SAGANs), they use spectral normalization in both the generator and discriminator and use the two time-scale update rule. They tested this combination and found it allowed training to one million iterations without a decrease in performance (sample quality, FID, and Inception score), whereas with only spectral normalization the performance decreased after around 260,000 iterations (zhang2018self).

2.3.6. Ongoing Research

There has been much work in regard to resolving training challenges. For a more in-depth discussion of these methods, there are a number of survey papers directed at GAN variants that include a discussion of training challenges and work (hong2017generative; manisha2018; hitawala2018). However, despite numerous proposed methods to address training challenges, how to ensure GAN training convergence to an equilibrium is an open question that still requires further research.

2.4. Evaluation

Because there are multiple components to a GAN (and multiple uses for a GAN), multiple approaches and measures have been introduced to evaluate GAN performance. Most commonly, the primary GAN model of interest after training is the generator (though some domain adaptation methods discussed in Section 4 end up using part of the discriminator). While training a GAN, the goal of the generator is to fool the discriminator. To do this, it needs to learn to generate samples that are indistinguishable from the training data distribution. Because the distribution may be multi-modal (e.g., images may contain a variety of objects as well as various lighting conditions or locations), the generator must not only learn to generate samples that are indistinguishable from real samples in that mode but also learn to generate samples similarly from every other mode. Not only should a GAN generate realistic samples, but it should ideally also generate diverse samples (avoid mode collapse). Our evaluation should address both of these concerns.

2.4.1. Past Generative Model Evaluation

The de-facto standard for evaluating generative models used for density estimation (the probabilistic interpretations of GANs) is through computing the log-likelihood (theis2016iclr). The dataset can be split into a training and testing set, and the log-likelihood of the model trained on the training data should achieve high log-likelihood on the testing data (likelihood of the test data should be high given the model) if the generator learned the data distribution well. However, in general computing the log-likelihood in GANs may not be tractable (goodfellow2016survey). Recall that in the generative model taxonomy, GANs have an implicit, rather than explicit, density function and they learn the distribution through sampling instead of maximizing log likelihood or an approximation (see Figure 4). Thus, for evaluation we similarly cannot rely on the log-likelihood and must instead use samples from the generator. At the same time, we need to be careful what methods we use because some sample-based evaluation methods can be misleading (theis2016iclr).

When using log-likelihood is not possible, a Parzen window estimate is commonly used (theis2016iclr; goodfellow2014nips; makhzani2015; nowozin2016nips). Using this method, a tractable model is created from generated samples, which then facilitates computing log-likelihood. However, Theis et al. (theis2016iclr) show that this approach should be avoided since in high dimensions the computed log-likelihood may be far from the log-likelihood of the real generator. It may even rank models incorrectly if used for comparison purposes (theis2016iclr).

2.4.2. Realistic Samples

Visually inspecting samples from the generator is common in GANs since most are used for generating images (pmlr-v80-santurkar18a). To evaluate the realism of generated images, Salimans et al. (salimans2016nips) instructed humans (via Amazon Mechanical Turk) to mark which images they thought were generated versus real, from which the researchers could then compare different models.

To provide a more automated evaluation, Salimans et al. (salimans2016nips) propose using an “Inception score” based on the output of running a generated image through an Inception image classification network (see Szegedy et al. (szegedy2016cvpr)). The intuition is that images containing realistic objects will have low entropy in the softmax output layer of the image classification network (low entropy conditional label distribution ). The low entropy will result from the probability of some objects being high rather than exhibiting a uniform probability distribution over all the possible objects that the classification network can recognize. Furthermore, if the GAN generator outputs images with a large diversity then the marginal distribution will have high entropy. This is due to the fact that when integrating over all the possible noise inputs , the generator outputs a large variety of realistic objects in images rather than only a few select realistic objects. Determining this score requires performing evaluation on a large number of samples (salimans2016nips).

One way for a generator to create exceptionally realistic images (images indistinguishable by a human from the training distribution) is to simply memorize the training data (pmlr-v80-santurkar18a), which is an extreme form of overfitting. After training, many researchers thus verify that the generator has not simply memorized the training set. To do this, nearest neighbors in the training data can be computed for the generated images (goodfellow2014nips) using measures such as Euclidean (makhzani2015) or cosine (donahue2017iclr) distance. If the nearest neighbor is visually almost identical to the training data instance, the generator can be assumed to be memorizing the training data. However, because even a change as simple as shifts of a few pixels can result in the incorrect nearest neighbor based on Euclidean distance (and thus inaccurately concluding that the model is not overfitting), Theis et al. (theis2016iclr) recommends using perceptual distance metrics (e.g. (johnson2016perceptual; gatys2016image; ledig2017photo)). Instead of pixel-level differences, perceptual distance metrics (or losses) are based on high-level feature differences related to image content (johnson2016perceptual). These high-level features are extracted from pre-trained image classification deep CNNs that have already learned hierarchical feature representations (johnson2016perceptual; gatys2016image). In addition to using perceptual distance metrics, Theis et al. also recommends not limiting the comparison of images to only one nearest neighbor. For example, in the birthday paradox test (arora2018), human evaluators compare 20 closest pairs.

Radford et al. (radford2015) note that if memorization has occurred, then there will be sharp transitions when walking in the learned distribution. Recall that the input to a generator is a noise vector . This vector represents a space that can be “walked” by changing a single value in the vector while fixing the others and outputting the generated image, then selecting a different value and repeating. If the generator has not memorized the data, then when walking the learned distribution relevant semantic changes should occur. For example, when processing a room image dataset, furniture and other objects might be added or removed. Similarly, when processing a human face image dataset, the hair style, gender, or expression might change. This approach is similarly used in Berthelot et al. (berthelot2017).

2.4.3. Diverse Samples

Not only should the generator generate realistic images but it should also generate diverse images, matching the full training distribution. Arora et al. (arora2018) propose the birthday paradox test to determine how well a GAN has generalized to the entire training distribution. The birthday paradox test is inspired by the birthday paradox. Imagine there are people in a room. How large does need to be in order to have a high probability that 2 people in the room share the same birthday? To guarantee it, by the pigeonhole principle. However, to have probability ¿50%, surprisingly only needs to be about 23 (assuming birthdays are i.i.d.) (arora2018), which is the “paradox”.

This birthday paradox test uses human evaluation to estimate an upper bound on the support size of the learned generator distribution by checking for duplicates. If a distribution has support , then likely there will be a duplicate in samples drawn from that distribution (arora2018). Here, “support” refers to the training examples used to specify the learned distribution. This is similar to how “support vectors” in SVMs represent the training examples with non-zero alphas, or the training examples that are used to specify the decision boundary (cortes1995support; osuna1997improved). If only a few training examples are used to learn the generator distribution, then this indicates mode collapse, where the generator no longer can generate diverse images. In the case of generating human face images, such a generator might only generate smiling males with blonde hair. With a small support size, a GAN may only be learning a small portion of the distribution rather than the full distribution.

The birthday paradox test (arora2018) is performed by generating a sample of images. The 20 closest pairs of that sample can be found using a distance metric such as Euclidean distance (which they found to work on the celebrity face dataset but will likely create issues on other datasets for the previously-noted reasons). Next, a human looks at the 20 images and identifies any near duplicates. This process is repeated multiple times, and if there is high probability that there are near duplicates in each set of 20 images, then likely the support size of the generator distribution is approximately upper bounded by . If is too small, then the GAN may be generating realistic outputs, but it suffers from mode collapse and will not be able to generate diverse images throughout the entire training distribution. The one failure mode of this test occurs when the generator creates a few images with high probability (e.g. outputs one image 10 percent of the time), which would result in duplicate images being found in a set of 20 even though the support size may be very large (arora2018).

However, the birthday paradox test still requires human interaction. To minimize human involvement, Santurkar et al. (pmlr-v80-santurkar18a) propose a completely automated test of diversity through detecting covariate shift by classification. They train an unconditional GAN and a multi-class classifier on the dataset. Then they create a fake dataset using the GAN generator. They run the multi-class classifier on this fake dataset and compare the results with that on the real dataset. If the GAN learns the true distribution, then there should not be any covariate shift. However, if the GAN suffers from mode collapse, only generating part of the distribution, then there will be a covariate shift. The multi-class classifier can be chosen at varying difficulty or number of classes to determine the extent of the mode collapse.

If using labeled images, Odena et al. (pmlr-v70-odena17a) propose another entirely automated metric diversity measure by using multi-scale structural similarity (MS-SSIM). MS-SSIM is a human-perception metric yielding higher values for visually similar images. Odena et al. repurpose this for measuring generator diversity by randomly sampling 100 image pairs from each image class. In this case, lower MS-SSIM scores indicate lower perceptual similarity and thus higher diversity. They could detect mode collapse early during training by calculating the mean of these scores on generated images. After training, they could compare the mean scores of the generated images with the real training data as a measure of the final generator diversity. However, Karras et al. (karras2018progressive) found that MS-SSIM is able to detect large-scale mode collapse but that it misses smaller-scale lack of variation.

2.4.4. Both Realistic and Diverse Samples

An improvement over the Inception score is the Féchet Inception Distance (FID) (heusel2017nips). FID computes a distance between the real and generated distributions, taking advantage of training data statistics (in this case, mean and covariance). This distance will increase both with poorer-quality images and images exhibiting lesser diversity if only a few images are generated for a class (lucic2017). However, it cannot detect memorization of the training set (lucic2017). To compute the score, they use an Inception network and select one layer of the network (with an assumption that the layer’s output is a multidimensional Gaussian). All of the real data samples are input to the network, then mean and variance of the chosen layer’s output are calculated. The next step is to repeat the calculation of mean and variance for 50,000 generated samples. Finally, the Féchet distance is computed between these two sets of means and variances (heusel2017nips).

A further improvement is the Class-Aware Fréchet Distance (CAFD). Liu et al. (liu2018)

highlight some cases where FID does not correlate well with human judgments but their proposed CAFD evaluation does. Instead of using an Inception network, they train an encoder on the GAN training data to get a domain-specific feature representation. The network used for FID and the Inception score is trained on ImageNet and learns features mostly based on color and shape, which may not work well on other datasets. CAFD assumes a multivariate Gaussian mixture model rather than a multivariate Gaussian, which they propose will avoid losing class information on multi-class datasets. By comparing the results between the training and testing sets, CAFD can provide an indication of the level of overfitting

(liu2018).

Yet another automated metric is an approximation of log-likelihood. Wu et al. (wu2017iclr) propose employing annealed importance sampling to calculate log-likelihoods. This could be used for an automated method of detecting memorization as well as offer another human evaluation-based method of checking for diversity. Using a hold-out testing set of images, if the training and testing log-likelihoods throughout training are approximately the same, then the model is considered to be not memorizing the training data (wu2017iclr). To check for diversity, on both the training and testing set they apply annealed importance sampling to approximate from their image input and then generate new outputs based on . As in the other human evaluation tests, a human can then check for diversity. These authors validated the annealed importance sampling algorithm through bidirectional Monte Carlo. They assume a Gaussian observation model with fixed variance, but they also show that this assumption alone could not account for the detected differences between methods in their validation. Yet, this likely means the approach will also have similar difficulties as the Parzen window estimates in high dimensions (lucic2017). In addition, log-likelihood and visually realistic samples are largely independent. Theis et al. (theis2016iclr) showed that a model could have poor log-likelihood and generate great samples, great log-likelihood and generate poor samples, or good log-likelihood and generate great samples. Grover et al. (grover2017flow)

similarly found a disconnect even in a GAN with tractable log-likelihood. Thus, if the goal is visually realistic samples, log-likelihood (or an approximation of it) is not a satisfactory evaluation metric.

2.4.5. Ongoing Research

Various modifications to the above metrics and some less-commonly-used alternatives have been proposed, which are included in a survey by Borji (borji2018). Yet, there remains no consensus on how to evaluate GANs. This is a key challenge for continued work on GANs. In fact, when GAN variants were given a large computational budget, Lucic et al. (lucic2017) did not find the variants any better than the original NS-GAN by Goodfellow et al. (goodfellow2014nips) (when evaluated with FID on 4 popular datasets and approximate precision, recall, and F1 on a synthetic convex polygons dataset). In order to develop improvements to GANs, one must have a way of detecting improvement. Thus, GAN evaluation continues to be an ongoing area of research.

2.5. Extensions and Applications

As shown in Sections 2.3 and 2.4, there has been much work on improving GAN training and evaluation. GANs have also been extended for learned inference and conditioning on a class label or input image. There has also been work applying the GAN-inspired idea of an adversarial loss to a variety of problems, in addition to the popular use for generating synthetic images.

2.5.1. Conditional GANs

GANs only accept as input a noise vector (this original formulation is an unconditional GAN). Conditional GANs, on the other hand, accept as input other information such as a class label, image, or other data (goodfellow2014nips; gauthier2014conditional; mirza2014conditional; denton2015deep). In the case of image generation, this means that a particular type of image can be specified to generate. One such example is to generate an image of a particular class within an image dataset such as “cat” rather than a random object from the dataset.

Examples of popular and general-purpose conditional GANs are pix2pix by Isola et al.

(isola2017cvpr) and CycleGAN by Zhu et al. (zhu2017iccv). These GANs perform image-to-image translation. CycleGAN is based on pix2pix but does not require the training examples to be pairs in the two domains of interest, making CycleGAN entirely unsupervised. CycleGAN accomplishes this by integrating an additional loss function. In addition to the adversarial loss, CycleGAN uses a cycle consistency loss. This means that after translating an image from one domain (e.g. Google Maps data) to another (e.g. Google Maps satellite view), the new image can be translated back to reconstruct the original image. A method that is similar to CycleGAN but is also multimodal is the multimodal unsupervised image-to-image translation (MUNIT) by Huang et al. (huang2018multimodal). By assuming a decomposition into style (domain-specific) and content (domain-invariant) codes, MUNIT can generate diverse outputs for a given input image (e.g. multiple possible satellite view output images).

Some of the many uses of conditional GANs have been: transferring style (e.g. make a photo look like a Van Gogh painting) (zhu2017iccv)

, image colorization

(isola2017cvpr), generating satellite images from Google Maps data (or vice versa) (isola2017cvpr; zhu2017iccv) as shown in Figure 7, converting labels to photos (e.g. semantic segmentation output to a photo) (isola2017cvpr; zhu2017iccv), and domain adaptation (e.g. change from a synthetic vehicle driving image to one that looks realistic as shown in Figure 8 and discussed in Sections LABEL:pixelLevelAdaptation and LABEL:featureAndPixel).

Figure 7. A satellite view image (right) generated from Google Maps data (left). Images courtesy Isola et al. (isola2017cvpr).
Figure 8. Synthetic vehicle driving image (left) adapted to look realistic (right). Images courtesy Hoffman et al. (hoffman2018icml).

2.5.2. Inverse Mapping / Learned Inference

In the original GAN formulation, the generator would accept as input and output a vector that is indistinguishable from the training data. In this formulation there is no way to go from back to . Going from the generated data back to the latent space is the inverse mapping problem (donahue2017iclr), otherwise referred to as learned approximate inference (goodfellow2014nips). This problem was explored simultaneously by Donahue et al. (donahue2017iclr), who developed bidirectional GANs (BiGAN), and Dumoulin et al. (dumoulin2017iclr), who developed adversarially learned inference (ALI).

2.5.3. Adversarial Loss

Adversarial losses, an idea stemming from GANs, have been applied in multiple settings outside of generative modeling. Wang et al. (wang2017cvpr)

created an adversarial spacial dropout network to add occlusions to images to improve the accuracy of object detection algorithms. They also created an adversarial spatial transformer network to add deformations such as rotations to objects to again increase object detection accuracy. In other application domains, Pinto et al.

(pinto2017icra) used adversarial agents to improve a robot’s ability to grasp an object via self-supervised learning by employing both shaking and snatching adversaries. Giu et al. (guiteaching) used an adversarial loss to predict and demonstrate (i.e., robot will copy) human motion. Rippel et al. (waveone2017; rippel2018using) used a reconstruction and adversarial loss with an autoencoder for learning higher quality image compression at low bit rates. In the next two sections, we will discuss transfer learning and then focus on GAN-inspired adversarial domain adaptation applications, which offer additional use cases for adversarial losses.

3. Transfer Learning

Transfer learning is defined as the learning scenario where a model is trained on a source domain or task and evaluated on a different but related target domain or task, where either the tasks or domains (or both) differ (pan2010tkde; dredze2010multi; weiss2016). For instance, we may wish to learn a model on a handwritten digit dataset (e.g. MNIST (lecun1998mnist)) with the goal of using it to recognize house numbers (e.g. SVHN(netzer2011reading)). Or, we may wish to learn a model on a synthetic, cheap-to-generate traffic sign dataset (e.g. (moiseev2013evaluation)) with the goal of using it to classify real traffic signs (e.g. GTSRB (Stallkamp-IJCNN-2011)). In these examples, the source dataset used to train the model is related but different from the target dataset used to test the model – both are digits and signs respectively but each dataset looks significantly different. When the source and target differ but are related, then transfer learning can be applied to obtain higher accuracy on the target data.

3.1. Terminology

3.1.1. Categorizing Methods

In a transfer learning survey paper, Pan et al. (pan2010tkde) defined two terms to help classify various transfer learning techniques: “domain” and “task.” A domain consists of a feature space and a marginal probability distribution, i.e. the features of the data and the distribution of those features in the dataset. A task consists of a label space and an objective predictive function, i.e. the set of labels and the learned predictive function (learned from the training data). Thus, a transfer learning problem might be either transferring from a source domain to a different target domain or transferring from a source task to a different target task (or a combination of both) (pan2010tkde; dredze2010multi; weiss2016).

By this definition, a change in domain may result from either a change in feature space or a change in the marginal probability distribution. When classifying documents using text mining, a change in the feature space may result from a change in language (e.g. English to Spanish) whereas a change in the marginal probability distribution may result from a change in document topics (e.g. computer science to English literature) (pan2010tkde). Similarly, a change in task may result from either a change in the label space or a change in the objective predictive function. In the case of document classification, a change in the label space may result from a change in the number of classes (e.g. from a set of 10 topic labels to a set of 100 topic labels). Similarly, a change in the objective predictive function may result from a large change in the distribution of the labels (e.g. the source domain has 100 instances of class A and 10,000 of class B, whereas the target has 10,000 instances of A and 100 of B) (pan2010tkde).

To classify transfer learning algorithms based on whether the task or domain differs between source and target, Pan et al. (pan2010tkde) introduced three terms: “inductive”, “transductive”, and “unsupervised” transfer learning. In inductive transfer learning, the target and source tasks are different and the domains may or may not differ, and some labeled target data is required. In transductive transfer learning, the tasks remain the same while the domains are different, and both labeled source data and unlabeled target data are required. Finally, in unsupervised transfer learning, the tasks differ as in the inductive case, but there is no requirement of labeled data in either the source domain or the target domain.

3.1.2. Domain Adaptation

One popular type of transfer learning that has recently been explored as a novel use of GANs is domain adaptation, which will be the focus of our transfer learning survey. Domain adaptation is a type of transductive transfer learning. Here, the task remains the same, as does the domain feature space, but the domain marginal probability distributions differ (pan2010tkde; purushotham2017variational). Only part of the domain changes since the feature space is required to remain fixed between source and target.

In addition to the previous terminology, machine learning techniques are often categorized based on whether or not labeled training data is available. Supervised learning assumes labeled data is available, semi-supervised learning utilizes both labeled data and unlabeled data, and unsupervised learning utilizes only unlabeled data. However, domain adaptation assumes data comes from both a source domain and a target domain. Thus, prepending one of these three terms to “domain adaptation” is ambiguous since it may refer to labeled data being available in the source or target domains.

Authors apply these terms in various ways to domain adaptation (weiss2016) (e.g. (jiang2008domain; pan2010tkde; saito2017icml; daume2007acl)). In this paper, we will refer to “unsupervised” domain adaptation as the case having labeled source data and unlabeled target data, “semi-supervised” domain adaptation as the case having labeled source data in addition to some labeled target data, and “supervised” domain adaptation as the case having labeled source and target data (beijbom2012domain). This essentially means the adjective is referring to the target domain, which resolves the ambiguity but limits you to domain adaptation where you have source labels. These definitions are commonly used in the methods surveyed in the next section (e.g. used by (saito2017icml; goodfellow2016survey; long2015icml; ganin2015icml)).

3.1.3. Related Problems

Multi-domain learning (dredze2010multi; joshi2012multi) and multi-task learning (caruana1997multitask) are related to transfer learning and domain adaptation. In contrast to transfer learning, these learning approaches have the goal of obtaining high performance on all specified domains (or tasks) rather than just on a single target domain (or task) (pan2010tkde; yang2015iclr). For example, often it is assumed that the training data is drawn in an independent and identically distributed (i.i.d.) fashion, which may not be the case (joshi2012multi). One such example is the task of developing a spam filter for users who disagree on what is considered spam. If all the users’ data are combined, the training data will consist of multiple domains. While each individual domain may be i.i.d., the aggregated dataset may not be. If the data is split by user, then there may be too little data to learn a model for each user. Multi-domain learning can take advantage of the entire dataset to learn individual user preferences (dredze2010multi; joshi2012multi). When working with multiple tasks, instead of training models separately for different tasks (e.g. one model for detecting shapes in an image and one model for detecting text in an image), multi-task learning will learn these separate but related tasks simultaneously so that they can mutually benefit from the training data of other tasks through a (partially) shared representation (caruana1997multitask). If there are both multiple tasks and domains, then these approaches can be combined into multi-domain multi-task learning, as is described by Yang et al. (yang2015iclr).

Another related problem is domain generalization, in which a model is trained on multiple source domains with labeled data and then tested on a separate target domain that was not seen at all during training (muandet2013domain). This contrasts with domain adaptation where target examples (generally unlabeled) are available during training. Adversarial approaches have been designed to address this situation. Examples include an adversarial method introduced by Zhao et al. (zhao2017icml) and an autoencoder approach by Ghifary et al. (ghifary2015iccv):

Zhao et al. (zhao2017icml) propose an adversarial approach for sleep-stage classification from radio frequency (RF) signals. They want to learn a model based on a dataset collected from a number of people in selected environments capable of generalizing well to new people and/or new environments (e.g., sleeping in a different room). This is a domain generalization problem: given multiple source domains the model needs to generalize to an unseen target domain (muandet2013domain)

. In the sleep-stage classification setting, the source domains are a person-and-environment pair. Their feature extractor consists of a convolutional neural network (CNN) for extracting stage-specific features from an RF spectogram and a recurrent neural network (RNN) for extracting time-dependent features. To help the model generalize, their adversarial training method removes conditional dependencies between the source domain and the learned representation through conditioning the discriminator on the predicted label distribution. In other words, the feature extractor is learned such that the discriminator can not determine what source domain the learned representation came from. The result of this approach was a novel and effective strategy for sleep stage classification.

Ghifary et al. (ghifary2015iccv)

propose extending a denoising autoencoder to improve object recognition generalizability. Denoising autoencoders try to reconstruct the original image from a corrupted or noisy version fed into the network. Ghifary et al. instead treat a different view of the data (e.g. rotation, change in size, or variation in lighting) as the corruption or noise. They feed in an input image of an object from one domain and try reconstructing the corresponding views of the object in the other domains. By training in this way, the autoencoder learns features robust to variations across domains. After learning the representation, the feature extractor from the trained autoencoder can be used in a classification task such as object recognition.

3.2. Theory

Now that we have introduced various types of transfer learning, we address the question of needing to know when applying them may be beneficial. Ben-David et al. (ben2010ml) developed a theory that answers two questions: (1) when will a classifier trained on the source data perform well on the target data, and (2) given a small number of labeled target examples, how can they best be used during training to minimize target test error?

Answering the first question, labeled source data and unlabeled target data are required (unsupervised). Answering the second question, additionally some labeled target data are required (semi-supervised). The answers to both of these questions hold for not only domain changes but also task changes (e.g. labels can differ). These authors also address the case of multiple source domains, as do Mansour et al. (mansour2009nips). In this paper, we will focus on the cases containing only one source and one target (as is common in the methods we survey).

3.2.1. When to use transfer learning

With respect to the first question posed in the previous section, a proposed approach is to set a bound on the target error based on the source error and the divergence between the source and target domains (ben2010ml)

. The empirical source error is easy to obtain by first training and then testing a classifier. However, the divergence between the domains cannot be directly obtained with standard methods like Kullback-Leibler divergence due to only having a finite number of samples from the domains and not assuming any particular distribution. Thus, an alternative is to measure using a classifier-induced divergence called

-divergence. Estimates of this divergence with finite samples converges to the real -divergence. This divergence can be estimated by measuring error when getting a classifier to discriminate between the unlabeled source and target examples, though it is often intractable to find the theoretically-required divergence upper bound. Using the source error (right-hand side term 1 of Equation 2), the upper bound on the true -divergence (terms 2 and 3) and best hypothesis for the source and target (term 4), the target error can be bounded as shown in Equation 2, with probability at least for .

(2)

Here, is the target error, is the source error, is the divergence estimate from the unlabeled source and target examples, is the VC-dimension, is the number of source and target examples (assumed to be the same), and is the ideal predictor error.

To summarize, if is large, then there is no good hypothesis from training on the source domain that will work well on the target domain. However, as is more common in the application of domain adaptation, if is small, then the bound depends on the source error and the -divergence (ben2010ml).

3.2.2. Given some target labels

For the second question, a linear combination of the source and target errors is computed, called the -error. A bound can be calculated on the true -error based on the empirical -error. Finding the minimum -error depends on the empirical -error, the divergence between source and target, and the number of labeled source and target examples. Experimentation can be used to empirically determine the values of that will perform well. These authors demonstrate the process on sentiment classification (ben2010ml), illustrating that the optimum uses non-trivial values.

The bound is given in Equation 3. If is a labeled sample of size with points drawn from the source distribution and from the target distribution, then with at least probability for :

(3)

Here, is the empirical minimizer of the -error on given by and is the target error minimizer.

Then, the optimum is:

(4)

Here, is the number of source examples, is the number of target examples, , and

(5)
(6)
(7)

In summary, when only source or target data is available that data can be used. If the source and target are the same, then , which implies a uniform weighting of examples. Given enough target data, source data should not be used at all because it might increase the test-time error. Furthermore, without enough source data using it may also not be worthwhile, i.e. (ben2010ml).

4. Domain Adaptation Methods

Examining methods for domain adaptation, we will consider non-adversarial domain adaptation. Numerous papers have been written on this topic including other survey papers (margolis2011literature; beijbom2012domain; patel2015ieee) so we focus primarily on neural-network-based approaches which are more comparable to GAN-based approaches. We then look at the recent adversarial domain adaptation methods which are inspired by GANs. Tables 1 and 2 summarize the neural network-based methods we discuss (both non-adversarial and adversarial). Additionally, Tables 3 through 4.1.1

summarize the results of evaluating many of these methods on datasets used in the domains of image processing as well as sentiment analysis.

4.1. Non-Adversarial Domain Adaptation

4.1.1. Without neural networks

Daumé (daume2007acl)

lists six “obvious” domain adaptation methods that perform fairly well in the semi-supervised case. A “source only” model is trained only on the source data. Similarly, a “target only” model is trained only on the target data and an “all” model is trained on the union of both. In contrast, a “weighted” model uses both data sources but weights (chosen by cross validation) source and target data differently to even out the imbalance. A “prediction” model uses the output of the “source only” model as additional features for learning on the target data. Finally, a “linearly interpolated” model linearly interpolates between “source only” and “target only” models.

Daumé (daume2007acl) also describes two methods that outperform the above six baselines. First, the “prior” model uses the “source only” output as a prior for the weights of another model trained on the target data. Rather than regularizing the target model weights with an L2 norm, which prefers smaller magnitude weights, the model instead prefers weights similar to that of the source model unless the target data strongly suggests otherwise. For example, if one of the source weights was 5, then the corresponding target weight would be regularized to be near 5 rather than 0 as it would if using an L2 norm. Second, another option is to use three models for different types of information: source-specific, target-specific, and general. Each source example is either source-specific or general, and each target example is either target-specific or general. A previous paper (daume2006jail) proposed an EM algorithm for training but it was 10-15 times slower than the “prior” model training.

Finally, Daumé proposes a method that is “frustratingly easy” and trains much faster. Instead of each example being one of the three cases, now each feature of the data is either source-specific, target-specific, or general. The data will be augmented so the source data will have source-specific and general features (target-specific is zeros) and the target data will have target-specific and general features (source-specific is zeros). The input data (either target or source) is duplicated into both the specific and general features of each example. In other words, source data will be and target data , where is a vector of length and is each example’s original data. A kernalized version is also described which replaces with the function . The method can be extended to process multiple source domains and has been evaluated on sequence labeling tasks.


Name
Adaptation Loss Functions Adversarial Loss Gen. Shared
Feature Pixel Distance Diff. Cycle Semantic Pix. Sim. Task Feature Pixel

CyCADA(hoffman2018icml)
Rozantsev et al.(rozantsev2018ieee) MMD 111no sharing of weights, but regularized to be similar PixelDA(bousmalis2017cvpr) VADDA(purushotham2017variational)222uses a variational RNN (NIPS2015_5653) rather than a CNN because applying to multimodal sequential (time series) data Saito et al.(saito2017icml) low SimGAN(shrivastava2017cvpr) N/A333maps to target domain so only have feature extractor for target (part of the classifier) ADDA(tzeng2017cvpr) CycleGAN(zhu2017iccv) 444unspecified; originally not applied to domain adaptation, but Hoffman et al. (hoffman2018icml) tried it DSN(bousmalis2016nips) 555tried MMD but found adversarial performed better some DRCN(ghifary2016) CoGAN(liu2016nips) some Deep CORAL(sun2016) CORAL DANN(ganin2015icml; ganin2016jmlr) DAN(long2015icml) MK-MMD low Tzeng et al.(tzeng2015iccv)666semi-supervised for some classes, i.e. requires some labeled target data for some of the classes


Table 1. Comparison of different neural network based domain adaptation methods.
Explanation of Comparison Terms
Adaptation either feature-level or pixel-level (raw input) adaptation (or both)

Loss Functions
Distance – trying to align distributions through minimizing a distance function
Diff. – enforce different features between two networks, e.g. learning separate private and shared features
Cycle – cycle consistency (reconstruction) loss
Semantic – semantic consistency loss (same label before and after pixel-level translation)
Pix. Sim. – pixel-level similarity loss
Task – task loss, e.g. if for classification, outputting the groundtruth source label, or if for semantic segmentation, labeling each pixel with the correct groundtruth source label

Adversarial Loss
adversarial loss performed on either feature-level output or pixel-level output (or both); note that the difference between an “adversarial loss” and the “loss functions” above is that the loss functions above are generally a more simple equation whereas the adversarial loss is a learned neural-network discriminator trained to not be able to distinguish between two (or more) domains, i.e. a learned loss function (where learning is more than a hyperparameter search)

Gen.
uses a generator during training

Shared
encoders share weights, i.e. sharing weights between the source and target feature extractors
Table 2. Terms used for comparing neural network based domain adaptations in Table 1.
Name MNIST and USPS MNIST and SVHN MNIST[-M] Synthetic to Real MN US US MN SV MN MN SV MN MN-M SV GTSRB

Target only (i.e., if we had the target labels)
96.3 0.1 (hoffman2018icml) 96.5 (bousmalis2017cvpr) 99.2 0.1 (hoffman2018icml) 99.2 0.1 (hoffman2018icml) 99.5 (bousmalis2016nips) 99.51 (ganin2015icml) 96.4 (bousmalis2017cvpr) 98.7 (bousmalis2016nips) 98.91 (ganin2015icml) 92.44 (ganin2015icml) 92.4 (bousmalis2016nips) 99.87 (ganin2015icml) 99.8 (bousmalis2016nips)
CyCADA(hoffman2018icml) 95.6 0.2 96.5 0.1 90.4 0.4 Rozantsev et al.(rozantsev2018ieee) 60.7 67.3 PixelDA(bousmalis2017cvpr) 95.9 98.2 Saito et al.(saito2017icml) 85.0 52.8 94.0 92.9 96.2 ADDA(tzeng2017cvpr) 89.4 0.2 90.1 0.8 76.0 1.8 DSN(bousmalis2016nips)777results using adversarial loss, which are higher than those using MMD; also used some target labels in validation data for hyperparameter tuning 91.3 (bousmalis2017cvpr) 82.7 83.2 91.2 93.1 DRCN(ghifary2016) 91.80 0.09 73.67 0.04 81.97 0.16 40.05 0.07 CoGAN(liu2016nips) 91.2 0.8 89.1 0.8 62.0 (bousmalis2017cvpr) DANN(ganin2015icml; ganin2016jmlr) 85.1 (bousmalis2017cvpr) 71.07 70.7 (bousmalis2016nips) 71.1 (saito2017icml) 73.6 (hoffman2018icml) 35.7 (saito2017icml) 81.49 77.4 (bousmalis2016nips) 81.5 (saito2017icml) 90.48 90.3 (bousmalis2016nips; saito2017icml) 88.66 88.7 (saito2017icml) 92.9 (bousmalis2016nips) DAN(long2015icml) 81.1 (bousmalis2017cvpr) 71.1 (bousmalis2016nips) 76.9 (bousmalis2016nips) 88.0 (bousmalis2016nips) 91.1 (bousmalis2016nips) Source only (i.e., no adaptation) 78.9 (bousmalis2017cvpr) 82.2 0.8 (hoffman2018icml) 69.6 3.8 (hoffman2018icml) 59.19 (ganin2015icml) 59.2 (bousmalis2016nips) 67.1 0.6 (hoffman2018icml) 56.6 (bousmalis2016nips) 57.49 (ganin2015icml) 63.6 (bousmalis2017cvpr) 86.65 (ganin2015icml) 86.7 (bousmalis2016nips) 74.00 (ganin2015icml) 85.1 (bousmalis2016nips)
Table 3. Classification accuracy (source target, mean

std %) of different neural network based domain adaptation methods on various computer vision datasets (only including those used in

papers). Adversarial approaches denoted by .
Name Office (Amazon, DSLR, Webcam) A W D W W D W A A D D A

Rozantsev et al.(rozantsev2018ieee)
76.0 96.7 99.6
ADDA(tzeng2017cvpr) 75.1 97.0 99.6 DRCN(ghifary2016) 68.7 0.3 96.4 0.3 99.0 0.2 54.9 0.5 66.8 0.5 56.0 0.5 Deep CORAL(sun2016) 66.4 0.4 95.7 0.3 99.2 0.1 51.5 0.3 66.8 0.6 52.8 0.2 DANN(ganin2015icml; ganin2016jmlr) 67.3 1.7 72.6 0.3 (ghifary2016) 73.0 (rozantsev2018ieee; tzeng2017cvpr) 94.0 0.8 96.4 0.1 (ghifary2016) 96.4 (rozantsev2018ieee; tzeng2017cvpr) 93.7 1.0 99.2 0.3 (ghifary2016) 99.2 (rozantsev2018ieee; tzeng2017cvpr) 52.7 0.2 (ghifary2016) 67.1 0.3 (ghifary2016) 54.5 0.4 (ghifary2016) DAN(long2015icml) 68.5 0.4 63.8 0.4 (sun2016) 64.5 (rozantsev2018ieee) 68.5 (tzeng2017cvpr) 96.0 0.3 94.6 0.5 (sun2016) 95.2 (rozantsev2018ieee) 96.0 (tzeng2017cvpr) 99.0 0.2 98.6 (rozantsev2018ieee) 98.8 0.6 (sun2016) 99.0 (tzeng2017cvpr) 53.1 0.3 51.9 0.5 (sun2016) 67.0 0.4 65.8 0.4 (sun2016) 54.0 0.4 52.8 0.4 (sun2016) Tzeng et al.(tzeng2015iccv) 888semi-supervised for some classes, but evaluated on 16 hold-out categories for which the labels were not seen during training 59.3 0.6 90.0 0.2 97.5 0.1 40.5 0.2 68.0 0.5 43.1 0.2 Source only (i.e., no adaptation) 62.6 (tzeng2017cvpr)999Office dataset numbers are using a ResNet-50 network though Tzeng et al. (tzeng2017cvpr) also provided AlexNet results 96.1 (tzeng2017cvpr)9 98.6 (tzeng2017cvpr)9

Table 4. Classification accuracy (source target, mean std %) of different neural network based domain adaptation methods on the Office computer vision dataset. Adversarial approaches denoted by .

Source
Target DANN mSDA(ganin2016jmlr)111111using 30,000 dimensional feature vectors from marginalized stacked denoising autoencoders (mSDA) by Chen et al. (chen2012marginalized), which is an unsupervised method of learning a feature representation from the training data DANN(ganin2016jmlr)121212using 5000-dimensional unigram and bigram feature vectors CORAL(sun2016aaai)131313using bag-of-words feature vectors including only the top 400 words, but suggest using deep text features in future work No Adaptation(sun2016aaai)141414using bag-of-words feature vectors


Books
DVD 82.9 78.4
Books Electronics 80.4 73.3 76.3 74.7
Books Kitchen 84.3 77.9
DVD Books 82.5 72.3 78.3 76.9
DVD Electronics 80.9 75.4
DVD Kitchen 84.9 78.3
Electronics Books 77.4 71.3
Electronics DVD 78.1 73.8
Electronics Kitchen 88.1 85.4 83.6 82.8
Kitchen Books 71.8 70.9
Kitchen DVD 78.9 74.0 73.9 72.2
Kitchen Electronics 85.6 84.3

Table 5. Classification accuracy comparison for domain adaptation methods for sentiment analysis (positive or negative review) on the Amazon review dataset (blitzer2007biographies)101010http://www.cs.jhu.edu/~mdredze/datasets/sentiment/. Adversarial approaches denoted by .
Table 6. List and description of computer vision datasets from Tables 3 and 4