Log In Sign Up

Large Scale Adversarial Representation Learning

by   Jeff Donahue, et al.

Adversarially trained generative models (GANs) have recently achieved compelling image synthesis results. But despite early successes in using GANs for unsupervised representation learning, they have since been superseded by approaches based on self-supervision. In this work we show that progress in image generation quality translates to substantially improved representation learning performance. Our approach, BigBiGAN, builds upon the state-of-the-art BigGAN model, extending it to representation learning by adding an encoder and modifying the discriminator. We extensively evaluate the representation learning and generation capabilities of these BigBiGAN models, demonstrating that these generation-based models achieve the state of the art in unsupervised representation learning on ImageNet, as well as in unconditional image generation.


page 13

page 14

page 15

page 16

page 17

page 18

page 19

page 20


MAGE: MAsked Generative Encoder to Unify Representation Learning and Image Synthesis

Generative modeling and representation learning are two key tasks in com...

On the Limits of Learning Representations with Label-Based Supervision

Advances in neural network based classifiers have transformed automatic ...

Unsupervised State Representation Learning in Atari

State representation learning, or the ability to capture latent generati...

Repurposing GANs for One-shot Semantic Part Segmentation

While GANs have shown success in realistic image generation, the idea of...

Representation Learning for Non-Melanoma Skin Cancer using a Latent Autoencoder

Generative learning is a powerful tool for representation learning, and ...

Motif Mining and Unsupervised Representation Learning for BirdCLEF 2022

We build a classification model for the BirdCLEF 2022 challenge using un...

Latent Space Conditioning on Generative Adversarial Networks

Generative adversarial networks are the state of the art approach toward...

1 Introduction

In recent years we have seen rapid progress in generative models of visual data. While these models were previously confined to domains with single or few modes, simple structure, and low resolution, with advances in both modeling and hardware they have since gained the ability to convincingly generate complex, multimodal, high resolution image distributions biggan ; stylegan ; glow .

Intuitively, the ability to generate data in a particular domain necessitates a high-level understanding of the semantics of said domain. This idea has long-standing appeal as raw data is both cheap – readily available in virtually infinite supply from sources like the Internet – and rich, with images comprising far more information than the class labels that typical discriminative machine learning models are trained to predict from them. Yet, while the progress in generative models has been undeniable, nagging questions persist: what semantics have these models learned, and how can they be leveraged for representation learning?

The dream of generation as a means of true understanding from raw data alone has hardly been realized. Instead, the most successful approaches for unsupervised learning leverage techniques adopted from the field of supervised learning, a class of methods known as

self-supervised learning carl ; splitbrain ; cpc ; rotation . These approaches typically involve changing or holding back certain aspects of the data in some way, and training a model to predict or generate aspects of the missing information. For example, colorful ; splitbrain

proposed colorization as a means of unsupervised learning, where a model is given a subset of the color channels in an input image, and trained to predict the missing channels.

Generative models as a means of unsupervised learning offer an appealing alternative to self-supervised tasks in that they are trained to model the full data distribution without requiring any modification of the original data. One class of generative models that has been applied to representation learning is generative adversarial networks (GANs) gan . The generator in the GAN framework is a feed-forward mapping from randomly sampled latent variables (also called “noise”) to generated data, with learning signal provided by a discriminator trained to distinguish between real and generated data samples, guiding the generator’s outputs to follow the data distribution. The adversarially learned inference (ALI) ali or bidirectional GAN (BiGAN) bigan approaches were proposed as extensions to the GAN framework that augment the standard GAN with an encoder module mapping real data to latents, the inverse of the mapping learned by the generator.

In the limit of an optimal discriminator, bigan

showed that a deterministic BiGAN behaves like an autoencoder minimizing

reconstruction costs; however, the shape of the reconstruction error surface is dictated by a parametric discriminator, as opposed to simple pixel-level measures like the

error. Since the discriminator is usually a powerful neural network, the hope is that it will induce an error surface which emphasizes “semantic” errors in reconstructions, rather than low-level details.

In bigan it was demonstrated that the encoder learned via the BiGAN or ALI framework is an effective means of visual representation learning on ImageNet for downstream tasks. However, it used a DCGAN dcgan style generator, incapable of producing high-quality images on this dataset, so the semantics the encoder could model were in turn quite limited. In this work we revisit this approach using BigGAN biggan as the generator, a modern model that appears capable of capturing many of the modes and much of the structure present in ImageNet images. Our contributions are as follows:

  • We show that BigBiGAN (BiGAN with BigGAN generator) matches the state of the art in unsupervised representation learning on ImageNet.

  • We propose a more stable version of the joint discriminator for BigBiGAN.

  • We perform a thorough empirical analysis and ablation study of model design choices.

  • We show that the representation learning objective also helps unconditional image generation, and demonstrate state-of-the-art results in unconditional ImageNet generation.

2 BigBiGAN








Figure 1: The structure of the BigBiGAN framework. The joint discriminator is used to compute the loss . Its inputs are data-latent pairs, either , sampled from the data distribution and encoder outputs, or , sampled from the generator outputs and the latent distribution . The loss includes the unary data term and the unary latent term , as well as the joint term which ties the data and latent distributions.

The BiGAN bigan or ALI ali approaches were proposed as extensions of the GAN gan framework which enable the learning of an encoder that can be employed as an inference model ali or feature representation bigan . Given a distribution of data (e.g., images), and a distribution of latents (usually a simple continuous distribution like an isotropic Gaussian ), the generator models a conditional distribution of data given latent inputs sampled from the latent prior , as in the standard GAN generator gan . The encoder models the inverse conditional distribution , predicting latents given data sampled from the data distribution .

Besides the addition of , the other modification to the GAN in the BiGAN framework is a joint discriminator , which takes as input data-latent pairs (rather than just data as in a standard GAN), and learns to discriminate between pairs from the data distribution and encoder, versus the generator and latent distribution. Concretely, its inputs are pairs and , and the goal of the and

is to “fool” the discriminator by making the two joint distributions

and from which these pairs are sampled indistinguishable. The adversarial minimax objective in bigan ; ali , analogous to that of the GAN framework gan , was defined as follows:

Under this objective, bigan ; ali showed that with an optimal , and minimize the Jensen-Shannon divergence between the joint distributions and , and therefore at the global optimum, the two joint distributions match, analogous to the results from standard GANs gan . Furthermore, bigan showed that in the case where and are deterministic functions (i.e., the learned conditional distributions and are Dirac functions), these two functions are inverses at the global optimum: e.g., , with the optimal joint discriminator effectively imposing reconstruction costs on and .

While the crux of our approach, BigBiGAN, remains the same as that of BiGAN bigan ; ali , we have adopted the generator and discriminator architectures from the state-of-the-art BigGAN biggan generative image model. Beyond that, we have found that an improved discriminator structure leads to better representation learning results without compromising generation (Figure 1). Namely, in addition to the joint discriminator loss proposed in bigan ; ali which ties the data and latent distributions together, we propose additional unary terms in the learning objective, which are functions only of either the data or the latents . Although bigan ; ali prove that the original BiGAN objective already enforces that the learnt joint distributions match at the global optimum, implying that the marginal distributions of and match as well, these unary terms intuitively guide optimization in the “right direction” by explicitly enforcing this property. For example, in the context of image generation, the unary loss term on matches the original GAN objective and provides a learning signal which steers only the generator to match the image distribution independently of its latent inputs. (In our evaluation we will demonstrate empirically that the addition of these terms results in both improved generation and representation learning.)

Concretely, the discriminator loss and the encoder-generator loss are defined as follows, based on scalar discriminator “score” functions and the corresponding per-sample losses :

where is a “hinge” used to regularize the discriminator geometricgan ; tran 111 We also considered an alternative discriminator loss which invokes the “hinge” just once on the sum of the three loss terms – – but found that this performed significantly worse than above which clamps each of the three loss terms separately. , also used in BigGAN biggan . The discriminator includes three submodules: , , and . takes only as input and takes only , and learned projections of their outputs with parameters and respectively give the scalar unary scores and . In our experiments, the data are images and latents

are unstructured flat vectors; accordingly,

is a ConvNet and is an MLP. The joint score tying and is given by the remaining submodule, , a function of the outputs of and .

The and parameters are optimized to minimize the loss , and the parameters are optimized to minimize loss . As usual, the expectations

are estimated by Monte Carlo samples taken over minibatches.

3 Evaluation

Most of our experiments follow the standard protocol used to evaluate unsupervised learning techniques, first proposed in colorful

. We train a BigBiGAN on unlabeled ImageNet, freeze its learned representation, and then train a linear classifier on its outputs, fully supervised using all of the training set labels. We also measure image generation performance, reporting Inception Score 

improvedgan (IS) and Fréchet Inception Distance frechet (FID) as the standard metrics there.

3.1 Ablation

We begin with an extensive ablation study in which we directly evaluate a number of modeling choices, with results presented in Table 1

. Where possible we performed three runs of each variant with different seeds and report the mean and standard deviation for each metric.

We start with a relatively fully-fledged version of the model at resolution (row Base), with the architecture and the component of taken from the corresponding architectures in BigGAN, including the skip connections and shared noise embedding proposed in biggan . is 120 dimensions, split into six groups of 20 dimensions fed into each of the six layers of as in biggan . The remaining components of and – are 8-layer MLPs with ResNet-style skip connections (four residual blocks with two layers each) and size 2048 hidden layers. The architecture is the ResNet-v2-50 ConvNet originally proposed for image classification in resnetv2 , followed by a 4-layer MLP (size 4096) with skip connections (two residual blocks) after ResNet’s globally average pooled output. The unconditional BigGAN training setup corresponds to the “Single Label” setup proposed in zurichfewer , where a single “dummy” label is used for all images (theoretically equivalent to learning a bias in place of the class-conditional batch norm inputs). We then ablate several aspects of the model, with results detailed in the following paragraphs. Additional architectural and optimization details are provided in Appendix A. Full learning curves for many results are included in Appendix D.

Latent distribution and stochastic .

As in ALI ali , the encoder of our Base model is non-deterministic, parametrizing a distribution . and are given by a linear layer at the output of the model, and the final standard deviation is computed from using a non-negative “softplus” non-linearity  softplus . The final uses the reparametrized sampling from kingmavae , with , where . Compared to a deterministic encoder (row Deterministic ) which predicts directly without sampling (effectively modeling as a Dirac distribution), the non-deterministic Base model achieves significantly better classification performance (at no cost to generation). We also compared to using a uniform (row Uniform ) with deterministically predicting given a linear output , as done in BiGAN bigan . This also achieves worse classification results than the non-deterministic Base model.

Unary loss terms.

We evaluate the effect of removing one or both unary terms of the loss function proposed in Section 

2, and . Removing both unary terms (row No Unaries) corresponds to the original objective proposed in bigan ; ali . It is clear that the unary term has a large positive effect on generation performance, with the Base and Unary Only rows having significantly better IS and FID than the Unary Only and No Unaries rows. This result makes intuitive sense as it matches the standard generator loss. It also marginally improves classification performance. The unary term makes a more marginal difference, likely due to the relative ease of modeling relatively simple distributions like isotropic Gaussians, though also does result in slightly improved classification and generation in terms of FID – especially without the term ( Unary Only vs. No Unaries). On the other hand, IS is worse with the term. This may be due to IS roughly measuring the generator’s coverage of the major modes of the distribution (the classes) rather than the distribution in its entirety, the latter of which may be better captured by FID and more likely to be promoted by a good encoder . The requirement of invertibility in a (Big)BiGAN could be encouraging the generator to produce distinguishable outputs across the entire latent space, rather than “collapsing” large volumes of latent space to a single mode of the data distribution.


To address the question of the importance of the generator in representation learning, we vary the capacity of (with and fixed) in the Small rows. With a third of the capacity of the Base model (Small (32)), the overall model is quite unstable and achieves significantly worse classification results than the higher capacity base model222Though the generation performance by IS and FID in row Small (32) is very poor at the point we measured – when its best validation classification performance (43.59%) is achieved – this model was performing more reasonably for generation earlier in training, reaching IS 14.69 and FID 60.67. With two-thirds capacity (Small (64)), generation performance is substantially worse (matching the results in biggan ) and classification performance is modestly worse. These results confirm that a powerful image generator is indeed important for learning good representations via the encoder. Assuming this relationship holds in the future, we expect that better generative models are likely to lead to further improvements in representation learning.

Standard GAN.

We also compare BigBiGAN’s image generation performance against a standard unconditional BigGAN with no encoder and only the standard ConvNet in the discriminator, with only the term in the loss (row No (GAN)). While the standard GAN achieves a marginally better IS, the BigBiGAN FID is about the same, indicating that the addition of the BigBiGAN and joint does not compromise generation with the newly proposed unary loss terms described in Section 2. (In comparison, the versions of the model without unary loss term on – rows Unary Only and No Unaries – have substantially worse generation performance in terms of FID than the standard GAN.) We conjecture that the IS is worse for similar reasons that the unary loss term leads to worse IS. Next we will show that with an enhanced taking higher input resolutions, generation with BigBiGAN in terms of FID is substantially improved over the standard GAN.

High resolution with varying resolution .

BiGAN bigan proposed an asymmetric setup in which takes higher resolution images than outputs and takes as input, showing that an taking inputs with a outperforms a for downstream tasks. We experiment with this setup in BigBiGAN, raising the input resolution to – matching the resolution used in typical supervised ImageNet classification setups – and varying the output and input resolution in . Our results in Table 1 (rows High Res (256) and Low/High Res (*)) show that BigBiGAN achieves better representation learning results as the resolution increases, up to the full resolution of . However, because the overall model is much slower to train with at resolution, the remainder of our results use the resolution for .

Interestingly, with the higher resolution , generation improves significantly (especially by FID), despite operating at the same resolution (row High Res (256) vs. Base). This is an encouraging result for the potential of BigBiGAN as a means of improving adversarial image synthesis itself, besides its use in representation learning and inference.


Keeping the input resolution fixed at 256, we experiment with varied and often larger architectures, including several of the ResNet-50 variants explored in revisiting . In particular, we expand the capacity of the hidden layers by a factor of or , as well as swap the residual block structure to a reversible variant called RevNet revnet with the same number of layers and capacity as the corresponding ResNets. (We use the version of RevNet described in revisiting .) We find that the base ResNet-50 model (row High Res (256)) outperforms RevNet-50 (row RevNet), but as the network widths are expanded, we begin to see improvements from RevNet-50, with double-width RevNet outperforming a ResNet of the same capacity (rows RevNet and ResNet ). We see further gains with an even larger quadruple-width RevNet model (row RevNet ), which we use for our final results in Section 3.2.

Decoupled / optimization.

As a final improvement, we decoupled the optimizer from that of , and found that simply using a higher learning rate for dramatically accelerates training and improves final representation learning results. For ResNet-50 this improves linear classifier accuracy by nearly 3% (ResNet ( LR) vs. High Res (256)). We also applied this to our largest architecture, RevNet-50 , and saw similar gains (RevNet ( LR) vs. RevNet ).

Encoder () Gen. () Loss Results
A. D. C. R. Var. C. R. IS () FID () Cls. ()
Base S 50 1 128 1 96 128 22.66 0.18 31.19 0.37 48.10 0.13
Deterministic S 50 1 128 (-) 1 96 128 22.79 0.27 31.31 0.30 46.97 0.35
Uniform S 50 1 128 (-) 1 96 128 () 22.83 0.24 31.52 0.28 45.11 0.93
Unary Only S 50 1 128 1 96 128 (-) 23.19 0.28 31.99 0.30 47.74 0.20
Unary Only S 50 1 128 1 96 128 (-) 19.52 0.39 39.48 1.00 47.78 0.28
No Unaries (BiGAN) S 50 1 128 1 96 128 (-) (-) 19.70 0.30 42.92 0.92 46.71 0.88
Small (32) S 50 1 128 1 (32) 128 3.28 0.18 247.30 10.31 43.59 0.34
Small (64) S 50 1 128 1 (64) 128 19.96 0.15 38.93 0.39 47.54 0.33
No (GAN) * (-) 96 128 (-) (-) 23.56 0.37 30.91 0.23 -
High Res (256) S 50 1 (256) 1 96 128 23.45 0.14 27.86 0.13 50.80 0.30
Low Res (64) S 50 1 (256) 1 96 (64) 19.40 0.19 15.82 0.06 47.51 0.09
High Res (256) S 50 1 (256) 1 96 (256) 24.70 38.58 51.49
ResNet-101 S (101) 1 (256) 1 96 128 23.29 28.01 51.21
ResNet S 50 (2) (256) 1 96 128 23.68 27.81 52.66
RevNet (V) 50 1 (256) 1 96 128 23.33 0.09 27.78 0.06 49.42 0.18
RevNet (V) 50 (2) (256) 1 96 128 23.21 27.96 54.40
RevNet (V) 50 (4) (256) 1 96 128 23.23 28.15 57.15
ResNet ( LR) S 50 1 (256) (10) 96 128 23.27 0.22 28.51 0.44 53.70 0.15
RevNet ( LR) (V) 50 (4) (256) (10) 96 128 23.08 28.54 60.15
Table 1: Results for variants of BigBiGAN, given in Inception Score improvedgan (IS) and Fréchet Inception Distance frechet

(FID) of the generated images, and ImageNet top-1 classification accuracy percentage (Cls.) of a supervised logistic regression classifier trained on the encoder features 

colorful , computed on a split of 10K images randomly sampled from the training set, which we refer to as the “train” split. The Encoder () columns specify the architecture (A.) as ResNet (S) or RevNet (V), the depth (D., e.g. 50 for ResNet-50), the channel width multiplier (C.), with 1 denoting the original widths from resnetv2

, the input image resolution (R.), whether the variance is predicted and a

vector is sampled from the resulting distribution (Var.), and the learning rate multiplier relative to the learning rate. The Generator () columns specify the BigGAN channel multiplier (C.), with 96 corresponding to the original width from biggan , and output image resolution (R.). The Loss columns specify which terms of the BigBiGAN loss are present in the objective. The column specifies the input distribution as a standard normal or continuous uniform . Changes from the Base setup in each row are highlighted in blue. Results with margins of error (written as “”) are the means and standard deviations over three runs with different random seeds. (Experiments requiring more computation were run only once.) (* Result for vanilla GAN (No (GAN)) selected with early stopping based on best FID; other results selected with early stopping based on validation classification accuracy (Cls.).)

3.2 Comparison with prior methods

Representation learning.
Method Architecture Feature Top-1 Top-5
BiGAN bigan ; splitbrain AlexNet conv3 31.0 -
Motion Segmentation (MS) motionseg ; carl ResNet-101 AvePool 27.6 48.3
Exemplar (Ex) exemplar ; carl ResNet-101 AvePool 31.5 53.1
Relative Position (RP) carlorig ; carl ResNet-101 AvePool 36.2 59.2
Colorization (Col) colorful ; carl ResNet-101 AvePool 39.6 62.5
Combination of MS+Ex+RP+Col carl ResNet-101 AvePool - 69.3
CPC cpc ResNet-101 AvePool 48.7 73.6
Rotation rotation ; revisiting RevNet-50 AvePool 55.4 -
Efficient CPC cpcplusplus ResNet-170 AvePool 61.0 83.0
BigBiGAN (ours) ResNet-50 AvePool 55.4 77.4
ResNet-50 BN+CReLU 56.6 78.6
RevNet-50 AvePool 60.8 81.4
RevNet-50 BN+CReLU 61.3 81.9
Table 2: Comparison of BigBiGAN models on the official ImageNet validation set against recent competing approaches with a supervised logistic regression classifier. BigBiGAN results are selected with early stopping based on highest accuracy on our train subset of 10K training set images. ResNet-50 results correspond to row ResNet ( LR) in Table 1, and RevNet-50 corresponds to RevNet ( LR).

We now take our best model by train classification accuracy from the above ablations and present results on the official ImageNet validation set, comparing against the state of the art in recent unsupervised learning literature. For comparison, we also present classification results for our best performing variant with the smaller ResNet-50-based . These models correspond to the last two rows of Table 1, ResNet ( LR) and RevNet ( LR).

Results are presented in Table 2. (For reference, the fully supervised accuracy of these architectures is given in Appendix A, Table 4.) Compared with a number of modern self-supervised approaches motionseg ; carlorig ; colorful ; cpc ; rotation ; cpcplusplus and combinations thereof carl , our BigBiGAN approach based purely on generative models performs well for representation learning, state-of-the-art among recent unsupervised learning results, improving upon a recently published result from revisiting of 55.4% to 60.8% top-1 accuracy using rotation prediction pre-training with the same representation learning architecture 333Our RevNet architecture matches the widest architectures used in revisiting , labeled as there. and feature, labeled as AvePool in Table 2, and matches the results of the concurrent work in cpcplusplus based on contrastic predictive coding (CPC).

We also experiment with learning linear classifiers on a different rendering of the AvePool feature, labeled BN+CReLU, which boosts our best results with RevNet to 61.3% top-1 accuracy. Given the global average pooling output , we first compute , and the final feature is computed by concatenating

, sometimes called a “CReLU” (concatened ReLU) non-linearity 

crelu .

denotes parameter-free Batch Normalization 

batchnorm , where the scale () and offset () parameters are not learned, so training a linear classifier on this feature does not involve any additional learning. The CReLU non-linearity retains all the information in its inputs and doubles the feature dimension, each of which likely contributes to the improved results.

Finally, in Appendix C we consider evaluating representations by zero-shot nearest neighbors classification, achieving 43.3% top-1 accuracy in this setting. Qualitative examples of nearest neighbors are presented in Figure 12.

Unsupervised image generation.
Method Steps IS () FID vs. Train () FID vs. Val. ()
BigGAN + SL zurichfewer 500K 20.4 (15.4 7.57) - 25.3 (71.7 66.32)
BigGAN + Clustering zurichfewer 500K 22.7 (22.8 0.42) - 23.2 (22.7 0.80)
BigBiGAN + SL (ours) 500K 25.38 (25.33 0.17) 22.78 (22.63 0.23) 23.60 (23.56 0.12)
BigBiGAN High Res + SL (ours) 500K 25.43 (25.45 0.04) 22.34 (22.36 0.04) 22.94 (23.00 0.15)
BigBiGAN High Res + SL (ours) 1M 27.94 (27.80 0.21) 20.32 (20.27 0.09) 21.61 (21.62 0.09)
Table 3: Comparison of our BigBiGAN for unsupervised (unconditional) generation vs. previously reported results for unsupervised BigGAN from zurichfewer . We specify the “pseudo-labeling” method as SL (Single Label) or Clustering. For comparison we train BigBiGAN for the same number of steps (500K) as the BigGAN-based approaches from zurichfewer , but also present results from additional training to 1M steps in the last row and observe further improvements. All results above include the median as well as the mean and standard deviation across three runs, written as “ ( )”. The BigBiGAN result is selected with early stopping based on best FID vs. Train.

In Table 3 we show results for unsupervised generation with BigBiGAN, comparing to the BigGAN-based biggan unsupervised generation results from zurichfewer . Note that these results differ from those in Table 1 due to the use of the data augmentation method of zurichfewer 444See the “distorted” preprocessing method from the Compare GAN framework: (rather than ResNet-style preprocessing used for all results in our Table 1 ablation study). The lighter augmentation from zurichfewer results in better image generation performance under the IS and FID metrics. The improvements are likely due in part to the fact that this augmentation, on average, crops larger portions of the image, thus yielding generators that typically produce images encompassing most or all of a given object, which tends to result in more representative samples of any given class (giving better IS) and more closely matching the statistics of full center crops (as used in the real data statistics to compute FID). Besides this preprocessing difference, the approaches in Table 3 have the same configurations as used in the Base or High Res (256) row of Table 1.

These results show that BigBiGAN significantly improves both IS and FID over the baseline unconditional BigGAN generation results with the same (unsupervised) “labels” (a single fixed label in the SL (Single Label) approach – row BigBiGAN + SL vs. BigGAN + SL). We see further improvements using a high resolution (row BigBiGAN High Res + SL), surpassing the previous unsupervised state of the art (row BigGAN + Clustering) under both IS and FID. (Note that the image generation results remain comparable: the generated image resolution is still here, despite the higher resolution input.) The alternative “pseudo-labeling” approach from zurichfewer , Clustering, which uses labels derived from unsupervised clustering, is complementary to BigBiGAN and combining both could yield further improvements. Finally, observing that results continue to improve significantly with training beyond 500K steps, we also report results at 1M steps in the final row of Table 3.

3.3 Reconstruction

As shown in bigan ; ali , the (Big)BiGAN and can reconstruct data instances by computing the encoder’s predicted latent representation and then passing this predicted latent back through the generator to obtain the reconstruction . We present BigBiGAN reconstructions in Figure 2. These reconstructions are far from pixel-perfect, likely due in part to the fact that no reconstruction cost is explicitly enforced by the objective – reconstructions are not even computed at training time. However, they may provide some intuition for what features the encoder learns to model. For example, when the input image contains a dog, person, or a food item, the reconstruction is often a different instance of the same “category” with similar pose, position, and texture – for example, a similar species of dog facing the same direction. The extent to which these reconstructions tend to retain the high-level semantics of the inputs rather than the low-level details suggests that BigBiGAN training encourages the encoder to model the former more so than the latter. Additional reconstructions are presented in Appendix B.

Figure 2: Selected reconstructions from an unsupervised BigBiGAN model (Section 3.3). Top row images are real data ; bottom row images are generated reconstructions of the above image computed by . Unlike most explicit reconstruction costs (e.g., pixel-wise), the reconstruction cost implicitly minimized by a (Big)BiGAN bigan ; ali tends to emphasize more semantic, high-level details. Additional reconstructions are presented in Appendix B.

4 Related work

A number of approaches to unsupervised representation learning from images based on self-supervision have proven very successful. Self-supervision generally involves learning from tasks designed to resemble supervised learning in some way, but in which the “labels” can be created automatically from the data itself with no manual effort. An early example is relative location prediction (carlorig, ), where a model is trained on input pairs of image patches and predicts their relative locations. Contrastive predictive coding (CPC) (cpc, ; cpcplusplus, ) is a recent related approach where, given an image patch, a model predicts which patches occur in other image locations. Other approaches include colorization colorful ; splitbrain , motion segmentation motionseg , rotation prediction rotation , and exemplar matching exemplar . Rigorous empirical comparisons of many of these approaches have also been conducted carl ; revisiting . A key advantage offered by BigBiGAN and other approaches based on generative models, relative to most self-supervised approaches, is that their input may be the full-resolution image or other signal, with no cropping or modification of the data needed (though such modifications may be beneficial as data augmentation). This means the resulting representation can typically be applied directly to full data in the downstream task with no domain shift.

A number of relevant autoencoder and GAN variants have also been proposed. Associative compression networks (ACNs) acn learn to compress at the dataset level by conditioning data on other previously transmitted data which are similar in code space, resulting in models that can “daydream” semantically similar samples, similar to BigBiGAN reconstructions. VQ-VAEs vqvae

pair a discrete (vector quantized) encoder with an autoregressive decoder to produce faithful reconstructions with a high compression factor and demonstrate representation learning results in reinforcement learning settings. In the adversarial space, adversarial autoencoders 

advae proposed an autoencoder-style encoder-decoder pair trained with pixel-level reconstruction cost, replacing the KL-divergence regularization of the prior used in VAEs kingmavae with a discriminator. In another proposed VAE-GAN hybrid learnedsimilarity the pixel-space reconstruction error used in most VAEs is replaced with feature space distance from an intermediate layer of a GAN discriminator. Other hybrid approaches like AGE age and -GAN alphagan add an encoder to stabilize GAN training. An interesting difference between many of these approaches and the BiGAN ali ; bigan framework is that BiGAN does not train the encoder or generator with an explicit reconstruction cost. Though it can be shown that (Big)BiGAN implicitly minimizes a reconstruction cost, qualitative reconstruction results (Section 3.3) suggest that this reconstruction cost is of a different flavor, emphasizing high-level semantics over pixel-level details.

5 Discussion

We have shown that BigBiGAN, an unsupervised learning approach based purely on generative models, achieves state-of-the-art results in image representation learning on ImageNet. Our ablation study lends further credence to the hope that powerful generative models can be beneficial for representation learning, and in turn that learning an inference model can improve large-scale generative models. In the future we hope that representation learning can continue to benefit from further advances in generative models and inference models alike, as well as scaling to larger image databases.


The authors would like to thank Aidan Clark, Olivier Hénaff, Aäron van den Oord, Sander Dieleman, and many other colleagues at DeepMind for useful discussions and feedback on this work.


Appendix A Model and optimization details

Our optimizer matches that of BigGAN [1] – we use Adam [18]

with batch size 2048 and the same learning rates and other hyperparameters, using the

optimizer to update simultaneously, with the same alternating optimization: two updates followed by a single joint update of and . (We do not use orthogonal regularization used in [1], finding it gave worse results in the unconditional setting, matching the findings of [24].) Spectral normalization [26] is used in and , but not in . Full cross-replica batch normalization is used in both and (including for the linear classifier training on features used for evaluations). We also apply exponential moving averaging (EMA) with a decay of 0.9999 to the and weights in all evaluations. (We find this results in only a small improvement for evaluations, but a substantial one for evaluations.)

At BigBiGAN training time, as well as linear classification evaluation training time, we preprocess inputs with ResNet [13]-style data augmentation, though with crops of size 128 or 256 rather than 224555

Preprocessing code from the TensorFlow ResNet TPU model:

For linear classification evaluations in the ablations reported in Table 1, we hold out 10K randomly selected images from the official ImageNet [30] training set as a validation set and report accuracy on that validation set, which we call train. All results in Table 1 are run for 500K steps, with early stopping based on linear classifier accuracy on our train split. In all of these models the linear classifier is initialized to 0 and trained for 5K Adam steps with a (high) learning rate of 0.01 and EMA smoothing with decay 0.9999. We have found it helpful to monitor representation learning progress during BigBiGAN training by periodically rerunning this linear classification evaluation from scratch given the current weights, resetting the classifier weights to 0 before each evaluation.

In Table 2 we extend the BigBiGAN training time to 1M steps, and report results on the official validation set of 50K images for comparison with prior work. The classifier in these results is trained for 100K Adam steps, sweeping over learning rates , again applying EMA with decay 0.9999 to the classifier weights. Hyperparameter selection and early stopping is again based on classification accuracy on train. As in [1], FID is reported against statistics over the full ImageNet training set, preprocessed by resizing the minor axis to the output resolution and taking the center crop along the major axis, except as noted in Table 3, where we also report FID against the validation set for comparison with [24].

All models were trained with data parallelism on TPU pod slices [11] using 32 to 512 cores.

Supervised model performance.

In Table 4 we present the results of fully supervised training with the model architectures used in our experiments in Section 3 for comparison purposes.

Architecture Top-1 Top-5
ResNet-50 76.3 93.1
ResNet-101 77.8 93.8
RevNet-50 71.8 90.5
RevNet-50 74.9 92.2
RevNet-50 76.6 93.1
Table 4: ImageNet validation set accuracy for fully supervised end-to-end training of the model architectures used in our representation learning experiments.
First layer convolutional filters.

In Figure 3 we visualize the learned convolutional filters for the first convolutional layer of our BigBiGAN encoders using the largest RevNet architecture. Note the difference between the filters in (a) and (b) (corresponding to rows RevNet and RevNet ( LR) in Table 1). In (b) we use the higher learning rate and see a corresponding qualitative improvement in the appearance of the learned filters, with less noise and more Gabor-like and color filters, as observed in BiGAN [4]. This suggests that examining the convolutional filters of the input layer can serve as a diagnostic for undertrained models.

(a) RevNet
(b) + LR
Figure 3: Visualization of first layer convolutional filters for our unsupervised BigBiGAN models with the RevNet architecture, which includes 1024 filters. (Best viewed with zoom.)

Appendix B Samples and reconstructions

Samples Reconstructions
Model Image IS () FID () Image Rel. Error % ()
Base Figure 4 24.10 30.14 Figure 5 70.54
Light Augmentation Figure 6 27.09 20.96 Figure 7 72.53
High Res (256) Figure 8 24.91 26.56 Figure 9 70.60
High Res (256) Figure 10 25.73 37.21 Figure 11 77.70
Table 5: Links to BigBiGAN samples and reconstructions with associated metrics.

In this Appendix we present BigBiGAN samples and reconstructions from several variants of the method. Table 5 includes pointers to samples and reconstruction images, as well as relevant metrics. The samples were selected by best FID vs. training set statistics, and we show the IS and FID along with sample images at that point. The reconstructions were selected by best (lowest) relative pixel-wise error, the error metric presented in Table 5, computed as:

where and are independent data samples, and serves as a “baseline” reconstruction error relative to a “random” input. For example, with a random initialization of and , we have . This relative metric penalizes degenerate reconstructions, such as the mean image, which would sometimes achieve low absolute reconstruction error despite having no perceptual similarity to the inputs. despite that the resulting images having no perceptual similarity to the inputs. In practice, given data samples (we use 50K), we estimate the denominator by comparing each sample with a single neighbor , computing:

Figure 4: samples from an unsupervised BigBiGAN generator , trained using the Base method from Table 1.
Figure 5: reconstructions from an unsupervised BigBiGAN model, trained using the Base method from Table 1. The top rows of each pair are real data , and bottom rows are generated reconstructions computed by .
Figure 6: samples from an unsupervised BigBiGAN generator , trained using the lighter augmentation from [24] with generation results reported in Table 3.
Figure 7: reconstructions from an unsupervised BigBiGAN model, trained using the lighter augmentation from [24] with generation results reported in Table 3. The top rows of each pair are real data , and bottom rows are generated reconstructions computed by .
Figure 8: samples from an unsupervised BigBiGAN generator , trained using the High Res (256) configuration from Table 1.
Figure 9: reconstructions of encoder input images from an unsupervised BigBiGAN model, trained using the High Res (256) configuration from Table 1. Reconstructions are upsampled from to for visualization. The top rows of each pair are real data , and bottom rows are generated reconstructions computed by .
Figure 10: samples from an unsupervised BigBiGAN generator , trained with a high-resolution and (High Res (256) from Table 1).
Figure 11: reconstructions from an unsupervised BigBiGAN model, trained with a high-resolution and (High Res (256) from Table 1). The top rows of each pair are real data , and bottom rows are generated reconstructions computed by .

Appendix C Nearest neighbors

Top-1 / Top-5 Acc. (%)
38.09 / - 41.28 / 58.56 43.32 / 65.12 42.73 / 66.22
35.68 / - 38.61 / 55.59 40.65 / 62.23 40.15 / 63.42
Table 6: Accuracy of nearest neighbors classifiers in BigBiGAN feature space on the ImageNet validation set. We report results under the normalized distance as well as the normalized (cosine) distance .

In this Appendix we consider an alternative way of evaluating representations –- by means of nearest neighbors classification, which does not involve learning any parameters during evaluation and is even simpler than learning a linear classifier as done in Section 3. For all results in this section, we use the outputs of the global average pooling layer (a flat 8192D feature) of our best performing model, RevNet , LR. We do not do any data augmentation for either the training or validation sets: we simply crop each image at the center of its larger axis and resize to .

We use a normalized or distance metric as our nearest neighbors criterion, defined as , for . ( corresponds to cosine distance.) For label predictions with multiple neighbors (), we use a simple counting scheme: the label with the most votes is selected as the prediction. Ties (multiple labels with the same number of votes) are broken by nearest neighbor classification among the data with the tied labels.

Quantitative results.

In Table 6 we present nearest neighbors classification results for . Across all , the -based metric outperforms , and the remainder of our discussion refers to the results. With just a single neighbor () we achieve a top-1 accuracy around 38%. Top-1 accuracy reaches 43% with , dropping off slightly at as votes from more distant neighbors are added.

Qualitative results.

Figure 12 shows sample nearest neighbors in the ImageNet training set for query images in the validation set. Despite being fully unsupervised, the neighbors in many cases match the query image in terms of high-level semantic content such as the category of the object of interest, demonstrating BigBiGAN’s ability to capture high-level attributes of the data in its unsupervised representations. Where applicable, the object’s pose and position in the image appears to be important as well – for example, the nearest neighbors of the RV (row 2, column 2) are all RVs facing roughly the same direction. In other cases, the nearest neighbors appear to be selected primarily based on the background or color scheme.


While our quantitative nearest neighbors classification results are far from the state of the art for ImageNet classification and significantly below the linear classifier-based results reported in Table 2, note that in this setup, no supervised learning of model parameters from labels occurs at any point: labels are predicted purely based on distance in a feature space learned from BigBiGAN training on image pixels alone. We believe this makes nearest neighbors classification an interesting additional benchmark for future approaches to unsupervised representation learning.

Figure 12: Nearest neighbors in BigBiGAN feature space, from our best performing model (RevNet , LR). In each row, the first (left) column is a query image, and the remaining columns are its three nearest neighbors from the training set (the leftmost being the nearest, next being the second nearest, etc.). The query images above are the first 24 images in the ImageNet validation set.

Appendix D Learning curves

In this Appendix we present learning curves showing how the image generation and representation learning metrics that we measured evolve throughout training, as a more detailed view of the results in Section 3, Table 1. We include plots for the following results:

  • Image generation (Figure 13)

  • Latent distribution and stochastic (Figure 14)

  • Unary loss terms (Figure 15)

  • capacity (Figure 16)

  • High resolution with varying resolution (Figure 17)

  • architecture (Figure 18)

  • Decoupled / learning rates (Figure 19)

Figure 13: Image generation learning curves for several of the ablations in Section 3, including a comparison of BigBiGAN to standard GAN. Legend entries correspond to the following rows in Table 1: Base, No (GAN), and High Res (256).
Figure 14: Image generation and representation learning curves for the latent space variations explored in Section 3. Legend entries correspond to the following rows in Table 1: Base, Deterministic , and Uniform .
Figure 15: Image generation and representation learning curves for the unary loss component variations explored in Section 3. Legend entries correspond to the following rows in Table 1: Base, Unary Only, Unary Only, and No Unaries (BiGAN).
Figure 16: Image generation and representation learning curves for the size variations explored in Section 3. Legend entries correspond to the following rows in Table 1: Base, Small (32), and Small (64).
Figure 17: Image generation and representation learning curves for high resolution with varying resolution explored in Section 3. Legend entries correspond to the following rows in Table 1: High Res (256), Low Res (64), and High Res (256).
Figure 18: Image generation and representation learning curves for the architecture variations explored in Section 3. Legend entries correspond to the following rows in Table 1: High Res (256), ResNet-101, ResNet , RevNet, RevNet , and RevNet .
Figure 19: Image generation and representation learning curves showing the effect of decoupling the and optimizers to train with higher learning rate. Legend entries correspond to the following rows in Table 1: High Res (256), ResNet ( LR), RevNet , and RevNet ( LR).