PatchVAE: Learning Local Latent Codes for Recognition

04/07/2020 ∙ by Kamal Gupta, et al. ∙ Google University of Maryland 5

Unsupervised representation learning holds the promise of exploiting large amounts of unlabeled data to learn general representations. A promising technique for unsupervised learning is the framework of Variational Auto-encoders (VAEs). However, unsupervised representations learned by VAEs are significantly outperformed by those learned by supervised learning for recognition. Our hypothesis is that to learn useful representations for recognition the model needs to be encouraged to learn about repeating and consistent patterns in data. Drawing inspiration from the mid-level representation discovery work, we propose PatchVAE, that reasons about images at patch level. Our key contribution is a bottleneck formulation that encourages mid-level style representations in the VAE framework. Our experiments demonstrate that representations learned by our method perform much better on the recognition tasks compared to those learned by vanilla VAEs.



There are no comments yet.


page 1

page 6

page 7

page 12

Code Repositories


PyTorch implementation of "PatchVAE: Learning Local Latent Codes for Recognition" to appear in CVPR 2020

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Due to the availability of large labeled visual datasets, supervised learning has become the dominant paradigm for visual recognition. That is, to learn about any new concept, the modus operandi is to collect thousands of labeled examples for that concept and train a powerful classifier, such as a deep neural network. This is necessary because the current generation of models based on deep neural networks require large amounts of labeled data 

[33]. This is in stark contrast to the insights that we have from developmental psychology on how infants develop perception and cognition without any explicit supervision [31]. Moreover, the supervised learning paradigm is ill-suited for applications, such as health care and robotics, where annotated data is hard to obtain either due to privacy concerns or high cost of expert human annotators. In such cases, learning from very few labeled images or discovering underlying natural patterns in large amounts of unlabeled data can have a large number of potential applications. Discovering such patterns from unlabeled data is the standard setup of unsupervised learning.

Figure 1: PatchVAE learns to encode repetitive parts across a dataset, by modeling their appearance and occurrence. (top) Given an image, the occurrence map of a particular part learned by PatchVAE is shown in the middle, capturing the head/beak of the birds. Samples of the same part from other images are shown on the right, indicating consistent appearance. (bottom) More examples of parts discovered by our PatchVAE framework.

Over the past few years, the field of unsupervised learning in computer vision has followed two seemingly different tracks with different goals: generative modeling and self-supervised learning. The goal of generative modeling is to learn the probability distribution from which data was generated, given some training data. Such models, learned using reconstruction-based losses, can draw samples from the same distribution or evaluate the likelihoods of new data, and are useful for learning compact representation of images. However, we argue that these representations are not as useful for visual recognition. This is not surprising since the task of reconstructing images does not require the bottleneck representation to sort out meaningful data useful for recognition and discard the rest; on the contrary, it encourages preserving as much information as possible for reconstruction.

In comparison, the goal in self-supervised learning is to learn representations that are useful for recognition. The standard paradigm is to establish proxy tasks that don’t require human-supervision but can provide signals useful for recognition. Due to the mismatch in goals of unsupervised learning for visual recognition and the representations learned from generative modeling, self-supervised learning is a more popular way of learning representations from unlabeled data. However, fundamental limitation of this self-supervised paradigm is that we need to define a proxy-task that can mimic the desired recognition task. It is not possible to always establish such a task, nor are these tasks generalizable across recognition tasks.

In this paper, our goal is to enable the unsupervised generative modeling approach of VAEs to learn representations useful for recognition. Our key hypothesis is that for a representation to be useful, it should capture just the interesting parts of the images, as opposed to everything in the images.

What constitutes an interesting image part has been defined and studied in earlier works that pre-date the end-to-end trained deep network methods [30, 7, 14]. Taking inspiration from these works, we propose a novel representation that only encodes few such parts of an image that are repetitive across the dataset, i.e., the patches that occur often in images. By avoiding reconstruction of the entire image our method can focus on regions that are repeating and consistent across many images. In an encoder-decoder based generative model, we constrain the encoder architecture to learn such repetitive parts – both in terms of representations for appearance of these parts (or patches in an image) and where these parts occur. We formulate this using variational auto-encoder (-VAEs) [19, 23], where we impose novel structure on the latent representations. We use discrete latents to model part presence or absence and continuous latents to model their appearance. Figure 1 shows an example of the discrete latents or occurrence map, and example parts discovered by our approach, PatchVAE. We present PatchVAE in Section 3 and demonstrate that it learns representations that are much better for recognition as compared to those learned by the standard -VAEs [19, 23].

In addition, we present losses that favor foreground, which is more likely to contain repetitive patterns, in Section 3.4, and demonstrate that they result in representations that are much better at recognition. Finally, in Section 4, we present results on CIFAR100 [20]

, MIT Indoor Scene Recognition 

[27], Places [37]

, and ImageNet 

[4] datasets. To summarize, our contributions are as follows:

  1. [leftmargin=*,noitemsep]

  2. We propose a novel patch-based bottleneck in the VAE framework that learns representations that can encode repetitive parts across images.

  3. We demonstrate that our method, PatchVAE, learns unsupervised representations that are better suited for recognition in comparison to traditional VAEs.

  4. We show that losses that favor foreground are better for unsupervised representation learning for recognition.

  5. We perform extensive ablation analysis of the proposed PatchVAE architecture.

2 Related Work

Due to its potential impact, unsupervised learning (particularly for deep networks) is one of the most researched topics in visual recognition over the past few years. Generative models such as VAEs [19, 23, 18, 11], PixelRNN [34], PixelCNN [12, 29], and their variants have proven effective when it comes to learning compressed representation of images while being able to faithfully reconstruct them as well as draw samples from the data distribution. GANs [10, 28, 38, 3] on the other hand, while don’t model the probability density explicitly, can still produce high quality image samples from noise. There has been work combining VAEs and GANs to be able to simultaneously learn image data distribution while being able to generate high quality samples from it [15, 8, 21]. Convolution sparse coding [1] is an alternative approach for reconstruction or image in-painting problems. Our work complements existing generative frameworks in that we provide a structured approach for VAEs that can learn beyond low-level representations. We show the effectiveness of the representations learned by our model by using them for visual recognition tasks.

There has been a lot of work in interpreting or disentangling representations learned using generative models such as VAEs

[23, 9, 16]

. However, there is little evidence of effectiveness of disentangled representations in visual recognition. Semi-supervised learning using generative models  

[17, 32], where partial or noisy labels are available to the model during training, has shown lots of promise in applications of generating conditioned samples from the model. In our work however, we focus on incorporating inductive biases in these generative models (e.g., VAEs) so they can learn representations better suited for visual recognition.

A related, but orthogonal, line of work is self-supervised learning where a proxy task is designed to learn representation useful for recognition. These proxy tasks vary from simple tasks like arranging patches in an image in the correct spatial order [5, 6] and arranging frames from a video in correct temporal order [35, 25], to more involved tasks like in-painting [26] and context prediction [24, 36]. We follow the best practices from this line of work for evaluating the learned representations.

(a) VAE Architecture
(b) PatchVAE Architecture
Figure 2: (a) VAE Architecture: In a standard VAE architecture, output of encoder network is used to parameterize the variational posterior for . Samples from this posterior are input to the decoder network. (b) Proposed PatchVAE Architecture: Our encoder network computes a set of feature maps using . This is followed by two independent single layer networks. The bottom network generates part occurrence parameters . We combine with output of top network to generate part appearance parameters . We sample and to construct as described in Section 3.2 which is input to the decoder network. We also visualize the corresponding priors for latents and in the dashed gray boxes.

3 Our Approach

Our work builds upon VAE framework proposed by [19]. We briefly review relevant aspects of the VAE framework and then present our approach.

3.1 VAE Review

Standard VAE framework assumes a generative model for data where first a latent is sampled from a prior and then the data is generated from a conditional distribution . A variational approximation to the true intractable posterior is introduced and the model is learned by minimizing the following negative variational lower bound (ELBO),


where is often referred to as an encoder as it can be viewed as mapping data to the the latent space, while is referred to as a decoder (or generator) that can be viewed as mapping latents to the data space. Both and are commonly parameterized as neural networks. Fig. 1(a) shows the commonly used VAE architecture. If the conditional takes a gaussian form, negative log likelihood in the first term of RHS of Eq. 1 becomes mean squared error between generator output and input data . In the second term, prior

is assumed to be a multi-variate normal distribution with zero-mean and identity covariance

and the loss simplifies to


When and are differentiable, entire model can be trained with SGD using reparameterization trick [19]. [23] propose an extension for learning disentangled representation by incorporating a weight factor for the KL Divergence term yielding


VAE framework aims to learn a generative model for the images where the latents represent the corresponding low dimensional generating factors. The latents can therefore be treated as image representations that capture the necessary details about images. However, we postulate that representations produced by the standard VAE framework are not ideal for recognition as they are learned to capture all details, rather than capturing ‘interesting’ aspects of the data and dropping the rest. This is not surprising since there formulation does not encourage learning semantic information. For learning semantic representations, in the absence of any relevant supervision (as is available in self-supervised approaches), inductive biases have to be introduced. Therefore, taking inspiration from works on unsupervised mid-level pattern discovery [30, 7, 14], we propose a formulation that encourages the encoder to only encode such few parts of an image that are repetitive across the dataset, i.e., the patches that occur often in images.

Since the VAE framework provides a principled way of learning a mapping from image to latent space, we consider it ideal for our proposed extension. We chose -VAEs for their simplicity and widespread use. In Section 3.2, we describe our approach in detail and in Section 3.4 propose a modification in the reconstruction error computation to bias the error term towards foreground high-energy regions (similar to the biased initial sampling of patterns in [30]).

3.2 PatchVAE

Given an image , let be a deterministic mapping that produces a 3D representation of size , with a total of locations (grid-cells). We aim to encourage the encoder network to only encode parts of an image that correspond to highly repetitive patches. For example, a random patch of noise is unlikely to occur frequently, whereas patterns like faces, wheels, windows, etc. repeat across multiple images. In order capture this intuition, we force the representation to be useful for predicting frequently occurring parts in an image, and use just these predicted parts to reconstruct the image. We achieve this by transforming to which encodes a set of parts at a small subset of locations on the grid cells. We refer to as “patch latent codes” for an image. Next we describe how we re-tool the -VAE framework to learn these local latent codes. We first describe our setup for a single part and follow it up with a generalization to multiple parts (Section 3.3).

Image Encoding. Given the image representation , we want to learn part representations at each grid location (where ). A part is parameterized by its appearance and its occurrence (i.e., presence or absence of the part at grid location ). We use two networks, and , to parameterize posterior distributions and of the part parameters and respectively. Since the mapping is deterministic, we can re-write these distributions as and ; or simply and . Therefore, given an image

the encoder networks estimate the posterior

and . Note that is a deterministic feature map, whereas and are stochastic.

Image Decoding. We utilize a generator or decoder network , that given and , reconstructs the image. First, we sample a part appearance ( dimensional, continuous) and then sample part occurrence ( dimensional, binary) one for each location from the posteriors


Next, we construct a 3D representation by placing at every location where the part is present (i.e., ). This can be implemented by a broadcasted product of and . We refer to as patch latent code. Again note that is deterministic and is stochastic. Finally, a deconvolutional network takes as input and generates an image . This image generation process can be written as


Since all latent variables ( for all and ) are independent of each other, they can be stacked as


This enables us to use a simplified the notation (refer to (4) and (5)):


Note that despite the additional structure, our model still resembles the setup of variational auto-encoders. The primary difference arises from: (1) use of discrete latents for part occurrence, (2) patch-based bottleneck imposing additional structure on latents, and (4) feature assembly for generator.

Training. We use the training setup of -VAE and use the maximization of variational lower bound to train the encoder and decoder jointly (described in Section 3.1). The posterior , which captures the appearance of a part, is assumed to be a Normal distribution with zero-mean and identity covariance . The posterior

, which captures the presence or absence a part, is assumed to be a Bernoulli distribution

with prior . Therefore, the ELBO for our approach can written as (refer to (3)):


where, the term can be expanded as:


Implementation details. As discussed in Section 3.1, the first and second terms of the RHS of (8) can be trained using L2 reconstruction loss and reparameterization trick [19]

. In addition, we also need to compute KL Divergence loss for part occurrence. Learning discrete probability distribution is a challenging task since there is no gradient defined to backpropagate reconstruction loss through the stochastic layer at decoder even when using the reparameterization trick. Therefore, we use the relaxed-bernoulli approximation 

[22, 2] for training part occurrence distributions .

For an image, network first generates feature maps of size , where are spatial dimensions and is the number of channels. Therefore, the number of locations . Encoders and are single layer neural networks to compute and . is -dimensional multivariate bernoulli parameter and is -dimensional multivariate gaussian.

is length of the latent vector for a single part. Input to the decoder

is -dimensional. In all experiments, we fix and .

Constructing . Notice that is an -dimensional feature map and is -dimensional binary output, but is -dimensional feature vector. If , the part occurs at multiple locations in an image. Since all these locations correspond to same part, their appearance should be the same. To incorporate this, we take the weighted average of the part appearance feature at each location, weighted by the probability that the part is present. Since we use the probability values for averaging the result is deterministic. This operation is encapsulated by the encoder (refer to Figure 1(b)). During image generation, we sample once and replicate it at each location where . During training, this forces the model to: (1) only predict where similar looking parts occur, and (2) learn a common representation for the part that occurs at these locations. Note that

can be modeled as a mixture of distributions (e.g., mixture of gaussians) to capture complicated appearances. However, in this work we assume that the convolutional neural network based encoders are powerful enough to map variable appearance of semantic concepts to similar feature representations. Therefore, we restrict ourselves to a single gaussian distribution.

3.3 PatchVAE with multiple parts

Next we extend the framework described above to use multiple parts. To use parts, we use encoder networks and , where and parameterize the part. Again, this can be implemented efficiently as 2 networks by concatenating the outputs together. The image generator samples and from the outputs of these encoder networks and constructs . We obtain the final patch latent code by concatenating all in channel dimension. Therefore, is -dimensional and is -dimensional stochastic feature map. For this multiple part case, (6) can be written as:


Similarly, (8) and (9) can be written as:


The training details and assumptions of posteriors follow the previous section.

3.4 Improved Reconstruction Loss

The L2 reconstruction loss used for training -VAEs (and other reconstruction based approaches) gives equal importance to each region of an image. This might be reasonable for tasks like image compression and image de-noising. However, for the purposes of learning semantic representations, not all regions are equally important. For example, “sky” and “walls” occupy large portions of an image, whereas concepts like “windows,” “wheels,”, “faces” are comparatively smaller, but arguably more important. To incorporate this intuition, we use a simple and intuitive strategy to weigh the regions in an image in proportion to the gradient energy in the region. More concretely, we compute laplacian of an image to get the intensity of gradients per-pixel and average the gradient magnitudes in local patches. The weight multiplier for the reconstruction loss of each patch in the image is proportional to the average magnitude of the patch. All weights are normalized to sum to one. We refer to this as weighted loss (). Note that this is similar to the gradient-energy biased sampling of mid-level patches used in [30, 7]. Examples of weight masks are provided in the supplemental material.

In addition, we also consider an adversarial training strategy from GANs to train VAEs [21]

, where the discriminator network from GAN implicitly learns to compare images and gives a more abstract reconstruction error for the VAE. We refer to this variant by using ‘GAN’ suffix in experiments. In Section 

4.2, we demonstrate that the proposed weighted loss () is complementary to the discriminator loss from adversarial training, and these losses result in better recognition capabilities for both -VAE and PatchVAE.

4 Experiments

Datasets. We evaluate PatchVAE on CIFAR100 [20], MIT Indoor Scene Recognition [27], Places [37] and Imagenet [4] datasets. CIFAR100 consists of 60k color images from 100 classes, with 600 images per class. There are 50000 training images and 10000 test images. Indoor dataset contains 67 categories, and a total of 15620 images. Train and test subsets consist of 80 and 20 images per class respectively. Places dataset has 2.5 millions of images with 205 categories. Imagenet dataset has over a million images from 1000 categories.

CIFAR100 Indoor67 Places205
Model Conv1 Conv[1-3] Conv[1-5] Conv1 Conv[1-3] Conv[1-5] Conv1 Conv[1-3] Conv[1-5]
-VAE 44.12 39.65 28.57 20.08 17.76 13.06 28.29 24.34 8.89
-VAE + 44.96 40.30 28.33 21.34 19.48 13.96 29.43 24.93 9.41
-VAE-GAN 44.69 40.13 29.89 19.10 17.84 13.06 28.48 24.51 9.72
-VAE-GAN + 45.61 41.35 31.53 20.45 18.36 14.33 29.63 25.26 10.66
PatchVAE 43.07 38.58 28.72 20.97 19.18 13.43 28.63 24.95 11.09
PatchVAE + 43.75 40.37 30.55 23.21 21.87 15.45 29.39 26.29 12.07
PatchVAE-GAN 44.45 40.57 31.74 21.12 19.63 14.55 28.87 25.25 12.21
PatchVAE-GAN + 45.39 41.74 32.65 22.46 21.87 16.42 29.36 26.30 13.39
BiGAN 47.72 41.89 31.58 21.64 17.09 9.70 30.06 25.11 10.82
Imagenet Pretrained 55.99 54.99 54.36 45.90 45.82 40.90 37.08 36.46 31.26
Table 1: Classification results on CIFAR100, Indoor67, and Places205. We initialize the classification model with the representations learned from unsupervised learning task. The model comprises of a conv layer followed by two residual blocks (each having 2 conv layers). First column (called ‘Conv1’) corresponds to Top-1 classification accuracy with pre-trained model with the first conv layer frozen, second and third columns correspond to results with first three and first five conv layers frozen respectively. Details in Section 4.1
Model Top-1 Acc. Top-5 Acc.
-VAE 44.45 69.67
PatchVAE 47.01 71.71
-VAE + 47.28 71.78
PatchVAE + 47.87 72.49
Imagenet Supervised 61.37 83.79
Table 2: ImageNet classification results using ResNet18. We initialize weights from using the unsupervised task and fine-tune the last two residual blocks. Details in Section 4.1

Learning paradigm. In order to evaluate the utility of PatchVAE features for recognition, we setup the learning paradigm as follows: we will first train the model in an unsupervised manner on all training images. After that, we discard the generator network and use only part of the encoder network to train a supervised model on the classification task of the respective dataset. We study different training strategies for the classification stage as discussed later.

Training details. In all experiments, we use the following architectures. For CIFAR100, Indoor67, and Place205, has a conv layer followed by two residual blocks [13]. For ImageNet, is a ResNet18 model (a conv layer followed by four residual blocks). For all datasets, and have a single conv layer each. For classification, we start from , and add a fully-connected layer with 512 hidden units and a final fully-connected layer as classifier. More details can be found in the supplemental material.

During the unsupervised learning phase of training, all methods are trained for 90 epochs for CIFAR100 and Indoor67, 2 epochs for Places205, and 30 epochs for ImageNet dataset. All methods use ADAM optimizer for training, with initial learning rate of

and a minibatch size of 128. For relaxed bernoulli in , we start with the temperature of 1.0 with an annealing rate of (following the details in [2]

). For training the classifier, all methods use stochastic gradient descent (SGD) with momentum with a minibatch size of 128. Initial learning rate is

and we reduce it by a factor of 10 every 30 epochs. All experiments are trained for 90 epochs for CIFAR100 and Indoor67, 5 epochs for Places205, and 30 epochs for ImageNet datasets.

Figure 3: Encoded part occurrence maps discovered on CIFAR100 and ImageNet. Each row represents a different part.

Baselines. We use the -VAE model (Section 3.1) as our primary baseline. In addition, we use weighted loss and discriminator loss resulting in the -VAE-* family of baselines. We also compare against a BiGAN model from [8]. We use similar backbone architectures for encoder/decoder (and discriminator if present) across all methods, and tried to keep the number of parameters in different approaches comparable to the best of our ability. Exact architecture details can be found in the supplemental material.

Figure 4: A few representative examples for several parts to qualitatively demonstrate the visual concepts captured by PatchVAE. For each part, we crop image patches centered on the part location where it is predicted to be present. Selected patches are sorted by part occurrence probability as score. We manually select a diverse set from the top-50 occurrences from the training images. As can be seen, a single part may capture diverse set of concepts that are similar in shape or texture or occur in similar context, but belong to different categories. We show which categories the patches come from (note that category information was not used while training the model).

4.1 Downstream classification performance

In Table 1, we report the top-1 classification results on CIFAR100, Indoor67, and Places205 datasets for all methods with different training strategies for classification. First, we keep all the pre-trained weights in from the unsupervised task frozen and only train the two newly added conv layers in the classification network (reported under column ‘Conv[1-5]’). We notice that our method (with different losses) generally outperforms the -VAE counterpart by a healthy margin. This shows that the representations learned by PatchVAE framework are better for recognition compared to -VAEs. Moreover, better reconstruction losses (‘GAN’ and ) generally improve both -VAE and PatchVAE, and are complementary to each other.

Next, we fine-tune the last residual block along with the two conv layers (‘Conv[1-3]’ column). We observe that PatchVAE performs better than VAE under all settings except the for CIFAR100 with just L2 loss. However, when using better reconstruction losses, the performance of PatchVAE improves over -VAE. Similarly, we fine-tune all but the first conv layer and report the results in ‘Conv1’ column. Again, we notice similar trends, where our method generally performs better than -VAE on Indoor67 and Places205 dataset, but -VAE performs better CIFAR100 by a small margin. When compared to BiGAN, PatchVAE representations are better on all datasets (‘Conv[1-5]’) by a huge margin. However, when fine-tuning the pre-trained weights, BiGAN performs better on two out of four datasets. We also report results using pre-trained weights in using supervised ImageNet classification task (last column, Table 1) for completeness. The results indicate that PatchVAE learns better semantic representations compared to -VAE.

ImageNet Results. Finally, we report results on the large-scale ImageNet benchmark in Table 2. For these experiments, we use ResNet18 [13] architecture for all methods. All weights are first learned using the unsupervised tasks. Then, we fine-tune the last two residual blocks and train the two newly added conv layers in the classification network (therefore, first conv layer and the following two residual blocks are frozen). We notice that PatchVAE framework outperforms -VAE under all settings, and the proposed weighted loss helps both approaches. Finally, the last row in Table 2 reports classification results of same architecture randomly initialized and trained end-to-end on ImageNet using supervised training for comparison.

Figure 5: Swapping source and target part appearance. Column 1, 2 show a source image with the occurrence map of one of the parts. We can swap the appearance vector of this part with appearance vectors of a different part in target images. Column 3, 4 show three target images with occurrence maps of one of their parts. Observe the change in reconstructions (column 5, 6) as we bring in the new appearance vector. The new reconstruction inherits properties of the source at specific locations in the target.
CIFAR100 Indoor67
4 27.59 14.40
8 28.74 12.69
16 28.94 14.33
32 27.78 13.28
64 29.00 12.76
Table 4: Effect of : Increasing the number of hidden units for a patch has very little impact on classification performance
CIFAR100 Indoor67
3 28.63 14.25
6 28.97 14.55
9 28.21 14.55
Table 5: Effect of

: Increasing increasing the prior probability of patch occurrence has adverse effect on classification performance

CIFAR100 Indoor67
0.01 28.86 14.33
0.05 28.67 14.25
0.1 28.31 14.03
Table 6: Effect of : Too high or too low can deteriorate the performance of learned representations
CIFAR100 Indoor67
0.06 30.11 14.10
0.3 30.37 15.67
0.6 28.90 13.51
Table 3: Effect of : Increasing the maximum number of patches increases the discriminative power for CIFAR100 but has little or negative effect for Indoor67
-VAE 4.857 108.741 0.289
PatchVAE 4.342 113.692 0.235
Table 7: Reconstruction metrics on ImageNet. PatchVAE sacrifices reconstruction quality to learn discriminative parts, resulting in higher recognition performance (Table 2)

4.2 Qualitative Results

We present qualitative results to validate our hypothesis. First, we visualize whether the structure we impose on the VAE bottleneck is able to capture occurrence and appearance of important parts of images. We visualize the PatchVAE trained on images from CIFAR100 and Imagenet datasets in the following ways.

Concepts captured. First, we visualize the part occurrences in Figure 3. We can see that the parts can capture round (fruit-like) shapes in the top row and faces in the second row regardless of the class of the image. Similarly for ImageNet, occurrence map of a specific part in images of chicken focuses on head and neck. Note that these semantically these parts are more informative than just texture or color what a -VAE can capture. In Figure 4, we show parts captured by the ImageNet model by cropping a part of image centered around the occurring part. We can see that parts are able to capture multiple concepts, similar in either shape, texture, or context in which they occur.

Swapping appearances. Using PatchVAE, we can swap appearance of a part with the appearance vector of another part from a different image. In Figure 5, keeping the occurrence map same for a target image, we modify the appearance of a randomly chosen part and observe the change in reconstructed image. We notice that given the same source part, the decoder tries similar things across different target images. However, the reconstructions are worse since the decoder has never encountered this particular combination of part appearance before.

Discriminative vs. Generative strength. As per our design, PatchVAE compromises the generative capabilities to learn more discriminative features. To quantify this, we use the the images reconstructed from -VAE and PatchVAE models (trained on ImageNet) and compute three different metrics to measure the quality of reconstructions of test images. Table 7 shows that -VAE is better at reconstruction.

4.3 Ablation Studies

We study the impact of various hyper-parameters used in our experiments. For the purpose of this evaluation, we follow a similar approach as in the ‘Conv[1-5]’ column of Table 1

and all hyperparameters from the previous section. We use CIFAR100 and Indoor67 datasets for ablation analysis.

Maximum number of patches. Maximum number of parts used in our framework. Depending on the dataset, higher value of can provide wider pool of patches to pick from. However, it can also make the unsupervised learning task harder, since in a minibatch of images, we might not get too many repeat patches. Table 6(left) shows the effect of on CIFAR100 and Indoor67 datasets. We observe that while increasing number of patches improves the discriminative power in case of CIFAR100, it has little or negative effect in case of Indoor67. A possible reason for this decline in performance for Indoor67 can be smaller size of the dataset (i.e., fewer images to learn).

Number of hidden units for a patch appearance . Next, we study the impact of the number of channels in the appearance feature for each patch (). This parameter reflects the capacity of individual patch’s latent representation. While this parameter impacts the reconstruction quality of images. We observed that it has little or no effect on the classification performance of the base features. Results are summarized in Table 6(right) for both CIFAR100 and Indoor67 datasets.

Prior probability for patch occurrence . In all our experiments, prior probability for a patch is fixed to , i.e., inverse of maximum number of patches. The intuition is to encourage each location on occurrence maps to fire for at most one patch. Increasing this patch occurrence prior will allow all patches to fire at the same location. While this would make the reconstruction task easier, it will become harder for individual patches to capture anything meaningful. Table 6 shows the deterioration of classification performance on increasing .

Patch occurrence loss weight . The weight for patch occurrence KL Divergence has to be chosen carefully. If is too low, more patches can fire at same location and this harms the the learning capability of patches; and if is too high, decoder will not receive any patches to reconstruct from and both reconstruction and classification will suffer. Table 6 summarizes the impact of varying .

5 Conclusion

We presented a patch-based bottleneck in the VAE framework that encourages learning useful representations for recognition. Our method, PatchVAE, constrains the encoder architecture to only learn patches that are repetitive and consistent in images as opposed to learning everything, and therefore results in representations that perform much better for recognition tasks compared to vanilla VAEs. We also demonstrate that losses that favor high-energy foreground regions of an image are better for unsupervised learning of representations for recognition.


Appendix A Training Details

The generator network has two deconv layers with batchnorm and a final deconv layer with tanh activation. When training with ‘GAN’ loss, the additional discriminator has four conv layers, two of which have batchnorm.

Appendix B Visualization of Weighted Loss

Figure 6 shows an illustration of the reconstruction loss proposed in Section 3.4. Notice that in first column, guitar has more weight that rest of the image. Similarly in second, fourth and sixth columns that train, painting, and people are respectively weighed more heavily by than rest of the image; thus favoring capturing the foreground regions.

Appendix C Model Architecture

In this section, we share the exact architectures used in various experiments. As discussed in Section 4, we evaluated our proposed model on CIFAR100, Indoor67, and Places205 datasets. We resize and center-crop the images such that input image size for CIFAR100 datasets is while for Indoor67 and Places205 datasets input image size is . PatchVAE can treat images of various input sizes in exactly same way allowing us to keep the architecture same for different datasets. In case of VAE and BiGAN however, we have to go through a fixed size bottleneck layer and hence architectures need to be a little different for different input image sizes. Wherever possible, we have tried to keep the number of parameters in different architectures comparable.

c.1 Architecture for unsupervised learning task

Tables 8 and 9 show the architectures for encoders used in different models. In the unsupervised learning task, encoder comprises of a fixed neural network backbone , that given an image of size generated feature maps of size . This backbone architecture is common to different models discussed in the paper and consists of a single conv layer followed by 2 residual blocks. We refer to this as Resnet-9 and it is described as Conv1-5 layers in Table 12. Rest of the encoder architecture varies depending on the model in consideration and is described in the Tables 8 and 9.

Tables 10 and 11 show the architectures for decoders used in different models. We use a pyramid like network for decoder where feature map size is doubled in consecutive layers, while number of channels is halved. Final non-linearity used in each decoder is tanh.

c.2 Architecture for supervised learning task

As discussed in Section 4, during the supervised learning phase, we discard rest of the encoder model and only keep for classifier training. So the architectures for all baselines are exactly the same. Tables 12 shows the architecture for classifier used in our experiments.

Figure 6: Masks used for weighted reconstruction loss . First row contains images randomly samples from MIT Indoor datatset. Second and third rows have the corresponding image laplacians and final reconstruction weight masks respectively. In the last row, we take the product of first and third row to highlight which parts of image are getting more attention while reconstruction.
Layer -VAE BiGAN PatchVAE
Features Resnet-9 Resnet-9 Resnet-9
- -
Parameters 888,192 789,792 922,896
Table 8: Encoder architecture for unsupervised learning task on CIFAR100 - All ‘convolutional’ layers are represented as

. BN stands for batch normalization layer and ReLU for Rectified Linear Units.

Layer -VAE BiGAN PatchVAE
Features Resnet-9 Resnet-9 Resnet-9
- -
Parameters 1,478,016 1,084,704 922,896
Table 9: Encoder architecture for unsupervised learning task on Indoor67 and Places205 - All ‘convolutional’ layers are represented as (). BN stands for batch normalization layer and ReLU for Rectified Linear Units. Note that PatchVAE and -VAE architectures are slightly different to account for sizes.
Parameters 774,144 774,144 683,904
Table 10: Decoder architecture for unsupervised earning task on CIFAR100 - All ‘deconvolutional’ layers are represented as (). BN stands for batch normalization layer and ReLU for Rectified Linear Units.
Parameters 1,069,056 1,069,056 683,904
Table 11: Decoder architecture for unsupervised learning task on Indoor67 and Places205 - All ‘deconvolutional’ layers are represented as (). BN stands for batch normalization layer and ReLU for Rectified Linear Units. Note that PatchVAE and -VAE architectures are slightly different to account for sizes.
Layer CIFAR100 () Indoor67 () Places205 ()
Parameters 1,783,460 4,912,259 4,983,053
Table 12: Architecture for supervised learning task - same for all baselines and our model. All convolutional layers are represented as (). BN stands for batch bormalization layer and ReLU for Rectified Linear Units. All pooling operations are MaxPool and are represented by (). Like Resnet-18, downsampling happens by convolutional layers that have a stride of 2. In our model, downsampling happens during Conv1, Pool, and after Conv4-5.