PyTorch implementation of "PatchVAE: Learning Local Latent Codes for Recognition" to appear in CVPR 2020
Unsupervised representation learning holds the promise of exploiting large amounts of unlabeled data to learn general representations. A promising technique for unsupervised learning is the framework of Variational Auto-encoders (VAEs). However, unsupervised representations learned by VAEs are significantly outperformed by those learned by supervised learning for recognition. Our hypothesis is that to learn useful representations for recognition the model needs to be encouraged to learn about repeating and consistent patterns in data. Drawing inspiration from the mid-level representation discovery work, we propose PatchVAE, that reasons about images at patch level. Our key contribution is a bottleneck formulation that encourages mid-level style representations in the VAE framework. Our experiments demonstrate that representations learned by our method perform much better on the recognition tasks compared to those learned by vanilla VAEs.READ FULL TEXT VIEW PDF
The recent success in human action recognition with deep learning method...
We present a framework for the unsupervised learning of neurosymbolic
Variational autoencoders (VAE) have ushered in a new era of unsupervised...
In this paper we aim to learn meaningful representations of sung intonat...
There exists a Classification accuracy gap of about 20
methods of genera...
While deep neural networks have been shown to perform remarkably well in...
Local Hebbian learning is believed to be inferior in performance to
PyTorch implementation of "PatchVAE: Learning Local Latent Codes for Recognition" to appear in CVPR 2020
Due to the availability of large labeled visual datasets, supervised learning has become the dominant paradigm for visual recognition. That is, to learn about any new concept, the modus operandi is to collect thousands of labeled examples for that concept and train a powerful classifier, such as a deep neural network. This is necessary because the current generation of models based on deep neural networks require large amounts of labeled data. This is in stark contrast to the insights that we have from developmental psychology on how infants develop perception and cognition without any explicit supervision . Moreover, the supervised learning paradigm is ill-suited for applications, such as health care and robotics, where annotated data is hard to obtain either due to privacy concerns or high cost of expert human annotators. In such cases, learning from very few labeled images or discovering underlying natural patterns in large amounts of unlabeled data can have a large number of potential applications. Discovering such patterns from unlabeled data is the standard setup of unsupervised learning.
Over the past few years, the field of unsupervised learning in computer vision has followed two seemingly different tracks with different goals: generative modeling and self-supervised learning. The goal of generative modeling is to learn the probability distribution from which data was generated, given some training data. Such models, learned using reconstruction-based losses, can draw samples from the same distribution or evaluate the likelihoods of new data, and are useful for learning compact representation of images. However, we argue that these representations are not as useful for visual recognition. This is not surprising since the task of reconstructing images does not require the bottleneck representation to sort out meaningful data useful for recognition and discard the rest; on the contrary, it encourages preserving as much information as possible for reconstruction.
In comparison, the goal in self-supervised learning is to learn representations that are useful for recognition. The standard paradigm is to establish proxy tasks that don’t require human-supervision but can provide signals useful for recognition. Due to the mismatch in goals of unsupervised learning for visual recognition and the representations learned from generative modeling, self-supervised learning is a more popular way of learning representations from unlabeled data. However, fundamental limitation of this self-supervised paradigm is that we need to define a proxy-task that can mimic the desired recognition task. It is not possible to always establish such a task, nor are these tasks generalizable across recognition tasks.
In this paper, our goal is to enable the unsupervised generative modeling approach of VAEs to learn representations useful for recognition. Our key hypothesis is that for a representation to be useful, it should capture just the interesting parts of the images, as opposed to everything in the images.
What constitutes an interesting image part has been defined and studied in earlier works that pre-date the end-to-end trained deep network methods [30, 7, 14]. Taking inspiration from these works, we propose a novel representation that only encodes few such parts of an image that are repetitive across the dataset, i.e., the patches that occur often in images. By avoiding reconstruction of the entire image our method can focus on regions that are repeating and consistent across many images. In an encoder-decoder based generative model, we constrain the encoder architecture to learn such repetitive parts – both in terms of representations for appearance of these parts (or patches in an image) and where these parts occur. We formulate this using variational auto-encoder (-VAEs) [19, 23], where we impose novel structure on the latent representations. We use discrete latents to model part presence or absence and continuous latents to model their appearance. Figure 1 shows an example of the discrete latents or occurrence map, and example parts discovered by our approach, PatchVAE. We present PatchVAE in Section 3 and demonstrate that it learns representations that are much better for recognition as compared to those learned by the standard -VAEs [19, 23].
In addition, we present losses that favor foreground, which is more likely to contain repetitive patterns, in Section 3.4, and demonstrate that they result in representations that are much better at recognition. Finally, in Section 4, we present results on CIFAR100 
, MIT Indoor Scene Recognition, Places 
, and ImageNet datasets. To summarize, our contributions are as follows:
We propose a novel patch-based bottleneck in the VAE framework that learns representations that can encode repetitive parts across images.
We demonstrate that our method, PatchVAE, learns unsupervised representations that are better suited for recognition in comparison to traditional VAEs.
We show that losses that favor foreground are better for unsupervised representation learning for recognition.
We perform extensive ablation analysis of the proposed PatchVAE architecture.
Due to its potential impact, unsupervised learning (particularly for deep networks) is one of the most researched topics in visual recognition over the past few years. Generative models such as VAEs [19, 23, 18, 11], PixelRNN , PixelCNN [12, 29], and their variants have proven effective when it comes to learning compressed representation of images while being able to faithfully reconstruct them as well as draw samples from the data distribution. GANs [10, 28, 38, 3] on the other hand, while don’t model the probability density explicitly, can still produce high quality image samples from noise. There has been work combining VAEs and GANs to be able to simultaneously learn image data distribution while being able to generate high quality samples from it [15, 8, 21]. Convolution sparse coding  is an alternative approach for reconstruction or image in-painting problems. Our work complements existing generative frameworks in that we provide a structured approach for VAEs that can learn beyond low-level representations. We show the effectiveness of the representations learned by our model by using them for visual recognition tasks.
There has been a lot of work in interpreting or disentangling representations learned using generative models such as VAEs[23, 9, 16]
. However, there is little evidence of effectiveness of disentangled representations in visual recognition. Semi-supervised learning using generative models[17, 32], where partial or noisy labels are available to the model during training, has shown lots of promise in applications of generating conditioned samples from the model. In our work however, we focus on incorporating inductive biases in these generative models (e.g., VAEs) so they can learn representations better suited for visual recognition.
A related, but orthogonal, line of work is self-supervised learning where a proxy task is designed to learn representation useful for recognition. These proxy tasks vary from simple tasks like arranging patches in an image in the correct spatial order [5, 6] and arranging frames from a video in correct temporal order [35, 25], to more involved tasks like in-painting  and context prediction [24, 36]. We follow the best practices from this line of work for evaluating the learned representations.
Our work builds upon VAE framework proposed by . We briefly review relevant aspects of the VAE framework and then present our approach.
Standard VAE framework assumes a generative model for data where first a latent is sampled from a prior and then the data is generated from a conditional distribution . A variational approximation to the true intractable posterior is introduced and the model is learned by minimizing the following negative variational lower bound (ELBO),
where is often referred to as an encoder as it can be viewed as mapping data to the the latent space, while is referred to as a decoder (or generator) that can be viewed as mapping latents to the data space. Both and are commonly parameterized as neural networks. Fig. 1(a) shows the commonly used VAE architecture. If the conditional takes a gaussian form, negative log likelihood in the first term of RHS of Eq. 1 becomes mean squared error between generator output and input data . In the second term, prior
is assumed to be a multi-variate normal distribution with zero-mean and identity covarianceand the loss simplifies to
When and are differentiable, entire model can be trained with SGD using reparameterization trick .  propose an extension for learning disentangled representation by incorporating a weight factor for the KL Divergence term yielding
VAE framework aims to learn a generative model for the images where the latents represent the corresponding low dimensional generating factors. The latents can therefore be treated as image representations that capture the necessary details about images. However, we postulate that representations produced by the standard VAE framework are not ideal for recognition as they are learned to capture all details, rather than capturing ‘interesting’ aspects of the data and dropping the rest. This is not surprising since there formulation does not encourage learning semantic information. For learning semantic representations, in the absence of any relevant supervision (as is available in self-supervised approaches), inductive biases have to be introduced. Therefore, taking inspiration from works on unsupervised mid-level pattern discovery [30, 7, 14], we propose a formulation that encourages the encoder to only encode such few parts of an image that are repetitive across the dataset, i.e., the patches that occur often in images.
Since the VAE framework provides a principled way of learning a mapping from image to latent space, we consider it ideal for our proposed extension. We chose -VAEs for their simplicity and widespread use. In Section 3.2, we describe our approach in detail and in Section 3.4 propose a modification in the reconstruction error computation to bias the error term towards foreground high-energy regions (similar to the biased initial sampling of patterns in ).
Given an image , let be a deterministic mapping that produces a 3D representation of size , with a total of locations (grid-cells). We aim to encourage the encoder network to only encode parts of an image that correspond to highly repetitive patches. For example, a random patch of noise is unlikely to occur frequently, whereas patterns like faces, wheels, windows, etc. repeat across multiple images. In order capture this intuition, we force the representation to be useful for predicting frequently occurring parts in an image, and use just these predicted parts to reconstruct the image. We achieve this by transforming to which encodes a set of parts at a small subset of locations on the grid cells. We refer to as “patch latent codes” for an image. Next we describe how we re-tool the -VAE framework to learn these local latent codes. We first describe our setup for a single part and follow it up with a generalization to multiple parts (Section 3.3).
Image Encoding. Given the image representation , we want to learn part representations at each grid location (where ). A part is parameterized by its appearance and its occurrence (i.e., presence or absence of the part at grid location ). We use two networks, and , to parameterize posterior distributions and of the part parameters and respectively. Since the mapping is deterministic, we can re-write these distributions as and ; or simply and . Therefore, given an image
the encoder networks estimate the posteriorand . Note that is a deterministic feature map, whereas and are stochastic.
Image Decoding. We utilize a generator or decoder network , that given and , reconstructs the image. First, we sample a part appearance ( dimensional, continuous) and then sample part occurrence ( dimensional, binary) one for each location from the posteriors
Next, we construct a 3D representation by placing at every location where the part is present (i.e., ). This can be implemented by a broadcasted product of and . We refer to as patch latent code. Again note that is deterministic and is stochastic. Finally, a deconvolutional network takes as input and generates an image . This image generation process can be written as
Since all latent variables ( for all and ) are independent of each other, they can be stacked as
Note that despite the additional structure, our model still resembles the setup of variational auto-encoders. The primary difference arises from: (1) use of discrete latents for part occurrence, (2) patch-based bottleneck imposing additional structure on latents, and (4) feature assembly for generator.
Training. We use the training setup of -VAE and use the maximization of variational lower bound to train the encoder and decoder jointly (described in Section 3.1). The posterior , which captures the appearance of a part, is assumed to be a Normal distribution with zero-mean and identity covariance . The posterior
, which captures the presence or absence a part, is assumed to be a Bernoulli distributionwith prior . Therefore, the ELBO for our approach can written as (refer to (3)):
where, the term can be expanded as:
. In addition, we also need to compute KL Divergence loss for part occurrence. Learning discrete probability distribution is a challenging task since there is no gradient defined to backpropagate reconstruction loss through the stochastic layer at decoder even when using the reparameterization trick. Therefore, we use the relaxed-bernoulli approximation[22, 2] for training part occurrence distributions .
For an image, network first generates feature maps of size , where are spatial dimensions and is the number of channels. Therefore, the number of locations . Encoders and are single layer neural networks to compute and . is -dimensional multivariate bernoulli parameter and is -dimensional multivariate gaussian.
is length of the latent vector for a single part. Input to the decoderis -dimensional. In all experiments, we fix and .
Constructing . Notice that is an -dimensional feature map and is -dimensional binary output, but is -dimensional feature vector. If , the part occurs at multiple locations in an image. Since all these locations correspond to same part, their appearance should be the same. To incorporate this, we take the weighted average of the part appearance feature at each location, weighted by the probability that the part is present. Since we use the probability values for averaging the result is deterministic. This operation is encapsulated by the encoder (refer to Figure 1(b)). During image generation, we sample once and replicate it at each location where . During training, this forces the model to: (1) only predict where similar looking parts occur, and (2) learn a common representation for the part that occurs at these locations. Note that
can be modeled as a mixture of distributions (e.g., mixture of gaussians) to capture complicated appearances. However, in this work we assume that the convolutional neural network based encoders are powerful enough to map variable appearance of semantic concepts to similar feature representations. Therefore, we restrict ourselves to a single gaussian distribution.
Next we extend the framework described above to use multiple parts. To use parts, we use encoder networks and , where and parameterize the part. Again, this can be implemented efficiently as 2 networks by concatenating the outputs together. The image generator samples and from the outputs of these encoder networks and constructs . We obtain the final patch latent code by concatenating all in channel dimension. Therefore, is -dimensional and is -dimensional stochastic feature map. For this multiple part case, (6) can be written as:
The training details and assumptions of posteriors follow the previous section.
The L2 reconstruction loss used for training -VAEs (and other reconstruction based approaches) gives equal importance to each region of an image. This might be reasonable for tasks like image compression and image de-noising. However, for the purposes of learning semantic representations, not all regions are equally important. For example, “sky” and “walls” occupy large portions of an image, whereas concepts like “windows,” “wheels,”, “faces” are comparatively smaller, but arguably more important. To incorporate this intuition, we use a simple and intuitive strategy to weigh the regions in an image in proportion to the gradient energy in the region. More concretely, we compute laplacian of an image to get the intensity of gradients per-pixel and average the gradient magnitudes in local patches. The weight multiplier for the reconstruction loss of each patch in the image is proportional to the average magnitude of the patch. All weights are normalized to sum to one. We refer to this as weighted loss (). Note that this is similar to the gradient-energy biased sampling of mid-level patches used in [30, 7]. Examples of weight masks are provided in the supplemental material.
In addition, we also consider an adversarial training strategy from GANs to train VAEs 
, where the discriminator network from GAN implicitly learns to compare images and gives a more abstract reconstruction error for the VAE. We refer to this variant by using ‘GAN’ suffix in experiments. In Section4.2, we demonstrate that the proposed weighted loss () is complementary to the discriminator loss from adversarial training, and these losses result in better recognition capabilities for both -VAE and PatchVAE.
Datasets. We evaluate PatchVAE on CIFAR100 , MIT Indoor Scene Recognition , Places  and Imagenet  datasets. CIFAR100 consists of 60k color images from 100 classes, with 600 images per class. There are 50000 training images and 10000 test images. Indoor dataset contains 67 categories, and a total of 15620 images. Train and test subsets consist of 80 and 20 images per class respectively. Places dataset has 2.5 millions of images with 205 categories. Imagenet dataset has over a million images from 1000 categories.
|Model||Top-1 Acc.||Top-5 Acc.|
Learning paradigm. In order to evaluate the utility of PatchVAE features for recognition, we setup the learning paradigm as follows: we will first train the model in an unsupervised manner on all training images. After that, we discard the generator network and use only part of the encoder network to train a supervised model on the classification task of the respective dataset. We study different training strategies for the classification stage as discussed later.
Training details. In all experiments, we use the following architectures. For CIFAR100, Indoor67, and Place205, has a conv layer followed by two residual blocks . For ImageNet, is a ResNet18 model (a conv layer followed by four residual blocks). For all datasets, and have a single conv layer each. For classification, we start from , and add a fully-connected layer with 512 hidden units and a final fully-connected layer as classifier. More details can be found in the supplemental material.
During the unsupervised learning phase of training, all methods are trained for 90 epochs for CIFAR100 and Indoor67, 2 epochs for Places205, and 30 epochs for ImageNet dataset. All methods use ADAM optimizer for training, with initial learning rate ofand a minibatch size of 128. For relaxed bernoulli in , we start with the temperature of 1.0 with an annealing rate of (following the details in 
). For training the classifier, all methods use stochastic gradient descent (SGD) with momentum with a minibatch size of 128. Initial learning rate isand we reduce it by a factor of 10 every 30 epochs. All experiments are trained for 90 epochs for CIFAR100 and Indoor67, 5 epochs for Places205, and 30 epochs for ImageNet datasets.
Baselines. We use the -VAE model (Section 3.1) as our primary baseline. In addition, we use weighted loss and discriminator loss resulting in the -VAE-* family of baselines. We also compare against a BiGAN model from . We use similar backbone architectures for encoder/decoder (and discriminator if present) across all methods, and tried to keep the number of parameters in different approaches comparable to the best of our ability. Exact architecture details can be found in the supplemental material.
In Table 1, we report the top-1 classification results on CIFAR100, Indoor67, and Places205 datasets for all methods with different training strategies for classification. First, we keep all the pre-trained weights in from the unsupervised task frozen and only train the two newly added conv layers in the classification network (reported under column ‘Conv[1-5]’). We notice that our method (with different losses) generally outperforms the -VAE counterpart by a healthy margin. This shows that the representations learned by PatchVAE framework are better for recognition compared to -VAEs. Moreover, better reconstruction losses (‘GAN’ and ) generally improve both -VAE and PatchVAE, and are complementary to each other.
Next, we fine-tune the last residual block along with the two conv layers (‘Conv[1-3]’ column). We observe that PatchVAE performs better than VAE under all settings except the for CIFAR100 with just L2 loss. However, when using better reconstruction losses, the performance of PatchVAE improves over -VAE. Similarly, we fine-tune all but the first conv layer and report the results in ‘Conv1’ column. Again, we notice similar trends, where our method generally performs better than -VAE on Indoor67 and Places205 dataset, but -VAE performs better CIFAR100 by a small margin. When compared to BiGAN, PatchVAE representations are better on all datasets (‘Conv[1-5]’) by a huge margin. However, when fine-tuning the pre-trained weights, BiGAN performs better on two out of four datasets. We also report results using pre-trained weights in using supervised ImageNet classification task (last column, Table 1) for completeness. The results indicate that PatchVAE learns better semantic representations compared to -VAE.
ImageNet Results. Finally, we report results on the large-scale ImageNet benchmark in Table 2. For these experiments, we use ResNet18  architecture for all methods. All weights are first learned using the unsupervised tasks. Then, we fine-tune the last two residual blocks and train the two newly added conv layers in the classification network (therefore, first conv layer and the following two residual blocks are frozen). We notice that PatchVAE framework outperforms -VAE under all settings, and the proposed weighted loss helps both approaches. Finally, the last row in Table 2 reports classification results of same architecture randomly initialized and trained end-to-end on ImageNet using supervised training for comparison.
: Increasing increasing the prior probability of patch occurrence has adverse effect on classification performance
We present qualitative results to validate our hypothesis. First, we visualize whether the structure we impose on the VAE bottleneck is able to capture occurrence and appearance of important parts of images. We visualize the PatchVAE trained on images from CIFAR100 and Imagenet datasets in the following ways.
Concepts captured. First, we visualize the part occurrences in Figure 3. We can see that the parts can capture round (fruit-like) shapes in the top row and faces in the second row regardless of the class of the image. Similarly for ImageNet, occurrence map of a specific part in images of chicken focuses on head and neck. Note that these semantically these parts are more informative than just texture or color what a -VAE can capture. In Figure 4, we show parts captured by the ImageNet model by cropping a part of image centered around the occurring part. We can see that parts are able to capture multiple concepts, similar in either shape, texture, or context in which they occur.
Swapping appearances. Using PatchVAE, we can swap appearance of a part with the appearance vector of another part from a different image. In Figure 5, keeping the occurrence map same for a target image, we modify the appearance of a randomly chosen part and observe the change in reconstructed image. We notice that given the same source part, the decoder tries similar things across different target images. However, the reconstructions are worse since the decoder has never encountered this particular combination of part appearance before.
Discriminative vs. Generative strength. As per our design, PatchVAE compromises the generative capabilities to learn more discriminative features. To quantify this, we use the the images reconstructed from -VAE and PatchVAE models (trained on ImageNet) and compute three different metrics to measure the quality of reconstructions of test images. Table 7 shows that -VAE is better at reconstruction.
We study the impact of various hyper-parameters used in our experiments. For the purpose of this evaluation, we follow a similar approach as in the ‘Conv[1-5]’ column of Table 1
and all hyperparameters from the previous section. We use CIFAR100 and Indoor67 datasets for ablation analysis.
Maximum number of patches. Maximum number of parts used in our framework. Depending on the dataset, higher value of can provide wider pool of patches to pick from. However, it can also make the unsupervised learning task harder, since in a minibatch of images, we might not get too many repeat patches. Table 6(left) shows the effect of on CIFAR100 and Indoor67 datasets. We observe that while increasing number of patches improves the discriminative power in case of CIFAR100, it has little or negative effect in case of Indoor67. A possible reason for this decline in performance for Indoor67 can be smaller size of the dataset (i.e., fewer images to learn).
Number of hidden units for a patch appearance . Next, we study the impact of the number of channels in the appearance feature for each patch (). This parameter reflects the capacity of individual patch’s latent representation. While this parameter impacts the reconstruction quality of images. We observed that it has little or no effect on the classification performance of the base features. Results are summarized in Table 6(right) for both CIFAR100 and Indoor67 datasets.
Prior probability for patch occurrence . In all our experiments, prior probability for a patch is fixed to , i.e., inverse of maximum number of patches. The intuition is to encourage each location on occurrence maps to fire for at most one patch. Increasing this patch occurrence prior will allow all patches to fire at the same location. While this would make the reconstruction task easier, it will become harder for individual patches to capture anything meaningful. Table 6 shows the deterioration of classification performance on increasing .
Patch occurrence loss weight . The weight for patch occurrence KL Divergence has to be chosen carefully. If is too low, more patches can fire at same location and this harms the the learning capability of patches; and if is too high, decoder will not receive any patches to reconstruct from and both reconstruction and classification will suffer. Table 6 summarizes the impact of varying .
We presented a patch-based bottleneck in the VAE framework that encourages learning useful representations for recognition. Our method, PatchVAE, constrains the encoder architecture to only learn patches that are repetitive and consistent in images as opposed to learning everything, and therefore results in representations that perform much better for recognition tasks compared to vanilla VAEs. We also demonstrate that losses that favor high-energy foreground regions of an image are better for unsupervised learning of representations for recognition.
Computer Vision and Pattern Recognition, 2009. IEEE Conference on, pages 248–255. Ieee, 2009.
Revisiting unreasonable effectiveness of data in deep learning era.2017 IEEE International Conference on Computer Vision (ICCV), pages 843–852, 2017.
Unpaired image-to-image translation using cycle-consistent adversarial networks.In Proceedings of the IEEE International Conference on Computer Vision, pages 2223–2232, 2017.
The generator network has two deconv layers with batchnorm and a final deconv layer with tanh activation. When training with ‘GAN’ loss, the additional discriminator has four conv layers, two of which have batchnorm.
Figure 6 shows an illustration of the reconstruction loss proposed in Section 3.4. Notice that in first column, guitar has more weight that rest of the image. Similarly in second, fourth and sixth columns that train, painting, and people are respectively weighed more heavily by than rest of the image; thus favoring capturing the foreground regions.
In this section, we share the exact architectures used in various experiments. As discussed in Section 4, we evaluated our proposed model on CIFAR100, Indoor67, and Places205 datasets. We resize and center-crop the images such that input image size for CIFAR100 datasets is while for Indoor67 and Places205 datasets input image size is . PatchVAE can treat images of various input sizes in exactly same way allowing us to keep the architecture same for different datasets. In case of VAE and BiGAN however, we have to go through a fixed size bottleneck layer and hence architectures need to be a little different for different input image sizes. Wherever possible, we have tried to keep the number of parameters in different architectures comparable.
Tables 8 and 9 show the architectures for encoders used in different models. In the unsupervised learning task, encoder comprises of a fixed neural network backbone , that given an image of size generated feature maps of size . This backbone architecture is common to different models discussed in the paper and consists of a single conv layer followed by 2 residual blocks. We refer to this as Resnet-9 and it is described as Conv1-5 layers in Table 12. Rest of the encoder architecture varies depending on the model in consideration and is described in the Tables 8 and 9.
As discussed in Section 4, during the supervised learning phase, we discard rest of the encoder model and only keep for classifier training. So the architectures for all baselines are exactly the same. Tables 12 shows the architecture for classifier used in our experiments.
|Layer||CIFAR100 ()||Indoor67 ()||Places205 ()|