Log In Sign Up

Laplacian Denoising Autoencoder

While deep neural networks have been shown to perform remarkably well in many machine learning tasks, labeling a large amount of ground truth data for supervised training is usually very costly to scale. Therefore, learning robust representations with unlabeled data is critical in relieving human effort and vital for many downstream tasks. Recent advances in unsupervised and self-supervised learning approaches for visual data have benefited greatly from domain knowledge. Here we are interested in a more generic unsupervised learning framework that can be easily generalized to other domains. In this paper, we propose to learn data representations with a novel type of denoising autoencoder, where the noisy input data is generated by corrupting latent clean data in the gradient domain. This can be naturally generalized to span multiple scales with a Laplacian pyramid representation of the input data. In this way, the agent learns more robust representations that exploit the underlying data structures across multiple scales. Experiments on several visual benchmarks demonstrate that better representations can be learned with the proposed approach, compared to its counterpart with single-scale corruption and other approaches. Furthermore, we also demonstrate that the learned representations perform well when transferring to other downstream vision tasks.


page 2

page 3

page 5

page 6


Self-Supervised Pyramid Representation Learning for Multi-Label Visual Analysis and Beyond

While self-supervised learning has been shown to benefit a number of vis...

Rethinking Image Mixture for Unsupervised Visual Representation Learning

In supervised learning, smoothing label/prediction distribution in neura...

Autoencoder-augmented Neuroevolution for Visual Doom Playing

Neuroevolution has proven effective at many reinforcement learning tasks...

PatchVAE: Learning Local Latent Codes for Recognition

Unsupervised representation learning holds the promise of exploiting lar...

Understanding the World Through Action

The recent history of machine learning research has taught us that machi...

1 Introduction

In recent years, deep learning has made significant improvements on machine learning tasks. However, the success of deep-based approaches relies greatly on using a large amount of human labeled data for supervision, which is usually very costly and infeasible to scale on new data. Actually, humans are exceptional experts at learning abstract knowledge from unsupervised data,

i.e., without knowing the specific labels of the data. Thus, how to imitate such a human cognitive ability and effectively learn robust representations from massive sums of unlabeled data in an unsupervised manner are crucial and have been attracting interests in the literature.

Representation learning is a popular framework for unsupervised learning that aims to learn transferable representations from unlabeled data [1]. Although great progress has been achieved for visual data by some recent advances [2, 3, 4, 5, 6, 7, 8, 9], the approaches are mostly designed to boost the performance of high-level recognition tasks like classification and detection [10, 11]. We argue that good representations should benefit multiple kinds of tasks, including both high-level recognition tasks and low-level pixel-wise prediction tasks. We, in this paper, present a novel unsupervised representation learning approach that is applicable to more generic type of data and tasks. The only assumption about the input data form is that the learned representations should incorporate the underlying data structures along some certain dimensions. For example, one would expect the representations for visual data to incorporate underlying image structures along the spatial dimension, while the representations for speech data might need to be exploited along the temporal dimension.

Fig. 1: Illustrative visualization of the discriminative representation learning capability on the MNIST test dataset. The samples are projected to 2D domain by using the t-SNE technique [12]. (a) shows the projection from the original digits raw data. (b) is the projection from the embedding space of conventional denoising autoencoder (DAE). (c) visualizes the projected distribution from the embedding space of the proposed Laplacian denoising autoencoder (LapDAE) approach.
Fig. 2: Left: illustration of our Laplacian pyramid based corruption construction strategy compared to traditional spatial corruption, where “LPS” indicates the Laplacian pyramid scale; Right: learned kernels when corruption is added in spatial domain (DAE) and gradient domain (LapDAE).

Specifically, we propose to decouple the representations into different semantic levels in the Laplacian domain. A novel type of denoising autoencoder (DAE) [13] is proposed to distill both high- and low-level representations accordingly. Different from the conventional DAE, where the noisy input is generated from the clean data by adding noises in the original space, we propose to generate noisy input by corrupting the clean data in the gradient domain. By perturbing the clean data in such a manner, the corruptions are diffused into larger scales and made more difficult to remove. More importantly, the gradient domain corruption can be naturally extended to span multiple scales with a Laplacian pyramid representation of the data [14]. To this end, the DAE is enforced to learn more robust and discriminative representations (Fig. 1) that can exploit the underlying data structures across multiple scales. In addition, the proposed learning approach can easily be incorporated into other representation learning frameworks, and boosts their performance accordingly.

Our motivation is inspired by the human knowledge learning by visual perception. Instead of trying to remember every single detail, human vision focuses more on the general concept of the object/scene, which favors a combined perception of both local and non-local information [15]. An example of the proposed gradient-domain corruption is illustrated in Fig. 2. It can be observed that compared to directly adding noise spatially, editing on different scales of the Laplacian pyramid leads to non-local random corruptions. We also show the learned kernels by the corruption in the spatial domain and that in the gradient domain on the right side of Fig. 2. It can be observed that more edge-sensitive and color-sensitive kernels and non-local responses are learned by the gradient domain corruption (right), in comparison to spatial corruption (left) which preferring local responses. We argue that in order for an agent to be able to recover the corruptions from different scales non-locally, it requires an understanding of the context in the presented scene.

In Fig. 1, we illustrate the discriminative capability of our model on the MNIST [16]

testing set. The visualization is achieved by projecting the high-dimensional data or feature to a two-dimension space, using the t-SNE 

[12] technique. Compared to the raw data distribution, the embedding space of the conventional denoising autoencoder shows a better clustering ability while with some background noise. When compared to the embedding of the proposed Laplacian denoising autoencoder (LapDAE), we can observe that different categories (digits) are well discriminated from each other and with much less noise. For example, the digit 5 and 3 are better discriminated compared to those from the raw data and from DAE.

We demonstrate the effectiveness of the proposed unsupervised learning approach in two folds: 1) by evaluating the clustering and discriminative capability on classic benchmarks (e.g., MNIST); 2) by training on large-scale data (e.g

., ImageNet 

[17]) and transferring the learned representations to a variety of downstream vision tasks including multi-label classification, object detection, and semantic segmentation. The main contributions of our work are summarized as follows:

  • We propose a new unsupervised representation learning framework , by enforcing the model to learn more context and discriminative information in the Laplacian domain.

  • The proposed framework is trained purely based on the raw data itself and neither the data domain assumptions nor pseudo labels are necessary.

  • Our framework is superior to the conventional DAE and achieves competitive performance on several benchmarks for representation learning.

The paper is organized as follows. We discuss related work to our approach in Section 2. The proposed LapDAE approach for self-supervised representation learning is elaborated in Section 3. Furthermore, in Section 4

we perform extensive experiments to validate the effectiveness of the proposed approach on representation learning. The transfer learning ability is also demonstrated by transferring the learned representations to several downstream tasks. Finally, the paper is concluded in Section 

5 with discussion on potential future directions.

2 Related Works

2.1 Autoencoders

The conventional autoencoder (AE) [18] is based on the idea of learning a mapping from high-dimension to low-dimension so that the encoded representation can be used to reconstruct the original raw input. Bengio et al[19] propose to learn the abstract representation by stacking single-layer autoencoders. Poultney et al[20] impose a sparsity prior for the latent encoded space. Furthermore, the denoising autoencoder (DAE) [13] is proposed to achieve abstraction that is robust to noise and is proven to be able to learn better representations. The variational autoencoder (VAE) [21] aims to learn a parametric latent variable model by encouraging the latent space to satisfy a distribution. We refer the readers to [1] for a broader view of autoencoder-based approaches in the literature.

Fig. 3: Illustration of the corruption with a Laplacian pyramid. A Gaussian pyramid is first constructed from the clean image, from which a Laplacian pyramid is built. After adding random corruptions (e.g., noise) to a randomly selected level (slice) in the Laplacian pyramid, the final corrupted image is reconstructed from the modified Laplacian pyramid. with indicates the Laplacian pyramid construction, while with indicates reconstruction.

2.2 Representation Learning

As a fundamental problem, representation learning has been studied for years. A comprehensive review could be referred to [1]. Early classical algorithms mainly focus on reconstructing the raw data by learning compressed features [18, 13]

. Some other methods instead use probabilistic models like Boltzmann machines 

[22, 23] and GANs [24, 25]. Recently, some approaches address the problem by defining pretext tasks (termed “self-supervised learning”) which have shown promising performance, including predicting relative spatial location/ordering of image patches [6, 5, 7], motion in video [26, 8, 27]

, colorization 

[2, 3], predicting rotations [9] and transformations [28], to name a few. Such pretext-task-based methods can be categorized into a different group compared to the AE/DAE-based ones. Compared to these representation learning methods, the proposed approach does not make any assumptions to the data and is more generic accordingly.

3 Laplacian Denoising Autoencoder

In this section, we first introduce some background information that motivates our approach. Afterwards, we elaborate the proposed Laplacian Denoising AutoEncoder (LapDAE) in detail with methodology and architecture.

3.1 Background

3.1.1 Denoising Autoencoder

In order to avoid the “simply copy input” that may occur in the traditional autoencoder, in [13] the authors introduce the denoising autoencoder (DAE), to reconstruct a “repaired input” from a corrupted version of it. Suppose the input is , for DAE, the corrupted version

is mapped to a hidden representation

, by basic AE. The reconstruction from the hidden representation is . The network parameters and are learned by minimizing a reconstruction error, e.g., mean square error . Such a DAE is shown to learn better representation compared to the AE in its classical form [13].

3.1.2 Laplacian Pyramid and Gradient Domain Editing

The original Laplacian pyramid was proposed for image editing [14] and can easily be generalized to other types of data where a low-pass filter is applicable. Given input data , its Gaussian pyramid is composed of a set of progressively lower resolution versions of the data, denoted as where is a pyramid level. In the pyramid, the bottom level is the data itself, i.e., , and . The Laplacian pyramid is constructed by subtracting the neighboring levels in the Gaussian pyramid, . Note that the top level of the Laplacian pyramid is the residual and the same as that in the Gaussian pyramid, , where is the top level of the pyramid. The construction process is illustrated in Fig. 3. Given a Laplacian pyramid, the original data can be reconstructed by recursively applying until is reached. Gradient domain editing on can be achieved by editing its Laplacian pyramid and then reconstruct the resulting from the modified Laplacian pyramid.

3.2 LapDAE Methodology

Following the denoising autoencoder framework [13], we attempt to distill the essential representations by training a convolutional network (ConvNet) to restore the clean data from the corrupted data . In contrast to a standard DAE, we generate the corrupted data from with the aid of a Laplacian pyramid. Specifically, we construct a Laplacian pyramid from the clean data and randomly corrupt a level of the pyramid, such that is reconstructed from the corrupted pyramid. Fig. 3 illustrates the process of the corruption with an example of image data. Since the corruption applied to higher levels of the pyramid affects larger spatial scales of the image (see Fig. 2), the randomly corrupted levels will enforce the network to learn features that can represent underlying structures across multiple scales. As known in the literature [29], ConvNet is inherently in favor of both local and non-local features at different levels of layers. Hence, with only local disturbances, it is difficult to capture the non-local semantic concepts. This has also been verified to some extent in the self-supervised learning methods that attempt to leverage patch-based context information [6, 5, 7], and similarly to the non-local scheme [30] on the network design side. By adding corruptions across multiple scales, the objective is to capture both local and non-local information during the representation learning phase. Additionally, in order to incorporate diverse types of corruptions and to force the network to “learn harder”, it is also possible to apply multiple types of corruptions to the pyramid during learning.

The assumption here is, by corrupting data in the Laplacian domain and reconstruct the original latent data from these multiple corrupted versions, the model need to understand the underlying semantic features and extract semantic-invariant representations accordingly. As illustrated in Fig. 4, in the representation space, the original data and its Laplacian-corrupted versions are projected to a same sample sphere, where the underlying semantic-invariant representation lives. This is achieved by enforcing the reconstructions from these projected features on the same sphere to be similar. Different spheres correspond to different semantic representations, e.g., the dog and cat samples are projected to different representation spheres.

Fig. 4: Illustration on semantic-invariant projection. Images (including both the original latent image and its variants in Laplacian domain) in the data space () are projected to the representation space by a learning-based function . The green and red spheres indicate different semantic-invariant representations.

Denote the corrupted data as (note that we use instead of for conciseness to denote a corrupted level in the Laplacian pyramid). The corrupted input space is achieved as


where is the corruption type set. Then the corrupted space is further mapped to a hidden (representation) space through an encoder with parameter by


Differing from a DAE that uses sparse code as a hidden representation, each sample in our hidden space has its own resolution, from which we recover the reconstruction space with a decoder with parameter . During the optimization in mini-batch, each time a training sample is presented, one or multiple versions of corruptions are constructed according to . Therefore each time the optimization is based on a sub-mini-batch, for which a sub-batch reconstruction objective is defined as:


where is the reconstructed data by the network. The learning process of the proposed LapDAE is summarized in Algorithm 1.

The proposed Laplacian denoising autoencoder performs data reconstruction in the Laplacian pyramid space across multiple scales. Despite simple and making no assumptions about the data and requiring no specially designed domain-specific loss functions, the proposed framework is able to learn representations competitively with existing unsupervised (as well as some self-supervised) approaches. This will be shown in the following evaluation section. Similar to some recent work which has explored withholding parts of the data (

e.g., AE to remove noise; inpainting and context-encoder to drop data in spatial domain; colorization to drop data along channel direction), our LapDAE model can be considered as removing context-aware noise along the scale direction in Laplacian domain.

Being a purely unsupervised model, this is a generic framework that can be applied to other domains in addition to visual data. The proposed framework opens a new potential direction for representation learning in another transferred domain (e.g., gradient domain), which we believe to be beneficial to the community, where current work focuses mainly on knowledge mining in the original (spatial) domain.

Initialize corruption set and parameters
while not converged do
       for  do
             Compose the Gaussian pyramid
             Construct the Laplacian pyramid from
             for  do
                   Randomly select a pyramid level
                   apply corruption on level
             end for
            Reconstruct the corrupted data in image domain
       end for
end while
Return for the LapDAE
Algorithm 1 LapDAE Optimization

3.3 LapDAE Architecture

In this section, we describe the detailed architecture of the proposed LapDAE. Specifically, we utilize a convolutional neural network (CNN) to implement the LapDAE and showcase its effectiveness on visual data. The corruption in the Laplacian domain is modeled as a Laplacian layer. Specifically, we randomly add Gaussian noise (with

) to a randomly chosen level in the Laplacian pyramid. The encoder consists of several simple convolutional (conv) layers, while the decoder is of a mirrored structure to the encoder and consists of up-conv (also termed deconv in some literature) layers. The model is trained with supervision from the reconstruction objective defined in Equation (3).

4 Experiments

4.1 Experimental Setup

The setup of the proposed framework is described in Sec. 3.3. For the basic LapDAE architecture, four layers are used to construct the encoder, while the decoder consists of three similar up-conv layers. For the experiments performed on large-scale datasets (Sec. 4.44.5), we use the AlexNet [31] structure for the encoder and similarly the decoder consists of three up-conv layers. For simplicity in this study, only one corruption type, random noise, is set in . The Laplacian pyramid is constructed with eight levels. The whole model is trained end-to-end by the Adam optimizer [32], with the learning rate set to . The learning rate decreases at a factor of

for every 20 epochs. A batch size of 128 is used throughout the experiments.

Fig. 5: Left: An illustration of the reconstruction performance on the MNIST dataset. The original raw input images are randomly selected from the test set and are shown in the first row, while the second and last rows show the reconstructed results from conventional DAE and our LapDAE, respectively. DAE applied on Lap noise space and LapDAE on spatial noise space are also shown for reference in the third and fourth rows. Right: Illustration on model convergence. The horizontal axis shows the training iterations while vertical axis the training loss (in log scale).

4.2 Evaluation on MNIST

The MNIST 111 dataset consists of 70,000 images of handwritten digits with size of , in which 60,000 are used for training and the rest 10,000 for testing. Randomly selected example images from the MNIST are shown in Fig. 5 (left). In this experiment, the input images are fed into the model at fixed size of with only horizontal flipping as data augmentation during training. As the objective for our model is set as the reconstruction error (as in Equation 3), we first illustrate the qualitative performance on the image reconstruction, shown in Fig. 5 (left). As we can see from the reconstruction results, the conventional DAE generally reconstructs the digits but they are unclear and include some noise. In contrast, the reconstruction of our LapDAE model is evidently much clearer and includes more details, e.g., the numbers 0 and 4. To better understand the reconstruction capability, we apply the conventional DAE to images corrupted with Laplacian noise, with comparison to applying our LapDAE to images where the noise is added on the input space. The results again suggest that the proposed LapDAE performs better on reconstructing context information, i.e., digits here. We also compare the convergence property of the proposed LapDAE and conventional DAE, as shown in Fig. 5 (right). It can be observed that with the proposed LapDAE, the model converges faster and results in reaching a much more optimum level.

We perform an experiment on image clustering for both DAE and the proposed LapDAE models. The result is shown in Fig. 1. From the results, we can see that the proposed LapDAE has a far better discriminative capability compared to the conventional DAE.

Fig. 6: The reconstruction result comparison on the CIFAR-10 dataset, in which examples from each category are randomly selected for visualization.

4.3 Evaluation on CIFAR

In comparison to the MNIST dataset, the CIFAR-10 [33] dataset is composed of RGB natural images with a size of , covering 10 different categories of natural objects. The training set consists of 50,000 images while the testing set is 10,000. Each category includes 6,000 images. In this experiment, we first visualize the reconstructed images and compare to those reconstructed by the conventional DAE, as shown in Fig. 6. From the results we can see that by using the proposed LapDAE, the representative context is well reconstructed, e.g., the face of the cat and the deer

in the forest. We also explore the quality of the learned representation, by performing an image retrieval task. The retrieval is based on the similarity of the embedding space, by using the nearest neighbor scheme. Given an input query image, the feature at the bottleneck (latent space) of the model is extracted for the retrieval in the whole testing set. For this experiment, we compare with results from the conventional DAE. Fig. 

7 shows example results. From the results in Fig. 7 we can observe that our LapDAE model learns a much better representation compared to the DAE. The conventional DAE tends to retrieve based on the appearance of the images, while our LapDAE focuses more on the context/semantic information, e.g., the airplane in the first row. We attribute this to the multiple scale corruptions in the Laplacian domain. Overall, these results together with the above evaluation on the MNIST dataset can be considered as proof of concept that the proposed LapDAE is capable of capturing both low-level and high-level context information.

Fig. 7: The image retrieval results by nearest neighbor on the CIFAR-10 dataset. Given the query on the left, the top-5 (from left to right) retrieved results of DAE (middle) and LapDAE (right) are presented, in which red ones indicate wrong category.

4.4 Evaluation on ImageNet

In this section, we aim to investigate the representation learning capability of the proposed LapDAE on large-scale dataset. To achieve this goal, we perform evaluations on the ImageNet [17] dataset. Specifically, we use the training set without labels from ImageNet [17] to train our LapDAE model. The training set includes 1.2 million images covering 1,000 categories. Each image is first resized to and randomly cropped to . Horizontal flipping is also applied for data augmentation.

4.4.1 Conv1 Learned Filter Visualization

In Fig. 8, we show the comparison for the learned filters from the first layer (i.e., conv1) of AlexNet between our approach and the fully-supervised ones. In the supervised version (the left panel), both color blobs and edge filters are learned. We can see that although not as sharp as those filters learned by the supervised setup for some blobs, our approach (the right panel) learns quite good filters including edges along different directions, edges with different frequencies, color contrast along different directions, etc., similar to the supervised ones. Comparing with conventional DAE (the middle panel), the learned representations from our approach are much better.

Fig. 8: The learned convolutional filters (kernels) from the first layer of AlexNet (conv1) trained on the ImageNet dataset. Left: result with fully-supervision from the labeled data; Middle and Right: the filters learned in an unsupervised manner by DAE and our LapDAE respectively.
Method Conv1 Conv2 Conv3 Conv4 Conv5
ImageNet / Places labels 19.3 36.3 44.2 48.3 50.5
Random Gaussian 11.6 17.1 16.9 16.3 14.1
Random rescaled [34] 17.5 23.0 24.5 23.2 20.6
Context [6] 16.2 23.3 30.2 31.7 29.6
Context Encoder [4] 14.1 20.7 21.0 19.8 15.5
Colorization [2] 12.5 24.5 30.4 31.5 30.3
Jigsaw [5] 18.2 28.8 34.0 33.9 27.1
BiGAN [25] 17.7 24.5 31.0 29.9 28.0
Split-Brain [3] 17.7 29.3 35.4 35.2 32.8
Counting [7] 18.0 30.6 34.3 32.5 25.7
RotNet [9] 18.8 31.7 38.7 38.2 36.5
Domain Adapt [35] 16.5 27.0 30.5 30.1 26.5
Instance [36] 16.8 26.5 31.8 34.1 35.6
AND [37] 15.6 27.0 35.9 39.7 37.9
AET-project [28] 19.2 32.8 40.6 39.7 37.7
DAE 12.5 18.5 21.8 20.4 14.8
DAE + Trans 17.6 31.8 39.2 37.4 34.5
(Ours) LapDAE 18.4 27.4 29.9 27.0 22.7
(Ours) LapDAE + Trans 19.3 33.2 43.2 41.1 39.6

Top-1 accuracy on ImageNet classification with a linear classifier. The results are reported based on the ImageNet validation set and all the listed methods except

ImageNet labels, Domain Adapt are pre-trained on ImageNet without ground truth labels.

4.4.2 Controlled Classification

Here we quantitatively evaluate our learned representations on the ImageNet classification task [17]. Following the experimental settings in [2], we freeze the pre-trained weights of our model and train a linear classifier on the top of each conv

layer, to perform the 1000-category classification task. In order to have approximately the same dimensions across different layers, the feature maps of each layer is interpolated to have around 9000 elements. Table 

I shows the evaluation results. Several state-of-the-art self-supervised representation learning methods [2, 3, 4, 6, 5, 25, 7, 9, 35, 36, 28, 37] are included for comparison. Since the proposed approach can be easily integrated into other representation learning frameworks, we also present the performance of our LapDAE combined with the task of predicting transformations (LapDAE+Trans). Specifically, we base on the AET framework [28] while reasoning the transformation between the original image and a transformed one corrupted by our LapDAE.

From the results, we observe that the proposed method with the Laplacian pyramid largely improves the performance compared to its counterpart without the Laplacian pyramid, especially for the lower convolutional layers. This is consistent with the above visualization and analysis of the conv1 layer, where the filter kernels have more representative power in the proposed LapDAE. When incorporating with the transformation prediction task, we can see that the performance is further boosted by a large margin. Even when compared to the AET-project method, our approach performs much better, which again validates the effectiveness of the proposed LapDAE.

4.5 Transfer Learning Analysis

Method Conv1 Conv2 Conv3 Conv4 Conv5
Places labels 22.1 35.1 40.2 43.3 44.6
Random Gaussian 15.7 20.3 19.8 19.1 17.5
Random rescaled [34] 21.4 26.2 27.1 26.1 24.0
Context [6] 19.7 26.7 31.9 32.7 30.9
Context Encoder [4] 18.2 23.2 23.4 21.9 18.4
Colorization [2] 16.0 25.7 29.6 30.3 29.7
Jigsaw [5] 23.0 31.9 35.0 34.2 29.3
BiGAN [25] 22.0 28.7 31.8 31.3 29.7
Split-Brain [3] 21.3 30.7 34.0 34.1 32.5
Counting [7] 23.3 33.9 36.3 34.7 29.6
RotNet [9] 21.5 31.0 35.1 34.6 33.7
Instance [36] 18.8 24.3 31.9 34.5 33.6
AET-project [28] 22.1 32.9 37.1 36.2 34.7
DAE 15.9 22.6 24.2 22.1 18.1
DAE + Trans 21.6 31.7 35.7 34.1 32.8
(Ours) LapDAE 21.0 30.9 31.6 29.2 26.1
(Ours) LapDAE + Trans 22.2 33.8 38.2 37.3 36.1
TABLE II: Top-1 accuracy on Places classification with a linear classifier. The results are reported based on the Places validation set and all the listed methods except Places labels, Domain Adapt are pre-trained on ImageNet without ground truth labels.

4.5.1 Performance on Places

In addition to the controlled classification on ImageNet, here we evaluate the representation learning capability by transfer learning on the Places dataset [38]. We use the same experimental settings as the ImageNet experiment and the result is shown in Table II

. Differing from the ImageNet experiments, the classifier trained on top of different layers is a 205-way logistic regression layer and which is then trained with the Places labels. From the result, we can see that the proposed LapDAE performs better than its counterparts and outperforms other methods when incorporating with transformation prediction.

4.5.2 Performance on Pascal Voc

Furthermore, we perform a transfer learning evaluation on the Pascal VOC dataset [39], for more vision tasks of multi-label classification, detection on VOC 2007, and semantic segmentation on VOC 2012. The learned weights of our model trained on ImageNet are transferred to a standard AlexNet for the evaluation. We then fine-tune the model on the Pascal VOC trainval set and test on the test set. Note that we do not apply any “magic” techniques such as weights rescaling [34]. For the classification task, we use the same network architecture as in the ImageNet evaluation, while for the detection and semantic segmentation tasks we use the publicly available frameworks of Fast R-CNN [10] and FCN [40], respectively, following the same setups for other state-of-the-art methods e.g.,  [6, 2]. The results in Table III show that the learned visual representations by the proposed LapDAE exhibit good performances when transferred to other vision datasets or tasks, and performs favorably against the other state-of-the-art methods. For the classification task of Pascal VOC, the proposed LapDAE outperforms all other generic unsupervised learning methods like DAE and GAN. More impressively, its segmentation performance on Pascal VOC surpasses most of the representation learning methods including those self-supervised learning approaches with specifically designed pretext tasks. Comparison between the proposed framework and its counterpart DAE suggests that the improvement is partly due to the Laplacian pyramid. When incorporating the transformation prediction task (LapDAE+Trans), our approach on Pascal VOC transfer learning is further improved and achieves state-of-the-art performance.

max width= Method Classification (%mAP) Detection (%mAP) Segmentation (%mIoU) fc6-8 all all all ImageNet labels 78.9 79.9 56.8 48.0 Random Gaussian - 53.3 43.4 19.8 Random rescaled [34] 39.2 56.6 45.6 32.6 Context [6] - 55.3 45.7 - Context Encoder [4] 34.6 56.5 44.5 29.7 Colorization [2] 61.5 65.5 46.9 35.6 Counting [7] - 67.7 51.4 36.6 GAN [24] 40.5 56.4 - - BiGAN [25] 52.3 60.1 46.9 34.9 RotNet [9] 70.9 73.0 54.4 39.1 AET-project [28] 70.5 73.1 54.2 39.3 DAE 37.0 54.6 43.4 29.1 DAE + Trans 66.7 70.1 51.0 36.8 (Ours) LapDAE 50.6 59.0 45.6 38.3 (Ours) LapDAE + Trans 71.4 74.2 55.2 41.1

TABLE III: Comparison with state-of-the-art representation learning methods on Pascal VOC vision tasks of classification, detection on 2007, and semantic segmentation on 2012. For classification we also compare a setup that fixes the layers before conv5 and only train the fc6-8. The learned weights from unlabeled ImageNet are transferred for the new tasks except the ImageNet labels.

5 Discussion and Conclusion

In this paper we introduced a novel type of denoising autoencoder for unsupervised representation learning. In contrast to conventional DAE, the corrupted data input to the proposed DAE is produced with the aid of a Laplacian pyramid. By adding corruptions to randomly chosen levels in a Laplacian pyramid, the resulting data corruptions span multiple scales across the original data space. From this, the model is forced to learn to represent underlying data structures across multiple scales. The proposed learning framework ensures that the agent learns better representations when compared to a conventional DAE and other self-supervised learning methods. Through extensive experimental evaluations, we demonstrated the effectiveness of the proposed LapDAE on representation learning without external supervisions. The transfer learning ability is also validated by several visual tasks.

While in this paper we showcase the effectiveness of the proposed method for learning transferable representations on vision tasks, it would be interesting to see how it performs with other types of data. Another interesting direction worth further investigation is including additional constraints to regularize the learning procedure, for instance, by introducing contrastive loss to maximize the distance between different semantic spheres while minimize the distance between samples belonging to the same semantic sphere. The core idea of performing both local and non-local learning that is consistent with the hierarchical nature of ConvNets, is the Laplacian pyramid space, which we believe to be a promising direction for future research.


  • [1] Y. Bengio, A. Courville, and P. Vincent, “Representation learning: A review and new perspectives,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 8, pp. 1798–1828, 2013.
  • [2] R. Zhang, P. Isola, and A. A. Efros, “Colorful image colorization,” in

    Proceedings of European Conference on Computer Vision (ECCV)

    , 2016.
  • [3] ——, “Split-brain autoencoders: Unsupervised learning by cross-channel prediction,” in

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , 2017.
  • [4] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros, “Context encoders: Feature learning by inpainting,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  • [5] M. Noroozi and P. Favaro, “Unsupervised learning of visual representations by solving jigsaw puzzles,” in Proceedings of European Conference on Computer Vision (ECCV), 2016.
  • [6] C. Doersch, A. Gupta, and A. A. Efros, “Unsupervised visual representation learning by context prediction,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2015.
  • [7] M. Noroozi, H. Pirsiavash, and P. Favaro, “Representation learning by learning to count,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017.
  • [8] D. Pathak, R. Girshick, P. Dollár, T. Darrell, and B. Hariharan, “Learning features by watching objects move,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • [9] S. Gidaris, P. Singh, and N. Komodakis, “Unsupervised representation learning by predicting image rotations,” arXiv preprint arXiv:1803.07728, 2018.
  • [10] R. Girshick, “Fast r-cnn,” in Proceedings of the IEEE international conference on computer vision (ICCV), 2015.
  • [11] H. Xu, X. Lv, X. Wang, Z. Ren, N. Bodla, and R. Chellappa, “Deep regionlets: Blended representation and deep learning for generic object detection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019.
  • [12] L. v. d. Maaten and G. Hinton, “Visualizing data using t-sne,” Journal of machine learning research, vol. 9, no. Nov, pp. 2579–2605, 2008.
  • [13] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol, “Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion,” Journal of Machine Learning Research, vol. 11, no. Dec, pp. 3371–3408, 2010.
  • [14] P. Burt and E. Adelson, “The laplacian pyramid as a compact image code,” IEEE Transactions on communications, vol. 31, no. 4, pp. 532–540, 1983.
  • [15] A. Bubić, D. Y. von Cramon, and R. I. Schubotz, “Prediction, cognition and the brain,” Frontiers in Human Neuroscience, vol. 4, no. 25, 2010.
  • [16] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner et al., “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
  • [17] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009.
  • [18] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of data with neural networks,” science, vol. 313, no. 5786, pp. 504–507, 2006.
  • [19] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle, “Greedy layer-wise training of deep networks,” in Advances in neural information processing systems (NeurIPS), 2007.
  • [20] C. Poultney, S. Chopra, Y. L. Cun et al.

    , “Efficient learning of sparse representations with an energy-based model,” in

    Advances in neural information processing systems (NeurIPS), 2007.
  • [21] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” arXiv preprint arXiv:1312.6114, 2013.
  • [22] G. E. Hinton and T. J. Sejnowski, “Learning and releaming in boltzmann machines,” Parallel distributed processing: Explorations in the microstructure of cognition, vol. 1, no. 282-317, p. 2, 1986.
  • [23] R. Salakhutdinov and H. Larochelle, “Efficient learning of deep boltzmann machines,” in

    Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics

    , 2010.
  • [24] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in neural information processing systems (NeurIPS), 2014.
  • [25] J. Donahue, P. Krähenbühl, and T. Darrell, “Adversarial feature learning,” arXiv preprint arXiv:1605.09782, 2016.
  • [26] I. Misra, C. L. Zitnick, and M. Hebert, “Shuffle and learn: unsupervised learning using temporal order verification,” in Proceedings of European Conference on Computer Vision (ECCV), 2016.
  • [27] X. Wang and A. Gupta, “Unsupervised learning of visual representations using videos,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2015.
  • [28] L. Zhang, G.-J. Qi, L. Wang, and J. Luo, “Aet vs. aed: Unsupervised representation learning by auto-encoding transformations rather than data,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
  • [29] M. D. Zeiler and R. Fergus, “Visualizing and understanding convolutional networks,” in Proceedings of European Conference on Computer Vision (ECCV), 2014.
  • [30] X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  • [31] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems (NeurIPS), 2012.
  • [32] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
  • [33] A. Krizhevsky, “Learning multiple layers of features from tiny images,” Citeseer, Tech. Rep., 2009.
  • [34] P. Krähenbühl, C. Doersch, J. Donahue, and T. Darrell, “Data-dependent initializations of convolutional neural networks,” arXiv preprint arXiv:1511.06856, 2015.
  • [35] Z. Ren and Y. J. Lee, “Cross-domain self-supervised multi-task feature learning using synthetic imagery,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  • [36] Z. Wu, Y. Xiong, S. X. Yu, and D. Lin, “Unsupervised feature learning via non-parametric instance discrimination,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  • [37] J. Huang, Q. Dong, S. Gong, and X. Zhu, “Unsupervised deep learning by neighbourhood discovery,” in International Conference on Machine Learning (ICML), 2019.
  • [38]

    B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva, “Learning deep features for scene recognition using places database,” in

    Advances in neural information processing systems (NeurIPS), 2014.
  • [39] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes (voc) challenge,” International journal of computer vision, vol. 88, no. 2, pp. 303–338, 2010.
  • [40] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.