Learning Generalized Transformation Equivariant Representations via Autoencoding Transformations

06/19/2019 ∙ by Guo-Jun Qi, et al. ∙ 0

Learning Transformation Equivariant Representations (TERs) seeks to capture the intrinsic visual structures of images through the representations that equivary to the applied transformations. It assumes that a transformation should be decoded from expressive representations of images before and after transformations. It greatly expands the scope of translation equivariance pinpointing the success of the Convolutional Neural Networks (CNNs) to develop a generic class of transformation equivariant representations. Unlike group equivariant convolutions that are limited to discrete transformations or linear transformation equivariance, we present a more flexible and tractable AutoEncoding Transformation (AET) model that can handle various types of transformations. Both deterministic AET and probabilistic Autoencoding Variational Transformations (AVT) models are presented. While the former trains transformation equivariant representations by directly reconstructing applied transformations, the latter is trained by maximizing the joint mutual information between the representations and the transformations. It leads to the Generalized TERs (GTERs) that could equivary against transformations in a more general manner by enabling them to capture more complex patterns of transformed visual structures beyond the linear TERs of a transformation group. We will further show that the presented approach can be extended to (semi-)supervised models by jointly maximizing the mutual information in the learned representations about the input labels and transformations. Experiment results following the standard evaluation protocols demonstrate the superior performances of the proposed models to the existing state-of-the-art unsupervised and (semi-)supervised approaches in literature.



There are no comments yet.


page 2

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

(a) AutoEncoding Data (AED)
(b) AutoEncoding Transformation (AET)
(c) (Semi-)Supervised Autoencoding Transformation (SAT)
Fig. 1:

The figure illustrates a comparison between AED, AET, and AET. The AED and AET seek to reconstruct the input data and transformation at the output end, respectively. The encoder (E) extracts the representation of input and transformed images. The decoder (D) either reconstructs the data in AED, or the transformation in AET. The SAT builds a classifier (C) upon the output representation from the encoder by capturing the equivariant visual structures under various transformations.

The goal of Transformation Equivariant Representation (TER) Learning seeks to learn representations that equivary to various transformations applied to images. In other words, given an image, its representation ought to change according to transformations. In this paper, the TERL is motivated by the assumption that representations equivarying under transformations should be able to encode the visual structures of images such that the transformations can be reconstructed from the representations of images before and after transformations. Based on this assumption, we formally present a novel principle of autoencoding transformations to learn a family of TERs.

Learning the TERs has been advocated in Hiton’s seminal work on learning transformation equivariant capsules [1], and plays a critical role for the success of Convolutional Neural Networks (CNNs) [2]. Specifically, the representations learned by the CNNs are translation equivariant as their feature maps are shifted in the same way as input images are translated. On top of these feature maps that preserve the visual structures of translation equivariance, fully connected layers are built to output the predicted labels of input images.

Obviously, the translation equivariant convolutional features play the pivotal role in delivering the state-of-the-art performances in the deep networks. Thus, they are extended beyond translations to learn more expressive representations of equivariance to generic types of transformations, such as affine, projective and homographic transformations. Aline this direction, the group equivariant CNNs [3] are developed to guarantee the transformation of input images results in the same transformation of input images.

However, the group equivariant CNNs [3] and their variants [4, 5] are restricted to discrete transformations, and the resultant representations are also limited to a group representation of linear transformations. These limitations restrict their abilities to model group representations of complex transformations that could be continuous and nonlinear in many learning tasks, ranging from unsupervised, to semi-supervised and supervised learning.

1.1 Unsupervised Learning of Transformation Equivariant Representations

The focus of this paper is on the principle of autoencoding transformations and its application to learn the transformation equivariant representations. The core idea is to encode data with the representations from which the transformations can be decoded as much as possible. We will begin with an unsupervised learning of such representations without involving any labeled data, and then proceed to a generalization to semi-supervised and supervised representations by encoding label information as well.

Unlike group equivariant CNNs that learn the feature maps mathematically satisfying the transformation equivariance as a function of the group of transformations, the proposed AutoEncoding Transformations (AET) presents an autoencoding architecture to learn transformation equivariant representations by reconstructing applied transformations. As long as a transformation of input images results in equivariant representations, it should be well decoded from the representations of original and transformed images. Compared with the group equivariant CNNS, the AET model is more flexible and tractable to tackle with any transformations and their compositions, since it does not rely on a strict convolutional structure to

The AET is also in contrast to the conventional AutoEncoding Data (AED) paradigm that instead aims to reconstruct data rather than the transformations. Figure 1

(a) and (b) illustrate the comparison between the AET and AED. Since the space of transformations (e.g., the few parameters of transformations) is of quite lower dimension than that of data space (e.g., the pixel space of images), the decoder of the AET can be quite shallower than that of the AED. This allows the backpropagated errors to more sufficiently train the encoder that models the representations of input data in the AET architecture.

Moreover, an AET model can be trained from an information-theoretic perspective by maximizing the information in the learned representation about the applied transformation and the input data. This will generalize the group representations of linear transformations to more general forms that could equivary nonlinearly to input transformations. It results in Generalized Transformation Equivariant Representations (GTERs) that can capture more complex patterns of visual structure under transformations. Unfortunately, this will result in an intractable optimization problem to maximize the mutual information between representations and transformations. A variational lower bound of the mutual information can be derive by introducing a surrogate transformation decoder, yielding a novel model of Autoencoding Variational Transformation (AVT) as an alterative to the deterministic AET.

1.2 (Semi-)Supervised Learning of Transformation Equivariant Representations

While both AET and AVT are trained in an unsupervised fashion, they can act as the basic representation for building the (semi-)supervised classifiers. Along this direction, we can train (Semi-)Supervised Autoencoding Transformation (SAT) that jointly trains the transformation equivariant representations as well as the corresponding classifiers.

Figure 1(c) illustrates the SAT model, where a classifier head is added upon the representation encoder of an AET network. The SAT can be based on either the deterministic AET or the probabilistic AVT architecture. Particularly, along the direction pointed by the AVT, we seek to train the proposed (semi-)supervised transformation equivariant classifiers by maximizing the mutual information of the learned representations with the transformations and labels. In this way, the trained SAT model can not only handle the transformed data through their equivarying representations, but also encode the labeling information through the supervised classifier. The resultant SAT also contains the deterministic model based on the AET as a special case by fixing a deterministic model to representation encoder and the transformation decoder.

The transformation equivariance in the SAT model is contrary to the data augmentation by transformations in deep learning literature

[2]. First, the data augmentation is only applicable to augment the labeled examples for model training, which cannot be extended to unlabeled data. This limits it in semi-supervised learning by exploring the unlabeled data. Second, the data augmentation aims to enforce the transformation invariance, in which the labels of transformed data are supposed to be invariant. This differs from the motivation to encode the inherent visual structures that equivary under various transformations.

Actually, in the (semi-)supervised transformation equivariant classifiers, we aim to integrate the principles of both training transformation equivariant representations and transformation invariant classifiers seamlessly. Indeed, both principles have played the key role in compelling performances of the CNNs and their modern variants. This is witnessed by the translation equivariant convolutional feature maps and the atop classifiers that are supposed to make transformation-invariant predictions with the spatial pooling and fully connected layers. We will show that the proposed SAT extends the translation equivariance in the CNNs to cover a generic class of transformation equivariance, as well as encode the labels to train the representations and the associated transformation invariant classifiers. We hope this can deepen our understanding of the interplay between the transformation equivariance and invariance both of which play the fundamental roles in training robust classifiers with labeled and unlabeled data.

The remainder of this paper is organized as follows. We will review the related works in Section 2. The unsupervised and (semi-)supervised learning of transformation equivariant representations will be presented in the autoencoding transformation framework in Section 3 and Section 4, respectively. We will present experiment results in Section 5 and Section 6 for unsupervised and semi-supervised tasks. We will conclude the paper and discuss the future works in Section 7.

2 Related Works

In this section, we will review the related works on learning transformation-equivariant representation, as well as unsupervised and (semi-)supervised models.

2.1 Transformation-Equivariant Representations

Learning transformation-equivariant representations can trace back to the seminal work on training capsule nets [6, 1, 7]. The transformation equivariance is characterized by the various directions of capsules, while the confidence of belonging to a particular class is captured by their lengths.

Many efforts have been made in literature [3, 4, 5] on extending the conventional translation-equivariant convolutions to cover more transformations. Among them are group equivariant convolutions (G-convolution) [3] that have been developed to equivary to more types of transformations. The idea of group equivariance has also been introduced to the capsule nets [5]

by ensuring the equivariance of output pose vectors to a group of transformations with a generic routing mechanism. However, the group equivariant convolution is restricted to discrete transformations, which limits its ability to learn the representations equivariant to generic continuous transformations.

2.2 Unsupervised Representation Learning

Auto-Encoders and GANs. Unsupervised auto-encoders have been extensively studied in literature [8, 9, 10]. Existing auto-encoders are trained by reconstructing input data from the outputs of encoders. A large category of auto-encoder variants have been proposed. Among them is the Variational Auto-Encoder (VAE) [11]

that maximizes the lower-bound of the data likelihood to train a pair of probabilistic encoder and decoder, while beta-VAE seeks to disentangle representations by introducing an adjustable hyperparameter on the capacity of latent channel to balance between the independence constraint and the reconstruction accuracy

[12]. Denoising auto-encoders [10] attempt to reconstruct noise-corrupted data to learn robust representations, while contrastive Auto-Encoders [13] encourage to learn representations invariant to small perturbations on data. Along this direction, Hinton et al. [1] propose capsule networks to explore transformation equivariance by minimizing the discrepancy between the reconstructed and target data.

On the other hand, Generative Adversarial Nets (GANs) have also been used to train unsupervised representations. Unlike the auto-encoders, the GANs [14] and their variants [15, 16, 17, 18] generate data from the noises drawn from a simple distribution, with a discriminator trained adversarially to distinguish between real and fake data. The sampled noises can be viewed as the representation of generated data over a manifold, and one can train an encoder by inverting the generator to find the generating noise. This can be implemented by jointly training a pair of mutually inverse generator and encoder [15, 16]. There also exist better generalizable GANs in producing unseen data based on the Lipschitz assumption on the real data distribution [17, 18], which can give rise to more powerful representations of data out of training examples [15, 16, 19]. Compared with the Auto-Encoders, GANs do not rely on learning one-to-one reconstruction of data; instead, they aim to generate the entire distribution of data.

Self-Supervisory Signals. There exist many other unsupervised learning methods using different types of self-supervised signals to train deep networks. Mehdi and Favaro [20] propose to solve Jigsaw puzzles to train a convolutional neural network. Doersch et al. [21] train the network by inferring the relative positions between sampled patches from an image as self-supervised information. Instead, Noroozi et al. [22] count features that satisfy equivalence relations between downsampled and tiled images. Gidaris et al. [23] propose to train RotNets by predicting a discrete set of image rotations, but they are unable to handle generic continuous transformations and their compositions. Dosovitskiy et al. [24]

create a set of surrogate classes by applying various transformations to individual images. However, the resultant features could over-discriminate visually similar images as they always belong to different surrogate classes. Unsupervised features have also been learned from videos by estimating the self-motion of moving objects between consecutive frames


2.3 (Semi-)Supervised Representation Learning

In addition, there exist a large number of semi-supervised models in literature. Here, we particularly mention three state-of-the-art methods that will be compared in experiments. Temporal ensembling [26] and mean teachers [27] both use an ensemble of teachers to supervise the training of a student model. Temporal ensembling uses the exponential moving average of predictions made by past models on unlabeled data as targets to train the student model. Instead, mean teachers update the student model with the exponential moving average of the weights of past models. On the contrary, the Virtual Adversarial Training (VAT) [28] seeks to minimizes the change of predictions on unlabeled examples when their output values are adversarially altered. This could result in a robust model that prefers smooth predictions over unlabeled data.

The SAT also differs from transformation-based data augmentation in which the transformed samples and their labels are used directly as additional training examples [2]. First, in the semi-supervised learning, unlabeled examples cannot be directly augmented to form training examples due to their missing labels. Moreover, data augmentation needs to preserve the labels on augmented images, and this prevents us from applying the transformations that could severely distort the images (e.g., shearing, rotations with arbitrary angles, and projective transformations) or invalidate the associated labels (e.g., vertically flipping “6” to “9”). In contrast, the SAT avoids using the labels of transformed images to supervisedly train the classifier directly; instead it attempts to encode the visual structures of images equivariant to various transformations without access to their labels. This leads to a label-blind TER regularizer to explore the unlabeled examples for the semi-supervised problem.

3 Unsupervised Learning of Transformation Equivariant Representations

In this section, we will first present the autoencoding transformation architecture to learn the transformation equivariant representations in a deterministic fashion. Then, a variational alternative approach will be presented to handle the uncertainty in the representation learning by maximizing the mutual information between the learned representations and the applied transformations.

3.1 AET: A Deterministic Model

We begin by defining the notations used in the proposed AutoEncoding Transformation (AET) architecture. Consider a random transformation sampled from a transformation distribution (e.g., warping, projective and homographic transformations), as well as an image drawn from a data distribution in a sample space . Then the application of to results in a transformed image .

The goal of AET focuses on learning a representation encoder with parameters , which maps a sample to its representation in a linear space . For this purpose, one need to learn a transformation decoder with parameters

that makes an estimate of the input transformation from the representations of original and transformed samples. Since the transformation decoder takes the encoder outputs rather than original and transformed images, this pushes the encoder to capture the inherent visual structures of images to make a satisfactory estimate of the transformation.

Then the AET can be trained to jointly learn the representation encoder and the transformation decoder

. A loss function

measuring the deviation between a transformation and its estimate is minimized to train the AET over and :

where the estimated transformation can be written as a function of the encoder and the decoder such that

and the expectation is taken over the distributions of transformations and data.

In this way, the encoder and the decoder can be jointly trained over mini-batches by back-propagating the gradient of the loss to update their parameters.

3.2 AVT: A Probabilistic Model

Alternatively, we can train transformation equivariant representations to contain as much information as possible about applied transformations to recover them.

(a) Autoencoding Variational Transformations (AVT)
(b) (Semi-)Supervised Autoencoding Transformations (SAT)
Fig. 2:

The figure illustrates the variational approach to unsupervised learning and (semi-)supervised learning of autoencoding transformations, namely AVT and SAT respectively. The probability

acts as the representation encoder, while and play the roles of a transformation and label decoder, respectively. By setting the transformation to an identity , the corresponding is the representation of an original image.

3.2.1 Notations

Formally, our goal is to learn an encoder that maps a transformed sample to a probabilistic representation with the mean

and variance

. This results in the following probabilistic representation of :



is sampled from a normal distribution

with denoting the element-wise product. Thus, the resultant probabilistic representation follows a normal distribution

conditioned on the randomly sampled transformation and input data .

On the other hand, the representation of the original sample is a special case when is an identity transformation, which is


whose mean and variance are computed by using the deep network with the same weights , and .

3.2.2 Generalized Transformation Equivariance

In the conventional definition of transformation equivariance, there should exist an automorphism in the representation space, such that 111The transformation in the sample space and the corresponding transformation in the representation space need not be the same. But the representation transformation should be a function of the sample transformation .

Here the transformation is independent of the input sample . In other words, the representation of a transformed sample is completely determined by the original representation and the applied transformation with no need to access the sample . This is called steerability property in literature [4], which enables us to compute by applying the sample-independent transformation directly to the original representation .

This property can be generalized without relying on the linear group representations of transformations through automorphisms. Instead of sticking with a linear , one can seek a more general relation between and , independently of . From an information theoretical point of view, this requires should jointly contain all necessary information about so that can be best estimated from them without a direct access to .

This leads us to maximizing the mutual information

to learn the generalized transformation equivariant representations. Indeed, by the chain rule and the nonnegativity of mutual information, we have

which shows is upper bounded by the mutual information between and .

Clearly, when , attains the maximum value of its upper bound . In this case, would provide no more information about than , which implies one can estimate directly from without accessing . Thus, we propose to solve

to learn the probabilistic encoder in pursuit of such a generalized TER.

However, a direction maximization of the above mutual information needs to evaluate an intractable posterior of the transformation. Thus, we instead lower bound the mutual information by introducing a surrogate decoder with the parameters to approximate the true posterior.

3.2.3 Variational Approach

Unlike the variational autoencoder that lower-bounds data likelihood [11], we directly take a lower bound of the mutual information [29] between and below

where denotes the (conditional) entropy, and is the non-negative Kullback divergence between and .

We choose to maximize the lower variational bound . Since is nonnegative and independent of the model parameters and , we choose to solve


to learn and under the expectation over , and the equality follows from the generative process for the representations in Eqs. (1)–(2).

3.2.4 Variational Transformation Decoder

To estimate a family of continuous transformations, we choose a normal distribution as the posterior of the transformation decoder, where the mean and variance are implemented by deep network respectively.

For categorical transformations (e.g., horizontal vs. vertical flips, and rotations of different directions), a categorical distribution can be adopted as the posterior , where each entry of is the probability mass for a transformation type. A hybrid distribution can also be defined to combine multiple continuous and categorical transformations, making the variational transformation decoder more flexible and appealing in handling complex transformations.

The posterior of transformation is a function of the representations of the original and transformed images. Thus, a natural choice is to use a Siamese encoder network with shared weights to output the representations of original and transformed samples, and construct the transformation decoder atop the concatenated representations. Figure 2(a) illustrates the architecture of the AVT network.

Finally, it is not hard to see that the deterministic AET model would be viewed as a special case of the AVT, if the probabilistic representation encoder and transformation decoder were set to deterministic functions.

4 (Semi-)Supervised Learning of Transformation Equivariant Representations

Autoencoding transformations can act as the basic representation block in many learning problems. In this section, we present its role in (semi-)supervised learning tasks to enable more accurate classification of samples by capturing their transformation equivariant representations.

4.1 SAT: (Semi-)Supervised Autoencoding Transformations

The unsupervised learning of autoencoding transformations can be generalized to (semi-)supervised cases with labeled samples. Accordingly, the goal is formulated as learning of representations that contain as much (mutual) information as possible about not only applied transformations but also data labels.

Given a labeled sample

, we can define the joint distribution over the representation, transformation and label,

where we have assumed that is independent of and once the sample is given.

In presence of sample labels, the pursuit of transformation equivariant representations can be performed by maximizing the joint mutual information such that the representation of the original sample and the transformation contains sufficient information to classify the label as well as learn the representation equivariant to the transformed sample.

Like in (3) for the unsupervised case, the joint mutual information can be lower bounded in the following way,

where the first two equalities apply the chain rule of mutual information, and the first inequality uses the nonnegativity of the mutual information. In particular, we usually have , which means the transformation should not change the label of a sample (i.e., transformation invariance of sample labels). The second inequality follows the variational bound we derived earlier in the last section.

One can also assume the surrogate posterior of labels can be simplified to since the representation of the original sample is supposed to provide sufficient information to predict the label.

Since and is independent of the model parameters and , we maximize the following variational lower bound


where and are sampled by following Eqs. (1)–(2) in the equality, and the ground truth is sampled from the label distribution directly.

Furthermore, a semi-supervised model can be trained by combining the unsupervised and supervised objectives (3) and (4)


with a positive balancing coefficient . This enables to jointly explore labeled and unlabeled examples and their representations equivariant to various transformations.

We will demonstrate that the SAT can achieve superior performances to the existing state-of-the-art (semi-)supervised models. Moreover, the competitive performances also show great potentials of the model as the basic representation block in many machine learning and computer vision tasks.

Figure 2(b) illustrates the architecture of the SAT model, in a comparison with its AVT counterpart. Particularly, in the SAT, the transformation and label decoders are jointly trained atop the representation encoder.

5 Experiments: Unsupervised Learning

In this section, we compare the proposed deterministic AET and probabilistic AVT models against the other unsupervised methods on the CIFAR-10, ImageNet and Places datasets. The evaluation follows the protocols widely adopted by many existing unsupervised methods by applying the learned representations to downstream tasks.

5.1 CIFAR-10 Experiments

First, we evaluate the AET and AVT models on the CIFAR-10 dataset.

5.1.1 Experiment Settings

Architecture To make a fair and direct comparison with existing models, the Network-In-Network (NIN) is adopted on the CIFAR-10 dataset for the unsupervised learning task [23, 30]. The NIN consists of four convolutional blocks, each of which contains three convolutional layers. Both AET and AVT have two NIN branches with shared weights, each taking the original and transformed images as its input, respectively. The output features of the forth block of two branches are concatenated and average-pooled to form a -d feature vector. Then an output layer follows to output the predicted transformation for the AET, and the mean and the log-of-variance of the predicted transformation for the AVT, with the logarithm scaling the variance to a real value.

The first two blocks of each branch are used as the encoder network to output the deterministic representation for the AET, and the mean of the probabilistic representation for the AVT. An additional

convolution followed by a batch normalization layer is added upon the encoder to produce the log-of-variance


Implementation Details Both the AET and the AVT networks are trained by the SGD with a batch size of original images and their transformed versions. Momentum and weight decay are set to and . For the AET, the learning rate is initialized to and scheduled to drop by a factor of after , , , and epochs. The network is trained for a total of epochs. The AVT network is trained for epochs, and its learning rate is initialized to . Then it is gradually decayed to from epochs after it is increased to at the epoch .

In the AVT, a single representation is randomly sampled from the encoder , which is fed into the decoder . To fully exploit the uncertainty of the representations, five samples are drawn and averaged as the representation of an image to train the downstream classifiers. We found averaging randomly sampled representations could outperform only using the mean of the representation.

Applied Transformations Two types of transformations are considered for model training. One is the affine transformation. It is a composition of a random rotation with , a random translation by of image height and width in both vertical and horizontal directions, and a random scaling factor of , along with a random shearing of degree. The other is the projective transformation, which is formed by randomly translating four corners of an image in both horizontal and vertical directions by of its height and width, after it is randomly scaled by and rotated by or .

5.1.2 Results

Method Error rate
Supervised NIN [23] (Upper Bound) 7.20
Random Init. + conv [23] (Lower Bound) 27.50
Roto-Scat + SVM [31] 17.7
ExamplarCNN [24] 15.7
DCGAN [32] 17.2
Scattering [33] 15.3
RotNet + non-linear [23] 10.94
RotNet + conv [23] 8.84
AET-affine + non-linear 9.77
AET-affine + conv 8.05
AET-project + non-linear 9.41
AET-project + conv 7.82
AVT-project + non-linear 8.96
AVT-project + conv 7.75
TABLE I: Comparison between unsupervised feature learning methods on CIFAR-10. The fully supervised NIN and the random Init. + conv have the same three-block NIN architecture, but the first is fully supervised while the second is trained on top of the first two blocks that are randomly initialized and stay frozen during training.

Comparison with Other Methods. To evaluate the effectiveness of a learned unsupervised representation, a classifier is usually trained upon it. In our experiments, we follow the existing evaluation protocols [31, 24, 32, 33, 23] by building a classifier on top of the second convolutional block.

First, we evaluate the classification results by using the AET and AVT representations with both model-based and model-free classifiers. For the model-based classifier, we follow [23] by training a non-linear classifier with three Fully-Connected (FC) layers – each of the two hidden layers has neurons with batch-normalization and ReLU activations, and the output layer is a soft-max layer with ten neurons each for an image class. We also test a convolutional classifier upon the unsupervised features by adding a third NIN block whose output feature map is averaged pooled and connected to a linear soft-max classifier.

Table I shows the results by different models. It compares both fully supervised and unsupervised methods on CIFAR-10. The unsupervised AET and AVT with the convolutional classifier almost achieves the same error rate as its fully supervised NIN counterpart with four convolutional blocks ( and vs. ).

1 FC 2 FC 3 FC conv
RotNet [23] 18.21 11.34 10.94 8.84
AET-affine 17.16 9.77 10.16 8.05
AET-project 16.65 9.41 9.92 7.82
AVT-project 16.19 8.96 9.55 7.75
TABLE II: Error rates of different classifiers on CIFAR 10.

We also compare the models when trained with varying number of FC layers in Table II. The results show that the AVT leads the AET can consistently achieve the smallest errors no matter which classifiers are used.

3 5 10 15 20
RotNet [23] 25.67 25.01 24.97 25.85 26.00
AET-affine 24.88 23.29 23.07 23.34 23.94
AET-project 23.29 22.40 22.39 23.32 23.73
AVT-project 22.46 21.62 23.7 22.16 21.51

The comparison of the KNN error rates by different models with varying numbers

of nearest neighbors on CIFAR-10.

We also note that the probabilistic AVT outperforms the deterministic AET in experiments. This is likely due to the ability of the AVT modeling the uncertainty of representations in training the downstream classifiers. We also find that the projective transformation also performs better than the affine transformation when they are used to train the AET, and thus we mainly use the projective transformation to train the AVT.

Comparison based on Model-free KNN Classifiers. We also test the model-free KNN classifier based on the averaged-pooled feature representations from the second convolutional block. The KNN classifier is model-free without training a classifier from labeled examples. This enables us to make a direct evaluation on the quality of learned features. Table III reports the KNN results with varying numbers of nearest neighbors. Again, both the AET and the AVT representations outperform the compared model with varying nearest neighbors for classification.

Comparison with Few Labeled Data. We also conduct experiments when a small number of labeled examples are used to train the downstream classifiers with the learned representations. Table IV reports the results of different models on CIFAR-10. Both the AET and AVT outperform the fully supervised models as well as the other unsupervised models when only few labeled examples ( samples per class) are available.

20 100 400 1000 5000
Supervised conv 66.34 52.74 25.81 16.53 6.93
Supervised non-linear 65.03 51.13 27.17 16.13 7.92
RotNet + conv [23] 35.37 24.72 17.16 13.57 8.05
AET-project + conv 34.83 24.35 16.28 12.58 7.82
AET-project + non-linear 37.13 25.19 18.32 14.27 9.41
AVT-project + conv 35.44 24.26 15.97 12.27 7.75
AVT-project + non-linear 37.62 25.01 17.95 14.14 8.96
TABLE IV: Error rates on CIFAR-10 when different numbers of samples per class are used to train the downstream classifiers. A third convolutional block is trained with the labeled examples on top of the first two NIN blocks of unsupervised representations trained with all unlabeled data. We also compare with the fully supervised models when they are trained with the labeled examples from scratch.

5.2 ImageNet Experiments

We further evaluate the performance by AET and AVT on the ImageNet dataset.

5.2.1 Architectures and Training Details

For a fair comparison with the existing method [20, 34, 23], two AlexNet branches with shared parameters are created with original and transformed images as inputs to train unsupervised models, respectively. The -d output features from the second last fully connected layer in each branch are concatenated and fed into the transformation decoder. We still use SGD to train the network, with a batch size of images and the transformed counterparts, a momentum of , a weight decay of .

For the AET model, the initial learning rate is set to , and it is dropped by a factor of at epoch 100 and 150. The model is trained for epochs in total. For the AVT, the initial learning rate is set to , and it is dropped by a factor of at epoch 300 and 350. The AVT is trained for epochs in total. We still use the average over five samples from the encoder outputs to train the downstream classifiers to evaluate the AVT. Since the projective transformation has shown better performances, we adopt it for the experiments on ImageNet.

5.2.2 Results

Method Conv4 Conv5
Supervised from [34](Upper Bound) 59.7 59.7
Random from [20] (Lower Bound) 27.1 12.0
Tracking [35] 38.8 29.8
Context [21] 45.6 30.4
Colorization [36] 40.7 35.2
Jigsaw Puzzles [20] 45.3 34.6
BIGAN [15] 41.9 32.2
NAT [34] - 36.0
DeepCluster [37] - 44.0
RotNet [23] 50.0 43.8
AET-project 53.2 47.0
AVT-project 54.2 48.4
TABLE V: Top-1 accuracy with non-linear layers on ImageNet. AlexNet is used as backbone to train the unsupervised models. After unsupervised features are learned, nonlinear classifiers are trained on top of Conv4 and Conv5 layers with labeled examples to compare their performances. We also compare with the fully supervised models and random models that give upper and lower bounded performances. For a fair comparison, only a single crop is applied and no dropout or local response normalization is applied during the testing.
Method Conv1 Conv2 Conv3 Conv4 Conv5
ImageNet labels(Upper Bound) 19.3 36.3 44.2 48.3 50.5
Random (Lower Bound) 11.6 17.1 16.9 16.3 14.1
Random rescaled [38] 17.5 23.0 24.5 23.2 20.6
Context [21] 16.2 23.3 30.2 31.7 29.6
Context Encoders [39] 14.1 20.7 21.0 19.8 15.5
Colorization[36] 12.5 24.5 30.4 31.5 30.3
Jigsaw Puzzles [20] 18.2 28.8 34.0 33.9 27.1
BIGAN [15] 17.7 24.5 31.0 29.9 28.0
Split-Brain [40] 17.7 29.3 35.4 35.2 32.8
Counting [40] 18.0 30.6 34.3 32.5 25.7
RotNet [23] 18.8 31.7 38.7 38.2 36.5
AET-project 19.2 32.8 40.6 39.7 37.7
AVT-project 19.5 33.6 41.3 40.3 39.1
DeepCluster* [37] 13.4 32.3 41.0 39.6 38.2
AET-project* 19.3 35.4 44.0 43.6 42.4
AVT-project* 20.9 36.1 44.4 44.3 43.5
TABLE VI: Top-1 accuracy with linear layers on ImageNet. AlexNet is used as backbone to train the unsupervised models under comparison. A -way linear classifier is trained upon various convolutional layers of feature maps that are spatially resized to have about elements. Fully supervised and random models are also reported to show the upper and the lower bounds of unsupervised model performances. Only a single crop is used and no dropout or local response normalization is used during testing, except the models denoted with * where ten crops are applied to compare results.
Method Conv1 Conv2 Conv3 Conv4 Conv5
Places labels(Upper Bound)[41] 22.1 35.1 40.2 43.3 44.6
ImageNet labels 22.7 34.8 38.4 39.4 38.7
Random (Lower Bound) 15.7 20.3 19.8 19.1 17.5
Random rescaled [38] 21.4 26.2 27.1 26.1 24.0
Context [21] 19.7 26.7 31.9 32.7 30.9
Context Encoders [39] 18.2 23.2 23.4 21.9 18.4
Colorization[36] 16.0 25.7 29.6 30.3 29.7
Jigsaw Puzzles [20] 23.0 31.9 35.0 34.2 29.3
BIGAN [15] 22.0 28.7 31.8 31.3 29.7
Split-Brain [40] 21.3 30.7 34.0 34.1 32.5
Counting [40] 23.3 33.9 36.3 34.7 29.6
RotNet [23] 21.5 31.0 35.1 34.6 33.7
AET-project 22.1 32.9 37.1 36.2 34.7
AVT-project 22.3 33.1 37.8 36.7 35.6
TABLE VII: Top-1 accuracy on the Places dataset. A

-way logistic regression classifier is trained on top of various layers of feature maps that are spatially resized to have about

elements. All unsupervised features are pre-trained on the ImageNet dataset, and then frozen when training the logistic regression classifiers with Places labels. We also compare with fully-supervised networks trained with Places Labels and ImageNet labels, as well as with random models. The highest accuracy values are in bold and the second highest accuracy values are underlined.

Table V reports the Top-1 accuracies of the compared methods on ImageNet by following the evaluation protocol in [20]. Two settings are adopted for evaluation, where Conv4 and Conv5 mean to train the remaining part of AlexNet on top of Conv4 and Conv5 with the labeled data. All the bottom convolutional layers up to Conv4 and Conv5 are frozen after they are trained in an unsupervised fashion. From the results, in both settings, the AVT model consistently outperforms the other unsupervised models, including the AET.

We also compare with the fully supervised models that give the upper bound of the classification performance by training the AlexNet with all labeled data end-to-end. The classifiers of random models are trained on top of Conv4 and Conv5 whose weights are randomly sampled, which set the lower bounded performance. By comparison, the AET models narrow the performance gap to the upper bound supervised models from and by RotNet and DeepCluster on Conv4 and Conv5, to and by the AET, and to and by the AVT.

Moreover, we also follow the testing protocol adopted in [40] to compare the models by training a -way linear classifier on top of different numbers of convolutional layers in Table VI. Again, the AVT consistently outperforms all the compared unsupervised models in terms of the Top-1 accuracy.

5.3 Places Experiments

We also compare different models on the Places dataset. Table VII reports the results. Unsupervised models are pretrained on the ImageNet dataset, and a linear logistic regression classifier is trained on top of different layers of convolutional feature maps with Places labels. It assesses the generalizability of unsupervised features from one dataset to another. The models are still based on AlexNet variants. We compare with the fully supervised models trained with the Places labels and ImageNet labels respectively, as well as with the random networks. Both the AET and the AVT models outperform the other unsupervised models, except performing slightly worse than Counting [40] with a shallow representation by Conv1 and Conv2.

6 Experiments: (Semi-)Supervised Learning

We compare the proposed SAT model with the other state-of-the-art semi-supervised methods in this section. For the sake of fair comparison, we follow the test protocol used in literature [27, 26] on both CIFAR-10 [42] and SVHN [43], which are widely used as the benchmark datasets to evaluate the semi-supervised models.

6.1 Network Architecture and Implementation Details

Network Architecture For the sake of a fair comparison, a 13-layer convolutional neural network, which has been widely used in existing semi-supervised models [26, 27, 28], is adopted as the backbone to build the SAT. It consists of three convolutional blocks, each of which contains three convolution layers. The SAT has two branches of such three blocks with shared weights, each taking the original and transformed images as input, respectively. The output feature maps from the third blocks of two branches are concatenated and average-pooled, resulting in a -d feature vector. A fully-connected layer follows to predict the mean and the log-of-variance of the transformation. The first two blocks are used as the encoder to output the mean of the representation, upon which an additional convolution layer with batch normalization is added to compute the log-of-variance .

In addition, a classifier head is built on the representation from the encoder. Specifically, we draw five random representations of an input image, and feed their average to the classifier. The classifier head has the same structure as the third convolutional block but its weights differ from the Siamese branches of transformation decoder. The output feature map of this convolutional block is globally average-pooled to -d feature vector, and a softmax fully connected layer follows to predict the image label.

Implementation Details The representation encoder, transformation decoder and the classifier are trained in an end-to-end fashion. In particular, the SGD is adopted to iteratively update their weights over a minbatch with images, their transformed counterparts, and labeled examples. Momentum and weight decay are set to and , respectively. The model is trained for a total of epochs. The learning rate is initialized to . It is increased to at epoch , before it is linearly decayed to starting from epochs. For a fair comparison, we adopt the entropy minimization used in the state-of-the-art virtual adversarial training [28]. A standard set of data augmentations in literature [26, 27, 28] are also adopted through experiments, which include both horizontal flips and random translations on CIFAR-10, and only random translations on SVHN. The projective transformation that performs the better than the affine transformation is adopted to train the semi-supervised representations.

6.2 Results

1000 labels 2000 labels 4000 labels 50000 labels
GAN [44] 18.63 2.32
model [26] 12.36 0.31 5.560.10
 Temporal Ensembling [26] 12.16 0.31 5.600.10
VAT [28] 10.55
Supervised-only 46.431.21 33.94 20.660.57 5.810.15
model [27] 27.361.20 18.020.60 13.200.27 6.060.11
Mean Teacher [27] 21.551.48 15.730.31 12.310.28 5.940.15
SAT 14.890.38 11.710.29 9.580.11 4.910.13
TABLE VIII: Error rate percentage of compared methods on CIFAR-10 over ten runs (four runs when all labels are used).
250 labels 500 labels 1000 labels 73257 labels
GAN [44] 18.444.8 8.11 11.3
model [26] 6.650.53 4.82 0.17 2.540.04
 Temporal Ensembling [26] 5.120.13 4.42 0.16 2.740.06
VAT [28] 3.86
Supervised-only 27.773.18 16.88 12.320.95 2.750.10
model [27] 9.690.92 6.830.66 4.950.26 2.500.07
Mean Teacher [27] 4.350.50 4.180.27 3.950.19 2.500.05
SAT 4.300.22 3.720.20 3.440.10 2.150.06
TABLE IX: Error rate percentage of compared methods on SVHN over ten runs (four runs when all labels are used).

We compare with the state-of-the-art semi-supervised methods in literature [27, 26]. Table VIII and IX show that the SAT outperforms the compared methods with different numbers of labeled examples on both CIFAR-10 and SVHN datasets. The results demonstrate that the SAT has captured the useful representation, which delivers competitive classification performances from the transformations on both unlabeled and labeled examples to semi-supervise the network training with only few labeled examples.

In particular, the proposed SAT reduces the average error rates of Mean Teacher (the second best performing method) by 30.9%, 25.6%, 22.2% relatively with , , and labels on CIFAR-10, while reducing them by , , relatively with , , and labels on SVHN. The compared semi-supervised methods, including model [26], Temporal Ensembling [26], and Mean Teacher [27], attempt to maximize the consistency of model predictions on the transformed and original images to train semi-supervised classifiers. While they also apply the transformations to explore unlabeled examples, the competitive performance of the SAT model shows the transformation-equivariant representations are more compelling for classifying images than those compared methods predicting consistent labels under transformations. It justifies the proposed criterion of pursuing the transformation equivariance as a regularizer to train a classifier.

1000 labels 2000 labels 4000 labels 50000 labels
VAT w/o EntMin[28] 11.36
SAT w/o EntMin 15.320.40 12.760.26 10.900.21 5.950.17
VAT with EntMin[28] 10.55
SAT with EntMin 14.890.38 11.710.29 9.580.11 4.910.13
TABLE X: Comparison of error rate percentages of SAT and VAT with and without Entropy Minimization (EntMin) on CIFAR-10.

It is not hard to see that the SAT can be integrated into the other semi-supervised methods as their base representations, and we believe this could further boost their performances. This will be left to the future work as it is beyond the scope of this paper.

6.2.1 The Impact of Entropy Minimization

We also conduct an ablation study of the Entropy Minimization (EntMin) on the model performance. EntMin was used in VAT [28] that outperformed the other semi-supervised methods in literature. Here, we compare the error rates between the SAT and the VAT with or without the EntMin. As shown in Table X, no matter if the entropy minimization is adopted, the SAT always outperforms the corresponding VAT. We also note that, even without entropy minimization, the SAT still performs better than the other state-of-the-art semi-supervised classifiers such as Mean Teacher, Temporal Ensembling, and model shown in Table VIII. This demonstrates the compelling performance of the SAT model.

6.2.2 Comparison with Data Augmentation by Transformations

1000 labels 2000 labels 4000 labels
DAT 51.00 38.61 27.99
SAT 15.72 13.20 11.05
TABLE XI: Error rate percentage of Data Augmentation by Transformations (DAT) on CIFAR-10. To ensure a fair comparison, the same set of labeled examples are split from the training set for semi-supervised learning.

We also compare the performances between the SAT and a classification network trained with the augmented images by the transformations. Specifically, in each minibatch, input images are augmented with the same set of random projective transformations used in the SAT. The transformation-augmented images and their labels are used to train a network with the same 13-layer architecture that has been adopted as the SAT backbone. Note that the transformation augmentations are applied on top of the standard augmentations mentioned in the implementation details for a fair comparison with the SAT.

Table XI compares the results between the SAT and the Data Augmentation by Transformation (DAT) classifier on CIFAR-10. It shows the SAT significantly outperforms the DAT. This is not surprising – data augmentation by transformations can only augment the labeled examples, limiting its ability of exploring unlabeled examples that play very important roles in semi-supervised learning.

Moreover, the projective transformations used in the SAT could severely distort training images that could incur undesired update to the model weights if the distorted images were used to naively train the network. This is witnessed by the results that the data augmentation by transformations performs even worse than the supervised-only method (see Table VIII).

In contrast, the SAT avoids a direct use of the transformed images to supervise the model training with their labels. Instead, it trains the learned representations to contain as much information as possible about the transformations. The superior performance demonstrates its outstanding ability of classifying images by exploring the variations of visual structures induced by transformations on both labeled and unlabeled images.

7 Conclusion and Future Works

In this paper, we present to use a novel approach of AutoEncoding Transformations (AET) to learn representations that equivary to applied transformations on images. Unlike the group equivariant convolutions that would become intractable with a composition of complex transformations, the AET model seeks to learn representations of arbitrary forms by reconstructing transformations from the encoded representations of original and transformed images. The idea is further extended to a probabilistic model by maximizing the mutual information between the learned representation and the applied transformation. The intractable maximization problem is handled by introducing a surrogate transformation decoder and maximizing a variational lower bound of the mutual information, resulting in the Autoencoding Variational Transformations (AVT). Along this direction, a (Semi-)Supervised Autoencoding Transformation (SAT) approach can be derived by maximizing the joint mutual information of the learned representation with both the transformation and the label for a given sample. The proposed AET paradigm lies a solid foundation to explore transformation equivariant representations in many learning tasks. Particularly, we conduct experiments to show its superior performances on both unsupervised learning to semi-(supervised) learning tasks following standard evaluation protocols. In future, we will explore the great potential of applying the learned AET representation as the building block on more learning tasks, such as (instance) semantic segmentation, object detection, super-resolution reconstruction, few-shot learning, and fine-grained classification.