1 Introduction
The figure illustrates a comparison between AED, AET, and AET. The AED and AET seek to reconstruct the input data and transformation at the output end, respectively. The encoder (E) extracts the representation of input and transformed images. The decoder (D) either reconstructs the data in AED, or the transformation in AET. The SAT builds a classifier (C) upon the output representation from the encoder by capturing the equivariant visual structures under various transformations.
The goal of Transformation Equivariant Representation (TER) Learning seeks to learn representations that equivary to various transformations applied to images. In other words, given an image, its representation ought to change according to transformations. In this paper, the TERL is motivated by the assumption that representations equivarying under transformations should be able to encode the visual structures of images such that the transformations can be reconstructed from the representations of images before and after transformations. Based on this assumption, we formally present a novel principle of autoencoding transformations to learn a family of TERs.
Learning the TERs has been advocated in Hiton’s seminal work on learning transformation equivariant capsules [1], and plays a critical role for the success of Convolutional Neural Networks (CNNs) [2]. Specifically, the representations learned by the CNNs are translation equivariant as their feature maps are shifted in the same way as input images are translated. On top of these feature maps that preserve the visual structures of translation equivariance, fully connected layers are built to output the predicted labels of input images.
Obviously, the translation equivariant convolutional features play the pivotal role in delivering the stateoftheart performances in the deep networks. Thus, they are extended beyond translations to learn more expressive representations of equivariance to generic types of transformations, such as affine, projective and homographic transformations. Aline this direction, the group equivariant CNNs [3] are developed to guarantee the transformation of input images results in the same transformation of input images.
However, the group equivariant CNNs [3] and their variants [4, 5] are restricted to discrete transformations, and the resultant representations are also limited to a group representation of linear transformations. These limitations restrict their abilities to model group representations of complex transformations that could be continuous and nonlinear in many learning tasks, ranging from unsupervised, to semisupervised and supervised learning.
1.1 Unsupervised Learning of Transformation Equivariant Representations
The focus of this paper is on the principle of autoencoding transformations and its application to learn the transformation equivariant representations. The core idea is to encode data with the representations from which the transformations can be decoded as much as possible. We will begin with an unsupervised learning of such representations without involving any labeled data, and then proceed to a generalization to semisupervised and supervised representations by encoding label information as well.
Unlike group equivariant CNNs that learn the feature maps mathematically satisfying the transformation equivariance as a function of the group of transformations, the proposed AutoEncoding Transformations (AET) presents an autoencoding architecture to learn transformation equivariant representations by reconstructing applied transformations. As long as a transformation of input images results in equivariant representations, it should be well decoded from the representations of original and transformed images. Compared with the group equivariant CNNS, the AET model is more flexible and tractable to tackle with any transformations and their compositions, since it does not rely on a strict convolutional structure to
The AET is also in contrast to the conventional AutoEncoding Data (AED) paradigm that instead aims to reconstruct data rather than the transformations. Figure 1
(a) and (b) illustrate the comparison between the AET and AED. Since the space of transformations (e.g., the few parameters of transformations) is of quite lower dimension than that of data space (e.g., the pixel space of images), the decoder of the AET can be quite shallower than that of the AED. This allows the backpropagated errors to more sufficiently train the encoder that models the representations of input data in the AET architecture.
Moreover, an AET model can be trained from an informationtheoretic perspective by maximizing the information in the learned representation about the applied transformation and the input data. This will generalize the group representations of linear transformations to more general forms that could equivary nonlinearly to input transformations. It results in Generalized Transformation Equivariant Representations (GTERs) that can capture more complex patterns of visual structure under transformations. Unfortunately, this will result in an intractable optimization problem to maximize the mutual information between representations and transformations. A variational lower bound of the mutual information can be derive by introducing a surrogate transformation decoder, yielding a novel model of Autoencoding Variational Transformation (AVT) as an alterative to the deterministic AET.
1.2 (Semi)Supervised Learning of Transformation Equivariant Representations
While both AET and AVT are trained in an unsupervised fashion, they can act as the basic representation for building the (semi)supervised classifiers. Along this direction, we can train (Semi)Supervised Autoencoding Transformation (SAT) that jointly trains the transformation equivariant representations as well as the corresponding classifiers.
Figure 1(c) illustrates the SAT model, where a classifier head is added upon the representation encoder of an AET network. The SAT can be based on either the deterministic AET or the probabilistic AVT architecture. Particularly, along the direction pointed by the AVT, we seek to train the proposed (semi)supervised transformation equivariant classifiers by maximizing the mutual information of the learned representations with the transformations and labels. In this way, the trained SAT model can not only handle the transformed data through their equivarying representations, but also encode the labeling information through the supervised classifier. The resultant SAT also contains the deterministic model based on the AET as a special case by fixing a deterministic model to representation encoder and the transformation decoder.
The transformation equivariance in the SAT model is contrary to the data augmentation by transformations in deep learning literature
[2]. First, the data augmentation is only applicable to augment the labeled examples for model training, which cannot be extended to unlabeled data. This limits it in semisupervised learning by exploring the unlabeled data. Second, the data augmentation aims to enforce the transformation invariance, in which the labels of transformed data are supposed to be invariant. This differs from the motivation to encode the inherent visual structures that equivary under various transformations.Actually, in the (semi)supervised transformation equivariant classifiers, we aim to integrate the principles of both training transformation equivariant representations and transformation invariant classifiers seamlessly. Indeed, both principles have played the key role in compelling performances of the CNNs and their modern variants. This is witnessed by the translation equivariant convolutional feature maps and the atop classifiers that are supposed to make transformationinvariant predictions with the spatial pooling and fully connected layers. We will show that the proposed SAT extends the translation equivariance in the CNNs to cover a generic class of transformation equivariance, as well as encode the labels to train the representations and the associated transformation invariant classifiers. We hope this can deepen our understanding of the interplay between the transformation equivariance and invariance both of which play the fundamental roles in training robust classifiers with labeled and unlabeled data.
The remainder of this paper is organized as follows. We will review the related works in Section 2. The unsupervised and (semi)supervised learning of transformation equivariant representations will be presented in the autoencoding transformation framework in Section 3 and Section 4, respectively. We will present experiment results in Section 5 and Section 6 for unsupervised and semisupervised tasks. We will conclude the paper and discuss the future works in Section 7.
2 Related Works
In this section, we will review the related works on learning transformationequivariant representation, as well as unsupervised and (semi)supervised models.
2.1 TransformationEquivariant Representations
Learning transformationequivariant representations can trace back to the seminal work on training capsule nets [6, 1, 7]. The transformation equivariance is characterized by the various directions of capsules, while the confidence of belonging to a particular class is captured by their lengths.
Many efforts have been made in literature [3, 4, 5] on extending the conventional translationequivariant convolutions to cover more transformations. Among them are group equivariant convolutions (Gconvolution) [3] that have been developed to equivary to more types of transformations. The idea of group equivariance has also been introduced to the capsule nets [5]
by ensuring the equivariance of output pose vectors to a group of transformations with a generic routing mechanism. However, the group equivariant convolution is restricted to discrete transformations, which limits its ability to learn the representations equivariant to generic continuous transformations.
2.2 Unsupervised Representation Learning
AutoEncoders and GANs. Unsupervised autoencoders have been extensively studied in literature [8, 9, 10]. Existing autoencoders are trained by reconstructing input data from the outputs of encoders. A large category of autoencoder variants have been proposed. Among them is the Variational AutoEncoder (VAE) [11]
that maximizes the lowerbound of the data likelihood to train a pair of probabilistic encoder and decoder, while betaVAE seeks to disentangle representations by introducing an adjustable hyperparameter on the capacity of latent channel to balance between the independence constraint and the reconstruction accuracy
[12]. Denoising autoencoders [10] attempt to reconstruct noisecorrupted data to learn robust representations, while contrastive AutoEncoders [13] encourage to learn representations invariant to small perturbations on data. Along this direction, Hinton et al. [1] propose capsule networks to explore transformation equivariance by minimizing the discrepancy between the reconstructed and target data.On the other hand, Generative Adversarial Nets (GANs) have also been used to train unsupervised representations. Unlike the autoencoders, the GANs [14] and their variants [15, 16, 17, 18] generate data from the noises drawn from a simple distribution, with a discriminator trained adversarially to distinguish between real and fake data. The sampled noises can be viewed as the representation of generated data over a manifold, and one can train an encoder by inverting the generator to find the generating noise. This can be implemented by jointly training a pair of mutually inverse generator and encoder [15, 16]. There also exist better generalizable GANs in producing unseen data based on the Lipschitz assumption on the real data distribution [17, 18], which can give rise to more powerful representations of data out of training examples [15, 16, 19]. Compared with the AutoEncoders, GANs do not rely on learning onetoone reconstruction of data; instead, they aim to generate the entire distribution of data.
SelfSupervisory Signals. There exist many other unsupervised learning methods using different types of selfsupervised signals to train deep networks. Mehdi and Favaro [20] propose to solve Jigsaw puzzles to train a convolutional neural network. Doersch et al. [21] train the network by inferring the relative positions between sampled patches from an image as selfsupervised information. Instead, Noroozi et al. [22] count features that satisfy equivalence relations between downsampled and tiled images. Gidaris et al. [23] propose to train RotNets by predicting a discrete set of image rotations, but they are unable to handle generic continuous transformations and their compositions. Dosovitskiy et al. [24]
create a set of surrogate classes by applying various transformations to individual images. However, the resultant features could overdiscriminate visually similar images as they always belong to different surrogate classes. Unsupervised features have also been learned from videos by estimating the selfmotion of moving objects between consecutive frames
[25].2.3 (Semi)Supervised Representation Learning
In addition, there exist a large number of semisupervised models in literature. Here, we particularly mention three stateoftheart methods that will be compared in experiments. Temporal ensembling [26] and mean teachers [27] both use an ensemble of teachers to supervise the training of a student model. Temporal ensembling uses the exponential moving average of predictions made by past models on unlabeled data as targets to train the student model. Instead, mean teachers update the student model with the exponential moving average of the weights of past models. On the contrary, the Virtual Adversarial Training (VAT) [28] seeks to minimizes the change of predictions on unlabeled examples when their output values are adversarially altered. This could result in a robust model that prefers smooth predictions over unlabeled data.
The SAT also differs from transformationbased data augmentation in which the transformed samples and their labels are used directly as additional training examples [2]. First, in the semisupervised learning, unlabeled examples cannot be directly augmented to form training examples due to their missing labels. Moreover, data augmentation needs to preserve the labels on augmented images, and this prevents us from applying the transformations that could severely distort the images (e.g., shearing, rotations with arbitrary angles, and projective transformations) or invalidate the associated labels (e.g., vertically flipping “6” to “9”). In contrast, the SAT avoids using the labels of transformed images to supervisedly train the classifier directly; instead it attempts to encode the visual structures of images equivariant to various transformations without access to their labels. This leads to a labelblind TER regularizer to explore the unlabeled examples for the semisupervised problem.
3 Unsupervised Learning of Transformation Equivariant Representations
In this section, we will first present the autoencoding transformation architecture to learn the transformation equivariant representations in a deterministic fashion. Then, a variational alternative approach will be presented to handle the uncertainty in the representation learning by maximizing the mutual information between the learned representations and the applied transformations.
3.1 AET: A Deterministic Model
We begin by defining the notations used in the proposed AutoEncoding Transformation (AET) architecture. Consider a random transformation sampled from a transformation distribution (e.g., warping, projective and homographic transformations), as well as an image drawn from a data distribution in a sample space . Then the application of to results in a transformed image .
The goal of AET focuses on learning a representation encoder with parameters , which maps a sample to its representation in a linear space . For this purpose, one need to learn a transformation decoder with parameters
that makes an estimate of the input transformation from the representations of original and transformed samples. Since the transformation decoder takes the encoder outputs rather than original and transformed images, this pushes the encoder to capture the inherent visual structures of images to make a satisfactory estimate of the transformation.
Then the AET can be trained to jointly learn the representation encoder and the transformation decoder
. A loss function
measuring the deviation between a transformation and its estimate is minimized to train the AET over and :where the estimated transformation can be written as a function of the encoder and the decoder such that
and the expectation is taken over the distributions of transformations and data.
In this way, the encoder and the decoder can be jointly trained over minibatches by backpropagating the gradient of the loss to update their parameters.
3.2 AVT: A Probabilistic Model
Alternatively, we can train transformation equivariant representations to contain as much information as possible about applied transformations to recover them.
The figure illustrates the variational approach to unsupervised learning and (semi)supervised learning of autoencoding transformations, namely AVT and SAT respectively. The probability
acts as the representation encoder, while and play the roles of a transformation and label decoder, respectively. By setting the transformation to an identity , the corresponding is the representation of an original image.3.2.1 Notations
Formally, our goal is to learn an encoder that maps a transformed sample to a probabilistic representation with the mean
and variance
. This results in the following probabilistic representation of :(1) 
where
is sampled from a normal distribution
with denoting the elementwise product. Thus, the resultant probabilistic representation follows a normal distributionconditioned on the randomly sampled transformation and input data .
On the other hand, the representation of the original sample is a special case when is an identity transformation, which is
(2) 
whose mean and variance are computed by using the deep network with the same weights , and .
3.2.2 Generalized Transformation Equivariance
In the conventional definition of transformation equivariance, there should exist an automorphism in the representation space, such that ^{1}^{1}1The transformation in the sample space and the corresponding transformation in the representation space need not be the same. But the representation transformation should be a function of the sample transformation .
Here the transformation is independent of the input sample . In other words, the representation of a transformed sample is completely determined by the original representation and the applied transformation with no need to access the sample . This is called steerability property in literature [4], which enables us to compute by applying the sampleindependent transformation directly to the original representation .
This property can be generalized without relying on the linear group representations of transformations through automorphisms. Instead of sticking with a linear , one can seek a more general relation between and , independently of . From an information theoretical point of view, this requires should jointly contain all necessary information about so that can be best estimated from them without a direct access to .
This leads us to maximizing the mutual information
to learn the generalized transformation equivariant representations. Indeed, by the chain rule and the nonnegativity of mutual information, we have
which shows is upper bounded by the mutual information between and .
Clearly, when , attains the maximum value of its upper bound . In this case, would provide no more information about than , which implies one can estimate directly from without accessing . Thus, we propose to solve
to learn the probabilistic encoder in pursuit of such a generalized TER.
However, a direction maximization of the above mutual information needs to evaluate an intractable posterior of the transformation. Thus, we instead lower bound the mutual information by introducing a surrogate decoder with the parameters to approximate the true posterior.
3.2.3 Variational Approach
3.2.4 Variational Transformation Decoder
To estimate a family of continuous transformations, we choose a normal distribution as the posterior of the transformation decoder, where the mean and variance are implemented by deep network respectively.
For categorical transformations (e.g., horizontal vs. vertical flips, and rotations of different directions), a categorical distribution can be adopted as the posterior , where each entry of is the probability mass for a transformation type. A hybrid distribution can also be defined to combine multiple continuous and categorical transformations, making the variational transformation decoder more flexible and appealing in handling complex transformations.
The posterior of transformation is a function of the representations of the original and transformed images. Thus, a natural choice is to use a Siamese encoder network with shared weights to output the representations of original and transformed samples, and construct the transformation decoder atop the concatenated representations. Figure 2(a) illustrates the architecture of the AVT network.
Finally, it is not hard to see that the deterministic AET model would be viewed as a special case of the AVT, if the probabilistic representation encoder and transformation decoder were set to deterministic functions.
4 (Semi)Supervised Learning of Transformation Equivariant Representations
Autoencoding transformations can act as the basic representation block in many learning problems. In this section, we present its role in (semi)supervised learning tasks to enable more accurate classification of samples by capturing their transformation equivariant representations.
4.1 SAT: (Semi)Supervised Autoencoding Transformations
The unsupervised learning of autoencoding transformations can be generalized to (semi)supervised cases with labeled samples. Accordingly, the goal is formulated as learning of representations that contain as much (mutual) information as possible about not only applied transformations but also data labels.
Given a labeled sample
, we can define the joint distribution over the representation, transformation and label,
where we have assumed that is independent of and once the sample is given.
In presence of sample labels, the pursuit of transformation equivariant representations can be performed by maximizing the joint mutual information such that the representation of the original sample and the transformation contains sufficient information to classify the label as well as learn the representation equivariant to the transformed sample.
Like in (3) for the unsupervised case, the joint mutual information can be lower bounded in the following way,
where the first two equalities apply the chain rule of mutual information, and the first inequality uses the nonnegativity of the mutual information. In particular, we usually have , which means the transformation should not change the label of a sample (i.e., transformation invariance of sample labels). The second inequality follows the variational bound we derived earlier in the last section.
One can also assume the surrogate posterior of labels can be simplified to since the representation of the original sample is supposed to provide sufficient information to predict the label.
Since and is independent of the model parameters and , we maximize the following variational lower bound
(4)  
where and are sampled by following Eqs. (1)–(2) in the equality, and the ground truth is sampled from the label distribution directly.
Furthermore, a semisupervised model can be trained by combining the unsupervised and supervised objectives (3) and (4)
(5) 
with a positive balancing coefficient . This enables to jointly explore labeled and unlabeled examples and their representations equivariant to various transformations.
We will demonstrate that the SAT can achieve superior performances to the existing stateoftheart (semi)supervised models. Moreover, the competitive performances also show great potentials of the model as the basic representation block in many machine learning and computer vision tasks.
Figure 2(b) illustrates the architecture of the SAT model, in a comparison with its AVT counterpart. Particularly, in the SAT, the transformation and label decoders are jointly trained atop the representation encoder.
5 Experiments: Unsupervised Learning
In this section, we compare the proposed deterministic AET and probabilistic AVT models against the other unsupervised methods on the CIFAR10, ImageNet and Places datasets. The evaluation follows the protocols widely adopted by many existing unsupervised methods by applying the learned representations to downstream tasks.
5.1 CIFAR10 Experiments
First, we evaluate the AET and AVT models on the CIFAR10 dataset.
5.1.1 Experiment Settings
Architecture To make a fair and direct comparison with existing models, the NetworkInNetwork (NIN) is adopted on the CIFAR10 dataset for the unsupervised learning task [23, 30]. The NIN consists of four convolutional blocks, each of which contains three convolutional layers. Both AET and AVT have two NIN branches with shared weights, each taking the original and transformed images as its input, respectively. The output features of the forth block of two branches are concatenated and averagepooled to form a d feature vector. Then an output layer follows to output the predicted transformation for the AET, and the mean and the logofvariance of the predicted transformation for the AVT, with the logarithm scaling the variance to a real value.
The first two blocks of each branch are used as the encoder network to output the deterministic representation for the AET, and the mean of the probabilistic representation for the AVT. An additional
convolution followed by a batch normalization layer is added upon the encoder to produce the logofvariance
.Implementation Details Both the AET and the AVT networks are trained by the SGD with a batch size of original images and their transformed versions. Momentum and weight decay are set to and . For the AET, the learning rate is initialized to and scheduled to drop by a factor of after , , , and epochs. The network is trained for a total of epochs. The AVT network is trained for epochs, and its learning rate is initialized to . Then it is gradually decayed to from epochs after it is increased to at the epoch .
In the AVT, a single representation is randomly sampled from the encoder , which is fed into the decoder . To fully exploit the uncertainty of the representations, five samples are drawn and averaged as the representation of an image to train the downstream classifiers. We found averaging randomly sampled representations could outperform only using the mean of the representation.
Applied Transformations Two types of transformations are considered for model training. One is the affine transformation. It is a composition of a random rotation with , a random translation by of image height and width in both vertical and horizontal directions, and a random scaling factor of , along with a random shearing of degree. The other is the projective transformation, which is formed by randomly translating four corners of an image in both horizontal and vertical directions by of its height and width, after it is randomly scaled by and rotated by or .
5.1.2 Results
Method  Error rate 

Supervised NIN [23] (Upper Bound)  7.20 
Random Init. + conv [23] (Lower Bound)  27.50 
RotoScat + SVM [31]  17.7 
ExamplarCNN [24]  15.7 
DCGAN [32]  17.2 
Scattering [33]  15.3 
RotNet + nonlinear [23]  10.94 
RotNet + conv [23]  8.84 
AETaffine + nonlinear  9.77 
AETaffine + conv  8.05 
AETproject + nonlinear  9.41 
AETproject + conv  7.82 
AVTproject + nonlinear  8.96 
AVTproject + conv  7.75 
Comparison with Other Methods. To evaluate the effectiveness of a learned unsupervised representation, a classifier is usually trained upon it. In our experiments, we follow the existing evaluation protocols [31, 24, 32, 33, 23] by building a classifier on top of the second convolutional block.
First, we evaluate the classification results by using the AET and AVT representations with both modelbased and modelfree classifiers. For the modelbased classifier, we follow [23] by training a nonlinear classifier with three FullyConnected (FC) layers – each of the two hidden layers has neurons with batchnormalization and ReLU activations, and the output layer is a softmax layer with ten neurons each for an image class. We also test a convolutional classifier upon the unsupervised features by adding a third NIN block whose output feature map is averaged pooled and connected to a linear softmax classifier.
Table I shows the results by different models. It compares both fully supervised and unsupervised methods on CIFAR10. The unsupervised AET and AVT with the convolutional classifier almost achieves the same error rate as its fully supervised NIN counterpart with four convolutional blocks ( and vs. ).
1 FC  2 FC  3 FC  conv  

RotNet [23]  18.21  11.34  10.94  8.84 
AETaffine  17.16  9.77  10.16  8.05 
AETproject  16.65  9.41  9.92  7.82 
AVTproject  16.19  8.96  9.55  7.75 
We also compare the models when trained with varying number of FC layers in Table II. The results show that the AVT leads the AET can consistently achieve the smallest errors no matter which classifiers are used.
3  5  10  15  20  

RotNet [23]  25.67  25.01  24.97  25.85  26.00 
AETaffine  24.88  23.29  23.07  23.34  23.94 
AETproject  23.29  22.40  22.39  23.32  23.73 
AVTproject  22.46  21.62  23.7  22.16  21.51 
The comparison of the KNN error rates by different models with varying numbers
of nearest neighbors on CIFAR10.We also note that the probabilistic AVT outperforms the deterministic AET in experiments. This is likely due to the ability of the AVT modeling the uncertainty of representations in training the downstream classifiers. We also find that the projective transformation also performs better than the affine transformation when they are used to train the AET, and thus we mainly use the projective transformation to train the AVT.
Comparison based on Modelfree KNN Classifiers. We also test the modelfree KNN classifier based on the averagedpooled feature representations from the second convolutional block. The KNN classifier is modelfree without training a classifier from labeled examples. This enables us to make a direct evaluation on the quality of learned features. Table III reports the KNN results with varying numbers of nearest neighbors. Again, both the AET and the AVT representations outperform the compared model with varying nearest neighbors for classification.
Comparison with Few Labeled Data. We also conduct experiments when a small number of labeled examples are used to train the downstream classifiers with the learned representations. Table IV reports the results of different models on CIFAR10. Both the AET and AVT outperform the fully supervised models as well as the other unsupervised models when only few labeled examples ( samples per class) are available.
20  100  400  1000  5000  
Supervised conv  66.34  52.74  25.81  16.53  6.93 
Supervised nonlinear  65.03  51.13  27.17  16.13  7.92 
RotNet + conv [23]  35.37  24.72  17.16  13.57  8.05 
AETproject + conv  34.83  24.35  16.28  12.58  7.82 
AETproject + nonlinear  37.13  25.19  18.32  14.27  9.41 
AVTproject + conv  35.44  24.26  15.97  12.27  7.75 
AVTproject + nonlinear  37.62  25.01  17.95  14.14  8.96 
5.2 ImageNet Experiments
We further evaluate the performance by AET and AVT on the ImageNet dataset.
5.2.1 Architectures and Training Details
For a fair comparison with the existing method [20, 34, 23], two AlexNet branches with shared parameters are created with original and transformed images as inputs to train unsupervised models, respectively. The d output features from the second last fully connected layer in each branch are concatenated and fed into the transformation decoder. We still use SGD to train the network, with a batch size of images and the transformed counterparts, a momentum of , a weight decay of .
For the AET model, the initial learning rate is set to , and it is dropped by a factor of at epoch 100 and 150. The model is trained for epochs in total. For the AVT, the initial learning rate is set to , and it is dropped by a factor of at epoch 300 and 350. The AVT is trained for epochs in total. We still use the average over five samples from the encoder outputs to train the downstream classifiers to evaluate the AVT. Since the projective transformation has shown better performances, we adopt it for the experiments on ImageNet.
5.2.2 Results
Method  Conv4  Conv5 

Supervised from [34](Upper Bound)  59.7  59.7 
Random from [20] (Lower Bound)  27.1  12.0 
Tracking [35]  38.8  29.8 
Context [21]  45.6  30.4 
Colorization [36]  40.7  35.2 
Jigsaw Puzzles [20]  45.3  34.6 
BIGAN [15]  41.9  32.2 
NAT [34]    36.0 
DeepCluster [37]    44.0 
RotNet [23]  50.0  43.8 
AETproject  53.2  47.0 
AVTproject  54.2  48.4 
Method  Conv1  Conv2  Conv3  Conv4  Conv5 

ImageNet labels(Upper Bound)  19.3  36.3  44.2  48.3  50.5 
Random (Lower Bound)  11.6  17.1  16.9  16.3  14.1 
Random rescaled [38]  17.5  23.0  24.5  23.2  20.6 
Context [21]  16.2  23.3  30.2  31.7  29.6 
Context Encoders [39]  14.1  20.7  21.0  19.8  15.5 
Colorization[36]  12.5  24.5  30.4  31.5  30.3 
Jigsaw Puzzles [20]  18.2  28.8  34.0  33.9  27.1 
BIGAN [15]  17.7  24.5  31.0  29.9  28.0 
SplitBrain [40]  17.7  29.3  35.4  35.2  32.8 
Counting [40]  18.0  30.6  34.3  32.5  25.7 
RotNet [23]  18.8  31.7  38.7  38.2  36.5 
AETproject  19.2  32.8  40.6  39.7  37.7 
AVTproject  19.5  33.6  41.3  40.3  39.1 
DeepCluster* [37]  13.4  32.3  41.0  39.6  38.2 
AETproject*  19.3  35.4  44.0  43.6  42.4 
AVTproject*  20.9  36.1  44.4  44.3  43.5 
Method  Conv1  Conv2  Conv3  Conv4  Conv5 

Places labels(Upper Bound)[41]  22.1  35.1  40.2  43.3  44.6 
ImageNet labels  22.7  34.8  38.4  39.4  38.7 
Random (Lower Bound)  15.7  20.3  19.8  19.1  17.5 
Random rescaled [38]  21.4  26.2  27.1  26.1  24.0 
Context [21]  19.7  26.7  31.9  32.7  30.9 
Context Encoders [39]  18.2  23.2  23.4  21.9  18.4 
Colorization[36]  16.0  25.7  29.6  30.3  29.7 
Jigsaw Puzzles [20]  23.0  31.9  35.0  34.2  29.3 
BIGAN [15]  22.0  28.7  31.8  31.3  29.7 
SplitBrain [40]  21.3  30.7  34.0  34.1  32.5 
Counting [40]  23.3  33.9  36.3  34.7  29.6 
RotNet [23]  21.5  31.0  35.1  34.6  33.7 
AETproject  22.1  32.9  37.1  36.2  34.7 
AVTproject  22.3  33.1  37.8  36.7  35.6 
way logistic regression classifier is trained on top of various layers of feature maps that are spatially resized to have about
elements. All unsupervised features are pretrained on the ImageNet dataset, and then frozen when training the logistic regression classifiers with Places labels. We also compare with fullysupervised networks trained with Places Labels and ImageNet labels, as well as with random models. The highest accuracy values are in bold and the second highest accuracy values are underlined.Table V reports the Top1 accuracies of the compared methods on ImageNet by following the evaluation protocol in [20]. Two settings are adopted for evaluation, where Conv4 and Conv5 mean to train the remaining part of AlexNet on top of Conv4 and Conv5 with the labeled data. All the bottom convolutional layers up to Conv4 and Conv5 are frozen after they are trained in an unsupervised fashion. From the results, in both settings, the AVT model consistently outperforms the other unsupervised models, including the AET.
We also compare with the fully supervised models that give the upper bound of the classification performance by training the AlexNet with all labeled data endtoend. The classifiers of random models are trained on top of Conv4 and Conv5 whose weights are randomly sampled, which set the lower bounded performance. By comparison, the AET models narrow the performance gap to the upper bound supervised models from and by RotNet and DeepCluster on Conv4 and Conv5, to and by the AET, and to and by the AVT.
5.3 Places Experiments
We also compare different models on the Places dataset. Table VII reports the results. Unsupervised models are pretrained on the ImageNet dataset, and a linear logistic regression classifier is trained on top of different layers of convolutional feature maps with Places labels. It assesses the generalizability of unsupervised features from one dataset to another. The models are still based on AlexNet variants. We compare with the fully supervised models trained with the Places labels and ImageNet labels respectively, as well as with the random networks. Both the AET and the AVT models outperform the other unsupervised models, except performing slightly worse than Counting [40] with a shallow representation by Conv1 and Conv2.
6 Experiments: (Semi)Supervised Learning
We compare the proposed SAT model with the other stateoftheart semisupervised methods in this section. For the sake of fair comparison, we follow the test protocol used in literature [27, 26] on both CIFAR10 [42] and SVHN [43], which are widely used as the benchmark datasets to evaluate the semisupervised models.
6.1 Network Architecture and Implementation Details
Network Architecture For the sake of a fair comparison, a 13layer convolutional neural network, which has been widely used in existing semisupervised models [26, 27, 28], is adopted as the backbone to build the SAT. It consists of three convolutional blocks, each of which contains three convolution layers. The SAT has two branches of such three blocks with shared weights, each taking the original and transformed images as input, respectively. The output feature maps from the third blocks of two branches are concatenated and averagepooled, resulting in a d feature vector. A fullyconnected layer follows to predict the mean and the logofvariance of the transformation. The first two blocks are used as the encoder to output the mean of the representation, upon which an additional convolution layer with batch normalization is added to compute the logofvariance .
In addition, a classifier head is built on the representation from the encoder. Specifically, we draw five random representations of an input image, and feed their average to the classifier. The classifier head has the same structure as the third convolutional block but its weights differ from the Siamese branches of transformation decoder. The output feature map of this convolutional block is globally averagepooled to d feature vector, and a softmax fully connected layer follows to predict the image label.
Implementation Details The representation encoder, transformation decoder and the classifier are trained in an endtoend fashion. In particular, the SGD is adopted to iteratively update their weights over a minbatch with images, their transformed counterparts, and labeled examples. Momentum and weight decay are set to and , respectively. The model is trained for a total of epochs. The learning rate is initialized to . It is increased to at epoch , before it is linearly decayed to starting from epochs. For a fair comparison, we adopt the entropy minimization used in the stateoftheart virtual adversarial training [28]. A standard set of data augmentations in literature [26, 27, 28] are also adopted through experiments, which include both horizontal flips and random translations on CIFAR10, and only random translations on SVHN. The projective transformation that performs the better than the affine transformation is adopted to train the semisupervised representations.
6.2 Results
1000 labels  2000 labels  4000 labels  50000 labels  

GAN [44]  18.63 2.32  
model [26]  12.36 0.31  5.560.10  
Temporal Ensembling [26]  12.16 0.31  5.600.10  
VAT [28]  10.55  
Supervisedonly  46.431.21  33.94  20.660.57  5.810.15 
model [27]  27.361.20  18.020.60  13.200.27  6.060.11 
Mean Teacher [27]  21.551.48  15.730.31  12.310.28  5.940.15 
SAT  14.890.38  11.710.29  9.580.11  4.910.13 
250 labels  500 labels  1000 labels  73257 labels  

GAN [44]  18.444.8  8.11 11.3  
model [26]  6.650.53  4.82 0.17  2.540.04  
Temporal Ensembling [26]  5.120.13  4.42 0.16  2.740.06  
VAT [28]  3.86  
Supervisedonly  27.773.18  16.88  12.320.95  2.750.10 
model [27]  9.690.92  6.830.66  4.950.26  2.500.07 
Mean Teacher [27]  4.350.50  4.180.27  3.950.19  2.500.05 
SAT  4.300.22  3.720.20  3.440.10  2.150.06 
We compare with the stateoftheart semisupervised methods in literature [27, 26]. Table VIII and IX show that the SAT outperforms the compared methods with different numbers of labeled examples on both CIFAR10 and SVHN datasets. The results demonstrate that the SAT has captured the useful representation, which delivers competitive classification performances from the transformations on both unlabeled and labeled examples to semisupervise the network training with only few labeled examples.
In particular, the proposed SAT reduces the average error rates of Mean Teacher (the second best performing method) by 30.9%, 25.6%, 22.2% relatively with , , and labels on CIFAR10, while reducing them by , , relatively with , , and labels on SVHN. The compared semisupervised methods, including model [26], Temporal Ensembling [26], and Mean Teacher [27], attempt to maximize the consistency of model predictions on the transformed and original images to train semisupervised classifiers. While they also apply the transformations to explore unlabeled examples, the competitive performance of the SAT model shows the transformationequivariant representations are more compelling for classifying images than those compared methods predicting consistent labels under transformations. It justifies the proposed criterion of pursuing the transformation equivariance as a regularizer to train a classifier.
1000 labels  2000 labels  4000 labels  50000 labels  

VAT w/o EntMin[28]  11.36  
SAT w/o EntMin  15.320.40  12.760.26  10.900.21  5.950.17 
VAT with EntMin[28]  10.55  
SAT with EntMin  14.890.38  11.710.29  9.580.11  4.910.13 
It is not hard to see that the SAT can be integrated into the other semisupervised methods as their base representations, and we believe this could further boost their performances. This will be left to the future work as it is beyond the scope of this paper.
6.2.1 The Impact of Entropy Minimization
We also conduct an ablation study of the Entropy Minimization (EntMin) on the model performance. EntMin was used in VAT [28] that outperformed the other semisupervised methods in literature. Here, we compare the error rates between the SAT and the VAT with or without the EntMin. As shown in Table X, no matter if the entropy minimization is adopted, the SAT always outperforms the corresponding VAT. We also note that, even without entropy minimization, the SAT still performs better than the other stateoftheart semisupervised classifiers such as Mean Teacher, Temporal Ensembling, and model shown in Table VIII. This demonstrates the compelling performance of the SAT model.
6.2.2 Comparison with Data Augmentation by Transformations
1000 labels  2000 labels  4000 labels  

DAT  51.00  38.61  27.99 
SAT  15.72  13.20  11.05 
We also compare the performances between the SAT and a classification network trained with the augmented images by the transformations. Specifically, in each minibatch, input images are augmented with the same set of random projective transformations used in the SAT. The transformationaugmented images and their labels are used to train a network with the same 13layer architecture that has been adopted as the SAT backbone. Note that the transformation augmentations are applied on top of the standard augmentations mentioned in the implementation details for a fair comparison with the SAT.
Table XI compares the results between the SAT and the Data Augmentation by Transformation (DAT) classifier on CIFAR10. It shows the SAT significantly outperforms the DAT. This is not surprising – data augmentation by transformations can only augment the labeled examples, limiting its ability of exploring unlabeled examples that play very important roles in semisupervised learning.
Moreover, the projective transformations used in the SAT could severely distort training images that could incur undesired update to the model weights if the distorted images were used to naively train the network. This is witnessed by the results that the data augmentation by transformations performs even worse than the supervisedonly method (see Table VIII).
In contrast, the SAT avoids a direct use of the transformed images to supervise the model training with their labels. Instead, it trains the learned representations to contain as much information as possible about the transformations. The superior performance demonstrates its outstanding ability of classifying images by exploring the variations of visual structures induced by transformations on both labeled and unlabeled images.
7 Conclusion and Future Works
In this paper, we present to use a novel approach of AutoEncoding Transformations (AET) to learn representations that equivary to applied transformations on images. Unlike the group equivariant convolutions that would become intractable with a composition of complex transformations, the AET model seeks to learn representations of arbitrary forms by reconstructing transformations from the encoded representations of original and transformed images. The idea is further extended to a probabilistic model by maximizing the mutual information between the learned representation and the applied transformation. The intractable maximization problem is handled by introducing a surrogate transformation decoder and maximizing a variational lower bound of the mutual information, resulting in the Autoencoding Variational Transformations (AVT). Along this direction, a (Semi)Supervised Autoencoding Transformation (SAT) approach can be derived by maximizing the joint mutual information of the learned representation with both the transformation and the label for a given sample. The proposed AET paradigm lies a solid foundation to explore transformation equivariant representations in many learning tasks. Particularly, we conduct experiments to show its superior performances on both unsupervised learning to semi(supervised) learning tasks following standard evaluation protocols. In future, we will explore the great potential of applying the learned AET representation as the building block on more learning tasks, such as (instance) semantic segmentation, object detection, superresolution reconstruction, fewshot learning, and finegrained classification.
References
 [1] G. E. Hinton, A. Krizhevsky, and S. D. Wang, “Transforming autoencoders,” in International Conference on Artificial Neural Networks. Springer, 2011, pp. 44–51.
 [2] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.
 [3] T. Cohen and M. Welling, “Group equivariant convolutional networks,” in International conference on machine learning, 2016, pp. 2990–2999.
 [4] T. S. Cohen and M. Welling, “Steerable cnns,” arXiv preprint arXiv:1612.08498, 2016.
 [5] J. E. Lenssen, M. Fey, and P. Libuschewski, “Group equivariant capsule networks,” arXiv preprint arXiv:1806.05086, 2018.
 [6] S. Sabour, N. Frosst, and G. E. Hinton, “Dynamic routing between capsules,” in Advances in Neural Information Processing Systems, 2017, pp. 3856–3866.
 [7] G. E. Hinton, S. Sabour, and N. Frosst, “Matrix capsules with em routing,” 2018.
 [8] G. E. Hinton and R. S. Zemel, “Autoencoders, minimum description length and helmholtz free energy,” in Advances in neural information processing systems, 1994, pp. 3–10.
 [9] N. Japkowicz, S. J. Hanson, and M. A. Gluck, “Nonlinear autoassociation is not equivalent to pca,” Neural computation, vol. 12, no. 3, pp. 531–545, 2000.

[10]
P. Vincent, H. Larochelle, Y. Bengio, and P.A. Manzagol, “Extracting and composing robust features with denoising autoencoders,” in
Proceedings of the 25th international conference on Machine learning. ACM, 2008, pp. 1096–1103.  [11] D. P. Kingma and M. Welling, “Autoencoding variational bayes,” arXiv preprint arXiv:1312.6114, 2013.
 [12] I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, and A. Lerchner, “betavae: Learning basic visual concepts with a constrained variational framework,” in International Conference on Learning Representations, 2017.

[13]
S. Rifai, P. Vincent, X. Muller, X. Glorot, and Y. Bengio, “Contractive autoencoders: Explicit invariance during feature extraction,” in
Proceedings of the 28th International Conference on International Conference on Machine Learning. Omnipress, 2011, pp. 833–840.  [14] I. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in neural information processing systems, 2014, pp. 2672–2680.
 [15] J. Donahue, P. Krähenbühl, and T. Darrell, “Adversarial feature learning,” arXiv preprint arXiv:1605.09782, 2016.
 [16] V. Dumoulin, I. Belghazi, B. Poole, O. Mastropietro, A. Lamb, M. Arjovsky, and A. Courville, “Adversarially learned inference,” arXiv preprint arXiv:1606.00704, 2016.
 [17] G.J. Qi, “Losssensitive generative adversarial networks on lipschitz densities,” arXiv preprint arXiv:1701.06264, 2017.
 [18] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein gan,” arXiv preprint arXiv:1701.07875, 2017.
 [19] M. Edraki and G.J. Qi, “Generalized losssensitive adversarial learning with manifold margins,” in Proceedings of European Conference on Computer Vision (ECCV 2018), 2018.
 [20] M. Noroozi and P. Favaro, “Unsupervised learning of visual representations by solving jigsaw puzzles,” in European Conference on Computer Vision. Springer, 2016, pp. 69–84.
 [21] C. Doersch, A. Gupta, and A. A. Efros, “Unsupervised visual representation learning by context prediction,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 1422–1430.
 [22] M. Noroozi, H. Pirsiavash, and P. Favaro, “Representation learning by learning to count,” in The IEEE International Conference on Computer Vision (ICCV), 2017.
 [23] S. Gidaris, P. Singh, and N. Komodakis, “Unsupervised representation learning by predicting image rotations,” arXiv preprint arXiv:1803.07728, 2018.
 [24] A. Dosovitskiy, J. T. Springenberg, M. Riedmiller, and T. Brox, “Discriminative unsupervised feature learning with convolutional neural networks,” in Advances in Neural Information Processing Systems, 2014, pp. 766–774.
 [25] P. Agrawal, J. Carreira, and J. Malik, “Learning to see by moving,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 37–45.
 [26] S. Laine and T. Aila, “Temporal ensembling for semisupervised learning,” arXiv preprint arXiv:1610.02242, 2016.
 [27] A. Tarvainen and H. Valpola, “Mean teachers are better role models: Weightaveraged consistency targets improve semisupervised deep learning results,” in Advances in neural information processing systems, 2017, pp. 1195–1204.
 [28] T. Miyato, S.i. Maeda, S. Ishii, and M. Koyama, “Virtual adversarial training: a regularization method for supervised and semisupervised learning,” IEEE transactions on pattern analysis and machine intelligence, 2018.
 [29] D. B. F. Agakov, “The im algorithm: a variational approach to information maximization,” Advances in Neural Information Processing Systems, vol. 16, p. 201, 2004.
 [30] L. Zhang, G.J. Qi, L. Wang, and J. Luo, “Aet vs. aed: Unsupervised representation learning by autoencoding transformations rather than data,” arXiv preprint arXiv:1901.04596, 2019.

[31]
E. Oyallon and S. Mallat, “Deep rototranslation scattering for object
classification,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, 2015, pp. 2865–2873.  [32] A. Radford, L. Metz, and S. Chintala, “Unsupervised representation learning with deep convolutional generative adversarial networks,” arXiv preprint arXiv:1511.06434, 2015.
 [33] E. Oyallon, E. Belilovsky, and S. Zagoruyko, “Scaling the scattering transform: Deep hybrid networks,” in International Conference on Computer Vision (ICCV), 2017.
 [34] P. Bojanowski and A. Joulin, “Unsupervised learning by predicting noise,” arXiv preprint arXiv:1704.05310, 2017.
 [35] X. Wang and A. Gupta, “Unsupervised learning of visual representations using videos,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 2794–2802.
 [36] R. Zhang, P. Isola, and A. A. Efros, “Colorful image colorization,” in European Conference on Computer Vision. Springer, 2016, pp. 649–666.
 [37] M. Caron, P. Bojanowski, A. Joulin, and M. Douze, “Deep clustering for unsupervised learning of visual features,” arXiv preprint arXiv:1807.05520, 2018.
 [38] P. Krähenbühl, C. Doersch, J. Donahue, and T. Darrell, “Datadependent initializations of convolutional neural networks,” arXiv preprint arXiv:1511.06856, 2015.
 [39] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros, “Context encoders: Feature learning by inpainting,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2536–2544.
 [40] R. Zhang, P. Isola, and A. A. Efros, “Splitbrain autoencoders: Unsupervised learning by crosschannel prediction.”

[41]
B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva, “Learning deep features for scene recognition using places database,” in
Advances in neural information processing systems, 2014, pp. 487–495.  [42] A. Krizhevsky, “Learning multiple layers of features from tiny images,” Citeseer, Tech. Rep., 2009.
 [43] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng, “Reading digits in natural images with unsupervised feature learning,” 2011.
 [44] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen, “Improved techniques for training gans,” in Advances in Neural Information Processing Systems, 2016, pp. 2234–2242.
Comments
There are no comments yet.