Towards a Hypothesis on Visual Transformation based Self-Supervision

11/24/2019 ∙ by Dipan K. Pal, et al. ∙ Carnegie Mellon University 0

We propose the first qualitative hypothesis characterizing the behavior of visual transformation based self-supervision, called the VTSS hypothesis. Given a dataset upon which a self-supervised task is performed while predicting instantiations of a transformation, the hypothesis states that if the predicted instantiations of the transformations are already present in the dataset, then the representation learned will be less useful. The hypothesis was derived by observing a key constraint in the application of self-supervision using a particular transformation. This constraint, which we term the transformation conflict for this paper, forces a network learn degenerative features thereby reducing the usefulness of the representation. The VTSS hypothesis helps us identify transformations that have the potential to be effective as a self-supervision task. Further, it helps to generally predict whether a particular transformation based self-supervision technique would be effective or not for a particular dataset. We provide extensive evaluations on CIFAR 10, CIFAR 100, SVHN and FMNIST confirming the hypothesis and the trends it predicts. We also propose novel cost-effective self-supervision techniques based on translation and scale, which when combined with rotation outperforms all transformations applied individually. Overall, this paper aims to shed light on the phenomenon of visual transformation based self-supervision.



There are no comments yet.


page 6

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: The Visual Transformation Self-Supervision (VTSS) Hypothesis: Consider a dataset that is to be used to learn features through a pretext VTSS task of augmenting the data with instantiations of transformation and then predicting it (the dataset is indicated by the dotted green circle). Let be a sample in the dataset with and be two instantiations of the transformation applied to (solid green dots) as part of the pretext VTSS task. The VTSS hypothesis predicts that for any , if and (solid green dots) lie close to samples already in the dataset (light green dotted samples), then the features learnt using the VTSS task of predicting will learn less useful features i.e. the features will be less suited for the main task such as downstream classification using that representation. On the other hand, if the augmented samples and are not present in the dataset and map to points outside the dataset (solid yellow dots), then the features learned will be more useful. The hypothesis is based off the observation that for a given VTSS task based on , if is close in distance to a sample already in the dataset or exactly the same (i.e. and , dotted green dots), then this creates a destructive effect called the transformation conflict. This effect forces a network to learn degenerate features. In the figure, VTSS will learn more degenerate features if maps to the green dots rather than the yellow dots.

The Mystery of Self-Supervision. Self-supervision loosely refers to the class of representation learning techniques, where the it is extremely cost effective to produce effective supervision for models. Indeed in many cases, the data becomes its own ground-truth. While a lot of efforts are being directed towards developing more effective techniques [12, 31, 39, 33, 5], there has not been enough attention on the problem of understanding these techniques or atleast a sub-set of them at a deeper level. Indeed while there have been many efforts which introduced self-supervision in different forms [40, 38, 10], there have been only a few efforts which shed more light into related phenomenon. In one such work, the authors focus on the trends (and the lack of) that different architecture choices have on performance of the learnt representations [17]. One emerging technique that has proven to learn useful representations while being deceptively elementary is the study of RotNet [10, 9]. RotNet takes an input image and applies specific rotations to it. The network is then tasked with predicting the correct rotation applied. In doing so, it was shown to learn useful representations which could be used in many diverse downstream applications. It was argued that however, that in learning to predict rotations, the network is forced to attend to the shapes, sizes and relative positions of visual concepts in the image. This explanation though intuitive, does not help us to better understand the behavior of the technique. For instance, it does not help us answer the question: Under what conditions would a particular method succeed in learning useful representations?

Towards an Initial Hypothesis of Self-Supervision. This study hopes to provide initial answers or directions to such questions. We first model supervised classification in a transformation framework. We then define a general self-supervision task utilizing a particular transformation within the same framework. In doing so, we uncover an effect depending on the dataset and the transformation used, which would force a network to learn noisy or degenerate features. Finding a simple condition which avoids this constraint, we arrive at a visual transformation based self-supervision (VTSS) hypothesis. This hypothesis is the first attempt to characterize and move towards a theory of the behavior of self-supervision techniques. The VTSS hypothesis also offers practical applications. For instance, the hypothesis can predict or suggest reasonable transformations to use as a prediction task in VTSS given a particular dataset. Indeed, we introduce two novel VTSS tasks based on translation and scale respectively which we study in our experiments. The hypothesis can also help predict the relative trends in performance between VTSS tasks based on one or more transformations on a particular dataset based on some of the dataset properties. We confirm the VTSS hypothesis and a few trends that it predicts in our experiments.

Our Contributions. 1) We provide an initial step towards a theory of behavior of visual transformation based self-supervision (VTSS) techniques in the form of a hypothesis. The hypothesis describes a condition when a self-supervision technique based on a particular visual transformation would succeed. 2) We use the hypothesis to propose two novel self-supervision tasks based on translation and scale and argue why they might be effective in particular cases. 3) We provide extensive empirical evaluation on CIFAR 10, CIFAR 100, SVHN and FMNIST confirming the hypothesis for all transformations studied, i.e. translation, rotation and scale. We further propose to combine transformations within the same self-supervision tasks leading to performance gains over individual transformations. Finally, we provide an array of ablation studies on VTSS using rotation, translation and scale.

2 Prior Art

Self-supervised learning has recently garnered a lot of attention from the community. For a brief overview of various techniques and methods, we encourage the reader to refer to

[15]. Self-supervision has proven to be effective in areas other than vision such as NLP [4, 37]

, robotics and reinforcement learning

[28, 13, 8, 22, 21]. One of the main methods of performing self-supervision to learning useful features is to solve a pretext task. Such a task is chosen that is ideally computationally cheap and more importantly, one that allows the training on a ‘good’ representation. [11]

Pretext task based methods. Several pretext tasks have been proposed for self-supervision. For instance, solving a patch-based jigsaw puzzle [5, 24, 16], predicting color channels [40, 18, 41], predicting rotations on images [10], learning features through reversing inpainting [30], and learning to count [25]. Utilizing spatial context as a supervision signal was also explored [6]. Studies found that learning robustness to corruptions in input was also an effective pretext task [30, 34]. Geometric transformations were found to be useful to learn representations in the study [7], however it did not predict the instance or the transformation but rather aimed to learn invariance towards them. This is different from a VTSS task which predicts the exact instance of the transformation applied. Another recent task that has shown considerable promise is to match a query representation to other keys in a set belonging to the same image [11]. Other methods utilize clustering even after utilizing a pretext task [26, 3]. There also have been similar pretext tasks proposed on videos such as solving jigsaws on video frames [1, 35]. Augmenting and then predicting rotations on videos was also found to offer a useful self-supervision signal [14]. Contrastive predictive coding [27] and contrastive multiview coding [32] are other successful method which utilized some form of prediction of the data. In the real-world, the laws of physics along with time constantly provide valuable transforming data. These temporal based visual transformations can be yet another source of supervision as explored by [20, 36, 19, 29].

3 Supervision from the Transformation Perspective

The Transformation Model of Visual Images.

We adopt a model of data transformations which accounts for all the variation that is seen in general data. Given an image which when vectorized results in a seed vector

sampled from a seed distribution , it is first acted upon by the transformation i.e. . This transformation is parameterized by and generates a sample from the specific class . For brevity, we drop the notation for the parameters and express a sample as . The transformations are complex and non-linear, and can introduce features into a particular sample that to a receiver might appear to be associated to a particular class. For instance,

could potentially be features of a particular individual in the case of face recognition, or features of a face at a particular pose for pose estimation.

The Transformation Paradigm of Supervised Classification. In the context of supervised classification, different instantiations of the parameters along with different seed vectors give rise to all of the samples that can be observed for class in training and testing. Training data is assumed to include only a subset of all possible combinations of the parameters. Testing data would be sampled from the remaining space of combinations. Note that we do not account for any relation or overlap between and for classes and . For classification, given an input image

, a classifier

is tasked with predicting the class output . In other words, the task is to predict which of the transformations from the set was applied.

Self-Supervision from the Visual Transformation Perspective. The transformation framework is general and can be applied to any classification problem, including self-supervision. A general self-supervision task utilizing a particular transformation would allow all data variation or transformations to be accounted for in the seed distribution independent of . In fact, the different classes are simply where is a particular instantiation of the transformation . For instance, self-supervision based on rotation would be modelled as being the in-plane rotation transformation and being a particular instantiation of it e.g. clockwise. Note that for a self-supervision task under this framework, all image data is modelled as seed vectors in the distribution with being a particular instantiation of a transformation, including the identity transformation .

4 A Hypothesis on Visual Transformation based Self-Supervision

For our purpose, we define usefulness of a representation as the semi-supervised classification test accuracy , of a downstream classifier using that representation on a classification task of higher abstraction. A classification task can be loosely termed to be at a higher abstraction level if it uses a more complicated transformation set , than the one used for the self-supervision task.

The VTSS Hypothesis: Let be a set of transformations acting on a vector space with being the identity transformation and further with . Let be the set of all seed vectors . Finally, we simulate a dataset with a set of pre-existing transformations , by letting . Now, for a usefulness measure of of a representation that is trained using a transformation based self-supervision task which predicts instantiations of , the VTSS hypothesis predicts


In other words, consider a dataset of images and a visual transformation based self-supervision (VTSS) task that predicts instantiations of . Then if the dataset already contains a lot of variations in its samples due to any of the transformations in , then the VTSS hypothesis predicts that the features learnt on that dataset using the VTSS task corresponding to will not produce useful features or the usefulness will be diminished. In the hypothesis statement, the transformation set is the set of all possible transformations and variations that exist in the dataset . Thus, if and have a lot of transformations in common, decreases. In other words, is inversely proportional to the number of transformations common between and . It is important to note however that every instantiation of a transformation is considered different. Therefore, a rotation by clockwise is a different transformation than a rotation by clockwise. Each instantiation can be used as a prediction target while constructing the corresponding VTSS task.

Figure 2: The Effect of Transformation Conflict: Consider a VTSS task of predicting between three instantiations (including the identity) of a transformation on a two samples (blue and yellow dots) from a dataset. When is applied to the samples, it results in the corresponding transformed samples (light red and green dotted dots). Each color signifies the specific label or instantiation of a transformation that the network is tasked with predicting (in the figure there are two colored labels, red and green, with the identity transformation being the sample itself). For instance, RotNet [10] predicts between 4 angles including . The self-supervised network will take in as input each transformed or original sample (all dots), and predict the corresponding label (transformation instantiation). Left: In this case, each dataset sample is transformed into points that are distinct and away from other transformed samples or data points. Hence, there is no transformation conflict. The VTSS hypothesis in this case predicts that the features learnt will be useful. Center: Case A. Here, one of the samples (yellow) transforms into a point (corresponding light red) close to a transformed version (near by light green) of a separate data sample (blue). This presents a way of incurring transformation conflict within the dataset. For the similar inputs of the closely overlapping dotted red and green dots, the network is expected to predict/output both red and green labels. This causes the network to learn degenerate features, as it is trying to maximize discrimination between the two close-by samples. Right: Case B. Here, one of the samples (blue) transforms into a point (corresponding dotted light green) close to another original data point (yellow). This presents a second way of incurring transformation conflict within the dataset. For the similar inputs of the closely overlapping yellow and dotted green dots, the network is expected to predict/output both green and identity labels. In both cases A and B, the VTSS hypothesis predicts less useful features learnt by the self-supervision task.

Approaching RotNet from a new perspective. The hypothesis deters the use of transformations for VTSS tasks which are already present in the data. This might discourage us from utilizing in-plane rotations as a VTSS task since small yet appreciable amounts of in-plane rotation exist in most real-world datasets. However, we must recall that each instantiation of the transformation is considered different. Hence if we consider a rotation angles large enough such as , they are unlikely to exist in the dataset. Thus, the VTSS hypothesis predicts that in-plane rotations would be an effective VTSS task provided the range of rotation is large enough and this indeed has been the observation [10].

Identifying Effective Transformations for VTSS. Following this train of thought, it is natural to ask what other transformations can be used for VTSS? Translation and scale are two relatively simple transformations that are easier to apply (especially translation). It would however be a fair observation to make that both transformations are in fact the most common transformations of variation in real-world visual data, which according to our hypothesis would result in an ineffective learning task. However, owing to manual and automated labelling efforts, there are many datasets in which the visual concept of interest is fairly centered in the image across all samples. This creates an opportunity for the direct use of translation as a computationally inexpensive transformation to apply to self-supervision, while being consistent with the VTSS hypothesis. Scale variation as well can be controlled and accounted for. Nonetheless, in many datasets even when the object is localized, there is relatively more scale variation than translation jitter. As part of this study, we propose the use of both translation and scale as VTSS tasks for use whenever the conditions are favorable.

Predicting Trends in Relative Performance of VTSS on Datasets.

Currently, a barrage of self-supervision tasks are applied to a particular dataset as part of a trial and error process towards obtaining a desirable level of performance. It seems that there exists no heuristic to predict even at minimum a trend of effectiveness. The VTSS hypothesis provides the an initial heuristic to predict trends in relative performance on any given dataset, given some properties of the dataset. There are at times the possibility of estimating how much variation due to a particular transformation might exist in a given dataset. This might be possible due to control or knowledge of the data collection process, coarse estimation through techniques like PCA or more sophisticated techniques such as disentanglement through an information bottleneck

[2]. In such cases, the VTSS hypothesis can assist in rejecting particular VTSS tasks and prioritize others. For instance, a rotation based VTSS technique is predicted to not be beneficial on a dataset such as Rotated MNIST. Indeed, in our experiments, we observe cases when rotation based self-supervision fails completely.

Understanding the VTSS hypothesis. We discussed a few ways the VTSS hypothesis could be useful. We now provide a qualitative explanation for the same. Consider two samples and belonging to a dataset . Let there be a network which will be trained for a VTSS task utilizing the transformation set where is the identity transformation and . Therefore, learns to predict one out of outputs. Specifically for any , given as input (where could be identity), would need to predict the correct instantiation of including the identity. Now, the VTSS hypothesis predicts that as long as s.t. or , the VTSS task will learn useful features. In other words, as long there exists no transformation instantiation in such that and can be related to one another through it, a useful feature will be learned. To see why, we assume s.t. . Under this assumption, the output of should be i.e. . However, we also have i.e. predicting the identity class since . Notice that a conflict arises with these two equations, 1) and also . Therefore, for the same input , the network is expected to output two separate classes. We term this phenomenon as a transformation conflict for this paper (see Fig. 2), and we observe it in our experiments. This condition over the course of many iterations will learn noisy filters. This is because in practice, there exists small differences between and . The network will be forced to amplify such differences while trying to minimize the loss, leading to noise being learned as features.

Rotation N-way C10 SVHN
Baseline () 4
, 4
, , 4
, , , 4
Translation N-way C10 SVHN
Baseline (C) 5
C, U 5
C, U, R 5
C, U, D, L 5
C, U, D, L, R 5
FS 3 blocks 10
FS 4 blocks 10
Table 1: VTSS Hypothesis Confirmation. The column of transformation instantiations on the left were added into the original data independent of the additional augmentation by the VTSS task. For each dataset, the number denotes the semi-supervision accuracy following the protocol described while having a N-way transformation prediction that is fixed for Rotation and Translation. FS indicate Fully Supervised. The smaller number in the bracket denotes the N-way (number of classes) test accuracy during the VTSS task. The VTSS task remains constant while more transformations are added into the original data. This artificially increases in Eq. 1. We see that decreases, leading to confirmation of Eq. 1.

5 Experimental Validation

Our goal through an extensive experimental validation is threefold. 1) To confirm (or find evidence otherwise) the VTSS hypothesis for VTSS tasks based on rotation and translation and. 2) To explore the efficacy of solving VTSS tasks with individual transformations and the additive combinations. 3) To perform ablation studies on VTSS task based on rotation, translation and scale to help gain insights into effects on semi supervision performance. For these experiments, we utilize the CIFAR 10, 100, FMNIST and SVHN datasets111We provide additional experiments including more datasets in the supplementary.. Our effort is not to maximize any individual performance metric or achieve state-of-the-art performance on any task, but rather to discover overall trends in behavior leading to deeper insights into the phenomenon of self-supervision through visual transformations.

General Experimental Protocol. For each transformation and dataset, the overall experimental setup remained unchanged. We follow the training protocol introduced in the RotNet study [10] where a 4 convolution block backbone network is first trained with a VTSS task based on some transformation (rotation, translation and/or scale). This network is tasked with predicting specific transformation instances following the self-supervision protocol. After training, the network weights are frozen and the feature representations from the second convolution block is utilized for training a downstream convolution classifier which is tasked with the usual supervised classification task i.e. predicting class labels for a particular dataset. Our choice of exploring the performance trends of the second block is informed by the original RotNet study where the second conv block exhibited maximum performance for CIFAR 10 [10]

. However, since our focus is on discovering overall trends rather than maximizing individual performance numbers, this choice is inconsequential for our study. Thus, the overall learning setting is the semi-supervised learning since part of the pipeline utilized the frozen self-supervised weights. The final semi-supervised test accuracies reported on the test data of each dataset utilized this semi-supervised pipeline.


The network architecture for all experiments consists of four convolutional blocks, followed by global average pooling and a fully connected layer. Each convolutional block is a stack of three convolution layers, each of which is followed by Batch normalization and a ReLU. Finally, there exists a pooling layer between two blocks

222More details are provided in the supplementary..

EXP 1: Confirming the VTSS Hypothesis

Goal: Recall that the VTSS hypothesis predicts that a VTSS task would learn useful features using a particular transformation only when the predicted instantiations of do not already exist in data. In this experiment, we test this hypothesis for the VTSS tasks based on rotation [10] and translation. The overall approach for this experiment is to break the assumption of the VTSS hypothesis that instantiations from are not present in the original data or . We do this by introducing increasing elements from in the original data itself, independent of the fact that the VTSS task would additionally apply and predict instantiations of to learn a useful representation. This artificially increases . Checking if (semi-supervised performance of the learned representations) varies inversely allows one to confirm whether the VTSS hypothesis.

Experimental Setup: We explore three VTSS tasks based on rotation, translation and scale respectively. The prediction range of these transformations are as follows333We provide more details in the supplementary.:

1) VTSS Rotation: [10] Image rotations by leading to 4-way classification. The input image was rotated by one of the four angles. The VTSS task was to predict the correct rotation angle applied.

2) VTSS Translation: Image translations by 5 pixels with the directions up, down, left, right, no translation (center) leading to a 5-way classification task. From the original image, a center crop with a 5 pixel margin was cropped, which was now considered to be the ‘no translation’ input (center crop). Translations by 5 pixels were applied to this center patch in one of the directions between up, down, left and right. The VTSS task was to predict which direction the image was translated. The 5 pixel margin allows for a 5 pixel translation with no artifacts.

(a) CIFAR 10
(b) SVHN
Figure 3: Representative samples from the CIFAR 10 and SVHN dataset.

For each transformation , more instantiations of were sequentially added in in the original data independent of the corresponding VTSS task. Therefore for each image, there are in fact two separate stages where a transformation is added a) the ablation study itself and b) the VTSS task independent of the ablation study. We now explain the protocol in detail for rotation which has 4 runs (experiments). Run 1) Baseline. The original data contains no rotations added in. This is used for the standard RotNet VTSS task of predicting a 4-way task after rotating the image by one of rotations. This model is evaluated for semi-supervision accuracy and is set as the baseline. Run 2) . Next, the same procedure is followed however, the original data that is sent to the VTSS task, already contains all images at and rotations. It is crucial to note however, that the VTSS task of rotating one of rotations and then predicting the rotation remains unchanged. The VTSS task then transforms and predicts based on the original images and the images at identically. Run 3) Now, the same procedure (VTSS task followed by semi-supervision evaluation) is followed by having all images rotated at each of Run 4) Finally, yet another run uses all four rotations added in to all images of the original data. This protocol is followed similarly for translation (predicting 5-way between no translation, up, down, left and right) where the particular transformations that were measured by the VTSS task were added into the original data sequentially. Further details are provided in the supplementary. The performance metric considered for each transformation is the semi-supervised accuracy (obtained using the protocol explained in the general experimental settings) on the CIFAR 10 and SVHN test sets.

Results and Discussion. Table. 1 showcases the results of these experiments. The left column indicates which transformation instantiations (each for rotation and translation) were added into the original data as part of this ablation study. The semi-supervised accuracy indicates the performance of the learned features towards the downstream classification task (protocol introduced in [10]). The number in the bracket is the self-supervised accuracy which indicates the test accuracy on the VTSS task itself. Higher accuracy indicates the model is able to distinguish between the between transformations added in. We make a few observations.

Observation 1: We find that VTSS Rotation performs well when there are no rotations already present in the data (both for CIFAR 10 and SVHN). However, the method completely breaks down for both datasets when all rotations are present. This indicates that the model has learnt noisy features.

Observation 2: Notice that the self-supervision classification accuracy for all transformations steadily decreases as more rotations were added into the original data. This is in alignment with Eq. 1. Indeed, as more ablation transformations are added in the original data, it becomes difficult for the VTSS task to learn useful features due to the transformation conflict effect. Observation 1 and 2 together confirm the VTSS hypothesis.

Pre-existing Transformations in SVHN and CIFAR: For the next observation we take a look at the SVHN [23] and CIFAR 10 datasets illustrated with a few samples in Fig. 3(b) and Fig. 3(a). For SVHN, we find that there exist considerable scale variation and blur within each digit class. This blur also acts as scale variation as it simulates the process when a small low resolution object is scaled up leading to blur. However, note that since the dataset was created by extending each digit bounding box in the appropriate directions leading to a square, each digit of interest is almost exactly centered. Thus, there pre-exists very little translation in the dataset. Coupled with the fact that digits have lesser variation than general objects, the visual concepts of interest are more centered. CIFAR 10 on the other hand has more complicated objects also with some scale variation already present in the vanilla dataset. The complex nature of the visual classes results in relatively more translation jitter of visual concepts of interest than SVHN. Lastly, both datasets have some rotation variation however not as extreme as or beyond.

Observation 3: We observe that VTSS Translation performs closer to the fully supervised performance for SVHN compared to CIFAR 10. Keeping in mind that SVHN has relatively less translation than CIFAR 10, this is consistent with and supports the VTSS hypothesis.

R [10] 89.15 91.29 91.94 63.62
T 86.20 91.16 88.98 57.10
S 43.89 28.42 83.48 17.72
R+T 89.58 91.56 92.18 64.79
S+T 71.53 87.56 88.02 45.09
R+S 89.00 89.16 91.81 63.78
R+T+S 89.39 91.72 91.63 64.87
FS 3 blocks 89.96 91.43 77.37 64.93
FS 4 blocks 90.26 92.50 92.21 65.95
S (full) 31.16 7.76 87.87 15.01
R (full) 88.50 90.08 93.70 61.04
FS 3 blocks (full) 91.35 80.83 92.65 52.60
FS 4 blocks (full) 91.66 90.57 94.62 67.95
Table 2: Visual Transformation based Self-Supervision (VTSS) through a combination of transformations. The column on the left denoted the transformation that the self-supervised backbone was trained with. In the full crop (full) setting, the entire image was utilized for training and testing. In the base setting, the center crop of the image with a margin of 5 pixels on all sides was used. This was done for a better comparison with VTSS Translation which required a 5 pixel margin to allow room for translations as the VTSS task. bold and italics indicate the best and second-best performances respectively. VTSS tasks using a combination of transformations performed the best for all four datasets.

EXP 2: Exploring VTSS Tasks with Multiple Transformations Simultaneously

Goal: Though self-supervision with images has shown considerable promise as an unsupervised technique, it is still a fairly recent paradigm. Typically, self supervision using visual transformations has been applied with a single transformation type, for instance exclusively rotations for RotNet [10]. Given that in this study, we have demonstrated the existence of a VTSS technique for translation and scale as well, it is natural to ask the question: how does the performance differ when using multiple transformations in conjunction?. We explore answers and also observe phenomenon that the VTSS hypothesis predicts.

Experimental Setup: For the datasets CIFAR 10, 100, FMNIST and SVHN, we train the standard backbone network with 4 convolution blocks with VTSS tasks of Rotation , Translation up, down, left, right, no translation (center) with a shift of 5 pixels and Scale 0 pix, 2 pix zoom, 4 pix zoom. However, VTSS Translation needs a small margin (to translate without artifacts). We apply this crop margin (of 5 pixels) to all data for all tasks. Therefore, the default images for all tasks is the 5 pixel margin center crop of the original image. For VTSS Scale, this crop was designated to be the 0 pix zoom. A 2 pix zoom (or 4 pix zoom) would perform yet another crop with a 2 pix (or 4 pix) margin on each side before resizing the image back to the center crop size. We combine two or more transformations in an additive fashion. For instance, if VTSS Rotation predicts 4 classes and VTSS Translation predicts 5 classes, the task VTSS Rotation + Translation will predict 4+5-1 = 8 classes overall (where we combine the identity class of all transformations into a single class). We perform experiments with all 4 combinations between the three transformations. Additionally, we run each transformation individually to serve as a baseline under the center crop setting. All individual and combination transformation were run in the center crop setting to be consistent in data size for the VTSS Translation task for the combination experiments. However, in practice, when VTSS Scale and Rotation [10] would be applied independently, the entire image would be used and not just the center crop. Thus we provide additional results with just the individuals transformations of VTSS Rotation and Scale on the full sized crop of side 32 (without any center cropping). The corresponding fully-supervised results with the full crop were also provided.

Figure 4: Samples from the Fashion-MNIST dataset. Note that compared to CIFAR 10 and SVHN, FMNIST contains considerably less scale variation. Thus, the VTSS hypothesis predicts that VTSS Scale would perform better, than that of CIFAR 10 and SVHN, which is indeed the case from Table. 2. This is yet another confirmation of the VTSS hypothesis.

Pre-existing Transformations in Fashion-MNIST: A few sample images from the Fashion-MNIST dataset are shown in Fig. 4. One notices immediately that the dataset contains little to no translation jitter, no rotation and importantly, very little scale variation as compared to CIFAR and SVHN (see corresponding figure in main paper). This implies that the VTSS hypothesis would predict that VTSS Scale would be effective. The FMNIST dataset hence is a good dataset to prove effectiveness of the VTSS Scale task. The dataset contains 60,000 training images and 10,000 testing images. each image is sized 28, which for our experiments was rescaled to 32. This does not affect overall trends in our experiments since all images were resized equally.

(a) Rotation
(b) Translation

(c) Scale
Figure 5: Results of the Ablation study: Effect of Transformation Range on VTSS Translation, Rotation [10] and Scale on CIFAR 10.

Results: Individual Transformations. The results of these experiments are presented in Table. 2. We find that VTSS Rotation overall performs consistently high. However, given that SVHN has lesser translation (see discussion on pre-existing transformations in datasets), VTSS Translation performs better on SVHN than CIFAR 10 and 100. This is indeed consistent with the VTSS hypothesis. Note that scale performs worse on both the CIFAR datasets and SVHN. Recalling the prior discussion regarding the presence of scale variation in both CIFARs and SVHN, this result is consistent with the VTSS hypothesis. In fact, due to the presence of more blur which acts as scale variation, VTSS Scale works worse on SVHN than CIFAR 10, which has no common blurry artifacts. This observation as well is consistent with the VTSS hypothesis. Interestingly however, that given the observation that FMNIST has considerably less scale variation than CIFAR and SVHN, the VTSS hypothesis predicts that VTSS Scale would perform better on FMNIST than on CIFAR and SVHN. Indeed, this is what we observe. Both the full crop and the center crop VTSS Scale performance on FMNIST are significantly higher than that of CIFAR and SVHN. This provides further evidence towards the confirmation of the VTSS hypothesis.

Results: Combinations of Transformations. We find that a combination of VTSS R + T works better than isolated VTSS Rotation for all four datasets. This also true for VTSS R+T+S for all datasets except FMNIST. This is the first evidence that utilizing multiple transformations simultaneously as a single VTSS task can provide performance gains over any individual transformation. Given that it is computationally inexpensive to train under such a setting, this result is encouraging. Notice also that even though VTSS Scale performs worse on SVHN than CIFAR, the combination VTSS S+T performs better on CIFAR. Nonetheless, due to the inherent presence of scale in CIFAR and SVHN, the VTSS hypothesis predicts that VTSS tasks involving scale would suffer in performance. However, scale achieves more success on FMNIST due to the absense of inherent scale (a hypothesis prediction). This is something we do observe in Table. 2. From these experiments, we conclude that there is benefit in combining VTSS tasks for different transformations, however it must be done so while being aware of what transformations or factors of variation already exist in the data. Indeed, VTSS tasks using some sort of combination of transformations consistently outperformed all individual transformations for all four datasets.

Overall Observation: Effectiveness of VTSS Rotation

Our results indicate that VTSS Rotation seems to consistently perform better than translation and scale when applied individually. The VTSS hypothesis also begins to offer a probable justification as to why. The degree to which translation and scale are applied as part of the VTSS task to learn these features (which were effective nonetheless), are small. Larger variation we found typically reduced performance given a fixed sized image (see Fig. 

5(b) and Fig. 5(c)). These transformations are likely to exist in subtle amounts at similar ranges to those that were applied as part of the VTSS task. This leads to the detrimental effect of transformation conflict (see Fig. 2). On the other hand, VTSS Rotation was found to work well at large ranges () in the original study [10]. Nonetheless, rotation seldom occurs naturally in most datasets at such large ranges (including real-world ones). Our VTSS hypothesis therefore predicts that rotation is a particularly well-suited transformation for VTSS for general visual data.

EXP 3: Ablation Study: Effect of Transformation Range. We trained a backbone feature extractor network (RotNet) with the VTSS rotation task for different sets of rotations444We provide additional ablation studies in the supplementary on CIFAR 10. There was no independent rotations added (as in our VTSS hypothesis confirmation study) other than the rotations added by the VTSS task itself. We also perform this experiment similarly for the VTSS translation and scale tasks. In this case, we steadily increase the number of directions the translation is added in and correspondingly increase the number of classes for prediction. Lastly, the pixel range of the translation was also varied. For VTSS Scale, the number of scales and the range was varied.

Results. The results of this experiment are presented in Fig. 5(a), Fig. 5(b) and Fig. 5(c). We find that though predicting more rotation angles help, the improvement is marginal. Further, predicting even a single instantiation of the transformation resulted in the learning of useful representations. These results are consistent with those reported in the original study [10]. In this case of translation however, we find a significant increase initially after which there are diminishing returns. It is interesting to note that VTSS translation learnt useful features even with a single pixel shift (on each side) (while using traditional data augmentation). In fact, the 5 pixel shift performs only marginally better. However, a 8 pixel shift (therefore a total crop of just 32-16=16) deems excessive for a 32 sized dataset and drastically decreases performance. For VTSS Scale, it was difficult to find an overall trend. Nonetheless, we find that representations learned are in general poor. This we hypothesis is largely due to the existence of scale variation already in CIFAR. However, it is interesting to note that a scale variation of even 1 and 2 pixels can be useful for representation learning.


6 Appendix: Additional Ablation Studies, Details and Observations

General Hyperparameters.

For each transformation and dataset, the evaluation protocol and hyperparameters remained constant. The self-supervised backbone network and the fully-supervised network were both trained on the training set for all training samples (unless specified) for 200 epochs. The learning rate was set at 0.1 and multiplied by 0.02 at 60, 120 and 180 epochs with SGD with a batch-size of 128, momentum of 0.9 and weight decay of

. However, for every transformation to added in for a particular VTSS task, e.g. rotation, the entire batch was transformed by that augmentation and added to the batch. However, for the VTSS hypothesis confirmation studies only the transformations were added into this original 128 sized batch as an ablation study. For this, if there are different instantiations of the transformations to be added in, then the 128 sized batch was divided by and each shard was transformed by one instantiation. Once the self-supervised network was trained, the weights till the second conv block were frozen and a classifier on top was added.

Further Details in Architecture.

All networks consisted of units or blocks called a conv block. Each conv block consisted of 3 conv layers with 192 channels, each followed by batch normalization and ReLU. The fully-supervised and self-supervised backbone networks consisted of 4 conv blocks (unless specified otherwise) with average pooling of kernel size 3, stride 2 and padding 1 after each block. The semi-supervised classifier was trained on conv block 2 features after training the self-supervision model. The semi-supervised classifier added consisted of a single conv block with 192 channels, global average pooling followed by a single linear layer.

Ablation B: Effect of number of self-supervised and supervised samples. Typically, self supervision tasks are trained on as much data as possible. This is primarily due to the availability of inexpensive self-labels. However, we explore the case where there is a imbalance of data between the self-supervision task and semi-supervision tasks. We increase the number of samples available per class through samples for both the VTSS Rotation and the downstream semi-supervision tasks.

Ablation B: Results. Fig. 6 showcases the results of this experiment. Interestingly, we find that the performance increases linearly at almost identical rates for both the self-supervision and the downstream semi-supervision tasks. For instance, the performance of a 1000 samples/class for VTSS and just 20 samples/class for semi-supervised learning is very similar to 20 samples/class for VTSS and 1000 samples/class for semi-supervised learning. We find similar trends for other settings. This highlights the benefits of VTSS tasks. For a particular amount of data with an inexpensive self-labelling scheme, VTSS provides a level of performance to the downstream classifier similar to that of a semi-supervised model which was trained with the same amount of labeled data. Nonetheless, such linear parallels between VTSS and downstream semi-supervision are encouraging.

Ablation C: Effect of number of classes used for self-supervision. VTSS tasks are typically applied to a wide array of data. However, what is the level of returns that a VTSS task provides given a steady increase in both diversity and amount of data? For this experiment, we train the same 4 conv block layered network with VTSS Rotation given classes. We are interested in the trends with which the performance increases w.r.t the downstream semi-supervised accuracy.

Ablation C: Results. The results of this experiment are presented in Fi.g 7. We find that the downstream semi-supervision performance increases as expected. However, the returns are diminishing. Indeed, though when there are 5000 samples/class available for training the semi-supervised network, the performance saturates with just 5 classes=5000 samples in total for the VTSS task. Similar trends are observed for the cases where there are lower samples/class available for semi-supervised learning. This suggests that though VTSS tasks are powerful, they seem to be hitting a barrier to the diversity of features that the model can learn. Attention must be paid to other aspects of the learning problem such as the size of the network, architecture etc [17] in order to further allow improvements leveraging more self-supervised data.

Figure 6: Results of Ablation B. Effect of number of self-supervised and supervised samples.
Figure 7: Results of Ablation C. Effect of number of classes used for self-supervision.

Additional Observation: Effectiveness of VTSS tasks in general. Taking a step back from a detailed inter-transformation analysis, we observe the performance of self-supervision followed by semi-supervised tasks. Recall that the original VTSS backbone network consisted of 4 conv blocks of three conv layers each. Then, only the first two conv blocks were used as a fixed feature extractor for the semi-supervised classifier on top that was trained. This classifier consisted of a single conv block that is identical to the other blocks. Therefore, in Table. 2 and Table. 2 in the main paper, the fully supervised networks or FS with 3 blocks had similar complexity to the overall self and then semi-supervised models, and provided a fair comparison from the perspective of model complexity. Yet, in the case of SVHN and FMNIST, we find that VTSS Rotation performs better than FS 3 blocks for the full crop setting (including scale for FMNIST). For the center crop experiments, VTSS R+T performs better than FS 3 blocks for SVHN. This showcases the overall effectiveness of VTSS tasks in general compared to fully-supervised networks of similar complexity.