The Mystery of Self-Supervision. Self-supervision loosely refers to the class of representation learning techniques, where the it is extremely cost effective to produce effective supervision for models. Indeed in many cases, the data becomes its own ground-truth. While a lot of efforts are being directed towards developing more effective techniques [12, 31, 39, 33, 5], there has not been enough attention on the problem of understanding these techniques or atleast a sub-set of them at a deeper level. Indeed while there have been many efforts which introduced self-supervision in different forms [40, 38, 10], there have been only a few efforts which shed more light into related phenomenon. In one such work, the authors focus on the trends (and the lack of) that different architecture choices have on performance of the learnt representations . One emerging technique that has proven to learn useful representations while being deceptively elementary is the study of RotNet [10, 9]. RotNet takes an input image and applies specific rotations to it. The network is then tasked with predicting the correct rotation applied. In doing so, it was shown to learn useful representations which could be used in many diverse downstream applications. It was argued that however, that in learning to predict rotations, the network is forced to attend to the shapes, sizes and relative positions of visual concepts in the image. This explanation though intuitive, does not help us to better understand the behavior of the technique. For instance, it does not help us answer the question: Under what conditions would a particular method succeed in learning useful representations?
Towards an Initial Hypothesis of Self-Supervision. This study hopes to provide initial answers or directions to such questions. We first model supervised classification in a transformation framework. We then define a general self-supervision task utilizing a particular transformation within the same framework. In doing so, we uncover an effect depending on the dataset and the transformation used, which would force a network to learn noisy or degenerate features. Finding a simple condition which avoids this constraint, we arrive at a visual transformation based self-supervision (VTSS) hypothesis. This hypothesis is the first attempt to characterize and move towards a theory of the behavior of self-supervision techniques. The VTSS hypothesis also offers practical applications. For instance, the hypothesis can predict or suggest reasonable transformations to use as a prediction task in VTSS given a particular dataset. Indeed, we introduce two novel VTSS tasks based on translation and scale respectively which we study in our experiments. The hypothesis can also help predict the relative trends in performance between VTSS tasks based on one or more transformations on a particular dataset based on some of the dataset properties. We confirm the VTSS hypothesis and a few trends that it predicts in our experiments.
Our Contributions. 1) We provide an initial step towards a theory of behavior of visual transformation based self-supervision (VTSS) techniques in the form of a hypothesis. The hypothesis describes a condition when a self-supervision technique based on a particular visual transformation would succeed. 2) We use the hypothesis to propose two novel self-supervision tasks based on translation and scale and argue why they might be effective in particular cases. 3) We provide extensive empirical evaluation on CIFAR 10, CIFAR 100, SVHN and FMNIST confirming the hypothesis for all transformations studied, i.e. translation, rotation and scale. We further propose to combine transformations within the same self-supervision tasks leading to performance gains over individual transformations. Finally, we provide an array of ablation studies on VTSS using rotation, translation and scale.
2 Prior Art
Self-supervised learning has recently garnered a lot of attention from the community. For a brief overview of various techniques and methods, we encourage the reader to refer to. Self-supervision has proven to be effective in areas other than vision such as NLP [4, 37]
, robotics and reinforcement learning[28, 13, 8, 22, 21]. One of the main methods of performing self-supervision to learning useful features is to solve a pretext task. Such a task is chosen that is ideally computationally cheap and more importantly, one that allows the training on a ‘good’ representation. 
Pretext task based methods. Several pretext tasks have been proposed for self-supervision. For instance, solving a patch-based jigsaw puzzle [5, 24, 16], predicting color channels [40, 18, 41], predicting rotations on images , learning features through reversing inpainting , and learning to count . Utilizing spatial context as a supervision signal was also explored . Studies found that learning robustness to corruptions in input was also an effective pretext task [30, 34]. Geometric transformations were found to be useful to learn representations in the study , however it did not predict the instance or the transformation but rather aimed to learn invariance towards them. This is different from a VTSS task which predicts the exact instance of the transformation applied. Another recent task that has shown considerable promise is to match a query representation to other keys in a set belonging to the same image . Other methods utilize clustering even after utilizing a pretext task [26, 3]. There also have been similar pretext tasks proposed on videos such as solving jigsaws on video frames [1, 35]. Augmenting and then predicting rotations on videos was also found to offer a useful self-supervision signal . Contrastive predictive coding  and contrastive multiview coding  are other successful method which utilized some form of prediction of the data. In the real-world, the laws of physics along with time constantly provide valuable transforming data. These temporal based visual transformations can be yet another source of supervision as explored by [20, 36, 19, 29].
3 Supervision from the Transformation Perspective
The Transformation Model of Visual Images.
We adopt a model of data transformations which accounts for all the variation that is seen in general data. Given an image which when vectorized results in a seed vectorsampled from a seed distribution , it is first acted upon by the transformation i.e. . This transformation is parameterized by and generates a sample from the specific class . For brevity, we drop the notation for the parameters and express a sample as . The transformations are complex and non-linear, and can introduce features into a particular sample that to a receiver might appear to be associated to a particular class. For instance,
The Transformation Paradigm of Supervised Classification. In the context of supervised classification, different instantiations of the parameters along with different seed vectors give rise to all of the samples that can be observed for class in training and testing. Training data is assumed to include only a subset of all possible combinations of the parameters. Testing data would be sampled from the remaining space of combinations. Note that we do not account for any relation or overlap between and for classes and . For classification, given an input image
, a classifieris tasked with predicting the class output . In other words, the task is to predict which of the transformations from the set was applied.
Self-Supervision from the Visual Transformation Perspective. The transformation framework is general and can be applied to any classification problem, including self-supervision. A general self-supervision task utilizing a particular transformation would allow all data variation or transformations to be accounted for in the seed distribution independent of . In fact, the different classes are simply where is a particular instantiation of the transformation . For instance, self-supervision based on rotation would be modelled as being the in-plane rotation transformation and being a particular instantiation of it e.g. clockwise. Note that for a self-supervision task under this framework, all image data is modelled as seed vectors in the distribution with being a particular instantiation of a transformation, including the identity transformation .
4 A Hypothesis on Visual Transformation based Self-Supervision
For our purpose, we define usefulness of a representation as the semi-supervised classification test accuracy , of a downstream classifier using that representation on a classification task of higher abstraction. A classification task can be loosely termed to be at a higher abstraction level if it uses a more complicated transformation set , than the one used for the self-supervision task.
The VTSS Hypothesis: Let be a set of transformations acting on a vector space with being the identity transformation and further with . Let be the set of all seed vectors . Finally, we simulate a dataset with a set of pre-existing transformations , by letting . Now, for a usefulness measure of of a representation that is trained using a transformation based self-supervision task which predicts instantiations of , the VTSS hypothesis predicts
In other words, consider a dataset of images and a visual transformation based self-supervision (VTSS) task that predicts instantiations of . Then if the dataset already contains a lot of variations in its samples due to any of the transformations in , then the VTSS hypothesis predicts that the features learnt on that dataset using the VTSS task corresponding to will not produce useful features or the usefulness will be diminished. In the hypothesis statement, the transformation set is the set of all possible transformations and variations that exist in the dataset . Thus, if and have a lot of transformations in common, decreases. In other words, is inversely proportional to the number of transformations common between and . It is important to note however that every instantiation of a transformation is considered different. Therefore, a rotation by clockwise is a different transformation than a rotation by clockwise. Each instantiation can be used as a prediction target while constructing the corresponding VTSS task.
Approaching RotNet from a new perspective. The hypothesis deters the use of transformations for VTSS tasks which are already present in the data. This might discourage us from utilizing in-plane rotations as a VTSS task since small yet appreciable amounts of in-plane rotation exist in most real-world datasets. However, we must recall that each instantiation of the transformation is considered different. Hence if we consider a rotation angles large enough such as , they are unlikely to exist in the dataset. Thus, the VTSS hypothesis predicts that in-plane rotations would be an effective VTSS task provided the range of rotation is large enough and this indeed has been the observation .
Identifying Effective Transformations for VTSS. Following this train of thought, it is natural to ask what other transformations can be used for VTSS? Translation and scale are two relatively simple transformations that are easier to apply (especially translation). It would however be a fair observation to make that both transformations are in fact the most common transformations of variation in real-world visual data, which according to our hypothesis would result in an ineffective learning task. However, owing to manual and automated labelling efforts, there are many datasets in which the visual concept of interest is fairly centered in the image across all samples. This creates an opportunity for the direct use of translation as a computationally inexpensive transformation to apply to self-supervision, while being consistent with the VTSS hypothesis. Scale variation as well can be controlled and accounted for. Nonetheless, in many datasets even when the object is localized, there is relatively more scale variation than translation jitter. As part of this study, we propose the use of both translation and scale as VTSS tasks for use whenever the conditions are favorable.
Predicting Trends in Relative Performance of VTSS on Datasets.
Currently, a barrage of self-supervision tasks are applied to a particular dataset as part of a trial and error process towards obtaining a desirable level of performance. It seems that there exists no heuristic to predict even at minimum a trend of effectiveness. The VTSS hypothesis provides the an initial heuristic to predict trends in relative performance on any given dataset, given some properties of the dataset. There are at times the possibility of estimating how much variation due to a particular transformation might exist in a given dataset. This might be possible due to control or knowledge of the data collection process, coarse estimation through techniques like PCA or more sophisticated techniques such as disentanglement through an information bottleneck. In such cases, the VTSS hypothesis can assist in rejecting particular VTSS tasks and prioritize others. For instance, a rotation based VTSS technique is predicted to not be beneficial on a dataset such as Rotated MNIST. Indeed, in our experiments, we observe cases when rotation based self-supervision fails completely.
Understanding the VTSS hypothesis. We discussed a few ways the VTSS hypothesis could be useful. We now provide a qualitative explanation for the same. Consider two samples and belonging to a dataset . Let there be a network which will be trained for a VTSS task utilizing the transformation set where is the identity transformation and . Therefore, learns to predict one out of outputs. Specifically for any , given as input (where could be identity), would need to predict the correct instantiation of including the identity. Now, the VTSS hypothesis predicts that as long as s.t. or , the VTSS task will learn useful features. In other words, as long there exists no transformation instantiation in such that and can be related to one another through it, a useful feature will be learned. To see why, we assume s.t. . Under this assumption, the output of should be i.e. . However, we also have i.e. predicting the identity class since . Notice that a conflict arises with these two equations, 1) and also . Therefore, for the same input , the network is expected to output two separate classes. We term this phenomenon as a transformation conflict for this paper (see Fig. 2), and we observe it in our experiments. This condition over the course of many iterations will learn noisy filters. This is because in practice, there exists small differences between and . The network will be forced to amplify such differences while trying to minimize the loss, leading to noise being learned as features.
|, , ,||4|
|C, U, R||5|
|C, U, D, L||5|
|C, U, D, L, R||5|
|FS 3 blocks||10|
|FS 4 blocks||10|
5 Experimental Validation
Our goal through an extensive experimental validation is threefold. 1) To confirm (or find evidence otherwise) the VTSS hypothesis for VTSS tasks based on rotation and translation and. 2) To explore the efficacy of solving VTSS tasks with individual transformations and the additive combinations. 3) To perform ablation studies on VTSS task based on rotation, translation and scale to help gain insights into effects on semi supervision performance. For these experiments, we utilize the CIFAR 10, 100, FMNIST and SVHN datasets111We provide additional experiments including more datasets in the supplementary.. Our effort is not to maximize any individual performance metric or achieve state-of-the-art performance on any task, but rather to discover overall trends in behavior leading to deeper insights into the phenomenon of self-supervision through visual transformations.
General Experimental Protocol. For each transformation and dataset, the overall experimental setup remained unchanged. We follow the training protocol introduced in the RotNet study  where a 4 convolution block backbone network is first trained with a VTSS task based on some transformation (rotation, translation and/or scale). This network is tasked with predicting specific transformation instances following the self-supervision protocol. After training, the network weights are frozen and the feature representations from the second convolution block is utilized for training a downstream convolution classifier which is tasked with the usual supervised classification task i.e. predicting class labels for a particular dataset. Our choice of exploring the performance trends of the second block is informed by the original RotNet study where the second conv block exhibited maximum performance for CIFAR 10 
. However, since our focus is on discovering overall trends rather than maximizing individual performance numbers, this choice is inconsequential for our study. Thus, the overall learning setting is the semi-supervised learning since part of the pipeline utilized the frozen self-supervised weights. The final semi-supervised test accuracies reported on the test data of each dataset utilized this semi-supervised pipeline.
The network architecture for all experiments consists of four convolutional blocks, followed by global average pooling and a fully connected layer. Each convolutional block is a stack of three convolution layers, each of which is followed by Batch normalization and a ReLU. Finally, there exists a pooling layer between two blocks222More details are provided in the supplementary..
EXP 1: Confirming the VTSS Hypothesis
Goal: Recall that the VTSS hypothesis predicts that a VTSS task would learn useful features using a particular transformation only when the predicted instantiations of do not already exist in data. In this experiment, we test this hypothesis for the VTSS tasks based on rotation  and translation. The overall approach for this experiment is to break the assumption of the VTSS hypothesis that instantiations from are not present in the original data or . We do this by introducing increasing elements from in the original data itself, independent of the fact that the VTSS task would additionally apply and predict instantiations of to learn a useful representation. This artificially increases . Checking if (semi-supervised performance of the learned representations) varies inversely allows one to confirm whether the VTSS hypothesis.
Experimental Setup: We explore three VTSS tasks based on rotation, translation and scale respectively. The prediction range of these transformations are as follows333We provide more details in the supplementary.:
1) VTSS Rotation:  Image rotations by leading to 4-way classification. The input image was rotated by one of the four angles. The VTSS task was to predict the correct rotation angle applied.
2) VTSS Translation: Image translations by 5 pixels with the directions up, down, left, right, no translation (center) leading to a 5-way classification task. From the original image, a center crop with a 5 pixel margin was cropped, which was now considered to be the ‘no translation’ input (center crop). Translations by 5 pixels were applied to this center patch in one of the directions between up, down, left and right. The VTSS task was to predict which direction the image was translated. The 5 pixel margin allows for a 5 pixel translation with no artifacts.
For each transformation , more instantiations of were sequentially added in in the original data independent of the corresponding VTSS task. Therefore for each image, there are in fact two separate stages where a transformation is added a) the ablation study itself and b) the VTSS task independent of the ablation study. We now explain the protocol in detail for rotation which has 4 runs (experiments). Run 1) Baseline. The original data contains no rotations added in. This is used for the standard RotNet VTSS task of predicting a 4-way task after rotating the image by one of rotations. This model is evaluated for semi-supervision accuracy and is set as the baseline. Run 2) . Next, the same procedure is followed however, the original data that is sent to the VTSS task, already contains all images at and rotations. It is crucial to note however, that the VTSS task of rotating one of rotations and then predicting the rotation remains unchanged. The VTSS task then transforms and predicts based on the original images and the images at identically. Run 3) Now, the same procedure (VTSS task followed by semi-supervision evaluation) is followed by having all images rotated at each of Run 4) Finally, yet another run uses all four rotations added in to all images of the original data. This protocol is followed similarly for translation (predicting 5-way between no translation, up, down, left and right) where the particular transformations that were measured by the VTSS task were added into the original data sequentially. Further details are provided in the supplementary. The performance metric considered for each transformation is the semi-supervised accuracy (obtained using the protocol explained in the general experimental settings) on the CIFAR 10 and SVHN test sets.
Results and Discussion. Table. 1 showcases the results of these experiments. The left column indicates which transformation instantiations (each for rotation and translation) were added into the original data as part of this ablation study. The semi-supervised accuracy indicates the performance of the learned features towards the downstream classification task (protocol introduced in ). The number in the bracket is the self-supervised accuracy which indicates the test accuracy on the VTSS task itself. Higher accuracy indicates the model is able to distinguish between the between transformations added in. We make a few observations.
Observation 1: We find that VTSS Rotation performs well when there are no rotations already present in the data (both for CIFAR 10 and SVHN). However, the method completely breaks down for both datasets when all rotations are present. This indicates that the model has learnt noisy features.
Observation 2: Notice that the self-supervision classification accuracy for all transformations steadily decreases as more rotations were added into the original data. This is in alignment with Eq. 1. Indeed, as more ablation transformations are added in the original data, it becomes difficult for the VTSS task to learn useful features due to the transformation conflict effect. Observation 1 and 2 together confirm the VTSS hypothesis.
Pre-existing Transformations in SVHN and CIFAR: For the next observation we take a look at the SVHN  and CIFAR 10 datasets illustrated with a few samples in Fig. 3(b) and Fig. 3(a). For SVHN, we find that there exist considerable scale variation and blur within each digit class. This blur also acts as scale variation as it simulates the process when a small low resolution object is scaled up leading to blur. However, note that since the dataset was created by extending each digit bounding box in the appropriate directions leading to a square, each digit of interest is almost exactly centered. Thus, there pre-exists very little translation in the dataset. Coupled with the fact that digits have lesser variation than general objects, the visual concepts of interest are more centered. CIFAR 10 on the other hand has more complicated objects also with some scale variation already present in the vanilla dataset. The complex nature of the visual classes results in relatively more translation jitter of visual concepts of interest than SVHN. Lastly, both datasets have some rotation variation however not as extreme as or beyond.
Observation 3: We observe that VTSS Translation performs closer to the fully supervised performance for SVHN compared to CIFAR 10. Keeping in mind that SVHN has relatively less translation than CIFAR 10, this is consistent with and supports the VTSS hypothesis.
|FS 3 blocks||89.96||91.43||77.37||64.93|
|FS 4 blocks||90.26||92.50||92.21||65.95|
|FS 3 blocks (full)||91.35||80.83||92.65||52.60|
|FS 4 blocks (full)||91.66||90.57||94.62||67.95|
EXP 2: Exploring VTSS Tasks with Multiple Transformations Simultaneously
Goal: Though self-supervision with images has shown considerable promise as an unsupervised technique, it is still a fairly recent paradigm. Typically, self supervision using visual transformations has been applied with a single transformation type, for instance exclusively rotations for RotNet . Given that in this study, we have demonstrated the existence of a VTSS technique for translation and scale as well, it is natural to ask the question: how does the performance differ when using multiple transformations in conjunction?. We explore answers and also observe phenomenon that the VTSS hypothesis predicts.
Experimental Setup: For the datasets CIFAR 10, 100, FMNIST and SVHN, we train the standard backbone network with 4 convolution blocks with VTSS tasks of Rotation , Translation up, down, left, right, no translation (center) with a shift of 5 pixels and Scale 0 pix, 2 pix zoom, 4 pix zoom. However, VTSS Translation needs a small margin (to translate without artifacts). We apply this crop margin (of 5 pixels) to all data for all tasks. Therefore, the default images for all tasks is the 5 pixel margin center crop of the original image. For VTSS Scale, this crop was designated to be the 0 pix zoom. A 2 pix zoom (or 4 pix zoom) would perform yet another crop with a 2 pix (or 4 pix) margin on each side before resizing the image back to the center crop size. We combine two or more transformations in an additive fashion. For instance, if VTSS Rotation predicts 4 classes and VTSS Translation predicts 5 classes, the task VTSS Rotation + Translation will predict 4+5-1 = 8 classes overall (where we combine the identity class of all transformations into a single class). We perform experiments with all 4 combinations between the three transformations. Additionally, we run each transformation individually to serve as a baseline under the center crop setting. All individual and combination transformation were run in the center crop setting to be consistent in data size for the VTSS Translation task for the combination experiments. However, in practice, when VTSS Scale and Rotation  would be applied independently, the entire image would be used and not just the center crop. Thus we provide additional results with just the individuals transformations of VTSS Rotation and Scale on the full sized crop of side 32 (without any center cropping). The corresponding fully-supervised results with the full crop were also provided.
Pre-existing Transformations in Fashion-MNIST: A few sample images from the Fashion-MNIST dataset are shown in Fig. 4. One notices immediately that the dataset contains little to no translation jitter, no rotation and importantly, very little scale variation as compared to CIFAR and SVHN (see corresponding figure in main paper). This implies that the VTSS hypothesis would predict that VTSS Scale would be effective. The FMNIST dataset hence is a good dataset to prove effectiveness of the VTSS Scale task. The dataset contains 60,000 training images and 10,000 testing images. each image is sized 28, which for our experiments was rescaled to 32. This does not affect overall trends in our experiments since all images were resized equally.
Results: Individual Transformations. The results of these experiments are presented in Table. 2. We find that VTSS Rotation overall performs consistently high. However, given that SVHN has lesser translation (see discussion on pre-existing transformations in datasets), VTSS Translation performs better on SVHN than CIFAR 10 and 100. This is indeed consistent with the VTSS hypothesis. Note that scale performs worse on both the CIFAR datasets and SVHN. Recalling the prior discussion regarding the presence of scale variation in both CIFARs and SVHN, this result is consistent with the VTSS hypothesis. In fact, due to the presence of more blur which acts as scale variation, VTSS Scale works worse on SVHN than CIFAR 10, which has no common blurry artifacts. This observation as well is consistent with the VTSS hypothesis. Interestingly however, that given the observation that FMNIST has considerably less scale variation than CIFAR and SVHN, the VTSS hypothesis predicts that VTSS Scale would perform better on FMNIST than on CIFAR and SVHN. Indeed, this is what we observe. Both the full crop and the center crop VTSS Scale performance on FMNIST are significantly higher than that of CIFAR and SVHN. This provides further evidence towards the confirmation of the VTSS hypothesis.
Results: Combinations of Transformations. We find that a combination of VTSS R + T works better than isolated VTSS Rotation for all four datasets. This also true for VTSS R+T+S for all datasets except FMNIST. This is the first evidence that utilizing multiple transformations simultaneously as a single VTSS task can provide performance gains over any individual transformation. Given that it is computationally inexpensive to train under such a setting, this result is encouraging. Notice also that even though VTSS Scale performs worse on SVHN than CIFAR, the combination VTSS S+T performs better on CIFAR. Nonetheless, due to the inherent presence of scale in CIFAR and SVHN, the VTSS hypothesis predicts that VTSS tasks involving scale would suffer in performance. However, scale achieves more success on FMNIST due to the absense of inherent scale (a hypothesis prediction). This is something we do observe in Table. 2. From these experiments, we conclude that there is benefit in combining VTSS tasks for different transformations, however it must be done so while being aware of what transformations or factors of variation already exist in the data. Indeed, VTSS tasks using some sort of combination of transformations consistently outperformed all individual transformations for all four datasets.
Overall Observation: Effectiveness of VTSS Rotation
Our results indicate that VTSS Rotation seems to consistently perform better than translation and scale when applied individually. The VTSS hypothesis also begins to offer a probable justification as to why. The degree to which translation and scale are applied as part of the VTSS task to learn these features (which were effective nonetheless), are small. Larger variation we found typically reduced performance given a fixed sized image (see Fig.5(b) and Fig. 5(c)). These transformations are likely to exist in subtle amounts at similar ranges to those that were applied as part of the VTSS task. This leads to the detrimental effect of transformation conflict (see Fig. 2). On the other hand, VTSS Rotation was found to work well at large ranges () in the original study . Nonetheless, rotation seldom occurs naturally in most datasets at such large ranges (including real-world ones). Our VTSS hypothesis therefore predicts that rotation is a particularly well-suited transformation for VTSS for general visual data.
EXP 3: Ablation Study: Effect of Transformation Range. We trained a backbone feature extractor network (RotNet) with the VTSS rotation task for different sets of rotations444We provide additional ablation studies in the supplementary on CIFAR 10. There was no independent rotations added (as in our VTSS hypothesis confirmation study) other than the rotations added by the VTSS task itself. We also perform this experiment similarly for the VTSS translation and scale tasks. In this case, we steadily increase the number of directions the translation is added in and correspondingly increase the number of classes for prediction. Lastly, the pixel range of the translation was also varied. For VTSS Scale, the number of scales and the range was varied.
Results. The results of this experiment are presented in Fig. 5(a), Fig. 5(b) and Fig. 5(c). We find that though predicting more rotation angles help, the improvement is marginal. Further, predicting even a single instantiation of the transformation resulted in the learning of useful representations. These results are consistent with those reported in the original study . In this case of translation however, we find a significant increase initially after which there are diminishing returns. It is interesting to note that VTSS translation learnt useful features even with a single pixel shift (on each side) (while using traditional data augmentation). In fact, the 5 pixel shift performs only marginally better. However, a 8 pixel shift (therefore a total crop of just 32-16=16) deems excessive for a 32 sized dataset and drastically decreases performance. For VTSS Scale, it was difficult to find an overall trend. Nonetheless, we find that representations learned are in general poor. This we hypothesis is largely due to the existence of scale variation already in CIFAR. However, it is interesting to note that a scale variation of even 1 and 2 pixels can be useful for representation learning.
Unaiza Ahsan, Rishi Madhok, and Irfan Essa.
Video jigsaw: Unsupervised learning of spatiotemporal context for video action recognition.In
2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 179–189. IEEE, 2019.
-  Christopher P Burgess, Irina Higgins, Arka Pal, Loic Matthey, Nick Watters, Guillaume Desjardins, and Alexander Lerchner. Understanding disentangling in -vae. arXiv preprint arXiv:1804.03599, 2018.
-  Mathilde Caron, Piotr Bojanowski, Armand Joulin, and Matthijs Douze. Deep clustering for unsupervised learning of visual features. In Proceedings of the European Conference on Computer Vision (ECCV), pages 132–149, 2018.
-  Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
-  Carl Doersch, Abhinav Gupta, and Alexei A Efros. Unsupervised visual representation learning by context prediction. In Proceedings of the IEEE International Conference on Computer Vision, pages 1422–1430, 2015.
-  Carl Doersch, Abhinav Gupta, and Alexei A. Efros. Unsupervised visual representation learning by context prediction. In The IEEE International Conference on Computer Vision (ICCV), December 2015.
Alexey Dosovitskiy, Jost Tobias Springenberg, Martin Riedmiller, and Thomas
Discriminative unsupervised feature learning with convolutional neural networks.In Advances in neural information processing systems, pages 766–774, 2014.
-  Frederik Ebert, Sudeep Dasari, Alex X Lee, Sergey Levine, and Chelsea Finn. Robustness via retrying: Closed-loop robotic manipulation with self-supervised learning. arXiv preprint arXiv:1810.03043, 2018.
Zeyu Feng, Chang Xu, and Dacheng Tao.
Self-supervised representation learning by rotation feature
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 10364–10374, 2019.
-  Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Unsupervised representation learning by predicting image rotations. arXiv preprint arXiv:1803.07728, 2018.
-  Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning, 2019.
-  Dan Hendrycks, Mantas Mazeika, Saurav Kadavath, and Dawn Song. Using self-supervised learning can improve model robustness and uncertainty. arXiv preprint arXiv:1906.12340, 2019.
-  Eric Jang, Coline Devin, Vincent Vanhoucke, and Sergey Levine. Grasp2vec: Learning object representations from self-supervised grasping. arXiv preprint arXiv:1811.06964, 2018.
-  Longlong Jing and Yingli Tian. Self-supervised spatiotemporal feature learning by video geometric transformations. arXiv preprint arXiv:1811.11387, 2018.
-  Longlong Jing and Yingli Tian. Self-supervised visual feature learning with deep neural networks: A survey. arXiv preprint arXiv:1902.06162, 2019.
-  Dahun Kim, Donghyeon Cho, Donggeun Yoo, and In So Kweon. Learning image representations by completing damaged jigsaw puzzles. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 793–802. IEEE, 2018.
-  Alexander Kolesnikov, Xiaohua Zhai, and Lucas Beyer. Revisiting self-supervised visual representation learning. arXiv preprint arXiv:1901.09005, 2019.
Gustav Larsson, Michael Maire, and Gregory Shakhnarovich.
Learning representations for automatic colorization.In European Conference on Computer Vision, pages 577–593. Springer, 2016.
-  Hsin-Ying Lee, Jia-Bin Huang, Maneesh Singh, and Ming-Hsuan Yang. Unsupervised representation learning by sorting sequences. In Proceedings of the IEEE International Conference on Computer Vision, pages 667–676, 2017.
-  Ishan Misra, C Lawrence Zitnick, and Martial Hebert. Shuffle and learn: unsupervised learning using temporal order verification. In European Conference on Computer Vision, pages 527–544. Springer, 2016.
-  Adithyavairavan Murali, Lerrel Pinto, Dhiraj Gandhi, and Abhinav Gupta. Cassl: Curriculum accelerated self-supervised learning. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 6453–6460. IEEE, 2018.
-  Ashvin Nair, Dian Chen, Pulkit Agrawal, Phillip Isola, Pieter Abbeel, Jitendra Malik, and Sergey Levine. Combining self-supervised learning and imitation for vision-based rope manipulation. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pages 2146–2153. IEEE, 2017.
-  Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. Reading digits in natural images with unsupervised feature learning. 2011.
-  Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In European Conference on Computer Vision, pages 69–84. Springer, 2016.
-  Mehdi Noroozi, Hamed Pirsiavash, and Paolo Favaro. Representation learning by learning to count. In Proceedings of the IEEE International Conference on Computer Vision, pages 5898–5906, 2017.
-  Mehdi Noroozi, Ananth Vinjimoor, Paolo Favaro, and Hamed Pirsiavash. Boosting self-supervised learning via knowledge transfer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 9359–9367, 2018.
-  Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
-  Deepak Pathak, Dhiraj Gandhi, and Abhinav Gupta. Self-supervised exploration via disagreement. arXiv preprint arXiv:1906.04161, 2019.
-  Deepak Pathak, Ross Girshick, Piotr Dollár, Trevor Darrell, and Bharath Hariharan. Learning features by watching objects move. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2701–2710, 2017.
-  Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and Alexei A Efros. Context encoders: Feature learning by inpainting. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2536–2544, 2016.
-  Pierre Sermanet, Corey Lynch, Yevgen Chebotar, Jasmine Hsu, Eric Jang, Stefan Schaal, Sergey Levine, and Google Brain. Time-contrastive networks: Self-supervised learning from video. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 1134–1141. IEEE, 2018.
-  Yonglong Tian, Dilip Krishnan, and Phillip Isola. Contrastive multiview coding. arXiv preprint arXiv:1906.05849, 2019.
-  Trieu H Trinh, Minh-Thang Luong, and Quoc V Le. Selfie: Self-supervised pretraining for image embedding. arXiv preprint arXiv:1906.02940, 2019.
Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol.
Extracting and composing robust features with denoising autoencoders.In
Proceedings of the 25th international conference on Machine learning, pages 1096–1103. ACM, 2008.
-  Chen Wei, Lingxi Xie, Xutong Ren, Yingda Xia, Chi Su, Jiaying Liu, Qi Tian, and Alan L Yuille. Iterative reorganization with weak spatial constraints: Solving arbitrary jigsaw puzzles for unsupervised representation learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1910–1919, 2019.
-  Donglai Wei, Joseph J Lim, Andrew Zisserman, and William T Freeman. Learning and using the arrow of time. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8052–8060, 2018.
-  Jiawei Wu, Xin Wang, and William Yang Wang. Self-supervised dialogue learning. CoRR, abs/1907.00448, 2019.
-  Xiaohua Zhai, Joan Puigcerver, Alexander Kolesnikov, Pierre Ruyssen, Carlos Riquelme, Mario Lucic, Josip Djolonga, Andre Susano Pinto, Maxim Neumann, Alexey Dosovitskiy, et al. The visual task adaptation benchmark. arXiv preprint arXiv:1910.04867, 2019.
-  Liheng Zhang, Guo-Jun Qi, Liqiang Wang, and Jiebo Luo. Aet vs. aed: Unsupervised representation learning by auto-encoding transformations rather than data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2547–2555, 2019.
-  Richard Zhang, Phillip Isola, and Alexei A Efros. Colorful image colorization. In European conference on computer vision, pages 649–666. Springer, 2016.
-  Richard Zhang, Phillip Isola, and Alexei A Efros. Split-brain autoencoders: Unsupervised learning by cross-channel prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1058–1067, 2017.
6 Appendix: Additional Ablation Studies, Details and Observations
For each transformation and dataset, the evaluation protocol and hyperparameters remained constant. The self-supervised backbone network and the fully-supervised network were both trained on the training set for all training samples (unless specified) for 200 epochs. The learning rate was set at 0.1 and multiplied by 0.02 at 60, 120 and 180 epochs with SGD with a batch-size of 128, momentum of 0.9 and weight decay of. However, for every transformation to added in for a particular VTSS task, e.g. rotation, the entire batch was transformed by that augmentation and added to the batch. However, for the VTSS hypothesis confirmation studies only the transformations were added into this original 128 sized batch as an ablation study. For this, if there are different instantiations of the transformations to be added in, then the 128 sized batch was divided by and each shard was transformed by one instantiation. Once the self-supervised network was trained, the weights till the second conv block were frozen and a classifier on top was added.
Further Details in Architecture.
All networks consisted of units or blocks called a conv block. Each conv block consisted of 3 conv layers with 192 channels, each followed by batch normalization and ReLU. The fully-supervised and self-supervised backbone networks consisted of 4 conv blocks (unless specified otherwise) with average pooling of kernel size 3, stride 2 and padding 1 after each block. The semi-supervised classifier was trained on conv block 2 features after training the self-supervision model. The semi-supervised classifier added consisted of a single conv block with 192 channels, global average pooling followed by a single linear layer.
Ablation B: Effect of number of self-supervised and supervised samples. Typically, self supervision tasks are trained on as much data as possible. This is primarily due to the availability of inexpensive self-labels. However, we explore the case where there is a imbalance of data between the self-supervision task and semi-supervision tasks. We increase the number of samples available per class through samples for both the VTSS Rotation and the downstream semi-supervision tasks.
Ablation B: Results. Fig. 6 showcases the results of this experiment. Interestingly, we find that the performance increases linearly at almost identical rates for both the self-supervision and the downstream semi-supervision tasks. For instance, the performance of a 1000 samples/class for VTSS and just 20 samples/class for semi-supervised learning is very similar to 20 samples/class for VTSS and 1000 samples/class for semi-supervised learning. We find similar trends for other settings. This highlights the benefits of VTSS tasks. For a particular amount of data with an inexpensive self-labelling scheme, VTSS provides a level of performance to the downstream classifier similar to that of a semi-supervised model which was trained with the same amount of labeled data. Nonetheless, such linear parallels between VTSS and downstream semi-supervision are encouraging.
Ablation C: Effect of number of classes used for self-supervision. VTSS tasks are typically applied to a wide array of data. However, what is the level of returns that a VTSS task provides given a steady increase in both diversity and amount of data? For this experiment, we train the same 4 conv block layered network with VTSS Rotation given classes. We are interested in the trends with which the performance increases w.r.t the downstream semi-supervised accuracy.
Ablation C: Results. The results of this experiment are presented in Fi.g 7. We find that the downstream semi-supervision performance increases as expected. However, the returns are diminishing. Indeed, though when there are 5000 samples/class available for training the semi-supervised network, the performance saturates with just 5 classes=5000 samples in total for the VTSS task. Similar trends are observed for the cases where there are lower samples/class available for semi-supervised learning. This suggests that though VTSS tasks are powerful, they seem to be hitting a barrier to the diversity of features that the model can learn. Attention must be paid to other aspects of the learning problem such as the size of the network, architecture etc  in order to further allow improvements leveraging more self-supervised data.
Additional Observation: Effectiveness of VTSS tasks in general. Taking a step back from a detailed inter-transformation analysis, we observe the performance of self-supervision followed by semi-supervised tasks. Recall that the original VTSS backbone network consisted of 4 conv blocks of three conv layers each. Then, only the first two conv blocks were used as a fixed feature extractor for the semi-supervised classifier on top that was trained. This classifier consisted of a single conv block that is identical to the other blocks. Therefore, in Table. 2 and Table. 2 in the main paper, the fully supervised networks or FS with 3 blocks had similar complexity to the overall self and then semi-supervised models, and provided a fair comparison from the perspective of model complexity. Yet, in the case of SVHN and FMNIST, we find that VTSS Rotation performs better than FS 3 blocks for the full crop setting (including scale for FMNIST). For the center crop experiments, VTSS R+T performs better than FS 3 blocks for SVHN. This showcases the overall effectiveness of VTSS tasks in general compared to fully-supervised networks of similar complexity.