Multiple Pretext-Task for Self-Supervised Learning via Mixing Multiple Image Transformations

12/25/2019 ∙ by Shin'ya Yamaguchi, et al. ∙ 0

Self-supervised learning is one of the most promising approaches to learn representations capturing semantic features in images without any manual annotation cost. To learn useful representations, a self-supervised model solves a pretext-task, which is defined by data itself. Among a number of pretext-tasks, the rotation prediction task (Rotation) achieves better representations for solving various target tasks despite its simplicity of the implementation. However, we found that Rotation can fail to capture semantic features related to image textures and colors. To tackle this problem, we introduce a learning technique called multiple pretext-task for self-supervised learning (MP-SSL), which solves multiple pretext-task in addition to Rotation simultaneously. In order to capture features of textures and colors, we employ the transformations of image enhancements (e.g., sharpening and solarizing) as the additional pretext-tasks. MP-SSL efficiently trains a model by leveraging a Frank-Wolfe based multi-task training algorithm. Our experimental results show MP-SSL models outperform Rotation on multiple standard benchmarks and achieve state-of-the-art performance on Places-205.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Convolutional neural networks (CNNs) [27, 16, 44]

are widely adopted to solve many target tasks in applications of computer vision such as object recognition 

[30], semantic segmentation [4], and object detection [42]. However, these successes depend on supervised training of CNNs with the vast amount of labeled data [43]

, which is expensive and impractical because of the manual annotation cost. Since the cost of labeled data limits the practical applications of CNNs, a number of researches focus on the training techniques to alleviate the requirement of many labeled data; the techniques include transfer learning, semi-supervised learning, and

self-supervised learning.

Among the training methods with few or none labels, self-supervised learning is received attention as one of the most promising approaches. To improve the performance of the target tasks, self-supervised learning uses a pre-training task called pretext-task, which is solved to predict surrogate supervisions defined by using visual information on input images itself. Via the pretext-tasks using unlabeled data, CNN models learn high-level semantic representations in advance, which can be helpful for the target supervised task. Therefore, the pre-learned CNN model achieves higher performance in the target task even if only small volume labeled data can be used. To capture more sophisticated high-level semantic image features, various pretext-tasks have been recently proposed for self-supervised learning such as using image patches [8, 32, 33, 34, 10], and predicting differences generated from image preprocessing [49, 50, 14, 11]

. However, most of these existing works require specialized implementation for the pretext-tasks (e.g., synthetic images, loss functions, and network architectures); these specializations are not always easy to implement and efficient for the computations.

In such progress of self-supervised learning, Gidaris et al[14]

have proposed the rotation prediction task (Rotation), which can boost the target task performance despite its simpleness. In the Rotation, a model is trained to classify discrete labels representing the rotation degree of input images (e.g., 0

, 90, 180, and 270). Rotation captures the semantic information of object shapes that are useful for target tasks as well as the previous works, but the implementation is easier than them because there is no need for the specialized implementations. However, Rotation lacks capturing other semantic information such as object textures and colors. For recognizing images, the information of object textures play a crucial role [12, 3, 13], and also learning color information contributes the target model performance [49, 25, 50]. For instance, when we apply Rotation to Describable Texture Dataset (DTD) [6], which is a classification dataset for predicting the classes of color textures (Figure 1(a)), Rotation does not improve the target classification performance well (Figure 1(b)). Therefore, the Rotation models can fail to improve the performance of target tasks that need to capture the features of textures and colors.

In this paper, to overcome the limitation of Rotation, we propose a novel multiple pretext-task for self-supervised learning (MP-SSL) which combines multiple image transformations for learning representations. Our key idea is complementarily training a model to capture not only the object shapes but also the textures and colors that are useful for target tasks. To learn the textures and colors, we consider to leverage image enhancements

, which are transformations aiming to modify appearances of images (e.g., sharpening, brightening, and solarizing), for combining with Rotation. We investigate the performance of the five enhancement transformations as the pretext-task, and reveal that sharpness and solarization are the best transformations to capture the information of textures and colors. With these transformations, we formulate the multiple pretext-task and train models via the following procedure: MP-SSL generates one transformed image by applying the multiple transformations serially, and then, solves the pretext classification tasks for each transformation on this image. In the training, we optimize each pretext-task specific parameter and the shared parameter between the pretext-tasks. Finally, we can apply the trained shared parameter for target tasks. We confirmed that our MP-SSL can achieve state-of-the-art performance on Places-205 and comparable performance to state-of-the-art on ImageNet.

Our contributions are summarized as follows:

  • We experimentally found that Rotation does not work well on the task predicting color texture classes of target images by testing on Describable Texture Dataset (DTD), and the image enhancements can work as the pretext-task for this task.

  • We propose a novel multiple pretext-task for self-supervised learning (MP-SSL), which combines our proposed image enhancements with Rotation.

  • We confirmed that our MP-SSL is superior or comparable to other self-supervised learning techniques including Rotation on various datasets (CIFAR, ImageNet, and Places-205) and network architectures (AlexNet, VGG, and Wide-ResNet).

2 Related Works

2.1 Self-supervised Learning

Several pretext-tasks of self-supervised learning have succeeded to learn useful representation by focusing on various visual features that appeared in images. For instance, the pretext-tasks utilizing image patches are to predict correct patch positions [8, 31] and permutations [32], to count output values of image patches [33], and inpainting images gouged a partial square region [35]. Further, the pretext-tasks of [49, 50, 25] are to colorize input gray-scaled images by using the output of CNNs. On the other hand, Rotation [14] is one of the most promising approaches because the model can learn powerful representations for various target tasks through solving the simple classification task. Because of the simplicity, Rotation is used a part of the training systems that solve tasks in few labeled data settings for classification and image generation task [29, 5, 48]. Additionally, Feng et al[11] have presented a method enhancing Rotation by explicitly decoupling network architectures for learning rotation-related features solving Rotation and rotation-unrelated features solving instance classification. In this paper, we aim to improve Rotation while keeping the simplicity by focusing on the semantic information of textures and colors. In contrast to the study of [11], our method can be applied without any special modifications for network architectures in Rotation. Thus, we can easily combine MP-SSL and Feng et al.’s method and it might achieve more helpful representations complementarily.

2.2 Multi-task Self-Supervised Learning

Multi-task learning is a learning paradigm, which simultaneously solves multiple tasks while sharing partial parameters in order to obtain better performance than learning each task independently [20, 47]. In the context of self-supervised learning, several approaches have been proposed for associating multiple pretext-task and obtaining better representations. Doersch et al[9] have presented a method that simultaneously solves four different pretext-tasks on specialized multi-task network adjusting the domain gap derived from different input spaces for each pretext-task. The approach of [36] adopts three pretext-tasks (depth, edge, and surface normal prediction task) defined by using synthetic 3D data, and trains a model together with these pretext-tasks and the target task by applying both real and synthetic images. The authors of [28] have shown a multi-task learning technique, which applies three additional pretext-tasks defined by recycling bounding box labels for the object detection task. In contrast to these prior works, our algorithm shares the input space and training scheme between all pretext-tasks, and the pretext-tasks are defined by only image transformations. Thus, for applying our algorithm, we do not need specialized network architectures, synthesized 3D data, nor labels for target tasks.

2.3 Image Enhancement

Image enhancements have been used for making images easier to analyze the visual features related to textures and colors by adjusting the characteristics of image channels [40]. In the context of training CNNs, since these transformations can amplify the patterns of images with holding the semantic features, they are adopted to data augmentation and contribute to the performance of vision tasks [7]. Similarly, the enhancements can be simultaneously applied with the Rotation transformation because they transform pixel values for the channel direction rather than the horizontal or vertical direction in images. From these facts, we hypothesize that the image enhancements can be a good partner of Rotation to obtain high-level semantic information.

Conv1 Conv2 Conv3 Conv4 Conv5 Perf. Ratio
ImageNet Supervised 19.3 36.3 44.2 48.3 50.5 0.847
Rotation 18.8 31.7 38.7 38.2 36.5
CIFAR-10 Supervised 52.7 66.1 71.2 74.7 76.9 0.912
Rotation 52.3 63.2 66.0 65.7 62.1
DTD Supervised 23.4 29.0 31.9 32.0 31.3 0.665
Rotation 19.5 22.3 19.3 18.2 17.1
Table 1: Top-1 classification accuracy with linear layers. denotes that all of the results in the row are preprinted from [14]

. For CIFAR-10, we used 45,000/5,000/10,000 images for the train/validation/test set, respectively. Each column from Conv1 to Conv5 represents the classification accuracy of the logistic regression model on top of the feature maps of each corresponding convolutional layer of AlexNet (All convolutional layers are frozen). The rows of Supervised shows the results when training classification models to predict the class labels, and then, applying the trained convolutional layers for the logistic regression models.

3 Motivation

In this section, we first introduce the foundation of self-supervised learning by Rotation. Then, we experimentally show that Rotation is not so effective on Describable Texture Dataset (DTD) because DTD focuses on color textures. Finally, we confirm that our image enhancement based pretext-tasks can achieve higher performances on DTD.

3.1 State of Rotation

The pretext-task of Rotation [14] is to predict a rotation degree of an input image rotated from the original image to , where corresponding to . For training, we optimize a network by minimizing softmax-cross entropy loss with respect to the four classes in the set of rotation degrees . The objective functions for Rotation are defined as follows:

(1)
(2)

where, is a number of training images, is a feature extractor parameterized by , and is the -th element of a classifier for predicting rotation degrees parameterized by . This is quite easy to implement since we can reuse existing code modules defined for common supervised classification tasks to construct the Rotation loss functions. In the target task, we use the trained feature extractor for generating feature maps or initializing target models.

Note that, in [14], although the authors have formulated the above objective functions to be generalized for arbitrary geometric image transformations (e.g., rotation, scale, and aspect ratio transformations), they have pointed out that the transformations except for rotation may not be appropriate for the pretext-task because they produce the easy detectable visual artifacts. However, this discussion is limited to the “geometric” transformations, not including image enhancements used in this paper. Therefore, it is nontrivial the performance when employing the image enhancements as the pretext-task in forms of Eq. (1) and (2).

(a) Rotation (0, 90, 180, 270)
(b) Brightness (0.1, 0.5, 1.0, 1.5)
(c) Contrast (0.1, 0.5, 1.0, 1.5)
(d) Saturation (0.0, 0.5, 1.0, 1.5)
(e) Sharpness (0.0, 0.5, 1.0, 1.5)
(f) Solarization (0, 85, 170, 256)
Figure 2: Samples of the image enhancement transformations designed for our proposed self-supervised learning. Figures of (b) to (f) describe the transformed images; the left to right order corresponds to the degrees represented as parenthesized values in the captions. For more detailed settings for degrees are appeared in supplementary materials.
Conv1 Conv2 Conv3 Conv4 Conv5 Perf. Ratio
Supervised 23.4 29.0 31.9 32.0 31.3
Rotation 19.5 22.3 19.3 18.2 17.1 0.665
Brightness 18.3 22.4 21.1 19.2 17.4 0.673
Contrast 18.0 22.4 20.2 18.8 17.4 0.665
Saturation 19.5 22.0 20.3 17.6 17.7 0.669
Sharpness 21.0 23.6 21.0 20.4 18.5 0.718
Solarization 19.2 23.3 21.5 19.3 19.4 0.705
Table 2: Top-1 classification accuracy with linear layers by using feature extractors learned pretext-tasks on DTD dataset. All of descriptions for each column are inherited from Table 1.

3.2 Limitation of Rotation

Next, we reveal that Rotation does not work well when the images in a target dataset are constructed by color textures such as DTD [6]. DTD (Fig. 1(a)) is a dataset for the task predicting what kind of color texture appeared in images. We tested the Rotation models on DTD by using AlexNet [24]. Following [14], the degrees for predicting rotation was . The more detail settings are shown in supplementary materials.

Table 1 shows the accuracies of the color texture classification task of DTD when using feature maps generated from feature extractors. For comparison, we reprint the accuracy in the cases of ImageNet [37] reported in [14] and show the results of CIFAR-10 in  [23]. To evaluate the performance of Rotation, we compute the score of Performance Ratio from the accuracies, which indicates the performance degradation from using feature maps from supervised models to using that of self-supervised models. A score of Performance Ratio is calculated by , where is the number of convolutional layers, and means the accuracy when using the top of the -th convolutional layer as a feature map. From the result, we can see that Rotation produces lower Performance Ratio on DTD in contrast to the cases of ImageNet and CIFAR-10. This is because the pretext-task of Rotation for DTD can be more difficult to learn than the case of the other datasets since images of DTD often contain the patterns invariant to the rotation transformation and focusing on color textures (Figure 1(a)). Therefore, the performance of the Rotation models can be limited by depending on the contents of images in the target dataset, and thus, we should focus on the multiple semantic information including object texture and colors, not only object shapes learned by Rotation.

3.3 Image Enhancement Transformations

To overcome the limitation of Rotation, we investigate the method using image enhancements for pretext-tasks of self-supervised learning. For obtaining the semantic information of object textures and colors, we adopted the five image enhancements, which are Brightness, Contrast, Saturation, Sharpness, and Solarization. They are well-known as the representative image enhancements that manipulate tendencies of image pixels, and implemented in open-source libraries (e.g., python image library

111https://github.com/python-pillow/Pillow). The five transformations are often used for data augmentation as mentioned in Section 2.3. Thus, the selected transformations are useful for combining with Rotation in terms of preserving the simpleness and complementing the effectiveness of Rotation.

We arrange the enhancement transformations into the form of self-supervised learning in the same fashion of Rotation. That is, by using Eq. (1) and (2), we train a model predicting the discretized degrees of a transformation as a classification task by reference to [7]

. This formulation is desirable because it can preserve the properties of Rotation requiring no architecture modifications, nor specialized loss function. For simplicity, we set the degree to be quartile including the original images as well as Rotation. Note that, in order to avoid making trivial image artifacts mentioned in 

[14], we modify the first degree to be 0.1 for Brightness and Contrast; they generate fully-black or fully-gray images when the degree is 0. Figure 2 illustrates the transformed images by the arranged image enhancements.

In the same experimental setting of Section 3.2, we tested the models trained on the pretext-task using the image enhancements instead of Rotation. Table 2 describes the top-1 classification accuracies when using top of each corresponding convolutional layer as the input for linear classifiers. Surprisingly, almost all of the pretext-tasks using the image enhancements achieve the greater Performance Ratio than Rotation. More specifically, the models of Sharpness and Solarization outperform the Rotation models better than the other transformations. The performance gap among the transformations can be derived from whether the visual effect of a transformation is independent of the variation of images taken by human. That is, since the variation of contrast, brightness, or color tone of images also exists in natural images taken by human, CNNs are hard to learn the categories created by the transformation of Contrast, Brightness, or Saturation. This is similar in composition to the discussion in [14], i.e., the pretext-tasks using geometric transformations manipulating scales or aspects of an image will not work better than Rotation because there are various patterns of scales and aspects in natural images. On the other hand, Sharpness and Solarization respectively manipulate an image to emphasize the edge and to invert the colors for each pixel, and thus, the visual effects by these manipulations do not strongly related to the variations of natural images as with Rotation. This might be the reason why Sharpness and Solarization can more effectively train CNNs learning than the other transformations.

To more specifically interpret the performance difference in Table 2, we visualize the attention map of Conv2 in AlexNet on the image enhancement based pretext-tasks by GradCAM [38]. Figure 3 illustrates the attention maps from trained models by supervised learning, Rotation, Sharpness, and Solarization. As can be seen, the Rotation model focuses on the limited region around the black dot parts and pays attention to the white blank part between dots rather than the dot part itself. Contrastively, the results of Sharpness and Solarization show that the models concentrate to directly recognize the dot parts and respond to the wider regions of whole the image than supervised and Rotation models. In the case of Sharpness, since the pretext-task force CNNs to recognize the edges constructing object and textures on an image in order to distinguish the degrees of blurriness, the trained CNNs can focus on the textures of an image as shown in Figure 3(d). For Solarization, to classify the degrees of solarization (Figure 2(f)), CNNs are required to capture the regions of simultaneously inverted color textures or semantic structures, so that the models are trained to pay attention to the semantic regions that appeared on the whole of an image (Figure 3(e)). From these observations, we see that the trained CNNs by the image enhancements can more clearly capture the semantic information of textures and colors than that by supervised and Rotation.

(a) Input
(b) Super
(c) Rot
(d) Sharp
(e) Solar
Figure 3: Attention maps generated from trained models by (b) supervised learning (Super), and self-supervised learning with (c) Rotation (Rot), (d) Sharpness (Sharp), and (e) Solarization (Solar). These attention maps indicate where a trained CNN concentrates in order to recognize the dotted image sampled from DTD; the brighter region on images indicates the strength of attention.

4 Proposed Method

We propose an algorithm called multiple pretext-task for self-supervised learning (MP-SSL). In the previous section, we found that the pretext-tasks using image enhancements help CNNs to capture the semantic information related to object texture and colors in an image from DTD. On the other hand, as shown in Figure 1(b), Rotation still achieves the powerful performance of models when it is applied to general datasets where object shape is important such as CIFAR-10. In order to maximize the utility of image enhancements and Rotation complementarily, MP-SSL solves multiple pretext-task by processing one shared image to which multiple image transformations are applied. This enhances a model to capture semantic information related to not only object shapes but also textures and colors. In this section, we describe the objective functions of MP-SSL and the learning algorithm.

4.1 Objective Functions

Consider a self-supervised learning applying multiple pretext-task over an input space and a collection of -dimensional pretext-task spaces ,where is the number of pretext-tasks (image transformations). We train a feature extractor parameterized by through solving all of pretext-tasks with each classifier parameterized by the specific parameter . We define a set of labels for the -th pretext-task as follows:

(3)

Note that we inherit the size of the label set from [14] because they have found that the best number of recognized rotations is four. For the training, we use the following dataset:

(4)

where is the number of input images, is the -th input image in , and is the the label of the -th pretext-task for . We randomly sample from

with uniform distribution. Then, we transform the input image

as follows:

(5)

where is the function that returns transformed images according to the given label by applying the image transformation corresponding to the -th pretext-task. Note that a transformed image is generated by applying the all of the transformations serially, e.g., if we select Rotation and Solarization for the transformations, we first rotate an image, and then, solarize the rotated image.

For the set of transformed images and the set of corresponding labels for the -th pretext-task , a model is optimized by the following formulation for empirical risk minimization:

(6)
(7)

where is the scaling factor balancing the effect of losses across pretext-tasks. The form of Eq.(6) is a well-known objective function for multi-task learning [39], which minimizes a weighted sum for all tasks over the shared parameter and the task-specific parameter . By using the transformed images, we compute a softmax-cross entropy loss by in Eq. (7) as shown in Figure 4

. This means that we can easily implement the above loss functions by standard modules equipped in common deep learning frameworks.

4.2 Optimization

1:Set of input images , number of tasks , learning rate
2:Trained parameter
3:Randomly initialize parameters ,
4:while  not convergence do Assume standard mini-batch SGD
5:      Randomly sample labels
6:       Generate() By Eq. (5)
7:      for  to  do
8:             Update
9:      end for
10:       frankwolfe() Same as [39]
11:       Update by MGDA-UB
12:end while
Algorithm 1 MP-SSL

To optimize a model by Eq.(6), in a naive way, we must do the grid search of scaling factor and it is time-consuming. Thus, for efficiently solving the optimization problem, we adopt the approximation of multiple gradient descent algorithm using upper bound (MGDA-UB) [39]. Since MGDA-UB computes the scaling factors by Frank-Wolfe algorithm [17] for each training step, we can train a model without giving scaling factors explicitly. Furthermore, MGDA-UB approximately computes the loss function by differentiate w.r.t. the feature extractor instead of the shared parameter , so that we can update the parameters in a single backward pass for all pretext-tasks when the back-propagation. The overall algorithm of MP-SSL is summarized in Algorithm 1.

AlexNet VGG-16 WRN-40-10
CIFAR-10 CIFAR-100 TinyImageNet CIFAR-10 CIFAR-100 TinyImageNet CIFAR-10 CIFAR-100 TinyImageNet
Supervised 76.9 58.3 45.4 82.9 47.8 31.3 92.4 72.2 56.3
Rotation 62.1 33.2 23.7 33.9 11.7 3.2 74.0 43.0 23.4
MP-SSL (Rot + Sharp) 59.8 32.3 23.4 35.1 17.5 3.8 74.2 44.5 23.6
MP-SSL (Rot + Solar) 62.3 34.7 25.1 47.8 23.3 5.7 75.4 49.0 26.1
Table 3:

Evaluation summary of our MP-SSL algorithm with multiple network architectures and datasets. Each cell shows mean test top-1 accuracy of the linear classifier using feature maps generated by the pretrained (frozen) CNN. We extracted the feature maps from convolutional layers in pre-logit level, i.e.,

Conv5 of AlexNet, Conv5-3 of VGG-16, and Block-3 of WRN-40-10.
Conv1 Conv2 Conv3 Conv4 Conv5
Supervised [24, 49] 19.3 36.3 44.2 48.3 50.5
Random [50] 11.6 17.1 16.9 16.3 14.1
Krähenbühl et al[22] 17.5 23.0 24.5 23.2 20.6
Pathak et al. (Inpainting) [35] 14.1 20.7 21.0 19.8 15.5
Zhang et al. (Split-Brain) [50] 17.7 29.3 35.4 35.2 32.8
Rotation [14] 18.8 31.7 38.7 38.2 36.5
Jenni & Favaro et al[18] 19.5 33.3 37.9 38.9 34.9
Mundhenk et al[31] 19.6 31.8 37.6 37.8 33.7
Noroozi et al. (Jigsaw++) [34] 18.9 30.5 35.7 35.4 32.2
Wu et al[45] 16.8 26.5 31.8 34.1 35.6
Feng et al[11] 19.3 33.3 40.8 41.8 44.3
Rotation (Our reimpl.) 21.1 34.2 39.3 38.0 36.3
MP-SSL (Rot+Solar) 22.3 38.4 41.7 41.4 37.3
Table 4: Top-1 linear classification accuracies on ImageNet validation set using different frozen convolutional layers.
Conv1 Conv2 Conv3 Conv4 Conv5
Supervised (Places labels) [51, 50] 22.1 35.1 40.2 43.3 44.6
Supervised (ImageNet labels) [24, 49] 22.7 34.8 38.4 39.4 38.7
Random [50] 15.7 20.3 19.8 19.1 17.5
Krähenbühl et al[22] 21.4 26.2 27.1 26.1 24.0
Pathak et al. (Inpainting) [35] 18.2 23.2 23.4 21.9 18.4
Zhang et al. (Split-Brain) [50] 21.3 30.7 34.0 34.1 32.5
Rotation [14] 21.5 31.0 35.1 34.6 33.7
Jenni & Favaro et al[18] 23.3 34.3 36.9 37.3 34.4
Mundhenk et al[31] 23.7 34.2 37.2 37.2 34.9
Noroozi et al. (Jigsaw++) [34] 22.5 33.0 36.2 36.1 34.2
Wu et al[45] 18.8 24.3 31.9 34.5 33.6
Feng et al[11] 22.9 32.4 36.6 37.3 38.6
Rotation (Our reimpl.) 24.3 40.6 41.9 41.1 38.4
MP-SSL (Rot+Solar) 25.3 44.1 45.8 44.3 41.8
Table 5: Top-1 linear classification accuracies on Places-205 validation set using different frozen convolutional layers pretrained by ImageNet dataset.

5 Results

In this section, we show the evaluation of the MP-SSL algorithm on multiple tasks with various datasets. We compare MP-SSL with existing self-supervised learning methods. This section is composed of the following evaluations: (i) confirming the efficacy of the combination of Rotation and image enhancements via MP-SSL on image classification tasks, (ii) comparing MP-SSL to current state-of-the-art methods in terms of the performance on the standard benchmark for self-supervised learning using ImageNet and Places-205, (iii) analyzing MP-SSL through additional evaluations including a semi-supervised learning setting, a comparison of MP-SSL and data augmentation, and an ablation study about MGDA-UB.

Figure 4: Illustration of MP-SSL in the case that the pretext-tasks are Rotation and Solarization.

5.1 Settings

Datasets

We used five datasets for evaluating our MP-SSL: CIFAR-10/-100 [23], Places-205 [51], Tiny ImageNet [26], and ImageNet (ILSVRC 2012) [37]. For CIFARs and Tiny ImageNet, we randomly split the train set into 9:1, and applied the former for training and the latter for validating. The image size was set to 3232; in the case of Tiny ImageNet, we randomly cropped 3232 regions from the original images while training and applied center crop in validation and test time. In testing, we used the test sets of CIFAR-10/-100 and the validation set for Tiny ImageNet. For ImageNet and Places-205, we randomly split the train set into 99:1, and applied the former for training and the latter for validating. We set the image size to 224224 by random crop in training and center crop in testing. We tested models on the validation set of ImageNet and Places-205.

Network Architectures

As the network architectures for the evaluations, we used AlexNet [24], VGG-16 [41], and Wide-ResNet [46]; the convolutional layers of those work as the feature extractor . For 3232 images, we modified the original architectures of AlexNet and VGG-16 to resize the kernel size and the input/output channels of fully connected layers for adjusting input size to 3232 (for more details, see supplementary materials). We applied WRN-40-10 [46] as instances of the Wide-ResNet architecture. As same the previous works [49, 14]

, we used a variant of AlexNet architectures for the ImageNet experiments, which are modified channel sizes and replaced local response normalization layers to batch normalization layers.

Training

We selected settings and parameters for training by reference to [46]

. For fair evaluation and preserving reproducibility, we share the hyperparameters for all models of 32

32 images; we train them by SGD with Nesterov momentum (initial learning rate 0.01, weight decay

, batch size 128). In ImageNet experiments, we used SGD with Nesterov momentum (initial learning rate 0.005, weight decay

, batch size 256) for both of the pretext-tasks and the target tasks. We dropped the learning rate by 0.1 at 30, 60, and 80 epochs and trained all models for a total of 100 epochs except for the setting in Section 

5.4 that were dropped learning rates by 0.1 at 15, 30, and 40 epochs and trained for total 50 epochs. We initialized all weight parameters with He normal [15]. For target task training in Section 5.2, we used the logistic regression algorithm (L-BFGS) implemented in scipy library [19]

with the default parameters and set the maximum number of iterations to 10,000. In all experiments, each training was run three times, and we report the average scores and the standard deviations.

5.2 Comparison of Rotation and MP-SSL

First, we show the efficacy of MP-SSL by comparing to Rotation on various datasets and network architectures. Based on the investigation in Section 3, we used Sharpness and Solarization as the image enhancement used in MP-SSL, i.e., we tested the pairs of and ; these pairs are represented by Rot+Sharp and Rot+Solar, respectively. Table 3 summarizes that the comparison among different patterns of learning strategies (Supervised, Rotation, and MP-SSL), datasets (CIFAR-10/-100 and TinyImageNet), and network architectures (AlexNet, VGG-16, and WRN-40-10). From the results, we can confirm that MP-SSL models outperform the Rotation models in almost all cases. In particular, MP-SSL (Rot+Solar) achieved the best performance in all patterns, and significantly boosted the performance of Rotation in the cases of VGG-16. These results indicate that, while Rotation tends to make VGGs be too much specialized for solving pretext-tasks [21], MP-SSL can encourage VGGs to obtain better representations for the target tasks. We provide results of the other combinations of pretext tasks in supplementary materials.

5.3 ImageNet Representation Learning

In order to compare MP-SSL to existing self-supervised learning methods, we test MP-SSL on ImageNet representation learning benchmarks following [49].

Table 4 describes the top-1 classification accuracy of our MP-SSL model on ImageNet using linear classifiers. We first trained models with the unlabeled dataset by MP-SSL and then trained linear classifier layers for the multi-label classification by using feature maps extracted from the feature extractor with frozen weights. The results show that MP-SSL models achieve state-of-the-art performance by using feature maps on the input-side layers (Conv1, Conv2, Conv3). On the other hand, the performance of MP-SSL on the output-side layers (Conv4, Conv5) failed to outperform the current state-of-the-art method (Feng et al[11]) that combines Rotation with instance classification in the pretext-task. This is because the instance classification in [11] aims to focus on recognizing object instances; the information of object instances is often considered highly abstract information captured in output-side layers [2, 1]. Thus, since MP-SSL models focus on textures and colors rather than the specific information of object instance, the performance boost concentrates on the input-side layers that capture the more primitive visual information.

In Table 5, we also show the results on Places-205 classification models pretrained by the ImageNet via MP-SSL. As well as the ImageNet classification experiments, we used the representations with frozen weights for extracting feature maps. We can confirm that our MP-SSL model achieved the best performance in all cases of the used layers. This indicates that MP-SSL models can capture the semantic features generalized for recognizing images even though it is applied to different datasets.

5.4 Analysis

Semi-supervised Learning Setting

As can be seen in previous works [14, 29, 48], semi-supervised learning is one of the important practical applications of self-supervised learning models. Thus, we confirm the performance of our MP-SSL in a semi-supervised learning setting of the ImageNet classification by the AlexNet architecture. Similar to existing works [14, 48], we tested models as follows: (i) we pretrained models with self-supervised learning on 100% volume of unlabeled data, and (ii) fine-tuned the models with supervised learning on the decreased volume of labeled data. We used 10, 50, 100% volume of labeled data for the step (ii). Table 6 describes the top-1 accuracy of the models fine-tuned from the pretrained models by Rotation or MP-SSL (Rot+Solar). We can see that the MP-SSL models outperform in all of the cases, and it indicates the representations by MP-SSL can also work in semi-supervised settings.

Comparision to Rotation with Data Augmentation

Since image enhancements are used for data augmentation [7], MP-SSL might improve the performance of Rotation only by augmenting data rather than by solving our proposed multi-pretext task. Thus, to confirm the efficacy of our pretext-task, we compare MP-SSL to data augmentation. To this end, we tested Rotation with data augmentation by image enhancements and compare it to our MP-SSL models. In training of the data augmentation models (DA), we added the same transformation (Solarization) into input images as MP-SSL, but trained the models by only Rotation loss. Table 7 shows the comparison of DA and MP-SSL models by testing on 3232 datasets with WRN-40-10 as the same setting of Section 5.2. The performance of DA is similar to that of Rotation. In contrast, our MP-SSL outperforms all of Rotation and DA models, so that, we can say that the improvement of MP-SSL is mainly composed of the training with the multiple pretext-task.

Ablation Study of MGDA-UB

In MP-SSL, we adopt MGDA-UB [39] for the simultaneously training of multiple pretext-task to balance the weight of each pretext-task loss . In this section, we confirm the effectiveness of the MGDA-UB by comparing it with the models of using the fixed weight (denoted MT models). The results shown in Table 8 describes the top-1 accuracy of linear classifier on the same training setting of Section 5.2. From the result, our MP-SSL models achieved higher performances than MT models. This indicates that determining appropriate weights is important for training a model with multiple pretext-task, and MP-SSL based on MGDA-UB can easily improve the performance by dynamically computing the weights with the Frank-Wolfe method.

10 % labels 50% labels 100% labels
Scratch 25.3 49.8 59.5
Rotation 37.1 53.0 59.1
MP-SSL (Rot+Solar) 38.0 53.7 59.4
Table 6: ImageNet classification accuracy of semi-supervised learning setting models. We fine-tuned AlexNet models that are pretrained by Rotation and MP-SSL on all of the unlabeled training set in ImageNet. The row of Scratch denotes the results in the case of without pretrained models.
CIFAR-10 CIFAR-100 TinyImageNet
Rotation 74.0 43.0 23.4
DA (Solar) 74.7 43.0 23.0
MP-SSL (Rot+Solar) 75.4 49.0 26.1
Table 7: Comparison of MP-SSL and Data Augmentation (DA) with respect to performance boosting over Rotation. We used WRN-40-10 as the network architecture and tested on the same setting as Table 3.
CIFAR-10 CIFAR-100 TinyImageNet
Rotation 74.0 43.0 23.4
MT (Rot+Solar) 74.5 41.9 24.3
MP-SSL (Rot+Solar) 75.4 49.0 26.1
Table 8: Comparison of MP-SSL and multi-task learning with mean fixed weights for optimization (MT). For MT models, the weight for each image transformation was set to . We used WRN-40-10 as the network architecture and tested on the same setting as Table 3.

6 Conclusion

This paper presented a novel multiple pretext-task for self-supervised learning (MP-SSL), which combines Rotation and image enhancements to achieve useful representations focussing on not only object shapes but also textures and colors. We confirmed that MP-SSL with Rotation and Solarization improves the target performance across various datasets and network architectures, and achieves state-of-the-art performance on Places-205 datasets.

Appendix

Appendix A Experimental Setup in Section 3

The followings are the experimental details for the preliminary experiments in Section 3.2 of main paper.

  • We train a pretext model with with SGD (learning rate 0.01, momentum 0.9, weight decay , batchsize 256, input size ). The number of images in the train/validation/test set of DTD were 3,384/376/1,880, respectively. We trained a model for 100 epochs in total and drop the learning rate by a factor of 10 after 30, 60, 80 epochs.

  • To train target classification models, we used the logistic regression algorithm in scipy library [19]. We optimized the regression models by L-BFGS with the default parameters and set the maximum number of iterations to 5,000. As the same of [21], we compute the hyperparameter for regularization as , where is the size of representations and is the number of classes.

  • All the experiments run for 3 times and we report the average scores with the standard deviations.

Appendix B Details of Image Enhancements

In Section 3.3 of main paper, we introduced five image enhancements as image transformations for the additional pretext-tasks. Here, we show the detailed information of the transformations. Table 1 describes the definitions of each image enhancement. We denote the degrees for image enhancements by following the interface of python image library (PIL)222https://github.com/python-pillow/Pillow.

Transformation Description
Brightness
Controlling the brightness of a image.
The degree of 0 returns a black image,
and the degree of 1 returns the original image.
Contrast
Controlling the contrast of a image.
The degree of 0 returns a gray image,
and the degree of 1 returns the original image.
Saturation
Controlling the color balance of a image.
The degree of 0 returns a black and white image,
and the degree of 1 returns the original image.
Sharpness
Controlling the sharpness of a image.
The degree of 0 returns a blurred image,
and the degree of 1 returns the original image.
Solarization
Inverting all pixels above a threshold value of degree.
The degree of 0 returns a fully inverted image,
and the degree of 256 returns the original image.
Table 1: List of the image enhancements used for pretext-tasks

Appendix C Details of Network Architectures

Here, we show the detailed architecture of CNNs used for the evaluations in the main paper.

For 3232 images

We used AlexNet [24], VGG-16 [41], and Wide-ResNet (WRN-40-10) [46] based network architectures for the evaluations. The detailed network parameters for each architecture are illustrated in Table 23 and 4, respectively. The architecture WRN-40-10 is followed the same as shown in [46].

RGB image

Conv1: 2D Convolution (in:3, out:96, kernel:3, stride:1, pad:1)

BN, ReLU, MaxPooling (kernel:2, stride:2)

Conv2: 2D Convolution (in:96, out:256, kernel:5, stride:1, pad:2)
BN, ReLU, MaxPooling (kernel:2, stride:2)
Conv3: 2D Convolution (in:256, out:384, kernel:3, stride:1, pad:1)
BN, ReLU
Conv4: 2D Convolution (in:384, out:384, kernel:3, stride:1, pad:1)
BN, ReLU
Conv5: 2D Convolution (in:384, out:256, kernel:3, stride:1, pad:1)
BN, ReLU, MaxPooling (kernel:2, stride:2)
Linear (in:256, out:4096)
Dropout (ratio:0.5)
Linear (in:4096, out:4096)
Dropout (ratio:0.5)
Linear (in:4096, out:4) for each pretext task
Table 2: AlexNet architecture for 3232 images. The parenthesized contains input channel size (in), output channel size (out), kernel size (kernel), stride, padding size (pad), and the ratio for dropout (ratio).
RGB image
Conv1-1: 2D Convolution (in:3, out:64, kernel:3, stride:1, pad:1)
BN, ReLU
Conv1-2: 2D Convolution (in:64, out:64, kernel:3, stride:1, pad:1)
BN, ReLU, MaxPooling (kernel:2, stride:2)
Conv2-1: 2D Convolution (in:64, out:128, kernel:3, stride:1, pad:1)
BN, ReLU
Conv2-2: 2D Convolution (in:128, out:128, kernel:3, stride:1, pad:1)
BN, ReLU, MaxPooling (kernel:2, stride:2)
Conv3-1: 2D Convolution (in:128, out:256, kernel:3, stride:1, pad:1)
BN, ReLU
Conv3-2: 2D Convolution (in:256, out:256, kernel:3, stride:1, pad:1)
BN, ReLU
Conv3-3: 2D Convolution (in:256, out:256, kernel:3, stride:1, pad:1)
BN, ReLU, MaxPooling (kernel:2, stride:2)
Conv4-1: 2D Convolution (in:256, out:512, kernel:3, stride:1, pad:1)
BN, ReLU
Conv4-2: 2D Convolution (in:512, out:512, kernel:3, stride:1, pad:1)
BN, ReLU
Conv4-3: 2D Convolution (in:512, out:512, kernel:3, stride:1, pad:1)
BN, ReLU, MaxPooling (kernel:2, stride:2)
Conv5-1: 2D Convolution (in:512, out:512, kernel:3, stride:1, pad:1)
BN, ReLU
Conv5-2: 2D Convolution (in:512, out:512, kernel:3, stride:1, pad:1)
BN, ReLU
Conv5-3: 2D Convolution (in:512, out:512, kernel:3, stride:1, pad:1)
BN, ReLU, MaxPooling (kernel:2, stride:2)
Linear (in:512, out:4096)
ReLU, Dropout (ratio:0.5)
Linear (in:4096, out:4096)
ReLU, Dropout (ratio:0.5)
Linear (in:4096, out:4) for each pretext task
Table 3: VGG-16 architecture for 3232 images
RGB image
2D Convolution (in:3, out:16, kernel:3, stride:1, pad:1)
Block-1: ResBlocks (:16, :160, :6, :1)
Block-2: ResBlocks (:160, :320, :6, :2)
Block-3: ResBlocks (:320, :640, :6, :2)
BN, ReLU, AveragePooling (kernel: 8)
Linear (in:640, out:4) for each pretext task
Table 4: WRN-40-10 architecture for 3232 images. ResBlocks() is composed of ResBlocks built on given input/output channel size and stride parameter .
Feature from previous layer
BN
2D Convolution (in:, out:, kernel:3, stride:1, pad:1)
Dropout (ratio:0.3), BN
2D Convolution (in:, out:, kernel:3, stride:, pad:1)
Add
Table 6: AlexNet architecture for ImageNet
RGB image
Conv1: 2D Convolution (in:3, out:64, kernel:11, stride:4, pad:2)
BN, ReLU, MaxPooling (kernel:2, stride:2)
Conv2: 2D Convolution (in:64, out:192, kernel:5, stride:1, pad:2)
BN, ReLU, MaxPooling (kernel:2, stride:2)
Conv3: 2D Convolution (in:192, out:384, kernel:3, stride:1, pad:1)
BN, ReLU
Conv4: 2D Convolution (in:384, out:256, kernel:3, stride:1, pad:1)
BN, ReLU
Conv5: 2D Convolution (in:256, out:256, kernel:3, stride:1, pad:1)
BN, ReLU, MaxPooling (kernel:2, stride:2)
Linear (in:256, out:4096)
Dropout (ratio:0.5)
Linear (in:4096, out:4096)
Dropout (ratio:0.5)
Linear (in:4096, out:4) for each pretext task
Table 5: A ResBlock architecture for 3232 images. and are given when the initialization.
Supervised 72.2
Random [50] 19.3
Sharpness 25.6
Solarization 27.1
Rotation 43.0
MP-SSL (Sharp+Solar) 26.5
MP-SSL (Rot+Sharp) 44.5
MP-SSL (Rot+Solar) 49.0
MP-SSL (Rot+Sharp+Solar) 48.8
Table 7: Top-1 linear classification accuracy on CIFAR-100

For ImageNet

We used a variant of AlexNet [24] for the evaluations. The detail of the network architecture is shown in Table 6. Note that this architecture is the same as previous works [49, 14].

Appendix D Further Analysis of MP-SSL

In the main paper, we demonstrated the results of MP-SSL with Rotation and image enhancements and it did not include the combinations of image enhancements such as Sharp+Solar. This is because MP-SSL aims to improve the Rotation models by an image enhancement and the combination of multiple image enhancements might produce a negative effect for the training due to transforming pixel values for the channel direction in duplicate. Even so, we additionally investigated the performance when using single image enhancement for the pretext task, and the effect of the other combination of image transformations in MP-SSL such as Solar+Sharp and Rot+Solar+Sharp. We tested these patterns of models on CIFAR-100 with WRN-40-10 by the same setting as Section 5.2 of the main paper.

Table 7 summarizes top-1 classification accuracy of linear classification models. Although Sharpness and Solarization improve the performance in comparing with Random that is the result of the feature map of CNNs with random weight, they were inferior to the Rotation models unlike the case of DTD appeared in Section 3.3 of the main paper. These results imply that CIFAR-100 mainly focuses on object shapes rather than color textures and thus Rotation is more effective than Sharpness or Solarization in this case.

Next, we turn to analyze the results of MP-SSL with several combinations of image transformations. The performances are shown at the bottom of Table 7. Interestingly, the Sharp+Solar models did not improve the single transformation cases (Rotation, Sharpness or Solarization). As the reason for this, since the image enhancements transform pixel values for the channel direction, simultaneously combining multiple image enhancements might prevent the efficient training of each pretext-task by transforming for the channel direction multiply. Similarly, although the Rot+Sharp+Solar models outperform the Rotation models, they are not superior to the Rot+Solar models.

References

  • [1] D. Bau, B. Zhou, A. Khosla, A. Oliva, and A. Torralba (2017) Network dissection: quantifying interpretability of deep visual representations. In

    Computer Vision and Pattern Recognition

    ,
    Cited by: §5.3.
  • [2] Y. Bengio, A. Courville, and P. Vincent (2013) Representation learning: a review and new perspectives. IEEE transactions on pattern analysis and machine intelligence. Cited by: §5.3.
  • [3] W. Brendel and M. Bethge (2019) Approximating CNNs with bag-of-local-features models works surprisingly well on imagenet. In International Conference on Learning Representations, Cited by: §1.
  • [4] L. Chen, M. Collins, Y. Zhu, G. Papandreou, B. Zoph, F. Schroff, H. Adam, and J. Shlens (2018) Searching for efficient multi-scale architectures for dense image prediction. In Advances in Neural Information Processing Systems, Cited by: §1.
  • [5] T. Chen, X. Zhai, M. Ritter, M. Lucic, and N. Houlsby (2019) Self-supervised gans via auxiliary rotation loss. In The IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.1.
  • [6] M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, and A. Vedaldi (2014) Describing textures in the wild. In Proceedings of the IEEE Conf. on Computer Vision and Pattern Recognition, Cited by: §1, §3.2.
  • [7] E. D. Cubuk, B. Zoph, D. Mane, V. Vasudevan, and Q. V. Le (2019) AutoAugment: learning augmentation strategies from data. In The IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.3, §3.3, §5.4.
  • [8] C. Doersch, A. Gupta, and A. A. Efros (2015) Unsupervised visual representation learning by context prediction. In Proceedings of the IEEE International Conference on Computer Vision, Cited by: §1, §2.1.
  • [9] C. Doersch and A. Zisserman (2017) Multi-task self-supervised visual learning. In Proceedings of the IEEE International Conference on Computer Vision, Cited by: §2.2.
  • [10] A. Dosovitskiy, J. T. Springenberg, M. Riedmiller, and T. Brox (2014) Discriminative unsupervised feature learning with convolutional neural networks. In Advances in Neural Information Processing Systems, Cited by: §1.
  • [11] Z. Feng, C. Xu, and D. Tao (2019) Self-supervised representation learning by rotation feature decoupling. In Computer Vision and Pattern Recognition, Cited by: §1, §2.1, Table 4, Table 5, §5.3.
  • [12] L. Gatys, A. S. Ecker, and M. Bethge (2015) Texture synthesis using convolutional neural networks. In Advances in neural information processing systems, Cited by: §1.
  • [13] R. Geirhos, P. Rubisch, C. Michaelis, M. Bethge, F. A. Wichmann, and W. Brendel (2019) ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness.. In International Conference on Learning Representations, Cited by: §1.
  • [14] S. Gidaris, P. Singh, and N. Komodakis (2018) Unsupervised representation learning by predicting image rotations. In International Conference on Learning Representations, Cited by: Appendix C, §1, §1, §2.1, Table 1, §3.1, §3.1, §3.2, §3.2, §3.3, §3.3, §4.1, Table 4, Table 5, §5.1, §5.4.
  • [15] K. He, X. Zhang, S. Ren, and J. Sun (2015) Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, Cited by: §5.1.
  • [16] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In The IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1.
  • [17] M. Jaggi (2013) Revisiting Frank-Wolfe: projection-free sparse convex optimization. In

    Proceedings of the 30th International Conference on Machine Learning

    ,
    Cited by: §4.2.
  • [18] S. Jenni and P. Favaro (2018) Self-supervised feature learning by learning to spot artifacts. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: Table 4, Table 5.
  • [19] E. Jones, T. Oliphant, P. Peterson, et al. (2001–) SciPy: open source scientific tools for Python. External Links: Link Cited by: 2nd item, §5.1.
  • [20] I. Kokkinos (2017) Ubernet: training a universal convolutional neural network for low-, mid-, and high-level vision using diverse datasets and limited memory. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.2.
  • [21] A. Kolesnikov, X. Zhai, and L. Beyer (2019) Revisiting self-supervised visual representation learning. Cited by: 2nd item, §5.2.
  • [22] P. Krähenbühl, C. Doersch, J. Donahue, and T. Darrell (2016) Data-dependent initializations of convolutional neural networks. Cited by: Table 4, Table 5.
  • [23] A. Krizhevsky and G. Hinton (2009) Learning multiple layers of features from tiny images. Technical report Citeseer. Cited by: §3.2, §5.1.
  • [24] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) ImageNet classification with deep convolutional neural networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems, External Links: Link Cited by: Appendix C, Appendix C, §3.2, Table 4, Table 5, §5.1.
  • [25] G. Larsson, M. Maire, and G. Shakhnarovich (2017) Colorization as a proxy task for visual understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1, §2.1.
  • [26] Y. Le and X. Yang Tiny imagenet visual recognition challenge. Cited by: §5.1.
  • [27] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, et al. (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE. Cited by: §1.
  • [28] W. Lee, J. Na, and G. Kim (2019) Multi-task self-supervised object detection via recycling of bounding box annotations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.2.
  • [29] M. Lucic, M. Tschannen, M. Ritter, X. Zhai, O. Bachem, and S. Gelly (2019) High-fidelity image generation with fewer labels. In Proceedings of the 36th International Conference on Machine Learning, Cited by: §2.1, §5.4.
  • [30] D. Mahajan, R. Girshick, V. Ramanathan, K. He, M. Paluri, Y. Li, A. Bharambe, and L. van der Maaten (2018) Exploring the limits of weakly supervised pretraining. In Proceedings of the European Conference on Computer Vision, Cited by: §1.
  • [31] T. N. Mundhenk, D. Ho, and B. Y. Chen (2018) Improvements to context based self-supervised learning.. In Computer Vision and Pattern Recognition, Cited by: §2.1, Table 4, Table 5.
  • [32] M. Noroozi and P. Favaro (2016) Unsupervised learning of visual representations by solving jigsaw puzzles. In Proceedings of the European Conference on Computer Vision, Cited by: §1, §2.1.
  • [33] M. Noroozi, H. Pirsiavash, and P. Favaro (2017) Representation learning by learning to count. In Proceedings of the IEEE International Conference on Computer Vision, Cited by: §1, §2.1.
  • [34] M. Noroozi, A. Vinjimoor, P. Favaro, and H. Pirsiavash (2018) Boosting self-supervised learning via knowledge transfer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1, Table 4, Table 5.
  • [35] D. Pathak, P. Krähenbühl, J. Donahue, T. Darrell, and A. Efros (2016) Context encoders: feature learning by inpainting. In Computer Vision and Pattern Recognition, Cited by: §2.1, Table 4, Table 5.
  • [36] Z. Ren and Y. Jae Lee (2018) Cross-domain self-supervised multi-task feature learning using synthetic imagery. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.2.
  • [37] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. (2015) Imagenet large scale visual recognition challenge. International Journal of Computer Vision. Cited by: §3.2, §5.1.
  • [38] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra (2017) Grad-cam: visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision, Cited by: §3.3.
  • [39] O. Sener and V. Koltun (2018) Multi-task learning as multi-objective optimization. In Advances in Neural Information Processing Systems 31, Cited by: §4.1, §4.2, §5.4, 10.
  • [40] V. Sharma, A. Diba, D. Neven, M. S. Brown, L. Van Gool, and R. Stiefelhagen (2018) Classification-driven dynamic image enhancement. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.3.
  • [41] K. Simonyan and A. Zisserman (2015) Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations, Cited by: Appendix C, §5.1.
  • [42] B. Singh, M. Najibi, and L. S. Davis (2018) SNIPER: efficient multi-scale training. In Advances in Neural Information Processing Systems 31, Cited by: §1.
  • [43] C. Sun, A. Shrivastava, S. Singh, and A. Gupta (2017) Revisiting unreasonable effectiveness of data in deep learning era. In IEEE International Conference on Computer Vision, Cited by: §1.
  • [44] M. Tan and Q. Le (2019) EfficientNet: rethinking model scaling for convolutional neural networks. In Proceedings of the 36th International Conference on Machine Learning, Cited by: §1.
  • [45] Z. Wu, Y. Xiong, S. X. Yu, and D. Lin (2018) Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: Table 4, Table 5.
  • [46] S. Zagoruyko and N. Komodakis (2016) Wide residual networks. In The British Machine Vision Conference, Cited by: Appendix C, §5.1, §5.1.
  • [47] A. R. Zamir, A. Sax, W. Shen, L. J. Guibas, J. Malik, and S. Savarese (2018) Taskonomy: disentangling task transfer learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.2.
  • [48] X. Zhai, A. Oliver, A. Kolesnikov, and L. Beyer (2019) S4L: self-supervised semi-supervised learning. Cited by: §2.1, §5.4.
  • [49] R. Zhang, P. Isola, and A. A. Efros (2016) Colorful image colorization. In Proceedings of the European Conference on Computer Vision, Cited by: Appendix C, §1, §1, §2.1, Table 4, Table 5, §5.1, §5.3.
  • [50] R. Zhang, P. Isola, and A. A. Efros (2017)

    Split-brain autoencoders: unsupervised learning by cross-channel prediction

    .
    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: Table 7, §1, §1, §2.1, Table 4, Table 5.
  • [51] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva (2014)

    Learning deep features for scene recognition using places database

    .
    In Advances in neural information processing systems, Cited by: Table 5, §5.1.