Image representation has always been the focus of computer vision studies. At the early developmental stage of image processing, images are represented by manually extracted global features Jain and Vailaya (1996); Manjunath and Ma (1996), including color, texture, and edge information. As these features are sensitive to variable lighting conditions and occlusion, they have been largely replaced by local feature extractors, like BoW Sivic and Zisserman (2003) and SIFT Lowe (2004), while the representation ability has still been limited by the hand-crafted extracting rules.
For the past decade, automatic feature learning by deep neural networks (DNNs) has dominated the field of image representation. DNNs learn feature embeddings through multiple layers of non-linear transformations. The output high-level abstract features generalize well in various downstream tasks. Especially, convolutional neural networks trained on images with annotated class labels can capture visual similarity among different categories to successfully make a semantic classification. However, due to the high cost or expertise required for manual annotation, a substantial ratio of image data remains unlabeled, which has not been effectively exploited.
In recent years, self-supervised learning methods Jing and Tian (2019)
have achieved breakthrough performance in the field of computer vision. The key to the success of these methods lies in the design of pretext tasks. As a special form of unsupervised learning, the “label” in self-supervised learning is derived from the data itself. Proper pretext tasks can make full use of the inherent properties of data to improve the quality of learned representation and enhance the performance of downstream tasks. Typical pretext tasks include image generationGoodfellow et al. (2014) et al. (2016), inpainting Pathak et al. (2016)
, and super resolutionLedig et al. (2016), which focus on specific operation but can also have good generalization ability on downstream tasks. Another kind of pretext tasks is context-based methods Doersch et al. (2015); Noroozi and Favaro (2016); Gidaris et al. (2018), which utilizes spatial relations among different image patches or intrinsic attributes of images.
In this study, we focus on the task of instance-wise discrimination based on data augmentation, which is commonly-used in self-supervised learning. To create positive and negative labels, data augmentation techniques are utilized to generate views of images. The pairs of views originated from the same image form positive samples, and those from different images form negative samples. This kind of representative methods mainly includes batch-based and memory bank-based methods. A major limitation of these methods is the extra cost on storing the features of samples. For instance, the memory bank-based methods Wu et al. (2018b); Huang et al. (2019); Han et al. (2020) allocate large memory space to maintain the features of generated views for all samples. MoCo He et al. (2019) also requires to build a queue for feature storage, while the differences lie on the encoder for generating features of keys and the variable length of the queue. Instead, SimCLR Chen et al. (2020) directly fetches samples from batches. To ensure good performance, SimCLR has a much larger batch size, which restricts its applications in the labs of limited computational resources.
Therefore, how to get a comparable or even better result with less resource consumption has become an attractive topic in the self-supervised learning field. Considering that the end-to-end methods are more straightforward and flexible compared to memory bank and momentum-based methods, we mainly focus on the batch-based methods. To address the above challenge, we propose AAG, i.e. self-supervised representation learning by Auxiliary Augmentation with GNT-Xent Loss. Specifically, to obtain recognizable features, a self-supervised pretext task based on instance discrimination is adopted. The main idea is to treat each image instance as a single “class” so that the model has a direction for training.
The AAG model is driven by a hybrid data augmentation scheme using both basic and auxiliary data augmentation strategies, which generate three enhanced views for each image. Combinations of these views compose positive and negative samples that are fed into a siamese neural network to encode image features. Besides, to speed up the model training process while maintaining stability, we propose a simple but efficient contrastive loss function named GNT-Xent Loss. Optimizing this modified contrastive loss enforces the positive pairs more closer while negative pairs more separated, thus yielding more discriminative representations of images. This method not only gets rid of memory bank but also uses a much smaller batch size compared to previous batch-based methods.
To assess the discriminant ability of the learned representation, we use both weighted NN algorithm and linear evaluation to evaluate the model performance. The contributions of our work are summarized as follows.
We design a new scheme of contrastive learning with basic and auxiliary data augmentation. The hybrid data augmentation strategy greatly alleviates the dependency on large batch size.
We propose a novel contrastive loss function which can not only maintain the stability of the training process but also improve the accuracy under both NN and linear evaluation.
The AAG method achieves SOTA accuracies on multiple benchmark datasets with low computation cost.
In this section, we provide a brief overview of research progress on self-supervised representation learning in recent years, mainly involving contrastive learning.
uses contrastive loss function to measure the distance between samples Hadsell et al. (2006). The main idea is to first learn the low-dimensional mapping of raw data, then decrease the Euclidean distance of similar pairs and increase that of dissimilar pairs. It allows us to retain the original semantics of samples even when the dimensions are greatly reduced. Contrastive learning is an effective method to learn the common representation of samples in different categories.
Memory bank-based methods
compute the contrastive loss by using the image representation stored in the memory bank with features of current minibatch Wu et al. (2018b, a); He et al. (2019); Huang et al. (2019); Han et al. (2020). The memory bank itself will be updated iteratively during the training process. Wu et al. (2018) proposed a memory bank with appropriate parameter settings for the first time in the field of unsupervised learning. Then MoCo He et al. (2019) replaces the memory bank with a variable length queue and achieves breakthrough results on a large-scale dataset.
compute the contrastive loss in the current batch during the training process Ye et al. (2019); Chen et al. (2020). The advantage of these methods is that the features for comparison are up-to-date at every time step. ISIF Ye et al. (2019) shows the superiority of batch-wise contrastive learning. SimCLR Chen et al. (2020) uses a quite large batch size to train the network with diversiform data augmentation approaches and achieves impressive results under the protocol of linear evaluation.
To fully exploit the potential of self-supervised learning in extracting features from images, we propose Auxiliary Augment with GNT-Xent loss (AAG) method. The overview of our method is illustrated in Figure 1. Given a dataset of images , our goal is to learn a function without supervision, where is a deep neural network which maps image to feature . During the training process, we randomly sample a minibatch of images at each iteration. For every image , we perform data augmentation for three times with basic augmentation and auxiliary augmentation. Three views are obtained by , and , respectively, where and are sampled from the same family of basic augmentation while is sampled from auxiliary augmentation. Then we feed these three views into the backbone network and get three feature embeddings, . Finally, we calculate GNT-Xent Loss as defined in Eq. (1). The loss is a sum of three components, each of which is the loss arising from differentiating a positive pair of views against the corresponding negative ones. The formal definitions of these three components are shown in Algorithm 1.
Hybrid Data Augmentation Scheme
Data augmentation makes instance-wise discrimination a practical pretext task for self-supervised learning (SSL). As most of the downstream tasks depend on the discovery of high-level semantic meaning in images, a basic assumption of data augmentation-based SSL is that the semantic content of images is invariant to the augmentation operations. Thus, the basic goal of data augmentation is to bring diversification to image features and help the model distill the semantic information.
Basic data augmentation.
Basic data augmentation is defined as the superposition of a series of random data augmentation approaches. Previous studies Wu et al. (2018b); Chen et al. (2020) have shown that data augmentation is crucial for image preprocessing. Whether data augmentation is applied properly determines the final result of the trained model. Following the practice of previous batch-based methods Chen et al. (2020), our basic data augmentation consists of the following 5 operations that are performed sequentially with random magnitudes.
i) Random resize and crop;
ii) Color jitter;
iii) Random grayscale;
iv) Random horizontal flip;
v) Gaussian blur.
By calling the basic data augmentation on an image twice, we obtain two views with different image attributes, which are called core views.
Auxiliary data augmentation.
The basic data augmentation adopts a conservative scheme covering only five basic operations, which may be unable to generate sufficiently diversified image features. Some previous methods address this issue by increasing the number of samples, thus resulting into a large batch size. To reduce the dependency on batch size, we propose the auxiliary data augmentation strategy to generate additional views. The purpose of using auxiliary data augmentation is to bring in more data augmentation operations with more randomness so that the trained model can learn the semantic information from images with various attributes.
Ideally, the more semantic-invariant transformations are included, the more easily models generalize well. However, the model performance does not necessarily benefit from adding augmentation operations, as the augmentation operations are not guaranteed to be semantic-invariant. The larger the operation pool, the higher risk may be introduced and further result into noisy samples in the dataset.
Therefore, to obtain a relatively reliable operation set, we choose good policies demonstrated by previous data augmentation studies, mainly from AutoAugment Cubuk et al. (2019a) and RandAugment Cubuk et al. (2019b). AutoAugment Cubuk et al. (2019a)
uses an adaptive search algorithm to find the best data augmentation policies for different datasets. The policy has several sub-policies. Each sub-policy consists of 2 operations which are associated with two values, i.e. the probability to use the operation and the magnitude of the operation. AutoAugment provides 25 best sub-policies for CIFAR10, SVHN, and ImageNet. The searched optimal policies can be used directly for the above datasets, while searching policies for a new dataset may be costly. RandAugmentCubuk et al. (2019b) does not require a search process and only needs to define two parameters, namely the number of operations in a sub-policy and the magnitude of operations. It greatly reduces the time complexity of searching and achieves a close performance compared to AutoAugment.
Obviously, the operation pool for generating auxiliary views is much larger than that of basic augmentation. Therefore, to control the risk of introducing too much noise, we generate only one view for each image, which is called auxiliary view. In this way, there is no positive pair consisting of two auxiliary views.
Instance discrimination with three views.
Instance discrimination Wu et al. (2018b) forces the model to learn to recognize different instances rather than different classes in the absence of any semantic labels. As a consequence, representations that capture the similarity and differences between instances can be learned. Although it is a task for discriminating instances, semantic insights can also be learned. Given three views for each image, the process of contrastive learning is as follows.
Suppose we have a minibatch of images at some time step. The L2-normalized feature embeddings of the th image are , , and (), respectively, where and are derived from basic views, and
is derived from the auxiliary view. As we use cosine similarity to measure the distance between feature vectors, it is essential to apply L2norm to scale the length of vector to 1. Letdenote cosine similarity between two feature vectors and , then the value of ranges from -1 to 1. In order to scale up the range of similarity, we follow Wu et al. (2018b) to apply a temperature parameter . With the above parameters, we propose a new loss function termed as GNT-Xent (the gradient-stabilized and normalized temperature-scaled cross-entropy loss), which is inspired by the name of NT-Xent Chen et al. (2020). The GNT-Xent loss for a positive pair is formulated as,
where and denote the similarity of a positive pair and negative pair, respectively (formal definitions are in Algorithm 1). The main difference from NT-Xent is that is removed from the denominator. Detailed explanations are given in the next section.
Design considerations of loss function.
In this section, we will give a formal explanation on the reason for modifying NT-Xent. The original NT-Xent loss can be formulated as,
where and denote the similarity between a positive pair and a negative pair respectively. The loss above aims to minimize and maximize . Thus the limit value of loss function approaches 0 on the right side. Actually, this limit is not required for contrastive learning, becuase contrastive learning only focuses on assimilating positive pairs while dissimilating negative pairs.
Based on this consideration, we propose a modified contrastive loss, GNT-Xent, as formulated in Eq. (4),
As can be seen, the difference is that we subtract the item of the positive pair from the denominator. Let’s ignore the constant parameter and compute the gradients. and . During the training , the value of keeps increasing rapidly, so gradients of , will be affected and reduced. As a result, the training process can be hampered in the later stages. By contrast, the gradients of GNT-Xent will not be affected by the values of or , because they are constant. It is trivial to deduce that and . Then the training process can be steady and continuous.
The total loss of AAG consists of three components. As three positive pairs can be derived from three views for each image, each component aims to penalize for the prediction error of a positive pair. Here we have different treatments regarding the negative samples when calculating the losses. The major difference is that in the first component, we consider all the core-core pairs of views from different images; while in the second and third components, although the positive pairs contain an auxiliary view, we do not take auxiliary-auxiliary pairs of views from different images into consideration. The major reason is that the auxiliary augmentation has much randomness and has more chances to produce noisy samples compared to basic augmentation.
. For training the model, we set the batch size to 128 and the number of epochs to 200, and use SGD with momentum. The weight decay parameter is 5and momentum is set to 0.9. The embedding feature size of the last layer is 128. The initial learning rate is 0.03. We replace the StepLR schedule with a CosineLR schedule Loshchilov and Hutter (2016) without restarts. In the GNT-Xent loss, we set the temperature parameter to 0.1 as suggested in Wu et al. (2018b)
. In the experiments except the ablation study, we adopt the policies provided by AutoAugment as the default auxiliary augmentation polices. Most of the experiments were implemented in pytorch running on GeForce RTX 2080 Ti. And the experiments with large batch size were performed in Ascend 910 processor on Huawei Cloud.
We adopt two common methods to evaluate the performance of self-supervised learning, namely NN evaluation Cover and Hart (1967); Wu et al. (2018b) and linear evaluation Zhang et al. (2016); Bachman et al. (2019); Chen et al. (2020). The NN evaluation computes low-dim features of the last layer and compares them against the ones of training images in the memory bank, using cosine similarity. The top nearest neighbors will be used to make the prediction. Linear evaluation fixes parameters of the trained network and utilizes the features before the last layer to retrain a one-layer network. The accuracy of the one-layer network is regarded as the result of linear evaluation. We use Adam optimizer with the initial learning rate of 0.01 to train the one-layer network for 50 epochs. The CosineLR schedule is also applied.
|Super-AND (NN Eval)||89.2||75.6||61.5||42.7||94.9||91.9|
|AAG (NN Eval)||91.2||81.2||64.9||50.0||95.6||93.0|
|AAG (Linear Eval)||91.6||81.2||66.4||51.2||96.3||94.3|
|Method||Network||Batch Size||Linear Eval|
CIFAR10 Krizhevsky (2012) is a natural image dataset which contains 60000 color images of size and 10 classes, among which 50000 images are for training and 10000 for testing. The image size of CIFAR100 Krizhevsky (2012) is the same as in CIFAR10, while CIFAR100 has 100 classes and each class has 600 images. In the SVHN dataset Netzer et al. (2011), images were cropped patches of house numbers from the Google Street View images, which has 73257 images for training and 26032 for testing. The number of classes is 10 and the size of images is also .
Results and Discussions
Evaluation for short-term training.
Table 1 describes the NN evaluation on six models including the proposed AAG while excluding Super-AND, because Super-AND was trained for 5 rounds with 200 epochs per round (i.e. a total of 1000 epochs which is five times ours). For a fair comparison with Super-AND, we conduct another experiment shown in Table 2. From Table 1, we can see that without the memory bank, AAG still outperforms all baselines except for one case. And we find that by using a complex CNN, the performance of AAG can get more benefits. As a result, without using a memory bank, our method outperforms the state-of-the-art methods under the same condition with lower computation complexity.
Evaluation for long-term training.
Experimental results show that AAG is far from convergence under the condition of 200 epochs. Table 3 shows that Super-AND performs a little bit better in the early stage but AAG achieves higher accuracy by a large margin in the later stage. It suggests that the accuracy of AAG can be further improved by more training epochs. Thus, we retrain our AAG for 1000 epochs on the above datasets. The results are shown in Table 2. AAG performs well with a longer training time under both NN and linear evaluation.
Moreover, we experiment with different numbers of training epochs to investigate the impact of epoch number on the accuracy of NN and linear evaluation on CIFAR10. As Figure 3 shows, the accuracy is improved as the number of epochs increases, and the accuracy of linear evaluation is always higher than that of NN evaluation.
Investigation on batch size.
To examine the effect of batch size in AAG, we conduct an experiment on CIFAR10 with varying batch size (Figure 3). We use as the initial learning rate for the training of different batch sizes Goyal et al. (2017). The training epoch is set to 200. For batch sizes more than 256, we warm up the learning rate for 10 epochs. The curve shows that the optimal batch size is not the largest one. The possible reason is that when the batch size gets larger, the positive samples being misclassified as negative pairs in a batch increases. Consequently, the overall performance declines.
To verify our theoretical analysis on GNT-Xent, we conduct a comparison experiment between GNT-Xent and NT-Xent (results shown in Figure 4). We use the same settings for both loss functions. As the values of gradients are influenced by the batch size, what we care about is the trend of changes rather than the specific value.
As can be seen, and of GNT-Xent are constants during the training, whereas the values of NT-Xent decline sharply at the beginning and then going down steadily. As a result, the cosine similarity of GNT-Xent is higher than that of NT-Xent and the gap always exists. Furthermore, GNT-Xent achieves higher accuracy compared to NT-Xent. The results are reported in Table 4. Besides, to prove that GNT-Xent does not simply speed up the learning process, we use a larger learning rate of 0.3 to retrain the model. Even though the performance of both loss functions gets worse, GNT-Xent still performs better than NT-Xent.
To further examine the efficacy of the proposed loss function, we conduct another comparison experiment with two existing methods Ye et al. (2019); Chen et al. (2020) which use NT-Xent as their loss function. As shown in Table 5, we replace their loss functions by GNT-Xent. The modified versions outperform original ones by a large margin.
In addition, to visualize the discriminant capacity of learned features, we project the 128D features onto 2D space via the t-SNE algorithm Maaten and Hinton (2008). As Figure 6 shows, the features of GNT-Xent has a more separated distribution than that of NT-Xent. The results shown in Figure 5, i.e. the curves of NN accuracy and loss during the training, again demonstrate the advantages of GNT-Xent over NT-Xent.
Train with a bigger network.
The SOTA method AMDIM Bachman et al. (2019) achieves an accuracy of 91.2 on CIFAR10 with ResNet50 (25), while SimCLR outperforms AMDIM by using a larger batch size and longer training steps with a standard ResNet50. To compare with SimCLR, we use the same network with a 2-layer MLP as SimCLR does and train for 1000 epochs. Note that different from SimCLR, we use a batch size of 64 which allows training on a single GPU like RTX 2080Ti. Table 6 shows that our method outperforms SimCLR’s best result with batch size 1024. This result suggests that the batch size can be effectively reduced by using our method.
To verify the validity of our method and investigate the impact of different components, we perform an ablation study with the following variants of AAG,
i) Using only two basic views;
ii) Replacing the auxiliary view with a basic view;
iii) Replacing CosineLR schedule with StepLR schedule;
iv) Replacing GNT-Xent loss with NT-Xent loss;
v) Using RandAugment (2 random operations with a magnitude of 10) as the auxiliary augmentation approach.
Table 7 displays the NN evaluation results on CIFAR10 and reveals that each component of AAG has a positive effect on improving the model accuracy. Among these variants, auxiliary data augmentation contributes most to the performance gain. It can be also observed that using policies of RandAugment achieves a close performance compared with that of using AutoAugment, indicating that the AAG method is relatively robust to the augmentation policies used to generate the auxiliary view. Thus, it is applicable to various datasets where the optimal policies are unavailable.
|Two Basic Views||85.8|
|Three Basic Views||86.9|
|with NT-Xent Loss||86.8|
In this paper, we focus on the augmentation-based self-supervised learning and develop a new method called AAG, which contains a new auxiliary augmentation scheme and a new GNT-Xent loss. The former introduces an auxiliary view in addition to the basic views to enhance the diversity of views and increase the number of data samples within a batch in the meantime. And the latter aims to achieve stable and efficient training. Both of these two components show advantages over their counterpart methods in the experiments. AAG improves the overall performance and works well in the condition of a small batch size which reduces space consumption, showing its great potential for unsupervised learning in computer vision tasks.
- Learning representations by maximizing mutual information across views. In Advances in Neural Information Processing Systems, pp. 15535–15545. Cited by: Model Overview, Train with a bigger network., Evaluation protocols..
- Deep clustering for unsupervised learning of visual features. pp. 139–156. Cited by: Baselines..
- A simple framework for contrastive learning of visual representations. arXiv: Learning. Cited by: Introduction, Batch-based methods, Model Overview, Basic data augmentation., Instance discrimination with three views., Baselines., Training process., Evaluation protocols..
- Nearest neighbor pattern classification. IEEE transactions on information theory 13 (1), pp. 21–27. Cited by: Evaluation protocols..
- AutoAugment: learning augmentation strategies from data. pp. 113–123. Cited by: Auxiliary data augmentation..
RandAugment: practical automated data augmentation with a reduced search space.
arXiv: Computer Vision and Pattern Recognition. Cited by: Auxiliary data augmentation..
- Unsupervised visual representation learning by context prediction. pp. 1422–1430. Cited by: Introduction.
- Unsupervised representation learning by predicting image rotations. Cited by: Introduction.
- Generative adversarial nets. pp. 2672–2680. Cited by: Introduction.
Accurate, large minibatch sgd: training imagenet in 1 hour. arXiv preprint arXiv:1706.02677. Cited by: Investigation on batch size..
- Dimensionality reduction by learning an invariant mapping. 2, pp. 1735–1742. Cited by: Contrastive learning.
- A comprehensive approach to unsupervised embedding learning based on and algorithm.. arXiv: Learning. Cited by: Introduction, Memory bank-based methods, Baselines., Table 1.
- Momentum contrast for unsupervised visual representation learning.. arXiv: Computer Vision and Pattern Recognition. Cited by: Introduction, Memory bank-based methods.
- Deep residual learning for image recognition. pp. 770–778. Cited by: Experimental settings..
Unsupervised deep learning by neighbourhood discovery. pp. 2849–2858. Cited by: Introduction, Memory bank-based methods, Baselines., Table 1.
- IMAGE retrieval using color and shape. Pattern Recognition 29 (8), pp. 1233–1244. Cited by: Introduction.
- Self-supervised visual feature learning with deep neural networks: a survey. arXiv: Computer Vision and Pattern Recognition. Cited by: Introduction.
- ImageNet classification with deep convolutional neural networks. pp. 1097–1105. Cited by: Experimental settings..
- Learning multiple layers of features from tiny images. University of Toronto, pp. . Cited by: Datasets..
Photo-realistic single image super-resolution using a generative adversarial network. arXiv: Computer Vision and Pattern Recognition. Cited by: Introduction.
Sgdr: stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983. Cited by: Experimental settings..
- Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision 60 (2), pp. 91–110. Cited by: Introduction.
Visualizing data using t-sne.
Journal of machine learning research9 (Nov), pp. 2579–2605. Cited by: Training process..
- Texture features for browsing and retrieval of image data. IEEE Transactions on Pattern Analysis and Machine Intelligence 18 (8), pp. 837–842. Cited by: Introduction.
- The street view house numbers (svhn) dataset. . Cited by: Datasets..
- Unsupervised learning of visual representations by solving jigsaw puzzles. pp. 69–84. Cited by: Introduction.
- Context encoders: feature learning by inpainting. pp. 2536–2544. Cited by: Introduction.
- Video google: a text retrieval approach to object matching in videos. pp. 1470–1477. Cited by: Introduction.
- Improving generalization via scalable neighborhood component analysis. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 685–701. Cited by: Memory bank-based methods.
- Unsupervised feature learning via non-parametric instance discrimination. pp. 3733–3742. Cited by: Introduction, Memory bank-based methods, Model Overview, Basic data augmentation., Instance discrimination with three views., Instance discrimination with three views., Baselines., Experimental settings., Evaluation protocols..
- Unsupervised embedding learning via invariant and spreading instance feature. pp. 6210–6219. Cited by: Batch-based methods, Baselines., Training process., Table 1.
- Colorful image colorization. In European conference on computer vision, pp. 649–666. Cited by: Introduction, Evaluation protocols..