Self-Supervised Learning for Fine-Grained Image Classification

07/29/2021 ∙ by Farha Al Breiki, et al. ∙ 94

Fine-grained image classification involves identifying different subcategories of a class which possess very subtle discriminatory features. Fine-grained datasets usually provide bounding box annotations along with class labels to aid the process of classification. However, building large scale datasets with such annotations is a mammoth task. Moreover, this extensive annotation is time-consuming and often requires expertise, which is a huge bottleneck in building large datasets. On the other hand, self-supervised learning (SSL) exploits the freely available data to generate supervisory signals which act as labels. The features learnt by performing some pretext tasks on huge unlabelled data proves to be very helpful for multiple downstream tasks. Our idea is to leverage self-supervision such that the model learns useful representations of fine-grained image classes. We experimented with 3 kinds of models: Jigsaw solving as pretext task, adversarial learning (SRGAN) and contrastive learning based (SimCLR) model. The learned features are used for downstream tasks such as fine-grained image classification. Our code is available at



There are no comments yet.


page 5

page 7

page 9

page 10

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Abstract

Fine-grained image classification involves identifying different subcategories of a class which possess very subtle discriminatory features. Fine-grained datasets usually provide bounding box annotations along with class labels to aid the process of classification. However, building large scale datasets with such annotations is a mammoth task. Moreover, this extensive annotation is time-consuming and often requires expertise, which is a huge bottleneck in building large datasets. On the other hand, self-supervised learning (SSL) exploits the freely available data to generate supervisory signals which act as labels. The features learnt by performing some pretext tasks on huge unlabelled data proves to be very helpful for multiple downstream tasks.

Our idea is to leverage self-supervision such that the model learns useful representations of fine-grained image classes. We experimented with 3 kinds of models: Jigsaw solving as pretext task, adversarial learning (SRGAN) and contrastive learning based (SimCLR) model. The learned features are used for downstream tasks such as fine-grained image classification. Our code is available at

2 Dataset

We used the fine-grained cassava plant disease dataset available at It consists of 12k unlabelled images and 6k labelled images in .jpg format, divided among five subcategories according to plant disease type: cbb (cassava bacterial blight), cbsd (cassava brown streak disease), cgm (cassava green mite), cmd (cassava mosaic disease) and healthy. The images are varied in size, lighting, background, and resolution.

The unlabelled images were used during the SSL pretext task and labelled images were used during the downstream task. The labelled images had an imbalance in the number of images per class. This was handled by obtaining more images from another publicly available dataset and by using class weights. We randomly split the dataset into 80% training and 20% validation.

3 Related Work

Fine-grained image classification is a very well-known problem in the computer vision domain. Deep neural networks have been used extensively for fine-grained classification. Some of the most popular ones include kernel pooling with CNNs  

8099808 and boosted CNNs  BMVC2016_24. Over the years, several approaches have been developed - from requiring direct human supervision to weakly supervised approaches. Methods like  7298658; 493; huang2015partstacked; 6618954; zhang2014partbased require a lot of part-based annotation for the datasets.

Recently, weakly supervised methods requiring only image level labels have gained a lot of popularity. There have been numerous attention-based approaches like  xiao2014application; 8237819. Moreover, DCL  8953746, and PMG  du2020finegrained methods have shown remarkable performance without explicit attention.

Self-supervised learning has received a huge surge in recent years owing to the difficulties posed by supervised learning methods. Expensive annotations, generalization error, spurious correlations, and adversarial attacks are examples of the issues associated with supervised learning  jing2019selfsupervised. Self-supervised learning involves learning features by performing a pretext task. Pretext tasks like generation-based, context-based, free semantic label-based and cross modal-based have been used to learn discriminatory features  jing2019selfsupervised; noroozi2017unsupervised; larsson2017colorization. These learned features are utilized to complete the downstream task, which can be classification, detection, segmentation and others.

According to  Liu_2021

, self-supervised learning can be classified into three main categories: generative learning, contrastive learning, or generative-contrastive (adversarial) learning. The main difference between them is their architecture and objective.

Generative self-supervised models can be autoregressive (AR) models, flow-based models, auto-encoding (AE) models, and hybrid generative models  Liu_2021. Generative models have been popular in the NLP domain for text classification. BERT, MADE, and CBOW are some of the well-known applications for generative-based models  devlin2019bert; germain2015made; mikolov2013efficient.

Contrastive learning has been extensively employed for self-supervision  chen2020simple; he2020momentum. Unlike generative models, contrastive models try to reduce the dissimilarity between augmentations of the an image Liu_2021. Models such as Deep InfoMax, MoCo, and SimCLR have been used for self-supervised classification applications in  hjelm2019learning; he2020momentum; chen2020simple .

Adversarial learning combines some generative and contrastive learning features as it learns to reconstruct the original data distribution by minimizing the distributional divergence. Original data can be constructed by feeding complete input or partial input. Applications for adversarial learning with complete input are BiGAN and ALI  donahue2017adversarial; dumoulin2017adversarially

, and for partial input, we have colorization, inpainting, and super-resolution.

Self supervised learning for fine-grain hasn’t been explored much yet. In this project, we have tried a pretext task,adversarial learning and contrastive learning for self-supervised fine-grain classification. We have used Jigsaw solving as pretext task. In adversarial learning, we used a generation-based (super-resolution) task that was based on SRGAN  ledig2017photorealistic. Further, for contrastive learning, we have experimented using the SimCLR model with different augmentations such as patch swapping, coarse dropout and jigsaw.

4 Method

4.1 Baseline model

We chose a weakly supervised fine-grained classification model - Fine-Grained Visual Classification via Progressive Multi-Granularity Training of Jigsaw Patches  du2020finegrained. It progresses from a high granularity to a low granularity, thus combining low-level specific patch details along with the high-level view of the entire input image. It uses image patches of increasing sizes to learn the characteristic features in a step-by-step manner. The ResNet model achieved an accuracy of 88% on the labelled images.

4.2 Jigsaw as Pretext Task

The idea of using Jigsaw solving as a pretext task  noroozi2017unsupervised comes from the observation that solving a Jigsaw puzzle not only requires observing the individual patches but also understanding the spatial relationships between the patches. This in turn demands learning specific discriminatory features of the patches that can help in solving the puzzle. Fine-grained classes usually have very subtle distinguishing features. Hence, we explored on how well the model could learn these fine features through solving the Jigsaw puzzle.

The input image is split into 3x3, i.e. 9 patches, which are used to create a jigsaw. Permutations for shuffling the patches are generated such that the average Hamming distance between the permutations is the maximum. Moreover, multiple jigsaws are generated for each input image. These are done to ensure that the model does not learn any sort of shortcuts to predict the right order and also makes every position equally likely for every patch. The model is tasked with predicting the right permutation that solves the jigsaw. The learnt features were used for downstream fine-grained classification. We have fine-tuned using different number of layers as results are summarized in Table 1. Downstream accuracy of 67% was obtained.

4.3 Contrastive Learning: self-supervised learning using SimCLR

SimCLR is a self-supervised learning algorithm based on contrastive loss. It generates two augmented images of each image and ensures that the learned representations for those two images are closer to each other in the feature space but are farther away from the representations of the other images of the batch. To achieve this purpose, SimCLR uses the NT-Xent, or Normalized Temperature-scaled Cross Entropy Loss as a loss function. Let

and be the feature representations of the same altered image. The loss function for a positive pair of examples (i, j) is defined as:


The default augmentations used in SimCLR are random resize cropping, color jittering, random horizontal flip, random grayscale and Gaussian blurring. Considering the success of SimCLR, we decided to experiment and observe the performance of the model on fine-grained datasets.

We used the default augmentations of SimCLR (mentioned above), except Gaussian blurring (as it would also blur the discriminatory spots which are essential to be learnt). After the contrastive learning phase, we performed the downstream classification task and achieved an accuracy of 73%. The results have been summarized in Table 6 with a sample Grad-CAM shown in Figure 3(b). It can be inferred from the Grad-CAM that the model had used the spots on the leaf to make the class prediction i.e. the model was able to localize the fine-grained features to a decent extent. This motivated us to experiment with some modifications to the SimCLR pipeline so that the fine-grained discriminatory features may be learnt better.

4.3.1 Gamma transform

The first image-level augmentation that we used is a random gamma transform from the albumentations library  Buslaev_2020. This transform changes the brightness of the image as shown in Figure 0(c). The brightness is altered randomly within a range of 50-250. Level 50 refers to a darker image and 250 corresponds to a very bright image. The gamma transform was performed before the default SimCLR augmentations.

NOTE: In the default augmentations, random resize cropping was used. We realized that this may not be suitable because the random crop may or may not contain the essential fine-grained regions. Hence, we replaced it with resize for the following augmentations.

4.3.2 Coarse Dropout

Coarse dropout was the second image-level augmentation that was applied from the albumentations library  Buslaev_2020. This transform randomly removes squares from the input image. The size of the squares ranges from 10 to 25 pixels. Figure 0(b)

shows an example of the output of the augmentation. Coarse dropout is generally used to prevent overfitting while using images. Hence, this motivated us to check if the model was probably overfitting and verify if it could still learn the features despite the removed regions. We tried to make sure that the fine-grained regions are not completely lost while using this transform by controlling the size and number of squares.

4.3.3 Random Patch swapping

In order to predict the correct class label, it is important that the model is able to localize the fine-grained features. Thus, the model should be able to identify those discriminatory features, irrespective of the position of the features. This intuition prompted us to use a simple random patch swapping component. Two random patches of size 200x200 are extracted and swapped in every image. Thus, training the model leads to the original image Figure 0(a) and the image with random patches swapped Figure 0(d) to have representations closer in the embedding space.

4.3.4 Random Jigsaw Shuffling

Our baseline model  du2020finegrained, Progressive Multi Granularity (PMG) model for fine-grained classification, uses multi granularity Jigsaw shuffling to help improve feature learning. Drawing inspiration from this idea, we incorporated random Jigsaw shuffling into the SimCLR pipeline along with the existing augmentations. A random permutation of [0, 1, 2.., - 1] is generated, where is the granularity of the Jigsaw. The image is divided into uniform partitions and shuffled according to the generated permutation.

We performed two such variations. The first comprised of the original image and a randomly-shuffled Jigsaw puzzle (4x4) image. This achieved a downstream task accuracy of 69.5%. The second variation involved two randomly-shuffled Jigsaw puzzles with (4x4) and (2x2) granularities. This achieved a downstream task accuracy of 68.7%. The representations of these pairs of images (Figure 0(f)) are learnt to be closer in the embedding space.

4.3.5 DCL-based Jigsaw shuffling

From the results as shown in Table 2, it can be observed that random Jigsaw shuffling did not perform quite well. This can be attributed to the fact that during the Jigsaw formation (dependent on the granularity), some of the fine-grained regions might be chopped and shuffled such that they are no longer identifiable. To handle this scenario, another algorithm from the fine-grained domain provided a plausible solution - DCL 8953746. It suggests that shuffling within the local neighborhood of the image can help preserve the fine-grained regions to a large extent. The algorithm has been summarized in Algorithm 1.

1:procedure DclJigsaw()
3:     Uniformly partition img into regions. Each region is denoted as .
5:     for every in R do

         Generate random vector

such that = i + r, where
8:         Generate new permutation of regions in row by sorting the array , verifying:
12:     Similarly, perform the above operation for each in R such that:
15:      in original region location is mapped to a new coordinate:
17:     Perform shuffling of img based on the new coordinate mapping.
19:     return img
Algorithm 1 DCL Jigsaw shuffling

The representations of the original image and DCL-based Jigsaw pair Figure 0(g) are learnt to be closer in the embedding space. We experimented with different granularities of Jigsaw (i.e. 3x3, 5x5, 7x7), out of which 3x3 Jigsaw showed the best accuracy.

(a) Input image
(b) Coarse dropout
(c) Gamma transform
(d) Patch swap
(e) fine-grained region
(f) Random Jigsaw
(g) DCL Jigsaw
Figure 1: Augmentations experimented on SimCLR as explained in Section 4.3

4.3.6 Fine-grained region cropping

Using bounding box annotations for fine-grained regions have been very popular in the fine-grained datasets. Using this as the intuition, we tried to form image pairs involving the original image and a crop of the most important fine-grained region in the image which are supposed to be resulting in representations closer in the embedding space. We carried out this experiment on a small scale by generating bounding boxes for 200 images (40 images for each class). The results looked promising and hence, we tried to perform the same on the larger dataset.

As the dataset did not have any bounding box annotations, we employed a technique of smartcropping to obtain the fine-grained regions. Smartcrop is a way of intelligently determining the most important part of the image and keeping it in focus while cropping the image. The tool was able to approximately localize the fine-grained regions to a good extent. The smartcrop algorithm can be summarized by the following steps -

1. Find edges using Laplace
2. Find and boost regions high in saturation
3. Generate a set of candidate crops using a sliding window
4. Rank them using an importance function to focus the detail in the center and avoid it in the edges.
5. Output the candidate crop with the highest rank

We have directly used a smartcropping package available on Github SmartCrop. After localizing the fine-grained regions, we overlaid them against a white background Figure 0(e) to maintain the image size.

4.4 Scaling: Self-supervised learning using SRGAN

Scaling was chosen as a pretext task to allow the model to learn the finer features of the images since the dataset images were obtained using phone cameras of varied quality and resolutions. In  ledig2017photorealistic

, Ledig et al. developed a super-resolution generative adversarial network (SRGAN) to create high resolution images. The network downscales the images by a factor of four using a bicubic kernel and regenerates the images at two, four, and eight times the resolution.

Following the work for the midterm where we investigated the potential of the SRGAN discriminator as a feature extractor, we now proceeded to replicate the same procedure with the SRGAN generator. The intuition behind this approach is that as the network upsamples an input image, it also learns the finer features of the image that are important to distinguish the fine-grained classes of the dataset.

Traditionally, an MSE content loss has been calculated to ensure the preservation of pixel-level content (eq. 2)  dong2015image; shi2016realtime. However, the resulting upsampled image tends to be overly smoothed and lacks finer details.  ledig2017photorealistic introduced a VGG content loss instead, where the focus of the loss function optimization is shifted from the pixel space to the feature space to ensure high-level content preservation. This is achieved by calculating MSE comparing the generated images ()) with the feature maps obtained after the -th convolution before the -th maxpooling layer of the pre-trained VGG network ()) (eq. 3).


They eventually introduced a perceptual loss (), defined as the weighted sum of the content loss () and adversarial loss ().(eq. 4).


The SRGAN generator consists of two main components: a ResNet backbone and an upsampling block. The ResNet consists of 16 identical residual blocks (with convolution-batch normalization-parametric ReLU sequence) and a bypass skip connection that relieves the network from modeling the identity transformation. The upsampling block consists of a convolution, pixel shuffler, and parametric ReLU (Figure


The pixel shuffler is the main operation responsible for the upsampling. It acts as a deconvolution operation that reverses the convolution transformation to produce a higher resolution output. This is done through a periodic reshuffling of the lower resolution feature maps into a higher resolution output, essentially going from a tensor of shape (

) to a tensor of shape (), where is the upscaling factor (Figure 3).

In our experiments, we retrained the model using the cassava dataset and reconstructed the images at the original resolution (i.e. four times upscaling). To leverage the power of self-supervised learning, we included all 12k unlabelled images in our training. We then isolated the generator from the network, froze the weights, and attached a classifier to output five classes using an L2-Adam regularization and the log softmax activation function. Following the experimental procedure we performed with the discriminator, we changed the input image sizes, reduced the depth of the ResNet backbone, and added dropouts to the fully connected layer of the classifier to evaluate the model performance. The experiments were performed in progressive order such that the depth of the architecture was reduced on the best accuracy input image size, and the dropout was introduced on the best accuracy depth. The results are summarized in Table


Figure 2: SRGAN architecture. The generator takes a low-resolution input through ResNet16 and upsampling block pipeline to output a high resolution image (top). The discriminator tries to distinguish the generated super-resolution image from the original high resolution image by passing them through a VGG backbone architecture (bottom). Reprinted from  8917633.
Figure 3: Upsampling pixel shuffle operation that converts a tensor of shape () to a tensor of shape (), where is the upscaling factor.

5 Results and Discussion

We chose accuracy as our primary metric of evaluation as it is generally used for evaluating performance on fine grained data. We have also included alternative metrics including precision, recall, and F1 scores in the Appendix for a more comprehensive evaluation.

For qualitative evaluation, we ran gradient-weighted class activation maps (Grad-CAM) as a visualization tool of our models. Grad-CAM provides a way of debugging the model and visually validating that it is “looking” and “activating” at the correct locations of an image. Grad-CAM works by finding a specified convolutional layer in the network and examining the gradient information flowing into that layer. We used Grad-CAM to visualize the regions the model was using to make a class prediction. Some Grad-CAM images are shown in Figures 4 and 5.

Table 1: Experimenting with various parameters
Parameter Value Classification accuracy
No. of permutations 100 0.658
for Jigsaw task 200 x 200 0.676
Image size 256 x 256 0.664
550 0.673
No. of last layers used 2 0.661
for finetuning 5 0.687

5.1 Jigsaw as pretext task

It can be seen from Table 1 that the accuracy improved slightly with an increase in the number of permutations. Using a larger image size was also beneficial. Fine-tuning from more layers also helped increase accuracy. While the model performs well on the pretext task, it is unable to perform similarly on the downstream task and reaches an accuracy of around 68%. This could be because the model could not learn the distinguishing features from the patches which looked very similar.

5.2 SimCLR

The effect of different augmentations introduced in SimCLR has been summarized in Table 2. The random gamma transform resulted in a slightly better performance (73.8%) than the original SimCLR model (73%). This can be attributed to the fact that gamma transform increased the contrast in images, thus making the fine-grained spots more prominent. The Grad-CAM image on Figure 3(g) showed that the model was able to localize the spots on the leaves without getting confused with the background.

Using coarse dropout transform dropped the accuracy significantly to 48%. Even though we ensured the fine-grained regions are not completely covered by the squares, the model was not able to localize the important spots from the unobstructed regions of the leaves (Figure 3(h)).

With random patch swapping, the model was localizing the spots on the leaf but at the same time, it was also using the unimportant background elements (Figure 3(c)). Accuracy of 53% was obtained.

We observed that the model was able to localize the spots despite the slight distortion introduced due to random patch swapping. Hence, we increased the randomness by introducing a Jigsaw shuffling. As observed from Figure 3(d), the model was not very confident about the spots on the leaves and was also highlighting some features from the background. It was able to achieve an accuracy of only 69.5%.

With the intuition that random chopping and shuffling can make the fine-grained regions unidentifiable, we experimented with DCL-based  8953746 Jigsaw shuffling. As can be observed from Figure 3(f), the model was very confident and was able to localize the spots on the leaves, yet it was only able to reach an accuracy of 71.1%.

Inspired by the idea that fine-grained datasets usually provide bounding boxes to localize the features, we experimented with using the original image and the fine-grained region as a positive pair. As observed from Figure 3(e), the model was unable to localize the discriminatory spots and was highlighting background elements while reaching an accuracy of only 54%. Building upon this, we thought it would be interesting to see the effect of retaining the discriminatory fine-grained region and shuffling the remaining image region to induce randomness. This resulted in further deterioration of the downstream classification accuracy to 51%. It might be the case that the model was using the leaf boundaries to locate the spots, while with the smartcropping most of the images were zoomed onto the spots and the leaf structure was lost. We believe SimCLR has a lot of potential due to its simple and intuitive nature and more can be explored.

Table 2: Downstream validation accuracy using SimCLR.
Model Accuracy
Original 0.730
Coarse dropout 0.480
Gamma transform 0.738
Random patch swapping 0.530
Random Jigsaw (4x4) 0.695
DCL-based Jigsaw (3x3) 0.711
Fine-grained region crop 0.548
  • We used = 0.5 and a batch size of 64 in the pretext task.

(a) Input image
(b) SimCLR original
(c) Patch swapping
(d) Random Jigsaw
(e) Fine-grained crop
(f) DCL 3x3 Jigsaw
(g) Gamma transform
(h) Coarse dropout
Figure 4: SimCLR Grad-CAM images

5.3 Srgan

On SRGAN generator, the best performing model used an input image size of 128 x 128 (Table 3). Our results showed roughly the same accuracy when no residual blocks were removed (64.5%) and two residual blocks were removed (64.2%). We chose the simpler model with two residual blocks removed to perform the subsequent experiments. Our results showed a dramatic increase in accuracy (up to 83.3%) when dropouts were introduced in the classifier, with the best performing model using a dropout of 0.5.

Comparing the results of the generator and discriminator, we found that the discriminator initially performed better than the generator when changing the input image size and reducing the number of identical blocks from the baseline architecture. However, once dropouts were introduced, the generator significantly outperformed the discriminator.

Table 3: Downstream validation accuracy using SRGAN
Parameter Value Generator* Discriminator**
Accuracy Accuracy
Image size Original (crop 88 x 88) 0.612 0.704
256 x 256 (crop 88 x 88) 0.621 0.713
128 x 128 (crop 88 x 88) 0.645 0.741
88 x 88 (no cropping) 0.619 0.645
Depth Remove the last / blocks 0.642 0.744
of architecture Remove the last / blocks 0.635 0.715
Remove the last / blocks 0.628 0.701
Dropout 0.5 0.833 0.732
0.7 0.800 0.746
0.9 0.824 0.748
  • The depth of ResNet was reduced in increments of two for the generator.

  • The depth of VGG was reduced in increments of one for the discriminator.

(a) Gen. (original)
(b) 128 x 128
(c) Remove 2 blocks
(d) Dropout 0.5
(e) Discr. (original)
(f) 128 x 128
(g) Remove last block
(h) Dropout 0.9
Figure 5:

SRGAN Grad-CAM images for the best performing models for the different experiments using the generator (top) and discriminator (bottom). (a) and (e) are the results of the original generator and discriminator, respectively using the default parameters; (b) and (f) are the results for the best (i.e. highest accuracy) input size images; (c) and (g) are the results for the best baseline model depth; (d) and (h) are the results for best dropout hyperparameter.

We noted an interesting observation on the Grad-CAM images (Figure 5). Before dropouts were added, the discriminator was still able to learn and roughly base its classification decision on certain spots of the leaves, while the generator was not able to distinguish the leaves from the background at all. Once dropouts were added, both the generator and discriminator exhibited an improvement in performance, with the generator being able to supersede the discriminator and accurately determine the correct identifying features of the cassava leaves. This suggests that the models (particularly the generator) have great potential to be used as a feature extractor, but they need to be used in conjunction with a regularization technique such as dropout to separate the unimportant background features from the important ones.

Figure 6: Success and failure cases of our best performing model. (a) and (c) are the original leaf images. (b) is an example of a success case where the model was clearly able to distinguish between the different leaf types. (d) is an example of a failure case where, in the presence of different-shaded cassava leaves, the model was probably not able to use color as a distinguishing factor.

6 Conclusion

Fine-grained classification is an important problem owing to the number of applications in the real world. However, large annotated datasets are very expensive to obtain for training models. Hence, self-supervised learning was explored as a possible solution to exploit the freely available unlabelled fine-grained images. We experimented with Jigsaw as pretext task, SRGAN generator and SimCLR.

Comparing the best models for the different pretext tasks, we found that the Jigsaw task performed most poorly, followed by SimCLR, SRGAN discriminator, and SRGAN generator. We accomplished the highest downstream classification accuracy of 83% (compared to the 88% supervised fine-grained baseline model). We thus see a promising venue for self-supervised learning as a learning mechanism.

Our further investigation into the successes and failures of the models reveal that our models were likely able to learn the texture and shape of the leaves, but were probably not able to learn colors as a distinguishing factor between the classes (Figure 6). For future research, we would be interested in exploring color as a pretext task. We would also consider repeating this experiment on other fine-grained datasets to verify the findings of our experiment.