Fine-grained image classification involves identifying different subcategories of a class which possess very subtle discriminatory features. Fine-grained datasets usually provide bounding box annotations along with class labels to aid the process of classification. However, building large scale datasets with such annotations is a mammoth task. Moreover, this extensive annotation is time-consuming and often requires expertise, which is a huge bottleneck in building large datasets. On the other hand, self-supervised learning (SSL) exploits the freely available data to generate supervisory signals which act as labels. The features learnt by performing some pretext tasks on huge unlabelled data proves to be very helpful for multiple downstream tasks.
Our idea is to leverage self-supervision such that the model learns useful representations of fine-grained image classes. We experimented with 3 kinds of models: Jigsaw solving as pretext task, adversarial learning (SRGAN) and contrastive learning based (SimCLR) model. The learned features are used for downstream tasks such as fine-grained image classification. Our code is available at https://github.com/rush2406/Self-Supervised-Learning-for-Fine-grained-Image-Classification.
We used the fine-grained cassava plant disease dataset available at https://www.kaggle.com/c/cassava-disease/data. It consists of 12k unlabelled images and 6k labelled images in .jpg format, divided among five subcategories according to plant disease type: cbb (cassava bacterial blight), cbsd (cassava brown streak disease), cgm (cassava green mite), cmd (cassava mosaic disease) and healthy. The images are varied in size, lighting, background, and resolution.
The unlabelled images were used during the SSL pretext task and labelled images were used during the downstream task. The labelled images had an imbalance in the number of images per class. This was handled by obtaining more images from another publicly available dataset and by using class weights. We randomly split the dataset into 80% training and 20% validation.
3 Related Work
Fine-grained image classification is a very well-known problem in the computer vision domain. Deep neural networks have been used extensively for fine-grained classification. Some of the most popular ones include kernel pooling with CNNs8099808 and boosted CNNs BMVC2016_24. Over the years, several approaches have been developed - from requiring direct human supervision to weakly supervised approaches. Methods like 7298658; 493; huang2015partstacked; 6618954; zhang2014partbased require a lot of part-based annotation for the datasets.
Recently, weakly supervised methods requiring only image level labels have gained a lot of popularity. There have been numerous attention-based approaches like xiao2014application; 8237819. Moreover, DCL 8953746, and PMG du2020finegrained methods have shown remarkable performance without explicit attention.
Self-supervised learning has received a huge surge in recent years owing to the difficulties posed by supervised learning methods. Expensive annotations, generalization error, spurious correlations, and adversarial attacks are examples of the issues associated with supervised learning jing2019selfsupervised. Self-supervised learning involves learning features by performing a pretext task. Pretext tasks like generation-based, context-based, free semantic label-based and cross modal-based have been used to learn discriminatory features jing2019selfsupervised; noroozi2017unsupervised; larsson2017colorization. These learned features are utilized to complete the downstream task, which can be classification, detection, segmentation and others.
According to Liu_2021
, self-supervised learning can be classified into three main categories: generative learning, contrastive learning, or generative-contrastive (adversarial) learning. The main difference between them is their architecture and objective.
Generative self-supervised models can be autoregressive (AR) models, flow-based models, auto-encoding (AE) models, and hybrid generative models Liu_2021. Generative models have been popular in the NLP domain for text classification. BERT, MADE, and CBOW are some of the well-known applications for generative-based models devlin2019bert; germain2015made; mikolov2013efficient.
Contrastive learning has been extensively employed for self-supervision chen2020simple; he2020momentum. Unlike generative models, contrastive models try to reduce the dissimilarity between augmentations of the an image Liu_2021. Models such as Deep InfoMax, MoCo, and SimCLR have been used for self-supervised classification applications in hjelm2019learning; he2020momentum; chen2020simple .
Adversarial learning combines some generative and contrastive learning features as it learns to reconstruct the original data distribution by minimizing the distributional divergence. Original data can be constructed by feeding complete input or partial input. Applications for adversarial learning with complete input are BiGAN and ALI donahue2017adversarial; dumoulin2017adversarially
Self supervised learning for fine-grain hasn’t been explored much yet. In this project, we have tried a pretext task,adversarial learning and contrastive learning for self-supervised fine-grain classification. We have used Jigsaw solving as pretext task. In adversarial learning, we used a generation-based (super-resolution) task that was based on SRGAN ledig2017photorealistic. Further, for contrastive learning, we have experimented using the SimCLR model with different augmentations such as patch swapping, coarse dropout and jigsaw.
4.1 Baseline model
We chose a weakly supervised fine-grained classification model - Fine-Grained Visual Classification via Progressive Multi-Granularity Training of Jigsaw Patches du2020finegrained. It progresses from a high granularity to a low granularity, thus combining low-level specific patch details along with the high-level view of the entire input image. It uses image patches of increasing sizes to learn the characteristic features in a step-by-step manner. The ResNet model achieved an accuracy of 88% on the labelled images.
4.2 Jigsaw as Pretext Task
The idea of using Jigsaw solving as a pretext task noroozi2017unsupervised comes from the observation that solving a Jigsaw puzzle not only requires observing the individual patches but also understanding the spatial relationships between the patches. This in turn demands learning specific discriminatory features of the patches that can help in solving the puzzle. Fine-grained classes usually have very subtle distinguishing features. Hence, we explored on how well the model could learn these fine features through solving the Jigsaw puzzle.
The input image is split into 3x3, i.e. 9 patches, which are used to create a jigsaw. Permutations for shuffling the patches are generated such that the average Hamming distance between the permutations is the maximum. Moreover, multiple jigsaws are generated for each input image. These are done to ensure that the model does not learn any sort of shortcuts to predict the right order and also makes every position equally likely for every patch. The model is tasked with predicting the right permutation that solves the jigsaw. The learnt features were used for downstream fine-grained classification. We have fine-tuned using different number of layers as results are summarized in Table 1. Downstream accuracy of 67% was obtained.
4.3 Contrastive Learning: self-supervised learning using SimCLR
SimCLR is a self-supervised learning algorithm based on contrastive loss. It generates two augmented images of each image and ensures that the learned representations for those two images are closer to each other in the feature space but are farther away from the representations of the other images of the batch. To achieve this purpose, SimCLR uses the NT-Xent, or Normalized Temperature-scaled Cross Entropy Loss as a loss function. Letand be the feature representations of the same altered image. The loss function for a positive pair of examples (i, j) is defined as:
The default augmentations used in SimCLR are random resize cropping, color jittering, random horizontal flip, random grayscale and Gaussian blurring. Considering the success of SimCLR, we decided to experiment and observe the performance of the model on fine-grained datasets.
We used the default augmentations of SimCLR (mentioned above), except Gaussian blurring (as it would also blur the discriminatory spots which are essential to be learnt). After the contrastive learning phase, we performed the downstream classification task and achieved an accuracy of 73%. The results have been summarized in Table 6 with a sample Grad-CAM shown in Figure 3(b). It can be inferred from the Grad-CAM that the model had used the spots on the leaf to make the class prediction i.e. the model was able to localize the fine-grained features to a decent extent. This motivated us to experiment with some modifications to the SimCLR pipeline so that the fine-grained discriminatory features may be learnt better.
4.3.1 Gamma transform
The first image-level augmentation that we used is a random gamma transform from the albumentations library Buslaev_2020. This transform changes the brightness of the image as shown in Figure 0(c). The brightness is altered randomly within a range of 50-250. Level 50 refers to a darker image and 250 corresponds to a very bright image. The gamma transform was performed before the default SimCLR augmentations.
NOTE: In the default augmentations, random resize cropping was used. We realized that this may not be suitable because the random crop may or may not contain the essential fine-grained regions. Hence, we replaced it with resize for the following augmentations.
4.3.2 Coarse Dropout
Coarse dropout was the second image-level augmentation that was applied from the albumentations library Buslaev_2020. This transform randomly removes squares from the input image. The size of the squares ranges from 10 to 25 pixels. Figure 0(b)
shows an example of the output of the augmentation. Coarse dropout is generally used to prevent overfitting while using images. Hence, this motivated us to check if the model was probably overfitting and verify if it could still learn the features despite the removed regions. We tried to make sure that the fine-grained regions are not completely lost while using this transform by controlling the size and number of squares.
4.3.3 Random Patch swapping
In order to predict the correct class label, it is important that the model is able to localize the fine-grained features. Thus, the model should be able to identify those discriminatory features, irrespective of the position of the features. This intuition prompted us to use a simple random patch swapping component. Two random patches of size 200x200 are extracted and swapped in every image. Thus, training the model leads to the original image Figure 0(a) and the image with random patches swapped Figure 0(d) to have representations closer in the embedding space.
4.3.4 Random Jigsaw Shuffling
Our baseline model du2020finegrained, Progressive Multi Granularity (PMG) model for fine-grained classification, uses multi granularity Jigsaw shuffling to help improve feature learning. Drawing inspiration from this idea, we incorporated random Jigsaw shuffling into the SimCLR pipeline along with the existing augmentations. A random permutation of [0, 1, 2.., - 1] is generated, where is the granularity of the Jigsaw. The image is divided into uniform partitions and shuffled according to the generated permutation.
We performed two such variations. The first comprised of the original image and a randomly-shuffled Jigsaw puzzle (4x4) image. This achieved a downstream task accuracy of 69.5%. The second variation involved two randomly-shuffled Jigsaw puzzles with (4x4) and (2x2) granularities. This achieved a downstream task accuracy of 68.7%. The representations of these pairs of images (Figure 0(f)) are learnt to be closer in the embedding space.
4.3.5 DCL-based Jigsaw shuffling
From the results as shown in Table 2, it can be observed that random Jigsaw shuffling did not perform quite well. This can be attributed to the fact that during the Jigsaw formation (dependent on the granularity), some of the fine-grained regions might be chopped and shuffled such that they are no longer identifiable. To handle this scenario, another algorithm from the fine-grained domain provided a plausible solution - DCL 8953746. It suggests that shuffling within the local neighborhood of the image can help preserve the fine-grained regions to a large extent. The algorithm has been summarized in Algorithm 1.
The representations of the original image and DCL-based Jigsaw pair Figure 0(g) are learnt to be closer in the embedding space. We experimented with different granularities of Jigsaw (i.e. 3x3, 5x5, 7x7), out of which 3x3 Jigsaw showed the best accuracy.
4.3.6 Fine-grained region cropping
Using bounding box annotations for fine-grained regions have been very popular in the fine-grained datasets. Using this as the intuition, we tried to form image pairs involving the original image and a crop of the most important fine-grained region in the image which are supposed to be resulting in representations closer in the embedding space. We carried out this experiment on a small scale by generating bounding boxes for 200 images (40 images for each class). The results looked promising and hence, we tried to perform the same on the larger dataset.
As the dataset did not have any bounding box annotations, we employed a technique of smartcropping to obtain the fine-grained regions. Smartcrop is a way of intelligently determining the most important part of the image and keeping it in focus while cropping the image. The tool was able to approximately localize the fine-grained regions to a good extent. The smartcrop algorithm can be summarized by the following steps -
1. Find edges using Laplace
2. Find and boost regions high in saturation
3. Generate a set of candidate crops using a sliding window
4. Rank them using an importance function to focus the detail in the center and avoid it in the edges.
5. Output the candidate crop with the highest rank
We have directly used a smartcropping package available on Github SmartCrop. After localizing the fine-grained regions, we overlaid them against a white background Figure 0(e) to maintain the image size.
4.4 Scaling: Self-supervised learning using SRGAN
Scaling was chosen as a pretext task to allow the model to learn the finer features of the images since the dataset images were obtained using phone cameras of varied quality and resolutions. In ledig2017photorealistic
, Ledig et al. developed a super-resolution generative adversarial network (SRGAN) to create high resolution images. The network downscales the images by a factor of four using a bicubic kernel and regenerates the images at two, four, and eight times the resolution.
Following the work for the midterm where we investigated the potential of the SRGAN discriminator as a feature extractor, we now proceeded to replicate the same procedure with the SRGAN generator.
The intuition behind this approach is that as the network upsamples an input image, it also learns the finer features of the image that are important to distinguish the fine-grained classes of the dataset.
Traditionally, an MSE content loss has been calculated to ensure the preservation of pixel-level content (eq. 2) dong2015image; shi2016realtime. However, the resulting upsampled image tends to be overly smoothed and lacks finer details. ledig2017photorealistic introduced a VGG content loss instead, where the focus of the loss function optimization is shifted from the pixel space to the feature space to ensure high-level content preservation. This is achieved by calculating MSE comparing the generated images ()) with the feature maps obtained after the -th convolution before the -th maxpooling layer of the pre-trained VGG network ()) (eq. 3).
They eventually introduced a perceptual loss (), defined as the weighted sum of the content loss () and adversarial loss ().(eq. 4).
The SRGAN generator consists of two main components: a ResNet backbone and an upsampling block. The ResNet consists of 16 identical residual blocks (with convolution-batch normalization-parametric ReLU sequence) and a bypass skip connection that relieves the network from modeling the identity transformation. The upsampling block consists of a convolution, pixel shuffler, and parametric ReLU (Figure2).
The pixel shuffler is the main operation responsible for the upsampling. It acts as a deconvolution operation that reverses the convolution transformation to produce a higher resolution output. This is done through a periodic reshuffling of the lower resolution feature maps into a higher resolution output, essentially going from a tensor of shape () to a tensor of shape (), where is the upscaling factor (Figure 3).
In our experiments, we retrained the model using the cassava dataset and reconstructed the images at the original resolution (i.e. four times upscaling). To leverage the power of self-supervised learning, we included all 12k unlabelled images in our training. We then isolated the generator from the network, froze the weights, and attached a classifier to output five classes using an L2-Adam regularization and the log softmax activation function. Following the experimental procedure we performed with the discriminator, we changed the input image sizes, reduced the depth of the ResNet backbone, and added dropouts to the fully connected layer of the classifier to evaluate the model performance. The experiments were performed in progressive order such that the depth of the architecture was reduced on the best accuracy input image size, and the dropout was introduced on the best accuracy depth. The results are summarized in Table3.
5 Results and Discussion
We chose accuracy as our primary metric of evaluation as it is generally used for evaluating performance on fine grained data. We have also included alternative metrics including precision, recall, and F1 scores in the Appendix for a more comprehensive evaluation.
For qualitative evaluation, we ran gradient-weighted class activation maps (Grad-CAM) as a visualization tool of our models. Grad-CAM provides a way of debugging the model and visually validating that it is “looking” and “activating” at the correct locations of an image. Grad-CAM works by finding a specified convolutional layer in the network and examining the gradient information flowing into that layer. We used Grad-CAM to visualize the regions the model was using to make a class prediction. Some Grad-CAM images are shown in Figures 4 and 5.
|No. of permutations||100||0.658|
|for Jigsaw task||200 x 200||0.676|
|Image size||256 x 256||0.664|
|No. of last layers used||2||0.661|
5.1 Jigsaw as pretext task
It can be seen from Table 1 that the accuracy improved slightly with an increase in the number of permutations. Using a larger image size was also beneficial. Fine-tuning from more layers also helped increase accuracy. While the model performs well on the pretext task, it is unable to perform similarly on the downstream task and reaches an accuracy of around 68%. This could be because the model could not learn the distinguishing features from the patches which looked very similar.
The effect of different augmentations introduced in SimCLR has been summarized in Table 2. The random gamma transform resulted in a slightly better performance (73.8%) than the original SimCLR model (73%). This can be attributed to the fact that gamma transform increased the contrast in images, thus making the fine-grained spots more prominent. The Grad-CAM image on Figure 3(g) showed that the model was able to localize the spots on the leaves without getting confused with the background.
Using coarse dropout transform dropped the accuracy significantly to 48%. Even though we ensured the fine-grained regions are not completely covered by the squares, the model was not able to localize the important spots from the unobstructed regions of the leaves (Figure 3(h)).
With random patch swapping, the model was localizing the spots on the leaf but at the same time, it was also using the unimportant background elements (Figure 3(c)). Accuracy of 53% was obtained.
We observed that the model was able to localize the spots despite the slight distortion introduced due to random patch swapping. Hence, we increased the randomness by introducing a Jigsaw shuffling. As observed from Figure 3(d), the model was not very confident about the spots on the leaves and was also highlighting some features from the background. It was able to achieve an accuracy of only 69.5%.
With the intuition that random chopping and shuffling can make the fine-grained regions unidentifiable, we experimented with DCL-based 8953746 Jigsaw shuffling. As can be observed from Figure 3(f), the model was very confident and was able to localize the spots on the leaves, yet it was only able to reach an accuracy of 71.1%.
Inspired by the idea that fine-grained datasets usually provide bounding boxes to localize the features, we experimented with using the original image and the fine-grained region as a positive pair. As observed from Figure 3(e), the model was unable to localize the discriminatory spots and was highlighting background elements while reaching an accuracy of only 54%. Building upon this, we thought it would be interesting to see the effect of retaining the discriminatory fine-grained region and shuffling the remaining image region to induce randomness. This resulted in further deterioration of the downstream classification accuracy to 51%. It might be the case that the model was using the leaf boundaries to locate the spots, while with the smartcropping most of the images were zoomed onto the spots and the leaf structure was lost. We believe SimCLR has a lot of potential due to its simple and intuitive nature and more can be explored.
|Random patch swapping||0.530|
|Random Jigsaw (4x4)||0.695|
|DCL-based Jigsaw (3x3)||0.711|
|Fine-grained region crop||0.548|
We used = 0.5 and a batch size of 64 in the pretext task.
On SRGAN generator, the best performing model used an input image size of 128 x 128 (Table 3). Our results showed roughly the same accuracy when no residual blocks were removed (64.5%) and two residual blocks were removed (64.2%). We chose the simpler model with two residual blocks removed to perform the subsequent experiments. Our results showed a dramatic increase in accuracy (up to 83.3%) when dropouts were introduced in the classifier, with the best performing model using a dropout of 0.5.
Comparing the results of the generator and discriminator, we found that the discriminator initially performed better than the generator when changing the input image size and reducing the number of identical blocks from the baseline architecture. However, once dropouts were introduced, the generator significantly outperformed the discriminator.
|Image size||Original (crop 88 x 88)||0.612||0.704|
|256 x 256 (crop 88 x 88)||0.621||0.713|
|128 x 128 (crop 88 x 88)||0.645||0.741|
|88 x 88 (no cropping)||0.619||0.645|
|Depth||Remove the last / blocks||0.642||0.744|
|of architecture||Remove the last / blocks||0.635||0.715|
|Remove the last / blocks||0.628||0.701|
The depth of ResNet was reduced in increments of two for the generator.
The depth of VGG was reduced in increments of one for the discriminator.
SRGAN Grad-CAM images for the best performing models for the different experiments using the generator (top) and discriminator (bottom). (a) and (e) are the results of the original generator and discriminator, respectively using the default parameters; (b) and (f) are the results for the best (i.e. highest accuracy) input size images; (c) and (g) are the results for the best baseline model depth; (d) and (h) are the results for best dropout hyperparameter.
We noted an interesting observation on the Grad-CAM images (Figure 5). Before dropouts were added, the discriminator was still able to learn and roughly base its classification decision on certain spots of the leaves, while the generator was not able to distinguish the leaves from the background at all. Once dropouts were added, both the generator and discriminator exhibited an improvement in performance, with the generator being able to supersede the discriminator and accurately determine the correct identifying features of the cassava leaves. This suggests that the models (particularly the generator) have great potential to be used as a feature extractor, but they need to be used in conjunction with a regularization technique such as dropout to separate the unimportant background features from the important ones.
Fine-grained classification is an important problem owing to the number of applications in the real world. However, large annotated datasets are very expensive to obtain for training models. Hence, self-supervised learning was explored as a possible solution to exploit the freely available unlabelled fine-grained images. We experimented with Jigsaw as pretext task, SRGAN generator and SimCLR.
Comparing the best models for the different pretext tasks, we found that the Jigsaw task performed most poorly, followed by SimCLR, SRGAN discriminator, and SRGAN generator. We accomplished the highest downstream classification accuracy of 83% (compared to the 88% supervised fine-grained baseline model). We thus see a promising venue for self-supervised learning as a learning mechanism.
Our further investigation into the successes and failures of the models reveal that our models were likely able to learn the texture and shape of the leaves, but were probably not able to learn colors as a distinguishing factor between the classes (Figure 6). For future research, we would be interested in exploring color as a pretext task. We would also consider repeating this experiment on other fine-grained datasets to verify the findings of our experiment.