1. Introduction
In conventional deep-learning-based computer vision approaches, one can observe a positive relationship between the size of the training dataset and the performance of the model. In contrast, the few-shot based approaches attempt to achieve similar performances while using significantly lesser training dataset.
There has been a lot of recent research in this domain (Fei-Fei et al., 2006; Lee et al., 2019; Li et al., 2019; Finn et al., 2017, 2018; Yoon et al., 2018; Nichol and Schulman, 2018; Lee and Choi, 2018). The major benefit of a few-shot learning based solution to a computer vision problem, say image classification, is that the overall computation cost of achieving a certain level of accuracy is drastically lower as compared to traditional data-driven approaches. As few-shot based approach requires a few examples of a per class by definition, the task of data collection and annotation becomes significantly easier. This attribute makes it perfectly suitable for dealing with problems where the collection of data for a particular class is either difficult or the event in consideration is naturally rare. Researchers have been employing few-shot learning in the problem of anomaly detection (Lu et al., 2020; Ding et al., 2021; Sheynin et al., 2021)
as anomalies are naturally rare, which creates a huge class imbalance. While there have been approaches to solve the problem of class imbalance with synthetic data generation using generative adversarial networks
(Qasim et al., 2020; Yang and Zhou, 2021), they still suffer from drawbacks such as, huge computational cost involved during training.One of the most popular frameworks for this task is meta-learning. Here the model focuses on learning to learn, rather than memorizing the particular features of images. This helps in enabling the model to distinguish objects without the requirement of a huge dataset. A few-shot problem is usually defined by the pair -way -shot, where refers to the number of classes in question and refers to the number of examples in each class on which the model is trained. The training set thus formed is called the support set, and the testing set is called the query set.
For training the few-shot learner, two commonly used approaches are gradient-based approach and metric-based approach. In the case of gradient-based approaches, the base model is updated as a trainable function (Bengio et al., 1992) and the gradients are then back-propagated across it (Maclaurin et al., 2015; Finn et al., 2017)
. In metric-based approaches, a feature embedding is learnt which is then used to classify the query images based on a similarity function
(Gidaris and Komodakis, 2018; Sung et al., 2018).In this paper, we present a new approach, influenced by the work of Chowdhury et al. (Chowdhury et al., 2021)
. Chowdhury et al. employed a huge combination of library-learners which are basically pre-trained CNNs available off-the-shelf, and used them to parallelly compute the feature embeddings. They combined the features using simple ensembling techniques and passed it to a multi-layered perceptron. Their approach simplifies the training process but involves a huge number of parameters to achieve an acceptable amount of accuracy. We develop a new ensembling strategy that involves the use of a convolutional block to stack and combine the features obtained from each feature extractor. This drastically reduces the parameter counts and boosts the classification accuracy.

We evaluate the reliability of our method on a powerline components dataset with huge class imbalance. This dataset has been collected by drones and has five major classes namely, insulator, nest, bolt, spacer and the anomaly class, missing bolt. The dataset contains thousands of images of multiple resolutions in all classes except in the anomaly class which contains significantly less number of examples. This dataset emulates the practical scenarios perfectly as the captured images are mostly of low resolutions which makes the classification task particularly challenging.
2. Strategy
We took some well-known off-the-shelf convolutional neural networks, ResNet
(He et al., 2016), DenseNet (Huang et al., 2017), Inception (Szegedy et al., 2015), Xception (Chollet, 2017), EfficientNet (Tan and Le, 2019), all trained on ILSVRC2012 (Russakovsky et al., 2015) and discarded the fully-connected layers to obtain their respective convolutional segments. These convolution subnetworks are used to extract and form the feature embeddings corresponding to each image. We reshaped the obtained features to a stack of channels and passed it to our proposed model for the few-shot classification task. We experimented with multiple pre-trained CNNs and found that a combination of three such networks had provided the best results. A detailed performance evaluation report using various off-the-shelf pre-trained models has been provided in Section 3.The support and query sets were generated by randomly sampling our dataset. The few-shot training was performed using the support set that comprised of a few examples for each class while the few-shot query set was used to evaluate the model performance. The model architecture and the training details are discussed in the subsequent sections.
2.1. Architecture
For our best performing approach, we have used the combination of ResNet 50, EfficientNet B5, and DenseNet 201 for computing the feature embeddings corresponding to each image. The obtained features were reshaped to spatial blocks and stacked to form the input for our proposed model. The stacked channels form the input , that is passed through a convolutional block comprising of [
]. The output of this block is then flattened and passed through a Multi-Layer Perceptron with two hidden layers with 256 neurons and 32 neurons. The output layer has five neurons for providing a 5 class classification which is then passed through a softmax layer to obtain the final classified label
. The detailed architecture is illustrated in Figure 1 and the model summary is provided in Table 1.Layers | Output Shape | Parameters |
---|---|---|
ResNet 50 | 23.6M (Frozen) | |
EfficientNet B5 | 28.5M (Frozen) | |
DenseNet 201 | 18.3M (Frozen) | |
Concat | - | |
Reshape stack | - | |
Conv2D | 1.7M | |
BatchNorm | 2k | |
AvgPool | - | |
Flatten | - | |
Dense In | 262k | |
Hidden Dense 1 | 131k | |
Hidden Dense 2 | 8k | |
Dense Out | 0.1k | |
Total Trainable Parameters | 2.1M |

2.2. Training and Implementation
This dataset consists of five classes, i.e., . We experimented with different numbers of examples per class () that can be used for training the few shot learner and found that the model has an optimum performance at =5. We randomly selected 32 images from each class from the whole dataset and split them into two groups of 5 and 27 images for the support and query sets, respectively. The support set for the few-shot training process was created by combining the extracted features from the pre-trained networks with their associated labels. The same procedure was followed for the rest of the 27 images, except that their corresponding labels were not supplied, and then the resulting features formed the query set.
Figure 2 illustrates the training pipeline using three pre-trained networks as the feature extractors. We use the proposed feature ensembling strategy to combine the extracted features which are passed to the trainable CNN layers and further propagated to the trainable MLP layer to obtain the classified output. The reshaping and stacking technique is explained visually in the same figure. The network minimizes the categorical cross-entropy loss
which is backpropagated through the trainable layers. We use Adam optimizer
(Kingma and Ba, 2014) with a learning rate of . We use values for the regularization constant as high asto ensure that the model does not overfit to the training data. The network takes 300 epochs for the loss value to saturate, but as the execution time of each epoch is less than a few milliseconds, the overall process does not take more than a few seconds to complete. All training and testing were performed on a system powered by an Intel Xeon 2.90 GHz quad-core CPU with NVIDIA 1080 GPU having 8GB of graphics memory.
3. Ablation Study and Hyperparameter Search
We performed an extensive ablation study to ensure the reliability of our approach. We ablate our model in terms of the feature extractors, the ensembling strategy and the hyperparameters for the trainable CNN and MLP network. We considered eight different backbone networks, each with less than 30M parameters, and 4 different kernel sizes for the ensembling. For studying the ablation of ensembling techniques, we fix the structure of the trainable CNN-MLP model on a trial-error basis to observe the general trend in accuracy. We later on refine this structure based on further ablation studies. For all the testing purposes we have used -fold cross verification to obtain a reliable performance score.
Backbone | ||||
---|---|---|---|---|
ResNet 50 | 87.86 | 86.57 | 85.53 | 88.02 |
ResNet 50 V2 | 82.32 | 85.74 | 86.28 | 87.98 |
DenseNet 121 | 76.39 | 78.47 | 77.78 | 84.22 |
DenseNet 201 | - | - | 79.04 | 87.53 |
Inception V3 | 64.67 | 63.87 | 69.02 | 75.37 |
Xception | 75.48 | 71.73 | 73.93 | 78.27 |
EfficientNet V2S | - | 78.83 | 78.56 | 82.79 |
EfficientNet B5 | 77.56 | 79.77 | 75.91 | 82.76 |


3.1. Ablation study of ensembling techniques
We begin with using only one pre-trained network as the backbone. The performance of each of the models corresponding to various kernel sizes for ensembling is presented in Table 2. A clear trend can be observed that the kernel size works the best for each model. We pick three of the best performing models for further ablation studies. Since the output shape of most of the networks is either 1024 or 2048, we can easily convert it into 1 or 2 stacks of . The output size of DenseNet 201 is 1920 which can only be converted into stacks of or smaller, therefore, some of the columns in Table 2 are missing. For similar reasons, some accuracies corresponding to EfficientNet V2S is also missing.
3.2. Ablation Study of CNN-MLP network
Here we study the effects of changing the structure of the trainable CNN-MLP model. We fix the backbone network as the best performing model and the kernel size to as obtained in Table 2. We experimented with multiple numbers of hidden layers in the MLP each with varying number of neurons, including the case with no hidden layer at all. The results are listed in Table 3. The best result was obtained for the case when there were 2 hidden layers, with 256 and 32 neurons respectively. The input layer of the MLP depends on the output of the CNN block. From this we conclude the optimal number of filters in the CNN to be 512.
Hidden Layers | Structure | Accuracy (%) |
---|---|---|
0 | 57.56 | |
54.09 | ||
1 | 65.33 | |
70.68 | ||
66.68 | ||
2 | 82.93 | |
89.08 | ||
73.10 | ||
73.25 |
3.3. Ablation study of number of feature extractors

Backbone | Accuracy (%) |
---|---|
RN50 | 88.24 |
DN201 | 87.53 |
ENB5 | 82.76 |
RN50 + DN201 | 89.39 |
ENB5 + DN201 | 90.12 |
RN50 + ENB5 | 90.95 |
RN50 + DN201 + ENB5 | 92.30 |
In this section we study the impact of the combining multiple models together. We select the three best performing models from the previous sections111We experimented with multiple combinations involving the other models as well, but the combination of the three best models produced the best results.. We compare the performances of the models taken one at a time, two at a time, and all three at a time. Considering the practical memory constraints of most mobile devices, we limited our study to a maximum of three models taken together to restrict the total parameter count. Table 4
lists the accuracies thus obtained. We can observe a clear performance improvement as we increase the number of pre-trained networks for feature extraction.
The class separability of the combinations is visualised using a TSNE (Van der Maaten and Hinton, 2008) plot given in Figure 3. A -distributed stochastic neighbour embedding, or TSNE, is a dimensionality reduction tool that helps in visualizing the clustering ability of a model. We observe that when we take the backbones one at a time (1, 2, and 3), the models fail to form sharp clusters, thus having the lowest accuracy. The clustering capability of the model improves as we increase the number of backbone networks. Figure 4 shows the sharp clustering capability of the proposes model.
4. Comparison with the state-of-the-art
The final results of our model is presented in this section. We provide the performance scores on the Powerline components dataset first and then compare our model with other popular models on some standard datasets.
4.1. Results on Powerline Dataset
All the testing was done by running the model multiple times and using -fold cross validation method to get an average score. Figure 6 shows some examples of images that were correctly classified by the model. It is to be noted that the images were of different resolutions. They are rescaled to the same size for display purposes.


k | 1 | 5 | 10 |
---|---|---|---|
Accuracy (%) | 70.11 | 92.30 | 95.42 |
Figure 7 shows some of the misclassified images. It can be observed that most of the misclassifications were due to heavy amount of noise and blur.
Model | Aircraft | Traffic | Omniglot | Texture | FC100 | VGG Flower |
---|---|---|---|---|---|---|
MAML | 33.1 ± 0.6 | 67.4 ± 0.9 | 82.6 ± 0.7 | 56.9 ± 0.8 | 62.0 ± 0.8 | 78.0 ± 0.7 |
MatchingNet | 33.5 ± 0.6 | 73.7 ± 0.8 | 89.7 ± 0.5 | 54.7 ± 0.7 | 59.4 ± 0.8 | 74.2 ± 0.8 |
ProtoNet | 41.5 ± 0.7 | 75.0 ± 0.8 | 95.5 ± 0.3 | 62.9 ± 0.7 | 64.7 ± 0.8 | 86.7 ± 0.6 |
SUR | 45.2 ± 0.8 | 70.6 ± 0.8 | 98.7 ± 0.1 | 59.6 ± 0.7 | 67.2 ± 1.0 | 90.8 ± 0.5 |
Chowdhury et al. | 68.9 ± 0.9 | 85.8 ± 0.7 | 98.0 ± 0.2 | 85.7 ± 0.6 | 80.5 ± 0.6 | 97.9 ± 0.2 |
Ours | 65.6 ± 1.7 | 93.1 ± 0.3 | 99.0 ± 0.3 | 86.8 ± 0.6 | 91.4 ± 0.2 | 98.8 ± 0.3 |
Table 5 lists the results obtained by our best model with the three feature extractors namely, ResNet 50, DenseNet 201, EfficientNet B5. The ensembling strategy used a kernel size of and 512 filters for the CNN block, and two hidden layers () in the MLP block. The results were obtained by varying the number of training examples in each class. Figure 5 contains the confusion matrices for the three values of . For , the model was supplied with only one training image per class, explaining the sharp drop in accuracy.
4.2. Results on Standard Datasets
We compare our model with the existing state-of-the-art methods on various datasets, such as the Aircraft (Maji et al., 2013), Traffic (Oreshkin et al., 2018), Omniglot (Lake et al., 2015), FC100 (Houben et al., 2013), VGG Flower (Nilsback and Zisserman, 2008), and the Texture (Cimpoi et al., 2014). As we perform our hyperparameter search on the power-line anomaly dataset containing five classes only, we stick to the results for the -way -shot problem. Table 6 shows a detailed comparative study of our method with the existing state-of-the-art methods. For comparison we chose some of the most popular existing alternative techniques for few-shot classification, such as, MAML (Finn et al., 2017), MatchingNet (Vinyals et al., 2016), ProtoNet (Snell et al., 2017), SUR (Dvornik et al., 2020) and the model proposed by Chowdhury et. al. (Chowdhury et al., 2021). It can be observed that under most circumstances, our method is able to outperform the model by Chowdhury et al., our inspiration, by a significant margin for most datasets.
5. Conclusion
In this paper we experimented a new approach for few-shot image classification. We evaluated our approach on a powerline anomaly dataset where the anomaly class was ”missing bolts”. We developed an ensembling technique that combines the extracted features of different pre-trained networks in a parameter efficient way. The classification accuracy obtained by training the model with a -way -shot support set was above 90% for . After extensive performance evaluation with multiple combinations of feature extractors, we found that the accuracy score was obtained with a strategic combination of three specialized pre-trained networks. We visualized the class separability of our method using TSNE plots and confusion matrices and finally obtained a peak classification accuracy of 92.30% for -way -shot task. The dataset used to evaluate our framework was new and challenging because it included realistic images of multiple resolutions. A major critique of our approach is the sensitivity of the accuracy on each image of the support set. For example, as the support set was randomly selected, there were cases where the all the samples under a particular class were similar to each other and failed to represent other variations thereby compromising the overall accuracy. Therefore, the selection of support set should be done with extreme care. The complete code will be made publicly available for further research.
Acknowledgements.
All computations were performed using the resources provided by the AI Computing Facility at CSIR-CEERI, Pilani.References
-
Global optimization of a neural network-hidden markov model hybrid
. IEEE transactions on Neural Networks 3 (2), pp. 252–259. Cited by: §1. -
Xception: deep learning with depthwise separable convolutions.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pp. 1251–1258. Cited by: §2. - Few-shot image classification: just use a library of pre-trained feature extractors and a simple classifier. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9445–9454. Cited by: §1, §4.2.
- Describing textures in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3606–3613. Cited by: §4.2.
- Few-shot network anomaly detection via cross-network meta-learning. In Proceedings of the Web Conference 2021, pp. 2448–2456. Cited by: §1.
- Selecting relevant features from a multi-domain representation for few-shot classification. In European Conference on Computer Vision, pp. 769–786. Cited by: §4.2.
- One-shot learning of object categories. IEEE transactions on pattern analysis and machine intelligence 28 (4), pp. 594–611. Cited by: §1.
-
Model-agnostic meta-learning for fast adaptation of deep networks.
In
International conference on machine learning
, pp. 1126–1135. Cited by: §1, §1, §4.2. - Probabilistic model-agnostic meta-learning. Advances in neural information processing systems 31. Cited by: §1.
- Dynamic few-shot visual learning without forgetting. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4367–4375. Cited by: §1.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §2.
- Detection of traffic signs in real-world images: the german traffic sign detection benchmark. In The 2013 international joint conference on neural networks (IJCNN), pp. 1–8. Cited by: §4.2.
- Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708. Cited by: §2.
- Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §2.2.
- Human-level concept learning through probabilistic program induction. Science 350 (6266), pp. 1332–1338. Cited by: §4.2.
- Meta-learning with differentiable convex optimization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10657–10665. Cited by: §1.
- Gradient-based meta-learning with learned layerwise metric and subspace. In International Conference on Machine Learning, pp. 2927–2936. Cited by: §1.
- Finding task-relevant features for few-shot learning by category traversal. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 1–10. Cited by: §1.
- Few-shot scene-adaptive anomaly detection. In European Conference on Computer Vision, pp. 125–141. Cited by: §1.
- Gradient-based hyperparameter optimization through reversible learning. In International conference on machine learning, pp. 2113–2122. Cited by: §1.
- Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151. Cited by: §4.2.
- Reptile: a scalable metalearning algorithm. arXiv preprint arXiv:1803.02999 2 (3), pp. 4. Cited by: §1.
- Automated flower classification over a large number of classes. In 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing, pp. 722–729. Cited by: §4.2.
- Tadam: task dependent adaptive metric for improved few-shot learning. Advances in neural information processing systems 31. Cited by: §4.2.
- Red-gan: attacking class imbalance via conditioned generation. yet another medical imaging perspective.. In Medical Imaging with Deep Learning, pp. 655–668. Cited by: §1.
- Imagenet large scale visual recognition challenge. International journal of computer vision 115 (3), pp. 211–252. Cited by: §2.
- A hierarchical transformation-discriminating generative model for few shot anomaly detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8495–8504. Cited by: §1.
- Prototypical networks for few-shot learning. Advances in neural information processing systems 30. Cited by: §4.2.
- Learning to compare: relation network for few-shot learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1199–1208. Cited by: §1.
- Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9. Cited by: §2.
- Efficientnet: rethinking model scaling for convolutional neural networks. In International conference on machine learning, pp. 6105–6114. Cited by: §2.
- Visualizing data using t-sne.. Journal of machine learning research 9 (11). Cited by: §3.3.
- Matching networks for one shot learning. Advances in neural information processing systems 29. Cited by: §4.2.
- IDA-gan: a novel imbalanced data augmentation gan. In 2020 25th International Conference on Pattern Recognition (ICPR), pp. 8299–8305. Cited by: §1.
- Bayesian model-agnostic meta-learning. Advances in neural information processing systems 31. Cited by: §1.