NDPNet: A novel non-linear data projection network for few-shot fine-grained image classification

06/13/2021 ∙ by Weichuan Zhang, et al. ∙ CSIRO Tencent QQ Griffith University Xidian University NetEase, Inc 0

Metric-based few-shot fine-grained image classification (FSFGIC) aims to learn a transferable feature embedding network by estimating the similarities between query images and support classes from very few examples. In this work, we propose, for the first time, to introduce the non-linear data projection concept into the design of FSFGIC architecture in order to address the limited sample problem in few-shot learning and at the same time to increase the discriminability of the model for fine-grained image classification. Specifically, we first design a feature re-abstraction embedding network that has the ability to not only obtain the required semantic features for effective metric learning but also re-enhance such features with finer details from input images. Then the descriptors of the query images and the support classes are projected into different non-linear spaces in our proposed similarity metric learning network to learn discriminative projection factors. This design can effectively operate in the challenging and restricted condition of a FSFGIC task for making the distance between the samples within the same class smaller and the distance between samples from different classes larger and for reducing the coupling relationship between samples from different categories. Furthermore, a novel similarity measure based on the proposed non-linear data project is presented for evaluating the relationships of feature information between a query image and a support set. It is worth to note that our proposed architecture can be easily embedded into any episodic training mechanisms for end-to-end training from scratch. Extensive experiments on FSFGIC tasks demonstrate the superiority of the proposed methods over the state-of-the-art benchmarks.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Humans are capable of learning stable feature representations with small training samples for dealing with image classification tasks [18]

. Inspired by this ability of human, few-shot learning has attracted much attention in computer vision and pattern recognition which intends to rapidly learn a classifier with good generalization capacity for understanding new concepts from very limited numbers of labeled training examples.

Currently, the main problem of few-shot learning [12, 2] is the very limited training examples of each class to effectively express a concept. In recent years, different few-shot learning methods [19, 24, 20, 11, 29] have been presented to tackle fine-grained image classification tasks by learning transferable knowledge [19] from an additional auxiliary dataset. The existing few-shot fine-grained image classification (FSFGIC) methods can be roughly classified into two main streams: meta-learning based methods [24, 29] and metric-learning based methods [19, 12, 27]. Meta-learning based methods aim to learn meta knowledge with only a few training examples for each category. A set of functions are used for mapping labeled training image samples and test samples for visual classification. Metric-learning based methods intend to learn a group of functions for transforming test samples into the embedding space. And then the test samples will be classified into a class by a given similarity measure (e.g., nearest neighbor [19] and cosine metric [11]).

In this work, our interest lies in addressing the metric-learning problem for FSFGIC tasks. One of the key issues of the metric-based FSFGIC methods is how to utilize a convolution neural network 

[23, 10] and a similarity measure to learn a common feature representation for each category [19]. Generally, mainstream metric-based FSFGIC methods follow the episodic training paradigm [23], and then image level feature representations [28] or a group of local feature descriptors [2] from each input image are obtained for measuring the similarities [20] between query images and support classes. Although the aforementioned methods [19, 20, 28, 11, 2] have achieved a certain degree of success, given that the sample data in FSFGIC is extremely limited and the similarity between different categories in FSFGIC may be very high, we argue that projecting samples from query images and samples from support sets into a non-linear space with different non-linear projection factors and letting the designed network learn appropriate non-linear projection factors can make the distance between samples within a category smaller and the distance between samples from different categories larger and thus can effectively improve the classification performance in FSFGIC.

Currently, four convolutional blocks (i.e., Conv-64F [23]) are widely used in meta-learning based FSFGIC tasks [3, 15, 29] and metric-learning based FSFGIC tasks [11, 5, 2]. Feature structure maps [3, 15, 29, 11, 5, 2] extracted from an input image with a single scale are used as descriptors for calculating the relationships between query images and support sets and performing FSFGIC. They follow the idea [4, 16] that descriptors extracted on a single scale image have a good trade-off between classification accuracy and speed. Our research indicates that they utilized the descriptors with strong semantic information for FSFGIC, but ignored the useful information of finer details in an image.

In this paper, we propose to formalize the FSFGIC as a feature distribution problem and design a novel non-linear data projection based metric learning network (NDPNet) for tackling the classification problems of FSFGIC. Firstly, we design a novel feature re-abstraction embedding network that has the ability to extract descriptors with detail re-enhanced semantic information from input images with a single scale. Meanwhile, a group of feature descriptors with detail re-enhanced semantic information are used to represent each test image. Secondly, a novel non-linear data projection strategy is designed to make the distance between the samples within the same class smaller and the distance between samples of different classes larger. Thirdly, a novel non-linear data projection based similarly measure is designed for obtaining the relationships between query samples after non-linear projection and support classes after non-linear projection. Finally, the Adam optimization method [7] with a cross-entropy loss is employed to train the whole network for learning the weights (including appropriate non-linear projection factors) and performing FSFGIC tasks.

The main contributions of our proposed method are as follows:

A novel feature re-abstraction embedding network is designed which is capable of representing a sample by a set of feature representations with detail re-enhanced semantic information from an input image.

A novel non-linear data projection strategy is designed for making the distance between the samples within the same class smaller and the distance between samples from different classes larger.

A novel non-linear data projection based metric learning network is designed to learn appropriate non-linear projection factors for performing FSFGIC tasks.

Experiments on four fine-gained image classification benchmark datasets exhibit that our proposed method outperforms the state-of-the-art methods on various FSFGIC tasks.

2 Related Works

In the following, we firstly provide a brief background on the definition of an interest point. Then the main challenges on interest point detection are presented.

2.1 Background

The existing FSFGIC methods can be roughly classified into two groups: meta-learning based FSFGIC methods and metric-learning based FSFGIC methods. We briefly review the two main streams in the FSFGIC literature as follows.

Meta-learning based FSFGIC methods. The meta-learning based FSFGIC methods [22, 21] aim to train a meta-learner which learns how to update the parameters of a given initial model with only a few training examples for each category. In order to more effectively distinguish some subtle and local differences between different fine-grained categories, recent meta-learning works have been presented for tackling FSFGIC tasks by considering how to more effectively capture subtle differences from limited training samples such as a piecewise mapping strategy [24] and a multi-attention mechanism [29].

Although the existing meta-learning based methods have achieved competitive results for FSFGIC, it is a difficult task to train a sophisticated memory addressing framework because of the temporally-linear hidden state dependencies [14]. In contrast, the proposed NDPNet architectures can be easily embedded into the episodic training mechanism for end-to-end training from scratch.

Metric-learning based FSFGIC methods. The metric-learning based FSFGIC methods [23, 19, 12, 27, 11, 27, 2]

aim to learn a transferable feature knowledge and obtain a distribution based on similarity metrics between different categories. Inspired by the Siamese neural network 

[8] and the episodic training mechanism [23], Prototypical-Net [19] was proposed for few-shot learning tasks. In [19], a prototype was designed as the mean of embedded support examples for each class, and then Euclidean distance is used as the metric measure for FSFGIC.

Fig. 1: The framework of the proposed NDPNet network for a 5-way 1-shot FSFGIC task which contains two modules. (1) A feature re-abstraction embedding module. (2) A non-linear data projection based similarity measure module.

The aforementioned metric-learning based methods utilized image level global features to represent each sample and perform few-shot learning tasks. It is worth to note that the training examples of each class are extremely limited in few-shot learning. Because local feature descriptors can more effectively represent the distribution of each category than image level global features in few-shot learning [12, 2, 11], some recent work (e.g., covariance metric networks (CovaMNet) [12], adaptive task-aware local representations network (ATL-Net) [2], and deep nearest neighbor neural network (DN4) [11]) employed the local feature descriptors to represent each sample. And then, covariance metric [12]

, adaptive threshold for episodic attention based on classification probability 

[2], and cosine metric [11] were utilized for measuring the relationship between a query sample and each support class for FSFGIC. Compared with the aforementioned metric-learning methods (e,g., the CovaMNet [12], ATL-Net [2], and DN4 [11]), the feature descriptors extracted by the proposed feature re-abstraction embedding network designed in this work has re-enhanced finer details into abstract feature representation. More important, projecting samples from query images and samples from support sets into a non-linear space with different projection factors and letting the network learn appropriate projection factors can make the distance between samples within a category smaller and the distance between samples from different categories larger, and thus can effectively improve the classification performance in FSFGIC. Experiments on fine-gained image classification benchmark datasets exhibit the superior performance from our proposed method compared with metric learning benchmarks state-of-the-arts.

3 Methodology

3.1 Problem Statement

A common few-shot fine-grained dataset includes two parts: a support set and a query set . The small support set contains unseen classes, each of which has labeled samples. The query set contains unlabeled samples. Sets and share the same label space. The goal of FSFGIC is to successfully classify each query sample () into a corresponding class in . Thus, the problem is noted as a -way -shot task. However, the training samples of each class are too limited to effectively learn transferable knowledge [2].

In this work, an auxiliary set and an episodic training paradigm [23] are utilized for tackling the aforementioned problem. The auxiliary set consists of a large quantity of classes and labeled samples which are far larger than and respectively and can be divided into many -way -shot FSFGIC tasks . It is worth to note that there is no intersection between the label space sets and . For each , it consists of an auxiliary support set and an auxiliary query set . In the episodic training stage, thousands of tasks will be constructed for training the transferable knowledge. Once the transferable knowledge is obtained, each sample from will be classified into one class in set .

3.2 The Proposed NDPNet

The overview of our proposed NDPNet framework for FSFGIC is illustrated in Fig. 1. It contains two main modules: a feature re-abstraction embedding (FRaE) module and a non-linear data projection based similarity measure (NDP-SM) module. The feature re-abstraction embedding module is designed to feature representations with detail re-enhanced semantic information for all samples. After that, a similarity measure module is designed to compute the relationships of the learned feature representations between a query sample and each support class . Predominantly, our proposed NDPNet architecture can be trained in an end-to-end manner from scratch. In the following, the proposed feature re-abstraction embedding module and the similarity measure module are presented in detail.

Feature Re-abstraction Embedding Module. Four convolutional blocks (i.e., Conv-64F [23]) are widely used in meta-learning based FSFGIC tasks [3, 15, 29] and metric-learning based FSFGIC tasks [23, 19, 17, 12, 11, 5, 2]. They utilized the most abstract descriptors from the deepest layer that contains strong semantic information for FSFGIC as shown in Fig. 2(a). Here, we design a novel feature re-abstraction embedding network as shown in Fig. 2(b), to re-enhance the most abstract feature with finer details (, , , ) and re-abstract back to its most abstract form for metric learning in the next module.

Fig. 2: (a) The framework of the existing four convolutional blocks. (b) Our proposed feature re-abstraction embedding network.

The detailed framework of our designed feature re-abstraction embedding network is shown in Fig. 3

. Each convolutional block consists of a convolutional layer, a batch normalization layer, and a Leaky ReLU layer. In addition, a

max pooling layer is attached to the first two convolutional blocks. It is worth to note that 64 filters with a size of are utilized in the convolutional layer. For an input image () with a size of

, through the first convolutional block, 64 feature tensors with a size of

can be obtained which is named as (). And then three sets of 64 feature tensors with a size of , which are named as , , and (), can be obtained after the second, third, and fourth convolutional blocks.

Fig. 3: The feature re-abstraction embedding module.

In this work, feature tensors obtained from the second, third, and fourth convolution block (i.e., , , and ()) are combined as follows

(1)

For the combined tensors (), we upsample their corresponding spatial resolution with a scaling factor of 2

(2)

where denotes a bilinear upsampling operator with the scaling factor of 2. And then, the upsampled tensor is merged with tensor () as

(3)

Furthermore, a convolutional layer (with 64 filters of size 33) is attached to the merged tensor for reducing the aliasing effect caused by upsampling and obtaining 64 tensors () with a size of . Then, a max pooling layer is attached to the convolutional layer for obtaining feature tensor maps (). Finally, the feature values at the same coordinate position on different feature tensor maps are used to construct a feature descriptor. The descriptor is constructed from feature maps of different resolutions, and it contains the semantic information of the original deepest feature tensor map, so it has strong detail re-enhanced semantic characteristics.

In conclusion, through the designed feature embedding module, an input image can be represented as a group of -dimensional descriptors as follows

(4)

where is the -th descriptor, and denote the height and width of the feature tensor map respectively, (here ) is the number of filters, represents real space, and () is the total number of descriptors for the input image . Meanwhile, the descriptors can be used as the representations of image .

Fig. 4: The framework of non-linear data projection based similarity measure module.

Non-linear Data Projection based Metric Learning. In this module, a novel non-linear data projection strategy is designed to make the distance between the descriptors within a class smaller and the distance between descriptors of different classes larger.

Through the designed feature re-abstraction embedding module, the descriptors of a query sample and a support sample can be represented as and respectively. For descriptors and , standardization operation is performed as

(5)

where denotes the mean measure and

denotes the variance measure. For each element

and in the standardized descriptors and  , a non-linear projection is designed as follows

(6)

where and denote the non-linear projection factors and denotes the sign function. Then the descriptors of a query sample after non-linear projection and a support sample after non-linear projection can be expressed as and respectively.

The framework of non-linear data projection based similarly metric-learning module is shown in Fig. 4. In this work, the inner product is employed for measuring the similarities of feature information between a query image after non-linear projection and a support class after non-linear projection for performing FSFGIC tasks.

For each  , the inner product measure is used to find its corresponding -nearest neighbors from based on the nearest-neighbor method [1] as follows

(7)

where stands for an inner product measure, stands for an absolute value function, denotes the operation of arranging the values of the array from the largest to the smallest, represents the sequence number corresponding to the descriptor () arranged in descending order of the similarity with the descriptor , represents descriptors that are most similar to descriptor . Then, the non-linear data projection based similarity measure between a query sample and a support sample is defined as

(8)

In this work, is set to 1.

Furthermore, we extend the non-linear projection based inner product similarity measure to compute the similarity between a query sample and each support class for FSFGIC tasks. Specifically, the descriptors of a query sample and a support class can be represented as and respectively, where is the number of samples in support class . From Equation (7), each ’s corresponding -nearest neighbor from the support class can be represented as . The non-linear data projection based image-to-class similarity measure [11]) is defined as

(9)

And then the Adam optimization method [7] with a cross-entropy loss is used to train the whole network for learning the weights (including the appropriate non-linear projection factors) and performing FSFGIC tasks.

4 Experiments

4.1 Datasets

Our proposed network is evaluated on four fine-gained datasets of Stanford Dogs [6]

, Stanford Cars 

[9], CUB-200 [25], and Cottons [26]. The Stanford Dogs dataset consists of 120 dog classes with 20,580 samples. The Stanford Cars dataset consists of 196 car classes with 16,185 samples. The CUB-200 dataset consists of 200 bird classes with 6,033 samples. The Cottons dataset consists of 80 cotton classes with 480 samples. For fair performance comparisons, we follow the same data split as used in [2] that are illustrated in Table I.

Dataset
Stanford Dogs 70 20 30
Stanford Cars 130 17 49
CUB-200 130 20 50
Cottons 51 13 16
TABLE I: The class split of four fine-grained datasets. , , and are the numbers of classes in the auxiliary set, validation set, and test set respectively.

4.2 Experimental Setup

In this work, both the 5-way 1-shot and 5-way 5-shot FSFGIC tasks are performed on the aforementioned four datasets. Each input image is resized to a fixed size of and randomly cropped into . Random affine transformations, random horizontal flips, and random rotations are utilized for data augmentation. We randomly sample and construct 300,000 episodes for training our models by utilizing the episodic training paradigm [23]. For each episode, 15 query samples per class are randomly selected for the datasets of Stanford Dogs, Stanford Cars, and CUB-200 respectively. Because each class has only six samples in the Cottons dataset, 5 and 1 query samples are selected from each cotton class for the 1-shot and 5-shot settings respectively. The Adam optimization method [7]

is utilized for training the models using 30 epochs. The learning rate is initially set as 0.001 and multiplied by 0.5 for every 100,000 episodes.

In the testing stage, 600 episodes are randomly constructed from the testing set for obtaining the results. The top-1 mean accuracy is employed as the evaluation criteria. The aforementioned process is repeated five times and the final mean results are obtained as the classification accuracy of FSFGIC. Meanwhile, the confidence intervals are obtained and reported.

Model 5- Way Accuracy (%)
Stanford Dogs Stanford Cars CUB-200
5-way 1-shot 5-way 5-shot 5-way 1-shot 5-way 5-shot 5-way 1-shot 5-way 5-shot
Matching Net 35.800.99 47.501.03 34.800.98 44.701.03 45.301.03 59.501.01
P-Net 37.591.00 48.191.03 40.901.01 52.931.03 37.361.00 45.281.03
GNN 46.980.98 62.270.95 55.850.97 71.250.89 51.830.98 63.690.94
CovaMNet 49.100.76 63.040.65 56.650.86 71.330.62 52.420.76 63.760.64
DN4 45.410.76 63.510.62 59.840.80 88.650.44 46.840.81 74.920.64
PABN 45.650.71 61.240.62 54.440.71 67.360.61 63.360.80 74.710.60
45.720.75 60.940.66 60.280.76 73.290.58 63.630.77 76.060.58
ATL-Net 54.490.92 73.200.69 67.950.84 89.160.48 60.910.91 77.050.67
Our NDPNet 56.210.86 74.820.84 71.480.89 91.920.91 64.740.90 80.520.63
TABLE II: Comparison results on three different standard datasets.

4.3 Performance Comparison

The experimental results on Stanford Dogs, Stanford Cars, CUB-200, and Cottons datasets are presented in Table II and Table III and compared with eight state-of-the-art metric learning methods (i.e., Matching Net [23], Prototypical Net (P-Net) [19], GNN [17], CovaMNet [12], DN4 [11], PABN [5],  [5], and ATL-Net [2]). From Table II and Table III, it is observed that our proposed NDPNet method outperforms all the benchmark metric-learning methods on both 5-way 1-shot and 5-way 5-shot FSFGIC tasks. For the 5-way 1-shot and 5-way 5-shot FSFGIC tasks on the Stanford Cars dataset, our proposed NDPNet architecture achieves , , , , , , , and improvements and , , , , , , , and improvements over Matching Net, P-Net, GNN, CovaMNet, DN4, PABN, , and ATL-Net respectively. Such improvements is due to the ability of NDPNet for making the distance between the samples within the same class smaller and the distance between samples from different classes larger and reducing the coupling relationship between samples of different categories.

Model 5-Way Accuracy (%)
1-shot 5-shot
Matching Net 38.771.08 58.951.02
P-Net 39.121.03 60.251.02
GNN 43.910.71 68.610.81
CovaMNet 46.140.91 69.970.89
DN4 45.740.86 70.360.95
PABN 46.730.77 70.320.71
46.810.73 71.820.99
ATL-Net 51.890.97 73.480.96
Our NDPNet 53.520.98 74.660.92
TABLE III: Comparison results on the Cottons dataset.

4.4 Discussion

Below we conduct several further experiments for studying the effectiveness of the presented NDPNet architecture.

Influence of the feature re-abstraction embedding network for FSFGIC. In this experiment, the designed embedding network in our architecture is replaced with the basic embedding network (i.e., Conv-64F). Here, Conv-64F is combined with our non-linear data projection based similarity measure (NDP-SM) module, which is denoted as for performing FSFGIC tasks on the Stanford Dogs and Stanford Cars datasets. Their corresponding results on the 5-way 1-shot and 5-way 5-shot tasks are presented in Table IV. We can see from Table IV that the classification accuracies of our proposed NDPNet is far better than the classification accuracy of the , which demonstrates the effectiveness of the proposed feature re-abstraction embedding (FRaE) module in image feature description for metric learning.

Model 5- Way Accuracy (%)
Stanford Dogs Stanford Cars
5-way 1-shot 5-way 5-shot 5-way 1-shot 5-way 5-shot
53.170.92 69.260.98 67.800.99 89.700.83
FRaENet 52.591.03 68.371.12 65.970.98 86.931.07
55.380.96 74.710.95 71.100.98 91.550.89
GK_FRaENet 54.280.85 73.570.65 69.150.92 89.250.91
Our NDPNet 56.210.86 74.820.84 71.480.89 91.920.91
TABLE IV: The results of ablation study on the proposed NDPNet on the Stanford Dogs and Stanford Cars datasets.
Model 5- Way Accuracy (%)
Stanford Dogs Stanford Cars CUB-200
5-way 1-shot 5-way 5-shot 5-way 1-shot 5-way 5-shot 5-way 1-shot 5-way 5-shot
54.710.74 73.330.91 70.110.77 89.520.83 63.180.88 79.730.79
55.780.89 73.910.97 70.750.92 90.310.99 64.550.98 80.230.81
56.210.86 74.820.84 71.480.89 91.920.91 64.740.90 80.520.63
TABLE V: The impact of the -nearest neighbors on the proposed NDPNet on three different standard datasets.

Influence of the non-linear data projection for FSFGIC. In the following experiments, we firstly consider the classification performance of a network after removing the designed non-linear data projection strategy, which is named as FRaENet. The descriptors from query images and the descriptors from support classes obtained from the designed embedding network are utilized to calculate the similarities using Equations (7), (8), and (9). The FRaENet is employed for performing FSFGIC tasks on the Stanford Dogs and Stanford Cars datasets. Their results on the 5-way 1-shot and 5-way 5-shot tasks are presented in Table IV. After removing the proposed non-linear data projection based similarity measure (NPD-SM) module, the classification accuracy of FRaENet is significantly lower than the proposed NDPNet, which validates the effectiveness and strength of the proposed NPD-SM network.

Secondly, the inner product similarity measure in our architecture is replaced with the cosine similarity measure, which is denoted as

, for performing FSFGIC tasks on the Stanford Dogs and Stanford Cars datasets. Their corresponding results on the 5-way 1-shot and 5-way 5-shot tasks are presented in Table IV. It is observed that the classification performance of the is close to the classification performance of our proposed NDPNet. It can also be seen from Tables II and IV that the classification performance of the is far better than that of the cosine similarity measure based DN4 [11] method. This indicates that the non-linear data projection strategy effectively improve the ability to identify different samples.

Thirdly, we consider the classification performance of inner product measure based Gaussian kernel function learning network, which is named as GK_FRaENet, after removing the designed non-linear data projection strategy. Based on the obtained descriptors and from the feature re-abstraction embedding network, the GK_FRaENet is designed as follows

(10)

where is a scale factor. It is easy to verify that the function in Equation (10) satisfies the Mercer’s theorem [13]. It means that the function can be used for kernel learning which is utilized to compute the relationships between query images and support classes via Equations (8) and (9). The GK_FRaENet is employed for performing FSFGIC tasks on the Stanford Dogs and Cars datasets. Their corresponding results on the 5-way 1-shot and 5-way 5-shot tasks are presented in Table IV. We can see from Table IV that our proposed NDPNet achieves better classification performance than the GK_FRaENet. This further validates that the proposed NDP-SM can be more effectively making the distance between the samples within the same class smaller and the distance between samples from different classes larger.

Influence of the -nearest neighbors for FSFGIC. The selection of parameter will affect the performance of classification. Here, we analyse the effect of the parameter in Equation (7) on the performance in performing FSFGIC tasks. 5-way 1-shot and 5-way 5-shot tasks are performed on the Stanford Dogs, Standford Cars, and CUB-200 datasets by using different (). Their corresponding results on the 5-way 1-shot and 5-way 5-shot tasks are presented in Table V. It can be seen from Table V that when equal to 1, our proposed method achieves the best classification performance. Thus, -nearest neighbors with is recommended for our proposed architecture.

5 Conclusions

In this paper, we present a novel non-linear data projection network, NDPNet, which consists of a feature re-abstraction embedding module and a non-linear data projection based similarity measure module, for the challenging few-shot fine-grained image classification (FSFGIC). We emphasize and verify that the proposed NDPNet network is more effective for performing FSFGIC tasks. In this work, a feature re-abstraction embedding network is designed which is capable of representing a sample by a set of feature representations with detail re-enhanced semantic information. Furthermore, a novel non-linear data projection based similarity metric learning network is designed to make the distance between the samples within the same smaller and the distance between samples from different classes larger for performing FSFGIC. Our proposed method does not require extra supervision information and can be easily embedded into the episodic training mechanism for end-to-end training. Experiments on four fine-gained image classification benchmark datasets demonstrate that the presented NDPNet method outperforms the state-of-the-art methods on various FSFGIC tasks.

Reference

  • [1] O. Boiman, E. Shechtman, and M. Irani (2008) In defense of nearest-neighbor based image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8. Cited by: §3.2.
  • [2] C. Dong, W. Li, J. Huo, Z. Gu, and Y. Gao (2020) Learning task-aware local representations for few-shot learning. In

    Proceedings of the International Joint Conferences on Artificial Intelligence

    ,
    pp. 716–722. Cited by: §1, §1, §1, §2.1, §2.1, §3.1, §3.2, §4.1, §4.3.
  • [3] C. Finn, P. Abbeel, and S. Levine (2017) Model-agnostic meta-learning for fast adaptation of deep networks. In

    International Conference on Machine Learning

    ,
    pp. 1126–1135. Cited by: §1, §3.2.
  • [4] R. Girshick (2015) Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1440–1448. Cited by: §1.
  • [5] H. Huang, J. Zhang, J. Zhang, J. Xu, and Q. Wu (2020) Low-rank pairwise alignment bilinear network for few-shot fine-grained image classification. IEEE Transactions on Multimedia. Cited by: §1, §3.2, §4.3.
  • [6] A. Khosla, N. Jayadevaprakash, B. Yao, and F. Li (2011) Novel dataset for fine-grained image categorization: stanford dogs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshop on Fine-Grained Visual Categorization, Vol. 2. Cited by: §4.1.
  • [7] D. P. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. In Proceedings of the International Conference on Learning Representations, Cited by: §1, §3.2, §4.2.
  • [8] G. Koch, R. Zemel, and R. Salakhutdinov (2015) Siamese neural networks for one-shot image recognition. In

    Proceedings of the International Conference on Machine Learning Deep Learning Workshop

    ,
    Vol. 2. Cited by: §2.1.
  • [9] J. Krause, M. Stark, J. Deng, and L. Fei-Fei (2013) 3D object representations for fine-grained categorization. In Proceedings of the Proceedings of the IEEE International Conference on Computer Vision Workshops, Cited by: §4.1.
  • [10] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2017) Imagenet classification with deep convolutional neural networks. Communications of the ACM 60 (6), pp. 84–90. Cited by: §1.
  • [11] W. Li, L. Wang, J. Xu, J. Huo, Y. Gao, and J. Luo (2019) Revisiting local descriptor based image-to-class measure for few-shot learning. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7270–7268. Cited by: §1, §1, §1, §2.1, §2.1, §3.2, §3.2, §4.3, §4.4.
  • [12] W. Li, J. Xu, J. Huo, L. Wang, Y. Gao, and J. Luo (2019) Distribution consistency based covariance metric networks for few-shot learning. In Proceedings of the Association for the Advancement of Artificial Intelligence, Vol. 33, pp. 8642–8649. Cited by: §1, §2.1, §2.1, §3.2, §4.3.
  • [13] J. Mercer (1909) Functions of positive and negative type and their connection with the theory of integral equations. Philosophical Transactions of the Royal Society A 209 (441-458), pp. 415–446. Cited by: §4.4.
  • [14] N. Mishra, M. Rohaninejad, X. Chen, and P. Abbeel (2018) A simple neural attentive meta-learner. In Proceedings of the International Conference on Learning Representations, Cited by: §2.1.
  • [15] T. Munkhdalai, X. Yuan, S. Mehri, and A. Trischler (2018)

    Rapid adaptation with conditionally shifted neurons

    .
    In International Conference on Machine Learning, pp. 3664–3673. Cited by: §1, §3.2.
  • [16] S. Ren, K. He, R. Girshick, and J. Sun (2016) Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (6), pp. 1137–1149. Cited by: §1.
  • [17] V. G. Satorras and J. B. Estrach (2018) Few-shot learning with graph neural networks. In Proceedings of the International Conference on Learning Representations, Cited by: §3.2, §4.3.
  • [18] L. A. Schmidt (2009) Meaning and compositionality as statistical induction of categories and constraints. Ph.D. Thesis, Massachusetts Institute of Technology. Cited by: §1.
  • [19] J. Snell, K. Swersky, and R. Zemel (2017) Prototypical networks for few-shot learning. In Conference on Neural Information Processing Systems, pp. 4077–4087. Cited by: §1, §1, §2.1, §3.2, §4.3.
  • [20] F. Sung, Y. Yang, L. Zhang, T. Xiang, P. H. Torr, and T. M. Hospedales (2018) Learning to compare: relation network for few-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208. Cited by: §1, §1.
  • [21] S. Thrun and L. Pratt (1998) Learning to learn: introduction and overview. In Learning to learn, pp. 3–17. Cited by: §2.1.
  • [22] S. Thrun (1998) Lifelong learning algorithms. In Learning to learn, pp. 181–209. Cited by: §2.1.
  • [23] O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra, et al. (2016) Matching networks for one shot learning. NIPS 29, pp. 3630–3638. Cited by: §1, §1, §2.1, §3.1, §3.2, §4.2, §4.3.
  • [24] X. Wei, P. Wang, L. Liu, C. Shen, and J. Wu (2019) Piecewise classifier mappings: learning fine-grained learners for novel categories with few examples. TIP 28 (12), pp. 6116–6125. Cited by: §1, §2.1.
  • [25] P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. Belongie, and P. Perona (2010) CalTech-UCSD-Birds 200. Cited by: §4.1.
  • [26] X. Yu, Y. Zhao, Y. Gao, S. Xiong, and X. Yuan (2020) Patchy image structure classification using multi-orientation region transform. In AAAI, pp. 12741–12748. Cited by: §4.1.
  • [27] C. Zhang, Y. Cai, G. Lin, and C. Shen (2020) DeepEMD: Few-Shot Image Classification with Differentiable Earth Mover’s Distance and Structured Classifiers. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 12203–12213. Cited by: §1, §2.1.
  • [28] H. Zhang and P. Koniusz (2019) Power normalizing second-order similarity network for few-shot learning. In Proceedings of the IEEE Winter Applications of Computer Vision, pp. 1185–1193. Cited by: §1.
  • [29] Y. Zhu, C. Liu, and S. Jiang (2020) Multi-attention meta learning for few-shot fine-grained image recognition. In Twenty-Ninth International Joint Conference on Artificial Intelligence and Seventeenth Pacific Rim International Conference on Artificial Intelligence, pp. 1090–1096. Cited by: §1, §1, §2.1, §3.2.