Piecewise classifier mappings: Learning fine-grained learners for novel categories with few examples

05/11/2018 ∙ by Xiu-Shen Wei, et al. ∙ Nanjing University The University of Adelaide 0

Humans are capable of learning a new fine-grained concept with very little supervision, e.g., few exemplary images for a species of bird, yet our best deep learning systems need hundreds or thousands of labeled examples. In this paper, we try to reduce this gap by studying the fine-grained image recognition problem in a challenging few-shot learning setting, termed few-shot fine-grained recognition (FSFG). The task of FSFG requires the learning systems to build classifiers for novel fine-grained categories from few examples (only one or less than five). To solve this problem, we propose an end-to-end trainable deep network which is inspired by the state-of-the-art fine-grained recognition model and is tailored for the FSFG task. Specifically, our network consists of a bilinear feature learning module and a classifier mapping module: while the former encodes the discriminative information of an exemplar image into a feature vector, the latter maps the intermediate feature into the decision boundary of the novel category. The key novelty of our model is a "piecewise mappings" function in the classifier mapping module, which generates the decision boundary via learning a set of more attainable sub-classifiers in a more parameter-economic way. We learn the exemplar-to-classifier mapping based on an auxiliary dataset in a meta-learning fashion, which is expected to be able to generalize to novel categories. By conducting comprehensive experiments on three fine-grained datasets, we demonstrate that the proposed method achieves superior performance over the competing baselines.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Fine-grained image recognition, as an important computer vision problem, has attracted tremendous attention and observed rapid performance boost thanks to the sophisticated deep network structures. However, the large-scale fine-grained data volume required to train such classification algorithms limits the ranges where they can be successfully applied to,

e.g., very sparse training samples can be collected for some rare bird species. Humans, in contrast, are capable of learning a new fine-grained concept with very little supervision. To mimic this human ability, in this work, we study the fine-grained image recognition in a more practical and challenging few-shot setting, that is, we aim to learn the classifiers of novel fine-grained categories from very few labeled training examples (a.k.a. exemplars, usually or ).

Figure 1: Illustration of the few-shot fine-grained image recognition (FSFG) task. The aim is to learn the classifier for a fine-grained category, bird species in this example, from few exemplars. We train the exemplar-to-classifier mapping based on an auxiliary dataset and test the FSFG performance on another dataset . There are no category overlaps between these two sets.

Learning a classifier for a fine-grained category identified by few exemplars is a challenging problem, as satisfactory classification performance can be expected only when the learned classifiers can capture the subtle differences between categories and is able to generalize beyond the very limited supervisions. To realize such exemplar-to-classifier mapping, we propose an end-to-end trainable network which is inspired by state-of-the-art fine-grained recognition model and is tailored for the FSFG task. Specifically, the network consists of a bilinear feature learning module and a classifier mapping module. While the former encodes the discriminative information of exemplar image into a feature vector, the latter, as the key part of the network, maps the intermediate image features into the category-level decision boundaries. Two problems remain to succeed with such mappings. On one hand, the distribution of the image-level representation can be complex which poses a great challenge for the mapping. On the other hand, the feature generated from bilinear pooling is very high dimensional, which further impedes the mapping due to the risk of parameter explosion.

The key novelty of our model to mitigate these problems is a “piecewise mappings” function in the classifier mapping module, which generates the decision boundary via learning a set of more attainable sub-classifiers in a much more parameter-economic way. Due to the outer product computation in bilinear pooling, the feature obtained, by nature, can be viewed as a set of sub-vectors, each of which implicitly attends to part of the image. We perform the sub-vector to sub-classifier mapping resorting to highly non-linear mappings. Then, these sub-classifiers are recombined into a global classifier so that it can tell samples from different categories. Intuitively, we learn the feature-to-classifier mapping based on the implicit “part” which may encode simpler and purer information and consequently makes the mapping easier. As a by-product, the piecewise mappings significantly reduce the number of model parameters and enable a more efficient computation. We learn the exemplar-to-classifier mapping using an auxiliary dataset in a meta-learning fashion as shown in Fig. 1. The aim in the meta-training phase is to learn a “mapping paradigm” which is expected to be able to generalize to novel categories.

In experiments, we perform the proposed FSFG method on three fine-grained benchmark datasets, i.e., CUB Birds [23], Stanford Dogs [9], Stanford Cars [11]. Empirical results show that our FSFG model significantly outperforms competing baseline methods.

In summary, our major contributions are three-fold:

  • We study fine-grained image recognition in a challenging few-shot setting and propose a novel meta-learning strategy to address this problem.

  • We devise a novel exemplar-to-classifier mapping strategy, named piecewise mappings, which resorts to the special structure of the bilinear CNN features to learn a discriminative classifier in a parameter-economic way.

  • We conduct comprehensive experiments on three fine-grained benchmark datasets, and our proposed model achieves superior performance over competing solutions on all these datasets.

2 Related work

As our work is related to both fine-grained image recognition and generic few-shot learning, in this section we will briefly review these two topics separately.

2.1 Fine-grained image recognition

Fine-grained recognition is a challenging problem and has recently emerged as an active topic [9, 11, 23]. Over the past decade, fine-grained recognition has achieved high performance levels thanks to the integration of powerful deep learning techniques with large annotated training datasets. A number of effective fine-grained recognition methods have been developed in the literature [2, 5, 7, 8, 16, 17, 28]. Among them, some work, e.g.[8, 17], attempted to learn a more discriminative feature representation by developing powerful deep models. Some methods aligned the objects in fine-grained images to eliminate pose variations and the influence of camera position, e.g.[2, 16]. Moreover, some of them relied on localizing discriminative parts with/without strong supervisions, e.g.[5, 7, 16].

However, current fine-grained recognition systems assume a set of categories known a priori, despite the obviously dynamic and open nature of the visual world [1, 26, 25]. Compared with previous work, we are the first to study fine-grained image recognition in a challenging few-shot learning setting where the model is required to recognize novel fine-grained categories by only a few labeled images.

2.2 Generic few-shot image recognition

Nowadays, few-shot image recognition (a.k.a. few-shot learning or low-shot learning) [1, 22]

has attracted more and more attentions in computer vision and pattern recognition. This line of research explores the possibility of endowing learning systems the ability of rapid learning for novel categories from a few examples. More specifically, these systems are able to learn new concepts on the fly, from few or even a single example as in one-shot learning. Few-shot image recognition is usually tackled by using generative models 

[15, 19]

or, in a discriminative setting, using ad-hoc solutions such as exemplar support vector machines 

[18]. While recently, many methods solved it in a learning-to-learn formulation [4, 24, 25, 26, 27].

However, previous few-shot image recognition studies all focused on generic images (e.g.

, images of the ImageNet 

[20] and CIFAR [12] datasets) or generic patterns (e.g., characters of the Omniglot [14] dataset). Compared with those tasks, we consider a novel few-shot image recognition topic, i.e., few-shot fine-grained image recognition. The most different point of our topic from the generic few-shot image recognition is that, fine-grained recognition relies on more subtle image cues which makes it considerably more challenging. We demonstrate that the proposed model, especially our piecewise mappings component, can cater to the desire of capturing the subtle differences in a fine-grained scenario from limited training data, even one-shot.

3 Learning few-shot fine-grained learners

In this section, we firstly present our learning strategy for FSFG and introduce the relevant notations. Then, a detailed elaboration of various aspects of our method will be followed in the subsequent sections.

3.1 Learning strategy and notations

Our work is built upon the framework of meta-learning which treats the classifier generation process as a mapping function from the few labeled training samples of a category, called “exemplars” hereafter, to their corresponding category classifier. Fig. 2 shows the key idea of this learning scheme. This exemplar-to-classifier mapping is learned on an auxiliary training set . It contains labeled training images , where is an example image and is its corresponding label. Once the mapping function is learned, it will be applied on another testing set to evaluate its performance, where contains images of novel categories that do not appear in .

To train the mapping function, we randomly sample a set of “meta-training sets” from . Each meta-training set (corresponding to a training episode) contains randomly chosen categories and a few images associated with them. A meta-training set is composed of an “exemplar set” and a “query set” to mimic the scenario at the testing stage. Specifically, contains (e.g., 1 or 5) exemplar images per category. The query set is coupled with (has the same categories), but has no overlapped images. Each category of contains query images. During training, will be fed into the to-be-learned mapping function to generate the category classifiers :

(1)

Then, are subsequently applied to for evaluating the classification loss. The training objective then amounts to learning the mapping function by minimizing the classification loss. This process is formally written as follows:

(2)

where denotes the model parameters of the mapping function (from to ), and

is the loss function.

denotes applying the category classifiers generated by the exemplar set on the query set .

Figure 2: Key idea of the proposed FSFG model. In each episode, we sample an exemplar set from , which is composed of a subset of categories (three categories in this example) and each category contains few exemplars (the images with red border). We wish to learn a mapping that can map these exemplars into their corresponding category classifiers (the dashed lines). The mapping parameters are learned so that these classifiers can correctly distinguish the query images (the images with yellow border).

3.2 Model

We implement the above exemplar-to-classifier mapping by adopting a trainable neural network. Fig. 

3 shows the overall architecture of the network. As we can see, the network is composed of two modules: a representation learning module and a classifier mapping module. While the former adopts a bilinear CNN structure to encode the discriminative information of an exemplar image into a high-dimensional feature vector, the latter, as the key part of the network, maps the intermediate image representation into a category classifier. In the next two sub-sections, we elaborate these two modules in more details.

Figure 3: Overview structure of our proposed FSFG model. On the left, it is the first component (the bilinear pooling module) for representation learning. On the right, the second component (the classifier mapping module) mapps the intermediate image features into the category classifiers.

3.2.1 Representation learning

We employ a bilinear CNN (BCNN) structure [17] to learn the image representation considering its state-of-the-art performance in fine-grained image recognition. BCNN consists of two feature extractors whose outputs are multiplied using outer product at each location of the image and pooled to obtain an image representation. Concretely, given two convolutional networks ( and ) as two streams of BCNN, we assume their outputs are re-organized into and , where , denotes the dimensionality of the outputs and denotes the spatial locations. Then, at location , the bilinear representation will be ,

(3)

The vectorized versions of will be pooled over the entire image to derive the image representation (for interpretation simplicity we let ), that is,

(4)

With the outer product computation, bilinear structure modulates one feature stream with another. Thus, the BCNN feature can be viewed as a set of sub-vectors :

(5)

where is the modulated feature of by the -th feature of . This is similar to the multiplicative feature interactions in attention mechanisms [17]. From the observation that each modulated feature map tends to focus on an implicit “part” of an object, and thus, can be viewed as the feature description for that “part”. In our implementation, we train the bilinear CNN by performing the same procedure in [17] and use it as the image representation extractor.

To represent a set of exemplar images belonging to category , we simply compute the mean image representation as the category-level representation by:

(6)

where are samples with .

3.2.2 Classifier mapping

Now that the information of each category identified by few exemplars has been encoded into a bilinear feature vector, the task of the classifier mapping module is to map these intermediate category-level representations into their corresponding category classifiers. Mathematically, this module computes a -dimensional classifier for each category through a mapping .

A straightforward solution to realize this mapping is via a global mapping, either linear or nonlinear. For example, a linear mapping can be:

(7)

where and denote the parameters of the global mapping. However, this mapping strategy suffers from two drawbacks. First, as the feature is supposed to encode the category-level information, the distribution of which can be highly complex. This poses a great challenge for the global mapping to find a decision boundary in such a complex feature space. Second, since the bilinear feature tends to be high dimensional, this mapping may result in parameter explosion, which will make the network training hard or infeasible.

To mitigate these problems, we propose a novel “piecewise mappings” strategy, which exploits the structure of the bilinear features. As analyzed in Sec. 3.2.1, the bilinear feature can be viewed as a set of sub-vectors with each sub-vector describes an implicit “part” of the object. Intuitively, we can test if an object falls into the category described in the exemplars by checking whether each “part” of it is compatible with the exemplars. This motivates us to apply a piecewise mapping to first map each sub-vector into its corresponding sub-classifier , and then combine these sub-classifiers together to generate the global category classifier. Fig. 3 shows this mapping with more details.

Concretely, a sub-vector is firstly mapped into a sub-classifier

via a nonlinear multilayer perceptron (MLP)

as

(8)

We learn such MLPs to derive sub-classifiers , and then these sub-classifiers are concatenated together to generate the global category classifier :

(9)

Essentially, our model simplifies the global mapping approach by assuming that the classifier for the -th sub-vector is solely determined by the information from the -th sub-vector in the exemplar set. Despite resulting more restrictive mapping function, this assumption makes the network much easier to train. Note that, this mapping scheme will significantly reduce the model parameters involved in classifier generation. Taking one-layer mapping for example, let’s assume . For the global mapping, it requires more than parameters. For the proposed piecewise mappings, however, the number is reduced to about . In addition, although there are parameter-economy variants of BCNN [6], our piecewise classifier mappings still show better performance. This suggests that the proposed classifier mapping function brings benefits more than merely reducing the model size (cf. Table 2).

3.2.3 Network training

Given a query sample with label , we compute its prediction distribution via softmax as:

(10)

The model parameters are trained via minimizing the negative log-likelihood . With this, we can now summarize the training in an episode as follows. First, we select an exemplar set from and learn/generate the classifiers . Then, we establish a query set . The model parameters are optimized by minimizing . Algorithm 1 illustrates the training process in more details.

0:   is an auxiliary training set with images belonging to categories; denotes a subset of containing all images belonging to the -th category; denotes the number of categories in an exemplar set as well as a query set for an episode; denotes the elements in with element size ; denotes the elements in with element size ; denotes the number of piecewise mappings; RandomSample() denotes a set of elements chosen uniformly at random from set , without replacement; denotes a category set and denotes its -th element.
1:  Select a category subset for an episode);
2:  for  do
3:     Select );
4:     Compute the category-level representation following Eq. 6;
5:     Generate the category classifier by Eq. 8 and Eq. 9;
6:     Select );
7:  end for
8:  Initialize loss ;
9:  for  do
10:     for  in  do
11:         ;
12:     end for
13:  end for
14:  
15:  Update model parameters by minimizing ;
16:  return   piecewise mappings .
Algorithm 1 Training episode loss computation for the proposed piecewise mappings.

4 Experiments

In this section, we first describe the experimental setups, implementation details and the datasets used in experiments. Then, we present the few-shot fine-grained image recognition results on three fine-grained benchmark datasets. Finally, ablation studies are given to further evaluate the effectiveness of our proposed classifier mapping strategy.

4.1 Datasets, setups and implementation details

Our experiments are conducted on three fine-grained benchmark datasets, i.e., CUB Birds ( categories of birds, images) [23], Stanford Dogs ( categories of dogs, images) [9], Stanford Cars ( categories of cars, images) [11]. For each dataset, we randomly split its original image categories into two disjoint subsets: one as the auxiliary training set , and the other as the FSFG testing set . Table 1 presents the details of the category split. For each category in , we follow the raw splits provided by these datasets to split the data into training and validation. While the former is used to train the parameters, the latter is used to monitor the learning process.

category CUB Birds Stanford Dogs Stanford Cars
200 120 196
150 90 147
50 30 49
Table 1: Category split for three datasets. denotes the total number of categories in a dataset, denotes the number of categories in and denotes the number of categories in .

To mimic the testing condition, in each training episode, we set the category size of the exemplar set to be same as the number of categories in the testing set , i.e., . Further we set () for one-shot learning (five-shot learning) and is set to be in all settings. Similarly, during the testing phase, for each category in , we randomly choose one exemplar (five exemplars) for one-shot learning (five-shot learning), and another samples are randomly selected to evaluate the recognition performance. We repeat this evaluation process twenty times, and the mean classification accuracy is used as the evaluation criterion.

In theory, we can choose any network structures as the base network for our bilinear feature learning module. Since our key contribution is in the classifier mapping scheme, we choose AlexNet [13] as the two streams in BCNN, considering the trade off between its representation capacity and computational efficiency. Specifically, we adopt the AlexNet model pre-trained on the Places 205 database [29] to initialize the representation learning parameters. The reason why we use the Place dataset [29] instead of ImageNet [20] is to avoid the FGFS testing categories to be present in the pre-training dataset. We fine-tune the bilinear feature learning module on the auxiliary training set first and freeze it during the classifier learning process. For the classifier mapping module, without otherwise stated, we choose the mapping function to be a three-layer MLP, where hidden units are adopted in each layer and Exponential Linear Units (ELU) [3]

is used in each layer as the non-linear activation function. SGD is used to optimize the parameters with learning rate of

. We implement our model using the open-source library PyTorch.

4.2 Main results

We present the main results of FSFG by firstly introducing some baseline methods and then reporting the empirical results on these three datasets.

4.2.1 Comparison methods

In our experiments, we compare our proposed model to the following competitive baselines. Note that, apart from the original bilinear CNN, we also implement a compact bilinear CNN [6] as the image feature extractor to facilitate the comparison, which enables much lower feature dimensionality but keeps almost the same classification discriminative ability [6]. For compact bilinear pooling, we follow the optimal settings suggested in [6]. The dimensionality of compact bilinear pooling representations is -d (much less than -d of fully bilinear pooling). In our empirical results, the results of compact bilinear pooling are denoted as “CB” in Table 2, and the results of fully bilinear pooling are denoted as “FB”.

  • -NN (-nearest neighbors): Following the testing setting introduced in Sec. 4.1, we choose one sample (five samples) for each category in as exemplar(s) and samples in the same category for evaluation. We use the BCNN (either original or compact version) fine-tuned on as the image representation extractor, and nearest neighbor is adopted as the classifier to categorize the evaluation images. Specifically, the image representations are first -normalized and cosine distance is used as the distance metric. Note that, for five-shot learning, the representations of five exemplars are averaged before normalization to serve as the category-level representation. This process will be repeated twenty times as for our method. (This applies to all other baselines, so we omit this when introducing the following baselines.)

  • SVM (support vector machine): After obtaining the bilinear representations for exemplars of the testing categories in , we train a classifier for each category based on these representations. In particular, for one-shot learning, this baseline becomes exemplar-SVMs [18].

  • Siamese-Net [10]: As a standard metric-learning strategy, Siamese-Net is a competitive solution for few-shot learning. It learns a feature space in which images of the same category are close but images belonging to different categories are separated apart. We train a Siamese-Net based on by sampling pair-wise examples and the corresponding binary labels (“1” presents examples are from the same category and “0” is not.) Similar to [10], the regularized cross-entropy loss on the binary classifier is used. During evaluation, Siamese-Net could rank similarities between exemplars and testing data.

  • Global mapping: As aforementioned in Sec. 3.2.2, an alternative solution to our proposed piecewise classifier mappings is global mapping. It follows the idea of the global feature to global classifier mapping by applying the mapping function directly on the category-level representation.

Method CUB Birds Stanford Dogs Stanford Cars
1-shot 5-shot 1-shot 5-shot 1-shot 5-shot
-NN (FB) 38.853.43 55.580.84 24.532.36 40.302.34 26.992.91 43.401.68
-NN (CB) 24.521.80 41.851.51 18.311.81 32.371.15 21.251.78 39.421.57
SVM (FB) 34.471.93 59.191.28 23.373.18 39.501.07 25.661.53 51.071.51
SVM (CB) 24.941.97 41.931.69 18.252.83 30.501.76 21.341.94 39.431.46
Siamese-Net (FB) 37.381.53 57.731.38 23.991.66 39.691.17 25.811.67 48.951.31
Siamese-Net (CB) 26.582.47 43.511.53 19.282.60 31.491.22 22.411.55 40.071.88
Global mapping (FB-) 24.121.39 34.591.77 20.551.48 30.931.91 20.501.60 30.581.82
Global mapping (CB) 25.422.22 36.371.04 20.772.75 32.332.11 20.241.94 32.661.86
Ours 42.101.96 62.481.21 28.782.33 46.922.00 29.632.38 52.281.46
Table 2: Comparison results (meanstd.) on three fine-grained datasets. The highest average accuracy of each column is marked in bold. “/” denotes that our proposed model performs significantly better/worse than the corresponding method by the pairwise -test with confidence level 0.05. “FB” stands for using the fully bilinear pooling representations, and “CB” is for using compact bilinear pooling.

4.2.2 Comparison results

Table 2 presents the average accuracy rates of FSFG on the novel categories of three fine-grained datasets. For each dataset, we report both one-shot and five-shot recognition results. As shown in that table, our proposed model consistently and significantly outperforms the other baseline methods on these datasets.

Generally, we see the simple baseline -NN performs well and it even outperforms other more sophisticated baselines on some settings, e.g., on Stanford Dogs. This is due to the discriminative capacity of the bilinear CNN features. SVM observes more obvious advantage comparing to -NN when exploiting five training exemplars. Siamese-Net, as another discriminative method, achieves comparable performance to SVM but is outperformed by our method. This reflects our meta-learning strategy can better generalize to unseen/novel fine-grained categories. For the global mapping, because BCNN generates image representation of ultra-high dimensionality (i.e., in our case), it is infeasible to learn a global mapping on such high-dimensional feature vectors. In order to realize the global mapping, we apply an additional linear mapping to first reduce -d features into -d feature vectors, and based on the low-dimensional features, we conduct the global mapping. It is denoted as “Global mapping (FB-)” in Table 2. The global mapping is also implemented as a three-layer networks. As seen, our proposed piecewise mappings significantly outperforms the global mapping. In ablation studies, we will further compare these two types of mapping schemes.

Another interesting observation here is that the few-shot recognition performance gap between FB and CB is large. Note that, both FB and CB are trained on the same training set and achieve comparable classification performance on the validation set. This phenomenon may be explained as that the CB feature is not suitable for similarity matching (i.e., the experimental case of the testing set). It is an open problem worth further explorations.

4.3 Ablation studies

To further inspect our piecewise mappings strategy for FSFG, we conduct ablation experiments on two aspects. First, we compare the global mapping and piecewise mappings on a fairer setting. Second, we investigate the influence of the mapping function variations on the FSFG performance.

4.3.1 Piecewise mappings vs. global mapping

As aforementioned, due to high-dimensionality of bilinear feature, it is infeasible to learn a non-linear (even a simple linear) global mapping on the original bilinear features (e.g., dimensionality) in practice. To perform the global mapping, we modify the original AlexNet structure by reducing the number of units of the last convolution layer from to . By doing this, the bilinear feature becomes -dimensionality, which is feasible to learn a non-linear global mapping. In experiments, a three-layer MLP acts as the global mapping. The hidden units number is selected via cross-validation based on a set of . Finally, hidden units are selected because of its optimal performance.

For our proposed piecewise mappings, based on the modified BCNN, the piecewise mappings function is applied to -d sub-vectors. Totally, there are piecewise mappings. Each of them is implemented as a three-layer network whose hidden layers contain hidden units. ELU [3] is used as the activation function for both global mapping and piecewise mappings.

Table 3 demonstrates the comparison results of piecewise mappings vs. global mapping. Still the piecewise mappings significantly outperform the global mapping on all the three datasets. These observations can serve as a stronger evidence for the superiority of our proposed method.

Method CUB Birds Stanford Dogs Stanford Cars
1-shot 5-shot 1-shot 5-shot 1-shot 5-shot
Global mapping 27.361.64 38.051.55 19.552.27 32.532.35 16.062.06 26.171.02
Piecewise mappings (Ours) 31.002.85 48.802.33 23.073.24 41.022.50 18.982.18 31.511.38
Table 3: Comparison results of global mapping and piecewise mappings (our proposal) on three datasets. The highest average accuracy of each column is marked in bold. “” denotes that the piecewise mappings outperform the global mapping with confidence level by the pairwise -test.

Apart from the above quantitative evaluation, we present some qualitative results by visualizing the -d category classifiers generated by global mapping and piecewise mappings in 2D space in Fig. 4. The dots with the same color denote the classifiers generated from different exemplar images of the same category in . Different colors represent classifiers of different categories. We randomly select exemplars per category to conduct five-shot recognition. Thus, one category contains versions of classifiers ( dots in the same one color). As shown in the figure, the classifiers generated by piecewise mappings exhibit better category-separability and more centralized intra-category aggregation. This, in some sense, reflects that the classifiers generated by our method tend to capture the essence of the corresponding categories and maintain better distinguishing capacity.

(a) By global mapping
(b) By our piecewise mappings
Figure 4: Visualization of the category classifiers generated by global mapping and piecewise mappings in 2D space by t-SNE [21]. Each dot denotes a generated classifier and different colors represent different categories. For each category, fifty classifiers are shown, each of which is obtained via randomly sampled five exemplars. This visualization is based on CUB Birds. (The figures are best viewed in color.)

4.3.2 with different numbers of layers

We implement the mapping functions in our classifier mapping module as MLPs. Since the depth plays an important role in determining the modeling capacity of MLPs, in this part, we investigate how the FSFG performance changes w.r.t. different number of layers in . Specifically, we change the number of layers from to . The ablation study results are shown in Fig. 5.

Generally, we can see that a single-layer mapping leads to worst performance. This is due to its so limited modeling capacity that cannot realize the complex feature-to-classifier mapping. The FSFG performance rises when adding another layer and peaks when three-layer mappings are used. Beyond that point, continuing to increase the depth of the mapping functions will do harm to the recognition performance, especially in the one-shot scenario. This study necessitates the need to apply a highly non-linear mapping to learn a satisfactory classifier.

(a) 1-shot on CUB Birds
(b) 5-shot on CUB Birds
(c) 1-shot on Stanford Dogs
(d) 5-shot on Stanford Dogs
(e) 1-shot on Stanford Cars
(f) 5-shot on Stanford Cars
Figure 5: Ablation study on with different number of layers. In each sub-figure, the horizontal axis is the number of layers and the vertical axis represents the accuracy rate.

5 Conclusion

In this paper, we have presented the first study on fine-grained image recognition in a practical and challenging few-shot learning setting, which requires to learn the classifier for a fine-grained category identified by few exemplars. To address this problem, we proposed an end-to-end trainable network which was inspired by the bilinear CNN model and was tailored for the fine-grained few-shot learning task. The key novelty of our network was the piecewise classifiers mapping module. By considering the special structure of bilinear CNN features, it decomposed the exemplar-to-classifier mapping into a set of more attainable “part”-to-“part classifier” mappings. As a by-product, it significantly reduced the model parameters. Through comprehensive experiments on three standard fine-grained image classification dataset, our method showed promising results.

In the future, it appears promising to use transfer learning techniques by leveraging the already gained experience (

e.g., the classifiers of the known categories) based on the base set for generalizing the learning ability upon the novel set.

References

  • [1] L. Bertinetto, J. F. Henriques, J. Valmadre, P. H. S. Torr, and A. Vedaldi. Learning feed-forward one-shot learners. In Advances in Neural Information Processing Systems, pages 523–531, Barcelona, Spain, Dec. 2016.
  • [2] S. Branson, G. V. Horn, S. Belongie, and P. Perona. Bird species categorization using pose normalized deep convolutional nets. In British Machine Vision Conference, pages 1–14, Nottingham, England, Sept. 2014.
  • [3] D.-A. Clevert, T. Unterthiner, and S. Hochreiter. Fast and accurate deep network learning by exponential linear units (ELUs). In Proceedings of International Conference on Learning Representations, pages 1–14, San Juan, Puerto Rico, May. 2016.
  • [4] C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In

    Proceedings of International Conference on Machine Learning

    , pages 1–10, Sydney, Australia, Aug. 2017.
  • [5] J. Fu, H. Zheng, and T. Mei.

    Look closer to see better: Recurrent attention convolutional neural network for fine-grained image recognition.

    In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 4438–4446, Honolulu, HI, Jul. 2017.
  • [6] Y. Gao, O. Beijborn, N. Zhang, and T. Darrell. Compact bilinear pooling. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 317–326, Las Vegas, NV, Jun. 2016.
  • [7] S. Huang, Z. Xu, D. Tao, and Y. Zhang. Part-stacked CNN for fine-grained visual categorization. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 1173–1182, Las Vegas, NV, Jun. 2016.
  • [8] M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu. Spatial transformer networks. In Advances in Neural Information Processing Systems, pages 2008–2016, Montréal, Canada, Dec. 2015.
  • [9] A. Khosla, N. Jayadevaprakash, B. Yao, and L. Fei-Fei. Novel dataset for fine-grained image categorization. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition Workshop on Fine-Grained Visual Categorization, pages 806–813, Colorado Springs, CO, Jun. 2011.
  • [10] G. Koch, R. Zemel, and R. Salakhutdinov. Siamese neural networks for one-shot image recognition. In Proceedings of International Conference on Machine Learning, pages 1–8, New York, NY, Jun. 2016.
  • [11] J. Krause, M. Stark, J. Deng, and L. Fei-Fei. 3D object representations for fine-grained categorization. In Proceedings of IEEE International Conference on Computer Vision Workshop on 3D Representation and Recognition, pages 554–561, Sydney, Australia, Dec. 2013.
  • [12] A. Krizhevsky and G. E. Hinton.

    Convolutional deep belief networks on CIFAR-10.

    Technique Report, 2010.
  • [13] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pages 1097–1105, Lake Tahoe, NV, Dec. 2012.
  • [14] B. M. Lake, R. Salakhutdinov, J. Gross, and J. B. Tenenbaum. One shot learning of simple visual concepts. In Proceedings of Annual Meeting of the Cognitive Science Society, pages 1–6, Boston, MA, 2011.
  • [15] B. M. Lake, R. Salakhutdinov, and J. B. Tenenbaum. Human-level concept learning through probabilistic program induction. Science, 350(6266):1332–1338, 2015.
  • [16] D. Lin, X. Shen, C. Lu, and J. Jia. Deep LAC: Deep localization, alignment and classification for fine-grained recognition. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 1666–1674, Boston, MA, Jun. 2015.
  • [17] T.-Y. Lin, A. RoyChowdhury, and S. Maji. Bilinear CNN models for fine-grained visual recognition. In Proceedings of IEEE International Conference on Computer Vision, pages 1449–1457, Sandiago, Chile, Dec. 2015.
  • [18] T. Malisiewicz, A. Gupta, and A. A. Efros. Ensemble of exemplar-SVMs for object detection and beyond. In Proceedings of IEEE International Conference on Computer Vision, pages 89–96, Barcelona, Spain, Nov. 2011.
  • [19] D. J. Rezende, S. Mohamed, I. Danihelka, K. Gregor, and D. Wierstra. One-shot generalization in deep generative models. In Proceedings of International Conference on Machine Learning, pages 1521–1529, New York, NY, Jun. 2016.
  • [20] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
  • [21] L. van der Maaten and G. Hinton. Visualizing data using t-SNE. Journal of Machine Learning Research, 9:2579–2605, 2008.
  • [22] O. Vinyals, C. Blundell, T. Lillicrap, K. Kavukcuoglu, and D. Wierstra. Matching networks for one shot learning. In Advances in Neural Information Processing Systems, pages 3630–3638, Barcelona, Spain, Dec. 2016.
  • [23] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The Caltech-UCSD birds-200-2011 dataset. Techique Report CNS-TR-2011-001, 2011.
  • [24] Y.-X. Wang and M. Hebert. Learning from small sample sets by combining unsupervised meta-training with CNNs. In Advances in Neural Information Processing Systems, pages 244–252, Barcelona, Spain, Dec. 2016.
  • [25] Y.-X. Wang and M. Hebert. Model recommendation: Generating object detectors from few samples. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 1619–1628, Boston, MA, Jun. 2015.
  • [26] Y.-X. Wang and M. Hebert. Learning to learn: Model regression networks for easy small sample learning. In Proceedings of European Conference on Computer Vision, pages 616–634, Amsterdam, Netherlands, Oct. 2016.
  • [27] S. Yeung, V. Ramanathan, O. Russakovsky, L. Shen, G. Mori, and L. Fei-Fei. Learning to learn from noisy web videos. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 1–9, Honolulu, HI, Jul. 2017.
  • [28] Y. Zhang, X.-S. Wei, J. Wu, J. Cai, J. Lu, V.-A. Nguyen, and M. N. Do. Weakly supervised fine-grained categorization with part-based image representation. IEEE Transactions on Image Processing, 25(4):1713–1725, 2016.
  • [29] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva.

    Learning deep features for scene recognition using places database.

    In Advances in Neural Information Processing Systems, pages 487–495, Montréal, Canada, Dec. 2014.