. This problem is particularly challenging due to the low inter-category variance yet high intra-category discordance caused by various object postures, illumination conditions and distances from the camerasetc.
In general, the majority of fine-grained classification approaches need to be fed with a large amount of training data before obtaining a trustworthy classifier[6, 7, 8, 9, 10, 11]. However, labeling the fine-grained data requires strong domain knowledge, e.g., only ornithologists can accurately identify different bird species, which is significantly expensive compared to the generic object classification task. Moreover, in some fine-grained datasets such as the Wildfish  and iNaturalist , the data distributions are usually imbalanced and follow the long tail distribution. And in some of the categories, the well-labeled training samples are limited, e.g., it is hard to collect large-scale samples of endangered species. How to tackle the fine-grained image classification with limited training data remains an open problem.
Human beings can learn novel generic concepts with only one or a few samples easily. To simulate this intelligent ability, machine few-shot learning is initially identified by Li et al. , they propose to utilize probabilistic models to represent object categories and update them with few training examples. Most recently, inspired by the advanced representation learning ability of deep neural networks, deep machine few-shot learning [15, 16, 17, 18, 19, 20] revives and achieves significant improvements against previous methods. However, considering the cognitive process of human beings, preschool students can easily distinguish the difference between generic concepts like the ‘Cat’ and ‘Dog’ after seeing a few exemplary images of these animals, but they may be confused about fine-grained dog categories such as the ‘Husky’ and ‘Alaskan’ with limited samples. The undeveloped classification ability of children in processing information compared to adults [21, 22] indicates that generic few-shot methods cannot cope with the few-shot fine-grained classification task admirably. To this end, in this paper, we focus on dealing with the Few-Shot Fine-Gained (FSFG) classification in a ‘developed’ way.
The FSFG task is recently introduced by Wei et al. . Besides to establish the FSFG problem, they propose a deep neural network based model named Piece-wise Classifier Mapping (PCM). By adopting the meta-learning strategy on the auxiliary dataset, their model can classify different samples in the testing dataset with few labeled samples. The most critical issue in FSFG is to acquire subtle and informative image features. In PCM, the authors adopt the naive self-bilinear pooling to extract image representations, which is widely used in state-of-the-art fine-grained object classification [8, 9, 24]. Then with the operation of bilinear feature grouping, the PCM model can generate low-rank subtle descriptors of the original image. Most recently, Li et al.  propose to uses the covariance pooling  to distillate the image representation of each category. These matrix-outer-product based bilinear pooling operations could extract the second-order image features and contains more information than traditional first-order features , and thus achieve better performance on fine-grained data than generic ones.
It is worth noting that both  and  employ bilinear pooling on the input image itself to enhance the information of original features, which is noted as the self-bilinear pooling operation. However, when a human identifies the similar objects, she/he tends to compare them thoroughly in a pairwise way, e.g., comparing the heads of two birds first, then the wings and feet last. Therefore, it is natural to enhance the information during the comparing process when dealing with FSFG classification tasks. Based on this motivation, we propose a novel pairwise bilinear pooling operation on the support and query images to extract the comparative second-order images descriptors for FSFG.
achieves state-of-the-art performance by combining a feature encoder and a non-linear relation comparator. However, the matching feature extraction in the RelationNet only concatenates the support and query feature maps in depth (channel) dimension and fails to capture nuanced features for the fine-grained classification.
To address the above issues, we propose a novel end-to-end FSFG model that captures the fine-grained relations among different classes. This subtle comparative ability of our models is inherently more intelligent than merely modeling the data distribution [23, 19, 17]. The main contributions are summarized as follows:
Pairwise Bilinear Pooling. Existing second-order based FSFG methods [19, 23] enhance the encoded individual features by directly applying the self-bilinear pooling operation. However, such an operation fails to capture more nuanced relations between similar objects. Instead, we uncover the fine-grained relations between different support and query image pairs by using matrix outer product operation, which is called pairwise bilinear pooling. Based on the explicit elicitation of correlative information of pair samples, the proposed operation can extract more discriminate features than existing approaches[23, 1, 17]. More importantly, we introduce a low-rank approximation for the comparative second-order feature, where a set of co-variance low-rank transform matrices are learned to reduce the complexity of the operation.
Effective Feature Alignment. The main advantage of self-bilinear based FSFG methods is the enhancement of depth information for individual spatial positions in the image, which is achieved by the matrix outer product operation on convolved feature maps. Inspired by the self-bilinear pooling operation, we design a simple yet effective alignment mechanism to match the pairwise convolved image features. By exploiting the compact image features alignment, the ablation study shows that the proposed alignment mechanism is crucial for the significant improvements against the baseline model, where only the alignment loss is applied .
Performance. By incorporating the feature alignment mechanism and pairwise bilinear pooling operation, the proposed model achieves the state-of-the-art performances on four benchmark datasets.
The preliminary version of the proposed model was published at IEEE ICME-19 , the differences between the preliminary version and the new materials are mainly from three aspects:
The new materials in this paper comprise a more advanced pairwise pooling operation with a low-rank constraint, instead of directly operating the matrix product, we propose to learn multiple transformations.
A novel alignment mechanism is introduced to ensure the input feature pairs of the bilinear operation are matched.
More comprehensive experimental results analysis and ablation studies are conducted, and the proposed model achieves the best performance against all compared methods.
The rest of this paper is organized as follows: Section II gives a brief introduction of related works on Fine-grained Object classification, Generic Deep Few-shot Learning as well as recent progress in Fine-grained Few-shot Learning. Section III presents the proposed LRPABN method, then Section IV offers the data sets description, experiment setup, and experimental results analysis. Section V concludes the whole paper in the last.
Ii Related Work
Ii-a Fine-Grained Object Classification
Fine-grained object classification has been a trending topic in the computer vision research area for years, and most traditional fine-grained approaches use hand-crafted features as image representations[25, 26, 27]. However, due to the limited representative capacity of hand-crafted features, the performance of this type of method is moderate. In recent years, deep neural networks have developed advanced abilities in the feature extraction and function approximation [28, 29, 30, 31, 32, 33, 34], bringing significant progress in the fine-grained image classification task [35, 36, 37, 38, 35, 39, 40, 41, 42, 6, 7, 8, 43, 44, 9, 10, 24, 45, 46].
Deep fine-grained classification approaches can be roughly divided into two groups: regional feature-based methods [35, 36, 37, 38, 35, 39, 40, 41, 42, 6, 7] and global feature-based methods [8, 43, 44, 9, 10, 24, 45, 46]. In fine-grained image classification, the most informative information generally lies in the discriminate parts of the object. Therefore, regional feature-based approaches tend to detect such parts first and then fuse them to form a robustness representation of the object. For instance, Zhang et al.  firstly combine the R-CNN  into the fine-grained classifier with a geometric prior, in which the modified R-CNN generates thousands of proposals, the most discriminate ones are then selected for the object classification. In , Peng et al.
adopt two attention modules to localize objects and choose the discriminate parts simultaneously. A spectral clustering method is then employed to align the parts with the same semantic meaning for the prediction. However, the classification performance of these models relies heavily on the parts localization. Getting a well-trained part detector needs the input of a large amount of subtle annotated samples, which is infeasible to obtain. Moreover, the sophisticated regional feature fusion mechanism leads to the increasing complexity of the fine-grained classifier.
On the contrary, global feature-based fine-grained methods [8, 43, 44, 9, 10, 24, 45, 46] extract the feature from the whole image without explicitly localize the object parts. Bilinear CNN model (BCNN)  is the first work that adopts matrix outer product operation on the initial embedded features to generate a second-order representation for fine-grained classification. Li et al.  (iSQRT-COV) further improve the navie bilinear model by using covariance matrices over the last convolutional features as fine-grained features. iSQRT-COV obtains state-of-the-art performance on both generic and fine-grained datasets.
However, the feature dimensions of the second-order models are the square fold of the naive ones, to reduce the computation complexity, Gao et al. 
propose a compact bilinear pooling operation, which applies Tensor Sketch to reduce the dimensions. Kong et al.  introduce a low-rank co-decomposition of the covariance matrix that fatherly decreases the complexity, while Kim et al.  adopt Hadamard product to redefine the bilinear matrix outer product and proposes a factorized low-rank bilinear pooling for multimodal learning. Furthermore, Gao et al.  devise a hierarchical approach for fine-grained classification using a cross-layer factorized bilinear pooling operation. Inspired by the flexibility and effectiveness of the Hadamard product for extracting the second-order features between visual features and textual features in VQA tasks , in our LRPABN model, we propose to adopt the factorized bilinear pooling to approximate pairwise second-order statistics for few-shot fine-grained image classification, while achieving better performance compared to the first-order models.
Ii-B Generic Deep Few-shot Learning
The majority of deep few-shot learning methods [50, 51][15, 16, 17, 18, 19, 20] follow the strategy of meta-learning [52, 53], which distills the meta-knowledge from batches of auxiliary few-shot tasks. Each auxiliary task mimics the target few-shot tasks with the same support and query images’ split, and after episodes of training on auxiliary tasks, the trained model can converge speedily to an appreciable local optimum on target data without suffering from the overfitting.
One of the most representative methods is by learning from finetuning , MAML  designs a meta-learning framework that determines the transferable weights for the initialization of the deep neural network. By fine-tuning the network with the limited training samples, the model can achieve reliable performance in a few gradient descent update steps. Moreover, Sachin et al.  propose a gradient-based method that learns well-initialized weights but also an effective LSTM-based optimizer. However, different from this type of approach, our model is free from retraining during the meta-testing stage.
Another class of few-shot learning methods follows the idea of learning to compare [15, 16, 17, 55, 20]. In general, these approaches consist of two main components: a feature embedding network and a similarity metric. These methods aim to optimize the transferable embedding of both auxiliary data and target data. Consequently, the test images can to be identified by the simple nearest neighbor classifier [15, 16], deep distance matrix based classifier  or cosine distance based classifier [55, 20]. Considering the FSFG task requires the more advanced information processing ability, we propose to capture more nuanced features from the images pairs other than the first-order extraction used in leaning to compare approaches.
Ii-C Few-shot Fine-grained Learning
Most recently, Wei et al.  propose the first FSFG model by employing two sub-networks to tackle the problem jointly. The first component is a self-bilinear encoder, which adopts the matrix outer product operation on convolved features to capture subtle image features, while the second one is a mapping network that learns the decision boundaries of the input data. Li et al.  further replace the naive self-bilinear pooing as the covariance pooling. Moreover, they design a covariance metric to generate relation scores. However, self-bilinear pooling [23, 19] cannot extract the comparative features between pairs of images and the dimension of pooled features are usually very large. Pahde et al.  propose a cross-modality FSFG model, which embeds the textual annotations and image features into a common latent space. They also introduce a discriminative text-conditional GAN for the sample generation, which selects the representative samples from the auxiliary set. However, it is both computation and time consuming to obtain rich annotations for the fine-grained samples.
In this section, we present the problem formulation of FSFG first. Then the proposed LRPABN model is introduced, including the Low-Rank Pairwise Bilinear Polling operation and Feature Alignment Layer, which are the core parts of LRPABN. The detailed network architecture of LRPABN is given at last.
Iii-a Problem Definition
Given a Fine-Grained target dataset
For the FSFG task, the target dataset contains two parts: the labeled subset and the unlabeled subset , where samples from each subset are fine-grained images. The model needs to classify the unlabeled data from according to the few labeled samples from , where is the ground-truth label of sample . If the labeled data in the target dataset contains labeled images for each of different categories, the problem is noted as -way--shot.
In order to obtain an ideal model on such dataset, Few-Shot learning usually employs a fully annotated dataset, which has similar property or data distribution with as an auxiliary dataset :
Where and represent images and corresponding labels. In each round of training, the auxiliary dataset is randomly separated into two parts: support dataset and query dataset . With setting , we can mimic the composition of the target dataset in each iteration. Then is employed to learn a meta-learner , which can transfer the knowledge from to target data . Once the meta-learner is trained, it can be fine-tuned with labeled target dataset , and finally classify the samples from into their corresponding categories [15, 17, 23, 18, 1, 20, 19].
Iii-B The proposed LRPABN
The whole framework of LRPABN is shown in Figure 2, and detailed architecture is given in Figure 3. Different from traditional few-shot embedding structures [15, 16, 17], we add the Low-Rank Pairwise Bilinear Pooling to construct the fine-grained image feature extractors. Moreover, we modify the non-linear comparator  and apply it to the fine-grained task. As the Figure 2 shows, given a support set consisting of five classes with one image per class, an Encoder that is trained with the auxiliary data can extract the first-order image features from the raw images, then the Alignment Layer coordinates the embedded feature in support set with the query image feature in pairs. Next, the Low-Rank Bilinear Pooling is used to excerpt the comparative second-order image representation from the embedded feature pairs. Finally, the Comparator assignments the optimal label to the query from support labels in consonance with the similarity between the query and different support classes.
Pairwise bilinear pooling layer is designed to capture the nuanced comparative features of image pairs by employing the bilinear pooling operation, which plays a crucial role in determining the relations between support and query pairs. However, it is natural that if a couple of input are not well matched, the pooled features cannot result in the maximum classification performance gain. To this end, we introduce an alignment layer which consists of a Multi-Layer Perceptron (MLP) and two feature alignment losses to guarantee the registration of input pairs.
Iii-B1 Pairwise Bilinear Pooling Layer
The Bilinear CNN for the image classification can be defined as a quadruple:
where and are encoders for each input stream, is the self-bilinear pooling operation and represents the classifier. is the input image with height, width and color channels. Through encoder , the input image is transformed into a tensor , which has feature channels, and indicate the hight and width of the embedded feature map. Given two encoders and , and
denote feature vectors at specific spatial locationin each feature map and , where . The pooled feature is a vector. is a fully-connected layer trained with the cross-entropy loss.
Different from the conventional self-bilinear operates on pairs of embedded features from the same image, in our pairwise bilinear pooling layer, the input pair is generated from the different source sets, i.e., and . With the encoder , the pairwise bilinear pooling can be defined as:
It is worth noting that in the pairwise bilinear pooling, we only have one shared embedding function . Different from the self-bilinear pooling that operates on the same input image, pairwise bilinear pooling uses matrix outer product on two disparate samples. However, the pooled pairwise feature is a vector, which results in a square growth dimension of original features. Inspired by the Factorized Bilinear Pooling  applied in the visual question answer task, we further propose a low-rank pairwise bilinear pooling operation.
For the given and from Equation 4, where stands for any spatial feature vector in , the pairwise bilinear can be re-formulated as:
Where is a projection matrix, and are the feature vectors from and separately in the same position. Equation (5) fuses these feature vectors into a common scalar . Given a set of projection matrices , the redefined bilinear feature of any position is . is the dimension of this bilinear feature. Then the comparative bilinear representation for the original pairs can be represented as . It is worth noticing that Equation (5) is different from Equation (4) which adopts projection matrix in learning the bilinear feature. And in Equation (5), the dimension of comparative bilinear feature is that can be far smaller than in Equation (4). And in this way, the model gets a low-rank approximation for the original comparative bilinear feature. To further reduce the complexity of this model, we present a low-rank approximation of
Where and , denotes the Hadamard product. Equation (6) is the final form of low-rank pairwise bilinear pooling, which applies projection matrix and matrix factorization to approximate a full bilinear model (Equation 5). The proposed LRPABN is different from [49, 46], where  adopts the factorized bilinear pooling to fuse the multi-modal features, and  operates on convolutional features of the same image. While our method conduct on pairs of support and query images. To our best knowledge, LRPABN is the first work that extracts the low-rank bilinear feature from pairs of distinct images for FSFG tasks.
Iii-B2 Feature Alignment Layer
The self-bilinear pooling operates on the same image, which means in any spatial location of the embedded feature pairs, the operating features are entirely aligned. However, the proposed pairwise bilinear pooling operates on different inputs. Thus the encoded features may not always be matched. To overcome this obstacle, we introduce a feature alignment mechanism inspired by the PointNet . Given a position transform function and the encoded feature , the transformed feature can be computed as follows:
is an identity matrix. The transformed feature is, in which only the positions of the original feature vectors are rearranged.
The transform matrix can be learned with a shallow neural network, while the position transformation can be integrated into the feature encoder . Moreover, to ensure the effectiveness of the alignment, we further design two feature alignment losses as follows:
The first loss is a rough approximation of two embedded image descriptors that minimizing the Euclidean distances of two transformed features.
The second loss is a more concise feature alignment loss. Inspired by the pooling operation, we sum all the raw features () along with the channel dimension () first. And then we measure the MSE of summed features. By training with the proposed alignment losses, we encourage the network to automatically learn the matching features to generate a better pairwise bilinear feature.
As the Figure 2 indicates, after passing through the above layers, the pairwise comparative bilinear features are sent to a comparator. This module aims to learn the relations between the query images and support classes. In the one-shot-W-way setting, the support classes are represented by a single image, where for K-shot-W-way setting, the support classes are computed as the sum value of embedded features of K images in each class, i.e., for each query image, the model generates W comparative bilinear features corresponding to each class.
For a pair of query image and support class , the comparative bilinear feature can be represented as , where , the relation score of and is computed as:
Where is the comparator, and is the relation score of query and class .
Iii-B4 Model Training
The training loss in our bilinear comparator is the MSE loss, which regresses the relation score to the images label similarity. At a certain iteration during the episodic training, there exists query features and support class features in total, is thus defined as:
Where is the indicator, which equals to one when and zero otherwise. The LRPABN has two optional alignment losses and , we back-propagate the gradients when the alignment losses are computed immediately. That is, during the training stage, the model will be updated twice in one iteration.
Iii-C Network Architecture
The detailed network architecture is shown in Figure 3. It consists of three parts: Embedding Network, Low-rank Bilinear Pooling Layer and Comparator Network.
Embdeeding Network: For a fair comparison with the state-of-the-art generic few-shot and FSFG approaches, we adopt the same encoder structure in [15, 16, 17, 18, 19]. It consists of four convolutional blocks, where each block contains a 2D convolutional layer with amax-pooling layer is added. For simplicity, we integrate the feature alignment layer into embedding network as the first-order feature extractor as indicated in Figure 3.(a). Unlike the alignment mechanism used in [57, 40], we devise a simple two layers MLP with the Regulation (7). Besides, two optional alignment losses (8), (9) are applied in the alignment layer to generate the well-matched pairwise features.
Low-rank Bilinear Pooling Layer: For the Low-Rank Pairwise Bilinear Pooling layer in Figure 3.(b), we use a convolutional layer with kernel followed by the batch normalization and a ReLU layer, then the Hadamard product and normalization layers are appended to generate the comparative bilinear features.
In this section, we evaluate the proposed LRPABN on four widely used fine-grained datasets. First, we give a brief introduction to these datasets. Then we describe the experimental setup in detail. Finally, we analyze the experimental results of the proposed models and compare them with other few-shot learning approaches. For a fair comparison, we conduct two groups of experiments on these datasets, for the first group, we follow the setting, which Wei et al. [23, 1] used, while for the second group, we follow the newest settings in the recent few-shot methods [19, 20].
There are four datsets used to investigate the proposed models:
CUB Birds  contains 200 categories of birds and a total of 11,788 images.
DOGS  contains 120 categories of dogs and a total of 20,580 images.
CARS  contains 196 categories of cars and a total of 16,185 images.
NABirds  contains 555 categories of north American birds and a total of 48,562 images.
In Section III-A, we randomly divide these datasets into two disjoint sub-datasets: the auxiliary dataset and the target dataset . For the first group of experiments, we use the splits of PCM , as shown in Table I. For the second group, we adopt the dataset splits of Li’s [19, 20], as indicated in Table II. Both of these methods do not use the NABirds dataset. Thus, for this dataset only, we do our splits.
Iv-B Experimental Setup
In each round of training and testing, for one-shot image classification setting, the support sample number in each class equals 1 (in both and , ). Therefore, we use the embedded features of these base samples as the class features, i.e., . For the few-shot setting, we extract the class features by summing all the embedded features in each category. In our experiments, we compare the below FS as well as FSFG approaches:
The state-of-the-art methods:
RelationNet , a state-of-the-art generic few-shot method proposed in CVPR 2018. It uses a mini network to learn the similarity between the query image and the support class.
DN4 , the newest generic few-shot method published in CVPR 2019. By using a deep nearest neighbor neural netwok, DN4 can aggregate the discriminative information of local features and thus improve the final classification performance.
PCM , the first FSFG model published in IEEE TIP 2019. It adopts a self-bilinear model to extracts the fine-grained features of the image and achieves excellent performance on several FSFG tasks.
CovaMNet , the newest FSFG model published in AAAI 2019. It replaces the bilinear pooling with covariance bilinear pooling and achieves state-of-the-art performance on FSFG classification.
PABN, this model does not use alignment loss on embedded pair features.
PABN, and PABN are the models that adopt the alignment loss and for feature alignment. As Section III-B2 discussed, loss is a naive alignment loss where is a more compact loss.
The PABN+ models, these models apply the proposed alignment layer into PABN models, which aims to investigate the effectiveness of the proposed feature alignment transform function (7):
LRPABN and LRPABN, where LRPABN uses the alignment loss and LRPABN adopts the respectively in the alignment layer.
In the first experiment, the LRPABN models are compared with RelationNet, PCM and our previous proposed PABN models. We follow the data splits (Table I) of RelationNet, PCM and PABN ; all of these approaches do not contain the validation dataset.
In the second experiment, besides the RelationNet, PABN+ models and the proposed LRPANB models, we compare the newest state-of-the-art few-shot method DN4 and the newest FSFG approach CovaMNet. To fair compare, we use the same data splits (Table II) and training strategy of DN4 and CovaMNet.
For all the comparing methods, we conduct both 5-way-1-shot and 5-way-5-shot classification experiments. In the training stage of the first group of experiments, both of 5-way-1-shot and 5-way-5-shot experiments have 15 query images, which mean there are images and images respectively for 5-way-1-shot and 5-way-5-shot in each mini-batch. For the testing stage, we follow the RelationNet  that have 1 query for 5-way-1-shot and 5 queries for 5-way-5-shot in each mini-batch. In both the training and testing stages of the second group of experiments, we randomly select 15 and 10 queries from each category for the 5-way-1-shot and 5-way-5-shot settings, which is the same setting with [19, 20].
Moreover, in the training stage, we select the optimal models using the same validation strategies with  and [20, 19] for the first and second group of experiments separately. In the first group, we randomly sample and construct 100,000 episodes to train the LRPABN and PABN+ models, and in each episode, there only contains one learning task, while in the second group, we randomly select 10,000 episodes for training, and in each episode, 100 tasks are randomly batched to train the models. For LRPABN models, we set the dimension of the pairwise bilinear feature as 512, where the feature dimension of PABN and PABN+ is . We resize all the input images from all datasets to . All experiments use Adam optimize method with initial learning rate 0.001 and all models are trained end-to-end from scratch.
Iv-C Results and Analysis
is only applied for image retrieval task. It is unfair to compare with these methods directly. Therefore we compare our LRPABN with PCM, PABN , and CovaMNet , we also compare our methods with the state-of-the-art generic few-shot learning method RelationNet  and DN4 . The original implementation of RelationNet does not report the results on four fine-grained datasets, for fair comparisons, we use the open source code of the RelationNet111https://github.com/floodsung/LearningToCompare_FSL to conduct the FSFG image classification on these datasets.
In the first group of experiments, we compute both one-shot and five-shot classification accuracies on the four datasets by averaging on 10,000 episodes in testing. We show the experimental results of 10 compared models in Table III. As the table shows, the proposed LRPABN models achieve significant improvements on both 1-shot and 5-shot classification tasks on all datasets compared to the state-of-the-art FSFG methods and generic few-shot methods, which indicates the effectiveness of the proposed framework.
More specifically, the LRPABN, PABN+, and PABN models  both obtain around 10 to 30% higher in classification accuracy than PCM , which demonstrates that the comparative pairwise bilinear feature outperforms the self-bilinear feature on FSFG tasks. In addition, the pairwise bilinear feature-based approaches achieve better classification performances than RelationNet , which validates the extraction of second-order image descriptors surpasses the naive concatenation of feature pairs  for FSFG problems.
From Table III, compared to PABN models, PABN+ and LRPABN models obtain a clear classification performance boost. For instance, the PABN+ models gain and average improvements over PABN models, while the LRPABN models achieve and improvements over PANB models in one-shot and five-shot setting on CUB Birds dataset. As for the one-shot and five-shot on the CARS dataset, the PABN+ models gain and average improvements over PABN models, and the LRPABN models obtain and improvements over PANB models. These results demonstrate that the proposed feature alignment layer is more effective than the previously proposed feature matching mechanism . Thus, for the proposed pairwise bilinear pooling, the position alignment performance of the embedding feature map is a key factor that impacts the final classification result.
More Specifically, It can be observed from Table III that LRPABN models achieve the best or second best classification performance on nearly all datasets compared to other methods under various experimental settings. For CARS data, the LRPABN obtains , , significant improvements over PABN+, PABN and RelationNet on 1-shot-5-way task, while achieves , , improvements against PABN+, PABN and RelationNet on 5-shot-5-way setting. It is worth noting that the dimension of the pairwise bilinear feature in LRPABN is 512, where the corresponding feature dimension of PABN and PABN+ is 4096. This is due to the LRPABN models adopt the low-rank factorized bilinear pooling operation, which learns a set projection transform functions fusing the feature pairs, as discussed in Equation 6. Each of the projection represents a pattern of coalescing the image pairs in depth feature channels over all the matching positions. Meanwhile, the naive pairwise bilinear pooling in the PABN and PABN+ approaches only applies the matrix outer product on feature pairs once to merge them. Therefore, the LRPABN models can obtain more types of feature extraction than PANB and PABN+ models, which in turn achieves better performance with smaller feature dimensions.
For a further analysis of our models, we conduct an additional experiment on these four datasets comparing the LRPABN models with DN4 and CovaMNet. In this experiment, we also compare the PABN+ models. Moreover, we use the same setting to rerun the RelationNet on four datasets as the baseline method. We follow the same dataset split with DN4 and CovaMNet, the original papers of these two papers do not report the results on CUB Birds (CUB-2011)  and NABirds , so we use the open released codes of DN4222https://github.com/WenbinLee/DN4 and CovaMNet333https://github.com/WenbinLee/CovaMNet to get the results. During the test, 600 episodes are randomly selected from the data.
Table IV presents the average accuracies of different models on the novel classes of the fine-grained datasets. Both the one-shot and five-shot classification results are reported. As the table shows, the proposed LRPABN models get steadily and notably improvements on almost all fine-grained datasets for different experimental settings. More detailed, compared with CovaMNet, our proposed models achieve a plainly growth performances on CUB Birds, CARS and NABirds datasets on both one-shot and five-shot setting. Especially for NABirds data, the LRPABN obtains and gain over CovaMNet for one-shot and five-shot respectively. These results again firmly prove that the proposed pairwise bilinear pooling is superior compared to the self-bilinear pooling operation. Meanwhile, the feature alignment layer further boosts the final performance.
For the comparisons against the DN4 method, from the Table IV, LRPABN models obtain the highest accuracy on one-shot setting on CUB Birds, CARS, NABirds datasets and get second best results on DOGS data, where DN4 performs poorly in one-shot tasks on almost all datasets. However, for the five-shot setting, DN4 achieves the highest classification accuracy on all four datasets, while LRPABN achieves the second highest performance on CUB Birds, CARS and NABirds. We are surprised to observe that on the one-shot-five-way task in NABirds, LRPABN gains over DN4. Nevertheless, DN4 gets boosts over LRPANB on the five-shot-five-way task in the CARS dataset. That is, the proposed LRPABN method holds a tremendous advantage over DN4 for one-shot classification tasks but slightly inferior to DN4 for five-shot classification. The reason for this is that DN4 uses a deep nearest neighbor neural network to search the optimal local features in the support set as the support classes’ feature for a given query image. For the target query features (e.g.,
a set of local features), the algorithm selects the top k nearest local features in the whole support dataset according to the cosine similarity between query local feature and support local features. That is, the more image in the support classes, the better class feature will be generated. Thus, for five-shot classification, the DN4 outperforms LRPABN, where under the one-shot setting, DN4 has smaller support features to extract a good representation of the class feature. Considering the proposed LRPABN only sum the image features in each category as the class feature, how to generate a good representation of category would further improve the classification performance of our methods.
The classification examples of LRPABN, PABN+, and RelationNet models are shown in Figure 4. We select LRPABN and PABN+ as the representative of LRPABN and PABN+ approaches. To investigate the low-rank approximation, we set low-rank comparative feature dimensions as 512 and 128 for LRPABN-512 and LRPABN-128 models separately. By sending a fixed testing batch through the model, which consists of one support sample and five query samples for each of five classes, the prediction of LRPABN-512 only contains six mislabels in the entire 25 queries, while the prediction of LRPABN-128, PABN+ and RelationNet have 7, 8 and 10 wrong labels separately. That validates the effectiveness of the LRPABN models. We also find that in some classes like Nighthawk and Harris Sparrow, the high intra-variance and low inter-variance confuse all the models.
Iv-D Ablation studies
To further verify that the effectiveness of the proposed low-rank pairwise bilinear pooling for FSFG tasks, we perform an ablation study from two aspects. First, we conduct a feature dimension selection experiment to inspect the influence of the low-rank approximation of the pairwise bilinear feature. Then, we investigate the classification performance of the learned comparative features using t-SNE  visualization.
For the feature dimension selection, we change the number of dimensions as 16, 32, 64, 128, 256, 512, 1024 and 2048 for both 1-shot and 5-shot classification tasks on CUB Birds data. The model we used for this experiment is LRPABN. The results are shown as Figure 5, it can be observed that as the feature dimension gets larger, the test accuracy gradually improves to a peak first, then it goes through a drastic drop. For the 1-shot setting, the performance changes smoothly when the dimension is below 1024. For the 5-shot task, the variation of performance is relatively oscillatory, yet it can grow fast and steadily with the dimension increasing. Moreover, we find that even with a very compact low-rank approximation (i.e., the dimension is 16), the model can still achieve a decent classification performance, which fatherly verifies the stability of the proposed method. When the dimension goes too large, the model performs poorly, and this may be caused by the increased complexity of the framework can not model the data distribution well with few training samples.
The visualization for different comparative features is presented in Figure 6. We randomly select five support images and thirty query images per category from CUB Birds data to conduct the five-way-five-shot tasks. The original comparative feature dimension of RelationNet is , we use the convolved feature before the first fully-connected layer in classifier as the final comparative feature with dimension size 576. The comparative feature of PABN+ is , and we choose LRPABN with comparative dimension 128 and 512 separately (denoted as LRPABN-Dim-128 and LRPABN-Dim-512) for comparison. As the figure shows, the learned LRPABN-Dim-512 feature, which can be grouped into five classes correctly, outperforms others, the discriminative performance of LRPABN-Dim-128 and PABN+ are similar, which outperform RelationNet’ feature. The intuitive visualization results among the above methods again validate the superior capacity of the proposed low-rank pairwise bilinear features for FSFG tasks.
In this paper, we propose a novel few-shot fine-grained image classification method, which is inspired by the advanced information processing ability of human beings. The main contribution is the low-rank pairwise bilinear pooling operation, which extracts the second-order comparative features for the pair of support images and query images. Moreover, to get a more precise comparative feature, we propose an effective feature alignment mechanism to match the embedded support image features with query ones. Through comprehensive experiments on four fine-grained datasets, we verify the effectiveness of the proposed method.
-  H. Huang, J. Zhang, J. Zhang, Q. Wu, and J. Xu, “Compare more nuanced: Pairwise alignment bilinear network for few-shot fine-grained learning,” arXiv preprint arXiv:1904.03580, 2019.
-  C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie, “The Caltech-UCSD Birds-200-2011 Dataset,” California Institute of Technology, Tech. Rep. CNS-TR-2011-001, 2011.
-  G. Van Horn, S. Branson, R. Farrell, S. Haber, J. Barry, P. Ipeirotis, P. Perona, and S. Belongie, “Building a bird recognition app and large scale dataset with citizen scientists: The fine print in fine-grained dataset collection,” in CVPR, June 2015.
-  A. Khosla, N. Jayadevaprakash, B. Yao, and L. Fei-Fei, “Novel dataset for fine-grained image categorization,” in First Workshop on Fine-Grained Visual Categorization, CVPR, Colorado Springs, CO, June 2011.
-  J. Krause, M. Stark, J. Deng, and L. Fei-Fei, “3d object representations for fine-grained categorization,” in 4th International IEEE Workshop on 3D Representation and Recognition (3dRR-13), Sydney, Australia, 2013.
-  N. Zhang, J. Donahue, R. Girshick, and T. Darrell, “Part-based r-cnns for fine-grained category detection,” in ECCV. Springer, 2014, pp. 834–849.
J. Fu, H. Zheng, and T. Mei, “Look closer to see better: Recurrent attention convolutional neural network for fine-grained image recognition,” inCVPR, July 2017.
-  T.-Y. Lin, A. RoyChowdhury, and S. Maji, “Bilinear cnn models for fine-grained visual recognition,” in ICCV, December 2015.
-  Y. Cui, F. Zhou, J. Wang, X. Liu, Y. Lin, and S. Belongie, “Kernel pooling for convolutional neural networks,” in CVPR, July 2017.
-  P. Li, J. Xie, Q. Wang, and Z. Gao, “Towards faster training of global covariance pooling networks by iterative matrix square root normalization,” in CVPR, June 2018.
-  J. Krause, H. Jin, J. Yang, and L. Fei-Fei, “Fine-grained recognition without part annotations,” in CVPR, June 2015.
-  P. Zhuang, Y. Wang, and Y. Qiao, “Wildfish: A large benchmark for fish recognition in the wild,” in MM. ACM, 2018, pp. 1301–1309.
-  G. Van Horn, O. Mac Aodha, Y. Song, Y. Cui, C. Sun, A. Shepard, H. Adam, P. Perona, and S. Belongie, “The inaturalist species classification and detection dataset,” in CVPR, June 2018.
-  L. Fei-Fei, R. Fergus, and P. Perona, “One-shot learning of object categories,” IEEE TPAMI, vol. 28, no. 4, pp. 594–611, 2006.
-  O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra et al., “Matching networks for one shot learning,” in NIPS, 2016, pp. 3630–3638.
-  J. Snell, K. Swersky, and R. Zemel, “Prototypical networks for few-shot learning,” in NIPS, 2017, pp. 4077–4087.
-  F. Sung, Y. Yang, L. Zhang, T. Xiang, P. H. Torr, and T. M. Hospedales, “Learning to compare: Relation network for few-shot learning,” in CVPR, June 2018.
-  Y. Liu, J. Lee, M. Park, S. Kim, E. Yang, S. Hwang, and Y. Yang, “Learning to propagate labels: Transductive propagation network for few-shot learning,” in ICLR, 2019.
-  W. Li, J. Xu, J. Huo, L. Wang, G. Yang, and J. Luo, “Distribution consistency based covariance metric networks for few-shot learning,” in AAAI, 2019.
-  W. Li, L. Wang, J. Xu, J. Huo, G. Yang, and J. Luo, “Revisiting local descriptor based image-to-class measure for few-shot learning,” in CVPR, 2019.
-  A. L. Brown, “The development of memory: Knowing, knowing about knowing, and knowing how to know,” in Advances in child development and behavior. Elsevier, 1975, vol. 10, pp. 103–152.
-  D. R. John and C. A. Cole, “Age differences in information processing: Understanding deficits in young and elderly consumers,” Journal of consumer research, vol. 13, no. 3, pp. 297–315, 1986.
-  X.-S. Wei, P. Wang, L. Liu, C. Shen, and J. Wu, “Piecewise classifier mappings: Learning fine-grained learners for novel categories with few examples,” IEEE TIP, 2019.
-  T.-Y. Lin, A. RoyChowdhury, and S. Maji, “Bilinear convolutional neural networks for fine-grained visual recognition,” IEEE TPAMI, vol. 40, no. 6, pp. 1309–1322, 2018.
-  L. Xie, Q. Tian, M. Wang, and B. Zhang, “Spatial pooling of heterogeneous features for image classification,” IEEE TIP, vol. 23, no. 5, pp. 1994–2008, 2014.
-  S. Gao, I. W.-H. Tsang, and Y. Ma, “Learning category-specific dictionary and shared dictionary for fine-grained image categorization,” IEEE TIP, vol. 23, no. 2, pp. 623–634, 2014.
-  X. Zhang, H. Xiong, W. Zhou, and Q. Tian, “Fused one-vs-all features with semantic alignments for fine-grained visual categorization,” IEEE TIP, vol. 25, no. 2, pp. 878–892, 2016.
-  J. Xu, V. Jagadeesh, and B. Manjunath, “Multi-label learning with fused multimodal bi-relational graph,” IEEE TMM, vol. 16, no. 2, pp. 403–412, 2014.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in IEEE CVPR, 2016, pp. 770–778.
Y. Gal and Z. Ghahramani, “Dropout as a bayesian approximation: Representing model uncertainty in deep learning,” inICML, 2016, pp. 1050–1059.
-  Y. Yao, F. Shen, J. Zhang, L. Liu, Z. Tang, and L. Shao, “Discovering and distinguishing multiple visual senses for web learning,” IEEE TMM, 2018.
-  J. Zhang, Q. Wu, C. Shen, J. Zhang, and J. Lu, “Multilabel image classification with regional latent semantic dependencies,” IEEE TMM, vol. 20, no. 10, pp. 2801–2813, 2018.
-  S. Qiao, C. Liu, W. Shen, and A. L. Yuille, “Few-shot image recognition by predicting parameters from activations,” in IEEE CVPR, 2018, pp. 7229–7238.
-  J. Zhang, Q. Wu, C. Shen, J. Zhang, and J. Lu, “Mind your neighbours: Image annotation with metadata neighbourhood graph co-attention networks,” IEEE CVPR, 2018.
-  C. Huang, H. Li, Y. Xie, Q. Wu, and B. Luo, “Pbc: Polygon-based classifier for fine-grained categorization,” IEEE TMM, vol. 19, no. 4, pp. 673–684, 2016.
-  Z. Xu, D. Tao, S. Huang, and Y. Zhang, “Friend or foe: Fine-grained categorization with weak supervision,” IEEE TIP, vol. 26, no. 1, pp. 135–146, 2017.
-  Y. Zhang, X.-S. Wei, J. Wu, J. Cai, J. Lu, V.-A. Nguyen, and M. N. Do, “Weakly supervised fine-grained categorization with part-based image representation,” IEEE TIP, vol. 25, no. 4, pp. 1713–1725, 2016.
-  B. Zhao, X. Wu, J. Feng, Q. Peng, and S. Yan, “Diversified visual attention networks for fine-grained object classification,” IEEE TMM, vol. 19, no. 6, pp. 1245–1256, 2017.
-  H. Yao, S. Zhang, Y. Zhang, J. Li, and Q. Tian, “Coarse-to-fine description for fine-grained visual categorization,” IEEE TIP, vol. 25, no. 10, pp. 4858–4872, 2016.
Y. Peng, X. He, and J. Zhao, “Object-part attention model for fine-grained image classification,”IEEE TIP, vol. 27, no. 3, pp. 1487–1500, 2018.
-  L. Zhang, Y. Yang, M. Wang, R. Hong, L. Nie, and X. Li, “Detecting densely distributed graph patterns for fine-grained image categorization,” IEEE TIP, vol. 25, no. 2, pp. 553–565, 2016.
-  A. Iscen, G. Tolias, P.-H. Gosselin, and H. Jégou, “A comparison of dense region detectors for image search and fine-grained classification,” IEEE TIP, vol. 24, no. 8, pp. 2369–2381, 2015.
-  Y. Gao, O. Beijbom, N. Zhang, and T. Darrell, “Compact bilinear pooling,” in CVPR, June 2016.
-  S. Kong and C. Fowlkes, “Low-rank bilinear pooling for fine-grained classification,” in CVPR, July 2017.
-  Y. Suh, J. Wang, S. Tang, T. Mei, and K. Mu Lee, “Part-aligned bilinear representations for person re-identification,” in ECCV, September 2018.
-  C. Yu, X. Zhao, Q. Zheng, P. Zhang, and X. You, “Hierarchical bilinear pooling for fine-grained visual recognition,” in ECCV, September 2018.
-  R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in CVPR, June 2014.
-  N. Pham and R. Pagh, “Fast and scalable polynomial kernels via explicit feature maps,” in ACM SIGKDD. ACM, 2013, pp. 239–247.
-  J.-H. Kim, K. W. On, W. Lim, J. Kim, J.-W. Ha, and B.-T. Zhang, “Hadamard Product for Low-rank Bilinear Pooling,” in ICLR, 2017.
-  C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learning for fast adaptation of deep networks,” in ICML, vol. 70. International Convention Centre, Sydney, Australia: PMLR, 06–11 Aug 2017, pp. 1126–1135.
-  R. Sachin and L. Hugo, “Optimization as a model for few-shot learning,” in ICLR, 2017.
-  J. Schmidhuber, “Evolutionary principles in self-referential learning, or on learning how to learn: The meta-meta-… hook,” Diplomarbeit, Technische Universität München, München, 1987.
-  S. Thrun and L. Pratt, Eds., Learning to Learn. Norwell, MA, USA: Kluwer Academic Publishers, 1998.
-  W.-Y. Chen, Y.-C. Liu, Z. Kira, Y.-C. F. Wang, and J.-B. Huang, “A closer look at few-shot classification,” ICLR, 2019.
-  W.-Y. Chen, Y.-C. Liu, Z. Kira, Y.-C. Wang, and J.-B. Huang, “A closer look at few-shot classification,” in ICLR, 2019.
-  F. Pahde, P. Jähnichen, T. Klein, and M. Nabi, “Cross-modal hallucination for few-shot fine-grained recognition,” arXiv preprint arXiv:1806.05147, 2018.
-  C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “Pointnet: Deep learning on point sets for 3d classification and segmentation,” in CVPR, July 2017.
-  H. Yao, S. Zhang, Y. Zhang, J. Li, and Q. Tian, “One-shot fine-grained instance retrieval,” in MM. ACM, 2017, pp. 342–350.
-  L. v. d. Maaten and G. Hinton, “Visualizing data using t-sne,” JMLR, vol. 9, no. Nov, pp. 2579–2605, 2008.