TOAN: Target-Oriented Alignment Network for Fine-Grained Image Categorization with Few Labeled Samples

05/28/2020 ∙ by Huaxi Huang, et al. ∙ University of Technology Sydney 9

The challenges of high intra-class variance yet low inter-class fluctuations in fine-grained visual categorization are more severe with few labeled samples, i.e., Fine-Grained categorization problems under the Few-Shot setting (FGFS). High-order features are usually developed to uncover subtle differences between sub-categories in FGFS, but they are less effective in handling the high intra-class variance. In this paper, we propose a Target-Oriented Alignment Network (TOAN) to investigate the fine-grained relation between the target query image and support classes. The feature of each support image is transformed to match the query ones in the embedding feature space, which reduces the disparity explicitly within each category. Moreover, different from existing FGFS approaches devise the high-order features over the global image with less explicit consideration of discriminative parts, we generate discriminative fine-grained features by integrating compositional concept representations to global second-order pooling. Extensive experiments are conducted on four fine-grained benchmarks to demonstrate the effectiveness of TOAN compared with the state-of-the-art models.



There are no comments yet.


page 1

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Fine-Grained (FG) visual recognition aims to distinguish different sub-categories belonging to the same entry-level category, such as animal identification [1, 2, 3] and vehicle recognition [4], etc. The core challenges of this problem are the high intra-class variance, yet low inter-class fluctuations within datasets [5, 6]. As Fig. 1 shows, two different gulls ‘Herring Gull’ and ‘Western Gull’ share similar visual appearances. However, within each class, the posture, illumination condition, and object background can change dramatically, which is more changeling compared with the generic image classification problems [7, 8, 9].

Traditional FG models [10, 5, 11, 12, 6, 13, 14, 15, 16] utilize large-scale and fully-annotated datasets to ‘understand’ and ‘memorize’ the training data, thus achieving satisfactory performances in identifying new samples from the same label space. However, it is hard to obtain extensively labeled data in many practical scenarios, e.g., in industry defects detection, the majority of defects exist only in a few common categories, while most categories only contain a small portion. Moreover, annotating a fine-grained dataset requires massive inputs, e.g., only the diagnostician with the expert knowledge could accurately determine the lesion part from the CT images. Consequently, how to obtain an efficient model with sparse labeled samples remains an open problem. In this paper, we focus on one of the representative limited sample learning methods for FG tasks, i.e., Fine-Grained image classification under Few-Shot settings (FGFS). Most recent FS methods [17, 18, 19, 20, 21, 22, 23]

are designed for the generic image classification problems, which aim to learn to classify unlabeled query images when only a few labeled support examples are available for each class. Given the specialties of the FGFS, directly applying these models without explicitly address the FG properties may result in less optimized performances.

Fig. 1: The minute inter-class visual differences but significant intra-class variations in FGFS tasks are more rigorous and challengeable than general FG tasks.

As the core challenge of the FG problems, with large-scale and fully-annotated datasets, the high intra-class variance could be somehow relieved through supervised training to obtain a robust representation of each class. However, for FGFS, each class only contains limited labeled samples. As seen from Fig. 1, in the one-shot bird classification scenario, if the single support (labeled) sample shows a diving gesture, whereas query (unlabeled) ones are standing, it can be ‘confusing’ for classifiers to distinguish them. Therefore, the large intra-class differences among support images, as well as support-query pairs, bring significant impacts on representation learning. Nevertheless, current FGFS models [24, 25, 26, 27] rarely focus on this problem. The second-order representation learning [28, 24, 25, 26, 27] is usually carried out to address the low inter-class variance of the FGFS problem. This type of method focuses on learning FG features over the global image. However, the most discriminative features of FG data usually exist in some small parts. Though promising results are obtained in these models, further improvements can be achieved by exploring more powerful features.

Fig. 2: The overview of TOAN in the N-way-1-shot task, we omit other support samples replace with . The model consists of three parts: the feature embedding learns the convolved features; the fine-grained relation extractor contains TOMM and GPBP, which aims to generate robust deep bilinear features from support-query pairs, where PBP stands for the Pair-wise Bilinear Pooling; and the comparator maps the query to its corresponding class.

To solve the above challenges, in this paper, we propose a novel neural network in a meta-learning fashion, named as

Target-Oriented Alignment Network (TOAN)

. By employing a feature alignment transformation and a second-order comparative feature extraction jointly, TOAN learns a robust fine-grained relation on each support-query pair, which achieves superior performances for FGFS tasks. More specifically, to eliminate the training biases brought by the intra-class variance, we propose to reformulate the convolutional representations of support images according to the target query by an attention-weighted sum operation, noted as the

Target-Oriented Matching Mechanism (TOMM). Unlike the conventional self-attention mechanism [29] that operates on the input itself, TOMM generates the attention weights in a target-oriented fashion. That is, the similarities of the convolutional features between the support-query pairs are computed first and then converted into a soft-attention map. The support features are reformulated based on these attention weights to rule out the possible variance compared to the query. Different from previous works [28, 24, 25, 27], TOMM makes the most use of the spatial relation between support and query images to cope with the sizeable intra-class variation in FGFS data explicitly.

Moreover, we further enhance the discriminate ability of bilinear features to address the low inter-class variance through mining the concept compositionality representation of bilinear features. Compositionality helps humans learn new concepts from limited samples since it can convert concepts to knowing primitive [30, 31, 32]

. For a convolutional neural network, convolutional feature channels correspond to different sets of visual patterns 

[33, 34, 35]. Therefore, inspired by [36, 37, 35, 16], we incorporate the compositional concepts into the fine-grained feature extraction by combining the channel grouping operation with the pair-wise bilinear pooling, named as Group Pair-wise Bilinear Pooling (GPBP).

In summary, this paper makes the following contributions:

  • We propose to automatically learn an explicit feature transformation to reduce the biases caused by the intra-class variance in FGFS, named as TOMM. By adopting a global cross-correction attention mechanism, TOMM can transfer the support image features to align the query image features in the embedding space and generate the new support class features simultaneously.

  • We propose GPBP to adopt the convolutional channel grouping to aggregate the regional representations into pairwise bilinear pooling, which devises second-order features from both global and local views. This leads to more powerful features. To our best knowledge, GPBP is the first work to adopt group bilinear pooling in FGFS.

  • Extensive experiments are conducted on four benchmark datasets to investigate the effectiveness of the proposed model, and TOAN outperforms related state-of-the-art methods.

The rest of this paper is organized as follows. Section II introduces related works. Section III presents the proposed TOAN method. In Section IV, we evaluate the proposed method on four widely-used fine-grained datasets. The conclusion is discussed in Section V.

Ii Related Work

Ii-a Fine-Grained Image Categorization

Most recent Fine-Grained (FG) models [38, 39, 5, 40, 11, 12, 6, 13, 14, 34, 35, 41, 42, 43] can be roughly grouped into two categories: regional feature-based models [38, 5, 40, 11, 34, 35, 42] and global feature-based methods [44, 10, 39, 45, 12, 6, 13, 16, 46]. The regional feature-based methods focus on mining the discriminative parts of the fine-grained objects, as they are the most informative parts of the FG objects. For instance, [38] adopts the adversarial learning by ‘construction’ and ‘destruction’ the input image to integrate the discriminative features, while the attention mechanisms [5, 35, 41, 42] are widely used to learn the discriminative masks automatically. The global high-order image descriptors are also favored to enhance the convolutional features, e.g., Lin et al. [6] apply matrix outer-product operation on the embedded features to generate a second-order representation, while Gao et al. [13] devise a hierarchical approach by using a cross-layer factorized bilinear pooling operation. The square covariance matrices are used in [12]. Our work follows the global feature-based methods to generate second-order features for FGFS.

Ii-B Generic Few-Shot Learning

Previous works of the generic Few-Shot (FS) learning are conducted from various perspectives, such as learning with memory [47, 48]

, which leverages recurrent neural networks to store the historical information; learning from fine-tuning 

[49, 17, 50, 51], which designs a meta-learning framework to obtain well initial weights for the neural network; learning to compare [18, 24, 19, 20, 21, 52], etc. Among which, learning to compare is most widely used [53, 54, 55, 18, 24, 19, 20, 21, 26, 56, 57, 27]. In general, learning to compare methods can be divided into two modules: the feature embedding and the similarity measurement. By adopting the episode training mechanism [21], these approaches optimize the transferable embedding of both auxiliary data and target data. Then, the query images can be identified by the distance-based classifiers [54, 55, 18, 58, 19, 20, 21, 57]. Most recently, [54, 18, 56] focus on exploring regional information for an accurate similarity comparison. Our work is somewhat close to the learning to compare methods. However, we propose to capture a more robust representation of fine-grained tasks by considering both eliminating intra-class variations through feature alignment and pair-wise second-order feature enhancement, which is tailored for FG challenges.

Ii-C Fine-Grained Categorization with Few Labeled Samples

Wei et al. [25] propose a Piecewise Classifier Mappings (PCM) framework to tackle the fine-grained image categorization under the few-shot setting. PCM injects the bilinear feature [6] into a group of mapping networks to reduce the dimension of the features. A deep distance classifier is then appended to generate the final prediction. SoSN [27] adopts the power normalizing second-order pooling to generate the fine-grained features of the input images, and a pair-wise mechanism is then proposed to capture the relationship of support-query pairs, it achieves superior performance on a fine-grained Open MIC dataset [59]. Li et al. [24] replace the bilinear pooling with a covariance pooling operation [60, 61], and a covariance metric is proposed as the distance classifier. Moreover, [26] designs a localization network to generate the foreground and background features for an input image with external bounding box annotations. After that, the bilinear-pooled foreground and background features are concatenated to feed into the classifier. Both [24, 25, 26, 27]

adopt the second-order pooling on the input image itself (noted as self-bilinear pooling) to capture the fine-grained representation. To further consider the second-order pair-wise relationship between the support and query images, our previous work 

[28] proposes a factorized low-rank bilinear pooling on them to learn the pair-wise comparative features directly (noted as pair-wise bilinear pooling). Moreover, [28] presents a feature position arrangement module as the feature alignment with the global MSE loss to boost the discrimination of the fine-grained features. However, this alignment does not consider the spatial dependencies between the support and query images. Different from [28], the proposed model explicitly learns the pair-wise similarities to generate an attention map and reformulate the support image feature based on it, without external supervision like [26]. Furthermore, we also incorporate a group bilinear pooling by integrating the compositional concept representations into pair-wise learning.

Besides the above bilinear-based works, generative models [62, 63, 64] are also used to synthesize more samples for the support classes. Moreover, MAML-based model [65] adopts a meta-learning strategy to learn good initial FGFS learners.

Iii Methodology

Iii-a Problem Definition

In a FGFS task, we have a small labeled support set of different classes. Given an unlabeled query sample from the query set , the goal is to assign the query to one of the support classes. This target dataset is defined as


where denote the feature and label of a support image. The support set and query set share the same label space. If contains labeled samples for each of categories, the task is noted as a -way--shot problem.

It is far from obtaining an ideal classifier with the limited annotated . Therefore, FGFS models usually utilize a fully annotated dataset, which has similar data distribution but disjoint label space with as an auxiliary dataset . To make full use of the auxiliary set, we follow the widely used episode training strategy [21] as our meta-training mechanism. Expressly, at each training iteration, one support set and one query set are randomly selected from the auxiliary set to construct a meta-task. Moreover, contains samples from different classes. In this way, each training task can mimic the target few-shot problem with the same setting. By adopting thousands of these meta-training operations, the model can transfer the knowledge from the auxiliary to the target dataset .

Fig. 3: The detailed architecture of fine-grained relation extractor, the left figure denotes TOMM, and the right one represents the GPBP operation. indicate the embedded support sample and query sample, is the fine-grained relation.

Iii-B The Proposed TOAN

Given a support image set , where is the -th sample in class , and a query image set , the generic learining to compare FS model consists of two parts: feature embedding module and the comparator , which can be described as:


where denotes the operator of the function composition, aims to learn the feature embedding for raw images, and is the classifier. However, this framework cannot capture a good representation of the subtle difference in FG data. Accordingly, FGFS models [28, 24, 25, 26, 27] incorporate a high-order feature generation module to relive the low inter-class variance. Nevertheless, they seldom consider the sizeable intra-class variance conundrum.

To this end, we propose TOAN to jointly cope with these two problems through a deep fine-grained relation extractor . Fig. 2 illustrates the design of the proposed model:


the comparator assigns each to its nearest category in according to the fine-grained relation , which is generated by applying on the embedded features and . The fine-grained relation is learned as:


is composed of two parts, TOMM and GPBP, to learn from and jointly. TOMM is designed to generate the query image feature and a set of support class representations , (e.g., represents the prototype of class ), which are well-matched in the embedding space, while GPBP focuses on extracting the second-order comparative features from the aligned support-classes prototypes and query image features. We introduce the details of the fine-grained relation extraction and the comparator as follows.

Iii-B1 Fine-grained Relation Extraction

Target-Oriented Matching Mechanism:

In FG data, different sub-classes belong to the same entry-level class, which means that all samples share a similar appearance. Therefore, the similarities of the same parts among sub-classes are higher than those in different parts, which inspires us to transform support features according to the query using cross-correction attention. That is, for a pair of support image and query image , TOMM is expressed as:


where indicate the channel number and the size for the convolutional feature map, and are the embedded support and query features. capture the task-agnostic similarity between two features (). The aligned support feature and the support class prototype are computed as:


where operates in a row-wise way. By using this matching mechanism, is transformed to , where the similarity of each spatial position between and reaches the highest. By averaging all aligned features in the given support class, TOMM obtains well-matched support-class prototypes and query features. Since TOMM aligns the support features in each class to match the query ones, the intra-class variance in each class is thus reduced.

It is worth noting that existing works [66, 56] adopt cross-correction attention to generate efficient features (using attention maps to figure out the semantically related features between support-query pairs), yet do not change the position of the convolutional channels, [54] only uses cross-correction attention to find the closest local features. These works are designed for generic FS problems, and essentially different from the proposed TOMM to explicitly transfer the support image features to match the query ones globally, tailored for FG challenges.

Group Pair-wise Bilinear Pooling:

Semantic compositional information plays an important role in FG tasks, as the most discriminative information always exists in some small parts. However, current FGFS models [25, 27, 26, 28] focus on learning the FG features over the global image. Moreover, studies show that high-level convolutional channels represent specific semantic patterns [37, 16, 35]. To this end, we propose to combine compositional concept representations into second-order feature extraction to generate more powerful features for FGFS.

GPBP is composed of the convolutional channel grouping operation followed by pair-wise bilinear feature extraction. Given a pair of support class feature and query image feature , semantic grouping operation is formulated as:


where converts the original feature into disjoint groups along the channel dimension. Each of these feature groups contains channels, which corresponds to a semantic subspace [36]. For and , we define a bilinear feature of and as:


where represent the spatial features of and in the given position . is a projection matrix that fuses and into a scalar. By adopting on each spatial position of feature pairs, a bilinear feature is obtained. For each channel group , GPBP learns projection matrices, and then we concatenate these scalars to generate a fine-grained relation:


After obtaining the fine-grained relations of each group, we then combine them into the final relation 111For brevity, we omit the subscript of as:


where is the final dimension of . Similar to [67, 13], we adopt a low-rank approximation of to reduce the number of parameters for regularization:


where , , and denotes the Hadamard product.

Iii-B2 Comparator

After capturing the comparative bilinear features of query image and support class , the comparator is defined as:


where learns the distance between the support class and query image , that is, for each query , the comparator generates similarities from support categories. The query image is assigned to the nearest category. Same as [55, 20], we use the MSE loss as our training loss to regress the predicted label to the ground-truth.

Iii-C Network Architecture

Feature Embedding Module: In FGFS and FS tasks, can be any proper convolutional neural network such as Conv4 [49, 20, 21], ResNet [49, 53, 55, 18], and AlexNet [25].

Fine-grained Relation Extractor: We show the detailed architecture of the fine-grained relation extraction module in Fig. 3. TOMM: To construct and , we use a convolutional layer with a

kernel followed by the batch normalization and a LeakyReLU layer. The Target-Oriented Matching is implemented by Eq. (

6). GPBP: For the channel grouping, we split the embedded feature map into groups along the channel dimension. pair-wise bilinear pooling consists of a convolutional layer with

kernel followed by the batch normalization and a ReLU layer. Then the Hadamard product operation is applied to generate the final bilinear features.

Comparator: The comparator consists of two convolutional blocks and two fully-connected layers. Each block contains a

convolution, a batch normalization, and a ReLU nonlinearity layer. The activation function of the first fully connected layer is ReLU, where the Sigmoid transformation is added after the output of the last fully connected layer to generate similarities for input pairs.

Iv Experiment

200 120 196 555
50 30 49 139
120 70 130 350
30 20 17 66
TABLE I: The dataset splits. is the total number of categories. is the number of categories in the target datasets. and note the training class number and validation category number in the auxiliary datasets separately.

Iv-a Dataset

We evaluate the proposed method on four datasets: Caltech-UCSD Birds-200-2011 (CUB) [3], which contains 11,788 images spanning 200 sub-categories of birds; Stanford Dogs (DOGS) [1], which consists of 20,580 images with 120 dog species; Stanford Cars (CARS) [4], which has 196 categories of cars and a total number of 16,185 images; North America Birds (NABirds) [2], which consists of 48,562 bird images from 555 bird species. For a fair comparison, we follow the latest data splits [28, 18, 24] of FG benchmarks in DN4 [18], as Table I shows.

Methods MiniImageNet (%)
1-shot 5-shot
RelationNet [20] 50.440.82 65.320.70
RelationNet, ours 51.870.45 64.750.57
ProtoNet [19] 49.420.78 68.200.66
ProtoNet, ours 47.570.63 66.210.58
MatchingNet [21] 43.560.84 55.310.73
MatchingNet, ours 48.900.62 65.670.55
TABLE II: Re-implementations validation.

Iv-B Experimental Setting

All experiments are conducted in the 5-way-1-shot and 5-way-5-shot fashions on the above datasets. During each episode of training and testing, we randomly select five categories to construct a FGFS task. For the 5-way-1-shot setting, we randomly sample 51515 = 80 images from the selected categories, where there are one support image and 15 query images in each class. Similarly, we randomly choose 55510 = 75 images to set up the 5-way-5-shot experiment. We resize the input image to 8484 and train models from scratch using Adam optimization [68]. The initial learning rate is 0.001. We set the group number of GPBP as four and the bilinear feature dimension as 1024. Moreover, we fix the output channel of . Besides FGFS models, it is worth noting that generic FS models can still be applied to fine-grained data, by referring to the first FGFS method [25], we select the representative ones for comparison with the proposed TOAN on FGFS tasks. We compare various models as follows:

1-shot 5-shot 1-shot 5-shot 1-shot 5-shot 1-shot 5-shot
MatchingNet [21] 57.590.74 70.570.62 48.030.60 64.220.59 45.050.66 60.600.62 60.700.78 76.230.62
ProtoNet [19] 53.880.72 70.850.63 45.270.61 64.240.61 42.580.63 59.490.65 55.850.78 75.340.63
RelationNet [20] 59.820.77 71.830.61 56.020.74 66.930.63 44.750.70 58.360.66 64.340.81 77.520.60
CovaMNet [24] 58.510.94 71.150.80 56.650.86 71.330.62 49.100.76 63.040.65 60.030.98 75.630.79
DN4 [18] 55.600.89 77.640.68 59.840.80 88.650.44 45.410.76 63.510.62 51.810.91 83.380.60
PCM [25] 42.101.96 62.481.21 29.632.38 52.281.46 28.782.33 46.922.00 - -
PABN+ [28] 63.360.80 74.710.60 54.440.71 67.360.61 45.650.71 61.240.62 66.940.82 79.660.62
LRPABN [28] 63.630.77 76.060.58 60.280.76 73.290.58 45.720.75 60.940.66 67.730.81 81.620.58
SoSN [27] 63.950.72 78.790.60 62.840.68 75.750.52 48.010.76 64.950.64 69.530.77 83.870.51
TOAN 65.340.75 80.430.60 65.900.72 84.240.48 49.300.77 67.160.49 70.020.80 85.520.50
TOAN:ResNet 67.170.81 82.090.56 76.620.70 89.570.40 51.830.80 69.830.66 76.140.75 90.210.40
TABLE III: Few-shot classification accuracy (%) comparisons on four FG benchmarks. All results are with confidence intervals where reported. We highlight the best and the second best methods.
1-shot 5-shot 1-shot 5-shot 1-shot 5-shot 1-shot 5-shot
MatchingNet [21] 57.590.74 70.570.62 48.030.60 64.220.59 45.050.66 60.600.62 60.700.78 76.230.62
MatchingNet+TOMM 60.870.78 75.120.61 53.790.72 72.670.55 47.060.74 63.220.62 65.830.75 80.730.57
+3.28 +4.55 +5.76 +8.45 +2.01 +2.62 +5.13 +4.50
ProtoNet [19] 53.880.72 70.850.63 45.270.61 64.240.61 42.580.63 59.490.65 55.850.78 75.340.63
ProtoNet+TOMM 61.600.76 75.090.61 52.500.69 68.130.58 46.360.73 61.560.65 64.770.79 80.840.56
+7.72 +4.24 +7.23 +3.89 +3.78 +2.07 +9.92 +5.50
RelationNet [20] 59.820.77 71.830.61 56.020.74 66.930.63 44.750.70 58.360.66 64.340.81 77.520.60
RelationNet+TOMM 64.840.77 79.750.54 62.350.77 81.570.51 47.240.78 65.230.66 69.550.77 85.010.51
+5.02 + 7.92 +6.33 +14.64 +2.49 +6.87 +5.21 +7.49
TABLE IV: Ablation study for TOMM. We obtain complete improvements (%) in each model after incorporating TOMM.
1-shot 5-shot 1-shot 5-shot 1-shot 5-shot 1-shot 5-shot
RelationNet+GPBP 60.000.74 74.010.60 58.350.73 73.490.59 46.450.70 61.700.65 65.430.81 80.130.58
TOAN-GP* 65.800.78 79.370.61 65.880.74 82.690.50 50.100.79 65.900.68 69.480.75 85.480.53
TOAN-w/o 64.480.76 78.820.59 60.020.73 81.650.49 47.270.72 63.980.65 68.700.79 83.700.53
TOAN 65.340.75 80.430.60 65.900.72 84.240.48 49.300.77 67.160.49 70.020.80 85.520.50
TOAN_224 69.030.79 83.190.56 69.480.74 87.380.45 53.670.80 69.770.70 75.170.76 88.770.46
TOAN:ResNet_224 69.910.82 84.860.57 77.250.73 91.190.40 55.770.79 72.160.72 77.320.70 91.390.41
TABLE V: Ablation study of TOAN for other choices. Few-shot classification results (%) on four FG datasets.
FS Baselines:

MatchingNet [21] (NIPS2016), ProtoNet [19] (NIPS2017), and RelationNet [20] (CVPR2018) are three exemplary generic few-shot learning methods. For fair comparisons, we re-implement these methods by referring to the source codes with our experimental settings. Moreover, we conduct several experiments on the miniImageNet dataset [21] with the Conv4 backbone [20, 55, 28] to validate the correctness of our implementations of three baselines [21, 19, 20]. All the training and testing settings of the re-implementations are the same with TOAN. We present the comparison results between our re-implementations with the original baselines in Table II. It can be observed that the classification accuracies of our re-implementations do not decrease more than 2% compared to the performance that originally reported. These minor differences are attributed to the modifications of some implementation details in our experimental settings, as [49] investigated.


For generic FS models, we compare our model with DN4 [18] (CVPR2019) and CovaMNet [24]

(AAAI2019). Both results on CUB and NABirds are obtained from their open-sourced models, while other results are quoted. For FGFS methods, we select PCM 

[25] (TIP2019), SoSN [27] (WACV2019), and FGFS models PABN+ as well as LRPABN from [28] for comparison.

TOAN Family:

First of all, we add the TOMM to FS baseline models to investigate its effectiveness for FGFS tasks, noted as FS+TOMM, where FS can be anyone of the baseline models. Similarly, GPBP is also plugged into the RelationNet, named as RelationNet+GPBP. To investigate the grouping function, we replace the proposed function by a convolutional layer with a group parameter [37], noted as TOAN-GP*. Moreover, we remove the task-agnostic transformation in TOMM, noted as TOAN-, where pair-wise similarities are computed directly based on the embedded support feature and query feature . We replace the embedding Conv4 network [20] with a deeper ResNet network [18] to study the influence of backbones for the TOAN, noted as TOAN:ResNet. Finally, we use a larger input image size with different backbones to study the effects of the input image size for TOAN, named as TOAN_224 and TOAN:ResNet_224, respectively.

Iv-C Experimental Results

Comparison with State-of-the-Art: The comparisons between TOAN and other state-of-the-art methods are shown in Table III. It can be observed that our methods compare favorably over most generic FS and FGFS approaches by large margins on 1-shot and 5-shot experiments. Specifically, under the 5-way-1-shot setting, the classification accuracies are 65.34% vs. 63.95% [27], 65.90% vs. 59.84% [18], 49.30% vs. 49.10% [24], and 70.02% vs. 67.73% [28] on CUB, CARS, DOGS, and NABirds, respectively. Moreover, by replacing the shallow backbone Conv4 with a deeper ResNet model [18], the accuracy of TOAN:ResNet gets further improvements. For example, under 5-way-5-shot setting, TOAN:ResNet achieves 82.09%, 89.57%, 69.83%, and 90.21% compared with 80.43%, 84.24%, 67.16%, and 85.52% for TOAN model on the CUB, CARS, DOGS, and NABirds.

Ablation Studies about TOMM: First of all, we investigate the effectiveness of TOMM for FGFS tasks. As Table IV shows, there is an approximately 5% averagely increase after adopting TOMM in three FS baselines. Moreover, when incorporating TOMM to the RelationNet, the model achieves superior performances over the most compared approaches, for instance, under the 5-way-5-shot setting, the classification accuracies of RelationNet+TOMM are 79.75% vs. 78.79% [27], 65.23% vs. 63.04% [24], and 85.01% vs. 83.38% [18] on CUB, DOGS, and NABirds, separately. Therefore, it verifies that the significant intra-class variance is a crucial problem in FGFS tasks, and the TOMM is an effective mechanism to tackle this problem universally. Moreover, as the proposed task-agnostic transformation can better capture the similarities of input pairs, TOAN outperforms TOAN-, as shown in Table V. As Fig. 4 shows, we give a visualization of the TOMM. Instead of visualizing the embedded feature directly, we utilize the original image to get a more vivid description of the proposed feature alignment. More specifically, we resize the original images to the same size as the Target-Oriented attention map (1919). Then we multiply the image with the corresponding attention map by matrix multiplication to generate the aligned features that are similar to Equation (6). We observe that for each support image (each row in Fig. 4), TOMM transforms its feature to match each query (each top column image). For instance, in the fourth column in Fig. (a)a, the posture of five support cars is reshaped as the same with the red query car in the top row. Therefore, the TOAN can somehow eliminate the intra-class variance in the dataset.

Ablation Studies about GPBP: Unlike simple distance-based FS methods, which measure similarities of embedded feature pairs through  [19] or cosine distance [21], GPBP learns to mine fine-grained relations between support-query pairs. Thus, we combine GPBP with RelationNet to study its capability. RelationNet+GPBP brings certain gains over RelationNet, as shown in Table V. After combing TOMM and GPBP together, the final model TOAN achieves significant improvements compared with ablation models as expected, which indicates that the TOMM and GPBP benefit each other. From the second results row in Table V, the compared grouping [37] model TOAN-GP* achieves analogous performances with TOAN under the 1-shot setting. However, its performance is lower than TOAN under the 5-shot, which verifies the effectiveness of our grouping operation.

(a) CARS Dataset.
(b) CUB Dataset.
Fig. 4: TOMM Visualization, the first image in each row represents the support image, and the remaining images in the row are the aligned results of the support image, which are matched to each query image (in each top column).
(a) Semantic Grouping Validation.
(b) Features Dimension Selection.
Fig. 5: Discussions of GPBP, including semantic channel grouping validation (a)a and feature dimension selection (b)b.
(a) TOAN, 80.67% accuracy.
(b) RelationNet, 76.67% accuracy.
Fig. 6: t-SNE visualization of the features learned by the TOAN model, we randomly select 5 classes to show, and in each category, there exist 30 query images.

We conduct two experiments to discuss the proposed GPBP operation furtherly. First of all, we evaluate the semantic grouping in Fig. (a)a. As it can be seen, when the grouping number is less than or equal to eight, a larger group size results in higher performances, e.g., the classification results reach highest (80.69%) when the grouping size equals eight under 5-shot settings, which indicates the effectiveness of semantic grouping on boosting the performance of the bilinear features. When the grouping number is greater than eight, the performances tend to be stable with small fluctuations.

Then, we conduct feature dimension selection experiment as shown in Fig. (b)b, it is observed that a higher dimension of feature can bring a slight improvement in the 5-shot experiment for the TOAN. For example, the performance is 78.93% vs. 80.10%, when the length of the bilinear feature is 64 vs. 2048 under 5-way-5-shot CUB categorization, while the model works relatively stable for 1-shot experiments.

Input Image Size for TOAN: In high-order-based FG methods [10, 12, 6], a higher resolution of the input image can result in a more fine-grained feature, which is consistent with the reporting of current FGFS models [28, 27]. Therefore, we conduct experiments to investigate the effects of input resolution for TOAN. As seen from Table V, TOAN_224 and TOAN:ResNet_224 achieve further improvements with a larger input size compared with the smaller image resolution in TOAN. This also validates that a higher input resolution can bring better performance in FGFS tasks using our model.

t-SNE Visualization for the Learned Features: Fig. (a)a visualizes the feature distribution of the learned fine-grained features using t-SNE [69]. The features are generated under the 5-way-5-shot setting on the CUB. We use 30 query images per class. As can be observed, the learned features by our TOAN have more compact and separable clusters than RelationNet (Fig. (b)b), indicating the discrimination of the obtained features.

1-shot 5-shot 1-shot 5-shot 1-shot 5-shot 1-shot 5-shot
Conv-64 [20] 65.340.75 80.430.60 65.900.72 84.240.48 49.300.77 67.160.49 70.020.80 85.520.50
Conv-128 64.560.78 80.020.59 69.200.72 86.390.44 50.260.77 66.960.66 70.900.77 85.630.49
Conv-256 66.160.80 80.720.58 68.890.74 85.290.46 49.680.75 67.520.66 71.260.76 86.420.47
Conv-512 [70] 66.440.77 81.460.54 69.590.73 86.270.45 49.200.74 66.750.66 72.740.76 86.910.50
ResNet-64 69.250.81 81.900.61 74.640.76 90.200.41 53.330.82 69.960.70 75.980.72 89.550.44
ResNet-128 68.950.78 83.400.58 75.140.72 90.950.36 52.690.81 69.950.71 76.140.75 90.510.38
ResNet-256 [18] 67.170.81 82.090.56 76.620.70 89.570.40 51.830.80 69.830.66 76.140.75 90.210.40
ResNet-512 66.100.86 82.270.60 75.280.72 87.450.48 49.770.86 69.290.70 76.240.77 89.880.43
TABLE VI: Different backbone choices of TOAN. All results are with confidence intervals where reported.
Methods CUB data set
1-shot (%) 5-shot (%) Model Size Inference Time ( s) Feature Dim
ProtoNet [19] 53.88 70.85 113,088 0.69 64
MatchingNet [21] 57.59 70.57 113,088 0.68 64
RelationNet [20] 59.82 71.83 228,686 1.14 128
DN4 [18] 55.60 77.64 112,832 15.20 64
PABN [71] 63.36 74.71 375,361 8.65 4096
LRPABN [28] 63.63 76.06 344,251 2.53 512
TOAN 65.60 78.93 198,417 0.66 64
TOAN 65.61 79.81 237,585 0.87 128
TOAN 64.69 80.35 315,921 1.04 256
TOAN 64.17 79.19 472,593 1.23 512
TOAN 65.34 80.43 785,937 2.34 1024
TABLE VII: Ablation study of TOAN about model complexity. Model size indicates the number of parameters for each model, and the Inference Time is the testing time for each input query image.

Different Backbones for TOAN In our experiments, except for TOAN:ResNet that adopts ResNet [18] as its backbone, other models apply Conv4 [20, 55, 28] (Conv-64 in Table VI) as their embedding module. Then we choose more backbones to investigate our proposed model. First, we adopt the Conv-512 [70] as the embedding network, which is derived from Conv-64 by increasing the width across layers to 512 feature channels. Following this way, we further revise Conv-64 to Conv-128, Conv-256. Similarly, we design the ResNet-64, ResNet-128 and ResNet-512. From Table VI, we observe that a wider Conv-based backbone can result in higher performance in FSFG classification. On the other hand, deeper backbones can achieve further performance improvements compare to shallow ones. For instance, ResNet-64 outperforms Conv-512 on both 1-shot and 5-shot experiments. In the 5-shot setting, TOAN achieves relatively stable performance when the width of ResNet changes.

Model Complexity and Inference Time: The main complexity of our model is the TOMM operation, which has , where represents the size of the convolutional map. In our implementation, . In general, a deeper convolutional network usually results in a smaller feature map before feeding to the classifier. Therefore, the TOMM operation is efficient with deeper backbones. We conduct additional experiments to investigate the model size and inference time of TOAN compared with previous works [19, 21, 20, 18, 71, 28]. As shown in Table VII, using the same feature dimension, the proposed TOAN model achieves the best performance compared with other models with a small model size as well as a short time. While using a larger dimension, the classification performance can be further improved.

V Conclusion

In this paper, we propose a target-oriented alignment network (TOAN) for few-shot fine-grained image categorization. Specifically, a target-oriented matching mechanism is proposed to eliminate biases brought by the intra-class variance in fine-grained datasets, which is a crucial issue yet with less consideration in current studies. Moreover, the group pair-wise bilinear pooling is adopted to learn compositional bilinear features. We demonstrate the effectiveness of the proposed model on four benchmark datasets with state-of-the-art performances.


  • [1] A. Khosla, N. Jayadevaprakash, B. Yao, and L. Fei-Fei, “Novel dataset for fine-grained image categorization,” in First Workshop on Fine-Grained Visual Categorization, CVPR, 2011.
  • [2] G. Van Horn, S. Branson, R. Farrell, S. Haber, J. Barry, P. Ipeirotis, P. Perona, and S. Belongie, “Building a bird recognition app and large scale dataset with citizen scientists: The fine print in fine-grained dataset collection,” in CVPR, 2015.
  • [3] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie, “The Caltech-UCSD Birds-200-2011 Dataset,” California Institute of Technology, Tech. Rep. CNS-TR-2011-001, 2011.
  • [4] J. Krause, M. Stark, J. Deng, and L. Fei-Fei, “3d object representations for fine-grained categorization,” in 4th International IEEE Workshop on 3D Representation and Recognition (3dRR-13), 2013.
  • [5] J. Fu, H. Zheng, and T. Mei, “Look closer to see better: Recurrent attention convolutional neural network for fine-grained image recognition,” in CVPR, 2017.
  • [6] T.-Y. Lin, A. RoyChowdhury, and S. Maji, “Bilinear cnn models for fine-grained visual recognition,” in ICCV, 2015.
  • [7]

    J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in

    CVPR, 2009.
  • [8] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016.
  • [9] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in NeurIPS, 2012.
  • [10] Y. Cui, F. Zhou, J. Wang, X. Liu, Y. Lin, and S. Belongie, “Kernel pooling for convolutional neural networks,” in CVPR, 2017.
  • [11] J. Krause, H. Jin, J. Yang, and L. Fei-Fei, “Fine-grained recognition without part annotations,” in CVPR, 2015.
  • [12] P. Li, J. Xie, Q. Wang, and Z. Gao, “Towards faster training of global covariance pooling networks by iterative matrix square root normalization,” in CVPR, 2018.
  • [13] C. Yu, X. Zhao, Q. Zheng, P. Zhang, and X. You, “Hierarchical bilinear pooling for fine-grained visual recognition,” in ECCV, 2018.
  • [14] N. Zhang, J. Donahue, R. Girshick, and T. Darrell, “Part-based r-cnns for fine-grained category detection,” in ECCV, 2014.
  • [15] Y. Zhang, H. Tang, and K. Jia, “Fine-grained visual categorization using meta-learning optimization with sample selection of auxiliary data,” in ECCV, 2018, pp. 233–248.
  • [16] H. Zheng, J. Fu, Z.-J. Zha, and J. Luo, “Learning deep bilinear transformation for fine-grained image representation,” in NeurIPS, 2019, pp. 4279–4288.
  • [17] C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learning for fast adaptation of deep networks,” in ICML, 2017.
  • [18] W. Li, L. Wang, J. Xu, J. Huo, G. Yang, and J. Luo, “Revisiting local descriptor based image-to-class measure for few-shot learning,” in CVPR, 2019.
  • [19] J. Snell, K. Swersky, and R. Zemel, “Prototypical networks for few-shot learning,” in NeurIPS, 2017.
  • [20] F. Sung, Y. Yang, L. Zhang, T. Xiang, P. H. Torr, and T. M. Hospedales, “Learning to compare: Relation network for few-shot learning,” in CVPR, 2018.
  • [21] O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra et al., “Matching networks for one shot learning,” in NeurIPS, 2016.
  • [22] Y.-X. Wang, R. Girshick, M. Hebert, and B. Hariharan, “Low-shot learning from imaginary data,” in CVPR, 2018, pp. 7278–7286.
  • [23] Y.-X. Wang and M. Hebert, “Learning to learn: Model regression networks for easy small sample learning,” in ECCV.   Springer, 2016, pp. 616–634.
  • [24] W. Li, J. Xu, J. Huo, L. Wang, G. Yang, and J. Luo, “Distribution consistency based covariance metric networks for few-shot learning,” in AAAI, 2019.
  • [25] X.-S. Wei, P. Wang, L. Liu, C. Shen, and J. Wu, “Piecewise classifier mappings: Learning fine-grained learners for novel categories with few examples,” IEEE TIP, vol. 28, no. 12, pp. 6116–6125, 2019.
  • [26] D. Wertheimer and B. Hariharan, “Few-shot learning with localization in realistic settings,” in CVPR, 2019, pp. 6558–6567.
  • [27] H. Zhang and P. Koniusz, “Power normalizing second-order similarity network for few-shot learning,” in WACV.   IEEE, 2019, pp. 1185–1193.
  • [28] H. Huang, J. Zhang, J. Zhang, J. Xu, and Q. Wu, “Low-rank pairwise alignment bilinear network for few-shot fine-grained image classification,” arXiv preprint arXiv:1908.01313, 2019.
  • [29] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in NeurIPS, 2017.
  • [30] I. Biederman, “Recognition-by-components: a theory of human image understanding.” Psychological review, vol. 94, no. 2, p. 115, 1987.
  • [31] D. D. Hoffman and W. A. Richards, “Parts of recognition,” Cognition, vol. 18, no. 1-3, pp. 65–96, 1984.
  • [32] D. Marr and H. K. Nishihara, “Representation and recognition of the spatial organization of three-dimensional shapes,” Proceedings of the Royal Society of London. Series B. Biological Sciences, vol. 200, no. 1140, pp. 269–294, 1978.
  • [33] M. Simon and E. Rodner, “Neural activation constellations: Unsupervised part model discovery with convolutional networks,” in ICCV, 2015.
  • [34] X. Zhang, H. Xiong, W. Zhou, W. Lin, and Q. Tian, “Picking deep filter responses for fine-grained image recognition,” in CVPR, 2016.
  • [35] H. Zheng, J. Fu, T. Mei, and J. Luo, “Learning multi-attention convolutional neural network for fine-grained image recognition,” in ICCV, 2017.
  • [36] P. Hu, X. Sun, K. Saenko, and S. Sclaroff, “Weakly-supervised compositional feature aggregation for few-shot recognition,” arXiv preprint arXiv:1906.04833, 2019.
  • [37] T. Zhang, G.-J. Qi, B. Xiao, and J. Wang, “Interleaved group convolutions,” in ICCV, 2017.
  • [38] Y. Chen, Y. Bai, W. Zhang, and T. Mei, “Destruction and construction learning for fine-grained image recognition,” in CVPR, 2019.
  • [39] M. Engin, L. Wang, L. Zhou, and X. Liu, “Deepkspd: Learning kernel-matrix-based spd representation for fine-grained image recognition,” in ECCV, 2018, pp. 612–627.
  • [40] W. Ge, X. Lin, and Y. Yu, “Weakly supervised complementary parts models for fine-grained image classification from the bottom up,” in CVPR, 2019.
  • [41] H. Zheng, J. Fu, Z.-J. Zha, and J. Luo, “Looking for the devil in the details: Learning trilinear attention sampling network for fine-grained image recognition,” in CVPR, 2019.
  • [42] X. He, Y. Peng, and J. Zhao, “Which and how many regions to gaze: Focus discriminative regions for fine-grained visual categorization,”

    International Journal of Computer Vision

    , vol. 127, no. 9, pp. 1235–1255, 2019.
  • [43] X. He and Y. Peng, “Fine-grained visual-textual representation learning,” IEEE Transactions on Circuits and Systems for Video Technology, 2019.
  • [44] S. Cai, W. Zuo, and L. Zhang, “Higher-order integration of hierarchical convolutional activations for fine-grained visual categorization,” in ICCV, 2017.
  • [45] P. Koniusz, H. Zhang, and F. Porikli, “A deeper look at power normalizations,” in CVPR, 2018, pp. 5774–5783.
  • [46] Y. Hu, Y. Yang, J. Zhang, X. Cao, and X. Zhen, “Attentional kernel encoding networks for fine-grained visual categorization,” IEEE Transactions on Circuits and Systems for Video Technology, 2020.
  • [47] T. Munkhdalai and H. Yu, “Meta networks,” in ICML, 2017.
  • [48] A. Santoro, S. Bartunov, M. Botvinick, D. Wierstra, and T. Lillicrap, “Meta-learning with memory-augmented neural networks,” in ICML, 2016.
  • [49] W.-Y. Chen, Y.-C. Liu, Z. Kira, Y.-C. F. Wang, and J.-B. Huang, “A closer look at few-shot classification,” in ICLR, 2019.
  • [50] A. Rajeswaran, C. Finn, S. M. Kakade, and S. Levine, “Meta-learning with implicit gradients,” in NeurIPS, 2019, pp. 113–124.
  • [51] R. Sachin and L. Hugo, “Optimization as a model for few-shot learning,” in ICLR, 2017.
  • [52] C. Zhang, C. Li, and J. Cheng, “Few-shot visual classification using image pairs with binary transformation,” IEEE Transactions on Circuits and Systems for Video Technology, 2019.
  • [53] S. Gidaris and N. Komodakis, “Dynamic few-shot visual learning without forgetting,” in CVPR, 2018.
  • [54] F. Hao, F. He, J. Cheng, L. Wang, J. Cao, and D. Tao, “Collect and select: Semantic alignment metric learning for few-shot learning,” in ICCV, 2019.
  • [55] H. Li, D. Eigen, S. Dodge, M. Zeiler, and X. Wang, “Finding task-relevant features for few-shot learning by category traversal,” in CVPR, 2019.
  • [56] Z. Wu, Y. Li, L. Guo, and K. Jia, “Parn: Position-aware relation networks for few-shot learning,” in ICCV, 2019, pp. 6659–6667.
  • [57] S. W. Yoon, J. Seo, and J. Moon, “Tapnet: Neural network augmented with task-adaptive projection for few-shot learning,” in ICML, 2019.
  • [58] Y. Liu, J. Lee, M. Park, S. Kim, E. Yang, S. Hwang, and Y. Yang, “Learning to propagate labels: Transductive propagation network for few-shot learning,” in ICLR, 2019.
  • [59] P. Koniusz, Y. Tas, H. Zhang, M. Harandi, F. Porikli, and R. Zhang, “Museum exhibit identification challenge for the supervised domain adaptation and beyond,” in ECCV, September 2018.
  • [60] A. Cherian, S. Sra, A. Banerjee, and N. Papanikolopoulos, “Jensen-bregman logdet divergence with application to efficient similarity search for covariance matrices,” IEEE TPAMI, vol. 35, no. 9, pp. 2161–2174, 2012.
  • [61] O. Tuzel, F. Porikli, and P. Meer, “Region covariance: A fast descriptor for detection and classification,” in ECCV.   Springer, 2006, pp. 589–600.
  • [62] F. Pahde, M. Nabi, T. Klein, and P. Jahnichen, “Discriminative hallucination for multi-modal few-shot learning,” in ICIP.   IEEE, 2018, pp. 156–160.
  • [63] S. Tsutsui, Y. Fu, and D. Crandall, “Meta-reinforced synthetic data for one-shot fine-grained visual recognition,” in NeurIPS, 2019, pp. 3057–3066.
  • [64] X. He and Y. Peng, “Only learn one sample: Fine-grained visual categorization with one sample training,” in Proceedings of the 26th ACM international conference on Multimedia, 2018, pp. 1372–1380.
  • [65] Y. Zhu, C. Liu, and S. Jiang, “Multi-attention meta learning for few-shot fine-grained image recognition.”
  • [66] R. Hou, H. Chang, M. Bingpeng, S. Shan, and X. Chen, “Cross attention network for few-shot classification,” in NeurIPS, 2019, pp. 4005–4016.
  • [67] J.-H. Kim, K.-W. On, W. Lim, J. Kim, J.-W. Ha, and B.-T. Zhang, “Hadamard product for low-rank bilinear pooling,” in ICLR, 2017.
  • [68] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in ICLR, 2015.
  • [69] L. v. d. Maaten and G. Hinton, “Visualizing data using t-sne,” JMLR, vol. 9, no. Nov, pp. 2579–2605, 2008.
  • [70] S. Gidaris, A. Bursuc, N. Komodakis, P. Perez, and M. Cord, “Boosting few-shot visual learning with self-supervision,” in ICCV, 2019.
  • [71] H. Huang, J. Zhang, J. Zhang, Q. Wu, and J. Xu, “Compare more nuanced: Pairwise alignment bilinear network for few-shot fine-grained learning,” in 2019 IEEE International Conference on Multimedia and Expo (ICME).   IEEE, 2019, pp. 91–96.