Log In Sign Up

Fine-grained Few-shot Recognition by Deep Object Parsing

In our framework, an object is made up of K distinct parts or units, and we parse a test instance by inferring the K parts, where each part occupies a distinct location in the feature space, and the instance features at this location, manifest as an active subset of part templates shared across all instances. We recognize test instances by comparing its active templates and the relative geometry of its part locations against those of the presented few-shot instances. We propose an end-to-end training method to learn part templates on-top of a convolutional backbone. To combat visual distortions such as orientation, pose and size, we learn multi-scale templates, and at test-time parse and match instances across these scales. We show that our method is competitive with the state-of-the-art, and by virtue of parsing enjoys interpretability as well.


page 12

page 13

page 14


Parsing Geometry Using Structure-Aware Shape Templates

Real-life man-made objects often exhibit strong and easily-identifiable ...

Part-based R-CNNs for Fine-grained Category Detection

Semantic part localization can facilitate fine-grained categorization by...

Expanded Parts Model for Semantic Description of Humans in Still Images

We introduce an Expanded Parts Model (EPM) for recognizing human attribu...

AdaAfford: Learning to Adapt Manipulation Affordance for 3D Articulated Objects via Few-shot Interactions

Perceiving and interacting with 3D articulated objects, such as cabinets...

FS-DETR: Few-Shot DEtection TRansformer with prompting and without re-training

This paper is on Few-Shot Object Detection (FSOD), where given a few tem...

End-to-end One-shot Human Parsing

Previous human parsing models are limited to parsing humans into pre-def...

Integrated Visualization Editing via Parameterized Declarative Templates

Interfaces for creating visualizations typically embrace one of several ...

1 Introduction

Deep neural networks (DNN) can be trained to solve visual recognition tasks with large annotated datasets. In contrast, training DNNs for few-shot recognition

[39, 46], and its fine-grained variant [40], where only a few examples are provided for each class by way of supervision at test-time, is challenging. Fundamentally, the issue is that few-shots of data is often inadequate to learn an object model among all of its myriad variations, which do not impact an object’s category. For our solution, we propose to draw upon two key observations from the literature.

  • There are specific locations bearing distinctive patterns/signatures in the feature space of a convolution neural network (CNN), which correspond to salient visual characteristics of an image instance

    [59, 4].

  • Attention on only a few specific locations in the feature space, leads to good recognition accuracy [61, 32, 42].

How can we leverage these observations?
Duplication of Traits. We posit that the visual characteristics found in one instance of an object are widely duplicated among other instances, and even among those belonging to other classes. It follows from our proposition that it is the particular collection of visual characteristics arranged in a specific geometric pattern that uniquely determines an object belonging to a particular class.

Juxtaposing these assumptions with (A) and (B) implies that these shared visual traits can be found in the feature maps of CNNs and only a few locations on the feature map suffice for object recognition. CNNs features are important, as they distil essential information, and suppress redundant or noisy information.

Parsing. We call these finitely many latent locations on the feature maps which correspond to salient traits, parts. These parts manifest as patterns, where each pattern belongs to a finite (but potentially large) dictionary of templates. This dictionary embodies both the shared vocabulary and the diversity of patterns found across object instances. Our goal is to learn the dictionary of templates for different parts using training data, and at test-time, we seek to parse111we view our dictionary as a collection of words, and the geometric relationship between different parts as relationship between phrases. new instances by identifying part locations and the sub-collection of templates that are exhibited for the few-shot task. The provided few-shot instances are parsed and then compared against the parsed query. The best matching class is then predicted as the output. As an example see Fig 1 (a), where the recognized part locations using the learned dictionary correspond to the head, breast and the knee of the birds in their images with corresponding locations in the convolutional feature maps. In matching the images, both the active templates and the geometric structure of the parts are utilized.

Figure 1: Motivation: a) In fine-grained few-shot learning, the most discriminating information is embedded in the salient parts (e.g. head and breast of a bird) and the geometry of the parts (obtuse or acute triangle). Our method parses the object into a structured combination of a finite set of dictionaries, such that both finer details and the shape of the object are captured and leveraged in recognition. b) In FSL setting, the same part may be distorted or absent in the support samples due to the perspective and pose changes. We propose to extract features from multiple scales for each part, and down-weight the matching if the scales are not matched.

Inferring part locations based on part-specific dictionaries is a low complexity task, and is analogous to the problem of detection of signals in noise in radar applications [45], a problem solved by matching the received signal against a known dictionary of transmitted signals.

Challenges. Nevertheless, our situation is somewhat more challenging. Unlike the radar situation, we do not a priori have a dictionary, and to learn one, we are only provided class-level annotations by way of supervision. In addition, we require that these learnt dictionaries are compact (because we must be able to reliably parse any input), and yet sufficiently expressive to account for diversity of visual traits found in different objects and classes.

Multi-Scale Dictionaries. Variation in pose and orientation lead to different appearances by perspective projections, which means there is variation in the scale and size of visual characteristics of parts. To overcome this issue we train dictionaries at multiple scales, which leads us to a parsing scheme that parses input instances at multiple scales. Also, in matching parts across images, we down-weight the matching if there is either a mismatch in part features or the scale of the dictionary (see Fig. 1 (b)).

Test-Time Inference. At test-time we must infer multiple parts, where the part locations must satisfy relative geometric constraints. Additionally, few-shot instances even within the same class exhibit significant variations in pose, orientation and size, which in turn induce variations in parsed outputs. To mitigate their effects we propose a novel instance-dependent re-weighting method for fusing instances based on their goodness-of-fit to the dictionary.

Contributions. We show that: (i) Our deep object parsing method, results in improved performance in few-shot recognition and fine-grained few-shot recognition tasks on Stanford-Car dataset outperforming prior art by 2.64%. (ii) We provide an analysis of different components of our approach showing the effect of ablating each in final performance. Through a visualization, we show that the parts recognized by our model are salient and help recognize the object category.

2 Related Work

Few-Shot Classification (FSC).

Modern FSC methods can be classified into three categories: metric-learning based, optimization-based, or data-augmentation methods. Methods in the first category focus on learning effective metrics to match query examples to support. Prototypical Network

[39] utilizes euclidean distance on feature space for this purpose. Subsequent approaches built on this by improving the image embedding space [55, 9, 1, 60, 35] or focusing on the metric [41, 49, 3, 38, 27, 51, 11, 57, 56]. Some recent methods have also found use of graph based methods, especially in transductive few shot classification [7, 54]. Optimization based methods train for fast adaptation using a few parameter updates with the support examples [14, 2, 31, 26, 34, 33]. Data-augmentation methods learn a generative model to synthesize additional training data for the novel classes to alleviate the issue of insufficient data [28, 37, 50, 53].

Fine-grained FSC. In fine-grained few-shot classification, different classes differ only in finer visual details. An example of this is to tease apart different species of birds in images. The approaches mentioned above have been applied in this context as well [29, 40, 30, 53]. [29] proposes to learn a local descriptor and an image-to-class measure to capture the similarity between objects. [48] uses a foreground object extractor to exclude the noise from background and synthesize foreground features to remedy the data insufficiency. BSNet [30] leverages a bi-similarity module to learn feature maps of diverse characteristics to improve the model’s generalization ability. Variational feature disentangling (VFD) [53]

, a data-augmentation method, is complementary to ours. It disentangles the feature representation into intra-class variance and class-discriminating information to emphasize the inter-class differences, and generates additional features for novel classes to mitigate data scarcity at test-time.

In addition, prior works on attention has been found to be effective both in few-shot learning and its fine-grained version. This approach seeks to extract most discriminative image features and ignore background information irrelevant to object class [21, 61, 32, 22]. In related work, [6] propose training fixed masks on feature space, and leveraging outputs from each mask for FSC.

Recognition using Object Parts. Our method is closely related to recognition based on identifying object components, an approach motivated by how humans learn to recognize object [5]. It draws inspiration from Ullman et al[44], who showed that information maximization with respect to classes of images resulted in visual features eyes, mouth, etc. in facial images and tyres, bumper, windows etc. in images of cars. Along these lines, Deformable Part Models (DPM) [12, 13] proposed to learn object models by composing part features and geometries, and utilize it for object detection. Neural Network models for DPMs were proposed in [36, 15]. Multi-attention based models, which can be viewed as implicitly incorporating parts, have been proposed [58] in the context of fine-grained recognition problems. Although related, a principle difference is our few-shot setting, where new classes emerge, and we need to generate new object models on-the-fly.

Prior works on FSC have also focused on combining parts, albeit with different notions of the concept. As such, the term part is overloaded, and is unrelated to our notion. DeepEMD [57] focuses in the image-distance metric based on an earth mover’s distance between different parts. Here, parts are simply different physical locations in the image and not a compact collection of salient parts for recognition. [42] uses salient object parts for recognition, while [43] attempts to encode parts into image features. However, both these methods require additional attribute annotations for training, which may be expensive to gather and not always available. [17] and [52] discover salient object parts and use them for recognition via attention maps similar to our method. They also re-weigh their similarity task-adaptively, which is also a feature of our method. We differ in our use of a finite dictionary of templates used to learn a compact representation of parts. Also, we use reconstruction as supervision for accurately localizing salient object parts, and impose a meaningful prior on the geometry of parts, which keeps us from degenerate solutions for part locations.

3 Deep Object Parsing Method

Figure 2: Overview of our method. An image gets parsed as a collection of salient parts. Once part peaks are located, different scales of attention maps are used for comparing the part in different images to account for any size discrepancies of a given part across images. (Best viewed with zoom; represents a channel-wise dot product)

Input instances are denoted by and we denote by the output features of a CNN, with channels, supported on a 2D, , grid.
Parsing Instances. A parsed instance has distinct units, , which we call parts. These parts are derived from the output features, . The term “parts” is overloaded in prior works. Our notion of a part is a tuple, consisting of part-location and part-expression at that location. Part location is an attention mask,

centered around a 2D vector

in the CNN feature space. We derive part expression for the instance using part templates, which are a finite collection of templates, , whose support is . These templates are learnt during training, and are assumed to be exhaustive across all categories. Although, our method allows for using several templates for each channel, we use one-template per channel in this paper. We observed that increasing the number of templates only marginally improves performance. Since channels-wise features represent independent information, we consider channel-wise templates, . For any instance, we reconstruct feature vectors on the mask using a sparse subset of part templates, and the resulting reconstruction coefficients, , are the part expressions.

Part Expression as LASSO Regression.

Given an instance , and its feature output, , and a candidate part-location,

, we estimate sparse part-expressions by optimizing the

regularized reconstruction error, at the location .


where the subscript refers to projection onto its support. The notation employed in the LHS denotes a vector (here across ) of components. We use this for brevity, but is usually clear from the context.

Non-negativity. Part expressions signify presence or absence of part templates in the observed feature vectors, and as such can be expected to take on non-negative values. This fact turns out to be useful later for DNN implementation.

Part Location Estimation. Note that part-expression is a function of part-location, and as such, part-location can be estimated by plugging in the optimal part-expressions for each candidate location value, namely,


This couples the two estimation problems, and is difficult to implement with DNNs, motivating our approach below.

Feedforward DNNs for Parsing. To make the proposed approach amenable to DNN implementation, we approximate the solution to Eq. 1 by optimizing the reconstruction error followed by thresholding, namely, we compute , and we threshold the resulting output by deleting entries smaller than : . This is closely related to thresholding methods employed in LASSO [18]. As such, the quadratic component of the loss allows for an explicit solution, and the solution reduces to template matching per channel, which can further be expressed as a convolution [16]. In this perspective, we overload notation and consider the template as a convolution kernel. Writing , we have and this resulting kernel is matched against the channel feature output, yielding:


For estimating location, we plug this in Eq. 2 and bound it from above.

The first term denoting the energy across all channels for different values of typically has small variance, and we ignore it. As such the problem reduces to optimizing the last two terms. As we argued before ground-truth part expressions are non-negative, and we write . Invoking completion of squares:


where and is a channel dependent constant. Strictly speaking the sum above over channels should only be over those with non-negative expressions, but we do not observe a noticeable difference in experiments, and consider the full summand here. is a temperature term, which we will use later for efficient implementation.

Multi-Scale Extension. We extend our approach to incorporate multiple scales. This is often required because of significant difference in orientation and pose between query and support examples. To do so we simply consider masks, at varying grid sizes indexed by . As such, proposed method directly generalizes to multi-scales, and part expressions can be obtained across different scales at any candidate location. To estimate part-location we integrate across all the different scales to find a single estimate.

1 Input: image
2 Parametric functions: convolutional backbone , template collections
3 Get the convolutional feature
4 for  do
5       Estimate by Eq. 4
6       Compute
7       Thresholding:
8 end for
Output: Part locations and template coefficients
Algorithm 1 Object Parsing using DNNs

3.1 Few-Shot Recognition

At test-time we are given a query instance, , and by way of supervision, support examples for classes. Let denote the set of support examples for the class label, . Parsing the th support example for th class yields, part locations, and part expressions denoted as . Similarly, parsing the query yields , and .

Geometric Distance. To leverage part-location information, we compare geometries between query and support. To do so, we embed pairwise part relative distances, and three-way part angles into a feature space , and use the squared distance in the embedded space.

Part Expression Distance.

Part-expressions across examples and different parts exhibit significant variability due to differences in pose and orientation. Entropy of the location probability,

is a key indicator of poor part-expressions. Leveraging this insight we train a weighting function that takes location entropies for all the support examples for part and the corresponding location entropy for query, , and outputs a score. Additionally, for each example for class, , and part, , we train a weighting function to output a composite weighted part-expression: .

In summary, the overall distance is the sum of geometric and weighted part distances with serving as a tunable parameter:


Training. Training procedure (see Algo. 2) follows convention. We sample

classes at random, and additionally sample support and query examples belonging to these classes from training data. An additional issue arising for us is that we must enforce diversity of part locations during training to ensure that diverse set of parts are chosen. There are many choices for the joint distributions of part locations to ensure that they are well-separated. We utilize the Hellinger distance 

[10] denoted by here, and enforce divergence for an example as follows:


where, is a tunable parameter. The overall training loss is the sum of the cross-entropic loss (see Line 12 in Algo 2), based on the distance described in Eq. 5, and the divergence loss (Eq. 6). In Algo 1, the argmax in Line 5 and delta function in Line 6 are non differentiable, preventing the back-propagation of gradients during training. We approximate the argmax by taking the mean value for sufficiently small . We then use a Gaussian function with a small variance (0.5 in our experiments) to approximate the delta function. The multi-scale extension is handled analogously. The templates for different dictionaries are trained in parallel, and the distance function in Eq. 5 is generalized to include part-expressions at multiple scales. These modifications lead to an end-to-end differentiable scheme.

1 Input: Training support image pairs , , and training query image pairs ,
2 for  and  do
3       Parse the image or by Algorithm 1
4       Compute by Eq. 6
5 end for
6for  do
7       for  do
8             Estimate for class
9             Predict part weight
10             Compute the weighted distance measure by Eq. 5
11       end for
12      Compute the loss
13 end for
Output: loss
Algorithm 2 Training episode loss computation. correspond to the number of classes, number of support examples/class, and the number of query samples/class in each episode respectively.

3.2 Implementation Details

We use two Resnet [19] variants (Resnet-12 and Resnet-18) as our feature extractor. We use the same Resnet-12 benchmarked in prior works [26], which has more output channels and more parameters (12.4M) compared to Resnet-18 (11.2M). The input image is resized to 84 84 for Resnet-12 and 224 224 for Resnet-18. In the output features, for Resnet-12, and , and for Resnet-18, and . The number of parts is set to 4 for most experiments, as we find it performs the best (see Sec. 4.2) and renders a diverse distribution of the parts. There are three scales in each part. The temperature in Eq. 4 is set to 0.01 and the margin in Eq. 3 is set to 0.05. is set to 0.01. The weights of

to re-weight the part expression distances between scales are predicted by a linear layer with ReLU activation. The weight

is predicted on the concatenation of the entropy of all support and query samples using a linear layer, and is normalized along part by softmax.

4 Experiments

4.1 Fine-grained Few-Shot Classification

We compare DOPM on three fine-grained datasets: Caltech-UCSD-Birds (CUB) [47], Stanford-Dog (Dog) [23] and Stanford-Car (Car) [25] against state-of-the-art methods.

Caltech-UCSD-Birds (CUB) [47] is a fine-grained classification dataset with 11,788 images of 200 bird species. Following convention[20], the 200 classes are randomly split into 100 base, 50 validation and 50 test classes.

Stanford-Dog/Car [23, 25] are two datasets for fine-grained classification. Dog contains 120 dog breeds with a total number of 20,580 images, while Car consists of 16,185 images from 196 different car models. For few-shot learning evaluation, we follow the benchmark protocol proposed in [29]. Specifically, 120 classes of Dog are split into 70, 20, and 30 classes, for training, validation, and test, respectively. Similarly, Car is split into 130 train, 17 validation and 49 test classes.

Methods PA FT R&L Geo Reweighting
DOPM (ours)
Table 1: Components compared to prior works. PA: use attention to detect parts; FT: use finite templates to express a part; R&L: the part expression is learned by reconstruction and location supervision; Geo: leverage geometrical information; Reweighting: re-weight matching scores. Prior part-based FSL methods do not employ some components proposed in DOPM .

Experiments Setup. We conducted 5-way 1-shot and 5-way 5-shot classification tasks on all datasets. Following the episodic evaluation protocol in [46], at test time, we sample 600 episodes and report the averaged Top-1 accuracy. In each episode, 5 classes from the test set are randomly selected. 1 or 5 samples for each class are sampled as support data, and another 15 examples are sampled for each class as the query data. The model is trained on train split and the validation split is used to select the hyper-parameters.

Training Details. Our model is trained with 10,000 episodes on CUB and 30,000 episodes on Stanford-Dog/Car on both ResNet12 and ResNet18 experiments. In each episode, we randomly select 10 classes and sample 5 and 10 samples as support and query data. The weight on the Hellinger distance is set to 1.0 on CUB and 0.1 on Stanford-Dog/Car, respectively. We train from scratch with Adam optimizer [24]. The learning rate starts from 5e-4 on CUB and 1e-3 on Stanford-Car/Dog, and decays to 0.1x every 3,000 episodes on CUB and 9,000 episodes on Dog/Car. On CUB, objects are cropped using the annotated bounding box before resizing to the input size. On Stanford-Car/Dog, we use the resized raw image as the input. We employed standard data augmentations, including horizontal flip and perspective distortion, to the input images.

Methods Backbone 1-shot 5-shot
Baseline++[8] ResNet18 67.020.9 83.580.54
ProtoNet[39] ResNet18 71.880.91 87.420.48
SimpleShot[49] ResNet18 62.850.20 84.010.14
DN4[29] ResNet18 70.470.72 84.430.45
FOT[48] ResNet18 72.560.77 87.220.46
AFHN[28] ResNet18 70.531.01 83.950.63
-encoder[37] ResNet18 69.800.46 82.600.35
BSNet[30] ResNet18 69.610.92 83.240.60
COMET[6] Conv6 72.200.90 87.600.50
MetaOptNet[26]* ResNet12 75.150.46 87.090.30
DeepEMD[57] ResNet12 75.650.83 88.690.50
MTL[33]* ResNet12 73.31 0.92 82.29 0.51
VFD [53] ResNet12 79.120.83 91.480.39
FRN[51] ResNet12 83.16 92.59
RENet[22] ResNet12 79.490.44 91.110.24
DOPM ResNet18 82.620.65 92.610.38
DOPM ResNet12 83.390.82 93.010.43
Table 2: Few-shot accuracy in

on CUB (along with 95% confidence intervals). If not specified, the results is reported by the original paper. *: results reported in

[53]. : results are obtained by running the codes released by authors using ResNet18 backbone.

Compared Methods. We compare our DOPM to state-of-the-art few-shot learning methods, including RENet[22], FRN[51], and DeepEMD[57] etc. Specifically, we compare to methods like FOT[48], VFD [53], and DN4[29], which are dedicated to the fine-grained setting. To highlight the contribution of DOPM , we tabulate in Tab. 1 the differences of the model design compared to prior works [43, 17, 57, 52] in few-shot learning that also use part composition. DOPM not only extracts parts with an attention mechanism (PA) and re-weights the similarity score at each score to best match query to support samples (reweighting) like prior works. It also uses fine templates (FT) learned by reconstruction and location supervision (R&L) during training to express each instance, and leverages part locations to represent the geometry of the object (Geo). These components are missing in previous works, as shown in Tab. 1.

Results. Our results along with comparisons against state-of-the-art on CUB and Stanford-Dog/Car are tabulated in Tab. 2 and Tab. 3, respectively. DOPM shows competitive performance on all the fine-grained benchmarks. On CUB, we achieve 83.39% 1-shot accuracy and 93.01% for 5-shot with Resnet-12 backbone, which outperforms all other state-of-the-art methods.

Methods Backbones Car Dog
1-shot 5-shot 1-shot 5-shot
ProtoNet[39] ResNet18 60.670.87 75.560.45 61.060.67 74.310.51
DN4[29] ResNet18 78.770.81 91.990.41 60.730.67 75.330.38
MetaOptNet[26] ResNet18 60.560.78 76.350.52 65.480.56 79.390.43
BSNet[30] ResNet18 60.360.98 85.280.64 - -
MTL*[33] ResNet12 - - 54.961.03 68.760.65
VFD*[53] ResNet12 - - 76.240.87 88.000.47
-encoder*[37] ResNet12 - - 68.590.53 78.600.78
DOPM ResNet18 81.41 0.71 93.48 0.38 70.56 0.75 84.75 0.41
DOPM ResNet12 81.83 0.78 93.84 0.45 70.10 0.79 85.12 0.55
Table 3: Few-shot classification accuracy in % on Stanford-Car/Dog benchmarks (along with 95% confidence intervals). *: results reported in [53]. : results are obtained by running the codes released by authors using ResNet18 backbone. Note that VFD[53] generates additional features at test-time for novel classes, and is as such complementary to DOPM.

DOPM  is competitive with or outperforms recent works on fine-grained FSC. On Stanford-Car, we outperform compared approaches by 3.06% and 1.85% on 1-shot and 5-shot, respectively. On Stanford-Dog, our method obtains 70.10% for 1-shot and 85.12% for 5-shot, which significantly outperforms most state-of-the-art methods except for VFD. We argue that VFD underperforms our method on CUB, and it is possible the method specialize the Dog dataset as it outperforms all other methods with significant margins. In addition, VFD, being a data augmentation approach, is complementary to DOPM , which is focused on learning the representation for objects.

CUB Dog Car
91.83 82.07 92.78
92.44 83.90 93.31
91.98 83.10 93.11
92.61 84.75 93.48
Table 4: Ablation study on re-weighting functions.
Methods CUB Dog Car
Baseline-1 79.59 74.31 75.56
Baseline-2 83.21 79.35 89.10
DOPM 92.61 84.75 93.48
Table 5: Ablation study on object parsing.

4.2 Ablation Study

We conduct a series of ablative studies to expose salient aspects of DOPM  on fine-grained datasets based on 5-shot accuracy with the Resnet-18 backbone.

Figure 3: Exemplar parts locations learned by DOPM when . From left to right: CUB, Dog, Car, and failure cases. DOPM  might locate parts on the background if it has visual signatures similar to an object.

Instance-dependent reweighting is beneficial: We proposed two re-weighting functions to mitigate pose/orientation variations: the inner weight to fuse the class representation, and the outer weight to balance the parts and scales. To validate their contributions, we evaluated DOPM without each individual weight. The results are tabulated in Tab. 5. Both weights improve performance over uniform weighting. The outer weight has larger gain, but employing both weights achieves the best performance.

Object parsing improves accuracy:

To validate the efficacy of parsing the object using part templates, we compared to two baseline models using the same backbone. Baseline-1 replaces the whole part parsing module by an average pooling layer. The divergence loss, geometric cosine similarity and re-weighting are all removed, making the baseline similar to ProtoNet

[39]. Baseline-2 removes the multi-scale dictionaries and extracts a feature vector for each part by applying a weighted average pooling using the part probability . It is similar to a multi-attention convolutional networks [58] that focuses on the part features. We compare DOPM  to these baselines in Tab. 5. For a fair comparison, we did not employ re-weighting in our model. Baseline-2 achieves better performance than baseline-1, indicating the benefit of detecting the part in fine-grained classification. DOPM  further improves the performance of baseline-2, validating the efficacy of representing each part as the mixture of atoms of dictionaries.

Multi-scaled parsing is beneficial. In Tab. 7 we ablated different choices of scales on Dog. Using multiple scales is better than a single scale, and using all the scales obtains the best performance. This validates our hypothesis that parts are distorted due to pose variations, and a single scale is not sufficient to represent the object in a few-shot scenario.

scales [3] [5] [3,5] [1,3,5]
Dog 81.56 81.38 83.04 84.75
Table 6: Ablation studies on the scales
# parts 3 4 5 6
CUB 92.10 92.61 92.21 92.06
Table 7: Ablation studies on the number of parts

Effect of More Parts. The number of parts was ablated in Tab. 7. It shows the best choice on CUB. The accuracy drops with more parts as the model starts to learn irrelevant or background signatures.

Interpretable part locations: We visualize the locations learned by DOPM in Fig. 3. DOPM is able to detect consistent parts for the same task. The parts are interpretable, and we often find semantic parts like head and ears. However, sometimes DOPM  might fails to locate parts on the object if similar visual signatures appear in the background.

Figure 4: Exemplar templates of learned dictionary . The templates are randomly sampled along channel for scale (top) and (bottom).
Figure 5: Top 4 activated dictionary (templates) elements (i.e. highest part expression levels) of for two Boston terriers (top 2 rows) and a golden retriever (bottom row). Templates for images of the same class are similar. The template heat map depicts which component of its kernel is large and must not be confused with the spatial field of feature space.

Dictionary templates: Some templates of the learned dictionary are visualized in Fig. 4. Our model uses each template to reconstruct the original feature in the corresponding channel. We see diverse visual representations in different channels, implying that DOPM learns diverse visual templates from the training set to express objects. Fig. 5 shows the activated templates for different objects. The model uses the same templates to express the same class.

Qualitative results: effect of reweighting: In Fig. 6 we demonstrate an example that was misclassified when the re-weighting is not employed. With the instance-dependent re-weighting, the noisy part features are down-weighted and the total posterior is dominated by features from the relevant scales, making the final prediction correct. Still, a limitation (Fig. 6) is that when background bears similarity to object-part, parts can be mis-located, and leads to poor parsing.

Figure 6: Example is misclassified without re-weighting. Re-weighting helps find the best match across different scales.

5 Conclusions

We presented DOPM a deep object-parsing method for few-shot recognition. Our fundamental concept is that, while object classes exhibit novel visual appearance, at a sufficiently small scale, the visual patterns are duplicated, and as such by leveraging training data to learn a dictionary of templates distributed across different relative locations, an object can be recognized simply by identifying which of the templates in the dictionary are expressed, and how these patterns are geometrically distributed. We build a statistical model for parsing that takes the output of a convolutional backbone as input to produce a parsed output. We then post-hoc learn to re-weight query and support instances to identify the best matching class, and as such this procedure allows for mitigating visual distortions. Our proposed method is an end-to-end deep neural network training method, and we show that our performance is not only competitive but the outputs generated are interpretable.


  • [1] A. Afrasiyabi, J. Lalonde, and C. Gagne (2021) Mixture-based feature space learning for few-shot image classification. In

    Proceedings of the IEEE/CVF International Conference on Computer Vision

    pp. 9041–9051. Cited by: §2.
  • [2] S. Baik, J. Choi, H. Kim, D. Cho, J. Min, and K. M. Lee (2021)

    Meta-learning with task-adaptive loss function for few-shot learning

    In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9465–9474. Cited by: §2.
  • [3] P. Bateni, R. Goyal, V. Masrani, F. Wood, and L. Sigal (2020) Improved few-shot visual classification. In

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    pp. 14493–14502. Cited by: §2.
  • [4] D. Bau, B. Zhou, A. Khosla, A. Oliva, and A. Torralba (2017) Network dissection: quantifying interpretability of deep visual representations. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6541–6549. Cited by: item (A).
  • [5] I. Biederman (1987) Recognition-by-components: a theory of human image understanding.. Psychological review 94 (2), pp. 115. Cited by: §2.
  • [6] K. Cao, M. Brbic, and J. Leskovec (2020) Concept learners for few-shot learning. arXiv preprint arXiv:2007.07375. Cited by: §2, Table 2.
  • [7] C. Chen, X. Yang, C. Xu, X. Huang, and Z. Ma (2021) ECKPN: explicit class knowledge propagation network for transductive few-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6596–6605. Cited by: §2.
  • [8] W. Chen, Y. Liu, Z. Kira, Y. F. Wang, and J. Huang (2019) A closer look at few-shot classification. arXiv preprint arXiv:1904.04232. Cited by: Table 2.
  • [9] R. Das, Y. Wang, and J. M. Moura (2021) On the importance of distractors for few-shot classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9030–9040. Cited by: §2.
  • [10] B. Everitt (1998) Cambridge dictionary of statistics. Cited by: §3.1.
  • [11] N. Fei, Y. Gao, Z. Lu, and T. Xiang (2021) Z-score normalization, hubness, and few-shot learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 142–151. Cited by: §2.
  • [12] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan (2009) Object detection with discriminatively trained part-based models. IEEE transactions on pattern analysis and machine intelligence 32 (9), pp. 1627–1645. Cited by: §2.
  • [13] P. F. Felzenszwalb, R. B. Girshick, and D. McAllester (2010) Cascade object detection with deformable part models. In 2010 IEEE Computer society conference on computer vision and pattern recognition, pp. 2241–2248. Cited by: §2.
  • [14] C. Finn, P. Abbeel, and S. Levine (2017) Model-agnostic meta-learning for fast adaptation of deep networks. In

    International Conference on Machine Learning

    pp. 1126–1135. Cited by: §2.
  • [15] R. Girshick, F. Iandola, T. Darrell, and J. Malik (2015) Deformable part models are convolutional neural networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 437–446. Cited by: §2.
  • [16] R. C. Gonzalez and R. E. Woods (2008) Digital image processing. Prentice Hall, Upper Saddle River, N.J.. Cited by: §3.
  • [17] F. Hao, F. He, J. Cheng, L. Wang, J. Cao, and D. Tao (2019) Collect and select: semantic alignment metric learning for few-shot learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8460–8469. Cited by: §2, §4.1, Table 1.
  • [18] T. Hastie, R. Tibshirani, and J. Friedman (2001) The elements of statistical learning. Springer Series in Statistics, Springer New York Inc., New York, NY, USA. Cited by: §3.
  • [19] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §3.2.
  • [20] N. Hilliard, L. Phillips, S. Howland, A. Yankov, C. D. Corley, and N. O. Hodas (2018) Few-shot learning with metric-agnostic conditional embeddings. arXiv preprint arXiv:1802.04376. Cited by: §4.1.
  • [21] Z. Jiang, B. Kang, K. Zhou, and J. Feng (2020) Few-shot classification via adaptive attention. arXiv preprint arXiv:2008.02465. Cited by: §2.
  • [22] D. Kang, H. Kwon, J. Min, and M. Cho (2021) Relational embedding for few-shot classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8822–8833. Cited by: §2, §4.1, Table 2.
  • [23] A. Khosla, N. Jayadevaprakash, B. Yao, and L. Fei-Fei (2011-06) Novel dataset for fine-grained image categorization. In First Workshop on Fine-Grained Visual Categorization, IEEE Conference on Computer Vision and Pattern Recognition, Colorado Springs, CO. Cited by: §4.1, §4.1.
  • [24] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.1.
  • [25] J. Krause, M. Stark, J. Deng, and L. Fei-Fei (2013) 3D object representations for fine-grained categorization. In 4th International IEEE Workshop on 3D Representation and Recognition (3dRR-13), Sydney, Australia. Cited by: §4.1, §4.1.
  • [26] K. Lee, S. Maji, A. Ravichandran, and S. Soatto (2019) Meta-learning with differentiable convex optimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10657–10665. Cited by: §2, §3.2, Table 2, Table 3.
  • [27] A. Li, W. Huang, X. Lan, J. Feng, Z. Li, and L. Wang (2020) Boosting few-shot learning with adaptive margin loss. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12576–12584. Cited by: §2.
  • [28] K. Li, Y. Zhang, K. Li, and Y. Fu (2020) Adversarial feature hallucination networks for few-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13470–13479. Cited by: §2, Table 2.
  • [29] W. Li, L. Wang, J. Xu, J. Huo, Y. Gao, and J. Luo (2019) Revisiting local descriptor based image-to-class measure for few-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7260–7268. Cited by: §2, §4.1, §4.1, Table 2, Table 3.
  • [30] X. Li, J. Wu, Z. Sun, Z. Ma, J. Cao, and J. Xue (2020) BSNet: bi-similarity network for few-shot fine-grained image classification. IEEE Transactions on Image Processing 30, pp. 1318–1331. Cited by: §2, Table 2, Table 3.
  • [31] Z. Li, F. Zhou, F. Chen, and H. Li (2017) Meta-sgd: learning to learn quickly for few-shot learning. arXiv preprint arXiv:1707.09835. Cited by: §2.
  • [32] Y. Lifchitz, Y. Avrithis, and S. Picard (2021) Few-shot few-shot learning and the role of spatial attention. In 2020 25th International Conference on Pattern Recognition (ICPR), pp. 2693–2700. Cited by: item (B), §2.
  • [33] Q. S. Y. Liu, T. Chua, and B. Schiele (2018)

    Meta-transfer learning for few-shot learning

    In 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2, Table 2, Table 3.
  • [34] A. Rajeswaran, C. Finn, S. Kakade, and S. Levine (2019) Meta-learning with implicit gradients. Cited by: §2.
  • [35] M. N. Rizve, S. Khan, F. S. Khan, and M. Shah (2021) Exploring complementary strengths of invariant and equivariant representations for few-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10836–10846. Cited by: §2.
  • [36] P. Savalle, S. Tsogkas, G. Papandreou, and I. Kokkinos (2014) Deformable part models with cnn features. In European Conference on Computer Vision, Parts and Attributes Workshop, Cited by: §2.
  • [37] E. Schwartz, L. Karlinsky, J. Shtok, S. Harary, M. Marder, R. Feris, A. Kumar, R. Giryes, and A. M. Bronstein (2018) Delta-encoder: an effective sample synthesis method for few-shot object recognition. arXiv preprint arXiv:1806.04734. Cited by: §2, Table 2, Table 3.
  • [38] C. Simon, P. Koniusz, R. Nock, and M. Harandi (2020) Adaptive subspaces for few-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4136–4145. Cited by: §2.
  • [39] J. Snell, K. Swersky, and R. S. Zemel (2017) Prototypical networks for few-shot learning. arXiv preprint arXiv:1703.05175. Cited by: §1, §2, §4.2, Table 2, Table 3.
  • [40] X. Sun, H. Xv, J. Dong, H. Zhou, C. Chen, and Q. Li (2020) Few-shot learning for domain-specific fine-grained image classification. IEEE Transactions on Industrial Electronics 68 (4), pp. 3588–3598. Cited by: §1, §2.
  • [41] F. Sung, Y. Yang, L. Zhang, T. Xiang, P. H. Torr, and T. M. Hospedales (2018) Learning to compare: relation network for few-shot learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1199–1208. Cited by: §2.
  • [42] L. Tang, D. Wertheimer, and B. Hariharan (2020) Revisiting pose-normalization for fine-grained few-shot recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14352–14361. Cited by: item (B), §2.
  • [43] P. Tokmakov, Y. Wang, and M. Hebert (2019) Learning compositional representations for few-shot recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6372–6381. Cited by: §2, §4.1, Table 1.
  • [44] S. Ullman, M. Vidal-Naquet, and E. Sali (2002) Visual features of intermediate complexity and their use in classification. Nature neuroscience 5 (7), pp. 682–687. Cited by: §2.
  • [45] H. L. Van Trees (2004) Detection, estimation, and modulation theory, part i: detection, estimation, and linear modulation theory. John Wiley & Sons. Cited by: §1.
  • [46] O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra, et al. (2016) Matching networks for one shot learning. Advances in neural information processing systems 29, pp. 3630–3638. Cited by: §1, §4.1.
  • [47] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie (2011) The Caltech-UCSD Birds-200-2011 Dataset. Technical report Technical Report CNS-TR-2011-001, California Institute of Technology. Cited by: §4.1, §4.1.
  • [48] C. Wang, S. Song, Q. Yang, X. Li, and G. Huang (2021) Fine-grained few shot learning with foreground object transformation. Neurocomputing 466, pp. 16–26. Cited by: §2, §4.1, Table 2.
  • [49] Y. Wang, W. Chao, K. Q. Weinberger, and L. van der Maaten (2019) Simpleshot: revisiting nearest-neighbor classification for few-shot learning. arXiv preprint arXiv:1911.04623. Cited by: §2, Table 2.
  • [50] Y. Wang, R. Girshick, M. Hebert, and B. Hariharan (2018) Low-shot learning from imaginary data. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7278–7286. Cited by: §2.
  • [51] D. Wertheimer, L. Tang, and B. Hariharan (2021) Few-shot classification with feature map reconstruction networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8012–8021. Cited by: §2, §4.1, Table 2.
  • [52] J. Wu, T. Zhang, Y. Zhang, and F. Wu (2021) Task-aware part mining network for few-shot learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8433–8442. Cited by: §2, §4.1, Table 1.
  • [53] J. Xu, H. Le, M. Huang, S. Athar, and D. Samaras (2021) Variational feature disentangling for fine-grained few-shot classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8812–8821. Cited by: §2, §2, §4.1, Table 2, Table 3.
  • [54] L. Yang, L. Li, Z. Zhang, X. Zhou, E. Zhou, and Y. Liu (2020) Dpgn: distribution propagation graph network for few-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13390–13399. Cited by: §2.
  • [55] H. Ye, H. Hu, D. Zhan, and F. Sha (2020) Few-shot learning via embedding adaptation with set-to-set functions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8808–8817. Cited by: §2.
  • [56] B. Zhang, X. Li, Y. Ye, Z. Huang, and L. Zhang (2021) Prototype completion with primitive knowledge for few-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3754–3762. Cited by: §2.
  • [57] C. Zhang, Y. Cai, G. Lin, and C. Shen (2020) DeepEMD: few-shot image classification with differentiable earth mover’s distance and structured classifiers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 12203–12213. Cited by: §2, §2, §4.1, Table 1, Table 2.
  • [58] H. Zheng, J. Fu, T. Mei, and J. Luo (2017) Learning multi-attention convolutional neural network for fine-grained image recognition. In 2017 IEEE International Conference on Computer Vision (ICCV), pp. 5209–5217. External Links: Document Cited by: §2, §4.2.
  • [59] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba (2014) Object detectors emerge in deep scene cnns. arXiv preprint arXiv:1412.6856. Cited by: item (A).
  • [60] Z. Zhou, X. Qiu, J. Xie, J. Wu, and C. Zhang (2021) Binocular mutual learning for improving few-shot classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8402–8411. Cited by: §2.
  • [61] Y. Zhu, C. Liu, and S. Jiang (2020) Multi-attention meta learning for few-shot fine-grained image recognition.. In IJCAI, pp. 1090–1096. Cited by: item (B), §2.