Few-shot fine-grained classification and person search share important similarities, as they both require paying attention to the details, e.g. what distinguishes a person from other people, or a bird from other possibly similar races. Both fields have progressed largely in recent years Kim et al. (2021); Wang et al. (2020)
. Few-shot learning eases the burden of large data collections when generalizing to new unseen (possibly rare) classes. Person search is useful for video surveillance, long term tracking and person verification. Both tasks face the similar challenges of background clutter, illumination and viewpoint changes, occlusions, image blur and distortions, including non-rigid deformations of the object body poseXiao et al. (2017); Hariharan (2020).
Person search is the task of finding a specific person, as provided by a single query image, within a gallery image. It consists of localization within the gallery (detection) and re-identification (classification based on the single query example). Few-shot learning similarly stands for recognizing the queried object, either classifying or detecting, typically from a single or multiple (i.e., five) examples (1- and 5-shot learning). Fine-grained classification specifically describes the challenge of recognizing an object (bird, aircraft, dog etc.) from a few details (the shape of the beak, the pattern on the wings etc.). Person search is therefore a one-shot fine-grained classification task, which includes detection. Note that in few-shot fine-grained classification thequery-gallery pair is termed support-query respectively, especially confusing for the role of the query. Throughout this work, we adopt the person search terminology and search a query person or object within a gallery image. See Sec. 3 for more details.
We propose a novel unified Query-Guided Network (QGN) to address both person search and few-shot fine-grained classification. Query guidance is novel and stands for processing the query and gallery images jointly, with a Siamese network design and query-gallery interaction modules. By contrast, prior literature in person search Xiao et al. (2017); Chen et al. (2020); Dong et al. (2020) and few-shot learning Snell et al. (2017); Mangla et al. (2020); Chen et al. (2019) typically extracts separate features for the query and gallery images, which prevents their models from emphasizing query-specific patterns in the gallery search.
QGN proposes three query-gallery interaction modules: i. the Query-guided Siamese Squeeze-and-Excitation Network (QSSE) re-weights both the query and gallery channel features, jointly conditioned on both images; ii. the Query Similarity Network (QSimNet) learns a similarity metric which is specific for comparing with the query; iii. the Query-guided RPN (QRPN) is used for detection, to provide query-specific proposals (besides the classic RPN).
The modularity of QGN allows to evaluate the core idea of extensively using query guidance in retrieval for detection and classification tasks. In both cases, query guidance enhances the relevance of ID features in the network backbone, matching function and, if present, in the region proposal. We consider person search as the detection task (in any case, this subsumes person re-identification) and few-shot fine-grained recognition as the classification task (to the best of our knowledge, there is no established few-shot fine-grained object detection benchmark yet).
Query-guidance is novel in the few-shot context. We evaluate QGN on five-widely adopted few-shot fine-grained datasets: CUB Wah et al. (2011)et al. (2013) et al. (2013), Stanford Dogs Khosla et al. (2011), and Oxford Flowers Nilsback and Zisserman (2006). QGN achieves state-of-the-art results on CUB, FGVC-Aircraft and Stanford Dogs. Particularly on CUB, QGN surpasses the current best S2M2 Mangla et al. (2020) by a large margin, i.e. 12pp and 5pp in 1- and 5-shot learning experiments, respectively. Moreover, when employing a shallower ResNet18, the performance of QGN surpasses S2M2, which employs the deeper WRN Mangla et al. (2020), by 3.1pp for 1-shot learning.
For person search, we add our query-guided components on top of a recently improved OIM implementation footnote 1, and achieve competitive performance with the state of the art on the large scale CUHK-SYSU Xiao et al. (2017) and PRW Zheng et al. (2017) datasets. We report comparison with several competing person search techniques, including the ones following our original work Munjal et al. (2019). Both in person search and in few-shot fine-grained classification, we perform an in-depth analysis, including diverse backbones (ResNet10, ResNet18, ResNet50, WRN-28-10). Furthermore, we demonstrate the intuition of our proposed query-guided components via qualitative visualizations on both tasks.
2 Related Work
We review prior art on few-shot learning, fine-grained classification and person search, emphasizing methods which condition the feature extraction upon the query. To the best of our knowledge, QGN is the first technique addressing both tasks and it is the first query-guidance approach for few-shot fine-grained classification.
Few-shot learning. Few-shot learning aims to train models that can rapidly adapt and generalize to new concepts using only a few samples. The copious recent progress in the field can be loosely divided into five categories. In the first, metric-based methods Sung et al. (2018); Zhao et al. (2022) learn a shared embedding space for the comparison of the feature embeddings from the query and the gallery images. The proposed QSimNet resembles the relation module in the Relation Network Sung et al. (2018), but the input features of query and gallery are jointly extracted and end-to-end trained. In the second category, optimization based methods Finn et al. (2017) adjust the optimization algorithm to learn from a few examples. Here the most popular is MAML Finn et al. (2017), which optimizes the initialization of the gradient-descent-based learner. Data hallucination may be a third direction, based on the data augmentation and the scarce provided data.
More recently, Chen et al. (2019) proposed a simpler transfer learning approach using a distance-based classifier, which is competitive with other more sophisticated approaches. S2M2 Mangla et al. (2020) extends their work with self-supervision techniques Su et al. (2020). Following Chen et al. (2019); Mangla et al. (2020), QGN also employs the non-episodic
training, hence it does not need to train separately for different few-shot protocols. Unlike transfer learning methods, QGN jointly processes the query and the gallery with a Siamese network model and it does not need any fine-tuning at inference time.
Finally, the category of dynamic network conditioning methods uses the query or gallery examples to either tune or condition the network by attention based mechanism Hou et al. (2019) or generate network parameters Feng (2018). Matching networks Vinyals et al. (2016) apply conditioning as post-processing with a bidirectional LSTM. Liang et al. (2022) uses a weight-centric learning strategy to push samples closer to their corresponding classifier weights. Other approaches generate weights by means of kernel generator or by combining basis convolutional kernel filters Feng (2018). These techniques relate to QSSE, which we employ for feature extraction, however our approach is the sole to make use of both the query and gallery features from the very first layers. Similar to ours, CAM Hou et al. (2019) generates query-gallery cross-attention maps, but it focuses on image parts, rather than entire feature channels, as we do. Also, the correlation layer of CAM is applied only once at the output layer, due to its high memory and runtime requirements, while our simpler QGN is applied at all network layers, which results in the query-gallery interaction across both coarser and finer details.
Few-shot fine-grained classification. Fine-grained differs mainly from general few-shot learning as it focuses on categories with subtle distinctive traits, e.g. species of birds, dogs, flowers, car models. This is more complex and less researched. Within this literature, Hariharan (2020) targets fine-grained few-shot recognition by learning pose normalized embedding and uses extra part annotations. Zhu et al. (2020) uses attention modules after the feature extractor to infer spatial and channel attentions. Tang et al. (2022) employs a multi-scale feature pyramid and a multi-level attention pyramid to extract features of different granularities. More recently, Chen et al. (2019) evaluates the generic few-shot methods including ProtoNet Snell et al. (2017), MatchingNet Vinyals et al. (2016), RelationNet Sung et al. (2018) and MAML Finn et al. (2017) on few-shot fine-grained classification. S2M2 Mangla et al. (2020) also evaluates its approach on the fine-grained case. Sun et al. (2020) propose a unifying loss for various fine-grained tasks. Unlike the above methods, our QGN is a Siamese model and it leverages query-gallery cross-attention.
Person Search. There are several person search techniques but they are distinct from the previous, as no methods address both tasks. In person search, we distinguish sequential methods Dong et al. (2020); Han et al. (2019), which cascade the person detection and person re-identification sub-tasks, from joint methods Munjal et al. (2019); Liu et al. (2022), which perform both sub-tasks with a single network. The latter lag a bit behind sequential models in performance and are more complex to train, since detection and re-identification are conflicting sub-tasks. However these require in general less memory and computational resources, and are therefore preferable for industrial applications. QGN belongs to this second category, but the proposed query-guidance components are applicable to a sequential method, too.
Among the joint models, QGN relates to Xiao et al. (2017) which introduces Online Instance Matching (OIM) into Faster RCNN Ren et al. (2015) as an additional multi-task loss. OIM is the de-facto standard re-identification loss, adopted by most recent person search approaches Kim et al. (2021); Yan et al. (2021) as well as by QGN. PGA Yan et al. (2021) uses the class prototype as a guidance for person attention. AlignPS Kim et al. (2021) proposes an anchor free framework for person search with a feature aggregation module. Similarly, BINet Dong et al. (2020) and NAE Chen et al. (2020) build on top of OIM. BINet Dong et al. (2020) employs an additional parallel branch that takes cropped patches and supervises the joint model with interaction losses. NAE Chen et al. (2020) decomposes the embeddings of OIM into angle and norm to accomplish re-ID and detection respectively. Li et al. (2021) uses a hierarchical distillation strategy to transfer knowledge from a stronger teacher model to a student model. QGN is the first to introduce query-gallery interaction modules at different stages of the network, as well as throughout the backbone.
Query-guided person search. Prior work from ours Munjal et al. (2019) was the first to introduce query guidance for person search. Afterwards, this has been adopted by a few techniques, including TCTS Wang et al. (2020) and IGPN Dong et al. (2020). TCTS proposes an identity-guided query detector to produce query-like person boxes for the subsequent re-ID network. IGPN replaces the standard two-stage detector with a query- or instance-guided detector. IGPN adopts the Siamese RPN which correlates the query and gallery feature maps. By contrast, the proposed QRPN takes the query image crop at the input and re-weights the feature channels of the gallery image, emphasizing the traits of the person which we are searching for. Also, both IGPN and TCTS are sequential approaches that use two different models for detection and re-identification, while ours is a joint approach. Note that the joint models require less resources as compared to the sequential approaches as both the model parameters and processing are shared by the backbone. Additionally, learning joint models provides an appealing multi-task objective and addressing this successfully may result in a better use of data, higher performance and a better direction towards general intelligence, i.e. networks which understand multiple aspects of the scene.
In this section, we first formulate few-shot fine-grained classification and person search tasks. Then we discuss the proposed model and the three query-guided modules, as well as the optimization details.
3.1 Problem formulation
Let us describe the few-shot fine-grained classification and the person search tasks in a unified way.
The training and test sets for both tasks can be given as with classes and with classes, respectively. Here, represents the images and their corresponding ground-truth annotations. In particular, stands for the object classes in the case of few-shot fine-grained classification; and it means the person-ID and its location in the image for the task of person search. The set of and classes in and are disjoint, i.e. at test time the model needs to classify new classes and person-IDs.
Following literature from both tasks, we employ an episodic evaluation protocol, where a subset is sampled from with novel classes in each episode. A part of , i.e. examples from each of the classes, is considered as query. The remaining part of is the gallery, where the model needs to find the queries.
In the few-shot case, examples are sampled per episode as , i.e. classes with examples per class as query. This is termed -way -shot classification. While another examples per class are used as gallery. also represents the complexity of the evaluation. Larger means more competition among classes during classification. On the other hand, represents the number of examples per class in that we can use as query. Larger means more information per class. Typically, is either (1-shot learning) or (5-shot learning).
In person search, an episode is sampled per query example. Here, includes all positive examples corresponding to that query and a large number of random negatives from , e.g. for CUHK-SYSU Xiao et al. (2017) the size of is typically gallery query. Therefore, and . means person search follows a binary classification strategy i.e. either the gallery sample matches the query or not. means only one example per class is given as a query at one time. Therefore, person search can be viewed as a special case of few-shot classification, i.e. 1-shot learning.
Note: The terminology used in few-shot classification literature is different from that of person search. In person search, the query image is the one for which the class (or ID) is already known, while the gallery image needs to be classified. Whereas, in few-shot classification, the query is the image that needs to be classified and the support is the image for which the class is already known. To keep the terminology consistent, we adopt the query-gallery convention of person search for few-shot case as well.
3.2 Query-guided Networks
When provided with one or a few query samples, humans focus on its relevant and distinguishing features to find a corresponding gallery image and the object within it. Inspired by this, QGN proposes to process jointly the query and gallery images by a Siamese network design, and to model the query-gallery interactions by query-guided modules.
Few-shot fine-grained classification is accomplished by a Siamese network which processes the query and gallery images together, to produce an embedding for each of them, which is used to classify the gallery class to one of the novel classes in . The relevant overall QGN model is illustrated in Figure 1. The image embeddings are computed by two convolutional backbones. QGN contributes several Query-guided Siamese Squeeze-and-Excitation Network (QSSE) blocks, which relate the feature extraction at multiple layers of the backbone. Finally QGN realizes the classification of the embeddings by a Query Similarity Network (QSimNet), which learns the final metric similarity score. These components are described in detail in Sec.3.3. The implementation of each branch in the Siamese network draws details from Mangla et al. (2020) and leverages for training the OIM loss Xiao et al. (2017).
Person search is realized by two parallel Siamese detection networks, which extract the object crops from the query and gallery images, computes an embedding and compares those to assess whether they contain the same or different classes. The proposed QGN model is illustrated in Fig. 2. The image embeddings are extracted with convolutional backbones, leveraging the multi-layers query-gallery interaction by the QSSE. Then the object crops are extracted from the gallery by the proposed Query-guided Region Proposal Network (QRPN), i.e. proposals for bounding boxes tailored to the queried object, which integrates the proposals of a standard RPN Ren et al. (2015). The top proposals are then passed to the subsequent network with a multi-task head for classification (person vs non-person), localization refinement (regression offsets), and ID feature learning. Finally, the ID embeddings of query and each gallery proposal are compared by the QSimNet to distinguish same Vs different IDs. Details for the QGN components are provided in Sec. 3.3.
The implementation of each detection parallel branch follows details of Xiao et al. (2017), including the OIM loss. Differently from the few-shot fine-grained, person search includes a detection task, so the entire query and gallery images are provided to the network, not just the person crops. Note that we do not need proposals for the query branch, since the query crop is given as input.
3.3 Query-guided Network Components
We propose three components to provide query-guidance at different stages of the Siamese networks. QSSE considers joint global context of the query and gallery to re-calibrate the channel features of the convolutional backbones. QRPN generates query-like proposals exploiting the query-crop specific patterns. QSimNet learns a distance metric to compare the query- gallery features.
In person search (Fig. 2), we adopt all three components. In few-shot fine grained classification (Fig. 1), there is no need to generate candidate proposals and QGN consists only of QSSE and QSimNet. In both cases, all network parts are trained end-to-end.
3.3.1 Query-guided Siamese Squeeze-and-Excitation Network (QSSE)
The query and gallery objects in the images may be taken from different viewpoints and with different lighting conditions. Their embeddings should ideally disentangle these nuances. To this goal, we propose the QSSE module, which leverages the interaction of query and gallery. More specifically, as shown in Fig. 2, the QSSE modules, inserted at the output of each network block (e.g. residual block for ResNet), allow a joint re-calibration of the feature maps.
The QSSE module draws inspiration from SE-Net Hu et al. (2018), extending it to pairs of images (Fig. 3). In more detail, inside a QSSE, first a squeeze operation is performed by global average pooling of query and gallery features. This operation summarizes the spatial information of each of the channels, giving descriptors and for query and gallery respectively.
After this, an excitation operation is performed where the two descriptors are first concatenated and then passed through a non-linear bottleneck. The first layer of the bottleneck is for dimensionality reduction, shrinking the dimension of the concatenated descriptor by a factor of . This reduced feature (
) is then passed through the ReLU operation () modeling non-linear dependencies between channels. Finally, the feature is expanded to dimensions by the next fully connected layer , followed by sigmoid activation (
) to generate the weight vector. Mathematically, the Siamese squeeze-and-excitation operation is given by
where the parameters of the first and second fully connected layers are, respectively, and .
Following Hu et al. (2018), we set the reduction ratio to 16 in all our experiments. As shown in Fig. 3, the scale operation employs the weight vector to re-weight the residual outputs (for query) and (for gallery), by channel-wise multiplication. These scaled outputs are then added to the original features and via skip connections, giving outputs and respectively. Mathematically, the above operation is defined as
where denotes the channel-wise scaling operation.
3.3.2 Query-guided RPN (QRPN)
QRPN is an attention-based region proposal network that leverages the local query features to generate query-like object proposals. QRPN consists of a channel-wise attention sub-network followed by a standard RPN Ren et al. (2015), as shown in Fig. 4. The attention network uses the cropped query features to re-weight the feature channels of the gallery image. The re-weighted features are then passed to a standard RPN to generate object proposals.
In more detail, the query-crop features are first pooled using a ROI-pool Ren et al. (2015). We then pass the pooled query features to a non-linear bottleneck. The first layer of the bottleneck reduces the pooled features to , where and . Note that is applied to all pixels of all the channels of the pooled map. In this way, our attention mechanism leverages the spatially localized query crop patterns to emphasize particular gallery channels. This also gives the network layer more freedom and lets the optimization dictate what specific local patterns to highlight, instead of just global features. This is in contrast with the squeeze operation of SE-Net Hu et al. (2018). The second fully connected layer then expands the features back to dimensions, followed by a sigmoid () activation to generate weights. Finally, the output weights are used to re-calibrate the gallery features and not the query itself.
We further complement QRPN with the standard RPN in a parallel branch, that takes into account generic objectness score (cf. Fig 2). This helps in retrieving further proposals when they are quite different from the query. The objectness score from RPN and query-similarity score from QRPN are summed up to generate final score for each anchor which is used for non-maximal suppression (NMS) at the stage of proposal generation. Note that both RPN included in QRPN and the parallel RPN follow the same design and use same anchors.
QRPN is trained using QRPN loss which is a binary cross-entropy loss given as,
is the probability of the true classfor the anchor out of a total of anchors.
3.3.3 Query-guided Similarity Net (QSimNet)
QSimNet is a deep query-dependent metric that is trained end-to-end with other network components. Unlike standard offline metrics such as the euclidean distance Xiao et al. (2017); Chen et al. (2020), QSimNet alters the similarity measures for each query, to account for the relative importance of attributes such as e.g. color and shape.
As shown in Fig. 5
, QSimNet works by first calculating the L2 distance between the two features, i.e element-wise subtraction and square operation. This is followed by batch normalization and a fully connected layer with two outputs. Finally, a softmax is applied to generate similarity/dissimilarity scores.
QSimNet is trained using Sim loss which is defined as the binary cross-entropy loss similar to . is given as,
where N defines the number of pairs in the mini-batch and is the probability of the true class for the pair.
3.4 Training Query-guided Networks
We discuss in details the optimization of QGN for each of the task.
3.4.1 Few-shot fine-grained classification
The QGN network is optimized in an end-to-end fashion, which considers both the classification backbone, as well as the QSSE and QSimNet.
Self-supervision has been proven to improve few-shot learning in various recent works Su et al. (2020); Mangla et al. (2020) as it helps to overcome supervision-collapse Su et al. (2020), a phenomenon where training on the base classes force the network to discard information irrelevant for the discrimination of base classes, but crucial for the novel classes. Various pretext tasks have been proposed in literature for self-supervision. In this work, we opt rotation prediction Su et al. (2020) mainly because of its simplicity and effectiveness Su et al. (2020); Mangla et al. (2020). In more details, each image in the batch is rotated by four angles (0, 90, 180, 270) and a 4-way rotation classifier is added on the top. The network is optimized with an additional rotation loss (), together with and . The overall loss function is therefore:
Note that we do not follow an episode based training and use the same trained model, both for the 1- and 5-shot case. The inference architecture of the 1-shot case looks similar to the training phase (without the loss functions) as shown in Fig. 1. We simply pass the query and gallery to the network to obtain their similarity score. However, in the 5-shot case, each of the 5 queries are passed to the CNN together with the gallery. This results in 5 different sets of feature vectors for each query and gallery. We compute the sum of these 5 features which are then normalised and passed to QSimNet to get the similarity score:
where is the gallery feature and is the corresponding query (support) feature, .
3.4.2 Person search
The QGN end-to-end network training includes the detection network and the identification network, as well as QSSE, QRPN and QSimNet. The overall loss function is:
where , , and are the standard Faster-RCNN losses Ren et al. (2015) for classification, regression, RPN regression and RPN objectness. The ID feature learning is supervised by standard OIM loss Xiao et al. (2017), while our new components QRPN and QSimNet are supervised by and respectively. The losses are shown in Fig. 2 as dark gray or dark orange boxes.
During inference, it is typical for object detection pipelines to apply NMS at the end using final classification scores. However, we use the final similarity score from QSimNet for such NMS stage during inference. The classification score from ClsNet is only used to remove least confident detections with score less than 0.01.
QRPN Anchor Sampling:
Since a typical gallery image can only contain one target-person matching the query crop, the number of positive anchors is significantly fewer as compared to the negatives. This leads to a skewed positive-to-negative ratio for training of the qrpn loss (). Therefore, we augment the target person in gallery via jittering i.e. the target box is moved randomly in the nearby region. Additionally, we keep a lower anchor-to-target IoU threshold of 0.6 for positive anchor sampling. To further reduce the number of negatives, we use a batch size of 128 instead of standard 256 hence improving the positive-to-negative ratio. Note that the negative anchors are sampled from the background that do not cover other people in the gallery. This is because the non-target people in the gallery are positives for the standard RPN and it would lead to contrasting objectives for QRPN and RPN.
4 Experimental evaluation
We experimentally evaluate QGN on recent datasets for few few-shot fine-grained classification and person search. On the few-shot fine-grained classification, QGN outperforms the current state of the art by a large margin. On the person search, QGN performs competitive with other approaches. In both cases, we provide novel qualitative visualizations of the query guidance.
4.1 Experiments on few-shot fine-grained classification
We evaluate QGN on the widely adopted Caltech-UCSD birds dataset (CUB) Wah et al. (2011) and four other fine-grained datasets from different domains: Stanford Cars Krause et al. (2013), FGVC-Aircraft Maji et al. (2013), Stanford Dogs Khosla et al. (2011), and Oxford Flowers Nilsback and Zisserman (2006). Further to evaluating various backbones, we also provide a visualization of the QSSE.
4.1.1 Benchmarks and Implementation details
The few-shot fine-grained datasets: CUB Wah et al. (2011), Stanford Cars Krause et al. (2013), FGVC-Aircraft Maji et al. (2013), Stanford Dogs Khosla et al. (2011) and Oxford Flowers Nilsback and Zisserman (2006), are composed of 100-200 classes and a few thousands of images for each class. For CUB, we follow the split of Chen et al. (2019) as used by most previous approaches. For other four datasets, we follow the split of Su et al. (2020). In Table 1, we provide details of these datasets.
Evaluation Criteria: Following Mangla et al. (2020), we adopt an episodic few-shot evaluation and report the mean classification accuracy of randomly generated 5-way 1-shot and 5-way 5-shot episodes with gallery per class.
Implementation Details: We integrate the QSSE and QSimNet modules Munjal et al. (2019) and the OIM loss Xiao et al. (2017) with the Rotation self-supervision of Mangla et al. (2020). We experiment with three network architectures: ResNet10, ResNet18 and WRN-28-10 (width 28, scale factor 10). Following Chen et al. (2019); Mangla et al. (2020), the image size is for ResNet10/18 and
for WRN. The feature embedding is 512 for ResNet10/18 and it is 640 for WRN-28-10. In all experiments, the batch size is 8 (8 query-support pairs). The negative-to-positive ratio is 3 to 1, (3 query-support samples from the same class and 1 from different ones). We train for 120 epochs using the Adam optimizer with an initial learning rate of 0.001. During training, we augment the data via random crop, image jittering and random horizontal flip.
|MatchingNet Vinyals et al. (2016)||ResNet18||73.49||84.45||NIPS16|
|MAML Finn et al. (2017)||ResNet18||68.42||83.47||ICML17|
|ProtoNet Snell et al. (2017)||ResNet18||72.99||86.64||NIPS17|
|RelationNet Sung et al. (2018)||ResNet18||68.58||84.05||CVPR18|
|Baseline++ Chen et al. (2019)||ResNet18||67.02||83.58||ICLR19|
|In.||S2M2 Mangla et al. (2020)||ResNet18||71.81||86.22||WACV20|
|Proto+Jig Su et al. (2020)||ResNet18||-||89.8||ECCV20|
||Baseline++ Mangla et al. (2020)||WRN||70.40||82.92||WACV20|
|S2M2 Mangla et al. (2020)||WRN||80.68||90.85||WACV20|
|TEAM Qiao et al. (2019)||ResNet18||80.16||87.17||ICCV19|
|ICI Wang et al. (2020)||WRN||91.11||92.98||CVPR20|
|MAML Finn et al. (2017)||81.2||86.9||88.8||77.3||79.0||ICML17|
|ProtoNet Snell et al. (2017)||87.3||91.7||91.4||83.0||89.2||NIPS17|
|Proto+Jig Su et al. (2020)||89.8||92.4||91.8||85.7||92.2||ECCV20|
4.1.2 Comparison to the state of the art
In Table 2, we compare QGN to state-of-the-art few-shot fine-grained classification methods on the CUB dataset. QGN with the ResNet18 backbone achieves an accuracy of 83.82 and 91.22 for the 1-shot and 5-shot cases respectively, surpassing the previous best technique S2M2 Mangla et al. (2020) by the large margins of 12pp and 5pp. These results also surpass the performance of S2M2 with the larger WRN backbone, by 3.1pp and 0.4pp respectively. Similarly, QGN with the shallower ResNet10 backbone also surpasses S2M2 with the ResNet18 backbone by 9pp and 3.2pp. For completeness, we report in Table 2 all most recent techniques. Methods below the double line either use additional unlabeled data (semi-supervised) or evaluate all queries together (transductive), hence they do not make a fair comparison to our approach. However, these techniques appear complementary to the proposed query guidance and they could be integrated into QGN in future work.
In Table 3, we compare QGN to other approaches on four other few-shot fine-grained datasets in addition to birds (CUB). As shown in the table, for 3 out of 5 datasets i.e birds, aircrafts and dogs, we outperform the previous best results by 1.4pp, 0.2pp and 0.2pp respectively.
Rotation Mangla et al. (2020)
|+ QSSE + QSimNet (=QGN)||80.83||89.39||83.82||91.22||84.15||91.86|
4.1.3 Ablation Studies
QGN components. We evaluate the effectiveness of query-guided components applicable to few-shot classification, QSSE and QSimNet, with ablation studies.
CUB. In Table 4, we consider analysis of QGN with backbones ResNet10, ResNet18 and WRN-28-10. The reference baseline combines the OIM classifier with an auxiliary rotation prediction for self-supervision. We dub this model . This coincides with Mangla et al. (2020), which we indicate as Rotation, except for replacing the cosine classifier with OIM. For ResNet18, achieves 80.27 and 89.81 for 1- and 5-shot classifications, outperforming Rotation, which only achieves 72.40 and 84.83. Since OIM is the leading technique for person search, but it had not been adopted for few-shot classification, this result motivates the QGN proposition for a unified approach to both tasks.
Next, we add our proposed QSSE on top of this baseline. For ResNet10, the addition of QSSE brings an improvement of almost 1pp for both 1-shot and 5-shot. For ResNet18, it brings an improvement of 0.5pp for the 1-shot and of 1.5pp for the 5-shot case. Then we add QSimNet on top of . For ResNet10, it improves by almost 2.4pp and 1.2pp for the 1-shot and 5-shot respectively. For ResNet18, it improves by almost 2pp and 1pp. QGN for few-shot fine grained classification is given by combining QSSE and QSimNet. For ResNet10, QGN achieves an accuracy of 80.83 and 89.39, for the 1-shot and 5-shot case respectively. For ResNet18, QGN achieves an accuracy of 83.82 and 91.22. A similar improvement can be seen for the deeper WRN. Overall, in most cases, the best performance is consistently achieved by combining the two components, showing that QSSE and QSimNet are complementary.
QSSE Analysis. In Table 5, we compare the parameter and computational speed of and . The comparison shows that the inclusion of QSSE adds only marginal additional parameters %, however runtime complexity has increased by %. This is due to the siamese design of QSSE architecture that processes pair of images together.
|Params (M)||Runtime Complexity (TFLOPS)|
|1 (-shot) query example from each of 5 (-way) classes|
|+QSSE||Negative Pairs||+ QSSE|
(i) Non-discriminant corresponding pairs
|(ii) Focus on background|
4.1.4 Qualitative Results
In Figure 6, we illustrate some sample results of QGN for the 5-way 1-shot case on the CUB dataset. Given a gallery in the first column, we show 5 query examples from each of the 5 (-way) classes in the next 5 columns. In the first four rows, some challenging examples are given where QGN correctly classifies (green box) the gallery image. In the last two rows, there are examples where QGN assigns the gallery to wrong classes (red box). Note that failure cases are also challenging for human observers, as they mainly correspond to matching front to back views of the birds.
Next, in Figure 7, we delve into the QSSE component. Using GradCam Selvaraju et al. (2017), we visualize some class activation maps for the and +QSSE models. With reference to the left panel, reporting positive query (Q) and gallery (G) pairs, note how the +QSSE model focuses on corresponding body regions that are mostly discriminative. For example, in the first row / left panel, +QSSE looks at the discriminant grey wing and yellow beak of the bird in both query and gallery, while fails to focus on the wings. In the third row / left panel, high activations spread over the query example for , while for +QSSE high activations appear on a region which looks similar to the gallery. With reference to the right panel, reporting negative pairs, note that the head part of the query (yellow bird) is blue in color, while that of the gallery is black, and that +QSSE focuses only on the discriminant head part.
In Figure 8, we demonstrate some examples where +QSSE could not recognize the correct bird. The failure happens mainly for two reasons: i. when the corresponding pairs attended by QSSE are not discriminative enough; and ii. when +QSSE focuses on background. In general, the proposed +QSSE finds the correct discriminative corresponding parts, better than when not using QSSE.
4.2 Experiments on Person Search
Here we evaluate QGN on the CUHK-SYSU Munjal et al. (2019) and PRW Zheng et al. (2017) datasets; we analyse quantitatively the influence of backbone architectures, input image sizes and the ROI-Pool Vs. -Align; and we illustrate the effect of QRPN.
4.2.1 Benchmarks and Implementation details
CUHK-SYSU Xiao et al. (2017) is the most used dataset for evaluating person search. It comprises 18,184 images annotated with 96,143 person bounding boxes of 8,432 identities. The training set contains 11,206 images of 5,532 identities. The test set consists of 6,978 images and 2900 queries.
PRW Zheng et al. (2017) is a dataset acquired by 6 stationary cameras in a university campus. The dataset comprises 11,816 images annotated with 43,110 bounding boxes. The training set includes 5,134 images with 482 identities, while the test set has 6,112 images with 450 identities and 2057 queries.
Evaluation metrics: Following previous works Xiao et al. (2017), we report the performance using two metrics: Common Matching Characteristic (CMC top-K) and mean Average Precision (mAP). CMC top-K is measured as the probability of retrieving at least one match in top-K predictions. Average Precision (AP) is measured for each query by calculating the area under precision-recall curve. mAP is then calculated using the mean of APs for all queries.
Implementation Details: We use OIM Xiao et al. (2017) as baseline and extend it with the three proposed query-guided components. The images are re-scaled such that their shorter side is 600 pixels, unless mentioned explicitly. All models are trained using SGD for 4 epochs over pre-trained OIM model. The learning rate is set to 0.001, then dropped by a factor of 10 after 2 epochs. CUHK-SYSU considers as query-gallery pairs all combinations for each ID. The training set is further augmented by flipping both query and gallery images. For PRW, we sample only three gallery images for each possible query image of an ID, since the number of boxes per ID are already very large.
4.2.2 Comparison to the state of the art
In Table 6, we compare QGN to the state-of-the-art. In the top section, we report joint end-to-end methods, in the bottom section we list cascaded approaches. In each section the approaches are chronologically ordered.
As visible from the table regarding the CUHK-SYSU dataset, QGN achieves an accuracy of 91.5 mAP and 92.1 top-1, surpassing APNet Zhong et al. (2020) by 2.6pp mAP and 2.8pp top-1, BINet Dong et al. (2020) by 1.5pp mAP and 1.4pp top-1. Following recent approaches Yan et al. (2021); Han et al. (2021), we further report the performance of QGN leveraging the better FPN Lin et al. (2017) backbone. As shown in the last row of the table, FPN+QGN achieves an accuracy of 93.7 mAP and 94.4 top-1, surpassing the most recent joint approaches including DMRNet Han et al. (2021) by 0.5pp mAP and 0.2pp top-1, DKD Zhang et al. (2021) by 0.6pp mAP and 0.2pp top-1. Note that FPN+QGN also performs competitive with AlignPS Yan et al. (2021), only 0.3pp away in terms of mAP.
On PRW, QGN achieves an accuracy of 42.9 mAP and 81.9 top-1, surpassing APNet by 1pp mAP and .5pp top-1, NAE by .8pp top-1. Adopting the better FPN backbone further improves the performance. Particularly, FPN+QGN achieves an accuracy of 46.7 mAP and 82.9 top-1, surpassing NAE by 2.7pp mAP and 1.8pp top-1, PGA by 2.5pp mAP, AlignPS by 0.6pp mAP and 0.8pp top-1. Also note that FPN+QGN performs competitive to DMRNet.
|OIM Xiao et al. (2017)||75.5||78.7||21.3||49.9||CVPR17|
||Context Yan et al. (2019)||84.1||86.5||33.4||73.6||CVPR19|
|APNet Zhong et al. (2020)||88.9||89.3||41.9||81.4||CVPR20|
|BINet Dong et al. (2020)||90.0||90.7||45.3||81.7||CVPR20|
|Joint||NAE Chen et al. (2020)||92.1||92.9||44.0||81.1||CVPR20|
|PGA Kim et al. (2021)||92.3||94.7||44.2||85.2||CVPR21|
|FPN + AlignPS Yan et al. (2021)||94.0||94.5||46.1||82.1||CVPR21|
||FPN + DMRNet Han et al. (2021)||93.2||94.2||46.9||83.3||AAAI21|
|DKD Zhang et al. (2021)||93.1||94.2||50.5||87.1||AAAI21|
||FPN + QGN||93.7||94.4||46.7||82.9||Proposed|
|FPN+RDLR Han et al. (2019)||93.0||94.2||42.9||70.2||ICCV19|
|IGPN Dong et al. (2020)||90.3||91.4||47.2||87.0||CVPR20|
|TCTS Wang et al. (2020)||93.9||95.1||46.9||87.5||CVPR20|
|OIM Xiao et al. (2017)||75.5||78.7||-||-|
|+ QSSE + QRPN||82.4||82.8||74.7||74.4|
|+ QSSE + QSimNet||83.3||83.4||76.1||75.9|
|+ QRPN + QSimNet||83.1||83.3||75.9||75.5|
|+ QSSE + QRPN + QSimNet (= QGN)||84.4||84.4||78.4||77.7|
4.2.3 Ablation Studies
First we evaluate the impact of QGN components, then the effect of model hyper-parameters on both OIM and QGN.
QGN components. In Table 7, we quantify the improvements of the QGN components when integrated into OIM Xiao et al. (2017), considering two network architectures (ResNet50, ResNet18) and gallery size 100. We re-implement OIM, named Baseline in the table, yielding slightly better performance than Xiao et al. (2017). As illustrated, each QGN component improves over OIM. Also, improvements are consistent for each component across different backbone architectures. Taking the representative case of ResNet50, the baseline OIM (77.2 mAP) is improved by 2.9pp with QSSE (80.1 mAP), it is improved by 2.4pp with QRPN (79.6 mAP), and by 5.4pp with QSimNet (82.6 mAP), which is the strongest single component.
QGN components are also complementary. In Table 7, considering ResNet50, QSSE+QRPN gives 82.4 mAP, QSSE+QSimNet gives 83.3 mAP, QRPN+QSimNet gives 83.1 mAP, and the full QGN set (QSSE+QRPN+QSimNet) reaches 84.4 mAP. This means an overall improvement wrt the baseline OIM of 7.2pp.
Reduction Ratio of QRPN. For QRPN we choose reduction ratio to be 16 as in Hu et al. (2018). Our experiments (cf. Table 8) also confirm this to be a reasonable choice as it maintains a good balance between mAP and parameter size.
Hyper-parameters of OIM and QGN. In Table 9, we evaluate different design choices for OIM and QGN using the ResNet50 backbone.
CUHK-SYSU: As shown in the first few rows, the OIM baseline (77.2 mAP) improves by 3.6pp (80.8 mAP) when adopting the larger ROI pooling size (Vs. the standard ). It further improves slightly by 0.4pp (81.2 mAP) when switching to the more complex pooling method, ROI-Align. It improves by 2.7pp (83.9 mAP) when considering larger input images (smaller size re-scaled to 900 Vs. 600). Also, a larger batch size gives additional improvement taking the accuracy to 86.1 mAP (row 5). Following NAE 111https://github.com/DeanChan/NAE4PS, OIM may be further improved by concatenating globally pooled 1024-d features after ROI align with 2048-d feature from ClsIdenNet, bringing the OIM accuracy to 88.6 mAP. We treat this particular OIM as our baseline. Adding QGN components on top of this baseline gives our proposed model QGN, with a performance of 91.5 mAP.
PRW: Similarly on PRW dataset, largest improvements are due to increasing the pool size (32.8 Vs. 29.2 mAP), image size (36.9 Vs. 33.6 mAP), batch size (38.7 Vs. 36.9 mAP) and using finer features with gCat (40.4 Vs. 38.7 mAP). As shown in the last row, our proposed QGN gives an accuracy of 42.0 mAP.
Discussion on Runtime. Our method jointly processes each query-gallery pair. This means, for a test set of M queries and N galleries, an exhaustive search of combinations is required, which makes it computationally expensive. However, note that in practical person search scenarios M is usually a small number (typically 1, i.e only one query person is being searched).
|(b) RPN||(c) QRPN||(a) Query||(b) RPN||(c) QRPN|
|(b) OIM‡ Top-1||(c) QGN Top-1||(a) Query||(b) OIM‡ Top-1||(c) QGN Top-1|
|QGN Top-1||Query||QGN Top-1||Query||QGN Top-1|
4.2.4 Qualitative results
First we compare the standard RPN Ren et al. (2015) Vs. the proposed QRPN, then we compare OIM and QGN results.
RPN Vs. QRPN Proposals. Fig. 9 illustrates region proposals by the RPN Vs. the proposed QRPN. Given a query-gallery image pair, in column (a) we show the query images with the person bounding boxes (in yellow). In columns (b) and (c) we illustrate the top 10 region proposals in the gallery by RPN and QRPN, respectively. Note that the proposals by the RPN are on any person in the image, as it is trained for generic person detection. By contrast, the QRPN proposals in column (c) are query-guided and are focused on those people which mostly resemble the queried person. Specific examples are the second row/left panel and the third row/right panel, where QRPN specifically proposes people wearing clothes of the same color, and the last row/right panel where RPN fails due to contrast challenges while QRPN leverages the query person pattern and successfully estimates regions over it.
We support the qualitative result with Fig. 10, i.e. a plot of the number of query-specific proposals (y-axis) among the top-N proposals (x-axis). A query-specific proposal is one that has IoU with the target, one which serves to detect the queried person. Note how QRPN and QRPN+RPN consistently provide more query-specific proposals than the standard RPN. Additionally, training with both QRPN and RPN sub-networks results in better performances.
OIM Vs. QGN. Fig. 11 illustrates some challenging queries (column (a)) and gallery images, where these are searched for, either with OIM (column (b)) or QGN (column (c)). Top-1 search results are reported. Note how QGN retrieves a query person from a crowd (first row / left panel), distinguishes a query person from similarly dressed ones (second row / right panel), and also re-identifies the query in low contrast and illumination conditions (third row / right panel).
In Fig. 12, we illustrate typical failure cases of QGN. In (a), QGN successfully retrieves the correct person, but the bounding box is poorly aligned (IoU ). (b) is an interesting case of missing annotation for the target person, i.e. QGN detects the reflection of the girl in the mirror, which is considered false positive. (c) is challenging due to the similar appearance and low visibility of the people.
5 Conclusion and Future Work
This work has addressed, for the first time, few-shot fine-grained classification and person search with a unified Query-Guided Network (QGN). Uniting best practices from the two tasks has allowed QGN to define a novel state-of-the-art in few-shot fine-grained classification and to be on par with it for person search. A second contribution has been to propose query guidance via three components, which may be plugged-in at various stages of classification and detection models. Query guidance is novel for few-shot fine-grained classification, and it has been shown effective both quantitatively and qualitatively. In person search, query-guidance had been the novel introduction of our work Munjal et al. (2019), now adopted by various state-of-the-art techniques, which we re-state here as effective. A drawback of our approach is its computational complexity which is due to the interaction of a pair of images at all levels in the network, notably in the Siamese QSSE network. In future work, following the spirit of a unified query-guided framework, we plan to research few-shot fine-grained detection, for which the query-guided proposal network module of QGN may also be relevant.
This work is partially supported by Sapienza (Bandi d’Ateneo) and by the project of the Italian Ministry of Education, Universities and Research (MIUR) “Dipartimenti di Eccellenza 2018-2022”.
Norm-aware embedding for efficient person search.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 12615–12624. Cited by: §1, §2, §3.3.3, Table 6.
- A closer look at few-shot classification. In International Conference on Learning Representations (ICLR), URL https://openreview.net/forum?id=HkxLXnAcFQ, Cited by: §1, §2, §2, §4.1.1, §4.1.1, Table 2.
- Bi-directional interaction network for person search. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2839–2848. Cited by: §2, §4.2.2, Table 6.
- Instance guided proposal network for person search. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2585–2594. Cited by: §1, §2, §2, Table 6.
- Dynamic conditional networks for few-shot learning. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 19–35. Cited by: §2.
Model-agnostic meta-learning for fast adaptation of deep networks.
Proceedings of the 34th International Conference on Machine Learning (ICML), pp. 1126–1135. Cited by: §2, §2, Table 2, Table 3.
- Re-id driven localization refinement for person search. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 9814–9823. Cited by: §2, Table 6.
Decoupled and memory-reinforced networks: towards effective feature learning for one-step person search.
Proceedings of the AAAI Conference on Artificial Intelligence, pp. 1505–1512. Cited by: §4.2.2, Table 6.
- Revisiting pose-normalization for fine-grained few-shot recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 14352–14361. Cited by: §1, §2.
- Cross attention network for few-shot classification. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, pp. 4005–4016. Cited by: §2.
- Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7132–7141. Cited by: Figure 3, §3.3.1, §3.3.1, §3.3.2, §4.2.3.
- Novel dataset for fine-grained image categorization. In Workshop on Fine-Grained Visual Categorization (FGVC), IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §4.1.1, §4.1.
- Prototype-guided saliency feature learning for person search. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4865–4874. Cited by: §1, §2, Table 6.
- 3D object representations for fine-grained categorization. In IEEE International Conference on Computer Vision Workshops (ICCVW), Vol. , pp. 554–561. Cited by: §1, §4.1.1, §4.1.
- Hierarchical distillation learning for scalable person search. Pattern Recognition 114, pp. 107862. Cited by: §2.
- Learning multi-level weight-centric features for few-shot learning. Pattern Recognition 128, pp. 108662. Cited by: §2.
- Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2117–2125. Cited by: §4.2.2.
- Making person search enjoy the merits of person re-identification. Pattern Recognition 127, pp. 108654. Cited by: §2.
- Fine-grained visual classification of aircraft. Technical report External Links: Cited by: §1, §4.1.1, §4.1.
- Charting the right manifold: manifold mixup for few-shot learning. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp. 2218–2227. Cited by: §1, §1, §2, §2, §3.2, §3.4.1, §4.1.1, §4.1.1, §4.1.2, §4.1.3, Table 2, Table 4.
- Query-guided end-to-end person search. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 811–820. Cited by: §1, §2, §4.1.1, §4.2, §5.
- Knowledge distillation for end-to-end person search. In Proceedings of the British Machine Vision Conference (BMVC), pp. 31.1–31.16. Cited by: §2.
- A visual vocabulary for flower classification. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1447–1454. Cited by: §1, §4.1.1, §4.1.
- Transductive episodic-wise adaptive metric for few-shot learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 3603–3612. Cited by: Table 2.
- Faster r-cnn: towards real-time object detection with region proposal networks. In Proceedings of the 29th International Conference on Neural Information Processing Systems, pp. 91–99. Cited by: §2, §3.2, §3.3.2, §3.3.2, §3.4.2, §4.2.4.
- Grad-cam: visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 618–626. Cited by: Figure 7, §4.1.4.
- Prototypical networks for few-shot learning. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 4077–4087. Cited by: §1, §2, Table 2, Table 3.
- When does self-supervision improve few-shot learning?. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 645–666. Cited by: §2, §3.4.1, §4.1.1, Table 2, Table 3.
- Circle loss: A unified perspective of pair similarity optimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6398–6407. Cited by: §2.
- Learning to compare: relation network for few-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1199–1208. Cited by: §2, §2, Table 2.
- Learning attention-guided pyramidal features for few-shot fine-grained recognition. Pattern Recognition 130, pp. 108792. Cited by: §2.
- Matching networks for one shot learning. In Proceedings of the 30th International Conference on Neural Information Processing Systems, pp. 3637–3645. Cited by: §2, §2, Table 2.
- The caltech-ucsd birds-200-2011 dataset. Technical report Technical Report CNS-TR-2011-001, California Institute of Technology. Cited by: §1, §4.1.1, §4.1.
- TCTS: a task-consistent two-stage framework for person search. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11952–11961. Cited by: §2, Table 6.
- Instance credibility inference for few-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 12836–12845. Cited by: §1, Table 2.
- Joint detection and identification feature learning for person search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3415–3424. Cited by: §1, §1, §1, §2, Figure 1, Figure 2, §3.1, §3.2, §3.2, §3.3.3, §3.4.2, §4.1.1, §4.2.1, §4.2.1, §4.2.1, §4.2.3, Table 6, Table 7.
- Anchor-free person search. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7690–7699. Cited by: §2, §4.2.2, Table 6.
- Learning context graph for person search. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2158–2167. Cited by: Table 6.
- Diverse knowledge distillation for end-to-end person search. In Proceedings of the AAAI Conference on Artificial Intelligence, pp. 3412–3420. Cited by: §4.2.2, Table 6.
- Self-guided information for few-shot classification. Pattern Recognition 131, pp. 108880. Cited by: §2.
- Person re-identification in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1367–1376. Cited by: §1, §4.2.1, §4.2.
- Robust partial matching for person search in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6826–6834. Cited by: §4.2.2, Table 6.
- Multi-attention meta learning for few-shot fine-grained image recognition. In Proceedings of the International Joint Conference On Artificial Intelligence (IJCAI), pp. 1090–1096. Cited by: §2.