Visual Recognition by Request

by   Chufeng Tang, et al.

In this paper, we present a novel protocol of annotation and evaluation for visual recognition. Different from traditional settings, the protocol does not require the labeler/algorithm to annotate/recognize all targets (objects, parts, etc.) at once, but instead raises a number of recognition instructions and the algorithm recognizes targets by request. This mechanism brings two beneficial properties to reduce the burden of annotation, namely, (i) variable granularity: different scenarios can have different levels of annotation, in particular, object parts can be labeled only in large and clear instances, (ii) being open-domain: new concepts can be added to the database in minimal costs. To deal with the proposed setting, we maintain a knowledge base and design a query-based visual recognition framework that constructs queries on-the-fly based on the requests. We evaluate the recognition system on two mixed-annotated datasets, CPP and ADE20K, and demonstrate its promising ability of learning from partially labeled data as well as adapting to new concepts with only text labels.


page 2

page 4

page 5

page 8

page 9

page 15

page 16

page 18


ConceptLearner: Discovering Visual Concepts from Weakly Labeled Image Collections

Discovering visual knowledge from weakly labeled data is crucial to scal...

Building a Large-scale Multimodal Knowledge Base System for Answering Visual Queries

The complexity of the visual world creates significant challenges for co...

Why You Should Try the Real Data for the Scene Text Recognition

Recent works in the text recognition area have pushed forward the recogn...

WASA: A Web Application for Sequence Annotation

Data annotation is an important and necessary task for all NLP applicati...

Adversarial Knowledge Transfer from Unlabeled Data

While machine learning approaches to visual recognition offer great prom...

Formal Analysis of Art: Proxy Learning of Visual Concepts from Style Through Language Models

We present a machine learning system that can quantify fine art painting...

Pose Recognition in the Wild: Animal pose estimation using Agglomerative Clustering and Contrastive Learning

Animal pose estimation has recently come into the limelight due to its a...

1 Introduction

Recognition is a fundamental topic in computer vision. While the deep neural networks 

LeCun et al. (2015) have largely advanced the recognition accuracy of image classification, object detection, instance/semantic segmentation, etc., existing benchmarks for visual recognition mostly follow a conventional evaluation protocol: a set of classes (dictionary) is pre-defined and each image is labeled with semantic regions and/or instances according to the fixed dictionary. Such a protocol suffers a series of limitations. In particular, we focus on the unsatisfying consistency and scalability in annotation and evaluation.

  • Consistency. Images may reflect a complex scene that contains various objects as well as complicated and/or hierarchical semantics. Inevitably, there can be variable granularity of annotation across scenarios (e.g., a large object can be labeled with parts while a small object can barely). The inconsistency (i.e., the ground-truth labels are not always accurate) may confuse the algorithm in both training and inference stages.

  • Scalability. Once a class is added to the dictionary, each single image in the dataset needs to be checked to confirm the existence of the new class. This is a costly burden for large-scale datasets (e.g.

    , COCO 

    Lin et al. (2014), OpenImages Kuznetsova et al. (2020), etc.) or pixel-level annotations (e.g.

    , COCO, Cityscapes 

    Cordts et al. (2016), ADE20K Zhou et al. (2019), etc.). In addition, it is difficult to extend this mechanism to the scenarios of open-domain recognition, i.e., new concepts are added on-the-fly with merely a few or even no training samples (e.g., by texts only).

Figure 1: Left: A comparison between of visual recognition all at once (conventional) and visual recognition by request (proposed). The example is a complex street-scene image, where the conventional method (the blue arrows) cannot annotate dense and small objects (the yellow box) or small parts (the red box), resulting in inevitable inaccuracy in evaluation. In comparison, the proposed method only recognizes what is asked, hence the evaluation is less biased. This figure is best viewed in color. Right: A small part of knowledge base that corresponds to the left example.

Motivated by the above, in this paper, we present a new protocol for annotation and evaluation. The protocol, named visual recognition by request, differs from the conventional one which we name it visual recognition all at once. The idea is shown in Figure 1. Initially, the whole image is an instance of the scene class. Then, the annotation process involves a few requests that gradually add new semantic regions, instances, and new properties to the image. We consider two types of requests.

  1. Whole-to-part semantic segmentation. Given an instance and a dictionary of its parts (i.e., the class names), find the semantic segmentation map. The first request on each image always belongs to this task (see Section 3 for detailed descriptions).

  2. Instance segmentation. Given a semantic region and one or a few pixels (named probes), find the instance(s) that occupy the probe(s). In the proposed setting, this is the only way to create instances. Not all instances need to be annotated/recognized.

Note that each instance can be further partitioned into parts, so the above requests can be recursively called to support unlimited granularity. In addition, since the dictionary of semantic segmentation is not fixed, the system naturally supports open-domain recognition by simply adding unseen (but related) class names at the inference stage.

To perform the above tasks, we propose a query-based recognition framework which takes four inputs: the input image, the current segmentation map (which is empty initially), the knowledge base (i.e.

, a hierarchical dictionary of parts and sub-classes), and the request. The recognition approach is straightforward. For each request, the image is fed into a vision backbone for visual features, the knowledge base and request are fed into a text/positional encoder for query vectors, and the segmentation map serves as constraints. Hence, the output of each request is produced by the interaction between the query and visual features, filtered by the constraints. The two request types are dealt with different queries and interaction modules. Please see Section 

4 for technical details.

We establish our benchmark on the Cityscapes-Panoptic-Parts (CPP) dataset de Geus et al. (2021) and the ADE20K dataset Zhou et al. (2019). The knowledge base for CPP contains object and non-duplicate part classes, while that for ADE20K has object and part classes which appear frequently. Based on it, we parse the annotations into a set of queries to be used in both the training and inference stages. The recognition accuracy is computed by a hierarchical version of PQ Kirillov et al. (2019) (HPQ), which we explain in Section 5.1. Qualitative and quantitative results demonstrate some intriguing properties of visual recognition by request, in particular, our approach works well in the scenarios that (i) the object parts are sparsely annotated – see Sections 5.2 and 5.3, and (ii) new concepts (relative, compositional, etc.) are added by texts only – see Section 5.4.

The main contribution of this paper involves (i) a new annotation and evaluation protocol, (ii) a unified recognition approach, and (iii) a new usage of the ADE20K dataset that integrates objects and parts – thanks to the proposed setting, the sparse part annotations can be easily used. We expect our proposal to pave a new path for alleviating the burden of non-scalable and inconsistent annotations for visual recognition. The open problems are discussed in Section 6.

2 Related Work

Benchmarks of visual recognition. High-quality benchmarks are one of the fundamentals of visual recognition. Although the datasets for image classification Deng et al. (2009) and object detection Lin et al. (2014); Kuznetsova et al. (2020); Shao et al. (2019)

often have a larger number of images, we consider semantic/instance segmentation in this paper towards accurate, pixel-level recognition results. Popular segmentation datasets include PASCAL VOC 

Everingham et al. (2010), Cityscapes Cordts et al. (2016), MS-COCO Lin et al. (2014), ADE20K Zhou et al. (2019), etc. While MS-COCO contains more images () than ADE20K (), ADE20K has a larger number of semantic classes () than MS-COCO (). Due to the diversity of images and/or objects, it is almost impossible to keep a consistent annotation standard throughout the entire dataset, e.g., for an instance that occupies a large area, the labeler can easily depict the border and even decompose it into parts while it is difficult for small objects. Therefore, inevitably, there is a tradeoff He et al. (2021) between the granularity (e.g., object parts) and accuracy (e.g., annotating parts of every object – even if it only occupies tens of pixels) of annotations. This paper aims to alleviate the conflict by defining a new protocol.

Open-domain visual recognition based on vision-language interaction. It is widely accepted that visual recognition is an open-set scenario Scheirer et al. (2012); Bendale and Boult (2016), where new semantic classes should be allowed to be added and, ideally, the cost of updating data and/or models is as low as possible. To enable this, the core topic is how to define new classes, where using natural language is a promising direction. Pre-trained on large-scale image-text datasets Changpinyo et al. (2021), the language-driven or text-based visual recognition has been extended from image classification Radford et al. (2021); Jia et al. (2021) to object detection Chen et al. (2021b); Zareian et al. (2021); Gu et al. (2021); Li et al. (2022b) and semantic segmentation Ghiasi et al. (2021); Xu et al. (2021); Li et al. (2022a); Xu et al. (2022)

. The task is also related to other cross-modal recognition topics, including image captioning 

Vinyals et al. (2015); Hossain et al. (2019), visual question answering Antol et al. (2015); Wu et al. (2017); Das et al. (2018); Gordon et al. (2018), referring expression localization Hu et al. (2016); Mao et al. (2016); Zhu et al. (2016); Liu et al. (2017), visual reasoning Zellers et al. (2019), etc.

Query-based visual recognition. Our protocol requires the computational model to flexibly deal with different types of requests. This leads to a recently popular computational framework which we name it query-based visual recognition. The design principle is to create a query from the request on-the-fly and use the query to interact with the extracted visual featuress. Some open-domain recognition algorithms discussed above Xu et al. (2021); Ghiasi et al. (2021); Gu et al. (2021); Li et al. (2022a) were based on this framework, in which queries were generated by texts. The idea is also related to the DETR series for object detection Carion et al. (2020); Zhu et al. (2021); Chen et al. (2021a); Meng et al. (2021); Sun et al. (2021) and its extension to semantic/instance/panoptic segmentation Dong et al. (2021); Cheng et al. (2021b, a); Fang et al. (2021); Strudel et al. (2021); Zhang et al. (2021).

3 Problem Formulation

Let an image be , where indicates the intensity value/vector at the position of , and are image width and height, respectively. Conventional methods often annotate a pixel-wise segmentation map for , where represents the semantic and/or instance labels of . Our proposal instead organizes the recognition task into a set of requests, , where and denote the -th query and answer, respectively. Meanwhile, we maintain a tree, , for the recognition results, where each node is a semantic region or an instance. If is the parent node of , then is recognized by answering a request on . In our definition, an instance must be a child node of a semantic region. For the -th node, is the binary mask, is the class index in the knowledge base (see the next paragraph), indicates whether the node corresponds to an instance, and is the set of child nodes of . Note that and are tightly related: each non-leaf node in corresponds to a request in , say , and the answer corresponds to the set of all child nodes of .

The annotation is built upon a pre-defined knowledge base. It is formulated as a directed graph . For the -th node, is the semantic class label, is the set of graph nodes that correspond to its named parts. In our formulation, the class labels appear as texts rather than a fixed integer ID – this is to ease the generalization towards open-domain recognition: given a pre-trained text embedding (e.g., CLIP Radford et al. (2021)), some classes are recognizable by language even if they never appear in the training set. The root class is scene that defines the entire picture.

Figure 2: An example showing how an image with mixed annotations is parsed into requests. Note that the parsing is built upon a knowledge base (see Figure 1), indicating the set of candidate classes when each Type-I request is called. We highlight the newly annotated contents for better visualization.

Based on the above definitions, a typical procedure of annotation is described as follows and illustrated in Figure 2. Initially, nothing is annotated and has only a root node , in which , (scene), (i.e., this is an instance of scene), and . Each request, , finds the corresponding node in , , performs the recognition task, translates the answer into child nodes of , and adds them back to . There are two types of requests, defined below.

  1. Whole-to-part semantic segmentation. Prerequisite: must be an instance (i.e., ). The system fetches the class label, , and looks up the knowledge base for the set of its sub-classes in texts (e.g., in the CPP dataset de Geus et al. (2021), person has sub-classes of head, torso, arm, and leg). The task is to segment the mask, , into sub-masks – each pixel can belong to one of the specified sub-classes or not (i.e., a reject option is available for each pixel). For each image, the first request always belongs to this type, where the ‘parts’ contain all possible semantic classes in a scene (sky, road, person, etc.).

  2. Instance segmentation. Prerequisite: must be a semantic region (i.e., ). Each time, a probing pixel (or probe) is provided with the request, , and the task is to segment the instance that occupies . Note that, although an additional pixel is needed, this setting is easily converted into and fairly compared to conventional instance segmentation – we will show the results in Section 5.2.

The proposed setting makes it easier to scale up annotations. On the one hand, a concept (e.g., arm of person) labeled in one image does not mean that it must be labeled in another image – the reason of not labeling may be either the part is occluded, the object is too small, the request is not raised, or the labeler refuses to respond. Yet, it does not deliver inaccurate information to the algorithm – as we shall see in Sections 5.2 and 5.3, the recognition system naturally supports data-efficient learning on mixed annotations. On the other hand, it is easy to add new concepts to the knowledge base by texts only – it refers to the open-domain tests as shown in Section 5.4.

Figure 3: The proposed approach to process a Type-I request. It looks up the knowledge base for the corresponding classes (in texts), applies a pre-trained text encoder to transform them into embedding vectors, and uses them to interact with the visual features. The current recognition result is taken for constraints and updated afterwards. To process a Type-II request, the query embedding vectors are further augmented with the positional embedding. This figure is best viewed in color.

4 Query-based Recognition: A Baseline

The above setting calls for a query-based recognition algorithm. The input data includes the image , the current recognition result (prior to the -th request), the request , and the knowledge base . Processing each request involves extracting visual features, constructing queries, performing recognition and filtering, and updating the current recognition results.

Visual feature extraction.

 We extract visual features from using a deep neural network (e.g.

, a conventional convolutional neural network 

Krizhevsky et al. (2012); He et al. (2016) or vision transformer Dosovitskiy et al. (2021); Liu et al. (2021)), obtaining a feature map , where is usually a down-sampled scale of the original image.

Language-based queries. Each request is transferred into a set of query embedding vectors , for which a pre-trained text encoder (e.g., CLIP Radford et al. (2021)) and the knowledge base are required. (1) For Type-I requests (i.e., whole-to-part semantic segmentation), the target class (in text, e.g., person) is used to look up the knowledge base, and a total of child nodes are found (in text, e.g., head, arm, leg, torso). We feed them into the text encoder to obtain embedding vectors, . (2) For Type-II requests (i.e., instance segmentation), given a probing pixel, a triplet is obtained, where is the pixel coordinates and is the target class that is determined by the current recognition result. To construct the query, the target class (in text) is directly fed into the text encoder to obtain the semantic embedding vector , and the pixel coordinates are fed into a positional encoder to obtain the positional embedding . A simple implementation is to assign the positional embedding as the pixel coordinates, . The query embedding is obtained by combining the semantic and positional embeddings, . For simplicity, all text embedding vectors are -dimensional, i.e., same as the visual features.

Language-driven recognition. We conduct feature interaction between the visual features and the language-based queries to segment the target objects. (1) For Type-I requests, we directly compute the inner-product between the visual feature vector of each location and the embedding vectors , obtaining a -dimensional class score vector, . The entry with the maximal response is taken as the class label. To allow open-set recognition, we augment with an additional entry of , i.e., , where the added entry stands for the others class – that said, if the responses of all normal entries are smaller than , the corresponding position is considered an unseen (anomalous) class – see such examples in Section 5.4. (2) For Type-II requests, different from semantic segmentation, existing instance segmentation methods usually generate a set of proposals (e.g., region proposals in Mask R-CNN He et al. (2017), center points in CondInst Tian et al. (2020)) and predict the class label, bounding box, and binary mask for each proposal. To achieve instance segmentation by request, we first filter the proposals by the positional embedding . Specifically, the proposals (e.g., feature locations in CondInst) near the probing pixel are kept for classification and segmentation. Finally, we choose the prediction with the highest score – the scores are obtained in a similar way (inner-product with text embedding) as processing Type-I requests.

Top-down filtering. We combine the request and the current recognition result to obtain a mask to constrain the segmentation masks predicted above. That said, we follow the top-down criterion to deal with the segmentation conflict of different requests. For example, instance segmentation results should strictly inside the corresponding semantic region, and similarly part segmentation results should inside the instance region. We believe that applying advanced mask fusion methods Li et al. (2018, 2019); Mohan and Valada (2021); Ren et al. (2021) may produce better results. The filtered segmentation masks are added back to the current recognition results, which lays the foundation of subsequent requests.

Implementation and efficient training. In our implementation, the above framework contains two individual models. (1) For Type-I requests, we adopt a prevailing Transformer-based model, SegFormer Xie et al. (2021) with its customized backbone MiT-B0/B5, to extract pixel-wise visual features. For efficient-training, the model processes all possible requests in parallel for each image, i.e., computing inner-products between a feature vector and the text embeddings of all annotated classes. To support joint training of whole and part classes, we adopt a grouped softmax cross-entropy loss, where softmax is applied within each subtree (nodes sharing the same parent node). The final loss is the weighted sum over all subtrees. (2) For Type-II requests, we adopt CondInst Tian et al. (2020) to produce instance proposals, where all positive feature locations are viewed as probing pixel and processed in parallel for each image. Note that CondInst samples positive locations from the center region of an instance only, while the probing pixel can lies on any position of the instance during testing. To resolve this conflict, we sample positive locations from the whole ground-truth instance masks directly, named as mask sampling in the following. For both models, we use a pre-trained and frozen CLIP Radford et al. (2021) model to provide text embedding vectors. For more details, please refer to Appendix A.

5 Experiments

5.1 Dataset and the Evaluation Metric

We evaluate visual recognition by request on two datasets. (1) The Cityscapes-Panoptic-Parts (CPP) dataset de Geus et al. (2021) inherited the images and object classes from the Cityscapes dataset Cordts et al. (2016), and further annotated two groups of part classes for vehicles and persons. As a result, there are object classes having parts, with non-duplicate part classes in total. There is a high ratio of vehicles and persons annotated with parts (see the statistics in Appendix B.1), so it is relatively easy for the conventional segmentation approaches to make use of the dataset. (2) The ADE20K dataset Zhou et al. (2019) has a much larger number of object and part classes. We follow the convention to use the most frequent object classes. Regarding the parts, we check all valid candidates satisfying the following conditions: (i) it belongs to one of the object classes (used for instance segmentation) out of classes, and (ii) the number of occurrences is no fewer than . Finally, non-duplicate part classes belonging to object classes are found. Please refer to Appendix C.1 for the details of data preparation. ADE20K has very sparse annotations on parts – to the best of our knowledge, conventional approaches have not yet quantitatively reported part-based segmentation results on this dataset.

To measure the segmentation quality, we design a metric named hierarchical panoptic quality (HPQ) which can measure the accuracy of a recognition tree of any depth. HPQ is a recursive metric. For a leaf node, HPQ equals to mask IoU, otherwise, we first compute the class-wise HPQ:


where , , and denotes the sets of true-positives, false-positives, and false-negatives of the -th class, respectively, where the true-positive criterion is . The HPQ values of all active classes are averaged into at the root node. HPQ is related to prior metrics, e.g., it degenerates to the original PQ Kirillov et al. (2019) if there is no object-part hierarchy, and similar to PartPQ de Geus et al. (2021) if the object-part relationship is one-level (i.e., parts cannot have parts) and parts are only semantically labeled (i.e., no part instances), such as the CPP dataset (see Appendix B.2 for detailed differences). Yet, HPQ has the ability of being generalized into more complex knowledge bases.

5.2 Results on Cityscapes-Panoptic-Parts

There are two settings of evaluation, namely, non-probing-based (NP-based) and probing-based (P-based). The NP-based setting is to fairly compare against the conventional instance segmentation approaches, i.e., finding all instances at once. For this purpose, we densely sample probes on the semantic segmentation regions, each of which generating a candidate instance, and filter the results using non-maximum suppression (NMS) (see Appendix A.3 for details). The P-based setting aligns with the training stage. For each labeled instance, we simulate the user click to place a probe that lies within the intersection of the predicted semantic region and the ground-truth instance region – if the intersection is empty, the instance is lost (i.e., IoU is ). An important issue lies in the choice of probes – intuitively, the segmentation task would become easier if the probe lies in typical regions, e.g., close to the instance center. To show this, we set a hyper-parameter – if , the probe lies exactly on the center; if

, the probe is uniformly distributed on the instance; otherwise, the sampling strategy lies between two extreme situations (see Appendix 

A.4 for details).


mIoU (%) Non-P Part


SegFormer (B0) Xie et al. (2021) 76.54
w/ CLIP 77.35
w/ CLIP & parts (ours) 77.23 75.58
SegFormer (B5) Xie et al. (2021) 82.25
w/ CLIP 82.40
w/ CLIP & parts (ours) 82.22 78.55


Table 2: Instance segmentation accuracy with non-probing and probing tests on the CPP dataset. The parameter controls the centerness of the probes – please refer to the main texts for the definition.


AP (%) NP P-based (w.r.t. )


CondInst (R50) Tian et al. (2020) 36.6
w/ CLIP 36.8 39.3 39.0 34.6 24.5
w/ CLIP & mask samp. 37.8 39.4 39.1 37.4 33.5


Table 1: Semantic segmentation results on non-part (Non-P) and part classes of CPP.


HPQ (%) NP-based P-based (w.r.t. )


SegFormer (B0) + CondInst (R50) 55.8 56.7 56.6 56.5 56.1
SegFormer (B5) + CondInst (R50) 60.2 61.4 61.2 61.0 60.4


Table 3: Overall segmentation accuracy (HPQ) on the CPP dataset. We use the instance segmentation with CLIP and mask sampling strategy (the last row of Table 2).

We first test semantic and instance segmentation individually. Results are summarized in Tables 2 and 2, respectively. For semantic segmentation, introducing text embedding as flexible classes improves the accuracy slightly, and adding parts causes a slight accuracy drop on non-part classes. That said, language-based segmentation models can handle non-part and part classes simultaneously. For instance segmentation, sampling positive locations from ground-truth masks during training improves the results, especially when the quality of probes is not guaranteed (i.e., is large).

Next, we combine the best practice into overall segmentation, and report results in Table 3. Interestingly, although the NP-based setting surpasses the P-based settings with large values in instance AP, it reports lower HPQ values because P-based tests usually generate some false positives with low confidence scores – HPQ, compared to AP, is much more sensitive to these prediction errors.


PartPQ (%) All Non-P Part


PPS-small de Geus et al. (2021) 55.1 59.7 42.3
PPF-small Li et al. (2022c) 57.4 62.2 43.9
Ours-small 57.0 61.5 44.2
Ours-small 57.9 61.9 47.0
PPS-large de Geus et al. (2021) 61.4 67.0 45.8
PPF-large Li et al. (2022c) 61.9 68.0 45.6
Ours-large 61.2 65.8 48.1
Ours-large 62.4 66.6 50.5


Table 4: Comparison with recent approaches: PartPQ on all, non-part, and part classes. indicates P-based tests.

Comparison to recent approaches. We compare our method with two recently published works on CPP by PartPQ. Results are shown in Table 4, where ‘small’ and ‘large’ indicate different model sizes, e.g., we have used SegFormer (B0) and (B5) as small and large models, respectively. Our model shows competitive recognition accuracy. On the one hand, the PartPQ for non-part classes is slightly lower, because we use a relatively lightweight model for instance segmentation and do not use any post-refinement methods Liang et al. (2020); Tang et al. (2021). Applying advanced instance segmentation methods Zhang et al. (2021); Cheng et al. (2021a) may produce better results, but it is not the focus of this paper. On the other hand, our model has a clear advantage in part segmentation, which claims the effectiveness of (i) using text embedding to represent part classes and (ii) separate training of instance and part segmentation (to avoid optimization difficulties).


Ratio mIoU (%) HPQ (%)
Non-P Part NP P


100% 82.22 78.55 60.2 61.4
50% 81.82 77.59 59.8 61.0
30% 81.99 76.52 59.4 60.7
15% 81.72 65.51 57.3 59.6


Table 5: Segmentation accuracy of using different ratios of part annotations.

Data-efficient learning. A benefit of the proposed setting is the ability to fit data-efficient learning. To reveal this, we assume that much fewer parts are labeled on the CPP dataset. For this purpose, we randomly choose a subset of part annotations and discard others (details are elaborated in Appendix B.3). We report the segmentation accuracy in Table 5. The proposed method, performing visual recognition by-request, is easily adapted to such scenarios with partial annotations. This property largely benefits our investigation on the ADE20K dataset, where a considerable portion of part annotation is missing.

5.3 Results on ADE20K

Next, we report recognition accuracy on the ADE20K dataset. Note that recognizing objects with parts in ADE20K is a naturally data-efficient learning scenario, since the portion of objects with part annotations is very low (see statistics in Appendix C.1). This makes it very difficult to apply the conventional segmentation paradigm (i.e., visual recognition all at once) – we conjecture it to be the main reason that no prior works have reported mIoU for part segmentation on this dataset. Our setting, visual recognition by request, is easily applied to this scenario.


mIoU NP-based P-based (HPQ w.r.t. )
Non-P Part HPQ


SegFormer (B0) + CondInst (R50) 37.0 29.0 29.9 35.7 35.4 35.3 34.5
SegFormer (B5) + CondInst (R50) 48.5 35.8 36.5 41.7 41.5 41.1 40.5


Table 6: Semantic segmentation accuracy and the overall HPQ on the ADE20K dataset. We have by default used the instance segmentation with CLIP and mask sampling strategy. All numbers are in %.

Quantitative results are shown in Table 6. We report semantic segmentation mIoU as well as HPQ under different probing settings. All these values are significantly lower than the values reported on CPP, indicating that ADE20K is much more challenging in terms of the richness of semantics and the granularity of annotation. In addition, the high sparsity of part annotations results in a low recall of small parts, e.g., the gaze (the eyes of persons) class is rarely recognized – however, these elements are often missing in the annotation of the validation set, which implies that the true segmentation accuracy shall be even lower. Overall, there is much room for improvement in such a complex, multi-level, and sparsely annotated segmentation dataset.

5.4 Visualization and Open-domain Tests

Figure 4: Examples of visual recognition results on CPP (top) and ADE20K (bottom). Green spots indicate the example probing pixels. Best viewed digitally with zoom. Detailed vocabulary and more results (including richer interaction) are provided in Appendices B.4 and C.2.

We show some visualization results in Figure 4. In each row, the 1st–4th columns show the input, top-level semantic segmentation, instance segmentation (based on probes), and part segmentation (upon the chosen instances), respectively. For the example from ADE20K, we also compare our result with the ground-truth part labels (5th column), showing that the annotations are often incomplete, yet our approach can recognize many of the missing classes (e.g., bed of pool table).

Figure 5: Examples of open-domain visual recognition. From top to bottom: recognizing anomalies, relative classes, and compositional concepts, respectively. Text labels that unseen during training are underlined. Black region corresponds to others. Please refer to the main texts for details.

Besides, we qualitatively investigate the ability of open-domain recognition. Examples are shown in Figure 5 (see Appendix D for more results). Specifically, we consider the following challenges:

  1. Detecting anomalies. With a native class of others at each time of part segmentation, the model can distinguish anomalous classes (i.e., not in the knowledge base) from known ones. We show two examples of anomalies (dog and box) from the Fishyscapes dataset Blum et al. (2021) which were not detected by the baseline (see Figure 2 of Blum et al. (2021)) but were found by our approach which was not specifically designed for this task.

  2. Recognizing unseen classes. Since the queries are generated by texts, it is easy to recognize relative classes although the texts did not appear in the training process, e.g., the unseen vehicle class can automatically cover car, bus, motorbike, bike, etc., which exist in the knowledge base.

  3. Understanding compositional concepts. With a text-based whole-part hierarchy, it allows the model to transfer part configuration from a labeled object (e.g., car) to objects with similar part-level structures (e.g., caravan, trailer, etc.). This task mainly relies on the similarity of visual features, e.g., it is often difficult to transfer the body structure from person to animal due to the significant semantic gap.

6 Conclusions, Limitations, and Future Work

This paper presents a novel evaluation metric named visual recognition

by request. The core idea is to formulate the recognition task into a series of requests and construct a query-based framework to solve them. Experimental results show that the protocol works well under mixed and/or sparse labels, hence improving the scalability of annotations. Despite the benchmark and baseline that we have established, there are still some limitations of our work, discussed as follows.

  • The proposed protocol shall be capable of generalizing to new types of requests. For this purpose, the knowledge base needs be upgraded – currently, this is done manually, but this methodology seems difficult to extend to new request types, e.g., there is no existing database for all nameable visual attributes, e.g., size, shape, etc. We look forward to learning from web data Radford et al. (2021), yet filtering out noise is a challenging topic.

  • When the proposed protocol is inherited for annotating a new dataset or other scenarios (e.g., video or infrared images), new challenges may emerge to guarantee the annotation quality. For example, if the requests are arbitrarily determined by the labelers, they may always choose to annotate easy samples (e.g., large instances with clean background). Randomly generated requests (especially for Type-II) may be required, yet the rejection option (the labeler cannot depict an instance at the specified location) shall be allowed.

  • The proposed protocol does not require the model to accomplish recognition all at once. This may impact the design principle of network architectures. We conjecture that the advantage of very heavy backbones (e.g., with billions of parameters Zhai et al. (2021); Liu et al. (2021)) may shrink, and flexible and/or compositional backbones may show their efficiency.

Acknowledgement. The authors would like to thank Mr. Longhui Wei and Dr. Junran Peng for valuable discussions.


  • [1] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh (2015) Vqa: visual question answering. In International Conference on Computer Vision, Cited by: §2.
  • [2] A. Bendale and T. E. Boult (2016) Towards open set deep networks. In

    Computer Vision and Pattern Recognition

    Cited by: §2.
  • [3] H. Blum, P. Sarlin, J. Nieto, R. Siegwart, and C. Cadena (2021) The fishyscapes benchmark: measuring blind spots in semantic segmentation. International Journal of Computer Vision 129 (11), pp. 3119–3135. Cited by: item I.
  • [4] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko (2020) End-to-end object detection with transformers. In European Conference on Computer Vision, Cited by: §2.
  • [5] S. Changpinyo, P. Sharma, N. Ding, and R. Soricut (2021) Conceptual 12m: pushing web-scale image-text pre-training to recognize long-tail visual concepts. In Computer Vision and Pattern Recognition, Cited by: §2.
  • [6] L. Chen, T. Yang, X. Zhang, W. Zhang, and J. Sun (2021) Points as queries: weakly semi-supervised object detection by points. In Computer Vision and Pattern Recognition, Cited by: §2.
  • [7] T. Chen, S. Saxena, L. Li, D. J. Fleet, and G. Hinton (2021) Pix2seq: a language modeling framework for object detection. arXiv preprint arXiv:2109.10852. Cited by: §2.
  • [8] B. Cheng, A. Choudhuri, I. Misra, A. Kirillov, R. Girdhar, and A. G. Schwing (2021) Mask2Former for video instance segmentation. arXiv preprint arXiv:2112.10764. Cited by: §2, §5.2.
  • [9] B. Cheng, A. Schwing, and A. Kirillov (2021) Per-pixel classification is not all you need for semantic segmentation. In Advances in Neural Information Processing Systems, Cited by: §2.
  • [10] M. Contributors (2020) MMSegmentation: openmmlab semantic segmentation toolbox and benchmark. Note: Cited by: §A.1.
  • [11] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele (2016)

    The cityscapes dataset for semantic urban scene understanding

    In Computer Vision and Pattern Recognition, Cited by: §B.1, 2nd item, §2, §5.1.
  • [12] A. Das, S. Datta, G. Gkioxari, S. Lee, D. Parikh, and D. Batra (2018) Embodied question answering. In Computer Vision and Pattern Recognition, Cited by: §2.
  • [13] D. de Geus, P. Meletis, C. Lu, X. Wen, and G. Dubbelman (2021) Part-aware panoptic segmentation. In Computer Vision and Pattern Recognition, Cited by: §1, item I, §5.1, §5.1, Table 4.
  • [14] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In Computer Vision and Pattern Recognition, Cited by: §2.
  • [15] B. Dong, F. Zeng, T. Wang, X. Zhang, and Y. Wei (2021) Solq: segmenting objects by learning queries. In Advances in Neural Information Processing Systems, Cited by: §2.
  • [16] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al. (2021) An image is worth 16x16 words: transformers for image recognition at scale. In International Conference on Learning Representations, Cited by: §4.
  • [17] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman (2010) The pascal visual object classes (voc) challenge. International Journal of Computer Vision 88 (2), pp. 303–338. Cited by: §2.
  • [18] Y. Fang, S. Yang, X. Wang, Y. Li, C. Fang, Y. Shan, B. Feng, and W. Liu (2021) Instances as queries. In International Conference on Computer Vision, Cited by: §2.
  • [19] G. Ghiasi, X. Gu, Y. Cui, and T. Lin (2021) Open-vocabulary image segmentation. arXiv preprint arXiv:2112.12143. Cited by: §2, §2.
  • [20] D. Gordon, A. Kembhavi, M. Rastegari, J. Redmon, D. Fox, and A. Farhadi (2018) Iqa: visual question answering in interactive environments. In Computer Vision and Pattern Recognition, Cited by: §2.
  • [21] X. Gu, T. Lin, W. Kuo, and Y. Cui (2021) Open-vocabulary object detection via vision and language knowledge distillation. In International Conference on Learning Representations, Cited by: §2, §2.
  • [22] J. He, S. Yang, S. Yang, A. Kortylewski, X. Yuan, J. Chen, S. Liu, C. Yang, and A. Yuille (2021) PartImageNet: a large, high-quality dataset of parts. arXiv preprint arXiv:2112.00933. Cited by: §2.
  • [23] K. He, G. Gkioxari, P. Dollár, and R. Girshick (2017) Mask r-cnn. In International Conference on Computer Vision, Cited by: §4.
  • [24] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Computer Vision and Pattern Recognition, Cited by: §4.
  • [25] M. Z. Hossain, F. Sohel, M. F. Shiratuddin, and H. Laga (2019)

    A comprehensive survey of deep learning for image captioning

    ACM Computing Surveys 51 (6), pp. 1–36. Cited by: §2.
  • [26] R. Hu, H. Xu, M. Rohrbach, J. Feng, K. Saenko, and T. Darrell (2016) Natural language object retrieval. In Computer Vision and Pattern Recognition, Cited by: §2.
  • [27] C. Jia, Y. Yang, Y. Xia, Y. Chen, Z. Parekh, H. Pham, Q. Le, Y. Sung, Z. Li, and T. Duerig (2021) Scaling up visual and vision-language representation learning with noisy text supervision. In

    International Conference on Machine Learning

    Cited by: §2.
  • [28] A. Kirillov, K. He, R. Girshick, C. Rother, and P. Dollár (2019) Panoptic segmentation. In Computer Vision and Pattern Recognition, Cited by: §1, §5.1.
  • [29] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, Cited by: §4.
  • [30] A. Kuznetsova, H. Rom, N. Alldrin, J. Uijlings, I. Krasin, J. Pont-Tuset, S. Kamali, S. Popov, M. Malloci, A. Kolesnikov, et al. (2020) The open images dataset v4. International Journal of Computer Vision 128 (7), pp. 1956–1981. Cited by: 2nd item, §2.
  • [31] Y. LeCun, Y. Bengio, and G. Hinton (2015) Deep learning. Nature 521 (7553), pp. 436–444. Cited by: §1.
  • [32] B. Li, K. Q. Weinberger, S. Belongie, V. Koltun, and R. Ranftl (2022) Language-driven semantic segmentation. In International Conference on Learning Representations, Cited by: §A.1, §2, §2.
  • [33] J. Li, A. Raventos, A. Bhargava, T. Tagawa, and A. Gaidon (2018) Learning to fuse things and stuff. arXiv preprint arXiv:1812.01192. Cited by: §4.
  • [34] L. H. Li, P. Zhang, H. Zhang, J. Yang, C. Li, Y. Zhong, L. Wang, L. Yuan, L. Zhang, J. Hwang, K. Chang, and J. Gao (2022) Grounded language-image pre-training. In Computer Vision and Pattern Recognition, Cited by: §A.1, §2.
  • [35] X. Li, S. Xu, Y. Y. Cheng, Y. Tong, D. Tao, et al. (2022) Panoptic-partformer: learning a unified model for panoptic part segmentation. arXiv preprint arXiv:2204.04655. Cited by: Table 4.
  • [36] Y. Li, X. Chen, Z. Zhu, L. Xie, G. Huang, D. Du, and X. Wang (2019) Attention-guided unified network for panoptic segmentation. In Computer Vision and Pattern Recognition, Cited by: §4.
  • [37] J. Liang, N. Homayounfar, W. Ma, Y. Xiong, R. Hu, and R. Urtasun (2020) Polytransform: deep polygon transformer for instance segmentation. In Computer Vision and Pattern Recognition, Cited by: §5.2.
  • [38] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In European Conference on Computer Vision, Cited by: 2nd item, §2.
  • [39] C. Liu, Z. Lin, X. Shen, J. Yang, X. Lu, and A. Yuille (2017) Recurrent multimodal interaction for referring image segmentation. In International Conference on Computer Vision, Cited by: §2.
  • [40] Z. Liu, H. Hu, Y. Lin, Z. Yao, Z. Xie, Y. Wei, J. Ning, Y. Cao, Z. Zhang, L. Dong, et al. (2021) Swin transformer v2: scaling up capacity and resolution. arXiv preprint arXiv:2111.09883. Cited by: §4, 3rd item.
  • [41] J. Mao, J. Huang, A. Toshev, O. Camburu, A. L. Yuille, and K. Murphy (2016) Generation and comprehension of unambiguous object descriptions. In Computer Vision and Pattern Recognition, Cited by: §2.
  • [42] P. Meletis, X. Wen, C. Lu, D. C. de Geus, and G. Dubbelman (2020) Cityscapes-panoptic-parts and pascal-panoptic-parts datasets for scene understanding. arXiv preprint arXiv:2004.07944. Cited by: §B.1.
  • [43] D. Meng, X. Chen, Z. Fan, G. Zeng, H. Li, Y. Yuan, L. Sun, and J. Wang (2021) Conditional detr for fast training convergence. In International Conference on Computer Vision, Cited by: §2.
  • [44] R. Mohan and A. Valada (2021) Efficientps: efficient panoptic segmentation. International Journal of Computer Vision 129 (5), pp. 1551–1579. Cited by: §4.
  • [45] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. (2021) Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, Cited by: §A.1, §A.1, §2, §3, §4, §4, 1st item.
  • [46] J. Ren, C. Yu, Z. Cai, M. Zhang, C. Chen, H. Zhao, S. Yi, and H. Li (2021) REFINE: prediction fusion network for panoptic segmentation. In

    AAAI Conference on Artificial Intelligence

    Cited by: §4.
  • [47] W. J. Scheirer, A. de Rezende Rocha, A. Sapkota, and T. E. Boult (2012) Toward open set recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 35 (7), pp. 1757–1772. Cited by: §2.
  • [48] S. Shao, Z. Li, T. Zhang, C. Peng, G. Yu, X. Zhang, J. Li, and J. Sun (2019) Objects365: a large-scale, high-quality dataset for object detection. In International Conference on Computer Vision, Cited by: §2.
  • [49] R. Strudel, R. Garcia, I. Laptev, and C. Schmid (2021) Segmenter: transformer for semantic segmentation. In International Conference on Computer Vision, Cited by: §2.
  • [50] P. Sun, R. Zhang, Y. Jiang, T. Kong, C. Xu, W. Zhan, M. Tomizuka, L. Li, Z. Yuan, C. Wang, et al. (2021) Sparse r-cnn: end-to-end object detection with learnable proposals. In Computer Vision and Pattern Recognition, Cited by: §2.
  • [51] C. Tang, H. Chen, X. Li, J. Li, Z. Zhang, and X. Hu (2021) Look closer to segment better: boundary patch refinement for instance segmentation. In Computer Vision and Pattern Recognition, Cited by: §5.2.
  • [52] Z. Tian, H. Chen, X. Wang, Y. Liu, and C. Shen (2019) AdelaiDet: a toolbox for instance-level recognition tasks. Note: Cited by: §A.1.
  • [53] Z. Tian, C. Shen, H. Chen, and T. He (2019) FCOS: fully convolutional one-stage object detection. In International Conference on Computer Vision, pp. 9626–9635. Cited by: §A.4.
  • [54] Z. Tian, C. Shen, and H. Chen (2020) Conditional convolutions for instance segmentation. In European Conference on Computer Vision, Cited by: §A.1, §A.4, §4, §4, Table 2.
  • [55] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan (2015) Show and tell: a neural image caption generator. In Computer Vision and Pattern Recognition, Cited by: §2.
  • [56] Q. Wu, D. Teney, P. Wang, C. Shen, A. Dick, and A. van den Hengel (2017) Visual question answering: a survey of methods and datasets. Computer Vision and Image Understanding 163, pp. 21–40. Cited by: §2.
  • [57] Y. Wu, A. Kirillov, F. Massa, W. Lo, and R. Girshick (2019) Detectron2. Note: Cited by: §A.2.
  • [58] E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, and P. Luo (2021) SegFormer: simple and efficient design for semantic segmentation with transformers. In Advances in Neural Information Processing Systems, Cited by: §A.1, §4, Table 2.
  • [59] J. Xu, S. De Mello, S. Liu, W. Byeon, T. Breuel, J. Kautz, and X. Wang (2022) GroupViT: semantic segmentation emerges from text supervision. arXiv preprint arXiv:2202.11094. Cited by: §2.
  • [60] M. Xu, Z. Zhang, F. Wei, Y. Lin, Y. Cao, H. Hu, and X. Bai (2021) A simple baseline for zero-shot semantic segmentation with pre-trained vision-language model. arXiv preprint arXiv:2112.14757. Cited by: §2, §2.
  • [61] A. Zareian, K. D. Rosa, D. H. Hu, and S. Chang (2021) Open-vocabulary object detection using captions. In Computer Vision and Pattern Recognition, Cited by: §2.
  • [62] R. Zellers, Y. Bisk, A. Farhadi, and Y. Choi (2019) From recognition to cognition: visual commonsense reasoning. In Computer Vision and Pattern Recognition, Cited by: §2.
  • [63] X. Zhai, A. Kolesnikov, N. Houlsby, and L. Beyer (2021) Scaling vision transformers. arXiv preprint arXiv:2106.04560. Cited by: 3rd item.
  • [64] W. Zhang, J. Pang, K. Chen, and C. C. Loy (2021) K-net: towards unified image segmentation. In Advances in Neural Information Processing Systems, Cited by: §2, §5.2.
  • [65] B. Zhou, H. Zhao, X. Puig, T. Xiao, S. Fidler, A. Barriuso, and A. Torralba (2019) Semantic understanding of scenes through the ade20k dataset. International Journal of Computer Vision 127 (3), pp. 302–321. Cited by: §C.1, 2nd item, §1, §2, §5.1.
  • [66] X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai (2021) Deformable detr: deformable transformers for end-to-end object detection. In International Conference on Learning Representations, Cited by: §2.
  • [67] Y. Zhu, O. Groth, M. Bernstein, and L. Fei-Fei (2016) Visual7w: grounded question answering in images. In Computer Vision and Pattern Recognition, Cited by: §2.


Appendix A Implementation Details

a.1 Model Implementation

Text encoder. We adopt the Contrastive Language–Image Pre-training (CLIP) Radford et al. (2021) model as the text encoder throughout, which supports variable-length text labels by design. The pre-trained and frozen CLIP-ViT-B/32 model was used, which produces a -dimensional embedding vector for each text label. In this paper, we simply use the class names (mostly containing 1 or 2 words, e.g., person, traffic sign) defined in the datasets.

Type-I (SegFormer). We adopt the SegFormer Xie et al. (2021) implementation in the MMSegmentation Contributors (2020) codebase to perform semantic segmentation (including part segmentation). SegFormer uses a series of Mix Transformer encoders (MiT) with different sizes as the vision backbones. In this paper, we mainly adopt the lightweight model (MiT-B0) for fast inference and the largest model (MiT-B5) towards good performance. The output feature maps have

channels, which can be directly interacted with the text embedding vectors. Note that the logits vector after vision-language feature interaction is usually divided by a temperature parameter

, , where controls the concentration level of the distribution Radford et al. (2021); Li et al. (2022a). We follow CLIP to set as a learnable parameter, which is initialize to be in the start of the training stage.

Type-II (CondInst). We adopt the CondInst Tian et al. (2020) implementation in the AdelaiDet Tian et al. (2019a) codebase to perform instance segmentation. CondInst adopts feature locations on the FPN feature maps as instance proposals, which is naturally compatible with the design of the probing-based inference. Since the output feature maps (from the last layer of classification branch) have channels, we apply a linear projection to transform the dimension of text vector from to to perform feature interaction. We mainly adopt ResNet-50 as the visual backbone of CondInst. We empirically observed that the results were improved slightly ( mAP) with a larger backbone, ResNet-101 – we conjecture that the limited improvement lies in the small dataset size of CPP and ADE20K. In addition, we observed that the instance segmentation model is more sensitive to the choice of . For example, using leads to sub-optimal results. We empirically found that setting and adding a learnable bias term (partially following GLIP Li et al. (2022b)) works consistently better.

a.2 Model Optimization

Figure 6: Illustration of sampling probes with a fixed stride (e.g., ).

For Type-I requests, we almost follow the same training protocol as SegFormer, except that the whole and part-level classes are jointly trained in one model. The MiT encoder was pre-trained on Imagenet-1K and the decoder was randomly initialized. For data augmentation, random resizing with ratio of , random horizontal flipping, and random cropping ( for CPP and for ADE20K) were applied. The model was trained with AdamW optimizer for iterations on NVIDIA Tesla V100 GPUs. The initial learning rate is and decayed using the poly learning rate policy with the power of . The batch size is for CPP and for ADE20K. For the training time, it takes about / hours when using MiT-B5/B0 backbones on CPP, and / hours for ADE20K.

For Type-II requests, since CondInst do not report results on Cityscapes and ADE20K, we implemented by ourselves. Due to the small dataset size, we initialized the model with the MS-COCO pre-trained CondInst, which increases the results by AP. The model was trained with SGD optimizer on NVIDIA Tesla V100 GPUs. For CPP, we used the same training configurations as Mask R-CNN on Cityscapes Wu et al. (2019) and the model was trained for iterations with the batch size of . For ADE20K, the maximum number of proposals sampled during training was increased from to since more instances appeared in an image than MS-COCO. The model was trained for iterations with the initial learning rate being and decayed at iterations. During training, random resizing with ratio of and random cropping () were applied. Other configurations are the same as the original CondInst. For the training time, it takes about and hours on CPP and ADE20K, respectively.

a.3 Non-probing Segmentation


AP (%) NP-based (w.r.t. stride)


w/o 37.8 37.8 37.1 33.6
w/ mask samp. 38.5 38.5 37.9 35.1


Table 7: Non-probing-based instance segmentation results w.r.t. different strides.

The non-probing-based (NP-based) inference is used to fairly compare against the conventional instance segmentation approaches, i.e., finding all instances at once. For this purpose, we densely sample a set of probing pixels based on the semantic segmentation results. Specifically, we regularly sample points with a fixed stride in the whole image, and keep the points inside the corresponding semantic regions, as illustrated in Figure 6 (white dots are sampled probes). Each sampled point is viewed as a probing pixel to produce a candidate instance prediction (see Appendix A.4 for details). Finally, the results are filtered with NMS (using a threshold of as the same as CondInst). The above procedure can be viewed as an improved CondInst by replacing the builtin classification branch as a standalone pixel-wise classification model. We present the results with respect to different strides in Table 7. As shown, the denser the sampled probes, the higher the mask AP. In addition, enabling mask sampling during training further improves the results (see Appendix A.4).

a.4 Probing Segmentation

The probing-based (P-based) inference is used to simulate the user click to place a probing pixel that lies within the intersection of the predicted semantic region and the ground-truth instance region. Specifically, we first compute the mass center and the bounding box of the intersection region. The actual sampling bounding box is determined by the hyper-parameter , which controls the centerness of the probing pixels:

The probe is randomly sampled from the intersection of the bounding box and the original sampling region. If no candidate pixel lies in that region (since the mass center may outside the instance), the probe is sampled from the bounding box instead. As mentioned in Section 5.2, the probe lies exactly on the mass center if , and the probe is uniformly distributed on the instance if .

During inference, for each Type-II request we have a triplet (see Section 4 for details, the subscript is omitted here for simplicity). To perform instance segmentation by request, we first compute the mapped feature locations for each FPN level, where is the stride of the -th FPN level, and for CondInst. The mapped feature locations are kept for producing predictions. Finally, we only choose one prediction with the highest score, where the scores are computed by the inner-product of the pixel-wise feature vectors (extracted from the FPN feature maps) and the text embedding vector of the target class (generated by the text encoder). For P-based inference, there is at most one prediction for an instance, which naturally eliminates the need of non-maximum suppression (NMS).

In Table 2, we present the mask AP with respect to to different . As shown, AP increases consistently with lower (i.e., closer to the mass center). We observed that AP decreases dramatically with lager if the mask sampling strategy was not used during training (the second row). We diagnosed the issue and found that some probes away from the instance center produced unsatisfying results, as shown in Figure 7. The reason lies in that CondInst only samples positive positions from a small central region of instance by design Tian et al. (2019b, 2020), thus not all possible probes are properly trained. To address this issue, we instead sample positive positions from the entire ground-truth instance masks. The results are presented in Table 2 (the last row) and Figure 7. As shown, by enabling mask sampling, the results are much more robust to the quality of probes.

Figure 7: Results of using different probes (from top to bottom): mass center of instance (), random points on the instance (), handcrafted low-quality probes that closing to the instance boundaries. Best viewed digitally with color.

Appendix B Details of the CPP Experiments

b.1 Data Statistics

The Cityscapes Panoptic Parts (CPP) dataset Meletis et al. (2020) extends the popular Cityscapes Cordts et al. (2016) dataset with part-level semantic annotations. There are part classes belonging to scene-level classes are annotated for training and validation images in Cityscapes. Specifically, two human classes (person, rider) are labeled with 4 parts (torso, head, arm, leg), and three vehicle classes (car, truck, bus) are labeled with 5 parts (chassis, window, wheel, light, license plate). The CPP dataset provides exhaustive part annotations that each instance mask (belonging to the chosen classes) is completely partitioned into the corresponding part masks. In our experiments, we use semantic classes ( out of are thing classes for instance segmentation) and non-duplicate part classes in total. Results on the CPP val set are reported.

b.2 HPQ vs. PartPQ in CPP

The only difference between PartPQ and HPQ lies in computing the accuracy of the objects that have parts. PartPQ directly averages the mask IoU values of all parts, while HPQ calls for a recursive mechanism. CPP is a two-level dataset (i.e., parts cannot have parts), and all parts are semantically labeled (i.e., no instances are labeled on parts although some of them, e.g., wheel of car, can be labeled at the instance level). In this scenario, (1) the HPQ of a part (as a leaf node) is directly defined as its mask IoU if the corresponding prediction is a true positive (IoU is no smaller than ) and HPQ equals zero otherwise, and (2) since each part has only one unit, the recognition of each part has either a true positive (IoU is no smaller than ) or a false positive plus a false negative (IoU is smaller than ) – that said, the denominator of HPQ is a constant, equaling to the number of parts. As a result, the values of HPQ are usually lower than PartPQ (see Tables 3 and 4).

b.3 Sampling Part Annotations for Data-efficient Learning

In Table 5, we report the results on data-efficient scenarios, where only a subset of part annotations are remained for training. Specifically, there are in total annotated part masks in the CPP train set. We randomly sample a certain ratio (e.g., ) of part masks for data-efficient learning. Evaluation was conducted on the val set with complete part annotations.

b.4 More Visualization Results

We provide more visualization results on CPP, as shown in Figure 8. The results are obtained through probing-based inference, where the mass center was used as the probe (). Examples are chosen from the CPP val set. The probing pixels are omitted in the figure for simplicity.

Figure 8: Examples of visual recognition results on CPP. Best viewed digitally with color.

Appendix C Details of the ADE20K Experiments

c.1 Data Preparation and Statistics

The ADE20K Zhou et al. (2019) dataset provides pixel-wise annotations for more than semantic classes, including instance-level and part-level annotations. SceneParse150 is a widely used subset of ADE20K for semantic segmentation, which consists of training and validation images covering most frequent classes. For instance segmentation, foreground object classes are chosen from classes, termed InstSeg100 (see Section 3.4 of Zhou et al. (2019)), while few works have reported results on InstSeg100. As for part segmentation, we found the annotations are significantly sparse and incomplete 111The statistics are conducted based on the newest version of ADE20K from the official website.. There are non-duplicate part classes belonging to the instance classes 222In our definition, only instance classes have part-level annotations.. The labeling ratio for part classes is pretty low: only of instances have part annotations on averaged ( for each instance class individually). In our experiments, we keep the part classes that the number of occurrences is no fewer than , resulting in part-level classes belonging to instance classes. We conjecture that the sparse annotation property is the main reason that no prior works have reported qualitative results for part segmentation on this dataset. Results on the ADE20K validation set are reported. For the vocabulary used in the experiments, semantic and instance classes can be easily found in the original dataset Zhou et al. (2019), and we additionally list the part classes as follows.

  • instance class name (the number of part classes): [part class names]

  • bed (4): [footboard, headboard, leg, side rail]

  • windowpane (5): [pane, upper sash, lower sash, sash, muntin]

  • cabinet (7): [drawer, door, side, front, top, skirt, shelf]

  • person (13): [head, right arm, right hand, left arm, right leg, left leg, right foot, left foot, left hand, neck, gaze, torso, back]

  • door (9): [hinge, knob, handle, door frame, pane, mirror, window, muntin, door]

  • table (4): [drawer, top, leg, apron]

  • chair (7): [back, seat, leg, arm, stretcher, apron, seat cushion]

  • car (9): [mirror, door, wheel, headlight, window, license plate, taillight, bumper, windshield]

  • painting (1): [frame]

  • sofa (7): [arm, seat cushion, seat base, leg, back pillow, skirt, back]

  • shelf (1): [shelf]

  • mirror (1): [frame]

  • armchair (9): [back, arm, seat, seat cushion, seat base, earmuffs, leg, back pillow, apron]

  • desk (1): [drawer]

  • wardrobe (2): [door, drawer]

  • lamp (9): [canopy, tube, shade, light source, column, base, highlight, arm, cord]

  • bathtub (1): [faucet]

  • chest of drawers (1): [drawer]

  • sink (2): [faucet, tap]

  • refrigerator (1): [door]

  • pool table (3): [corner pocket, side pocket, leg]

  • bookcase (1): [shelf]

  • coffee table (2): [top, leg]

  • toilet (3): [cistern, lid, bowl]

  • stove (3): [stove, oven, button panel]

  • computer (4): [monitor, keyboard, computer case, mouse]

  • swivel chair (3): [back, seat, base]

  • bus (1): [window]

  • light (5): [shade, light source, highlight, aperture, diffusor]

  • chandelier (4): [shade, light source, bulb, arm]

  • airplane (1): [landing gear]

  • van (1): [wheel]

  • stool (1): [leg]

  • oven (1): [door]

  • microwave (1): [door]

  • sconce (5): [shade, arm, light source, highlight, backplate]

  • traffic light (1): [housing]

  • fan (1): [blade]

  • monitor (1): [screen]

  • glass (4): [opening, bowl, base, stem]

  • clock (1): [face]

c.2 More Visualization Results

We provide more visualization results on ADE20K, as shown in Figure 9. The results are obtained through probing-based inference, where the mass center was used as the probe (). Examples are chosen from the ADE20K validation set. The probing pixels are omitted in the figure for simplicity.

Figure 9: Examples of visual recognition results on ADE20K. Best viewed digitally with color.

Appendix D More Results of Open-domain Recognition

In Figure 10, we provide the results of more examples (from Fishyscapes) to demonstrate the ability of detecting anomalies. Our approach successfully recognized the anomalies as others in most cases except some failures in the details. For example, the legs of the cat and cow are wrongly recognized as person. In Figure 11, we provide more results of recognizing unseen classes. Interestingly, our approach successfully handled the unseen compound words (e.g., vehicle and human).

Figure 10: Examples of detecting anomalies. Black region corresponds to others. Best viewed digitally with color.
Figure 11: Examples of recognizing unseen classes. Text labels that unseen during training are underlined. Best viewed digitally with color.