FLOAT: Factorized Learning of Object Attributes for Improved Multi-object Multi-part Scene Parsing

03/30/2022
by   Rishubh Singh, et al.
Google
0

Multi-object multi-part scene parsing is a challenging task which requires detecting multiple object classes in a scene and segmenting the semantic parts within each object. In this paper, we propose FLOAT, a factorized label space framework for scalable multi-object multi-part parsing. Our framework involves independent dense prediction of object category and part attributes which increases scalability and reduces task complexity compared to the monolithic label space counterpart. In addition, we propose an inference-time 'zoom' refinement technique which significantly improves segmentation quality, especially for smaller objects/parts. Compared to state of the art, FLOAT obtains an absolute improvement of 2.0 segmentation quality IOU (sqIOU) on the Pascal-Part-58 dataset. For the larger Pascal-Part-108 dataset, the improvements are 2.1 We incorporate previously excluded part attributes and other minor parts of the Pascal-Part dataset to create the most comprehensive and challenging version which we dub Pascal-Part-201. FLOAT obtains improvements of 8.6 7.5 across a challenging diversity of objects and parts. The code and datasets are available at floatseg.github.io.

READ FULL TEXT VIEW PDF

Authors

page 1

page 3

page 4

page 5

page 8

page 15

page 16

page 17

08/18/2016

Semantic Understanding of Scenes through the ADE20K Dataset

Scene parsing, or recognizing and segmenting objects and stuff in an ima...
03/28/2017

Objects as context for detecting their semantic parts

We present a semantic part detection approach that effectively leverages...
05/10/2015

Deep Learning for Semantic Part Segmentation with High-Level Guidance

In this work we address the task of segmenting an object into its parts,...
12/27/2018

Finite State Machines for Semantic Scene Parsing and Segmentation

We introduce in this work a novel stochastic inference process, for scen...
06/08/2014

Detect What You Can: Detecting and Representing Objects using Holistic Models and Body Parts

Detecting objects becomes difficult when we need to deal with large shap...
07/17/2020

GMNet: Graph Matching Network for Large Scale Part Semantic Segmentation in the Wild

The semantic segmentation of parts of objects in the wild is a challengi...
03/07/2022

GlideNet: Global, Local and Intrinsic based Dense Embedding NETwork for Multi-category Attributes Prediction

Attaching attributes (such as color, shape, state, action) to object cat...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Semantic scene parsing is a foundational image understanding problem in the vision community [zheng2021rethinking, zhao2018icnet, li2020improving, yu2018bisenet, yang2018denseaspp, zhang2018exfuse, yuan2020object]. Typically, the goal is to segment objects and “stuff” regions (e.g. road, background) in the scene. Multi-object multi-part parsing is a significantly more challenging variant which requires part-level segmentation of each scene object [bsanet, gmnet, co-rank]. Compared to traditional object-level segmentation, semantic representations infused with fine-grained part-level knowledge can provide richer information for downstream reasoning tasks including visual question answering [hong2021ptr], perceptual concept learning [DBLP:journals/corr/abs-2111-05251], shape modelling [achlioptas2019shapeglot, dubrovina2019composite] and many others  [dong2014humanparsing, chen2014detect, 10.1007/978-3-642-33718-5_60, DBLP:journals/corr/ZhangDGD14, sun2013learning, krause2015fine].

For part-based object segmentation, some existing approaches tackle the simpler problem of single-object part parsing [gong2018instance, fang2018weakly, wang2015joint, wang2015semantic, haggag2016semantic]. Although a few recent approaches have addressed multi-object multi-part parsing [bsanet, gmnet, co-rank], they consider part labels to be independent and do not take advantage of intra/inter ontological relationships among objects and parts at label level. They also tend to perform poorly on smaller and infrequent parts/categories. To address these shortcomings, we propose FLOAT, a novel factorized label space framework for scalable multi-object multi-part parsing. Our approach is motivated by the following observations:

Observation #1: Object part names in datasets typically consist of a root component and side component(s). Many object categories contain parts with the same root component. For example, the root component of ‘ left front leg’ found in horse, cow etc. and ‘ right leg’ found in person, is leg. Therefore, parts can be grouped based on their root component.

The example also suggests that object categories whose instances contain shared category-level attributes (e.g. “living things that move”) are likely to contain same root components (such as leg). Using this criterion, some object categories (e.g. cow, person, bird) can be grouped as ‘animate’. Similarly, some categories (e.g. “rigid bodied”) can be grouped as ‘inanimate’. As with the ‘animate’ group, ‘inanimate’ group categories also share many root part components (e.g. ‘wheel’ in aeroplane, bicycle, car).

Observation #2: Similar to Observation #1, parts can also be grouped by side component – e.g. ‘ front’ is a side component of ‘ front wheel’ found in bike and ‘ left front leg’ in person.

Factoring the object/part label space in terms of these groups (‘animate’, ‘inanimate’, ‘side’) greatly reduces the effective number of output labels. In turn, this increases scalability in terms of object categories and part cardinality. The design choice (‘factoring’) also enables efficient data sharing when learning semantic representations for grouped parts and improves performance for infrequent classes (see Fig. 1).

A second key feature of our framework is IZR, an inference-time segmentation refinement technique. IZR transforms ‘zoomed in’ versions of preliminary per-object label maps into refined counterparts which are finally composited back onto the segmentation canvas. Apart from the advantage of not requiring additional training, IZR is empirically superior to alternate inference-time schemes and significantly improves segmentation quality, especially for smaller objects/parts.

In existing works, results are reported on simplified, label-merged versions of the original dataset (Pascal-Part [chen2014detect]). In our work, we incorporate previously excluded part attributes and other minor parts to create Pascal-Part-201, the most comprehensive and challenging version of Pascal-Part [chen2014detect]. Along with the standard mean IOU (mIOU) and mAvg scores, we report sqIOU [kirillov2019panoptic] and sqAvg – normalized segmentation quality measures which are less affected by spatial scale of objects and parts.

In summary, our contributions are the following:

  • [noitemsep]

  • FLOAT, a novel factorized label space framework for scalable multi-object multi-part parsing (Sec. 3).

  • IZR, an inference-time refinement technique which significantly improves segmentation quality especially for smaller objects/parts in the scene (Sec. 3.4).

  • Pascal-Part-201, the most comprehensive and challenging version of the Pascal-Part [chen2014detect] dataset (Sec. 4). Experimental evaluation demonstrates FLOAT’s superior performance on Pascal-Part-201 relative to existing approaches (Sec. 5).

2 Related Work

Semantic segmentation is a broad area with intensive research. We do not attempt to summarize all approaches to enable focus on more directly relevant works. A common design pattern for semantic segmentation is the encoder-decoder setup [7803544, zhao2017pyramid, chen2017deeplab, article_123]. In particular, the baselines, existing approaches and our proposed approach all adopt the popular DeepLab architecture [chen2017deeplab] for various components of the segmentation task pipeline.

Single-Object Multi-Part Parsing has been extensively explored. Existing approaches typically consider object category subsets such as persons [fang2018weakly, liang2018look, liang2016semantic, nie2018mutual, xia2016zoom, xia2017joint, xia2015pose, zhao2017self, gong2018instance, liang2015human, luo2018macro, liu2020hybrid], animals [haggag2016semantic, wang2015semantic, wang2015joint] and vehicles [liang2016semantic, nie2018mutual, song2017embedding, liu2021cgpart]. However, in this setting, most works assume a single object of interest per image.

Figure 2: An overview diagram of our FLOAT framework (Sec. 3). Given an input image , an object-level semantic segmentation network (, in blue) generates object prediction map (). Two decoders (in orange) produce object category grouped part-level prediction maps for ‘animate’ () and ‘inanimate’ objects () in the scene. Another decoder (in red) produces part-attribute grouped prediction maps for ‘left-right’ () and ‘front-back’ (). At inference time (shown by dotted lines), outputs from the decoders are merged in a top-down manner. The resulting prediction is further refined using the IZR technique (see Fig. 3) to obtain the final segmentation map ().

Multi-object multi-part parsing is a relatively new and under studied problem [bsanet, gmnet, co-rank]. The approaches of Zhao et al. [bsanet] and Michieli et al. [gmnet] tackle multi-object multi-part parsing by providing object-level feature guidance to the part segmentation network during optimization. Zhao et al. [bsanet] additionally provides boundary-level awareness to features. Tan et al. [co-rank] create a semantic co-ranking loss modelling intra and inter part relationships. Xiao et al. [xiao2018unified] introduce a composite dataset and an approach for predicting perceptual visual concepts in scenes. However, in contrast to our framework, these approaches report results on simplified (label-merged) versions of standard datasets and empirically exhibit inferior performance for smaller parts.

Factorization: In machine vision applications, early works such as Zheng et al. [DenseObjAtt_CVPR2014] used factorial Conditional Random Field models to separately predict object category, coarse object labels and object attributes such as shape, material and surface type. Other works involve jointly learning object and attribute-related information as a separable latent representation [nagarajan2018attributes] or using graph networks [naeem2021learning]. Misra et al. [misra2017red]

propose a factorization over global object attributes and object classifiers to enable compositionality. Other works extend this idea to inter-object relationships, e.g. noun-preposition-noun triplets 

[malinowski2014pooling, lan2012image, hong2021ptr]. In all these works, a simple global property of the object (e.g., material, texture, color, size, shape) is learnt jointly with the object category information. In their work on panoptic part segmentation, Geus et al. [de2021part] conduct experiments involving two categories from Pascal-Part-58 with some parts grouped by semantic similarity. Graphonomy, a framework by Lin et al.  [lin2020graphonomy] can span multiple datasets with a flat label structure and requires a manually specified graph per category. Such rigid connectivity relationships are unsuitable for modelling highly articulated objects (e.g. animals) found in our setting. To the best of our knowledge, we are the first to show that object parts can be factorized across diverse object categories at scale, and that such factorization significantly improves segmentation performance, in resonance with theories of visual recognition [biederman1987recognition, HOFFMAN198465].

Zooming in on image regions using bounding boxes generated by attention maps [wang2017zoom]

and reinforcement learning policies 

[dong2018reinforced, xu2021adazoom]

have been found to improve detection and segmentation. Other works use the technique on object instances for video interpolation 

[yuan2019zoom-in-to-check] and on part instances for object parsing [xia2016zoom]. Porzi et al. [porzi2021improving] use zoomed in crops based on object classes for improving panoptic segmentation of high resolution images. Similar to the latter set of approaches, FLOAT also employs zooming in on object regions. However, our zoom-based refinement does not require any extra training and can be directly used during inference for improved performance.

3 Our framework (FLOAT)

As mentioned earlier, FLOAT’s design leverages the shared-attribute groups that naturally exist within object categories (‘animate’, ‘inanimate’) and part attributes (‘left’, ‘right’, ‘front’, ‘back’) - see Fig. 2. The sections that follow describe how we operationalize the idea. Although our approach is general in nature, we use object categories and part names from the Pascal-Part dataset [chen2014detect] for ease of understanding.

Figure 3: An overview of Inference-time Zoom Refinement (IZR) - Sec. 3.4. During inference, predictions from the object-level network

are used to obtain padded bounding boxes for scene objects (

). The corresponding object crops () are processed by the factorized network (, Sec. 3). The resulting label maps () are composited to generate , the final refined part segmentation map (). Notice the improvement in segmentation quality relative to the part label map without IZR (included for comparison).

3.1 Relabeling images with factored labels

The original Pascal-Part dataset contains object and part level label maps. We re-label or partition these maps to obtain five new label groups as described below.

object: The label set for this group comprises unique object category labels. For example, in Fig. 2 is a label map from this group containing person and bicycle objects.

animate: For this group, the label set comprises root components of part labels from the object categories bird, cat, cow, cat, dog, horse, person, sheep. The part labels are pooled across all object categories. For example, a single label leg covers all corresponding part instances from all objects in the ‘animate’ group. This can also be seen in in Fig. 2 – the left foot and right foot of person are color-coded the same (‘orange’) and assigned the common label foot.

inanimate: The label set comprises root components of part labels from aeroplane, bicycle, bottle, bus, car, motorbike, pottedplant, train, tv. Note that (i) these categories are disjoint from the ‘animate’ group (see in Fig. 2) (ii) the part label pooling mentioned for ‘animate’ is applicable here as well.

side: In this case, two disjoint label groups exist. One group comprises all part labels which have the words ‘left’ or ‘right’ in their name (e.g. left hand, right wing). Label map regions whose part labels contain ‘left’/‘right’ are considered seed pixels for a flood-fill style procedure which produces corresponding ‘left’/‘right’ label maps (e.g. in Fig. 2). The same procedure is used for the label groups which have the words ‘front’ or ‘back’ in their name (see in Fig. 2). Appendix A.2 contains detailed explanation of the flood-fill algorithm.

Broadly, object parts from living things that move are in the ‘animate’ group while other parts, typically from rigidly shaped non-living things, are in the ‘inanimate’ group. As mentioned before, such grouping enables data-efficient representation learning for common parts (e.g. torso in ‘animate’ group). A similar reasoning holds for ‘side’ directional grouping ({‘left’, ‘right’}, {‘front’,‘back’}).

3.2 Factorized semantic segmentation architecture

We configure the segmentation architecture to output the factorized label maps described in previous section. As Fig. 2 shows, we employ two semantic segmentation networks, one for object-level and other for part-level label maps. The object-level network () outputs the object prediction map (). The part-level network consists of a shared encoder (), and three decoders: the ‘animate’ decoder () which outputs the ‘animate’ label map (), the ‘inanimate’ decoder () which outputs the ‘inanimate’ label map (). The ‘side’ decoder () outputs the ‘left/right’ () and ‘front/back’ () label maps. The outputs from the object-level network () and part-level network () are merged at inference time. We describe this merging process next.

3.3 Top-Down Merge

To combine the factorized label maps output by segmentation architecture (see Fig. 2), we adopt a top-down merging strategy. For each object (e.g. bicycle) in the object prediction map (), we examine the labels of corresponding pixel locations in the part-level label maps. Depending on the type of object (‘animate’ or ‘inanimate’), the corresponding label regions are copied to the scene-level prediction canvas. (e.g. for bicycle, the considered labels in would be wheel, chainwheel, handlebar, headlight, saddle). Similarly, the object-level map’s pixel locations are referenced from ‘side’ label maps ({‘left’,‘right’} - , {‘front’,‘back’} - ). In case of conflicts, the prediction defaults to background. The corresponding label regions are copied to the scene prediction canvas. Detailed explanation of top-down merging can be found in Appendix A.1 .

In the next section, we describe how the resulting prediction map is refined using a per-object ‘zooming’ technique.

3.4 Inference-time Zoom Refinement (IZR)

The Inference-time Zoom Refinement (IZR) technique improves segmentation quality by ‘zooming’ into each scene object. As the first step, the input image is processed by the object-level network to obtain object-level map (see in Fig. 3). The bounding box corresponding to each object component is then padded so that the object is centered and aspect ratio is preserved ( in Fig. 3). Image crops corresponding to the padded bounding box extents are then obtained (). Note that the padding enables scene context to be included for each cropped object and also helps account for inaccuracies in the object map prediction. The cropped object images are then processed by FLOAT’s factorized network to obtain the corresponding part-level label maps (). These label maps are then composited to generate the final refined segmentation map (). In the next two sections, we describe the optimizer formulation for the networks in FLOAT and implementation details.

3.5 Optimization

We train the object model (Sec. 3.2) using the standard per-pixel cross-entropy loss. For training the part-level model, we use a combination of cross-entropy loss () and graph matching loss ([gmnet]. The cross-entropy loss is applied to each of the 4 output part-level maps i.e. (see Fig. 2).

The graph matching loss [gmnet] captures proximity relationships between part pairs within the map and scores the matching of these pairs between the ground truth and the predicted map. The degree of proximity between a part pair is represented by the number of pixels in one part situated pixels or less from the other part, where is an empirically set threshold. For efficiency, the pairwise proximity map is approximated by dilating each part mask by and computing the intersecting region. The ground truth proximity map (and similarly predicted map ) is formally defined as: where is the proximity between the th and th parts, are the respective part mask, is a generic pixel, is morphological 2D dilation operator and is the cardinality of the given set. A row-wise normalization is applied to the proximity matrix: . The graph matching loss is computed as the Frobenius norm between the two adjacency matrices: .

Additionally, for the ‘animate’ and ‘inanimate’ branches, a composite foreground-background binary cross-entropy loss serves as extra guidance. The loss for the part level network is a weighted combination of the losses for all part branches: , where .

3.6 Implementation and Training Details

For fair comparison with previous works [bsanet, gmnet, co-rank], we employ the DeepLab-v3 [chen2017deeplab]

architecture with a ImageNet pre-trained ResNet-101 

[he2016deep] as the encoder (backbone) and follow the same training scheme and augmentations. During training, images are randomly left-right flipped and scaled to times the original resolution with bilinear interpolation. The results at testing stage are reported at the original image resolution. The threshold employed for proximity matrix (Sec. 3.5) is empirically set to . The model is trained for 40K steps with the base learning rate set to which is decreased with a polynomial decay rule with power . We employ weight decay regularization of . We use a batch size of images and use for weighting graph matching loss relative to the cross-entropy loss. We use 2 NVIDIA A100 GPUs each with 40GB GPU memory to train our models, and for experiments. Full computational and memory requirement can be found in Appendix C.

4 Datasets and Evaluation Metrics

Figure 4: An illustration of labelling granularity in different versions of the Pascal-Part dataset. Pascal-Part-108 [gmnet] adds smaller parts (e.g. eyes, ears) to Pascal-Part-58 [bsanet]. Our newly introduced Pascal-Part-201 further adds directional information to parts as appropriate (e.g. {‘left’,‘right’} to eyes, ears; {‘front’,‘back’} to legs).

Pascal-Part: For experiments, we use the Pascal-Part [chen2014detect] which is currently the largest multi-object multi-part parsing dataset. It contains variable-sized images with pixel-level part annotations on the Pascal VOC2010 [everingham2010pascal] semantic object classes (plus the background class). We use the original split from Pascal-Part with images for training and images in the publicly provided validation set for testing.

Pascal-Part-58/108: For comparison with previous work, we use the datasets Pascal-Part-58 [bsanet] and Pascal-Part-108 [gmnet] which contain and part classes respectively. Both the Pascal-Part variants simplify the original semantic classes by grouping some parts together, and contain and part classes respectively. Pascal-Part-58 mostly contains large parts of objects such as head, torso, leg etc. for animals and body, wheel etc. for non-living objects. Pascal-Part-108 is more challenging and additionally contains relatively smaller parts (e.g. eye, neck, foot etc. for animals and roof, door etc. for non-living objects).

Pascal-Part-201: We incorporate part attributes (‘left’, ‘right’, ‘front’, ‘back’, ‘upper’, ‘lower’) and other minor parts (e.g. eyebrow) excluded in both the mentioned variants (58/108), to create the most comprehensive and challenging version of the dataset containing parts which we dub Pascal-Part-201. We observed that the original part labelling scheme in Pascal-Part leaves out large chunks of an object’s pixels unlabelled for the bike, motorbike and tv categories which lead to disconnected objects. To address this, we add a body part annotation for bike, motorbike, and a frame part for tv. An example illustrating the differences in part labelling and granularity of the Pascal-Part variants can be seen in Fig. 4.

Model

bgr

aero

bike

bird

boat

bottle

bus

car

cat

chair

cow

table

dog

horse

mbike

person

plant

sheep

sofa

train

TV

mIOU mAvg
Baseline 91.0 31.6 47.7 24.3 56.7 46.4 31.0 36.7 24.2 35.6 17.5 38.6 27.3 20.7 38.0 26.9 50.8 13.3 42.1 14.7 57.6 26.3 36.8
GMNet[gmnet] 90.8 26.6 33.1 21.2 55.0 43.5 24.6 27.5 21.7 35.5 15.1 40.3 25.0 17.5 31.9 21.9 44.2 11.9 43.3 14.0 53.2 22.5 33.2
BSANet[bsanet] 91.2 34.6 41.7 27.9 61.2 51.7 34.1 38.1 26.1 35.4 24.0 43.6 28.4 23.0 37.4 27.7 54.7 14.3 40.4 17.8 59.4 28.5 38.7
FLOAT 92.5 36.7 49.7 34.4 75.3 51.4 35.8 42.0 37.8 59.6 35.5 58.2 41.0 34.0 40.2 40.8 52.2 28.5 69.0 15.1 56.1 37.1 46.9

bgr

aero

bike

bird

boat

bottle

bus

car

cat

chair

cow

table

dog

horse

mbike

person

plant

sheep

sofa

train

TV

sqIOU sqAvg
Baseline 89.6 28.9 39.3 17.1 57.4 32.3 27.1 26.0 20.5 39.8 14.8 34.7 22.7 17.2 31.5 19.2 34.9 10.8 52.6 14.4 53.8 21.5 32.6
GMNet[gmnet] 89.4 20.7 23.5 12.6 53.1 25.8 19.3 17.2 18.1 38.2 11.2 35.2 15.9 14.2 25.4 13.8 26.9 8.5 52.0 13.8 46.9 16.9 27.7
BSANet[bsanet] 89.9 30.7 33.5 18.6 60.2 31.2 29.2 26.4 21.2 37.8 17.5 38.0 22.3 17.8 31.2 18.2 33.6 10.8 47.2 17.5 55.4 22.1 32.8
FLOAT 90.8 32.5 41.8 24.5 63.9 36.1 30.4 29.9 33.0 50.8 28.1 47.6 35.6 26.1 33.6 29.9 34.5 20.6 69.0 13.6 56.8 29.6 39.5
Table 1: Category-wise results for Pascal-Part-201. FLOAT outperforms competing methods by large margins w.r.t mIOU (top) and sqIOU (bottom).

4.1 Evaluation Metrics

For performance evaluation, we use two versions of Intersection over Union (IOU) metric. We first describe mIOU and mAvg, the standard segmentation quality metrics reported for the problem setting. We then describe balanced variants of these metrics – sqIOU and sqAvg.

mIOU: Let and be the prediction and ground truth respectively for the th part in the th image . Suppose the dataset contains images. The mIOU for the part () is calculated as:

(1)

where is the indicator function (i.e. summation is performed only for images where part is present). The mIOU for the dataset is then calculated as: , where is the number of part categories (classes) in the dataset (58/108/201).

mAvg: The mIOU score for an object category is the average of its per-part scores, i.e. where is the number of unique part labels in object category . Finally, mAvg is calculated as , where is the number of object categories ( for Pascal-Part datasets).

sqIOU: This is a modified version of Segmentation Quality (SQ) metric [kirillov2019panoptic] tailored for semantic segmentation. The sqIOU for the part is calculated as:

(2)
Figure 5: Toy example comparing mIOU and sqIOU with two images from toy-person category containing parts head and torso . ‘Red’ and ’blue’ represent ground-truth, ‘pink’ and ’green’ represent prediction overlap areas. mIOU fails to reflect the bad segmentation of head in image while sqIOU is fairer.

The calculation for sqIOU and sqAvg is similar to that of mIOU. Due to their formulation, mIOU and mAvg [gmnet, bsanet] tend to be dominated by contributions from bigger111Informally, an instance is deemed “big” if it is among the largest instances for an object part category by area. instances. In contrast, sqIOU and sqAvg weight parts of all sizes equally – compare Eqn. 1 and 2 and also see the toy example in Fig. 5. Therefore, sqIOU and sqAvg can be considered a more ‘fair’ measure for segmentation quality.

5 Experimental Results

For evaluation, we compare the performance of FLOAT with BSANet [bsanet], GMNet [gmnet] and CO-Rank [co-rank]. As a baseline, we train a DeepLab-v3 [chen2017deeplab] model with independently paired object category and associated part names (e.g. cow left eye, cow right ear) as labels. BSANet and CO-Rank report results on Pascal-Part-58 while GMNet additionally reports results on Pascal-Part-108. We report results on all variants of the Pascal-Part dataset, including our newly introduced Pascal-Part-201. To enable comparison, we train GMNet and BSANet on our dataset, Pascal-Part-201. For evaluation, we employ the mIOU, mAvg and sqIOU, sqAvg metrics described previously (Sec. 4.1). In addition, we analyze the relative contribution of various components in FLOAT via ablation studies. Full results table can be found in Appendix F.

5.1 Pascal-Part-201

Table 1 shows the category-wise and overall performance on Pascal-Part-201. Overall, we see that FLOAT outperforms baselines and existing approaches by a significantly large margin. We obtain large gains of 10.8% on mIOU and 8.1% on sqIOU relative to the baseline. We outperform the next best method BSANet [bsanet] by large margins of 8.6% on mIOU and 7.5% on sqIOU as well.

Empirically, we obtain significant sqIOU gains of 10%-30% on small parts – for e.g. left/right eye, left/right ear, left/right horn etc. of ‘animate’ categories such as bird, cat, cow. For ‘inanimate’ categories (e.g. bus, car, aeroplane), we obtain sqIOU improvements in the range of 5%-11% on small parts such as front/back plate, left/right wing. The performance improvement is also similarly substantial for most parts containing side components (‘left/right’ or ‘front/back’).

5.2 Pascal-Part-58 and Pascal-Part-108

Method Dataset mIOU mAvg sqIOU sqAvg
Baseline 58 54.3 55.4 46.0 48.4
BSANet[bsanet] 58.2 58.9 49.3 51.5
GMNet[gmnet] 59.0 61.8 49.4 54.3
CO-Rank[co-rank] 60.7 60.6 - -
FLOAT 61.0 64.2 54.2 57.1
Baseline 108 41.3 43.6 32.2 36.1
BSANet[bsanet] 45.9 48.4 36.6 41.0
GMNet[gmnet] 45.8 50.5 35.8 41.9
FLOAT 48.0 53.0 40.5 45.6
Table 2: Results on Pascal-Part-58, Pascal-Part-108: FLOAT outperforms the baseline and other existing methods on mIOU and with a significant gap on sqIOU. Missing CO-Rank entries are due to incomplete official codebase and missing details in the paper.
Figure 6: Qualitative comparison on Pascal-Part-201. We observe that FLOAT gets small objects parts – person in the upper image, cat in the middle image. FLOAT also gets the left-right and front-back correct – leg(s) of dog and cat, side of car, wheel of bike.
Method

Dataset

Output Heads

No Factorization

Object

Part

Anim/Inanim

Side

Inference

Augmentation

mIOU sqIOU
Baseline 58 58 - 54.3 46.0
45 - 60.7 51.5
45 - 60.9 51.7
FLOAT 45 - IZR 61.0 54.2
Baseline 108 108 - 41.3 32.2
68 - 46.1 36.7
68 - 47.8 38.4
FLOAT 68 - IZR 48.0 40.5
Baseline 201 201 26.3 21.5
119 29.1 22.8
119 31.3 24.1
80 36.9 27.8
* 80 ✓* 36.9 27.6
+ RCZ 80 RCZ 36.6 28.0
FLOAT 80 IZR 37.1 29.6
Table 3: Ablation study: Starting from baseline with no factorization at all, we see that systematically adding components of FLOAT pipeline noticeably improves segmentation quality. is combined decoder for all part-level labels, FLOAT (see Fig. 2) is the proposed model. RCZ stands for Random Crop Zoom (see Sec. 5.3). The * indicates separate decoders for ‘left/right’ and ‘front/back’. ‘Output heads’ – total number of output channels of a model. ‘No factorization’ – parts are labelled with concatenated category and associated part name. ‘Object’ – predicting object labels separately.

We also show results on previously proposed datasets Pascal-Part-58 [bsanet] and Pascal-Part-108 [gmnet]. As shown in Table 2, FLOAT framework achieves the best performance on both these datasets. In terms of mIOU, we outperform CO-Rank [co-rank] by 0.3% on Pascal-Part-58 and GMNet [gmnet] by 2.0%. In terms of sqIOU, we outperform other methods by large margins as well – 4.8% over GMNet and 4.9% over BSANet. A similar trend is seen for Pascal-Part-108 with large improvements of 2.1% on mIOU and 3.9% on sqIOU over the next best method BSANet [bsanet].

Overall, the results across existing and challenging new variants of Pascal-Part dataset demonstrate the strengths of our factorized label space setup. In particular, the increasing gains with increasing dataset complexity demonstrates the superior scaling capacity of the FLOAT framework.

5.3 Ablation Studies

We perform multiple experiments with ablative variant models of FLOAT to verify the effectiveness of our design choices. From the results in Table 3, we see that starting from baseline (first row in each dataset variant), systematically adding components of FLOAT pipeline noticeably improves segmentation quality. The gains are most apparent for Pascal-Part-201 dataset, particularly when factorized components are included. From the last two rows, we also see that IZR is a superior choice compared to Random Crop Zoom (RCZ) - a variant which uses random crops whose cardinality matches the number of objects in the scene. Some part names in the original Pascal-Part dataset [chen2014detect] contain the side component ‘upper/lower’. We attempted to train a FLOAT variant with these components as outputs of decoder. However, the model failed to converge. We hypothesize this is due to the drastically smaller quantum of training data compared to other side attributes, i.e. ‘left/right’ and ‘front/back’.

5.4 Qualitative Analysis

Fig. 6 shows qualitative comparisons of our framework with existing approaches on Pascal-Part-201, reflecting the improvements gains we observe for mIOU and sqIOU metrics (Table 1). FLOAT is visually superior at segmenting smaller object parts – notice the significantly improved segmentation for parts in object categories person ( first row) and cat (second row). From the examples, we see that FLOAT is also better at learning directionality (‘left/right’, ‘front/back’). Similar improvements are evident from the examples provided in Figure 1 (Appendix E contains additional examples). Some limitations of FLOAT include missing predictions for the smallest of parts (e.g. eye in people far from camera) and partial predictions for thin parts leading to disconnections.

6 Conclusion

FLOAT is a simple but effective framework for improving semantic segmentation performance in multi-object multi-part parsing. Our idea of factorized label space is a key contribution which fully takes advantage of label-level intra/inter ontological relationships among objects and parts. The factorization not only enables scalability in terms of both object categories and part labels, but also improves segmentation performance substantially. Another key contribution is our inference-time zoom. By focusing only on object-centric regions of interest, IZR efficiently enhances segmentation quality without requiring explicit object feature guidance or other modifications to the part network setup. Apart from our framework, we introduce a new variant of Pascal-Part called Pascal-Part-201 which constitutes the most challenging benchmark dataset for the problem. Our experimental evaluation, using fairer versions of existing measures, shows that FLOAT clearly outperforms existing state-of-the-art approaches for existing and newly introduced Pascal-Part variants. The gains from our framework increase with increased part and object dataset complexity, empirically supporting our assertion of FLOAT’s scalability. Although presented in a 2D scene parsing setting, we expect ideas from FLOAT to be useful for the 3D scene parsing counterpart and in general, for scenarios with appropriately factorizable attributes.

References

Appendix A Algorithm details

a.1 Top Down Merge

The flowchart in the following page describes the “Top Down Merge” algorithm per pixel to obtain the final label for that pixel (aggregation across the image gives the final prediction). As described in the paper, each label consists of an object, a root part component and side component(s). For FLOAT, these are determined separately and merged to obtain the final label at each pixel. For each pixel :

  1. We obtain the object category predicted. We now have an “object” label.

  2. Choose the part from the animate part map or the inanimate part map depending on the object category. We now have an “object part” label.

  3. We now add side components :

    1. Animate:

      1. For animate categories, a part can have both left/right and front/back labels.

      2. Depending on what side components the “object part” needs to match the original label space, the same are added from the and side maps.

      3. To make sure each pixel has a left/right and a front/back label, while taking the softmax, we ignore the background category prediction.

    2. Inanimate:

      1. For animate categories, a part can have only one of left/right/front/back labels.

        Figure 7:
        Figure 8:
      2. We compute the combined Left-Right-Front-Back (LRFB) map by combining the Left-Right () and Front-Back () maps using confidence (softmax) values.

      3. If the “object part” needs the side component, the same is added from the LRFB map.

Hence, we get all components required from predicting the final label for each pixel : “Object L/R F/B Part” for animate and “Object L/R/F/B Part” for inanimate objects.

a.2 Flood Fill for Side Component Ground Truths

As an approximation to a breadth-first search style flood fill for generation ground truths, we compute the side component label for each pixel by allocating it the same label as the one closest to it in the map without flood fill.

Let’s assume, for an object, the original left-right map is LR_org and the map we want to compute is LR_fill. The 0-1 object mask for the under consideration object is obj_mask. The python snippet for computing LR_fill given LR_org and obj_mask is given in Figure 8 (FB_fill can be computed from FB_org using the same):

a.3 Illustration of Factorization described in Introduction

Objects are split into animate and inanimate groups. The parts in each group share root components which are merged to form the label set for part prediction for each set of objects. See Figure 9 for pictorial illustration.

Figure 9:

Appendix B Animate/Inanimate object group split

There are total 7 animate and 10 inanimate object categories with parts. See Figure 10 for group split.

Figure 10:

Appendix C Memory and Compute:

Table 4 summarizes the compute requirements for various models and datasets configurations. Despite the somewhat larger number of parameters compared to other models, FLOAT trains faster and provides significant segmentation performance gains.

Method Dataset Params(M) Train Time (mins) Test Time (secs)
BSANet 58 63.9 40.2 0.45
GMNet 124.9 33.6 0.49
FLOAT (45) 135.4 30.1 1.02 (0.55)
BSANet 108 63.9 43.7 0.72
GMNet 124.9 37.2 0.75
FLOAT (68) 135.4 34.8 1.38 (0.80)
BSANet 201 64.0 47.1 1.30
GMNet 124.9 40.3 1.35
FLOAT (80) 153.6 38.6 2.14 (1.43)
Table 4:

Compute comparisons of FLOAT with previous methods. Train time is per epoch. Test time is per instance. (Batch size

5). Total output heads for FLOAT given in brackets under method. Test time in brackets for FLOAT quotes time without IZR.

Appendix D Limitations

  • Partial predictions of objects with only a few parts visible in the scene.

  • Bad predictions around complicated boundaries, eg - rider on a bicycle.

  • Missing some very small/obscure objects in an image.

  • Missing some predictions for objects with bad lighting / extremely varying shapes.

Appendix E Results:

e.1 Part-58

Figure 11:
Figure 12:

e.2 Part-108

Figure 13:
Figure 14:

e.3 Part-201

Figure 15:
Figure 16:

Appendix F Pascal-Part Results

f.1 Pascal-Part-58 mIOU comparison

Part Baseline BSANet GMNet FLOAT
Background 90.1 91.6 92.7 92.9
Aeroplane Body 65.3 69.5 69.6 70.2
Aeroplane Engine 24.9 28.9 25.7 29.7
Aeroplane Wing 33.9 37.3 34.2 37.6
Aeroplane Stern 56.3 57.2 57.2 58.3
Aeroplane Wheel 43.8 51.8 46.8 45.7
Bicycle Wheel 77.8 76.9 81.3 81.2
Bicycle Body 48.4 51.1 51.5 53.0
Bird Head 64.6 69.9 71.1 73.7
Bird Wing 34.1 40.6 38.6 41.6
Bird Leg 28.9 34.7 28.7 30.1
Bird Torso 65.5 71.4 69.5 69.2
Boat 54.4 60.4 70.0 75.3
Bottle Cap 32.7 31.5 33.9 31.9
Bottle Body 68.8 73.7 77.6 73.8
Bus Window 72.7 74.9 75.4 76.7
Bus Wheel 55.3 56.1 58.1 61.0
Bus Body 74.8 77.5 79.9 80.3
Car Window 63.6 68.5 64.8 70.7
Car Wheel 64.8 69.1 70.3 73.7
Car Light 46.2 54.0 48.4 54.6
Car Plate 0.0 0.0 0.0 0.0
Car Body 72.1 77.4 77.6 79.0
Cat Head 80.2 84.0 83.8 85.7
Cat Leg 48.6 49.8 49.4 51.7
Cat Tail 41.2 45.8 46.0 45.5
Cat Torso 70.3 72.3 73.8 73.6
Chair 35.4 35.6 51.4 59.6
Cow Head 74.3 78.7 80.7 80.9
Cow Tail 0.0 0.6 8.1 18.3
Cow Leg 46.1 54.2 53.5 57.0
Cow Torso 67.9 76.4 77.1 76.7
Dining Table 43.0 43.1 51.3 58.2
Dog Head 78.7 85.1 85.0 84.3
Dog Leg 48.1 54.4 53.8 53.8
Dog Tail 27.1 33.6 31.4 37.3
Dog Torso 63.6 67.3 68.0 67.3
Horse Head 74.7 77.2 73.9 81.6
Horse Tail 47.0 52.0 50.4 52.2
Horse Leg 55.2 60.9 59.3 60.3
Horse Torso 71.3 74.6 73.9 77.2
Motorbike Wheel 72.9 72.3 73.5 76.3
Motorbike Body 64.1 73.2 74.3 75.0
Person Head 82.5 84.9 84.7 84.2
Person Torso 65.3 67.5 67.0 68.6
Person Lower Arm 46.9 51.1 48.6 51.8
Person Upper Arm 51.5 52.7 52.4 54.6
Part Baseline BSANet GMNet FLOAT
Person Lower Leg 38.6 42.0 40.2 42.4
Person Upper Leg 43.8 46.3 44.5 47.4
Potted Plant Pot 47.3 51.3 56.0 50.8
Potted Plant Plant 52.4 55.5 56.4 58.9
Sheep Head 60.9 63.6 70.8 70.6
Sheep Leg 8.6 19.4 14.3 24.4
Sheep Torso 68.3 71.7 75.6 76.0
Sofa 43.2 42.6 56.1 69.1
Train 76.6 80.9 85.0 86.0
TV Screen 69.5 72.3 77.0 72.0
TV Frame 44.4 49.0 54.1 47.3

f.2 Pascal-Part-58 sqIOU comparison

Part Baseline BSANet GMNet FLOAT
Background 89.6 90.2 91.0 91.2
Aeroplane Body 60.9 63.9 62.2 64.2
Aeroplane Engine 22.9 39.4 34.1 38.6
Aeroplane Wing 30.8 35.5 30.2 35.1
Aeroplane Stern 48.2 49.2 48.9 50.5
Aeroplane Wheel 30.5 32.7 27.7 39.2
Bicycle Wheel 68.7 66.8 68.4 72.9
Bicycle Body 38.6 40.5 38.4 43.9
Bird Head 48.1 54.6 52.1 58.2
Bird Wing 28.9 32.0 35.0 39.1
Bird Leg 15.7 19.4 15.1 21.2
Bird Torso 54.8 59.1 57.7 59.5
Boat 53.3 59.4 60.7 64.3
Bottle Cap 16.0 17.9 18.7 24.7
Bottle Body 43.1 45.5 48.6 50.1
Bus Window 68.1 70.3 69.5 72.2
Bus Wheel 48.7 46.6 49.6 55.0
Bus Body 71.4 73.0 74.1 75.5
Car Window 45.1 51.7 46.8 60.5
Car Wheel 45.9 47.6 47.1 58.1
Car Light 24.0 28.7 23.7 32.4
Car Plate 0.0 0.0 0.0 0.0
Car Body 59.2 61.5 61.7 67.7
Cat Head 76.3 78.7 78.0 81.4
Cat Leg 45.0 47.1 46.4 49.5
Cat Tail 31.3 36.4 36.4 37.5
Cat Torso 66.9 68.6 69.9 70.8
Chair 34.4 37.5 48.3 50.8
Cow Head 58.7 65.3 65.2 69.7
Cow Tail 0.0 0.5 3.4 16.1
Cow Leg 38.1 42.7 42.6 52.1
Cow Torso 63.2 72.7 73.1 74.1
Dining Table 38.4 36.8 43.2 47.6
Dog Head 71.0 76.2 74.9 78.8
Part Baseline BSANet GMNet FLOAT
Dog Leg 42.5 50.3 47.5 52.9
Dog Tail 22.7 25.5 27.7 36.2
Dog Torso 58.3 62.3 62.2 63.9
Horse Head 61.3 63.7 63.9 70.7
Horse Tail 30.8 33.6 34.1 39.7
Horse Leg 47.7 52.4 50.7 56.3
Horse Torso 68.5 70.0 71.9 74.2
Motorbike Wheel 60.0 63.0 62.3 68.0
Motorbike Body 60.2 64.6 63.6 66.3
Person Head 69.0 69.7 69.8 74.1
Person Torso 55.2 57.2 56.0 61.1
Person Lower Arm 33.9 39.5 35.5 45.2
Person Upper Arm 42.0 43.7 42.2 49.8
Person Lower Leg 31.6 33.0 32.2 37.5
Person Upper Leg 38.2 39.9 38.7 43.9
Potted Plant Pot 29.2 28.5 32.6 32.8
Potted Plant Plant 33.2 33.8 37.3 39.3
Sheep Head 48.1 55.3 56.4 61.5
Sheep Leg 6.0 10.0 8.1 21.1
Sheep Torso 67.2 71.2 75.4 74.9
Sofa 42.6 49.6 58.1 69.2
Train 75.8 80.0 83.4 82.1
TV Screen 65.4 68.9 70.4 70.8
TV Frame 41.0 44.6 44.6 46.8

f.3 Pascal-Part-108 mIOU comparison

Part Baseline BSANet GMNet FLOAT
Background 90.2 91.8 92.7 92.9
Aeroplane Body 60.9 69.6 61.9 70.0
Aeroplane Engine 53.2 56.2 27.2 57.6
Aeroplane Wing 27.9 37.2 34.3 34.3
Aeroplane Stern 24.7 29.4 57.4 27.6
Aeroplane Wheel 40.9 51.0 51.5 45.5
Bicycle Wheel 76.4 77.1 80.2 81.2
Bicycle Saddle 34.1 38.3 38.0 36.9
Bicycle Handlebar 23.3 25.2 22.4 18.8
Bicycle Chainwheel 42.3 41.6 44.1 53.6
Bird Head 51.5 66.6 65.3 70.0
Bird Beak 40.4 51.3 44.3 56.0
Bird Torso 61.7 67.3 64.8 63.4
Bird Neck 27.5 34.5 28.4 34.3
Bird Wing 35.9 41.3 37.2 36.2
Bird Leg 23.5 30.8 23.8 25.5
Bird Foot 13.9 18.3 17.7 17.4
Bird Tail 28.1 35.7 32.5 33.6
Boat 53.7 60.7 69.2 74.8
Bottle Cap 30.4 31.0 33.4 37.6
Bottle Body 63.7 71.4 78.7 72.4
Part Baseline BSANet GMNet FLOAT
Bus Side 70.1 74.9 75.7 77.1
Bus Roof 7.5 6.3 13.5 17.1
Bus Mirror 2.1 8.0 6.6 0.0
Bus Plate 0.0 0.0 0.0 0.0
Bus Door 40.1 47.4 38.1 40.6
Bus Wheel 53.8 56.6 56.7 59.8
Bus Headlight 25.6 31.0 30.4 44.7
Bus Window 71.8 74.7 74.6 75.7
Car Side 64.0 70.2 70.5 70.8
Car Roof 21.0 22.2 22.5 27.1
Car Plate 0.0 0.0 0.0 0.0
Car Door 40.1 42.1 42.3 46.8
Car Wheel 65.8 68.8 70.2 72.7
Car Headlight 42.9 53.8 46.4 52.5
Car Window 61.0 69.1 65.0 69.4
Cat Head 73.9 76.7 77.5 77.6
Cat Eye 58.8 64.8 62.8 67.8
Cat Ear 65.5 67.9 67.1 67.8
Cat Nose 39.1 46.6 46.3 44.7
Cat Torso 64.2 66.9 68.7 68.0
Cat Neck 22.8 22.4 24.4 26.2
Cat Leg 36.5 39.6 39.1 40.3
Cat Paw 40.6 42.0 41.7 43.2
Cat Tail 40.2 44.5 45.8 43.9
Chair 35.4 35.7 49.1 59.4
Cow Head 51.2 62.5 63.8 64.7
Cow Ear 51.2 57.8 60.0 60.9
Cow Muzzle 61.2 72.4 74.9 70.6
Cow Horn 28.8 45.5 44.0 34.9
Cow Torso 63.4 73.5 73.2 72.6
Cow Neck 9.5 15.9 20.3 26.8
Cow Leg 46.5 54.8 54.8 54.8
Cow Tail 6.5 3.1 13.6 22.9
Dining Table 33.0 45.6 50.6 58.0
Dog Head 60.5 64.7 64.0 63.3
Dog Eye 50.1 57.0 54.7 60.9
Dog Ear 52.0 57.8 56.8 57.4
Dog Nose 63.5 69.8 66.0 66.7
Dog Torso 58.4 62.3 63.2 62.2
Dog Neck 27.1 28.0 28.1 26.5
Dog Leg 39.2 43.2 43.7 43.1
Dog Paw 39.4 45.2 43.7 47.8
Dog Tail 24.7 35.0 30.8 31.0
Dog Muzzle 65.1 70.1 68.9 67.0
Horse Head 54.4 59.9 55.9 62.4
Horse Ear 49.7 56.8 52.2 59.2
Horse Muzzle 61.3 66.6 62.9 65.3
Horse Torso 56.7 61.1 60.7 63.1
Horse Neck 42.1 44.8 47.2 49.3
Horse Leg 54.1 59.3 56.4 58.0
Horse Tail 48.1 51.9 51.4 53.4
Horse Hoof 22.1 19.8 25.3 18.2
Part Baseline BSANet GMNet FLOAT
Motorbike Wheel 69.6 71.6 73.6 76.4
Motorbike Hbar 0.0 0.0 0.0 0.0
Motorbike Saddle 0.0 0.0 0.8 0.1
Motorbike Hlight 25.8 21.3 28.5 35.0
Person Head 68.2 71.3 69.3 72.6
Person Eye 35.1 44.6 38.7 49.6
Person Ear 37.4 46.4 41.4 47.6
Person Nose 53.0 57.4 56.7 62.4
Person Mouth 48.9 53.1 51.3 58.4
Person Hair 70.8 73.2 71.8 70.9
Person Torso 63.4 66.3 65.2 66.1
Person Neck 49.7 53.1 51.2 54.5
Person Arm 54.7 58.4 57.4 58.3
Person Hand 43.0 50.1 44.1 47.8
Person Leg 50.8 53.8 53.0 53.6
Person Foot 29.8 33.0 31.3 31.8
Potted Plant Pot 41.6 52.3 56.0 50.1
Potted Plant Plant 42.9 56.1 56.6 47.7
Sheep Head 45.6 50.2 54.0 51.6
Sheep Ear 43.2 48.9 45.3 54.8
Sheep Muzzle 58.2 66.6 64.9 65.5
Sheep Horn 3.0 5.1 5.4 31.8
Sheep Torso 62.6 66.3 68.8 69.9
Sheep Neck 26.9 29.8 30.3 36.0
Sheep Leg 8.6 21.1 11.7 23.9
Sheep Tail 6.7 6.3 9.1 15.2
Sofa 39.2 43.0 53.9 68.9
Train Head 5.3 6.0 4.5 4.0
Train Head Side 61.9 60.8 60.8 66.6
Train Head Roof 23.0 19.9 21.1 26.5
Train Headlight 0.0 0.0 0.0 0.0
Train Coach 28.6 35.7 31.4 36.4
Train Coach Side 15.6 18.4 14.9 15.5
Train Coach Roof 10.8 6.3 18.1 7.7
TV Screen 64.8 70.4 70.7 69.6

f.4 Pascal-Part-108 sqIOU comparison

Part Baseline BSANet GMNet FLOAT
Background 88.7 90.6 91.2 91.3
Aeroplane Body 55.1 63.9 62.3 64.5
Aeroplane Engine 47.1 49.1 32.5 50.2
Aeroplane Wing 22.9 35.5 30.0 31.0
Aeroplane Stern 29.7 40.2 50.1 34.5
Aeroplane Wheel 25.9 33.2 29.4 35.1
Bicycle Wheel 63.1 67.3 68.4 71.2
Bicycle Saddle 23.4 27.0 26.0 28.2
Bicycle Handlebar 15.7 18.9 16.3 16.7
Part Baseline BSANet GMNet FLOAT
Bicycle Chainwheel 29.4 30.2 29.1 33.5
Bird Head 42.7 47.6 45.8 52.4
Bird Beak 22.8 30.8 25.9 37.5
Bird Torso 49.6 57.0 54.2 54.7
Bird Neck 23.7 30.2 26.0 33.1
Bird Wing 31.2 31.5 33.2 36.4
Bird Leg 8.2 14.6 10.0 14.3
Bird Foot 7.7 10.8 9.7 12.0
Bird Tail 21.6 25.2 25.7 28.0
Boat 53.5 59.2 62.3 64.0
Bottle Cap 15.0 18.0 17.8 25.4
Bottle Body 40.0 45.1 49.1 49.3
Bus Side 64.2 70.5 70.2 72.3
Bus Roof 7.7 9.8 17.2 24.2
Bus Mirror 0.6 4.7 2.8 0.0
Bus Plate 0.0 0.0 0.0 0.0
Bus Door 27.5 33.7 28.7 32.4
Bus Wheel 43.2 48.1 48.1 52.9
Bus Headlight 12.2 23.4 16.8 30.5
Bus Window 67.1 70.6 69.2 71.7
Car Side 51.8 54.9 55.5 58.7
Car Roof 13.0 20.7 15.3 25.8
Car Plate 0.0 0.0 0.0 0.0
Car Door 34.1 37.3 40.8 46.8
Car Wheel 44.2 46.9 47.4 57.1
Car Headlight 21.2 28.5 25.5 31.4
Car Window 45.0 52.6 47.7 58.4
Cat Head 67.6 70.9 70.5 72.4
Cat Eye 30.9 40.4 35.0 46.6
Cat Ear 54.6 58.2 56.1 60.9
Cat Nose 13.9 25.3 23.6 29.2
Cat Torso 60.4 63.4 65.0 64.6
Cat Neck 23.0 22.1 24.8 28.0
Cat Leg 33.6 36.3 36.2 38.5
Cat Paw 36.9 39.5 38.8 41.5
Cat Tail 32.6 36.7 36.5 35.3
Chair 37.4 37.4 46.8 50.6
Cow Head 46.4 51.8 52.7 56.6
Cow Ear 36.8 42.3 41.3 47.3
Cow Muzzle 50.0 60.4 60.2 63.4
Cow Horn 16.9 22.4 23.4 23.5
Cow Torso 60.8 69.6 71.8 71.3
Cow Neck 9.6 13.3 15.2 23.8
Cow Leg 35.7 43.5 44.0 49.3
Cow Tail 3.7 2.2 6.9 15.7
Dining Table 31.5 39.2 44.1 47.5
Dog Head 54.0 58.6 56.6 60.0
Dog Eye 23.0 31.6 26.9 39.0
Dog Ear 40.1 50.0 47.4 54.5
Dog Nose 34.5 45.2 37.5 48.8
Dog Torso 53.6 57.9 57.8 59.4
Dog Neck 20.3 20.8 21.2 25.3
Part Baseline BSANet GMNet FLOAT
Dog Leg 33.6 40.0 37.5 41.9
Dog Paw 32.8 39.8 35.9 41.7
Dog Tail 22.6 28.1 27.4 33.2
Dog Muzzle 53.8 60.5 57.8 62.7
Horse Head 46.6 49.9 49.3 54.2
Horse Ear 29.3 38.2 35.1 44.4
Horse Muzzle 54.1 54.7 55.6 61.6
Horse Torso 55.7 58.0 61.0 60.6
Horse Neck 46.1 45.9 51.4 52.7
Horse Leg 46.1 51.7 48.9 53.3
Horse Tail 33.0 34.5 36.4 40.4
Horse Hoof 12.1 14.0 16.3 13.3
Motorbike Wheel 58.8 61.7 62.1 66.3
Motorbike Hbar 0.0 0.0 0.0 0.0
Motorbike Saddle 0.0 0.0 0.5 0.1
Motorbike Hlight 18.3 16.9 17.4 21.7
Person Head 50.5 54.0 52.4 58.5
Person Eye 9.4 15.7 11.0 20.0
Person Ear 14.7 22.9 17.7 26.0
Person Nose 22.6 26.8 26.0 35.9
Person Mouth 19.7 24.5 21.9 31.8
Person Hair 50.9 56.2 52.3 58.8
Person Torso 52.7 56.1 54.8 58.6
Person Neck 36.1 38.7 37.1 46.1
Person Arm 43.5 48.6 46.5 53.0
Person Hand 27.3 34.0 29.0 38.2
Person Leg 43.5 46.2 45.9 49.1
Person Foot 20.6 23.3 22.7 26.6
Potted Plant Pot 23.1 29.5 31.6 31.9
Potted Plant Plant 29.1 34.3 37.8 35.1
Sheep Head 37.0 41.7 42.3 45.8
Sheep Ear 22.2 29.1 25.1 35.3
Sheep Muzzle 35.0 40.4 40.9 50.4
Sheep Horn 1.7 2.1 2.1 17.5
Sheep Torso 63.7 66.5 70.2 69.0
Sheep Neck 24.3 28.3 26.5 36.0
Sheep Leg 4.8 12.2 6.4 30.1
Sheep Tail 4.1 5.3 4.6 14.6
Sofa 47.2 48.9 59.0 68.9
Train Head 3.2 3.7 3.0 2.6
Train Head Side 65.3 68.3 69.2 70.9
Train Head Roof 16.6 18.8 20.0 22.8
Train Headlight 0.0 0.0 0.0 0.0
Train Coach 11.7 13.6 14.8 12.6
Train Coach Side 25.5 29.2 29.8 29.7
Train Coach Roof 9.3 6.2 14.7 8.6
TV Screen 60.6 65.3 68.4 68.2

f.5 Pascal-Part-201 mIOU comparison

Part Baseline BSANet GMNet FLOAT
Background 91.0 91.2 90.7 92.5
Aeroplane Body 67.3 71.1 62.2 68.9
Aeroplane Engine 27.0 30.0 19.2 28.5
Aeroplane Left Wing 3.8 10.7 7.2 28.6
Aeroplane Right Wing 19.6 20.3 13.4 25.6
Aeroplane Stern 53.3 56.7 48.1 55.3
Aeroplane Tail 0.0 0.0 0.0 0.0
Aeroplane Wheel 50.4 52.9 36.1 49.9
Bicycle Back Wheel 63.8 63.6 55.6 67.9
Bicycle Chainwheel 41.2 44.9 35.1 42.0
Bicycle Body 44.6 44.9 40.1 47.3
Bicycle Front Wheel 68.4 70.9 61.7 72.9
Bicycle Handlebar 27.1 26.1 18.1 24.9
Bicycle Headlight 0.0 0.0 0.0 0.0
Bicycle Saddle 41.1 41.6 20.9 43.5
Bird Beak 53.0 57.3 37.2 49.3
Bird Head 66.5 66.4 54.3 66.5
Bird Left Eye 26.2 27.6 17.9 57.8
Bird Left Foot 5.9 12.0 2.2 9.5
Bird Left Leg 5.1 9.3 4.8 15.9
Bird Left Wing 4.2 11.9 8.8 29.4
Bird Neck 34.0 35.8 31.7 34.4
Bird Right Eye 0.0 11.6 0.9 55.2
Bird Right Foot 0.0 1.2 0.0 7.4
Bird Right Leg 11.1 11.1 14.6 11.2
Bird Right Wing 18.7 16.3 18.1 20.3
Bird Tail 30.0 36.2 26.1 29.5
Bird Torso 60.6 65.3 61.1 61.2
Boat 56.7 61.2 55.0 75.3
Bottle Body 64.6 72.5 65.5 67.6
Bottle Cap 28.1 30.9 21.4 35.1
Bus Back Plate 0.0 0.0 0.0 13.0
Bus Back Side 49.0 44.1 49.8 43.5
Bus Door 40.9 46.1 31.1 38.2
Bus Front Plate 26.3 42.2 0.0 45.3
Bus Front Side 68.9 66.9 60.9 48.6
Bus Headlight 32.6 34.8 6.1 38.8
Bus Left Mirror 0.0 0.8 0.0 7.5
Bus Left Side 21.4 25.1 27.1 34.6
Bus Right Mirror 0.0 12.5 0.0 9.7
Bus Right Side 33.9 31.5 29.2 36.7
Bus Roof 0.0 8.0 1.0 13.5
Bus Wheel 57.1 56.2 48.8 59.3
Bus Window 73.5 74.8 66.4 76.5
Car Back Plate 25.6 26.9 6.7 39.2
Car Back Side 45.0 44.5 38.0 44.6
Car Door 41.4 44.1 37.8 43.6
Car Front Plate 43.0 38.1 12.5 48.6
Car Front Side 66.0 65.6 60.1 56.1
Part Baseline BSANet GMNet FLOAT
Car Headlight 54.4 51.5 39.4 54.3
Car Left Mirror 12.6 14.0 1.4 21.0
Car Left Side 20.5 20.5 16.6 28.7
Car Right Mirror 0.3 7.1 0.0 17.8
Car Right Side 16.7 18.7 14.6 29.0
Car Roof 17.4 27.8 11.7 22.5
Car Wheel 68.3 69.2 63.7 72.4
Car Window 65.4 67.1 55.4 68.4
Cat Head 75.4 76.4 72.3 77.8
Cat Left Back Leg 9.7 9.9 6.7 11.6
Cat Left Back Paw 9.3 10.8 4.1 10.7
Cat Left Ear 12.6 22.3 24.2 59.7
Cat Left Eye 7.9 11.6 13.4 59.2
Cat Left Front Leg 11.5 15.0 14.8 25.3
Cat Left Front Paw 15.3 16.7 13.7 19.0
Cat Neck 21.5 18.5 19.6 23.3
Cat Nose 39.7 46.3 32.9 43.1
Cat Right Back Leg 1.5 10.1 6.8 16.0
Cat Right Back Paw 0.2 7.5 7.1 16.1
Cat Right Ear 33.2 28.6 25.1 59.7
Cat Right Eye 34.0 33.8 23.5 62.8
Cat Right Front Leg 16.2 12.1 12.0 26.5
Cat Right Front Paw 17.6 12.1 12.2 21.9
Cat Tail 40.6 45.4 15.3 43.0
Cat Torso 65.6 66.7 64.9 67.6
Chair 35.6 35.4 35.5 59.6
Cow Head 60.1 60.6 54.7 61.1
Cow Left Back Lower Leg 0.8 3.3 1.1 15.3
Cow Left Back Upper Leg 13.5 16.2 12.7 19.6
Cow Left Ear 1.9 24.2 8.2 53.0
Cow Left Eye 0.0 0.0 0.0 41.1
Cow Left Front Lower Leg 15.9 14.5 12.1 25.4
Cow Left Front Upper Leg 14.4 18.6 14.4 33.2
Cow Left Horn 0.0 13.3 0.0 28.3
Cow Muzzle 71.0 72.1 64.7 70.4
Cow Neck 5.7 15.1 9.0 21.9
Cow Right Back Lower Leg 16.3 18.4 1.7 17.4
Cow Right Back Upper Leg 5.6 12.9 5.2 22.9
Cow Right Ear 27.4 28.8 22.1 56.5
Cow Right Eye 1.9 11.0 0.0 38.5
Cow Right Front Lower Leg 2.6 9.5 0.5 25.1
Cow Right Front Upper Leg 19.2 21.8 14.4 32.7
Cow Right Horn 0.0 30.7 2.9 24.1
Cow Tail 5.6 12.2 0.0 17.8
Cow Torso 70.0 73.0 63.1 70.8
Dining Table 38.6 43.6 40.3 58.2
Dog Head 61.7 63.4 58.9 62.7
Dog Left Back Leg 5.0 8.6 6.2 17.1
Dog Left Back Paw 6.8 6.7 3.4 12.6
Dog Left Ear 22.1 19.5 19.9 56.6
Dog Left Eye 21.0 21.4 12.6 54.9
Part Baseline BSANet GMNet FLOAT
Dog Left Front Leg 9.4 18.6 14.2 32.3
Dog Left Front Paw 8.2 16.6 16.2 30.2
Dog Muzzle 67.7 70.3 63.2 64.9
Dog Neck 27.5 27.4 20.1 25.6
Dog Nose 64.0 69.2 55.1 66.0
Dog Right Back Leg 18.0 12.3 13.5 24.4
Dog Right Back Paw 2.5 5.5 2.5 18.1
Dog Right Ear 20.6 23.5 26.6 53.0
Dog Right Eye 20.7 17.5 17.5 59.8
Dog Right Front Leg 20.6 14.1 17.1 33.4
Dog Right Front Paw 23.5 18.7 17.2 31.8
Dog Tail 31.8 35.5 27.5 33.5
Dog Torso 60.3 62.3 58.8 62.0
Horse Head 58.0 58.2 48.8 63.5
Horse Left Back Hoof 0.0 1.7 1.2 8.2
Horse Left Back Lower Leg 4.4 8.0 3.1 19.6
Horse Left Back Upper Leg 0.4 10.0 14.0 24.0
Horse Left Ear 7.0 13.5 10.2 50.1
Horse Left Eye 0.0 0.0 3.2 39.8
Horse Left Front Hoof 0.0 3.9 0.0 2.1
Horse Left Front Lower Leg 15.5 20.1 11.2 23.3
Horse Left Front Upper Leg 14.2 24.9 14.4 30.1
Horse Muzzle 65.0 66.4 56.0 69.7
Horse Neck 50.8 48.9 38.6 51.5
Horse Right Back Hoof 0.0 2.6 2.1 7.7
Horse Right Back Lower Leg 16.1 19.6 7.2 21.2
Horse Right Back Upper Leg 22.4 23.9 14.3 28.6
Horse Right Ear 29.3 25.7 28.1 49.7
Horse Right Eye 17.2 19.4 1.2 52.1
Horse Right Front Hoof 0.0 2.2 0.0 2.9
Horse Right Front Lower Leg 4.1 9.6 5.3 21.7
Horse Right Front Upper Leg 21.5 12.6 13.2 33.3
Horse Tail 47.9 49.9 39.0 49.6
Horse Torso 61.3 61.7 56.4 65.1
Motorbike Back Wheel 60.7 63.9 52.3 63.7
Motorbike Body 67.8 70.7 64.5 70.8
Motorbike Front Wheel 68.9 72.2 63.4 71.9
Motorbike Handlebar 0.0 0.0 0.0 0.1
Motorbike Headlight 30.7 17.8 11.2 34.7
Motorbike Saddle 0.0 0.0 0.0 0.0
Person Hair 73.8 74.0 68.3 72.7
Person Head 70.0 70.8 63.9 71.2
Person Left Ear 19.5 16.6 8.2 45.1
Person Left Eye 4.5 12.9 3.1 42.7
Person Left Eyebrow 0.0 3.6 0.1 17.1
Person Left Foot 18.4 16.2 11.6 17.8
Person Left Hand 8.4 15.7 13.2 33.7
Person Left Lower Arm 19.6 18.2 12.1 37.4
Person Left Lower Leg 16.2 18.6 17.3 23.5
Person Left Upper Arm 21.0 19.0 16.1 47.2
Person Left Upper Leg 15.9 10.7 13.2 30.8
Part Baseline BSANet GMNet FLOAT
Person Mouth 52.2 53.6 30.2 57.9
Person Neck 51.3 52.2 42.6 52.9
Person Nose 59.2 58.2 40.1 61.1
Person Right Ear 16.3 19.8 9.8 47.8
Person Right Eye 22.8 17.9 13.7 47.3
Person Right Eyebrow 0.0 3.6 0.4 12.2
Person Right Foot 9.3 12.4 7.2 18.4
Person Right Hand 28.7 24.0 23.4 36.0
Person Right Lower Arm 17.9 19.6 18.5 37.2
Person Right Lower Leg 15.6 16.6 10.2 24.8
Person Right Upper Arm 22.0 21.4 23.4 45.9
Person Right Upper Leg 19.7 23.8 19.8 32.9
Person Torso 64.2 65.3 58.6 65.3
Potted Plant Plant 51.6 56.2 48.2 54.5
Potted Plant Pot 50.0 53.3 40.2 49.9
Sheep Head 49.2 45.1 42.6 47.6
Sheep Left Back Lower Leg 2.3 3.6 1.5 7.7
Sheep Left Back Upper Leg 0.0 0.0 0.0 9.7
Sheep Left Ear 18.3 27.2 6.9 51.9
Sheep Left Eye 0.0 0.0 1.3 37.6
Sheep Left Front Lower Leg 0.0 0.0 2.7 15.7
Sheep Left Front Upper Leg 0.0 0.0 0.0 14.4
Sheep Left Horn 0.0 0.7 0.0 26.1
Sheep Muzzle 59.8 61.1 58.6 66.0
Sheep Neck 24.9 23.1 17.1 32.0
Sheep Right Back Lower Leg 0.0 1.3 0.8 4.9
Sheep Right Back Upper Leg 0.2 1.6 1.1 7.1
Sheep Right Ear 15.5 12.9 24.1 49.5
Sheep Right Eye 8.5 15.8 1.9 37.0
Sheep Right Front Lower Leg 0.4 0.0 1.7 12.7
Sheep Right Front Upper Leg 2.3 0.0 2.7 16.3
Sheep Right Horn 0.0 8.4 0.0 25.0
Sheep Tail 6.8 5.1 0.1 12.3
Sheep Torso 65.1 65.5 62.5 69.2
Sofa 42.1 40.4 43.3 69.0
Train Coach Back Side 0.0 3.8 5.9 11.1
Train Coach Front Side 0.0 0.0 0.0 0.5
Train Coach Left Side 5.9 6.5 6.1 6.1
Train Coach Right Side 3.4 10.2 4.9 9.1
Train Coach Roof 0.0 9.1 1.6 0.0
Train Coach 30.7 35.2 33.5 28.0
Train Head 4.3 9.0 5.7 4.4
Train Head Back Side 0.0 0.0 0.0 1.3
Train Head Front Side 71.0 72.6 62.6 34.5
Train Head Left Side 19.3 16.0 22.8 27.2
Train Head Right Side 14.3 19.8 18.4 22.2
Train Head Roof 18.7 25.2 11.4 22.2
Train Headlight 23.1 24.1 9.1 29.5
TV Monitor Frame 46.8 47.2 40.4 44.7
TV Monitor Screen 68.5 71.5 63.9 67.4

f.6 Pascal-Part-201 sqIOU comparison

Part Baseline BSANet GMNet FLOAT
Background 89.6 89.9 89.4 90.8
Aeroplane Body 61.3 65.0 54.2 62.2
Aeroplane Engine 36.4 40.6 19.1 38.6
Aeroplane Left Wing 3.8 10.6 3.7 22.2
Aeroplane Right Wing 19.6 16.1 9.1 22.7
Aeroplane Stern 47.5 49.4 39.7 46.4
Aeroplane Tail 0.0 0.0 0.0 0.0
Aeroplane Wheel 33.5 33.1 19.4 35.6
Bicycle Back Wheel 53.6 53.7 43.2 59.2
Bicycle Chainwheel 30.3 33.6 18.6 33.2
Bicycle Body 37.6 36.7 29.6 39.9
Bicycle Front Wheel 58.5 60.2 47.6 61.6
Bicycle Handlebar 23.4 20.8 11.3 24.2
Bicycle Headlight 0.0 0.0 0.0 0.0
Bicycle Saddle 32.4 29.3 14.5 32.5
Bird Beak 31.1 31.9 12.7 36.2
Bird Head 49.4 49.4 40.7 52.2
Bird Left Eye 7.4 11.4 2.4 28.3
Bird Left Foot 3.9 6.5 3.1 6.4
Bird Left Leg 2.1 4.2 0.2 8.7
Bird Left Wing 3.9 10.5 5.2 23.4
Bird Neck 24.2 26.5 14.7 27.1
Bird Right Eye 0.0 1.0 0.1 23.1
Bird Right Foot 0.0 0.8 0.0 6.0
Bird Right Leg 4.1 4.4 8.4 6.7
Bird Right Wing 18.3 12.9 11.7 21.4
Bird Tail 24.0 25.7 17.1 27.5
Bird Torso 53.8 56.0 48.1 52.1
Boat 57.4 60.2 53.1 63.9
Bottle Body 46.2 45.6 39.4 47.8
Bottle Cap 18.5 17.7 12.3 24.5
Bus Back Plate 0.0 0.0 0.0 15.5
Bus Back Side 29.5 21.1 27.6 19.9
Bus Door 31.4 32.8 14.5 28.7
Bus Front Plate 17.6 32.6 0.0 39.3
Bus Front Side 66.0 67.4 58.0 46.1
Bus Headlight 20.4 28.0 1.2 25.9
Bus Left Mirror 0.0 0.4 0.0 3.5
Bus Left Side 23.4 24.7 20.0 32.5
Bus Right Mirror 0.0 6.3 0.0 4.0
Bus Right Side 43.1 36.5 30.3 35.1
Bus Roof 0.0 10.5 1.3 19.5
Bus Wheel 51.0 49.7 38.1 53.3
Bus Window 70.4 69.7 60.6 71.6
Car Back Plate 13.8 13.6 4.9 19.7
Car Back Side 22.9 24.8 16.4 25.6
Car Door 45.8 40.8 38.1 44.4
Car Front Plate 21.6 19.4 4.8 28.0
Car Front Side 41.9 43.2 35.0 35.8
Part Baseline BSANet GMNet FLOAT
Car Headlight 30.3 30.0 13.5 31.7
Car Left Mirror 4.9 7.5 0.0 7.7
Car Left Side 19.2 18.9 15.1 24.6
Car Right Mirror 0.2 5.0 0.0 8.1
Car Right Side 17.3 17.6 10.3 25.9
Car Roof 17.3 20.1 6.8 22.9
Car Wheel 50.7 49.2 38.9 56.6
Car Window 52.5 52.9 40.4 57.8
Cat Head 70.7 70.6 67.4 72.4
Cat Left Back Leg 7.2 7.5 5.0 10.0
Cat Left Back Paw 7.6 9.4 3.5 10.7
Cat Left Ear 10.7 16.8 18.7 51.8
Cat Left Eye 5.0 10.1 11.1 42.4
Cat Left Front Leg 10.6 14.1 11.1 27.3
Cat Left Front Paw 15.2 17.3 11.7 21.9
Cat Neck 20.3 18.5 16.5 23.0
Cat Nose 23.7 25.9 12.8 29.7
Cat Right Back Leg 0.9 6.8 6.8 11.5
Cat Right Back Paw 0.1 6.8 5.6 14.5
Cat Right Ear 30.0 23.5 19.5 53.6
Cat Right Eye 18.4 15.0 11.5 41.9
Cat Right Front Leg 13.5 9.8 10.3 26.6
Cat Right Front Paw 13.7 8.2 11.2 23.0
Cat Tail 38.6 36.7 24.4 36.5
Cat Torso 63.1 63.2 61.4 65.0
Chair 39.8 37.8 38.2 50.8
Cow Head 51.6 50.1 42.8 54.4
Cow Left Back Lower Leg 0.9 2.0 2.3 11.4
Cow Left Back Upper Leg 8.8 11.2 6.5 14.5
Cow Left Ear 3.3 17.4 3.9 42.1
Cow Left Eye 0.0 0.0 0.0 19.1
Cow Left Front Lower Leg 13.2 9.6 7.1 21.7
Cow Left Front Upper Leg 9.8 12.0 10.5 26.0
Cow Left Horn 0.0 8.2 0.0 18.7
Cow Muzzle 61.3 60.1 48.5 63.1
Cow Neck 5.6 9.6 2.6 19.9
Cow Right Back Lower Leg 10.2 12.8 0.8 13.9
Cow Right Back Upper Leg 3.6 7.2 2.9 21.0
Cow Right Ear 25.7 21.4 18.6 43.3
Cow Right Eye 0.4 4.1 0.0 15.9
Cow Right Front Lower Leg 2.4 6.7 0.4 22.0
Cow Right Front Upper Leg 12.6 12.6 5.2 26.4
Cow Right Horn 0.0 11.1 0.4 17.2
Cow Tail 2.8 7.4 0.0 13.6
Cow Torso 69.2 68.5 61.1 70.5
Dining Table 34.7 38.0 35.2 47.6
Dog Head 57.9 57.6 47.7 59.4
Dog Left Back Leg 5.3 7.1 3.4 13.9
Dog Left Back Paw 5.8 5.6 1.7 13.2
Dog Left Ear 15.5 14.1 8.1 48.9
Dog Left Eye 10.8 9.9 4.2 36.9
Part Baseline BSANet GMNet FLOAT
Dog Left Front Leg 7.0 14.6 9.5 28.6
Dog Left Front Paw 8.9 15.5 11.8 28.8
Dog Muzzle 61.5 60.7 49.6 62.4
Dog Neck 21.8 20.2 8.2 21.3
Dog Nose 44.6 45.0 24.7 48.0
Dog Right Back Leg 12.2 8.3 7.9 17.8
Dog Right Back Paw 2.5 5.6 0.3 17.3
Dog Right Ear 21.5 19.5 16.8 49.7
Dog Right Eye 9.7 9.6 4.0 38.2
Dog Right Front Leg 16.2 9.8 10.0 31.0
Dog Right Front Paw 19.7 13.5 6.1 31.8
Dog Tail 30.0 27.4 20.1 34.4
Dog Torso 57.1 57.4 52.9 58.9
Horse Head 51.4 49.0 42.5 54.3
Horse Left Back Hoof 0.0 1.4 2.9 6.6
Horse Left Back Lower Leg 3.2 5.6 4.1 13.6
Horse Left Back Upper Leg 0.7 8.8 7.8 18.4
Horse Left Ear 3.7 9.9 2.8 35.1
Horse Left Eye 0.0 0.0 5.1 14.7
Horse Left Front Hoof 0.0 2.8 0.0 2.4
Horse Left Front Lower Leg 11.5 15.6 6.9 21.3
Horse Left Front Upper Leg 11.9 21.5 9.8 29.4
Horse Muzzle 58.3 54.4 47.3 62.2
Horse Neck 50.8 45.6 40.5 50.6
Horse Right Back Hoof 0.0 1.2 5.9 5.1
Horse Right Back Lower Leg 12.7 15.0 3.6 15.6
Horse Right Back Upper Leg 18.5 17.8 12.1 20.4
Horse Right Ear 19.6 14.4 14.9 31.8
Horse Right Eye 3.9 6.0 1.1 16.9
Horse Right Front Hoof 0.0 1.4 0.0 2.6
Horse Right Front Lower Leg 2.4 5.1 3.6 20.0
Horse Right Front Upper Leg 16.5 7.3 9.3 29.4
Horse Tail 37.1 33.1 23.7 36.7
Horse Torso 58.2 58.7 55.3 60.7
Motorbike Back Wheel 49.7 49.3 38.4 55.4
Motorbike Body 62.7 62.9 57.8 62.9
Motorbike Front Wheel 57.4 59.5 50.5 62.7
Motorbike Handlebar 0.0 0.0 0.0 0.1
Motorbike Headlight 19.1 15.3 5.9 20.1
Motorbike Saddle 0.0 0.0 0.0 0.0
Person Hair 58.6 55.6 47.6 59.1
Person Head 56.2 54.3 47.7 58.5
Person Left Ear 11.3 8.7 3.0 23.4
Person Left Eye 1.6 4.9 0.5 19.8
Person Left Eyebrow 0.0 0.4 0.0 4.2
Person Left Foot 13.2 9.9 7.2 15.8
Person Left Hand 6.9 10.7 8.4 26.6
Person Left Lower Arm 14.9 14.0 7.6 33.4
Person Left Lower Leg 12.5 11.4 11.2 21.9
Person Left Upper Arm 16.3 13.3 10.3 40.8
Person Left Upper Leg 9.4 6.3 8.3 27.8
Part Baseline BSANet GMNet FLOAT
Person Mouth 25.3 25.4 10.5 31.9
Person Neck 39.7 35.7 26.7 42.3
Person Nose 30.5 27.2 14.4 36.2
Person Right Ear 9.1 9.9 1.8 23.3
Person Right Eye 5.7 3.9 5.6 19.6
Person Right Eyebrow 0.0 0.8 0.0 3.1
Person Right Foot 6.7 6.9 7.1 16.3
Person Right Hand 22.2 16.6 16.1 29.6
Person Right Lower Arm 15.2 14.6 12.5 34.0
Person Right Lower Leg 12.4 11.9 5.6 22.5
Person Right Upper Arm 16.5 16.7 16.8 40.5
Person Right Upper Leg 19.2 19.9 15.2 28.8
Person Torso 56.7 55.5 48.1 57.7
Potted Plant Plant 38.2 35.0 29.9 38.2
Potted Plant Pot 31.6 32.1 23.9 30.8
Sheep Head 42.5 40.7 35.8 44.2
Sheep Left Back Lower Leg 0.6 1.2 2.6 6.2
Sheep Left Back Upper Leg 0.0 0.0 0.0 5.4
Sheep Left Ear 8.0 14.3 2.3 33.8
Sheep Left Eye 0.0 0.0 1.4 11.5
Sheep Left Front Lower Leg 0.0 0.0 0.6 15.1
Sheep Left Front Upper Leg 0.0 0.0 0.0 11.7
Sheep Left Horn 0.0 0.4 0.0 15.1
Sheep Muzzle 43.9 40.1 33.4 48.1
Sheep Neck 25.4 24.6 12.6 29.0
Sheep Right Back Lower Leg 0.0 0.5 0.1 6.5
Sheep Right Back Upper Leg 0.0 1.1 1.1 8.0
Sheep Right Ear 13.4 5.7 8.2 29.9
Sheep Right Eye 0.8 3.9 0.1 11.1
Sheep Right Front Lower Leg 0.1 0.0 0.7 10.1
Sheep Right Front Upper Leg 0.5 0.0 0.4 10.3
Sheep Right Horn 0.0 2.9 0.0 11.3
Sheep Tail 2.9 3.9 0.0 16.0
Sheep Torso 66.4 66.7 63.1 69.0
Sofa 52.6 47.2 52.0 69.0
Train Coach Back Side 0.0 3.5 12.8 11.5
Train Coach Front Side 0.0 0.0 0.0 2.6
Train Coach Left Side 16.1 17.3 10.5 12.3
Train Coach Right Side 7.8 15.2 7.5 10.7
Train Coach Roof 0.0 7.5 1.7 0.0
Train Coach 10.7 13.3 13.1 11.5
Train Head 3.6 6.7 4.0 3.0
Train Head Back Side 0.0 0.0 0.0 5.2
Train Head Front Side 67.6 66.1 63.3 34.7
Train Head Left Side 32.8 33.4 37.7 22.8
Train Head Right Side 16.2 25.4 20.6 21.6
Train Head Roof 16.5 21.9 5.6 22.1
Train Headlight 15.5 16.1 3.2 18.4
TV Monitor Frame 42.4 43.0 34.5 44.8
TV Monitor Screen 65.3 67.8 59.4 68.7