Log In Sign Up

Learning Instance Occlusion for Panoptic Segmentation

by   Justin Lazarow, et al.

Recently, the vision community has shown renewed interest in the effort of panoptic segmentation --- previously known as image parsing. While a large amount of progress has been made within both the instance and semantic segmentation tasks separately, panoptic segmentation implies knowledge of both (countable) "things" and semantic "stuff" within a single output. A common approach involves the fusion of respective instance and semantic segmentations proposals, however, this method has not explicitly addressed the jump from instance segmentation to non-overlapping placement within a single output and often fails to layout overlapping instances adequately. We propose a straightforward extension to the Mask R-CNN framework that is tasked with resolving how two instance masks should overlap one another in the fused output as a binary relation. We show competitive increases in overall panoptic quality (PQ) and particular gains in the "things" portion of the standard panoptic segmentation benchmark, reaching state-of-the-art against methods with comparable architectures.


page 1

page 4

page 6

page 7


Learning to Fuse Things and Stuff

We propose an end-to-end learning approach for panoptic segmentation, a ...

IMP: Instance Mask Projection for High Accuracy Semantic Segmentation of Things

In this work, we present a new operator, called Instance Mask Projection...

Joint Learning of Instance and Semantic Segmentation for Robotic Pick-and-Place with Heavy Occlusions in Clutter

We present joint learning of instance and semantic segmentation for visi...

Single-Shot Panoptic Segmentation

We present a novel end-to-end single-shot method that segments countable...

IRNet: Instance Relation Network for Overlapping Cervical Cell Segmentation

Cell instance segmentation in Pap smear image remains challenging due to...

Proposal-Free Volumetric Instance Segmentation from Latent Single-Instance Masks

This work introduces a new proposal-free instance segmentation method th...

Instance-aware Semantic Segmentation via Multi-task Network Cascades

Semantic segmentation research has recently witnessed rapid progress, bu...

1 Introduction

Image understanding has been a long standing problem in both human perception [3]

and computer vision

[38]. The image parsing framework [50] is concerned with the task of decomposing and segmenting an input image into constituents such as objects (text and faces) and generic regions through integration of image segmentation, object detection, and object recognition. Image parsing consists of three key characteristics: 1) integration of generic image segmentation and object recognition; 2) combination of bottom-up (object detection, edge detection, image segmentation) and top-down modules (object recognition, shape prior); 3) competition from background regions and foreground objects through analysis-and-synthesis. Similar methods have been developed for grammar-based models [59]

, scene understanding

[30, 17], human pose recognition [40], segmentation [4], and object detection [28]. Scene parsing bears a similar spirit and it consists of both non-parametric [48] and parametric [57] approaches.

(a) a
(b) b
Figure 1: Images and ground truth masks from the COCO dataset (a) where predicting the ground truth mask creates ambiguity when attempting to assign pixels in the output that belong to both the dining table and bowl or to both the person and tie. The correct choice is shown above while the wrong choice of assignment is below (b) where a Mask R-CNN baseline produces an instance prediction that occludes the entire image and creates the same ambiguity in (a) despite a unambiguous ground truth annotation.

After the initial development, the problem of image understanding was studied separately as semantic segmentation and object detection (or extended to instance segmentation). For details, please refer to see Section 2. Instance segmentation [18] requires the detection and segmentation of each thing (countable object) within an image, while semantic segmentation [57] provides dense labeling of what kind of stuff (a non-thing) each pixel belongs to. Summarizing the development in instance segmentation and semantic segmentation, Kirillov [26] proposed the panoptic segmentation task that essentially combines the strength of semantic segmentation and instance segmentation. In this task, each pixel in an image is assigned either to a background class (stuff) or to a specific foreground object (instance of things). After its initial release, panoptic segmentation has quickly attracted attention and inspired follow-up works in computer vision [31, 25]. More recent methods also include [55, 35] that are works concurrent with ours.

Naturally, a common architecture for panoptic segmentation has emerged in a number of works [25, 31, 55] that relies on incorporating both kinds of segmentation into either a separate or shared architecture and then fusing the results from a semantic segmentation branch and instance segmentation branch into a single output. One must decide whether a pixel in the output belongs to a thing or stuff class and then additionally decide an instance identifier for the former case. Specifically, with two proposals there are now possibly inconsistent predictions between each branch, choices must be made to resolve these conflicts. Furthermore, one must resolve conflicts within the instance segmentation branch. A pixel in the output can only be assigned to a single class and instance, but instance segmentation proposals are often overlapping across and within classes. For example, please see Figure 0(a). In particular, if two high confidence instance masks have appreciable overlap, any fusion process must decide which instance should be placed on top of the other in the final panoptic output. In fact, some higher level “reasoning” was proposed as future work in [26] in order combat this problem.

In this work, we focus on enriching the fusion process with binary relationship between instances with respect to occlusion. We start with the basic unified architecture [25] and add an additional head (subnetwork) to the instance segmentation pipeline that is tasked with determining which of two instance masks should lie on top of (or below) the other. This constitutes a binary relation that can be consulted at inference time when the fusion process must actually resolve the previous questions. The relation can be fine-tuned on top of an existing Panoptic Feature Pyramid Networks (FPNs) [25] architecture with minimal difficulty. We call our module occlusion fusion head (OCFusion).

Once the fusion process is equipped with OCFusion, we observe significant gains in the overall performance on the COCO panoptic segmentation benchmark, specifically large gains in the performance corresponding to correct placement of things. Additionally, since minimal changes are needed to both the underlying architecture [25] and fusion process from [26], reasoning about the proposed instances and fusion results becomes more straightforward. We hope that this straightforward addition can be complementary to other approaches and provide a strong baseline for further developments in panoptic segmentation.

2 Related Work

Now, we briefly discuss related developments towards panoptic segmentation by broadly categorizing previous approaches into several groups.

Semantic segmentation

. With supervised machine learning techniques

[52, 15, 5] being commonly adopted in computer vision, the semantic segmentation/labeling task [46, 13] became an important area where each pixel in an image is assigned with a class label from a fixed set of pre-defined categories [46, 49]

. Later deep convoultion neural neural network (CNN)


based fully convolutional neural networks (FCN)

[37] family methods [7, 58] have significantly advanced the state-of-the-art results in semantic image segmentation.
Stuff and things. Interestingly, a different thread emerged by enriching the generic background into different classes called “stuff” [1, 19] that do not exhibit strong shape boundaries and individual such as sky, grass, and ground. On the other hand, common objects of interest are summarized into “things” [19, 29, 6] that are described by the so-called objectness. [2, 51].
Object detection

. At the same time, object detectors become increasingly practical including the ones before deep learning era

[14, 11] and after [16, 44, 43, 36]. A repulsion loss is added to a pedestrian detection algorithm [54] to deal with the crowd occlusion problem, but it focuses on detection only without instance segmentation.
Instance segmentation. The development in the above areas has inspired creation of another direction, instance segmentation [42] in which the main objective is to perform simultaneous object detection and segmentation. Existing methods for instance segmentation can be roughly divided into detection-based [41, 42, 9, 32, 18] and segmentation based [45, 56, 24].
Next, we discuss in detail the difference between OCFusion and the existing approaches for panoptic segmentation, occlusion ordering, and non-maximum suppression.
Panoptic segmentation. [26] introduced the task of panoptic segmentation along with the baseline where predictions from instance segmentation (Mask R-CNN [18]) and semantic segmentation (PSPNet [57]

) are combined via fusion heuristics. A strong baseline based on a single Feature Pyramid Network (FPN;

[33]) backbone followed by multi-task heads — consisting of semantic and instance segmentation branchs — was concurrently proposed by [31, 29, 25, 55]. On top of this baseline, [31] added attention layers to instance segmentation branch, which is guided by semantic segmentation branch; [29] added a loss term enforcing consistency between things and stuff predictions; [55]

added a parameter-free panoptic head which computes the final panoptic mask by pasting instance mask logits onto semantic segmentation logits. These works do not employ explicit reasoning of instance occlusion, thus resulting in a relatively small improvement over the baseline.

Occlusion ordering and layout learning. Occlusion handling is a long-studied computer vision task [53, 12, 47, 20]. In the context of semantic segmentation, occlusion ordering has been adopted in [48, 8]. [60] considers occlusion by way of amodal semantic segmentation, however, they annotate a small subset of COCO without a clear sense of how to apply this to larger segmentation tasks. In a broader sense, occlusion ordering can also be considered as part of the layout learning task, although the previous efforts in learning object layout [10] are still somewhat limited due to some particular assumptions for the types of layout models. Here, we study the particular occlusion ordering problem for instance maps in panoptic segmentation, which has been underexplored. In a concurrent work [35], the authors attempt to learn a spatial ranking of instance masks to aid the fusion process for panoptic segmentation. However, this is supervised in an weaker manner and the resulting ranking is less explicit about what occlusions, in fact, exist.

Learnable NMS. One can relate this method to the use of non-maximal surpression (NMS) that is usually applied to boxes, while our method acts to suppress intersections between masks. In this sense, our method acts as a learnable version of NMS for instance masks with similar computations to the analogous ideas for boxes such as [21].

3 Learning Instance Occlusion for Panoptic Fusion

We consider the setting for our method, OCFusion, within an architecture that has both a semantic segmentation branch and an instance segmentation branch (possibly separate or shared in some manner). The key task is then to fuse separate proposals from each branch into one that produces a single output defining the panoptic segmentation. We adopt the coupled approach of [25], [55] that uses a shared Feature Pyramid Network (FPN) [33] backbone with a top-down process for semantic segmentation and Mask R-CNN [18] for its instance segmentation branch. For the fusion process, we build off of the fusion heuristic introduced in [26].

3.1 Fusing instances

The fusion protocol in [26] adopts a greedy strategy during inference. In an iterative manner, pixels in the output are (exclusively) assigned to segments denoted by their class and an additional instance ID for “things”. First, proposed instances are sorted in order of decreasing detection confidence. Second, the current proposal mask’s intersection with the mask of all already assigned pixels is considered. If this is above a certain ratio (usually 0.5), this instance is entirely skipped in the output. Otherwise, pixels in this mask that have yer to be assigned are assigned to the instance in the output. After all instance proposals of some minimum detection threshold are considered (usually 0.5), the semantic segmentation is merged into the output by considering the pixels corresponding to each non-thing class predicted. If this total area exceeds some threshold (usually 4096) when removing already assigned pixels (a semantic proposal can never override an instance one), then these pixels are assigned to the corresponding “stuff” category. Pixels that are unassigned after this entire process are considered void predictions and have special treatment in the panoptic scoring process.

Immediately, certain flaws can be seen in this process. Detection scores have little to do with both mask quality [22]. As a result, higher detection scores are now made to imply a more foreground ordering. The latter is important since Mask R-CNN exhibits behavior that can assign near maximum confidence to very large objects (see dining table images in Figure 0(b)) that are both of poor mask quality and not truly foreground. We denote this type of fusion as fusion by confidence. Because the process is greedy with no reclamation process for lower confidence proposals, it is not uncommon to see images with a significant number of instances suppressed by a single instance with large area when it additionally has high confidence.

3.2 A dataset problem

Instance segmentation datasets, COCO are often crafted in a way such that instances are treated independently (without respect to occlusion) from another. If the input occludes another object, one has to consider what the ground truth segment should be. The mask for a dining table really depends on specific situations, see Figure 0(a) and 0(b) where plateware and utensils sit on the top. This brings an immediate gap in understanding between an annotated panoptic ground truth that allows no such overlaps (always requiring “holes” for occlusion) and a panoptic segmentation system that relies on instance segmentation results.

3.3 Softening the greed

While the greedy process outlined previously is efficient, its most glaring process is the complete reliance on detection confidences (for R-CNN, those from the box head) for a tangential task. This conflates detection confidence with a layout ranking and generally leads to poor performance when an instance that overlaps another (a tie on a shirt in Figure 0(a)) has lower detection confidence than the parent. This often causes entire hierarchies that Mask R-CNN successfully proposes being wiped out. Our approach aims to aid this fusion process in bypassing the reliance on confidences with one that can query which of two instances with appreciable intersection should be placed on top of the other or below in the final output, , learn the occlusion between instances.

3.4 Formulation

Consider two masks and proposed by an instance segmentation model, and denote their intersection as . We are interested in the case where one of the masks is heavily occluded by the other. Therefore, we consider the respective intersection ratios and where denotes the number of “on” pixels in mask . As noted in Section 3.1, the fusion process considers the intersection of the current instance proposal with the mask consisting of all already claimed pixels. Here, we are looking at the intersection between two masks and denote some threshold (usually 0.2). If either or , we define these two masks as having appreciable occlusion/overlap. In this case, we must then resolve which instance the pixels in belong to. For simplicity, we assume one of these masks will claim all the pixels belonging to the intersection (although extensions could be made to treat this in a per-pixel basis). The original protocol of [26] resolves this question by assigning to the mask with higher confidence. We attempt to learn the answer to this question by learning a binary relation such that whenever and have appreciable intersection:


where is only expected to be given two masks with appreciable intersection. Since this can deal with two masks, this relation offers more flexibility over approaches [35] that attempt to “rerank” the masks in a linear fashion. Certain occlusion relationships can be lost in this situation because an occlusion-based ordering is not a total order.

3.5 Fusion with occlusion

We now describe our modifications to the inference-time fusion heuristic of [26] that incorporates in Algorithm 1.

is matrix, initially empty

is a hyperparameter, the minimum intersection ratio for occlusion

is a hyperparameter

for each proposed instance mask  do
      pixels in that are not assigned in
     for  do each already merged segment
           is the intersection between mask and
          if  or  then significant intersection
               if  then
               end if
          end if
     end for
     if  then
          assign the pixels in to the panoptic mask
     end if
end for
Algorithm 1 Outline of Fusion with Occlusion Head (OCFusion)

After the instance fusion component has completed, the semantic segmentation is then incorporated in the original manner, only considering pixels assigned to stuff classes and determining whether the number of unassigned pixels corresponding to the class in the current panoptic output exceeds some threshold 4096. The process is illustrated in Figure 2.

Figure 2: An example of fusion using masks sorted by detection confidence alone versus with the ability to query for occlusions. Mask R-CNN proposes three instance masks listed with decreasing confidence. The heuristic of [26] occludes all subsequent instances after the “person”, while our method retains them in the final output by querying the occlusion head.
Figure 3: The architecture of OCFusion head.

3.6 Occlusion head

We implement as an additional “head” within Mask R-CNN [18]. Mask R-CNN already contains two heads: a box head that is tasked with taking region proposal network (RPN) proposals and refining the bounding box as well as assigning classification scores, while the mask head predicts a fixed size binary mask (usually

) for all classes independently from the output of the box head. Each head derives its own set of features from the underlying FPN. We name our additional head, the “occlusion head” and implement it as a binary classifier that takes two masks

and along with their respective FPN features (determined by their respective boxes) as input. The classifier output is interpreted as the value of .

3.7 Ground truth occlusion

In order to supervise the occlusion head, one must have ground truth information describing the layout ordering of two masks with overlap. The ground truth panoptic mask along with ground truth instance masks can be used to determine this. Therefore, we pre-compute the intersection between all pairs of masks in the ground truth. Those pairs where either or is larger than are considered to have significant overlap. We then find the pixels corresponding to the intersection of the masks in the panoptic ground truth. We determine the occlusion by taking the class corresponding to the majority of pixels in this intersection. It is sometimes possible for neither mask to have this majority, and we discard this as a valid occlusion in the data we assemble. The resulting “occlusion matrix” we store for each image is then an matrix where is the number of instances in the image and the value at position is either , indicating no occlusion, or encodes the value of .

3.8 Implementation

We extend the Mask R-CNN benchmark framework [39]

, built on top of PyTorch, to implement the design described in

[25] (the framework of [39] does not yet support panoptic segmentation), as well as, our occlusion head. The design in [25]

employs either a ResNet-50 or ResNet-101 FPN backbone with 256 dimensional features per scale which is then reduced to 128 for producing the semantic segmentation. Batch-normalization

[23] layers are frozen and not fine-tuned.

The occlusion head borrows the architecture in [22] as shown in Figure 3. For two mask representations and

, we apply max pooling to produce a

representation and concatenate each with the corresponding RoI features to produce the input to the head. Three layers of

convolutions with 512 feature maps and stride 1 are applied before a final one with stride 2. The features are then flattened before a 1024 dimensional fully connected layer and finally a projection to a single logit.

3.9 Training

During training, the occlusion head is designed to first find pairs of masks that match to different ground truth instances. Then, the intersection between these pairs of masks is computed and the ratio of intersection to mask area taken. A pair of masks is added for consideration when one of these ratios is at least as large as the pre-determined threshold (as mentioned in Section 3.7). We then subsample the set of all pairs meeting this criteria to decrease computational cost. It is desirable for the occlusion head to reflect the consistency of , therefore we also invert all pairs so that whenever the pair meets the intersection criteria. This also mitigates class imbalance. Since this is a binary classification problem, the overall loss from the order head is given by the binary cross entropy over all subsampled pairs of masks that meet the intersection criteria.

Models with ResNet-50-FPN
Panoptic FPN [25] 39.0 45.9 28.7
Panoptic FPN [25] 39.3 77.0 48.4 46.3 28.8
OCFusion (ours) 41.0 77.1 50.6 49.0 29.0
relative improvement +1.7 +0.1 +2.2 +2.7 +0.2
Models with ResNet-101-FPN
Panoptic FPN [25] 40.3 47.5 29.5
Panoptic FPN [25] 41.0 78.3 50.1 47.9 30.7
OCFusion (ours) 43.0 78.2 52.6 51.1 30.7
relative improvement +2.0 -0.1 +2.5 +3.2 +0.0
Table 1: Panoptic segmentation results against the baseline in [25] on the COCO validation set with ResNet-50-FPN and ResNet-101-FPN backbones. Our implementation.
Figure 4: Comparison against [25] which uses fusion by confidence.
Figure 5: Comparison against Spatial Ranking Module [35].
Figure 6: Comparison against UPSNet [55].

4 Experiments

4.1 Implementation details

Panoptic FPN [25] 39.0 45.9 28.7
AUNet [31] 39.6 49.1 25.2
UPSNet [55] 42.5 78.0 52.4 48.5 33.4
Spatial Ranking Module [35] 39.0 77.1 47.8 48.3 24.9
OCFusion (ours) 41.0 77.1 50.6 49.0 29.0
Table 2: Comparison to prior work on the MS-COCO 2018 val dataset. All results are based on a ResNet-50 FPN backbone. – indicates not reported.
Method Architecture PQ SQ RQ PQTh PQSt
Megvii (Face++) ensemble 53.2 83.2 62.9 62.2 39.5
Caribbean ensemble 46.8 80.5 57.1 54.3 35.5
PKU_360 ResNeXt-152-FPN 46.3 79.6 56.1 58.6 27.6
Panoptic FPN [25] ResNet-101-FPN 40.9 48.3 29.7
AUNet [31] ResNeXt-152-FPN 46.5 81.0 56.1 55.9 32.5
UPSNet [55] ResNet-101-FPN (deform. conv) 46.6 80.5 56.9 53.2 36.7
Spatial Ranking Module [35] ResNet-101-FPN 41.3 50.4 27.7
OCFusion (ours) ResNeXt-101-FPN (deform. conv) 46.1 79.6 56.2 53.6 34.7
Table 3: Comparison to prior work on the MS-COCO 2018 test-dev dataset. Used multi-scale test. We choose not to employ multi-scale testing (implemented by most of the competing algorithms in the table) due to the significant increase in computational cost at inference time. Experiments of [55, 29] suggest that multi-scale testing can improve the PQ by 0.7 point.

We perform all experiments on the COCO dataset [34] with panoptic annotations [26]. The COCO 2018 panoptic segmentation task consists of 80 thing and 53 stuff classes. We find the most stable and efficient way to train the occlusion head is by fine-tuning. Therefore, we train the FPN-based architecture described in [25] for 90K iterations on 8 GPUs with 2 images per batch. The base learning rate of 0.02 is decayed at both 60K and 80K iterations. We then proceed to train to fine-tune with the occlusion of the order head for 2500 more iterations with the learning rate unchanged from its value at the end of the initial training. We subsample 128 mask occlusions per image. We add only a single additional loss to the fine-tuning process so that the total loss where , , , and are the box head classification loss, bounding-box regression loss, mask loss, and occlusion head loss while is the semantic segmentation cross-entropy loss. We choose and for our experiments while for the occlusion head we choose the intersection ratio as 0.2. During fusion, we only consider instance masks with a detection confidence of at least 0.6 and reject segments during fusion when their overlap with the existing panoptic mask (after occlusions are resolved) to be . Lastly, when considering the segments of stuff generated from the semantic segmentation, we only consider those which have at least 4096 pixels remaining after discarding those already assigned.

For scoring, we adopt the panoptic quality (PQ) metric from [26]. The metric can be interpreted as a product between both the segmentation quality (SQ) across things and stuff as well as recognition quality (RQ) across segments. PQ can be further broken down into scores specific to things and stuff, denoted PQTh and PQSt, respectively.

We choose not to perform multi-scale testing of any kind due to the significant increase in computational cost at inference time. This differs from [55] which uses it within their test-dev results. As expected, we generally see that our results are at the state of the art for PQTh. Unlike [55] and [31], we do not make any attempt to improve the semantic segmentation branch of the architecture (nor understand occlusion of stuff) and leave that for future work.

4.2 Visual comparisons

Since panoptic segmentation within COCO is a relatively new task, the most recent papers offer only comparisons against the baseline presented in [26]. Here we additionally compare with a few other methods [35, 55] that are works concurrent with ours. We first compare our method against [25] in Figure 4 as well as two recent works: UPSNet [55] in 6 and the Spatial Ranking Module of [35] in Figure 5. The latter two have similar underlying architectures alongside modifications to their fusion process. We note that except for comparisons between [25], the comparison images shown are those included in the respective papers and not of our own choosing. Overall, we see that our method is able to preserve a significant number of instance occlusions and is competitive with the comparison images while even adding some instances lost previously.

4.3 COCO panoptic benchmark results

Table 1 shows the contribution of our method on COCO panoptic segmentation validation set. We observe that our method consistently improves the panoptic quality metric by 1.72.0 point across different backbones. Table 2 shows the performance of our system against state-of-the-art methods on the COCO validation set. Although our method does not beat the state-of-the-art UPSNet [55], our method acheives better PQTh than UPSNet. We hypothesize that this is because the baseline of UPSNet performs better than our baseline in terms of PQSt (31.6 28.8). PQTh of our system compares favorably to the best performing method AUNet [31]. Finally, Table 3 shows the performance of our system against state-of-the-art methods on the COCO test-dev set. Our system is on par with state-of-the-art method despite not using multi-scale testing. Experiments of [55, 29] suggest that multi-scale testing can improve the PQ by more than 0.7 point. However, we do not choose to use multi-scale testing, as it requires excessive amount of computation. , 11 scales used in UPSNet requires at least 11 GPU compute resource.

5 Conclusion

We have introduced an explicit notion of instance occlusion to Mask R-CNN so that instances may be better fused when producing a panoptic segmentation. We assemble ground truth occlusions in the COCO dataset and learn an additional head for Mask R-CNN tasked with predicting occlusion between two masks. Empirical results show that when fusion heuristics are allowed to query for occlusion relationships, state of the art performance can be reached with respect to the subset of panoptic quality dedicated to things and competitive results for overall panoptic quality. We hope to exploit how further understanding of occlusion, including relationships of stuff, could be helpful in the future.

Acknowledgements. This work is supported by NSF IIS-1618477 and NSF IIS-1717431. The authors thank Yifan Xu, Weijian Xu, Sainan Liu, and Yu Shen for valuable discussions.


  • [1] E. H. Adelson. On seeing stuff: the perception of materials by humans and machines. In Human vision and electronic imaging VI, volume 4299, pages 1–13. International Society for Optics and Photonics, 2001.
  • [2] B. Alexe, T. Deselaers, and V. Ferrari. Measuring the objectness of image windows. IEEE transactions on pattern analysis and machine intelligence, 34(11):2189–2202, 2012.
  • [3] I. Biederman. Recognition-by-components: a theory of human image understanding. Psychological review, 94(2):115, 1987.
  • [4] E. Borenstein and S. Ullman. Combined top-down/bottom-up segmentation. IEEE Transactions on pattern analysis and machine intelligence, 30(12):2109–2125, 2008.
  • [5] L. Breiman. Random forests. Machine learning, 45(1):5–32, 2001.
  • [6] H. Caesar, J. Uijlings, and V. Ferrari. Coco-stuff: Thing and stuff classes in context. In CVPR, pages 1209–1218, 2018.
  • [7] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence, 40(4):834–848, 2018.
  • [8] Y.-T. Chen, X. Liu, and M.-H. Yang. Multi-instance object segmentation with occlusion handling. In CVPR, 2015.
  • [9] J. Dai, K. He, and J. Sun. Instance-aware semantic segmentation via multi-task network cascades. In CVPR, 2016.
  • [10] C. Desai, D. Ramanan, and C. C. Fowlkes. Discriminative models for multi-class object layout. International journal of computer vision, 95(1):1–12, 2011.
  • [11] P. Dollár, Z. Tu, P. Perona, and S. Belongie. Integral channel features. 2009.
  • [12] M. Enzweiler, A. Eigenstetter, B. Schiele, and D. M. Gavrila. Multi-cue pedestrian classification with partial occlusion handling. In CVPR, pages 990–997. IEEE, 2010.
  • [13] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The pascal visual object classes (voc) challenge. International journal of computer vision, 88(2):303–338, 2010.
  • [14] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained part-based models. IEEE transactions on pattern analysis and machine intelligence, 32(9):1627–1645, 2010.
  • [15] Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences, 55(1):119–139, 1997.
  • [16] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014.
  • [17] S. Gould, R. Fulton, and D. Koller. Decomposing a scene into geometric and semantically consistent regions. In ICCV, 2009.
  • [18] K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask r-cnn. In ICCV, 2017.
  • [19] G. Heitz and D. Koller. Learning spatial context: Using stuff to find things. In European conference on computer vision, pages 30–43. Springer, 2008.
  • [20] D. Hoiem, A. N. Stein, A. A. Efros, and M. Hebert. Recovering occlusion boundaries from a single image. In ICCV, 2007.
  • [21] J. Hosang, R. Benenson, and B. Schiele. Learning non-maximum suppression. In CVPR, 2017.
  • [22] Z. Huang, L. Huang, Y. Gong, C. Huang, and X. Wang. Mask scoring r-cnn. arXiv preprint arXiv:1903.00241, 2019.
  • [23] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
  • [24] L. Jin, Z. Chen, and Z. Tu. Object detection free instance segmentation with labeling transformations. arXiv preprint arXiv:1611.08991, 2016.
  • [25] A. Kirillov, R. Girshick, K. He, and P. Dollár. Panoptic feature pyramid networks. arXiv preprint arXiv:1901.02446, 2019.
  • [26] A. Kirillov, K. He, R. Girshick, C. Rother, and P. Dollár. Panoptic segmentation. arXiv preprint arXiv:1801.00868, 2018.
  • [27] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. Howard, W. Hubbard, and L. Jackel. Backpropagation applied to handwritten zip code recognition. In Neural Computation, 1989.
  • [28] B. Leibe, A. Leonardis, and B. Schiele. Robust object detection with interleaved categorization and segmentation. International journal of computer vision, 77(1-3):259–289, 2008.
  • [29] J. Li, A. Raventos, A. Bhargava, T. Tagawa, and A. Gaidon. Learning to fuse things and stuff. arXiv preprint arXiv:1812.01192, 2018.
  • [30] L.-J. Li, R. Socher, and L. Fei-Fei. Towards total scene understanding: Classification, annotation and segmentation in an automatic framework. In

    2009 IEEE Conference on Computer Vision and Pattern Recognition

    , pages 2036–2043. IEEE, 2009.
  • [31] Y. Li, X. Chen, Z. Zhu, L. Xie, G. Huang, D. Du, and X. Wang. Attention-guided unified network for panoptic segmentation. arXiv preprint arXiv:1812.03904, 2018.
  • [32] X. Liang, Y. Wei, X. Shen, J. Yang, L. Lin, and S. Yan. Proposal-free network for instance-level object segmentation. arXiv preprint arXiv:1509.02636, 2015.
  • [33] T.-Y. Lin, P. Dollár, R. B. Girshick, K. He, B. Hariharan, and S. J. Belongie. Feature pyramid networks for object detection. In CVPR, 2017.
  • [34] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755, 2014.
  • [35] H. Liu, C. Peng, C. Yu, J. Wang, X. Liu, G. Yu, and W. Jiang. An end-to-end network for panoptic segmentation. arXiv preprint arXiv:1903.05027, 2019.
  • [36] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg. Ssd: Single shot multibox detector. In European conference on computer vision, pages 21–37, 2016.
  • [37] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015.
  • [38] D. Marr. Vision: A computational investigation into the human representation and processing of visual information, henry holt and co. Inc., New York, NY, 2(4.2), 1982.
  • [39] F. Massa and R. Girshick. maskrcnn-benchmark: Fast, modular reference implementation of Instance Segmentation and Object Detection algorithms in PyTorch., 2018. Accessed: January 5, 2019.
  • [40] G. Mori, X. Ren, A. A. Efros, and J. Malik. Recovering human body configurations: Combining segmentation and recognition. In CVPR, 2004.
  • [41] P. O. Pinheiro, R. Collobert, and P. Dollar. Learning to segment object candidates. In Advances in neural information processing systems, 2015.
  • [42] P. O. Pinheiro, T.-Y. Lin, R. Collobert, and P. Dollár. Learning to refine object segments. In ECCV, 2016.
  • [43] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once: Unified, real-time object detection. In CVPR, 2016.
  • [44] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015.
  • [45] H. Riemenschneider, S. Sternig, M. Donoser, P. M. Roth, and H. Bischof. Hough regions for joining instance localization and segmentation. In ECCV. 2012.
  • [46] J. Shotton, J. Winn, C. Rother, and A. Criminisi. Textonboost: Joint appearance, shape and context modeling for multi-class object recognition and segmentation. In European conference on computer vision, pages 1–15. Springer, 2006.
  • [47] J. Sun, Y. Li, S. B. Kang, and H.-Y. Shum. Symmetric stereo matching for occlusion handling. In CVPR, 2005.
  • [48] J. Tighe, M. Niethammer, and S. Lazebnik. Scene parsing with object instances and occlusion ordering. In CVPR, 2014.
  • [49] Z. Tu. Auto-context and its application to high-level vision tasks. In CVPR, 2008.
  • [50] Z. Tu, X. Chen, A. L. Yuille, and S.-C. Zhu. Image parsing: Unifying segmentation, detection, and recognition. International Journal of computer vision, 63(2):113–140, 2005.
  • [51] J. R. Uijlings, K. E. Van De Sande, T. Gevers, and A. W. Smeulders. Selective search for object recognition. International journal of computer vision, 104(2):154–171, 2013.
  • [52] V. Vapnik. Estimation of dependences based on empirical data. Springer Science & Business Media, 2006.
  • [53] X. Wang, T. X. Han, and S. Yan. An hog-lbp human detector with partial occlusion handling. In ICCV, 2009.
  • [54] X. Wang, T. Xiao, Y. Jiang, S. Shao, J. Sun, and C. Shen. Repulsion loss: Detecting pedestrians in a crowd. In CVPR, pages 7774–7783, 2018.
  • [55] Y. Xiong, R. Liao, H. Zhao, R. Hu, M. Bai, E. Yumer, and R. Urtasun. Upsnet: A unified panoptic segmentation network. arXiv preprint arXiv:1901.03784, 2019.
  • [56] Z. Zhang, S. Fidler, and R. Urtasun. Instance-level segmentation for autonomous driving with deep densely connected mrfs. In CVPR, 2016.
  • [57] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. Pyramid scene parsing network. CoRR, abs/1612.01105, 2016.
  • [58] S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. Du, C. Huang, and P. H. Torr.

    Conditional random fields as recurrent neural networks.

    In ICCV, pages 1529–1537, 2015.
  • [59] S.-C. Zhu, D. Mumford, et al. A stochastic grammar of images. Foundations and Trends® in Computer Graphics and Vision, 2(4):259–362, 2007.
  • [60] Y. Zhu, Y. Tian, D. Metaxas, and P. Dollár. Semantic amodal segmentation. In CVPR, pages 1464–1472, 2017.