SOGNet: Scene Overlap Graph Network for Panoptic Segmentation
The panoptic segmentation task requires a unified result from semantic and instance segmentation outputs that may contain overlaps. However, current studies widely ignore modeling overlaps. In this study, we aim to model overlap relations among instances and resolve them for panoptic segmentation. Inspired by scene graph representation, we formulate the overlapping problem as a simplified case, named scene overlap graph. We leverage each object's category, geometry and appearance features to perform relational embedding, and output a relation matrix that encodes overlap relations. In order to overcome the lack of supervision, we introduce a differentiable module to resolve the overlap between any pair of instances. The mask logits after removing overlaps are fed into per-pixel instance classification, which leverages the panoptic supervision to assist in the modeling of overlap relations. Besides, we generate an approximate ground truth of overlap relations as the weak supervision, to quantify the accuracy of overlap relations predicted by our method. Experiments on COCO and Cityscapes demonstrate that our method is able to accurately predict overlap relations, and outperform the state-of-the-art performance for panoptic segmentation. Our method also won the Innovation Award in COCO 2019 challenge.READ FULL TEXT VIEW PDF
SOGNet: Scene Overlap Graph Network for Panoptic Segmentation
SOGNet for panoptic segmentation.
Convolutional Neural Networks (CNNs) have achieved huge success in computer vision tasks such as image recognition [He et al.2016, Yang et al.2018], semantic segmentation [Long, Shelhamer, and
Darrell2015, Chen et al.2018], object detection [Girshick2015, Ren et al.2015], and instance segmentation [He et al.2017]. The semantic segmentation task answers which background scene a pixel belongs to, while the instance segmentation task predicts foreground object masks. Recently, the panoptic segmentation task introduced in [Kirillov et al.2019b] aims to unify the results of semantic and instance segmentation into a single pipeline. The system performs semantic segmentation for pixels that belong to amorphous background scenes, named stuff. For countable foreground objects, named things, the goal is to assign each object region with the right thing class, as well as an instance
id, identifying which object it belongs to. As a result, panoptic segmentation cannot have overlapping segments. However, most cutting-edge high-performance instance segmentation methods [He et al.2017] adopt the region-based strategy [Girshick et al.2014], and output overlapping segments. As shown in Figure 1, the object pairs, such as cup-dinning table, bottle-dinning table, and bowl-dinning table, share overlapping regions from instance segmentation output. Therefore, resolving overlaps and producing coherent segmentation results are the main challenge for the panoptic segmentation task [Kirillov et al.2019b].
In [Kirillov et al.2019b]
, the semantic and instance segmentation are trained separately, and their panoptic results are merged by heuristic post-processing steps. Later studies aim to unify the semantic and instance segmentation into an end-to-end training framework[Kirillov et al.2019a, Li et al.2019b, Liu et al.2019, Xiong et al.2019, Porzi et al.2019, Yang et al.2019, Li et al.2018]. The panoptic results are usually produced by fusion strategies [Kirillov et al.2019a, Li et al.2019b], or predicted by a panoptic head [Liu et al.2019, Xiong et al.2019]. These studies do not explicitly model overlap relations among objects, which is especially important for datasets with rich categories and complex scenes. However, modeling overlap is challenging without the supervision of object relations or depth information.
In this study, we introduce the scene overlap graph network (SOGNet) for panoptic segmentation. The SOGNet consists of four components: the joint segmentation, the relational embedding module, the overlap resolving module, and the panoptic head. The SOGNet trains semantic and instance segmentation in an end-to-end fashion, explicitly encodes overlap relations, resolves the overlap between any pair of objects in a differentiable way, and outputs a unified panoptic result in the panoptic head.
Similar to [Kirillov et al.2019a, Li et al.2019b, Liu et al.2019, Xiong et al.2019, Porzi et al.2019, Li et al.2018], we also use ResNets [He et al.2016] with feature pyramid network (FPN) [Lin et al.2017] as the shared backbone for our semantic and instance segmentation branches. Inspired by the relation classification in scene graph parsing tasks [Zellers et al.2018, Woo et al.2018], we formulate the overlapping problem in panoptic segmentation as a simplified scene graph with directed edges, in which there are only three relation types for instance with respect to : no overlap, covering as a subject, and being covered as an object. We name this representation as scene overlap graph in this study. We leverage the category, geometry, and appearance information of objects to perform edge feature embedding for the scene overlap graph, and output a matrix that explicitly encodes overlap relations. However, different from scene graph parsing tasks with the commonly used Visual Genome dataset [Krishna et al.2017] that has relation annotations, the panoptic segmentation task does not offer annotations of object relations or depth information, so the overlap relations cannot be modeled with direct supervision.
In order to overcome this problem, we develop the overlap resolving module, which resolves the overlaps between any pair of instances in a differentiable way. The mask logits after removing overlaps are then used for per-pixel instance
id classification in the panoptic head with the panoptic annotation. In doing so, the supervision from pixel-level classification helps the instance-level modeling of overlap relations.
We list the contributions in this study as follows:
We formulate the overlapping problem in panoptic segmentation as a structured representation, named scene overlap graph. Using category, geometry and appearance features, we perform relational embedding and output a matrix that explicitly encode overlap relations.
In order to deal with the lack of supervision on overlap relations, we develop an overlap resolving module that resolves overlaps between any pair of instances in a differentiable way. The supervision from per-pixel instance
id classification in the panoptic head helps to encode overlap relations. We also generate an approximate ground truth as weak supervision to quantify the accuracy of overlap relations predicted by our network.
Experiments on the COCO and Cityscapes datasets show that, our proposed method is able to accurately predict overlap relations, and outperform the state-of-the-art performance for panoptic segmentation.
The semantic segmentation task focuses on background scenes and is based on fully convolutional networks (FCNs) [Long, Shelhamer, and Darrell2015]. Because detail information is important for dense prediction problems, later studies learn finer representation by deconvolution [Noh, Hong, and Han2015], encoder-decoder structures [Badrinarayanan, Kendall, and Cipolla2017], or introducing skip connections between down-sampling and up-sampling paths [Ronneberger, Fischer, and Brox2015]. Other methods aim to aggregate multi-scale context [Farabet et al.2013, Chen et al.2018, Zhao et al.2017], and better capture long-range dependencies [Zheng et al.2015, Li et al.2019a]. The instance segmentation task deals with foreground objects. Similar to object detection [Girshick2015, Ren et al.2015], many instance segmentation studies [Li et al.2017, He et al.2017] also adopt the region-based strategy [Girshick et al.2014], and are able to achieve strong performance due to accurate localization for instances. As another stream, segmentation-based methods [Liang et al.2018, Arnab and Torr2017] perform pixel-wise classification and then construct object instances by grouping.
The recently proposed task, panoptic segmentation [Kirillov et al.2019b], requires a unified result for background scenes and foreground objects. A naive implementation is to train the two sub-tasks separately, and then fuse the results by heuristic rules [Kirillov et al.2019b]. Follow-up studies train semantic and instance segmentation in an end-to-end network by sharing backbone [de Geus, Meletis, and
Dubbelman2018, Kirillov et al.2019a, Li et al.2019b, Liu et al.2019, Xiong et al.2019, Porzi et al.2019, Yang et al.2019, Li et al.2018]. Most of them use fusion heuristics to produce the final output. In [Liu et al.2019, Xiong et al.2019], a panoptic head is constructed to predict instance
id. Li et al. [Li et al.2018] introduce a binary mask to differentiate between thing or stuff for each pixel. A semi- and weakly-supervised method is proposed in [Li, Arnab, and Torr2018] to relieve the cost of pixel-level annotation.
An important aspect ignored by current panoptic segmentation studies is modeling and resolving overlaps. The study [Lazarow, Lee, and Tu2019] tries to learn instance occlusions but cannot resolve them in the end-to-end training. As a comparison, our study is able to explicitly model overlap relations, telling us whether an instance lies upon or beneath another, and resolve their overlaps in a differentiable way to generate the panoptic output.
Parsing relationships of objects has been one of the core components of visual understanding. In [Hu et al.2018], appearance and geometry features are used to build interactions for object detection. The visual relationship datasets, such as Visual Genome, inspire a series of studies on scene graph generation. In [Zellers et al.2018, Woo et al.2018], the low-rank outer product [Kim et al.2017] is adopted to perform relational embedding from object features. Other relation reasoning methods are proposed by graph-based propagation [Xu et al.2017], associative embedding [Newell and Deng2017], and introducing an efficient module [Santoro et al.2017].
In our study, we formulate the overlapping problem as a simplified scene graph, and also perform relational embedding to encode overlap relations. Our method differs from these studies in that our problem does not offer relation annotation to supervise. We use the supervision from panoptic head to help the modeling of overlap relations.
In the scene graph generation task [Zellers et al.2018, Woo et al.2018, Xu et al.2017], objects in an image are constructed as a graph and their relations are directed edges. We formulate the overlapping problem in panoptic segmentation as a similar structure, named scene overlap graph (SOG). There are three relation types for instance with respect to : no overlap, covering as a subject, and being covered as an object. Our proposed SOGNet consists of four components. The joint segmentation connects semantic and instance segmentation in a unified network. The relational embedding module explicitly encodes overlap relations of objects. After the overlap resolving module, overlaps among instances are removed in a differentiable way. Finally, the panoptic head performs per-pixel instance
id classification. An illustration of our SOGNet architecture is shown in Figure 2.
Following current popular methods, we use ResNet with FPN as the shared backbone of semantic and instance segmentation branches. The Mask R-CNN structure is adopted for our instance segmentation head, which outputs the box regression, class prediction, and mask segmentation for foreground objects. As for semantic head, the FPN feature maps first go through three deformable convolution layers [Dai et al.2017], and then are up-sampled to the scale. Finally, they are concatenated to generate the per-pixel category prediction. This branch is supervised with both stuff and thing classes, and then the semantic logits of stuff classes are extracted into the panoptic head. We train our model using instance and panoptic annotation. The panoptic annotation that gives per-pixel category and instance
id supervises the semantic and panoptic head, respectively. The instance annotation contains overlaps and is used for instance segmentation.
For any training image, we are given the ground truth , where , , and refer to the bounding box, one-hot category, and corresponding mask for instance , respectively, and is the number of instances in this image. As illustrated in Figure 2, we perform relational embedding using the ground truth in the training phase. During inference, we replace them with the prediction from Mask R-CNN branch. The and (there are 80 thing classes for COCO) encode geometry and category information, respectively. In order to include appearance feature, we resize the values inside box from as
, which is consistent with the size of Mask R-CNN’s output. The resized mask is flattened to be a vector, denoted as.
The bilinear pooling method learns joint representation for pair of features and is widely applied to visual question answering [Kim et al.2017, Kim, Jun, and Zhang2018], and image recognition [Yu et al.2018] tasks. We construct our category and appearance relation features using the low-rank outer product in [Kim et al.2017]. For a pair of instances and , their category relation feature is calculated as:
where denotes the Hadamard product (element-wise multiplication), is the ReLU non-linear activation, and are two linear embeddings that project the input into subject and object features, respectively, and maps the relation feature into output dimension . We then have the category relation features as:
where “[ ]” is the concatenation operation. In a similar way, using as the input of Eq. (1), we can also construct the appearance relation features .
The relative geometry provides strong information to infer whether two objects have overlap or not. Following [Hu et al.2018, Woo et al.2018], we have the translation- and scale-invariant relative geometry feature encoded as:
where are coordinates and scales extracted from , and is a linear matrix that maps the 4-dimensional relative geometry feature into high-dimensional . We can further have the geometry relation features . We concatenate these edge representations about appearance, category, and geometry as:
where . The relational embedding is further used to encode overlap relations.
Based on relational embedding, we introduce the overlap resolving module to explicitly model overlap relations and resolve overlaps among instances in a differentiable way.
As illustrated in Figure 2, the relation features, , go through a layer to have a single-channel output with the sigmoid activation to restrict the values within . We reshape the output as a square matrix, denoted as . The element has a physical meaning to represent the potential of instance being covered by instance . Because there can be only one overlap relation between instances and , we then introduce the overlap relation matrix defined as:
where denotes the ReLU activation that is used to filter out the negative differences between potentials on symmetric positions. In doing so, if , it encodes that instance is being covered by instance , while on its symmetric position, . When , the instances and do not have overlaps. Besides, all diagonal elements equals to 0. As explained later, the positive elements in will be optimized towards in implementations. We now show how to leverage the overlap relation matrix to resolve overlaps.
For each bounding box, , of the ground truth instances, we have its mask logits (the activations before sigmoid) of
from the Mask R-CNN output. We then interpolate these logits back to the image scale
by bilinear interpolation and padding outside the box. These logits, denoted as, may have overlaps because Mask R-CNN is region-based and operates on each region independently. Using the matrix , we can deal with the overlaps between instances and as:
where is the output logit of instance , and represents the sigmoid activation that turns the logit into a binary-like mask . The element-wise multiplication, , calculates the intersecting region between instances and . The value decides whether the elements in intersecting region should be removed from the logit . When approaches , equals to , thus the logit will not be affected, and vice versa.
Considering the overlap relations of all the other instances on , we have:
and then the computational steps of the overlap resolving module can be formulated as:
where , and denotes the Tucker product along the 3-rd dimension (reshape as for inner product with , and then return to
). We see that our module is friendly to tensor operations in current deep learning frameworks, and is differentiable for resolving overlaps, so that the SOGNet can be trained in an end-to-end fashion.
The overlap relation matrix, , explicitly encodes whether there is intersection between any pair of instances, and if there is, the overlapping region should be removed from which instance. However, we are not provided with the supervision of overlap relations by the panoptic segmentation task. Because accurately resolving overlaps has a strong correlation with the quality of final panoptic output, we can exploit the pixel-level panoptic annotation to assist in the process of modeling overlap relations encoded by . As illustrated in Figure 2, the instance logits after the SOG module are then fed into the panoptic head.
Following UPSNet [Xiong et al.2019], we incorporate the logits from semantic head into the mask logits . We get the logits of -th object from semantic output by taking the values inside its ground truth box from the channel corresponding to its ground truth category , and padding zeros outside the box. In UPSNet, they are combined by addition, which is denoted as “Panoptic Head 1”. Here we develop an improved combination denoted as “Panoptic Head 2”. They are compared as:
where is the combined logit,
denotes the sigmoid function andis a factor to balance the numerical difference between semantic output values and mask logits. We set to be 2 in our experiments. Finally, we concatenate the combined instance logits and the stuff logits from the semantic head to perform per-pixel instance
classification with the standard cross entropy loss function,.
Despite we do not have the supervision to know which instance lies on the other one, we can leverage the ground truth binary masks, , to infer whether two instances have overlaps or not. We produce a symmetric relation matrix defined as:
where calculates the area of a binary mask through sum operation, calculates the intersection mask through element-wise multiplication, and denotes the indicator function that equals to 1 when the condition holds. All diagonal elements are filled with . When , it indicates that the overlapped intersection over the smaller object is larger than , which means there is a significant overlap between instances and . With the symmetric relation matrix , we can introduce the relation loss function as:
which calculates the mean squared error between and . In doing so, when there is overlap between instances and , i.e., , the overlap relation or is forced to approach , so that it will not contribute trivially when removing overlaps by Eq. (6).
In total, our SOGNet has the loss functions for semantic and instance segmentation, the panoptic loss for instance
id classification, and the relation loss to help optimizing the overlap relation matrix .
We adopt the evaluation metric introduced in[Kirillov et al.2019b], called Panoptic Quality (PQ). It can be viewed as the multiplication of a segmentation term (SQ) and a recognition term (RQ):
where and are predicted and ground truth segments, and , and denote the true positive, false positive and false negative sets, respectively.
For dataset, such as COCO, the instance annotation permits overlapping instances, while the panoptic annotation contains no overlaps. We can leverage the difference between the two annotations to generate an approximate ground truth of overlap relations, in order to test the quality of overlap relations predicted by our model. The method is also used in [Lazarow, Lee, and Tu2019] to generate their occlusion ground truth.
Concretely, we are provided with the instance annotation , and the panoptic annotation . For any pair of instances and , we calculate the intersecting region by , and inspect which one of and mainly covers the intersecting region, to know if lies upon or the other way round. Note that the instance and panoptic annotation are not seamlessly matched. Thus this method can only produce approximately true overlap relations
Using the synthetic ground truth as weak supervision, we construct a new asymmetric relation matrix . When , we have , and it means instance is covered by . We can have a new relation loss function in this weakly-supervised setting to replace Eq. (12) with:
which directly supervises the overlap relation matrix . In experiments, the weakly-supervised manner by Eq. (14) and our method by Eq. (12) have similar performances. Note that the supervision is only valid for datasets such as COCO that has difference between instance and panoptic annotations. It will be ineffective for datasets such as Cityscapes. But our method by Eq.(12) works in both cases.
Thus the weakly-supervised manner by Eq. (14) is served to test the efficacy of our method. Using the weak supervision , we develop a metric, named overlap accuracy (OA), to quantify the quality of overlap predictions encoded by . The OA of image is calculated as:
where , and . Our reported OA is an average over all images in the validation set.
We conduct experiments on the COCO and Cityscapes datasets for panoptic segmentation, and show that our proposed SOGNet is able to accurately predict overlap relations and outperform state-of-the-art performances.
We set the weights of loss functions following [Xiong et al.2019]. The weight of panoptic head is 0.1 for COCO and 0.5 for Cityscapes. The weight of relation loss is set to 1.0. We train the models with a batchsize of
images distributed on 8 GPUs. The region proposal network (RPN) is also trained end-to-end. The SGD optimizer with 0.9 Nesterov momentum and a weight decay of
is used. We use an equivalent setting to UPSNet for fair comparison. Images are resized with the shorter edge as 800, and the longer edge less than 1333. We freeze all batch normalization (BN)[Ioffe and Szegedy2015] layers within the ResNet backbone. For COCO, we train the SOGNet for 180K iterations. The initial learning rate is set to and is divided by at the 120K-th and 160K-th iterations. For Cityscapes, we train for 24K iterations and drop the learning rate at the 18K-th iteration. Besides, in order to test the quality of our overlap predictions, we perform an ablation study on COCO using a shorter training schedule because our relation loss converges soon. We only train for 45K iterations and drop the learning rate at iteration 30K and 40K. We do not adopt the void channel prediction proposed in UPSNet. In implementations, we filter out the instances that have no overlap with any other instance to reduce negative samples and computation overhead.
During inference phase, the ground truths
as the input of our relational embedding are replaced with the predictions from Mask R-CNN branch. In order to remove invalid instances, we filter out instances whose probability is lower than a threshold, and perform an NMS-like procedure, following[Kirillov et al.2019b, Xiong et al.2019]. For highly overlapped predictions of the same class, we keep the mask with the higher confidence score and discard the other one if the intersection is larger than a threshold. Otherwise, we keep the non-interacting part and deal with the next instance. The final output is predicted by our panoptic head. For stuff segment whose area is lower than 4096, we set the corresponding region as void.
|PlainNet + heuristics||39.6||78.7||48.4|
|PlainNet + heuristics + label prior||40.9||78.8||49.7|
|PlainNet + PH1||42.3||78.6||52.1|
We use ResNet-50 as backbone with a short training schedule, and conduct experiments to analyze feature combinations for our relational embedding, and test the quality of overlap relations predicted by our method. As shown in Table 1, we use different features as the input of our relational embedding. When only category or geometry feature is adopted, the performance improvement on PQ is not so significant, and the overlap prediction does not show high accuracy. When category and geometry features are used together, the embedding becomes much more powerful. Mask feature also slightly improves the overlap accuracy. We expect that a more sophisticated feature design will further boost the performance. It is observed that the weakly-supervised method by Eq. (14) achieves a similar result to our method by Eq. (12). As shown in Figure 3, we visualize the overlap relations predicted by , as well as the approximate ground truth, , on images from the validation set. More examples can be found in the supplementary material. It is shown that the matrix accurately predicts some overlap relations, including baseball gloveperson, tiepersonbus, and spooncupdinning table. The results demonstrate that the overlap relations are modeled well with the help of supervision from per-pixel instance
id classification in the panoptic head. Our method is able to encode overlap relations without direct supervision on them.
Using the standard training schedule and ResNet-50 as the backbone, we also perform comparisons between SOGNet and heuristic inference. The heuristics in [Kirillov et al.2019b] sort instances according to their objectness scores to deal with overlaps. In [Li et al.2019b], some hand-crafted label priors are made to rule overlap orders. For example, tie should always cover person. As a comparison, SOGNet explicitly predict overlap relations and resolve overlaps in a differentiable way. We train the joint segmentation component of SOGNet as a PlainNet, and perform inference with different methods. As shown in Table 2, label prior helps to improve the performance. When PlainNet adds the panoptic head for inference to produce the panoptic results, the performance becomes better. The SOGNet with relational embedding and overlap resolving has a further improvement. And our proposed Panoptic Head 2 (PH2) performs better than PH1. In Figure 4, we visualize the panoptic segmentation results of heuristic inference and SOGNet. It is shown that SOGNet better deals with the overlapping problem.
We run SOGNet on the COCO and Cityscapes datasets, and compare the results with state-of-the-art methods including the method in [Li, Arnab, and Torr2018], JSIS [de Geus, Meletis, and Dubbelman2018], TASCNet [Li et al.2018], Panoptic FPN [Kirillov et al.2019a], OANet [Liu et al.2019], AUNet [Li et al.2019b], UPSNet [Xiong et al.2019], and OCFusion [Lazarow, Lee, and Tu2019].
As shown in Table 3, with ResNet-101-FPN as the backbone, our proposed SOGNet achieves the highest single-model performance on the COCO test-dev set. It has a 1.3% PQ improvement than AUNet that uses a larger backbone. SOGNet also performs better than UPSNet using the same backbone and training schedule.
The results of SOGNet on the COCO and Cityscapes validation set are shown in Table 4. It is observed that SOGNet generalizes well to Cityscapes. It has a 0.7% improvement than TASCNet and UPSNet. On the COCO validation set, SOGNet has a 1.2% improvement than UPSNet using the same backbone. The mIoU and AP of SOGNet are 54.56 and 34.2 on COCO, which are similar to the results of UPSNet (54.3 and 34.3 as reported). It indicates that our better panoptic performance is not derived from a stronger semantic or instance segmentation model. More importantly, SOGNet is the only method that can explicitly encode overlap relations and tell us which instance lies upon or beneath another.
|Q.Li et al.||ResNet-101||53.8||42.5||62.1|
In this study, we aim to model overlap relations and resolve overlaps in a differentiable way for panoptic segmentation. We develop the SOGNet composed of the joint segmentation, the relational embedding module, the overlap resolving module, and the panoptic head. It is able to explicitly encode overlap relations without direct supervision on them. Ablation studies detach SOGNet and analyze the efficacy of each component. Experiments demonstrate that SOGNet accurately predicts overlap relations, and outperforms the state-of-the-art methods on both COCO and Cityscapes.
Z. Lin is supported by NSF China under grant no.s 61625301 and 61731018 and Zhejiang Lab.
We offer more visualization examples of the overlap relations predicted by our method, as shown in Figure 5.
Conditional random fields as recurrent neural networks.In ICCV, 1529–1537.