Instance segmentation, which segments out every object of interest, is an elemental task for computer vision. It is crucial for autonomous driving because it is vital to know positions for every object instance on roads. In the context of instance segmentation on images, previous approaches only operate on RGB imagery, such as Mask-RCNN. However, image data could be affected by illumination, color change, shadows, or optical defects. These factors can degrade the performance of image-based instance segmentation. By utilizing another modality that provides geometric cues of scenes, and since object shapes are independent of object texture and color change, these strong priors add more robust information of the scenes. A prior work  that goes beyond the dominant paradigm to incorporate depth information only uses it for naive ordering rather than directly regressing masks or building an end-to-end trainable model to propagate depth information. Besides, their depth maps are predicted from monocular images, making the depth ordering unreliable.
In outdoor scenes, stereo cameras or lidar sensors are commonly used for depth acquisition. Stereo cameras are low-cost and their adjustable parameters, such as longer baselines () and focal lengths (), favor stereo matching at far fields. Relationship of depth and disparity is given by
1-disparity (the minimal pixel difference showing the ideal longest range a stereo system could detect) represents farther distance if using longer f and b
. Next, longer baselines and focal lengths favor more precise geometric estimations, since longer baselines produce smaller triangulation error, and longer focal lengths project objects on images with more pixels and thus enhance the robustness of stereo matching and show more complete shapes.
In this paper, we propose Geometry-Aware Instance Segmentation Network (GAIS-Net) that takes the advantages of both the semantic information from image domain and geometric information from disparity maps. Our contributions are summarized as follows:
1. To our knowledge, we are the first to perform instance segmentation on imagery by fusing images and disparity information to regress object masks.
2. We collect High-Quality Driving Stereo (HQDS) dataset, with a total of 8.8K stereo pairs and with 4 times larger than the current best dataset, Cityscapes.
3. We present GAIS-Net, an aggregation of representation design for instance segmentation using images, image-based, and point cloud-based networks. We train GAIS-Net with different losses, and fuse these predictions using the mask scoring. GAIS-Net achieves the state of the art.
Our goal is to construct an end-to-end trainable network to perform instance segmentation for autonomous driving. Our system segments out each instance and outputs confidence scores for bounding boxes and masks for each instance. To exploit geometric information, we adopt PSMNet , the state-of-the-art stereo matching network, and introduce disparity information at ROI heads. The whole network design is in Fig. 1.
We build a two-stage detector with a backbone network, such as ResNet50-FPN, and a region proposal network (RPN) with non-maximum suppression. Object proposals are collected by feeding a stereo left image into the backbone network and RPN. The same as Mask-RCNN, we perform bounding box regression, class prediction for proposals, and mask prediction based on image domain features. Corresponding losses are denoted as , , and , and are identified in .
2.1 Geometry-Aware Mask Prediction
2.5D ROI and 3D ROI. We use PSMNet  and stereo pairs to predict dense disparity maps, projected onto the left stereo frame. Next, RPN outputs region proposals. We collect proposals and crop out these areas from the disparity map. We call these cropped out disparity areas as 2.5D ROI.
Based on the observations from pseudo-lidar work , which describes the advantage of back-projecting 2D grid structured data into 3D point cloud and processing with point cloud networks, we back-project the disparity map into space, where for each point, the first and second components describe its 2D grid coordinates, and the third component stores its disparity value. We name this representation as 3D ROI.
Instance Segmentation Networks.
Each 3D ROI contains different number of points. To facilitate training, we uniformly sample the 3D ROI to 1024 points, and collect all the 3D ROI into a tensor. We develop a PointNet structured instance segmentation network to extract point features and perform per-point mask probability prediction. We re-project the 3D feature onto the 2D grid to calculate the mask prediction and its loss. The re-projection is efficient because we do not break the point order in the point cloud-based instance segmentation. , same as , is a cross-entropy loss between a predicted probability mask and its matched groundtruth.
To fully utilize advantages of different representations, we further do 2.5D ROI instance segmentation with an image-based CNN. Similar to instance segmentation on 2D ROI, this network extracts local features of 2.5D ROI, and later performs per-pixel mask probability prediction. The mask prediction loss is denoted as .
2.2 Mask Continuity
We sample 3D ROI to 1024 points uniformly. However, predicted masks, denoted as , and their outlines are sensitive to pseudo-lidar sampling strategies. An undesirable sampling is illustrated in Fig. 2. To compensate the undesirable effect, we introduce a mask continuity loss. Since objects are structured and continuous, we calculate a mask Laplacian as , where and denote the dimensions of . Mask Laplacian computes continuity of . Further, the mask continuity loss is calculated as for penalizing discontinuities of .
2.3 Representation Correspondence
We use the point cloud-based network and the image-based network to extract features and regress and . These two masks should be similar because they are from the same disparity map. To evaluate the similarity, cross-entropy is calculated between and , and serves as a self-supervised correspondence loss
. Minimizing this term lets the networks of different representations supervise each other to extract more descriptive features for mask regressing, resulting in similar probability distribution betweenand . Mask-RCNN uses a feature grid after ROI pooling to regress masks. We also use this size at the mask heads.
2.4 Mask Scores and Mask Fusion
MS-RCNN  introduces mask scoring to directly regresses MaskIoU score based on a predicted mask and its associated matched groundtruth, showing quality of the mask prediction. However, their scores are not adopted at inference time to help manipulate mask shapes.
We adopt mask scoring and further exploit MaskIoU scores to fuse mask predictions from different representations at the inference time. The mask fusion process is illustrated in Fig. 3. During the inference time, we concatenate features and predicted masks of different representations respectively as inputs to the MaskIoU head. Scores of and are outputs from the MaskIoU head. We fuse mask predictions using their corresponding mask scores. We first linearly combine (, ) and (, ) to obtain (, ) for the disparity. The formulation is as follows.
Later, we linearly fuse () and () likewise to obtain the final probability mask
and its corresponding final mask score. The inferred mask is created by binarizing.
The mask scoring process should not be different for each representation. We only use 2D image features and to train a single MaskIoU head instead of constructing 3 MaskIoU heads for each representation. In this way, the MaskIoU module would not add much more memory use and the training is also effective. The MaskIoU loss is denoted as .
3.1 HQDS Dataset
Outdoor RGBD scene understanding is still less explored since much longer range sensing is required to align information from images and depth. Such as vehicles at distances showing in images but undetected by a depth sensor would bring ambiguity into RGBD methods. To conduct exploration of outdoor RGBD methods, and provide high quality data to reveal advantages of sensor fusion, we collect High-Quality Driving Stereo (HQDS) dataset in urban environments. Table1 shows a comparison with other public datasets for instance segmentation. Image resolution of HQDS is . From the table and Eq. 1. HQDS has the largest . Measuring range by the configuration is up to 1650 meters with 1-pixel disparity, which is only 440 and 350 for Cityscapes and KITTI. Note that produced disparity maps are computed by stereo matching methods so actual working distances are associated with methods’ robustness and image noise. However, longer baselines and focal lengths still favor far-field stereo matching since the former could show better geometry and more complete shapes for objects at distances .
|Dataset||Resolution (megapixels)||Stereo Pairs #||(m)||(pixels)|
HQDS contains 6K/1.2K stereo pairs for training/testing. We follow a half-automation process to annotate data with a group of supervised annotators. Our internal large-scale labeling system produces preliminaries, and the annotators adjust yielded bounding boxes and mask shapes or filter out false predictions to produce HQDS groundtruth.
There are 60K instances in the training set and 11K in the testing set. We adopt 3 instance classes: human, bicycle/motorcycle, and vehicle. Although other datasets on driving adopt more, such as Cityscapes use 8 classes, from MaskRCNN’s study  they suffer from much inter-class ambiguity which leads to biased results.
Associated number of instances in the training and testing sets are (5.5K, 1.5K, 52.8K) and (2.4K, 1K, 8.4K) respectively. Most of non-synthetic datasets encounter class-imbalanced issue. To remedy the imbalance, we adopt COCO dataset (instance segmentation for common objects) pretrained weights with class pruning in our implementations and comparison methods.
Evaluation and Metrics. We fairly compare with recent state-of-the-art methods validated on large-scale COCO dataset, including Mask-RCNN , MS-RCNN , Cascade Mask RCNN , and HTC  (w/o semantics), by using their publicly released codes and their COCO pretrained weights. We follow their training procedures to conduct comparison experiments.
We report numerical results in the standard COCO-style. Average precision (AP) averages across different IoU levels, from 0.5 to 0.95 with 0.05 as an interval. AP and AP are 2 typical IoU levels. The units are %. Table 2 shows the comparison with others. The proposed GAIS-Net attains the state of the art. We exceed Mask-RCNN using the same backbone by 9.7% and 6.8% for bounding box and mask AP, respectively.
3.2 Cityscapes Dataset
We also conduct experiments on Cityscapes dataset. However, its baseline and focal length are shorter than HQDS, and the maximal measuring distance is only 1/4 of HQDS. Much shorter focal length and baseline limit the working distance of stereo matching and produce disparity maps only focusing at near fields with poor shapes and geometry . From Table 3, performance of GAIS-Net is still better than Mask-RCNN. The improvement gap between HQDS and Cityscapes is mainly caused by the latter’s shorter baseline and focal length.
-  (2018) Pyramid stereo matching network. In CVPR, Cited by: §2.1, §2.
-  (2019) Hybrid task cascade for instance segmentation. In CVPR, Cited by: §3.1.
-  (2017) Mask r-cnn. In ICCV, Cited by: §1, §2, §3.1, §3.1, Table 3.
-  (2019) Mask scoring r-cnn. In CVPR, Cited by: §2.4, §3.1.
-  (1993) A multiple-baseline stereo. IEEE Transactions on pattern analysis and machine intelligence (TPAMI). Cited by: §1, §3.1, §3.2.
-  (2019) Pseudo-lidar from visual depth estimation: bridging the gap in 3d object detection for autonomous driving. In CVPR, Cited by: §2.1.
-  (2017) Depth-aware object instance segmentation. In ICIP, Cited by: §1.