Holistic scene understanding plays an critical role in the competent functioning of an autonomous robot. It enables the ability to perceive and reason which allows robots to interact with their surroundings. Broadly, the components of a scene can be classified into ‘stuff’ and ‘thing’ categories, where ‘stuff’ classes are defined as uncountable amorphous regions such as road, sidewalk, and buildings, while ‘thing’ classes are defined as countable objects such as people, cars, and cyclists. The panoptic segmentation task
aims to jointly predict ‘stuff’ and ‘thing’ classes using a single coherent Convolutional Neural Network (CNN) architecture. More precisely, the network assigns a semantic ID to a pixel of an image if it belongs to ‘stuff’ classes and it assigns a class label as well as an instance ID if the pixel belongs to a ’thing’ class. We address the panoptic segmentation task using our recently proposed Efficient Panoptic Segmentation (EfficientPS) architecture that incorporates our proposed efficient shared backbone with our new feature aligning semantic head, a new variant of Mask R-CNN  as the instance head, and our novel adaptive panoptic fusion module.
We benchmark our model on the Robust Vision Challenge (RVC) datasets. The goal of RVC is to encourage the development of vision systems that are robust and that can consequently perform well on a variety of datasets with different characteristics. The RVC panoptic segmentation challenge requires participants to train a single model over a joint dataset comprising of six different datasets and to benchmark individually on each corresponding benchmark server. The models are then ranked using the Schulze Proportional Ranking (PR) method . To address this challenge, we employ a lightweight version of our EfficientPS architecture which we refer to as ‘EffPS_b1bs4_RVC’ and it achieves rank #1 in RVC 2020.
For this challenge, we use our EfficeintPS architecture with minor modifications due to resource and time constraints. We use a modified version of EfficientNet-B5  as the encoder in EfficientPS. While, for RVC we use EfficientNet-B1 due to memory limitations. Additionally, EfficientPS employs depthwise separable convolutions in the Mask R-CNN instance head but we use standard convolutions in the RVC model to obtain a minor increase in performance over a reduction in the number of parameters due to the weaker EfficientNet-B1 encoder that we employ.
For the sake of completeness, we briefly discuss the topology of the EfficientPS architecture shown in Figure 1. It consists of four components: a shared backbone, instance head, semantic head, and the panoptic fusion module. The shared backbone comprises of a modified EfficientNet-B5 and our novel 2-way FPN architecture. We remove the Squeeze-and Excitation(SE) 
connections of the standard EfficientNet-B5 and replace the batch normalization layers with synchronized Inplace Activated Batch Normalization (iABNsync) . To overcome the problem of unidirectional flow of information, we add a parallel bottom-up branch to the standard FPN architecture followed by the summation of each branch at corresponding scales to attain the proposed 2-way FPN. The left half of Figure 1 depicts the backbone network.
The instance segmentation head of the EfficientPS architecture comprises of Mask R-CNN with standard convolutions, batch normalization, and ReLU activations layers replaced with depthwise separable convolution, iABN sync, and Leaky ReLU respectively. The novel semantic head uses three different modules to effectively capture features of different scales and to subsequently correlate them to avoid the mismatch problem. Therefore, we employ Dense Prediction Cells(DPC) on small-scale features and our LSFE module  on large-scale features. We then align the different scaled features using the Mismatch Correction module  before the segmentation logits are obtained as shown in Figure 1. Finally, the panoptic fusion module adaptively fuses logits from the semantic and instance heads based on their mask confidences. The fusion is performed according to equation (1) which enables selective attenuation or amplification of fused logit scores to integrate instance-specific ‘thing’ classes with ‘stuff’ classes as
where and are logits from the semantic and instance heads respectively.
In this section, we first describe the datasets in Section 3.1 and then present the data augmentation strategies that we employ to mitigate problems induced by combining all the dataset in Section 3.2. Finally, we detail the training protocol that we employ in Section 3.3.
We use the PyTorchdeep learning library for implementing our architecture and we trained our model on a system with an Intel Xenon@2.20GHz processor with NVIDIA TITAN RTX GPUs.
We combine six different panoptic segmentation benchmark datasets to form the joint training set. Namely, we use Microsoft COCO , Cityscapes , Mapillary Vistas Dataset , VIPER , WildDash  and KITTI . These six diverse datasets contain images that range from congested urban city driving scenarios, rural areas, and highways to simulated scenes as well as indoor environments.
Microsoft COCO: There are 118K, 5K, and 20K images for training, validation, and testing, respectively. The COCO dataset consists of 80 ‘thing‘ classes and 53 ‘stuff‘ classes. The images are of different resolutions.
Cityscapes: The Cityscapes dataset consists of 2975, 500, and 1525 urban road scene images for training, validation, and testing, respectively. The images are a resolution of pixels and contain 8 ‘thing’ classes and 11 ‘stuff’ classes.
Mapillary Vistas: The Mapillary Vistas dataset is a large-scale driving dataset consisting of 18K, 2K, and 5K images for training, validation and testing, respectively. The images in this dataset have varying resolutions and range from pixels to pixels. It contains 37 ‘thing’ classes and 28 ‘stuff’ classes.
VIPER: The VIPER dataset consists of realistic imagery of virtual urban scenes, which are at a resolution of pixels. It contains 18K, 2K, and 5K images for training, validation and testing, respectively. It comprises of 10 ‘thing’ classes and 13 ‘stuff’ classes.
WildDash: The WildDash dataset is a collection of road scene images taken from all over the world with many diverse and challenging scenarios. The images were captured at a resolution of pixels. It consists of 4256 training images and 776 test images. WildDash contains 13 ‘thing’ classes and 12 ‘stuff’ classes. We further split the training images randomly with a ratio of 80:20 to obtain the training and validation sets.
KITTI: The KITTI dataset consists of 200 training and testing images where the resolution of the images are pixels and contains 8 ‘thing’ classes and 11 ‘stuff’ classes.
3.2 Data Augmentation
The major problems while combining different benchmark datasets are various annotations schemes, difference in environments across datasets such as indoors and outdoors, and imbalance in the number of images available for training across different datasets. The difference in annotation schemes result in the same class being classified as being from the ‘stuff’ category in some of the datasets and being from the ‘thing’ category in others. For example, the pole class belongs to the ‘stuff’ category in the Cityscapes dataset but it belongs to the ‘thing’ category in the Mapillary Vistas dataset. To address this problem we create a unified label mapping space where all such objects are treated uniquely to address this discrepancy. In the aforementioned example, we treat the pole class in the Cityscapes dataset as a different class than the pole class in the Mapillary Vistas dataset. Therefore, our joint dataset has 109 ‘thing’ classes and 77 ‘stuff’ classes. To address the second problem that can lead to under or over-representation of a particular dataset due to the sheer number of training images, we replicate the dataset with fewer training examples for balanced training. In our case, the COCO dataset has 118K images which is roughly three times larger than all of the other datasets combined. Therefore, we replicate all the other dataset exclusive of COCO, three times and then combine it with COCO to form one large dataset or epoch.
Furthermore, we limit the longest side of the resolution of Mapillary Vistas to 1920 due to GPU memory constraints. We perform a limited set of random data augmentations including flipping and scaling within the range of . We also use variable crops to train our network such that for a given input image, it’s original size i.e void of any scaling augmentation is considered as the crop size for the image. We also perform training on crops due to memory constraints.
3.3 Training Protocol
We initialize the backbone of our EffPS_b1bs4_RVC architecture with weights from the EfficientNet model pre-trained on the ImageNet dataset and initialize the weights of the iABN sync layers to 1. We use Xavier initialization
for the other layers, zero constant initialization for the biases and we use Leaky ReLU with a slope of 0.01. We use the same hyperparameters as
unless explicitly mentioned in this report. We train our model with Stochastic Gradient Descent (SGD) with a momentum ofusing a multi-step learning rate schedule i.e. we start with an initial base learning rate and train the model for certain number of iterations, followed by lowering the learning rate by a factor of 10 at each milestone and continue training for 10 epochs. We use an initial learning rate of 0.01 and successively reduce it by a factor of 10 at 400k and 520k iterations. At the beginning of the training, we have a warm-up phase where the is increased linearly from to in 200 iterations. We train our EffPS_b1bs4_RVC with a batch size of 4 on 4 NVIDIA TITAN RTX GPUs where each GPU tends to a single-image. Please note that our benchmarked model is only trained on the training set i.e exclusive of the validation set.
We use test time augmentation such as flip and multi-scales only for COCO and Cityscapes. The scales that we use for COCO are  and the scales for Cityscapes are . For Mapillary Vistas dataset, we upsample the predictions with the longest side that we limited to 1920, to the original image size for benchmark submission. The parameter used in the panoptic fusion module is set to for COCO and for the other datasets.
5 Benchmark Results
Table 1 presents the panoptic segmentation benchmark results of EffPS_b1bs4_RVC on the five datasets from the Robust Vision Challenge 2020. For simplicity, we only report the PQ, SQ and RQ scores. Other metric values can be accessed via the corresponding benchmark servers. Furthermore, Figure 2 shows example results from each of the datasets in the RVC 2020 challenge.
(c) Mapillary Vistas
In this report, we presented our EffPS_b1bs4_RVC architecture that achieves the first place in the Robust Vision Challenge 2020 for the Panoptic Segmentation task. The challenge has a high level of difficulty than any other benchmark since it measures the performance of a solution on the combination of different challenging datasets that have vastly varying characteristics but at the same time exhibits the robustness of the given solution. Consequently, our result in RVC 2020 shows the robustness of EfficientPS across various benchmarks, as our winning model is essentially a lightweight version of it. Additionally, the performance of our model can be considerably improved by training for more epochs with higher batch sizes and with the inclusion of validation set in the training set.
-  (2019) Robot localization in floor plans using a room layout edge extraction network. arXiv preprint arXiv:1903.01804. Cited by: §6.
-  (2018) Searching for efficient multi-scale architectures for dense image prediction. In Advances in neural information processing systems, pp. 8699–8710. Cited by: §2.
-  (2016) The cityscapes dataset for semantic urban scene understanding. In , pp. 3213–3223. Cited by: §3.1.
Understanding the difficulty of training deep feedforward neural networks. In
Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp. 249–256. Cited by: §3.3.
-  (2017) Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 2961–2969. Cited by: §1.
-  (2018) Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7132–7141. Cited by: §2.
-  (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: §2.
-  (2019) Panoptic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 9404–9413. Cited by: §1.
-  (2014) Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §3.1.
-  (2015) Object scene flow for autonomous vehicles. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §3.1.
-  (2020) Efficientps: efficient panoptic segmentation. arXiv preprint arXiv:2004.02307. Cited by: §1, Figure 1, §2, §3.3.
-  (2017) The mapillary vistas dataset for semantic understanding of street scenes. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4990–4999. Cited by: §3.1.
-  (2019) Pytorch: an imperative style, high-performance deep learning library. In Advances in neural information processing systems, pp. 8026–8037. Cited by: §3.
-  (2017) Playing for benchmarks. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pp. 2232–2241. External Links: Cited by: §3.1.
-  (2018) In-place activated batchnorm for memory-optimized training of dnns. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5639–5647. Cited by: §2.
-  (2011) A new monotonic, clone-independent, reversal symmetric, and condorcet-consistent single-winner election method. Social Choice and Welfare 36 (2), pp. 267–303. Cited by: §1.
-  (2019) Efficientnet: rethinking model scaling for convolutional neural networks. arXiv preprint arXiv:1905.11946. Cited by: §2.
Convoluted mixture of deep experts for robust semantic segmentation.
IEEE/RSJ International conference on intelligent robots and systems (IROS) workshop, state estimation and terrain perception for all terrain mobile robots, Cited by: §6.
-  (2016) Towards robust semantic segmentation using deep fusion. In Robotics: Science and Systems (RSS 2016) Workshop, Are the Sceptics Right? Limits and Potentials of Deep Learning in Robotics, Cited by: §6.
-  (2018) Incorporating semantic and geometric priors in deep pose regression. In Workshop on Learning and Inference in Robotics: Integrating Structure, Priors and Models at Robotics: Science and Systems (RSS), Cited by: §6.
-  (2018) WildDash - creating hazard-aware benchmarks. In Proceedings of the European Conference on Computer Vision (ECCV), Cited by: §3.1.