Recently,  proposed the panoptic segmentation task that unifies instance segmentation and semantic segmentation. The task requires a model to assign each pixel of an image a semantic label and an instance ID. Several panoptic datasets, such as the Cityscapes, Microsoft COCO, and Mapillary Vistas, have been released. Much of research attention has focused on developing new models [22, 18, 7]. So far, the best performing models on the leader boards of various major panoptic challenges are exclusively CNN based.
In this paper, insead of developing new models, we focus on the data augmentation aspect of the panoptic segmentation task, and develop a novel oanpotic data augmentation method - PanDA that further improves the performance of top performing models.
We first identify the data deficiency and class imbalance problems and propose a pixel space data augmentation method for panoptic segmentation datasets that efficiently generates synthetic datasets from the original dataset. The method is computationally cheap and fast, and it requires no training or additional data.
Next, we experimentally show that training with PanDA augmented Cityscapes further improves all major performance metrics of UPSNet model, which was one of the top performing CNN based models in cityscapes challenge as of 2019. Trained on PanDA augmented Cityscapes, the UPSNet model achieves 1.2% to 6.9% relative gain over the original version in various metrics on the validation set.
To further demonstrate the generalizability of PanDA across scales and domains, we apply it to Cityscapes subsets with 10, 100, 1,000, and 3,000 images, as well as a 10 times larger COCO subset with 30,000 images. We report performance gains across scales and datasets. Finally, ablation study confirms that the combination of operations we implemented in the paper yield the best performance.
Finally, results from the ablation study show that, contrary to common beliefs, less realistic looking images improve model performance more. It suggests that the level of image realism is not positively correlated with data augmentation efficiency.
2 Related Work
2.1 Panoptic Segmentation
The panoptic segmentation task was first proposed in 2019 by Kirillov et al. as an attempt to formulate a unified task that bridges the gap between semantic segmentation and instance segmentation. The task divides objects in to two super-categories: things and stuff. For classes in things, each pixel in the image is labeled with a class id and an instance id (Fig 1). For classes in stuff, each pixel is labeled with a class id only. In addtion to traditional metrics such as mean intersection over union (mIoU) and average precision (AP), the panoptic quality (PQ) is defined as,
where true positives (TP) are predicted segments that have strictly greater than 0.5 overlap with ground truth segments. PQ is the product of segmentation quality (SQ) and recognition quality (RQ).
2.2 Data Augmentation
Despite the efforts towards low-shot learning, modern CNN based models are still data-hungry in that they have very large model capacity to even memorize randomly labeled large datasets. One efficient way to regularize models and promote generalization is data augmentation.
Many methods still in use for detection and segmentation models[22, 18, 7] are largely inherited from ImageNet visual recognition era. These methods take advantage of object invariances in pixel space: manipulations such crop, horizontal flip, resize, color distortion, and noise injection, do not usually change the identity of the object. These methods are simple in concept and computationally very cheap, but they do not take advantage of the more informative ground truth labelling that accompanies these harder detection and segmentation tasks. Some recent methods start to explore the use of bounding box and segmentation information[15, 1], but no pixel space data augmentation methods for panoptic segmentation tasks have been proposed to our knowledge.
Data simulation is another flavor of data augmentation. Thanks to the development of graphical simulation engines, we could generate photo-realistic images together with pixel perfect ground truth. The method is proven to work for many tasks such as human pose estimate, wildlife classification, and object detection. However, this method often requires handcrafted 3D models and there usually is a significant domain gap between real images and simulated images.
Recently, with the popularization of generative models such as generative adversarial networks (GAN), researchers also add images generated by GANs to the training set[6, 2, 15]. However, as a type of neural network based model, the GAN itself requires training which in turn is both computationally expensive and data hungry.
|(a) Original Cityscapes|
|(b) PanDA Cityscapes|
|(c) Per class pixel percentage per image|
|(a) Original||(b) Dropout only||(d) Dropout + Resize + Shift|
|UPSNet-50 Xiong et al.||59.3||79.7||73.0||54.6||62.7||75.2||33.3||39.1|
|UPSNet-50 our basline||58.8||79.5||72.6||54.5||62.0||75.1||33.5||38.7|
|UPSNet-50 w/ PanDA 1x||59.9||79.9||73.8||55.4||63.3||76.2||34.6||39.6|
|UPSNet-50 w/ PanDA 2x||60.0||80.3||73.5||55.8||63.1||76.2||35.6||40.5|
3 Panoptic Data Augmentation
Annotating images for panoptic segmentation can be costly, since every pixel in a image has to be semantically labeled and pixels that belong to things classes must have an additional instance ID associated. One could argue that it is orders of magnitude harder for a human annotator to generate panoptic ground truth than to assign a single class ID to an image as needed in visual recognition. But with great challenges usually come great opportunities, PanDA takes advantage of the rich information embedded in panoptic annotations and use it to synthesize new images with ground truth.
A fundamental feature that distinguishes PanDA from other data augmentation methods is that PanDA does not aim to create images that look realistic to human eyes. Many 3D simulation based methods[1, 19, 21] implicitly or explicitly suggest that the level of realism is key to achieving high performances. Similary, GAN based image synthesis methods[6, 2, 11] also usually optimize for realism, since the very goal of the generator network in GANs is to synthesize images realistic enough to fool the discriminator network into thinking that they are real. In contrast, PanDA takes more principled approaches to balance classes, breaks local pixel structures, and increase variations of objects, and the final images synthesized do not look fully realistic to human eyes.
As shown in Fig 1
, PanDA first decomposes a panoptically labelled training image by using the segmentation ground truth to divide it into foreground and background segments. The background segments which contain unlabeled and background classes are padded with noise patterns to make a new background image of the original size. Foreground segments are then overlaid on top of the new background image one by one. The same manipulation is applied to the ground truth segmentation image to generate the new ground truth segmentation.
For foreground segments, we can perform a series of pixel space operations such as dropout, resize and shift to control different aspects of each individual object instance. Dropout is used to remove segments from the image. The resize
operation changes the size of a segment while preserving the original aspect ratio, it simulates object movement in depth and introduces more size variance in objects. It resembles zooming and multi-scale training on the individual object level. Random shifting moves the segment in x and y in pixel space. It prevents memorization of object locations by the model and breaks the local pixel relationship between the object and its background, thus creates new local contexts for the object. Resize and shift together simulate 3 dimensional translation of objects in space, and dropout controls the frequency of objects. Because the ground truth depth information to recover the depth ordering of segments is not available, and we do not want larger segments to occlude smaller ones, we sort the segments by their area and put largest segments on the bottom layer. It is worth noting that more operations can be added to the repertoire of the PanDA tool set, and different variants of the aforementioned operations may also be implemented. In this paper, we limit the scope to the operations discussed above to allow for in-depth experiments with PanDA.
|UPSNet-50 Xiong et al.||31.2||25.1||50.9||33.6||55.0||33.2||19.2||18.3|
|UPSNet-50 our basline||31.1||24.8||51.0||33.2||53.5||35.8||19.4||18.9|
|UPSNet-50 w/ PanDA 1x||31.5||25.7||52.2||34.3||54.4||37.9||20.8||20.3|
|UPSNet-50 w/ PanDA 2x||32.3||25.6||52.7||36.9||56.6||39.3||20.6||20.7|
|UPSNet-50 Xiong et al.||38.6||42.1||56.6||33.4||56.3||27.2||28.4||30.1|
|UPSNet-50 our basline||38.5||41.6||56.3||33.6||55.4||26.0||27.1||31.2|
|UPSNet-50 w/ PanDA 1x||39.0||43.4||57.8||33.6||55.8||27.2||28.4||31.2|
|UPSNet-50 w/ PanDA 2x||40.1||43.4||58.6||36.3||57.5||29.3||27.4||31.7|
|# original images|
4.1 UPSNet Model
The UPSNet model is one of the top performing single models on both Cityscapes and MS COCO panoptic challenges. It uses a shared CNN feature extractor backbone and multiple heads on top of it. The backbone is a Mask R-CNN feature extractor built on a deep residual network (ResNet) with a feature pyramid network (FPN). UPSNet has an instance head which follows the Mask R-CNN design and a semantic head that consists of a deformable convolution based sub-network. The outputs of the two heads are then fed into the panoptic head which generates the final panoptic prediction. The Seamless Scene Segmentation (SSS) model  which is a competitor model developed around the same time early in 2019, roughly follows the same meta-architecture with a Mask R-CNN instance head and a Mini Deeplab (MiniDL) semantic head taking input from a common FPN feature extractor and feeding into a fusion head for panoptic prediction. We choose UPSNet over SSS primarily for its fast training and inference, well documented implementation, code availability, as well as compatibility with both Cityscapes and MS COCO panoptic tasks.
We used all the original hyper-parameters of UPSNet-50 model for training and inference except for two aspects. First, we scale the number of training iterations so that the number of training epochs remains roughly the same across a wide range of dataset sizes and various GPU counts. Second, we scale the learning rate based on the number of GPUs used in the training. The training processes start with the same ImageNet pretrained ResNet-50 model as reported in  and train with 2-8 Nvidia 1080Ti/2080Ti GPUs. For each experiment run, we evaluate the last few model checkpoints and report the best performance.
Cityscapes panoptic dataset has 2,975 training, 500 validation, and 1,525 test images of ego-centric driving scenes in urban environments in many cities in Europe. Examples are shown in Fig 2 (a-c). The dense annotation covers 97% of the pixels which consists of 9 things classes and 11 stuff classes. We choose Cityscapes as the main testbed for PanDA for several reasons: 1) It is one of the most popular panoptic datasets available and it has a total of 19 diverse classes that covers a wide range of driving related scenarios. Many modern models[22, 18, 7] report performance on Cityscapes and have published code available. 2) Results on a relatively small set are likely to generalize to other specialized domains where annotated panoptic data is scarce. 3) The small size also makes both data synthesis and training cost manageable. As a result, it is suitable for exploratory studies like this.
To demonstrate the generalizability of PanDA, we also performed additional experiments on MS COCO dataset, which has 118K training and 5k validation images including 53 stuff classes and 80 things classes of common objects. It is both orders of magnitude larger in size and more diverse regarding number of classes.
4.3 Augmenting Cityscapes with PanDA
The fact that multi scale inference and pretraining with additional data improves performance of the lightweight UPSNet-50 model suggests that model capacity is not the major limiting factor for the final performance and that data augmentation may further improve performance. To investigate the relationship between model performance and size of training dataset, we trained the UPSNet-50 model from scratch on various subsets of Cityscapes training set that consist of 10, 100, 1,000, and 2,975 images. Fig 3 shows near perfect log-linear relationships between the number of training images and various performance metrics (PQ , AP , AP box ). It suggests that adding training images is likely to further improve model performance.
To test whether adding training images indeed helps, we used PanDA to generate synthetic images from Cityscapes training set. A key discrepancy between model training and and the final PQ score is that classes with large areas provide overwhelming training signals during training and thus are favored during training, whereas SQ - the first term of the PQ formula - weights small and large objects equally. To mitigate the issue, we aim to balance average per class pixel count per image by applying random dropout of segments where the dropout rate is linearly proportional to the size of the segments. We also apply resizing and shifting to introduce size and location variance of objects. Specifically, segments occupying over 50% of an image are guaranteed to drop out, and segments smaller than 10% of the image frame are never dropped out. We couple resize and shift with a zooming operation: enlarged segments are pushed to the periphery and shrunk ones are pulled towards the center. This simulates an approaching motion of the object. Each segment is resized in the range of 0.5x to 1.5x, with scales drawn from a uniform distribution.
Summary statistics of the original Cityscapes dataset and PanDA augmented dataset are shown in Fig 4. In original Cityscapes, road and building together occupy over 50% pixels of an image on average which supports our observation that some large classes dominate learning signals. While PanDA removes a large proportion of non-background pixels on average, it significantly reduces the dominance of common large classes while largely preserving small and rare classes.
Examples of original and synthetic image pairs are shown in Fig 2: some of the commonly seen classes with large area such as road and building are more frequently dropped out and smaller instances such as person and pole are often kept. It has caught our attention that the synthetic images are no longer realistic and coherent scenes but rather nonsensical to human eyes. However we show in later sections that adding these synthetic images improves model performance.
To demonstrate the usefulness of PanDA, we generated synthetic training sets and trained the standard UPSNet on the augmented training sets. We first used PanDA to double Cityscapes training set and trained the original UPSNet-50 model on the PanDA augmented Cityscapes dataset with a total of 1,000 images. The model (UPSNet-50 w/ PanDA 1x in Table1) outperforms both our baseline model and the originally reported model in all metrics. We also triple Cityscapes by running PanDA on it twice and generating 6,000 synthetics images. We observed further model improvement in most all panoptic segmentation metrics. In summary, our best performing model improves on the published baseline by 0.7% PQ, 2.3% AP, 1.4% box AP, and 1.0% mIoU (1.2%, 6.9%, 3.6%, and 1.3% relative gain, respectively) on the validation set. Additionally, overall and per class performance in both instance segmentation (Table 2) and detection (Table 3) improve upon baseline.
4.4 PanDA Across Scales and Datasets
To test how well PanDA generalizes, we froze the parameters that we used in the previous section and applied PanDA to datasets of different scales and domains. We first tested if PanDA generalizes to smaller subsets of Cityscapes. This would be particularly useful if one develops a new panoptic task in a new domain where annotated data is expensive and scarce. In Table 4, we show the consistent performance improvement on UPSNet models trained with PanDA augmented Cityscapes subsets across scales of 10, 100, 1,000, and 3,000 images.
We then ask whether the improvement with PanDA is specific to Cityscapes dataset. Arguably, the ego-centric driving scenarios in urban environments in Cityscapes are a specialized domain which may raise concerns of generalizability across domains. In addition, Cityscapes is one of the smaller panoptic datasets available. Although previous experiments demonstrate that PanDA performs well when the dataset scales down, it remains unknown how well it performs when scaled up. To address the concerns, we applied PanDA to a 30,000 image subset of COCO dataset which is obtained by going through a list of the original 118K training set with step size 4. The COCO 30K subset will not only test whether PanDA generalizes to a different domain, but also breaks the 3,000-image upper limit of Cityscapes. The bottom two rows of Table 4 show that PanDA indeed works on the 10x larger COCO dataset, which further supports the claim that PanDA generalizes well across scales and domains.
4.5 Data Efficiency
In this section, we explore the data efficiency of PanDA generated images. As shown in Fig 3
, PQ, AP, and AP box scales linearly with the logarithm of number of training images. The log-linear regression functions are,
where is the number of original images used to train the model, PQ, AP, and AP box are respective model performance. PQ, AP, AP box values are plugged into the regression functions to in turn estimate effective training set size for experiments with either original images only or original plus PanDA images. We define data efficiency (DE) as,
where N is the estimated effective training set size of models trained on original images only, N is that of models trained on original and PanDA images. The definition is specific to the case where the ratio of original and synthetic images is 1. A higher DE suggests higher per image data efficiency, and one would expect a DE of 100% if a synthetic image is as informative as an original image.
Two conclusions can be drawn from results in Table 5. First, across scales and metrics, PanDA images are not as efficient as original images as none of DEs are near 100%. However, this is expected since we only reuse the object instances in the original images, and the synthetic images have only 40% non-background pixels per image on average. Second, the fact that a synthetic image can be half as efficient as a real one suggests that there is significant amount of information embedded in the original images that is not learned by current models. One can expect superior models to capture this additional information without the help of data augmentation.
4.6 Ablation Study
We conducted an ablation study to evaluate the effectiveness of the individual operations of PanDA. To make fair comparisons, in this set of experiments, the best performing UPSNet-50 model we could reproduce by training on the original Cityscapes training set was used as the baseline (first row in Table 6). For each experiment group, we varied PanDA’s parameters under the ablation constraint and report the best performance obtained. Experiment groups are trained on their respective PanDA 1x augmented Cityscapes dataset. Doubling the number of training epochs without any augmented data (second row in Table 6) does not improve model performance which suggests that the original UPSNet training parameters are near optimal. PanDA with dropout operation only improves model performance, presumably by mitigating the data imbalance issue (Fig 4). Without resize or shifting, original object relationships are preserved and thus still realistic (Fig 5(b)). As more operations are included in PanDA, the level of realism of synthesized images decreases (Fig 5). However, the fact that best performance is achieved with the least realistic image set suggest that, contrary to popular belief, the level of realism of synthetic images is not necessarily correlated data augmentation performance.
Compared with GAN based and 3D simulation based data augmentation methods, PanDA has several advantages. First of all, PanDA does not require training. GAN based methods need to be trained to generate realistic images in the desired domain first before the generative network can be used for data augmentation. Second, no additional data is needed for PanDA. As mentioned before, training GANs not only takes time but also usually requires a large unlabelled dataset from the same domain. 3D simulation based methods almost always require hand crafted 3D models of classes of interest. Finally, it is computationally very cheap to use PanDA since it operates exclusively in pixel space. Many image operations such as cropping and resizing could be further parallelized and optimized which allows for real time image synthesis at training time in an active learning fashion.
There are certain limitations within the current implementation of PanDA that leave doors open for more exploration. First, the functions used in background padding, dropout, resize, and shift can be substituted with more optimized functions. For example, instead of using white noise to pad background, uniform gray or pink noise can be used. Further optimizations can be performed either by a more systematic grid search or in an automated way. Second, since the 3,000 training images can be divided into 78,000 segments, if we allow the creation of hybrid images where segments in the new image can come from different original images, we could further expand the image variation combinatorially. Third, it remains unclear how many useful synthetic images can be generated with PanDA per original image. In principle, one could create an infinite number of synthetic images from the original dataset. Due to the high computation cost of training with very large datasets, this paper limits the scope to 2x and 3x or the original size. It will be an interesting direction to investigate the limit of the number of useful synthetic images per original image. Lastly, it is unknown to us how well PanDA will generalize to other models. Recall that UPSNet, the model used in this paper, shares many features with competing models such as a ResNet backbone, Mask R-CNN instance head, and an FPN feature extractor. It is very likely that competing models would also benefit from PanDA augmented datasets.
It is worth noting that for challenges on ego-centric driving datasets like Cityscapes, it is also possible to improve model performance by pretraining it on a larger dataset from a related domain. For example, better performing models can be obtained by pretraining on MS COCO and fine tuning with Cityscapes. However, in a new and specialized domain, there might not exist a dataset available to pretrain on. In addition, for a large dataset like MS COCO, pretraining on smaller datasets is unlikely to help.
Finally, although PanDA operations are well justified from the standpoint of statistics, we were surprised that the best performing PanDA augmented datasets do not look natural or realistic to human eyes at all. Many PanDA generated images look qualitatively similar to the ones shown in Fig 2 where positioning and occlusion of objects are not realistic. Many objects in the synthetic images appear to be ”floating” on top of the noise background out of the original context (Fig 2 (d), Fig 5 (c)); sometimes they cluster and overlap each other in the wrong depth order (Fig 2 (f)). It is known that contextual information helps human visual detection. We suspect that by taking objects out of their original context, PanDA presents harder challenges to the model and therefore forces the model to pay more attention to the pixels within the object foreground. Secondly, despite the fact that PanDA drastically reduced the total non-background pixels per synthetic image, the augmented datasets are more balanced. It suggests that a balanced but small dataset might be more helpful than a large but unbalanced dataset.
In this paper we present a simple and efficient method for data augmentation of annotated panoptic images to improve panoptic segmentation performance. PanDA is computationally cheap, and requires no training or additional data. After training with PanDA augmented datasets, top performing panoptic segmentation models further improve performance on two popular datasets, Cityscapes and MS COCO. To the best of our knowledge, PanDA is the first pixel-space data augmentation method with demonstrated performance gain for leading models on panoptic segmentation tasks. Further improvement is possible with fine-tuned parameters. The effectiveness of unrealistic images suggests that we should reconsider maximizing realism in image synthesis for data augmentation. Finally, PanDA opens new opportunities for exploring efficient pixel space data augmentation methods for detection and segmentation datasets, and we believe the community would benefit from these explorations in data augmentation.
-  (2019) Synthetic examples improve generalization for rare classes. arXiv preprint arXiv:1904.05916. Cited by: §2.2, §2.2, §3.
-  (2018) GAN augmentation: augmenting training data using generative adversarial networks. arXiv preprint arXiv:1810.10863. Cited by: §2.2, §3.
-  (2012) Multi-column deep neural networks for image classification. arXiv preprint arXiv:1202.2745. Cited by: §1.
The cityscapes dataset for semantic urban scene understanding. In
Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3213–3223. Cited by: §1, §4.2.
-  (2018) Autoaugment: learning augmentation policies from data. arXiv preprint arXiv:1805.09501. Cited by: §5.
-  (2018) GAN-based synthetic medical image augmentation for increased cnn performance in liver lesion classification. Neurocomputing 321, pp. 321–331. Cited by: §2.2, §3.
-  (2019) SSAP: single-shot instance segmentation with affinity pyramid. In Proceedings of the IEEE International Conference on Computer Vision, pp. 642–651. Cited by: §1, §2.2, §4.2.
-  (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §3.
-  (2015) Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pp. 1026–1034. Cited by: §1.
-  (2019) An annotation saved is an annotation earned: using fully synthetic training for object detection. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 0–0. Cited by: §2.2.
-  (2017) Beyond face rotation: global and local perception gan for photorealistic and identity preserving frontal view synthesis. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2439–2448. Cited by: §3.
-  (2019) Panoptic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9404–9413. Cited by: §1, §2.1.
-  (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §2.2.
-  (2014) Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §1.
-  (2019) Pixel level data augmentation for semantic image segmentation using generative adversarial networks. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1902–1906. Cited by: §2.2, §2.2.
-  (2006) Scene context guides eye movements during visual search. Vision research 46 (5), pp. 614–621. Cited by: §5.
-  (2017) The mapillary vistas dataset for semantic understanding of street scenes. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4990–4999. Cited by: §1.
-  (2019) Seamless scene segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8277–8286. Cited by: §1, §2.2, §4.1, §4.2.
-  (2016) Playing for data: ground truth from computer games. In European conference on computer vision, pp. 102–118. Cited by: §3.
-  (2015) Imagenet large scale visual recognition challenge. International journal of computer vision 115 (3), pp. 211–252. Cited by: §1.
-  (2017) Learning from synthetic humans. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 109–117. Cited by: §2.2, §3.
-  (2019) Upsnet: a unified panoptic segmentation network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8818–8826. Cited by: §1, §1, §2.2, Table 1, §3, §4.1, §4.1, §4.2, §4.3, Table 2, Table 3, §5.
Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530. Cited by: §2.2.