Semantic Segmentation on Multiple Visual Domains

07/09/2021 ∙ by Floris Naber, et al. ∙ 0

Semantic segmentation models only perform well on the domain they are trained on and datasets for training are scarce and often have a small label-spaces, because the pixel level annotations required are expensive to make. Thus training models on multiple existing domains is desired to increase the output label-space. Current research shows that there is potential to improve accuracy across datasets by using multi-domain training, but this has not yet been successfully extended to datasets of three different non-overlapping domains without manual labelling. In this paper a method for this is proposed for the datasets Cityscapes, SUIM and SUN RGB-D, by creating a label-space that spans all classes of the datasets. Duplicate classes are merged and discrepant granularity is solved by keeping classes separate. Results show that accuracy of the multi-domain model has higher accuracy than all baseline models together, if hardware performance is equalized, as resources are not limitless, showing that models benefit from additional data even from domains that have nothing in common.



There are no comments yet.


page 1

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

In the field of computer vision, one of the big challenges is semantic segmentation, which is the computer task of dividing an image into labelled regions that describe the contents of the image on a pixel level. These labels (or classes) could include anything, for example cars, people, chairs, trees and fish. Semantic segmentation has many diverse applications, like autonomous driving 

[11], medical image analysis [10], infrastructure monitoring [1], robotics [21], photo editing [7] and much more. It is an especially important task in autonomous driving and robotics for example, because it is important for the models to understand the context in the environment in which they’re operating [16]

. Semantic segmentation can also be considered as a substantial preprocessing for others tasks, including object detection, instance segmentation and scene understanding 


Currently in the field, semantic segmentation is tackled with deep neural networks that are trained using supervised learning. This training is done using datasets consisting of images that are labelled with a set of predetermined labels, the so-called label-space. These datasets are labelled manually on a pixel level and are therefore very expensive to make, compared to for example datasets for object detection, which only require bounding box labelling. As a result, only a limited amount of training data is available for semantic segmentation. Datasets are often small, have a limited domain and also a limited label-space, like urban driving scenes (Cityscapes 

[4]), indoor scenes(SUN RGB-D [19]) or underwater scenes (SUIM [8]) (see Fig. 1 for examples). This is undesired as neural networks thrive when they have access to a lot of diverse training data.
The accuracy of a model within a given domain can be very high, but as soon as a model leaves its domain, its performance can drop very quickly. For example a model trained on Cityscapes will perform well on urban roads, but will perform badly when facing a sidewalk with a restaurant and terrace, which is a situation better suited to a model trained on SUN RGB-D. While determining what image belongs to what dataset is very easy for datasets with a very clear domain, like the datasets mentioned above, this is not the case in the real world, as the border between different domains can be very blurry, as described by the example.
To improve accuracy on both domains, making a single model trained simultaneously on these different domains can improve the accuracy on both domains, compared to when both domains had different models trained for them specifically [6]. Also in applications where a system needs to operate across multiple domains, it is desired that it only uses 1 model, as otherwise this burdens the developers with making multiple models for multiple domains. These models then have to run those simultaneously on multiple GPU’s (expensive) or alternatively a controller needs to be developed to determine what model to use in every situation [12, 9].

Fig. 1: Example images (top) and colored annotations (bottom) of Cityscapes (left), SUIM (middle) and SUN RGB-D (right)

In this paper a method is proposed to combine three datasets from different domains (Cityscapes, SUN RGB-D and SUIM) without manual relabelling, to increase the size and diversity of the label-space a single model can train on. The goal is to research the hypothesis that model performance can be improved when trained on multiple domains, compared to models trained specifically for these domains combined. The system’s hardware requirements during operation of the system should not be increased and the design should be simple enough to allow for inclusion of more datasets in the future without difficulty.
First the challenges of multi-domain training for the three used datasets are presented in Sec. II and in Sec. III the solutions to these challenges proposed by state-of-the-art research and their shortcomings are explained. In Sec. IV a method that combines the three datasets is proposed that builds on the current research, after which the results are presented in Sec. V. The research is wrapped up with a discussion in Sec. VI and a conclusion in Sec. VII.

Ii Challenges of multi-domain training

Training a single neural network on multiple domains is not trivial, as there are challenges that need to be faced. The main challenge is the incompatibility between different datasets, as datasets often have incompatible label-spaces. Incompatibilities between the datasets used in this research consist of duplicate classes and discrepant granularity.
The first problem is when multiple datasets have a identical classes. For example, Cityscapes and SUN RGB-D both have a class person, which are considered as separate classes if the datasets are merged without addressing this. This would increase the label-space size unnecessarily, which could impair prediction accuracy and could hinder model training, as it will try to learn differences between the 2 person classes, while there are none in reality. For this reason a solution should be found such that the model sees these classes as identical.
The second problem of discrepant granularity is when something in an image corresponds to a specific class in one dataset, but multiple classes in another dataset. For example the class building in Cityscapes, contains the classes door, window and wall of SUN RGB-D. A simple solution to solve this issue would be to merge these more detailed classes of SUN RGB-D and map them to the Cityscapes building class, but then detailed labelling that is available in SUN RGB-D would be lost, which is undesired. Also in an indoor domain it would be out of place to consider a building class. A solution for this should be implemented that maintains a high granularity, while maintaining high accuracy.

Iii Related work

Currently, some research to address these incompatibility problems has already been done on multi-domain semantic segmentation. MSeg [12] introduces a composite dataset made up of many popular datasets, including Cityscapes, Mapillary Vistas [17]

, ADE20K 


and COCO 

[13]. The method for merging all the datasets is quite straight forward. A universal label-space was created by simply choosing one by hand based on all the included datasets. The total of 316 classes in all the datasets was reduced to 194 for MSeg, by merging most classes that were incompatible. To maintain a high granularity, some classes were manually relabelled to make those compatible with the rest of the dataset. The shortcomings of this idea are that even though a very detailed and diverse new dataset was created, still manual labelling is used to create a multi-domain dataset. Using this method, it is expensive to expand the label-space to new domains, like that of SUIM, which MSeg has a very bad performance on.
A semi-automated method of training a model on multiple incompatible datasets was proposed by Meletis and Dubbelman [14, 15] using a hierarchical structure. Three datasets were used: Cityscapes, Mapillary Vistas and GTSDB [5] (a bounding box traffic sign dataset). The solution proposed addresses the issue of discrepant granularity using a hierarchy, where the model first determines the regions in an image from the label-space of the highest hierarchy (i.e. lowest granularity). If on of these classes consists of multiple classes from a lower hierarchy, the model will relabel the pixels using the label-space of the lower hierarchy. A hierarchical structure makes for complex inference and is not computationally friendly, as the model has a head (part of the model that makes the predictions) for every hierarchical level. Also, this model did not expand the label-space to multiple domains, as every class in Cityscapes included in the final model corresponds to a class or a combination of classes from Mapillary Vistas, which means only the discrepant granularity problem within the driving domain is addressed. Bendavic et al. [2] proposed a method for creating a flat label-space for multiple domains, by creating a universal label-space that contains all the most detailed classes from all datasets included, after which all original classes are mapped to the classes of the universal label-space. Duplicate classes are simply mapped to the matching universal class and discrepant granularity is addressed by making the universal label-space match all the most detailed classes. So the Cityscapes road is mapped to 8 more detailed universal road classes, which correspond to the road classes of Mapillary Vistas. This solution is very interesting, as the final model has a label-space with a higher granularity than all of the original datasets, but this method also makes the label-space very complicated, as many pixels end up having partial (or multiple) labels, which results in a dataset not having a defined ground truth for every pixels. As a result, the models trained with their universal label-space had varied results on different benchmarks, suggesting that a good solution for multi-domain semantic segmentation still is not achieved.
Multi-domain semantic segmentation is an issue that still has no good semi-automated solution. Research that currently has been done in the field still has shortcomings and did not manage to address all the problems. This research aims to build upon the works described above and explores a method of merging datasets from different domains effectively.

Iv Method

In this section a method is proposed to merge 3 datasets; Cityscapes, SUN RGB-D and SUIM. First, the datasets are introduced in Sec. IV-A, secondly the dataset preparation is described in Sec. IV-B, then the merging method is described in Sec. IV-C and finally in Sec. IV-D the model choice and optimization process is described.

Fig. 2: Class mapping of Cityscapes, SUIM and SUN RGB-D to the new universal label-space of the merged dataset.

Iv-a Choosing datasets

To achieve semantic segmentation on multiple domains effectively, datasets need to be chosen that match the criteria for this research.
Firstly, the datasets should be from different domains and have useful classes, so a method of expanding the label-space and image-space can be researched. A useful class is one that a model can realistically encounter in a real life scenario. For example, brain tissue in Rontgen scans is not a useful class for a system that has regular camera images as input.
Secondly, the datasets should have similar sizes, as solving problems regarding dataset size is not the focus of this research. For example, if a dataset has 10 times more images than another dataset, it would dominate the training data and therefor the model will be more suited to the domain of this dataset.
Finally, the datasets should have a size and label-space large enough to achieve meaningful results and small enough to allow for more experimenting, as training on very large datasets with large label-spaces takes more time.
The datasets that match all criteria and are chosen for this research are the Cityscapes dataset, containing 3475 training and validation images of urban driving scenes and 19 evaluation classes, SUN RGB-D, containing 5285 training and validation images of indoor scenes and 37 evaluation classes, and the SUIM dataset, containing 1525 train and validation images of underwater scenes and 8 classes. The complete label-spaces of the datasets can be seen in Fig. 2.

Iv-B Dataset preparation

To allow for easy training on all 3 datasets, dataset errors need to be removed and file types should be the same across datasets, so that model does have to deal with unnecessary differences during training.
Firstly, the annotation files of Cityscapes and SUIM are PNG-files and BMP-files respectively, while all the SUN RGB-D annotations are stored in a single matrix structure. To make the annotation file and type the same for all datasets, the SUN RGB-D annotation data was extracted and saved in PNG-files for every image. The binary color codes of the SUIM annotations where converted to integers and also stored in PNG-files.
Secondly, images and corresponding annotations that did not have matching sizes were found in SUIM and were removed from the dataset, reducing its size from 1525 to 1488 training and validation images.
Finally, SUIM and SUN RGB-D have no defined training and validation splits, while Cityscapes does. For SUN RGB-D and SUIM, 1000 and 292 images are chosen respectively for validation only, resulting in a roughly 1 to 5 split of training and validation data, which is similar to the Cityscapes split. The validation sets are chosen such that class representation per pixel was similar in the validation set and training set, to make sure the validation data is a fair representation of the training data.

Iv-C Merging the datasets

For this research it was chosen to make a merging method that is independent from the training and runs before training, so that no complex training or inference is necessary, to allow for further research with different models and implementation of more datasets and also to keep the system performance fast. It is also desired that the system maintains a high granularity, so the challenges stated in Sec. II should be addressed properly.

Duplicate classes

This problem is addressed by mapping the duplicate classes to the same universal class. The person class occurs both in Cityscapes and SUN RGB-D, as mentioned in Sec. II, so this class is mapped to the same class as SUN RGB-D person. It should also be mentioned that both Cityscapes and SUN RGB-D have a class wall, but these classes are unique. Cityscapes defines a wall as a standalone wall outside, not connected directly to a building, while SUN RGB-D defines walls as walls of a house. The classes have therefor been renamed accordingly, to textitoutside wall and inside wall to avoid confusion. Cityscapes classes which are typically ignored are mapped to the same class as SUN RGB-D unlabeled, as these are all ignored during training and evaluation. SUIM does not have ignored classes. The mapping of every class to the universal label-space can be seen in Fig. 2.

Discrepant granularity

A method that was researched to solve discrepant granularity is pseudo-labelling, which relabels the ground truth of a more general class like Cityscapes building to more detailed classes like SUN RGB-D door, window and wall, using a pretrained model. This would mean the final model trains on the predictions of another model, which is undesired as it reduces accuracy. hierarchical structures and partial labelling also were not chosen due to their complexity during inference.
The final solution that was chosen for this problem was to keep the labels mentioned above separate, as the classes occur in a very different context in the Cityscapes and SUN RGB-D domains and as it would result in the largest label-space. This way a wall, door or window seen from the outside is considered as part of the building, but on the inside it would be split up in the 3 different classes. Since all 4 classes have separate, non-overlapping definitions it should not cause any issues during training or testing.

Iv-D The model

Choosing the model

The focus of this research is not on developing a model so an existing model is chosen and slightly optimized for this research. Models used in this research are trained and optimized using the MMSegmentation toolbox [3] of OpenMMLab. The model that was chosen for this research was a pyramid scheme parsing network, as this model was available with good hardware performance and thus allows for more experimenting. The specific model has lower accuracy than other more complex networks, but state-of-the-art accuracy is not the main goal of this research.

Optimizing the model

Bendavic et al. [2] state that multi-domain semantic segmentation benefits a lot from large batch sizes and crop sizes during training. For this reason it was chosen to downscale all the images to a size that would allow for large batch sizes, while not removing too much detail from the images, as this would decrease the accuracy of the model. An image resolution of 512x512 pixels was chosen, as this is was similar to the resolution of SUN RGB-D and SUIM images. These datasets have low resolutions and low quality annotations compared to Cityscapes, so a size that is similar to these datasets is the best compromise.
The crop size is chosen to be 256x256 pixels, as a smaller crop size was determined too small, because there would not be too little data in a single crop, which would result in the model not training effectively. Experimentation showed larger crop sizes resulted in over-fitting on the training set. During training crops are randomly taken from the images, so if the crop size is large, the randomness would decrease and risks of over-fitting increase.
The final image and crop sizes would allow for a batch size of 6 images per iteration on a desktop with a GTX 1070 with 8GB of Memory, which is the system used for experimenting.

The learning schedule that was chosen was a step function, which is independent from other training parameters like iteration total, as this allows for easy experimenting compared to other schedules. The final parameters that were set on was 80k iterations, meaning enough epochs would pass for the results to be useful. 80k iterations is quite a lot, but risk of over-fitting was very low due to the random crops. The learning rate would start at 0.01 and half every 20k iterations, which resulted in sufficient accuracy during validation.

V Results

V-a Metrics

To analyze the performance of the final models, the evaluation metric

mean Intersection over Union (mIoU) is used, which is currently the most used evaluation method in the field. Popular benchmarks use this metric to determine the order of the leaderboards. The main reason to choose this method over a method such as pixel accuracy (true positives per total amount of pixels) is because that method does not compensate for uncommon classes. mIoU on the other hand, takes the average of the IoU (true positives per all positives and false negatives) of every class [20], so accuracy on every class is equally important.

V-B Analysis

The final model will be compared to identical models trained on the same hardware setup. 3 models are trained on the datasets individually, to see how the accuracy of the model trained on the merged dataset compares to the accuracy of separate models. Such a system has several disadvantages as mentioned in Sec. I, but it is a good baseline to compare with. Also a model is trained on Cityscapes and SUIM, as there is no discrepant granularity between these datasets, as it can then be observed if models trained on easily merged datasets improves accuracy compared to separate models.
Hardware performance of the baseline models combined is worse than the models trained on the merged datasets, as can be seen in table I. The model trained on Cityscapes and SUIM performs best as expected, as the label-space is smaller than that of all 3 datasets combined, but this model does not work on the SUN RGB-D dataset, which can also be seen in Fig. 3.

Train Training time 1 iteration [ms] Memory [GB] Output size  [# labels]
Individual datasets 450 14.7 19/8/37
Cityscapes + SUIM 149 4.9 27
Citys + SUIM + SUN 180 4.9 63
TABLE I: Hardware performance of the baseline models added together, the model trained on Cityscapes and SUIM and the model trained on all 3 datasets merged.

When looking at the results from table II, it can be observed that the models trained on the individual datasets have the highest accuracy and the overall accuracy decreases when more datasets are merged. This can be explained by the fact that the individual models together have trained on 3 times more training data, as all models are trained with the same batch size and iteration total. If the batch size of the baseline models is changed such that the total amount of training data is the same for the baseline models together, then a fair comparison can be made. This is done in Sec. V-C. The drop in accuracy from the baseline models to the model trained on Cityscapes and SUIM is smaller than the drop when all datasets are merged. This can be explained by the fact that SUN RGB-D has the largest label-space and image total, so it stands to reason that this dataset has the largest influence on the results.

Train Cityscapes SUIM SUN RGB-D
Individual datasets 60.92 60.79 32.16
Cityscapes + SUIM 56.39 57.39 0.00
Citys + SUIM + SUN 42.68 51.81 26.96
TABLE II: Semantic segmentation accuracy (mIoU [%]) by models trained on the datasets Cityscapes, SUIM and SUN RGB-D individually, Cityscapes and SUIM merged, and all 3 datasets merged.

V-C Accounting for batch size

If batch size is reduced from 6 to 3 for the baseline models of Cityscapes and SUIM, the image total and training time is the same as for the model trained on both of them. The results in this case (table III) show that the model trained on both datasets has a higher accuracy on SUIM than the baseline model of SUIM, while the models have both trained on approximately the same amount of training data from this dataset. This improvement can also clearly be seen in the segmentation results in Fig. 3. The results of the model trained on both datasets still show worse results compared to the baseline model of Cityscapes. In Fig. 3 and in table III it can be seen that the difference is very minor. One of the causes could be that the SUIM dataset is too small to benefit a model trained on Cityscapes.

Fig. 3: Semantic segmentation results on images (row 1) of Cityscapes (left), SUIM (middle) and SUN RGB-D (right) not seen by the models during training. Row 2 shows the ground truth, row 3 to row 5 show predictions by models trained on the individual datasets (batch size 3)(row 3), Cityscapes and SUIM (batch size 6)(row 4) and all 3 datasets (batch size 6)(row 5).

Also the batch size is reduced from 6 to 3 for the baseline model of SUN RGB-D, as this dataset comprises around half of the total amount of training data. Now the model trained on all 3 datasets scores an mIoU of 5% more than the baseline model of SUN RGB-D, even though they have seen the same amount of training data from this domain. This improvement can very clearly be seen in the segmentation results in Fig. 3, showing that training on multiple domains improves the accuracy on this domain a lot.
The overall performance increase of the model trained on all 3 domains, compared to these baseline models, can be explained by the fact that the model is exposed to a larger variety of data and labels at the same time and therefor has a more detailed understanding of what features define each class and is therefore able to separate classes better, even within a single domain, than models trained on every domain separately.

Batch Test
Train size Cityscapes SUIM SUN RGB-D
Individual datasets 3 57.45 55.02 22.12
Cityscapes + SUIM 6 56.39 57.39 0.00
Citys + SUIM + SUN 6 42.68 51.81 26.96
TABLE III: Semantic segmentation accuracy (mIoU [%]) by models trained on the datasets individually with reduced batch size and models trained on merged datasets with original batch sizes.

V-D Discrepant granularity and duplicate classes

The merging of the person classes showed a huge accuracy increase in the SUN RGB-D validation set, as can be seen in table IV. The IoU is improved from 0.09% to 2.65%. The accuracy increase can be explained by the fact that the person class is extremely rare in the SUN RGB-D dataset and therefor models trained on this dataset benefit a lot from additional training data containing this class more often. On the other hand it could also just be the case that this improvement is caused by adding more datasets to the training data, as the increase is in line with the average accuracy increase. There is an accuracy decrease of person class in Cityscapes, which is less then the overall mIoU decrease of Cityscapes (see table II), suggesting that the merging does not contribute towards the mIoU drop.
The solution to discrepant granularity between Cityscapes and SUN RGB-D resulted in a decrease of around 5% for all 4 involved classes when the datasets are merged, as can be seen in table IV. This drop in accuracy is significantly less than the overall drop in mIoU on the domains of both Cityscapes and SUN RGB-D when the datasets are merged.
When looking at the baseline model with a reduced batch size, it can be seen that there is no performance drop in any of the SUN RGB-D classes when the datasets are merged. The accuracy on the class door even increases by 2%. An explanation for no accuracy drop could be that the context in which the conflicting classes occur is very different, namely in the Cityscapes and SUN RGB-D domain, and that the class definitions do not necessarily overlap. As a result the model can still distinguish these classes.

Individual datasets All 3 datasets
Batch size 6 6 3 6
Class City- scapes SUN RGB-D SUN RGB-D City- scapes SUN RGB-D
Person 62.79 0.09 0.00 47.92 2.65
Building 86.61 - - 82.08 -
Inside wall - 70.92 66.29 - 65.94
Door - 29.77 22.11 - 24.48
Window - 43.55 37.83 - 38.08
TABLE IV: Semantic segmentation accuracy (IoU [%]) of 5 different classes by models trained on the datasets Cityscapes and SUN RGB-D individually and all 3 datasets merged together.

Vi Discussion

The discrepant granularity solution shows accuracy can be improved on classes when their definitions do not overlap and the context in which they occur is different. This solution will not work between datasets where the context is similar or where definitions clearly overlap. Further research can explore how to address this while keeping a high granularity and flat label-space to avoid training and inference complexities.
The results of this research could be further improved if hardware used during experimentation was more powerful, as batch sizes clearly has a large influence on results. Further research can explore how batch size and batch choice influences the performance of a model trained on multiple domains.
During this research it has also been found that a model trained on multiple non-overlapping domains is extremely good in learning the differences between individual domains. During evaluation, the final model will almost exclusively predict classes from the domain every image is from. Further research could dive into why this happens and if there is a way to use this phenomenon to improve model performance.
A dataset that was first considered, but ultimately not used in this research, is COCO-stuff, because it has too many labels and images, 182 labels and over 100,000 images. The diverse label-space resulted in very long training times and the system was very slow at handling the large amount of files contained in the dataset, due to its sheer size. Further research could study if accuracy on large and diverse datasets also benefits from multi-domain training.

Vii Conclusion

From the results it can be concluded that accuracy on individual domains can be improved by training on multiple domains, if batch size is taken into account to equalize hardware performance between all models, as hardware resources are not limitless. This means that semantic segmentation accuracy on a domain improves if the model is trained on additional data from a different domain, even when the domains have nothing in common. The multi-domain model has a 70% larger output label-space than the largest baseline model, uses the same memory during training and has 2.5 times faster inference than the baseline models combined.


  • [1] S. M. Azimi, C. Henry, L. Sommer, A. Schumann, and E. Vig (2020) SkyScapes - Fine-Grained Semantic Understanding of Aerial Scenes. arXiv. External Links: ISSN 23318422 Cited by: §I.
  • [2] P. Bevandić, M. Oršić, I. Grubišić, J. Šarić, and S. Šegvić (2020) Multi-domain Semantic Segmentation on Datasets with Overlapping Classes. pp. 1–12. External Links: 2009.01636, Link Cited by: §III, §IV-D.
  • [3] M. Contributors (2020) {MMSegmentation}: OpenMMLab Semantic Segmentation Toolbox and Benchmark. Note: url{} Cited by: §IV-D.
  • [4] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele (2016) The Cityscapes Dataset for Semantic Urban Scene Understanding.

    Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition

    2016-Decem, pp. 3213–3223.
    External Links: Document, 1604.01685, ISBN 9781467388504, ISSN 10636919 Cited by: §I.
  • [5] C. Ertler, J. Mislej, T. Ollmann, L. Porzi, G. Neuhold, and Y. Kuang (2020) The Mapillary Traffic Sign Dataset for Detection and Classification on a Global Scale.

    Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

    12368 LNCS, pp. 68–84.
    External Links: Document, 1909.04422, ISBN 9783030585914, ISSN 16113349 Cited by: §III.
  • [6] D. Fourure, R. Emonet, E. Fromont, D. Muselet, N. Neverova, A. Trémeau, and C. Wolf (2017) Multi-task, multi-domain learning: Application to semantic segmentation and pose regression. Neurocomputing 251, pp. 68–80. External Links: Document, ISSN 18728286 Cited by: §I.
  • [7] Y. Gao, T. Liao, X. Liu, and J. Mu (2018) Seamless Image Cloning with Semantic Segmentation. pp. 1–5. Cited by: §I.
  • [8] M. J. Islam, C. Edge, Y. Xiao, P. Luo, M. Mehtaz, C. Morse, S. S. Enan, and J. Sattar (2020) Semantic segmentation of underwater imagery: Dataset and benchmark. IEEE International Conference on Intelligent Robots and Systems, pp. 1769–1776. External Links: Document, 2004.01241, ISBN 9781728162126, ISSN 21530866 Cited by: §I.
  • [9] S. Jain, D. P. Pani, M. Danelljan, and L. Van Gool (2020) Scaling semantic segmentation beyond 1k classes on a single gpu. arXiv preprint arXiv:2012.07489. Cited by: §I.
  • [10] F. Jiang, A. Grigorev, S. Rho, Z. Tian, Y. S. Fu, W. Jifara, K. Adil, and S. Liu (2018)

    Medical image semantic segmentation based on deep learning

    Neural Computing and Applications 29 (5), pp. 1257–1265. External Links: Document, ISSN 09410643 Cited by: §I.
  • [11] C. Kaymak and A. Ucar (2018) A Brief Survey and an Application of Semantic Image Segmentation for Autonomous Driving. arXiv. External Links: ISSN 23318422 Cited by: §I.
  • [12] J. Lambert, Z. Liu, O. Sener, J. Hays, and V. Koltun (2020) MSeg: A composite dataset for multi-domain semantic segmentation. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 2876–2885. External Links: Document, ISSN 10636919 Cited by: §I, §III.
  • [13] T. Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft COCO: Common objects in context. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 8693 LNCS (PART 5), pp. 740–755. External Links: Document, 1405.0312, ISSN 16113349 Cited by: §III.
  • [14] P. Meletis and G. Dubbelman (2018) Training of Convolutional Networks on Multiple Heterogeneous Datasets for Street Scene Semantic Segmentation. IEEE Intelligent Vehicles Symposium, Proceedings 2018-June, pp. 1045–1050. External Links: Document, 1803.05675, ISBN 9781538644522 Cited by: §III.
  • [15] P. Meletis, R. Romijnders, and G. Dubbelman (2019) Data selection for training semantic segmentation cnns with cross-dataset weak supervision. In 2019 IEEE Intelligent Transportation Systems Conference (ITSC), pp. 3682–3688. Cited by: §III.
  • [16] D. Mwiti (2019) A 2019 Guide to Semantic Segmentation. External Links: Link Cited by: §I.
  • [17] G. Neuhold, T. Ollmann, S. R. Bulo, and P. Kontschieder (2017) The Mapillary Vistas Dataset for Semantic Understanding of Street Scenes. Proceedings of the IEEE International Conference on Computer Vision 2017-Octob, pp. 5000–5009. External Links: Document, ISBN 9781538610329, ISSN 15505499 Cited by: §III.
  • [18] M. H. Saffar, M. Fayyaz, M. Sabokrou, and M. Fathy (2018) Semantic video segmentation: A review on recent approaches. arXiv. External Links: 1806.06172, ISSN 23318422 Cited by: §I.
  • [19] S. Song, S. P. Lichtenberg, and J. Xiao (2015) SUN RGB-D: A RGB-D scene understanding benchmark suite. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition 07-12-June, pp. 567–576. External Links: Document, ISBN 9781467369640, ISSN 10636919 Cited by: §I.
  • [20] E. Tiu (2019) Metrics to Evaluate your Semantic Segmentation Model. External Links: Link Cited by: §V-A.
  • [21] D. Wolf, J. Prankl, and M. Vincze (2016) Enhancing Semantic Segmentation for Robotics: The Power of 3-D Entangled Forests. IEEE Robotics and Automation Letters 1 (1), pp. 49–56. External Links: Document, Link Cited by: §I.
  • [22] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba (2017) Scene parsing through ADE20K dataset. Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017 2017-Janua, pp. 5122–5130. External Links: Document, ISBN 9781538604571 Cited by: §III.