aimed to keep up with the state-of-art semantic segmentation.
Image semantic segmentation is more and more being of interest for computer vision and machine learning researchers. Many applications on the rise need accurate and efficient segmentation mechanisms: autonomous driving, indoor navigation, and even virtual or augmented reality systems to name a few. This demand coincides with the rise of deep learning approaches in almost every field or application target related to computer vision, including semantic segmentation or scene understanding. This paper provides a review on deep learning methods for semantic segmentation applied to various application areas. Firstly, we describe the terminology of this field as well as mandatory background concepts. Next, the main datasets and challenges are exposed to help researchers decide which are the ones that best suit their needs and their targets. Then, existing methods are reviewed, highlighting their contributions and their significance in the field. Finally, quantitative results are given for the described methods and the datasets in which they were evaluated, following up with a discussion of the results. At last, we point out a set of promising future works and draw our own conclusions about the state of the art of semantic segmentation using deep learning techniques.READ FULL TEXT VIEW PDF
Deep learning has been revolutionary for computer vision and semantic
Deep learning has recently become one of the most popular sub-fields of
This paper gives an overview on semantic segmentation consists of an
In recent years, deep learning methods have outperformed other methods i...
Nowadays, Deep learning techniques show dramatic performance on computer...
Over the past decade deep learning has driven progress in 2D image
Accurate estimation of the positions and shapes of microscale objects is...
aimed to keep up with the state-of-art semantic segmentation.
Nowadays, semantic segmentation – applied to still 2D images, video, and even 3D or volumetric data – is one of the key problems in the field of computer vision. Looking at the big picture, semantic segmentation is one of the high-level task that paves the way towards complete scene understanding. The importance of scene understanding as a core computer vision problem is highlighted by the fact that an increasing number of applications nourish from inferring knowledge from imagery. Some of those applications include autonomous driving , human-machine interaction , computational photography , image search engines , and augmented reality to name a few. Such problem has been addressed in the past using various traditional computer vision and machine learning techniques. Despite the popularity of those kind of methods, the deep learning revolution has turned the tables so that many computer vision problems – semantic segmentation among them – are being tackled using deep architectures, usually Convolutional Neural Networks , which are surpassing other approaches by a large margin in terms of accuracy and sometimes even efficiency. However, deep learning is far from the maturity achieved by other old-established branches of computer vision and machine learning. Because of that, there is a lack of unifying works and state of the art reviews. The ever-changing state of the field makes initiation difficult and keeping up with its evolution pace is an incredibly time-consuming task due to the sheer amount of new literature being produced. This makes it hard to keep track of the works dealing with semantic segmentation and properly interpret their proposals, prune subpar approaches, and validate results.
To the best of our knowledge, this is the first review to focus explicitly on deep learning for semantic segmentation. Various semantic segmentation surveys already exist such as the works by Zhu et al. and Thoma
, which do a great work summarizing and classifying existing methods, discussing datasets and metrics, and providing design choices for future research directions. However, they lack some of the most recent datasets, they do not analyze frameworks, and none of them provide details about deep learning techniques. Because of that, we consider our work to be novel and helpful thus making it a significant contribution for the research community.
The key contributions of our work are as follows:
We provide a broad survey of existing datasets that might be useful for segmentation projects with deep learning techniques.
An in-depth and organized review of the most significant methods that use deep learning for semantic segmentation, their origins, and their contributions.
A thorough performance evaluation which gathers quantitative metrics such as accuracy, execution time, and memory footprint.
A discussion about the aforementioned results, as well as a list of possible future works that might set the course of upcoming advances, and a conclusion summarizing the state of the art of the field.
The remainder of this paper is organized as follows. Firstly, Section 2 introduces the semantic segmentation problem as well as notation and conventions commonly used in the literature. Other background concepts such as common deep neural networks are also reviewed. Next, Section 3 describes existing datasets, challenges, and benchmarks. Section 4 reviews existing methods following a bottom-up complexity order based on their contributions. This section focuses on describing the theory and highlights of those methods rather than performing a quantitative evaluation. Finally, Section 5 presents a brief discussion on the presented methods based on their quantitative results on the aforementioned datasets. In addition, future research directions are also laid out. At last, Section 6 summarizes the paper and draws conclusions about this work and the state of the art of the field.
In order to properly understand how semantic segmentation is tackled by modern deep learning architectures, it is important to know that it is not an isolated field but rather a natural step in the progression from coarse to fine inference. The origin could be located at classification, which consists of making a prediction for a whole input, i.e., predicting which is the object in an image or even providing a ranked list if there are many of them. Localization or detection is the next step towards fine-grained inference, providing not only the classes but also additional information regarding the spatial location of those classes, e.g., centroids or bounding boxes. Providing that, it is obvious that semantic segmentation is the natural step to achieve fine-grained inference, its goal: make dense predictions inferring labels for every pixel; this way, each pixel is labeled with the class of its enclosing object or region. Further improvements can be made, such as instance segmentation (separate labels for different instances of the same class) and even part-based segmentation (low-level decomposition of already segmented classes into their components). Figure 1 shows the aforementioned evolution. In this review, we will mainly focus on generic scene labeling, i.e., per-pixel class segmentation, but we will also review the most important methods on instance and part-based segmentation.
In the end, the per-pixel labeling problem can be reduced to the following formulation: find a way to assign a state from the label space
to each one of the elements of a set of random variables. Each label represents a different class or object, e.g., aeroplane, car, traffic sign, or background. This label space has possible states which are usually extended to and treating as background or a void class. Usually, is a 2D image of pixels . However, that set of random variables can be extended to any dimensionality such as volumetric data or hyperspectral images.
Apart from the problem formulation, it is important to remark some background concepts that might help the reader to understand this review. Firstly, common networks, approaches, and design decisions that are often used as the basis for deep semantic segmentation systems. In addition, common techniques for training such as transfer learning. At last, data pre-processing and augmentation approaches.
As we previously stated, certain deep networks have made such significant contributions to the field that they have become widely known standards. It is the case of AlexNet, VGG-16, GoogLeNet, and ResNet. Such was their importance that they are currently being used as building blocks for many segmentation architectures. For that reason, we will devote this section to review them.
AlexNet was the pioneering deep CNN that won the ILSVRC-2012 with a TOP-5 test accuracy of while the closest competitor, which made use of traditional techniques instead of deep architectures, achieved a accuracy in the same challenge. The architecture presented by Krizhevsky et al. 
was relatively simple. It consists of five convolutional layers, max-pooling ones,Rectified Linear Units as non-linearities, three fully-connected layers, and dropout. Figure 2 shows that CNN architecture.
Visual Geometry Group (VGG) is a CNN model introduced by the Visual Geometry Group (VGG) from the University of Oxford. They proposed various models and configurations of deep CNNs , one of them was submitted to the ImageNet Large Scale Visual Recognition Challenge (ILSVRC)-2013. That model, also known as VGG-16 due to the fact that it is composed by weight layers, became popular thanks to its achievement of TOP-5 test accuracy. Figure 3 shows the configuration of VGG-16. The main difference between VGG-16 and its predecessors is the use of a stack of convolution layers with small receptive fields in the first layers instead of few layers with big receptive fields. This leads to less parameters and more non-linearities in between, thus making the decision function more discriminative and the model easier to train.
GoogLeNet is a network introduced by Szegedy et al.  which won the ILSVRC-2014 challenge with a TOP-5 test accuracy of . This CNN architecture is characterized by its complexity, emphasized by the fact that it is composed by layers and a newly introduced building block called inception module (see Figure 4). This new approach proved that CNN layers could be stacked in more ways than a typical sequential manner. In fact, those modules consist of a Network in Network (NiN) layer, a pooling operation, a large-sized convolution layer, and small-sized convolution layer. All of them are computed in parallel and followed by convolution operations to reduce dimensionality. Thanks to those modules, this network puts special consideration on memory and computational cost by significantly reducing the number of parameters and operations.
Microsoft’s ResNet is specially remarkable thanks to winning ILSVRC-2016 with accuracy. Apart from that fact, the network is well-known due to its depth ( layers) and the introduction of residual blocks (see Figure 5). The residual blocks address the problem of training a really deep architecture by introducing identity skip connections so that layers can copy their inputs to the next layer.
The intuitive idea behind this approach is that it ensures that the next layer learns something new and different from what the input has already encoded (since it is provided with both the output of the previous layer and its unchanged input). In addition, this kind of connections help overcoming the vanishing gradients problem.
In order to extend Recurrent Neural Networks architectures to multi-dimensional tasks, Graves et al.  proposed a Multi-dimensional Recurrent Neural Network (MDRNN) architecture which replaces each single recurrent connection from standard RNNs with connections, where is the number of spatio-temporal data dimensions. Based on this initial approach, Visin el al.  proposed ReNet architecture in which instead of multidimensional RNNs, they have been using usual sequence RNNs. In this way, the number of RNNs is scaling linearly at each layer regarding to the number of dimensions of the input image (). In this approach, each convolutional layer (convolution + pooling) is replaced with four RNNs sweeping the image vertically and horizontally in both directions as we can see in Figure 6.
Training a deep neural network from scratch is often not feasible because of various reasons: a dataset of sufficient size is required (and not usually available) and reaching convergence can take too long for the experiments to be worth. Even if a dataset large enough is available and convergence does not take that long, it is often helpful to start with pre-trained weights instead of random initialized ones. Fine-tuning the weights of a pre-trained network by continuing with the training process is one of the major transfer learning scenarios.
Yosinski et al. proved that transferring features even from distant tasks can be better than using random initialization, taking into account that the transferability of features decreases as the difference between the pre-trained task and the target one increases.
However, applying this transfer learning technique is not completely straightforward. On the one hand, there are architectural constraints that must be met to use a pre-trained network. Nevertheless, since it is not usual to come up with a whole new architecture, it is common to reuse already existing network architectures (or components) thus enabling transfer learning. On the other hand, the training process differs slightly when fine-tuning instead of training from scratch. It is important to choose properly which layers to fine-tune – usually the higher-level part of the network, since the lower one tends to contain more generic features – and also pick an appropriate policy for the learning rate, which is usually smaller due to the fact that the pre-trained weights are expected to be relatively good so there is no need to drastically change them.
Due to the inherent difficulty of gathering and creating per-pixel labelled segmentation datasets, their scale is not as large as the size of classification datasets such as ImageNet. This problem gets even worse when dealing with RGB-D or 3D datasets, which are even smaller. For that reason, transfer learning, and in particular fine-tuning from pre-trained classification networks is a common trend for segmentation networks and has been successfully applied in the methods that we will review in the following sections.
Data augmentation is a common technique that has been proven to benefit the training of machine learning models in general and deep architectures in particular; either speeding up convergence or acting as a regularizer, thus avoiding overfitting and increasing generalization capabilities.
It typically consist of applying a set of transformations in either data or feature spaces, or even both. The most common augmentations are performed in the data space. That kind of augmentation generates new samples by applying transformations to the already existing data. There are many transformations that can be applied: translation, rotation, warping, scaling, color space shifts, crops, etc. The goal of those transformations is to generate more samples to create a larger dataset, preventing overfitting and presumably regularizing the model, balance the classes within that database, and even synthetically produce new samples that are more representative for the use case or task at hand.
Augmentations are specially helpful for small datasets, and have proven their efficacy with a long track of success stories. For instance, in , a dataset of portrait images is augmented synthesizing four new scales (), four new rotations (), and four gamma variations () to generate a new dataset of training images. That process allowed them to raise the accuracy of their system for portrait segmentation from to Intersection over Union (IoU) when including that augmented dataset for fine-tuning.
Two kinds of readers are expected for this type of review: either they are initiating themselves in the problem, or either they are experienced enough and they are just looking for the most recent advances made by other researchers in the last few years. Although the second kind is usually aware of two of the most important aspects to know before starting to research in this problem, it is critical for newcomers to get a grasp of what are the top-quality datasets and challenges. Therefore, the purpose of this section is to kickstart novel scientists, providing them with a brief summary of datasets that might suit their needs as well as data augmentation and preprocessing tips. Nevertheless, it can also be useful for hardened researchers who want to review the fundamentals or maybe discover new information.
Arguably, data is one of the most – if not the most – important part of any machine learning system. When dealing with deep networks, this importance is increased even more. For that reason, gathering adequate data into a dataset is critical for any segmentation system based on deep learning techniques. Gathering and constructing an appropriate dataset, which must have a scale large enough and represent the use case of the system accurately, needs time, domain expertise to select relevant information, and infrastructure to capture that data and transform it to a representation that the system can properly understand and learn. This task, despite the simplicity of its formulation in comparison with sophisticated neural network architecture definitions, is one of the hardest problems to solve in this context. Because of that, the most sensible approach usually means using an existing standard dataset which is representative enough for the domain of the problem. Following this approach has another advantage for the community: standardized datasets enable fair comparisons between systems; in fact, many datasets are part of a challenge which reserves some data – not provided to developers to test their algorithms – for a competition in which many methods are tested, generating a fair ranking of methods according to their actual performance without any kind of data cherry-picking.
In the following lines we describe the most popular large-scale datasets currently in use for semantic segmentation. All datasets listed here provide appropriate pixel-wise or point-wise labels. The list is structured into three parts according to the nature of the data: 2D or plain RGB datasets, 2.5D or RGB-Depth (RGB-D) ones, and pure volumetric or 3D databases. Table I shows a summarized view, gathering all the described datasets and providing useful information such as their purpose, number of classes, data format, and training/validation/testing splits.
|Name and Reference||Purpose||Year||Classes||Data||Resolution||Sequence||Synthetic/Real||Samples (training)||Samples (validation)||Samples (test)|
|PASCAL VOC 2012 Segmentation ||Generic||2012||21||2D||Variable||✗||R||Private|
|PASCAL-Context ||Generic||2014||540 (59)||2D||Variable||✗||R||N/A|
|Microsoft COCO ||Generic||2014||+80||2D||Variable||✗||R|
|SYNTHIA ||Urban (Driving)||2016||11||2D||✗||S||N/A||N/A|
|Cityscapes (fine) ||Urban||2015||30 (8)||2D||✓||R|
|Cityscapes (coarse) ||Urban||2015||30 (8)||2D||✓||R||N/A|
|CamVid ||Urban (Driving)||2009||32||2D||✓||R||N/A||N/A|
|CamVid-Sturgess ||Urban (Driving)||2009||11||2D||✓||R|
|Stanford background ||Outdoor||2009||8||2D||✗||R||N/A||N/A|
|Adobe’s Portrait Segmentation ||Portrait||2016||2||2D||✗||R||N/A|
|RGB-D Object Dataset||Household objects||2011||51||2.5D||✓||R||N/A||N/A|
|3D Mesh ||Object/Part||2009||19||3D||N/A||✗||S||380||N/A||N/A|
|Sydney Urban Objects Dataset||Urban (Objects)||2013||26||3D||N/A||✗||R||N/A||N/A|
|Large-Scale Point Cloud Classification Benchmark||Urban/Nature||2016||8||3D||N/A||✗||R||N/A|
Throughout the years, semantic segmentation has been mostly focused on two-dimensional images. For that reason, 2D datasets are the most abundant ones. In this section we describe the most popular 2D large-scale datasets for semantic segmentation, considering 2D any dataset that contains any kind of two-dimensional representations such as gray-scale or Red Green Blue (RGB) images.
PASCAL Visual Object Classes (VOC)111http://host.robots.ox.ac.uk/pascal/VOC/voc2012/: this challenge consists of a ground-truth annotated dataset of images and five different competitions: classification, detection, segmentation, action classification, and person layout. The segmentation one is specially interesting since its goal is to predict the object class of each pixel for each test image. There are 21 classes categorized into vehicles, household, animals, and other: aeroplane, bicycle, boat, bus, car, motorbike, train, bottle, chair, dining table, potted plant, sofa, TV/monitor, bird, cat, cow, dog, horse, sheep, and person. Background is also considered if the pixel does not belong to any of those classes. The dataset is divided into two subsets: training and validation with 1464 and 1449 images respectively. The test set is private for the challenge. This dataset is arguably the most popular for semantic segmentation so almost every remarkable method in the literature is being submitted to its performance evaluation server to validate against their private test set. Methods can be trained either using only the dataset or either using additional information. Furthermore, its leaderboard is public and can be consulted online222http://host.robots.ox.ac.uk:8080/leaderboard/displaylb.php?challengeid=11&compid=6.
PASCAL Context333http://www.cs.stanford.edu/~roozbeh/pascal-context/: this dataset is an extension of the PASCAL VOC 2010 detection challenge which contains pixel-wise labels for all training images (). It contains a total of classes – including the original 20 classes plus background from PASCAL VOC segmentation – divided into three categories (objects, stuff, and hybrids). Despite the large number of categories, only the 59 most frequent are remarkable. Since its classes follow a power law distribution, there are many of them which are too sparse throughout the dataset. In this regard, this subset of 59 classes is usually selected to conduct studies on this dataset, relabeling the rest of them as background.
PASCAL Part444http://www.stat.ucla.edu/~xianjie.chen/pascal_part_dataset/pascal_part.html: this database is an extension of the PASCAL VOC 2010 detection challenge which goes beyond that task to provide per-pixel segmentation masks for each part of the objects (or at least silhouette annotation if the object does not have a consistent set of parts). The original classes of PASCAL VOC are kept, but their parts are introduced, e.g., bicycle is now decomposed into back wheel, chain wheel, front wheel, handlebar, headlight, and saddle. It contains labels for all training and validation images from PASCAL VOC as well as for the 9637 testing images.
Semantic Boundaries Dataset (SBD)555http://home.bharathh.info/home/sbd: this dataset is an extended version of the aforementioned PASCAL VOC which provides semantic segmentation ground truth for those images that were not labelled in VOC. It contains annotations for images from PASCAL VOC . Those annotations provide both category-level and instance-level information, apart from boundaries for each object. Since the images are obtained from the whole PASCAL VOC challenge (not only from the segmentation one), the training and validation splits diverge. In fact, SBD provides its own training ( images) and validation ( images) splits. Due to its increased amount of training data, this dataset is often used as a substitute for PASCAL VOC for deep learning.
Microsoft Common Objects in Context (COCO)666http://mscoco.org/: is another image recognition, segmentation, and captioning large-scale dataset. It features various challenges, being the detection one the most relevant for this field since one of its parts is focused on segmentation. That challenge, which features more than classes, provides more than images for training, for validation, and its test set consist of more than images. In particular, the test set is divided into four different subsets or splits: test-dev ( images) for additional validation, debugging, test-standard ( images) is the default test data for the competition and the one used to compare state-of-the-art methods, test-challenge ( images) is the split used for the challenge when submitting to the evaluation server, and test-reserve ( images) is a split used to protect against possible overfitting in the challenge (if a method is suspected to have made too many submissions or trained on the test data, its results will be compared with the reserve split). Its popularity and importance has ramped up since its appearance thanks to its large scale. In fact, the results of the challenge are presented yearly on a joint workshop at the European Conference on Computer Vision (ECCV)777http://image-net.org/challenges/ilsvrc+coco2016 together with ImageNet’s ones.
SYNTHetic Collection of Imagery and Annotations (SYNTHIA)888http://synthia-dataset.net/: is a large-scale collection of photo-realistic renderings of a virtual city, semantically segmented, whose purpose is scene understanding in the context of driving or urban scenarios.The dataset provides fine-grained pixel-level annotations for classes (void, sky, building, road, sidewalk, fence, vegetation, pole, car, sign, pedestrian, and cyclist). It features training images from rendered video streams. It is also characterized by its diversity in terms of scenes (towns, cities, highways), dynamic objects, seasons, and weather.
Cityscapes999https://www.cityscapes-dataset.com/: is a large-scale database which focuses on semantic understanding of urban street scenes. It provides semantic, instance-wise, and dense pixel annotations for 30 classes grouped into 8 categories (flat surfaces, humans, vehicles, constructions, objects, nature, sky, and void). The dataset consist of around 5000 fine annotated images and 20000 coarse annotated ones. Data was captured in 50 cities during several months, daytimes, and good weather conditions. It was originally recorded as video so the frames were manually selected to have the following features: large number of dynamic objects, varying scene layout, and varying background.
CamVid101010http://mi.eng.cam.ac.uk/research/projects/VideoRec/CamVid/: is a road/driving scene understanding database which was originally captured as five video sequences with a resolution camera mounted on the dashboard of a car. Those sequences were sampled (four of them at 1 fps and one at 15 fps) adding up to frames. Those stills were manually annotated with classes: void, building, wall, tree, vegetation, fence, sidewalk, parking block, column/pole, traffic cone, bridge, sign, miscellaneous text, traffic light, sky, tunnel, archway, road, road shoulder, lane markings (driving), lane markings (non-driving), animal, pedestrian, child, cart luggage, bicyclist, motorcycle, car, SUV/pickup/truck, truck/bus, train, and other moving object. It is important to remark the partition introduced by Sturgess et al. which divided the dataset into training, validation, and testing images respectively. That partition makes use of a subset of class labels: building, tree, sky, car, sign, road, pedestrian, fence, pole, sidewalk, and bicyclist.
KITTI: is one of the most popular datasets for use in mobile robotics and autonomous driving. It consists of hours of traffic scenarios recorded with a variety of sensor modalities, including high-resolution RGB, grayscale stereo cameras, and a 3D laser scanner. Despite its popularity, the dataset itself does not contain ground truth for semantic segmentation. However, various researchers have manually annotated parts of the dataset to fit their necessities. Álvarez et al. generated ground truth for images from the road detection challenge with three classes: road, vertical, and sky. Zhang et al. annotated ( for training and for testing) acquisitions – RGB and Velodyne scans – from the tracking challenge for ten object categories: building, sky, road, vegetation, sidewalk, car, pedestrian, cyclist, sign/pole, and fence. Ros et al.  labeled training images and testing images (from the visual odometry challenge) with classes: building, tree, sky, car, sign, road, pedestrian, fence, pole, sidewalk, and bicyclist.
Youtube-Objects is a database of videos collected from YouTube which contain objects from ten PASCAL VOC classes: aeroplane, bird, boat, car, cat, cow, dog, horse, motorbike, and train. That database does not contain pixel-wise annotations but Jain et al. manually annotated a subset of sequences. They took every 10th frame from those sequences and generated semantic labels. That totals annotated frames at pixels resolution.
Adobe’s Portrait Segmentation111111http://xiaoyongshen.me/webpage_portrait/index.html: this is a dataset of pixels portrait images collected from Flickr, mainly captured with mobile front-facing cameras. The database consist of training images and reserved for testing, both sets are fully binary annotated: person or background. The images were labeled in a semi-automatic way: first a face detector was run on each image to crop them to pixels and then persons were manually annotated using Photoshop quick selection. This dataset is remarkable due to its specific purpose which makes it suitable for person in foreground segmentation applications.
Materials in Context (MINC): this work is a dataset for patch material classification and full scene material segmentation. The dataset provides segment annotations for categories: wood, painted, fabric, glass, metal, tile, sky, foliage, polished stone, carpet, leather, mirror, brick, water, other, plastic, skin, stone, ceramic, hair, food, paper, and wallpaper. It contains labeled material segmentations for training, for test, and for validation. The main source for these images is the OpenSurfaces dataset , which was augmented using other sources of imagery such as Flickr or Houzz. For that reason, image resolution for this dataset varies. On average, image resolution is approximately or .
Densely-Annotated VIdeo Segmentation (DAVIS)121212http://davischallenge.org/index.html: this challenge is purposed for video object segmentation. Its dataset is composed by high-definition sequences which add up to and frames for training and validation respectively. Frame resolution varies across sequences but all of them were downsampled to p for the challenge. Pixel-wise annotations are provided for each frame for four different categories: human, animal, vehicle, and object. Another feature from this dataset is the presence of at least one target foreground object in each sequence. In addition, it is designed not to have many different objects with significant motion. For those scenes which do have more than one target foreground object from the same class, they provide separated ground truth for each one of them to allow instance segmentation.
Stanford background131313http://dags.stanford.edu/data/iccv09Data.tar.gz: dataset with outdoor scene images imported from existing public datasets: LabelMe, MSRC, PASCAL VOC and Geometric Context. The dataset contains 715 images (size of pixels) with at least one foreground object and having the horizon position within the image. The dataset is pixel-wise annotated (horizon location, pixel semantic class, pixel geometric class and image region) for evaluating methods for semantic scene understanding.
SiftFlow : contains 2688 fully annotated images which are a subset of the LabelMe database . Most of the images are based on 8 different outdoor scenes including streets, mountains, fields, beaches and buildings. Images are belonging to one of the 33 semantic classes. Unlabeled pixels, or pixels labeled as a different semantic class are treated as unlabeled.
With the advent of low-cost range scanners, datasets including not only RGB information but also depth maps are gaining popularity and usage. In this section, we review the most well-known 2.5D databases which include that kind of depth data.
NYUDv2 141414http://cs.nyu.edu/~silberman/projects/indoor_scene_seg_sup.html: this database consists of indoor RGB-D images captured with a Microsoft Kinect device. It provides per-pixel dense labeling (category and instance levels) which were coalesced into 40 indoor object classes by Gupta et al. for both training ( images) and testing () splits. This dataset is specially remarkable due to its indoor nature, this makes it really useful for certain robotic tasks at home. However, its relatively small scale with regard to other existing datasets hinders its application for deep learning architectures.
SUN3D 151515http://sun3d.cs.princeton.edu/: similar to the NYUDv2, this dataset contains a large-scale RGB-D
video database, with 8 annotated sequences. Each frame has a semantic segmentation of the objects in the scene and information about the camera pose. It is still in progress and it will be composed by 415 sequences captured in 254 different spaces, in 41 different buildings. Moreover, some places have been captured multiple times at different moments of the day.
SUNRGBD 161616http://rgbd.cs.princeton.edu/: captured with four RGB-D sensors, this dataset contains 10000 RGB-D images, at a similar scale as PASCAL VOC. It contains images from NYU depth v2 , Berkeley B3DO , and SUN3D . The whole dataset is densely annotated, including polygons, bounding boxes with orientation as well as a 3D room layout and category, being suitable for scene understanding tasks.
The Object Segmentation Database (OSD) 171717http://www.acin.tuwien.ac.at/?id=289 this database has been designed for segmenting unknown objects from generic scenes even under partial occlusions. This dataset contains 111 entries, and provides depth image and color images together withper-pixel annotations for each one to evaluate object segmentation approaches. However, the dataset does not differentiate the category of different objects so its classes are reduced to a binary set of objects and not objects.
RGB-D Object Dataset181818http://rgbd-dataset.cs.washington.edu/: this dataset is composed by video sequences of 300 common household objects organized in 51 categories arranged using WordNet hypernym-hyponym relationships. The dataset has been recorded using a Kinect style 3D camera that records synchronized and aligned RGB and depth images at . For each frame, the dataset provides, the RGB-D and depth images, a cropped ones containing the object, the location and a mask with per-pixel annotation. Moreover, each object has been placed on a turntable, providing isolated video sequences around 360 degrees. For the validation process, 22 annotated video sequences of natural indoor scenes containing the objects are provided.
Pure three-dimensional databases are scarce, this kind of datasets usually provide Computer Aided Design (CAD) meshes or other volumetric representations, such as point clouds. Generating large-scale 3D datasets for segmentation is costly and difficult, and not many deep learning methods are able to process that kind of data as it is. For those reasons, 3D datasets are not quite popular at the moment. In spite of that fact, we describe the most promising ones for the task at hand.
ShapeNet Part191919http://cs.stanford.edu/~ericyi/project_page/part_annotation/: is a subset of the ShapeNet repository which focuses on fine-grained 3D object segmentation. It contains meshes sampled from categories of the original dataset (airplane, earphone, cap, motorbike, bag, mug, laptop, table, guitar, knife, rocket, lamp, chair, pistol, car, and skateboard). Each shape class is labeled with two to five parts (totalling object parts across the whole dataset), e.g., each shape from the airplane class is labeled with wings, body, tail, and engine. Ground-truth labels are provided on points sampled from the meshes.
Stanford 2D-3D-S202020http://buildingparser.stanford.edu: is a multi-modal and large-scale indoor spaces dataset extending the Stanford 3D Semantic Parsing work . It provides a variety of registered modalities – 2D (RGB), 2.5D (depth maps and surface normals), and 3D (meshes and point clouds) – with semantic annotations. The database is composed of full high-definition RGB images ( resolution) along with their corresponding depth maps, surface normals, meshes, and point clouds with semantic annotations (per-pixel and per-point). That data were captured in six indoor areas from three different educational and office buildings. That makes a total of rooms and approximately million points annotated with labels from categories: ceiling, floor, wall, column, beam, window, door, table, chair, bookcase, sofa, board, and clutter.
A Benchmark for 3D Mesh Segmentation212121http://segeval.cs.princeton.edu/: this benchmark is composed by 380 meshes classified in 19 categories (human, cup, glasses, airplane, ant, chair, octopus, table, teddy, hand, plier, fish, bird, armadillo, bust, mech, bearing, vase, fourleg). Each mesh has been manually segmented into functional parts, the main goal is to provide a sample distribution over ”how humans decompose each mesh into functional parts”.
Sydney Urban Objects Dataset222222http://www.acfr.usyd.edu.au/papers/SydneyUrbanObjectsDataset.shtml: this dataset contains a variety of common urban road objects scanned with a Velodyne HDK-64E LIDAR. There are 631 individual scans (point clouds) of objects across classes of vehicles, pedestrians, signs and trees. The interesting point of this dataset is that, for each object, apart from the individual scan, a full 360-degrees annotated scan is provided.
Large-Scale Point Cloud Classification Benchmark 232323http://www.semantic3d.net/: this benchmark provides manually annotated 3D point clouds of diverse natural and urban scenes: churches, streets, railroad tracks, squares, villages, soccer fields, castles among others. This dataset features statically captured point clouds with very fine details and density. It contains large-scale point clouds for training and another for testing. Its scale can be grasped by the fact that it totals more than one billion labelled points.
|Name and Reference||Architecture||Accuracy||Efficiency||Training||Instance||Sequences||Multi-modal||3D||Source Code||Contribution(s)|
|Fully Convolutional Network||VGG-16(FCN)||✗||✗||✗||✗||✓||Forerunner|
|SegNet||VGG-16 + Decoder||✗||✗||✗||✗||✓||Encoder-decoder|
|Bayesian SegNet||SegNet||✗||✗||✗||✗||✓||Uncertainty modeling|
|DeepLab||VGG-16/ResNet-101||✗||✗||✗||✗||✓||Standalone CRF, atrous convolutions|
|MINC-CNN ||GoogLeNet(FCN)||✗||✗||✗||✗||✓||Patchwise CNN, Standalone CRF|
|CRFasRNN||FCN-8s||✗||✗||✗||✗||✓||CRF reformulated as RNN|
|ENet||ENet bottleneck||✗||✗||✗||✗||✓||Bottleneck module for efficiency|
|Multi-scale-CNN-Eigen||Custom||✗||✗||✗||✗||✓||Multi-scale sequential refinement|
|Multi-scale-CNN-Roy||Multi-scale-CNN-Eigen||✗||✗||✗||✗||Multi-scale coarse-to-fine refinement|
|Multi-scale-CNN-Bian||FCN||✗||✗||✗||✗||✗||Independently trained multi-scale FCNs|
|ParseNet||VGG-16||✗||✗||✗||✗||✓||Global context feature fusion|
|ReSeg||VGG-16 + ReNet||✗||✗||✗||✗||✓||Extension of ReNet to semantic segmentation|
Fast R-CNN + DeepMask
|✗||✗||✗||✗||✓||Fusion of contextual information from multiple sources|
|2D-LSTM||MDRNN||✗||✗||✗||✗||✗||Image context modelling|
|rCNN||MDRNN||✗||✗||✗||✗||✓||Different input sizes, image context|
|DAG-RNN||Elman network||✗||✗||✗||✗||✓||Graph image structure for context modelling|
|SDS||R-CNN + Box CNN||✗||✗||✗||✓||Simultaneous detection and segmentation|
|DeepMask||VGG-A||✗||✗||✗||✓||Proposals generation for segmentation|
|SharpMask||DeepMask||✗||✗||✗||✓||Top-down refinement module|
|MultiPathNet||Fast R-CNN + DeepMask||✗||✗||✗||✓||Multi path information flow through network|
|Huang-3DCNN||Own 3DCNN||✗||✗||✗||✗||3DCNN for voxelized point clouds|
|PointNet||Own MLP-based||✗||✗||✗||✓||Segmentation of unordered point sets|
|Clockwork Convnet||FCN||✗||✗||✗||✓||Clockwork scheduling for sequences|
|3DCNN-Zhang||Own 3DCNN||✗||✗||✗||✓||3D convolutions and graph cut for sequences|
|End2End Vox2Vox||C3D||✗||✗||✗||✗||3D convolutions/deconvolutions for sequences|
The relentless success of deep learning techniques in various high-level computer vision tasks – in particular, supervised approaches such as Convolutional Neural Networks (CNNs) for image classification or object detection  – motivated researchers to explore the capabilities of such networks for pixel-level labelling problems like semantic segmentation. The key advantage of these deep learning techniques, which gives them an edge over traditional methods, is the ability to learn appropriate feature representations for the problem at hand, e.g., pixel labelling on a particular dataset, in an end-to-end fashion instead of using hand-crafted features that require domain expertise, effort, and often too much fine-tuning to make them work on a particular scenario.
Currently, the most successful state-of-the-art deep learning techniques for semantic segmentation stem from a common forerunner: the Fully Convolutional Network (FCN) by Long et al.. The insight of that approach was to take advantage of existing CNNs as powerful visual models that are able to learn hierarchies of features. They transformed those existing and well-known classification models – AlexNet, VGG (16-layer net), GoogLeNet, and ResNet 
– into fully convolutional ones by replacing the fully connected layers with convolutional ones to output spatial maps instead of classification scores. Those maps are upsampled using fractionally strided convolutions (also named deconvolutions) to produce dense per-pixel labeled outputs. This work is considered a milestone since it showed how CNNs can be trained end-to-end for this problem, efficiently learning how to make dense predictions for semantic segmentation with inputs of arbitrary sizes. This approach achieved a significant improvement in segmentation accuracy over traditional methods on standard datasets like PASCAL VOC, while preserving efficiency at inference. For all those reasons, and other significant contributions, the FCN is the cornerstone of deep learning applied to semantic segmentation. The convolutionalization process is shown in Figure 7.
Despite the power and flexibility of the FCN model, it still lacks various features which hinder its application to certain problems and situations: its inherent spatial invariance does not take into account useful global context information, no instance-awareness is present by default, efficiency is still far from real-time execution at high resolutions, and it is not completely suited for unstructured data such as 3D point clouds or models. Those problems will be reviewed in this section, as well as the state-of-the-art solutions that have been proposed in the literature to overcome those hurdles. Table II provides a summary of that review. It shows all reviewed methods (sorted by appearance order in the section), their base architecture, their main contribution, and a classification depending on the target of the work: accuracy, efficiency, training simplicity, sequence processing, multi-modal inputs, and 3D data. Each target is graded from one to three stars () depending on how much focus puts the work on it, and a mark (✗) if that issue is not addressed. In addition, Figure 8 shows a graph of the reviewed methods for the sake of visualization.
Apart from the FCN architecture, other variants were developed to transform a network whose purpose was classification to make it suitable for segmentation. Arguably, FCN-based architectures are more popular and successful, but other alternatives are also remarkable. In general terms, all of them take a network for classification, such as VGG-16, and remove its fully connected layers. This part of the new segmentation network often receives the name of encoder and produce low-resolution image representations or feature maps. The problem lies on learning to decode or map those low-resolution images to pixel-wise predictions for segmentation. This part is named decoder and it is usually the divergence point in this kind of architectures.
SegNet is a clear example of this divergence (see Figure 9). The decoder stage of SegNet is composed by a set of upsampling and convolution layers which are at last followed by a softmax classifier to predict pixel-wise labels for an output which has the same resolution as the input image. Each upsampling layer in the decoder stage corresponds to a max-pooling one in the encoder part. Those layers upsample feature maps using the max-pooling indices from their corresponding feature maps in the encoder phase. The upsampled maps are then convolved with a set of trainable filter banks to produce dense feature maps. When the feature maps have been restored to the original resolution, they are fed to the softmax classifier to produce the final segmentation.
On the other hand, FCN-based architectures make use of learnable deconvolution filters to upsample feature maps. After that, the upsampled feature maps are added element-wise to the corresponding feature map generated by the convolution layer in the encoder part. Figure 10 shows a comparison of both approaches.
Semantic segmentation is a problem that requires the integration of information from various spatial scales. It also implies balancing local and global information. On the one hand, fine-grained or local information is crucial to achieve good pixel-level accuracy. On the other hand, it is also important to integrate information from the global context of the image to be able to resolve local ambiguities.
Vanilla CNNs struggle with this balance. Pooling layers, which allow the networks to achieve some degree of spatial invariance and keep computational cost at bay, dispose of the global context information. Even purely CNNs – without pooling layers – are limited since the receptive field of their units can only grow linearly with the number of layers.
Many approaches can be taken to make CNNs aware of that global information: refinement as a post-processing step with Conditional Random Fields, dilated convolutions, multi-scale aggregation, or even defer the context modeling to another kind of deep networks such as RNNs.
As we mentioned before, the inherent invariance to spatial transformations of CNN architectures limits the very same spatial accuracy for segmentation tasks. One possible and common approach to refine the output of a segmentation system and boost its ability to capture fine-grained details is to apply a post-processing stage using a Conditional Random Field (CRF). CRFs enable the combination of low-level image information – such as the interactions between pixels  – with the output of multi-class inference systems that produce per-pixel class scores. That combination is especially important to capture long-range dependencies, which CNNs fail to consider, and fine local details.
The DeepLab models  make use of the fully connected pairwise CRF by Krähenbühl and Koltun as a separated post-processing step in their pipeline to refine the segmentation result. It models each pixel as a node in the field and employs one pairwise term for each pair of pixels no matter how far they lie (this model is known as dense or fully connected factor graph). By using this model, both short and long-range interactions are taken into account, rendering the system able to recover detailed structures in the segmentation that were lost due to the spatial invariance of the CNN. Despite the fact that usually fully connected models are inefficient, this model can be efficiently approximated via probabilistic inference. Figure 11 shows the effect of this CRF-based post-processing on the score and belief maps produced by the DeepLab model.
The material recognition in the wild network by Bell et al. makes use of various CNNs trained to identify patches in the MINC database. Those CNNs are used on a sliding window fashion to classify those patches. Their weights are transferred to the same networks converted into FCNs
by adding the corresponding upsampling layers. The outputs are averaged to generate a probability map. At last, the sameCRF from DeepLab, but discretely optimized, is applied to predict and refine the material at every pixel.
Another significant work applying a CRF to refine the segmentation of a FCN is the CRFasRNN by Zheng et al.. The main contribution of that work is the reformulation of the dense CRF with pairwise potentials as an integral part of the network. By unrolling the mean-field inference steps as RNNs, they make it possible to fully integrate the CRF with a FCN and train the whole network end-to-end. This work demonstrates the reformulation of CRFs as RNNs to form a part of a deep network, in contrast with Pinheiro et al.  which employed RNNs to model large spatial dependencies.
Dilated convolutions, also named à-trous convolutions, are a generalization of Kronecker-factored convolutional filters  which support exponentially expanding receptive fields without losing resolution. In other words, dilated convolutions are regular ones that make use of upsampled filters. The dilation rate controls that upsampling factor. As shown in Figure 12, stacking
-dilated convolution makes the receptive fields grow exponentially while the number of parameters for the filters keeps a linear growth. This means that dilated convolutions allow efficient dense feature extraction on any arbitrary resolution. As a side note, it is important to remark that typical convolutions are just-dilated convolutions.
In practice, it is equivalent to dilating the filter before doing the usual convolution. That means expanding its size, according to the dilation rate, while filling the empty elements with zeros. In other words, the filter weights are matched to distant elements which are not adjacent if the dilation rate is greater than one. Figure 13 shows examples of dilated filters.
The most important works that make use of dilated convolutions are the multi-scale context aggregation module by Yu et al., the already mentioned DeepLab (its improved version), and the real-time network ENet. All of them use combinations of dilated convolutions with increasing dilation rates to have wider receptive fields with no additional cost and without overly downsampling the feature maps. Those works also show a common trend: dilated convolutions are tightly coupled to multi-scale context aggregation as we will explain in the following section.
Another possible way to deal with context knowledge integration is the use of multi-scale predictions. Almost every single parameter of a CNN affects the scale of the generated feature maps. In other words, the very same architecture will have an impact on the number of pixels of the input image which correspond to a pixel of the feature map. This means that the filters will implicitly learn to detect features at specific scales (presumably with certain invariance degree). Furthermore, those parameters are usually tightly coupled to the problem at hand, making it difficult for the models to generalize to different scales. One possible way to overcome that obstacle is to use multi-scale networks which generally make use of multiple networks that target different scales and then merge the predictions to produce a single output.
Raj et al. propose a multi-scale version of a fully convolutional VGG-16. That network has two paths, one that processes the input at the original resolution and another one which doubles it. The first path goes through a shallow convolutional network. The second one goes through the fully convolutional VGG-16 and an extra convolutional layer. The result of that second path is upsampled and combined with the result of the first path. That concatenated output then goes through another set of convolutional layers to generate the final output. As a result, the network becomes more robust to scale variations.
Roy et al. take a different approach using a network composed by four multi-scale CNNs. Those four networks have the same architecture introduced by Eigen et al. . One of those networks is devoted to finding semantic labels for the scene. That network extracts features from a progressively coarse-to-fine sequence of scales (see Figure 14).
Another remarkable work is the network proposed by Bian et al.. That network is a composition of FCNs
which operate at different scales. The features extracted from the networks are fused together (after the necessary upsampling with an appropriate padding) and then they go through an additional convolutional layer to produce the final segmentation. The main contribution of this architecture is the two-stage learning process which involves, first, training each network independently, then the networks are combined and the last layer is fine-tuned. This multi-scale model allows to add an arbitrary number of newly trained networks in an efficient manner.
Another way of adding context information to a fully convolutional architecture for segmentation is feature fusion. This technique consists of merging a global feature (extracted from a previous layer in a network) with a more local feature map extracted from a subsequent layer. Common architectures such as the original FCN make use of skip connections to perform a late fusion by combining the feature maps extracted from different layers (see Figure 15).
Another approach is performing early fusion. This approach is taken by ParseNet in their context module. The global feature is unpooled to the same spatial size as the local feature and then they are concatenated to generate a combined feature that is used in the next layer or to learn a classifier. Figure 16 shows a representation of that process.
This feature fusion idea was continued by Pinheiro et al. in their SharpMask network , which introduced a progressive refinement module to incorporate features from the previous layer to the next in a top-down architecture. This work will be reviewed later since it is mainly focused on instance segmentation.
As we noticed, CNNs have been successfully applied to multi-dimensional data, such as images. Nevertheless, these networks rely on hand specified kernels limiting the architecture to local contexts. Taking advantage of its topological structure, Recurrent Neural Networks have been successfully applied for modeling short- and long-temporal sequences. In this way and by linking together pixel-level and local information, RNNs are able to successfully model global contexts and improve semantic segmentation. However, one important issue is the lack of a natural sequential structure in images and the focus of standard vanilla RNNs architectures on one-dimensional inputs.
Based on ReNet model for image classification Visin et al. proposed an architecture for semantic segmentation called ReSeg  represented in Figure 17. In this approach, the input image is processed with the first layers of the VGG-16 network , feeding the resulting feature maps into one or more ReNet layers for fine-tuning. Finally, feature maps are resized using upsampling layers based on transposed convolutions. In this approach Gated Recurrent Units have been used as they strike a good performance balance regarding memory usage and computational power. Vanilla RNNs have problems modeling long-term dependencies mainly due to the vanishing gradients problem. Several derived models such as Long Short-Term Memory (LSTM) networks  and GRUs  are the state-of-art in this field to avoid such problem.
Inspired on the same ReNet architecture, a novel Long Short-Term Memorized Context Fusion (LSTM-CF) model for scene labeling was proposed by . In this approach, they use two different data sources: RGB and depth. The RGB pipeline relies on a variant of the DeepLab architecture  concatenating features at three different scales to enrich feature representation (inspired by ). The global context is modeled vertically over both, depth and photometric data sources, concluding with a horizontal fusion in both direction over these vertical contexts.
As we noticed, modeling image global contexts is related to 2D recurrent approaches by unfolding vertically and horizontally the network over the input images. Based on the same idea, Byeon et al.  purposed a simple 2D LSTM-based architecture in which the input image is divided into non-overlapping windows which are fed into four separate LSTMs memory blocks. This work emphasizes its low computational complexity on a single-core CPU and the model simplicity.
Another approach for capturing global information relies on using bigger input windows in order to model larger contexts. Nevertheless, this reduces images resolution and also implies several problems regarding to window overlapping. However, Pinheiro et al.  introduced Recurrent Convolutional Neural Networks which recurrently train with different input window sizes taking into account previous predictions by using a different input window sizes. In this way, predicted labels are automatically smoothed increasing the performance.
Undirected cyclic graphs (UCGs) were also adopted to model image contexts for semantic segmentation . Nevertheless, RNNs are not directly applicable to UCG and the solution is decomposing it into several directed graphs (DAGs). In this approach, images are processed by three different layers: image feature map produced by CNN, model image contextual dependencies with DAG-RNNs, and deconvolution layer for upsampling feature maps. This work demonstrates how RNNs can be used together with graphs to successfully model long-range contextual dependencies, overcoming state-of-the-art approaches in terms of performance.
Instance segmentation is considered the next step after semantic segmentation and at the same time the most challenging problem in comparison with the rest of low-level pixel segmentation techniques. Its main purpose is to represent objects of the same class splitted into different instances. The automation of this process is not straightforward, thus the number of instances is initially unknown and the evaluation of performed predictions is not pixel-wise such as in semantic segmentation. Consequently, this problem remains partially unsolved but the interest in this field is motivated by its potential applicability. Instance labeling provides us extra information for reasoning about occlusion situations, also counting the number of elements belonging to the same class and for detecting a particular object for grasping in robotics tasks, among many other applications.
For this purpose, Hariharan et al.  proposed a Simultaneous Detection and Segmentation (SDS) method in order to improve performance over already existing works. Their pipeline uses, firstly, a bottom-up hierarchical image segmentation and object candidate generation process called Multi-scale COmbinatorial Grouping (MCG)  to obtain region proposals. For each region, features are extracted by using an adapted version of the Region-CNN (R-CNN) , which is fine-tuned using bounding boxes provided by the MCG method instead of selective search and also alongside region foreground features. Then, each region proposal is classified by using a linear Support Vector Machine (SVM) on top of the CNN features. Finally, and for refinement purposes, Non-Maximum Suppression (NMS) is applied to the previous proposals.
Later, Pinheiro et al.  presented DeepMask model, an object proposal approach based on a single ConvNet. This model predicts a segmentation mask for an input patch and the likelihood of this patch for containing an object. The two tasks are learned jointly and computed by a single network, sharing most of the layers except last ones which are task-specific.
Based on the DeepMask architecture as a starting point due to its effectiveness, the same authors presented a novel architecture for object instance segmentation implementing a top-down refinement process  and achieving a better performance in terms of accuracy and speed. The goal of this process is to efficiently merge low-level features with high-level semantic information from upper network layers. The process consisted in different refinement modules stacked together (one module per pooling layer), with the purpose of inverting pooling effect by generating a new upsampled object encoding. Figure 18 shows the refinement module in SharpMask.
Another approach, based on Fast R-CNN as a starting point and using DeepMask object proposals instead of Selective Search was presented by Zagoruyko et al . This combined system called MultiPath classifier, improved performance over COCO dataset and supposed three modifications to Fast R-CNN: improving localization with an integral loss, provide context by using foveal regions and finally skip connections to give multi-scale features to the network. The system achieved a 66% improvement over the baseline Fast R-CNN.
As we have seen, most of the methods mentioned above rely on existing object detectors limiting in this way model performance. Even so, instance segmentation process remains an unresolved research problem and the mentioned works are only a small part of this challenging research topic.
As we noticed, a significant amount of work has been done in semantic segmentation by using photometric data. Nevertheless, the use of structural information was spurred on with the advent of low-cost RGB-D sensors which provide useful geometric cues extracted from depth information. Several works focused on RGB-D scene segmentation have reported an improvement in the fine-grained labeling precision by using depth information and not only photometric data. Using depth information for segmentation is considered more challenging because of the unpredictable variation of scene illumination alongside incomplete representation of objects due to complex occlusions. However, various works have successfully made use of depth information to increase accuracy.
The use of depth images with approaches focused on photometric data is not straightforward. Depth data needs to be encoded with three channels at each pixel as if it was an RGB images. Different techniques such as Horizontal Height Angle (HHA)  are used for encoding the depth into three channels as follows: horizontal disparity, height above ground, and the angle between local surface normal and the inferred gravity direction. In this way, we can input depth images to models designed for RGB data and improve in this way the performance by learning new features from structural information. Several works such as  are based on this encoding technique.
In the literature, related to methods that use RGB-D data, we can also find some works that leverage a multi-view approach to improve existing single-view works.
Zeng et al. present an object segmentation approach that leverages multi-view RGB-D data and deep learning techniques. RGB-D images captured from each viewpoint are fed to a FCN
network which returns a 40-class probability for each pixel in each image. Segmentation labels are threshold by using three times the standard deviation above the mean probability across all views. Moreover, in this work, multiple networks for feature extraction were trained (AlexNet and VGG-16 ), evaluating the benefits of using depth information. They found that adding depth did not yield any major improvements in segmentation performance, which could be caused by noise in the depth information. The described approach was presented during the 2016 Amazon Picking Challenge. This work is a minor contribution towards multi-view deep learning systems since RGB images are independently fed to a FCN network.
Ma et al. propose a novel approach for object-class segmentation using a multi-view deep learning technique. Multiple views are obtained from a moving RGB-D camera. During the training stage, camera trajectory is obtained using an RGB-D SLAM technique, then RGB-D images are warped into ground-truth annotated frames in order to enforce multi-view consistency for training. The proposed approach is based on FuseNet, which combines RGB and depth images for semantic segmentation, and improves the original work by adding multi-scale loss minimization.
3D geometric data such as point clouds or polygonal meshes are useful representations thanks to their additional dimension which provides methods with rich spatial information that is intuitively useful for segmentation. However, the vast majority of successful deep learning segmentation architectures – CNNs in particular – are not originally engineered to deal with unstructured or irregular inputs such as the aforementioned ones. In order to enable weight sharing and other optimizations in convolutional architectures, most researchers have resorted to 3D voxel grids or projections to transform unstructured and unordered point clouds or meshes into regular representations before feeding them to the networks. For instance, Huang et al. (see Figure 19 take a point cloud and parse it through a dense voxel grid, generating a set of occupancy voxels which are used as input to a 3D CNN to produce one label per voxel. They then map back the labels to the point cloud. Although this approach has been applied successfully, it has some disadvantages like quantization, loss of spatial information, and unnecessarily large representations. For that reason, various researchers have focused their efforts on creating deep architectures that are able to directly consume unstructured 3D point sets or meshes.
PointNet is a pioneering work which presents a deep neural network that takes raw point clouds as input, providing a unified architecture for both classification and segmentation. Figure 20 shows that two-part network which is able to consume unordered point sets in 3D.
As we can observe, PointNet is a deep network architecture that stands out of the crowd due to the fact that it is based on fully connected layers instead of convolutional ones. The architecture features two subnetworks: one for classification and another for segmentation. The classification subnetwork takes a point cloud and applies a set of transforms and Multi Layer Perceptrons
Multi Layer Perceptronsto generate features which are then aggregated using max-pooling to generate a global feature which describes the original input cloud. That global feature is classified by another MLP to produce output scores for each class. The segmentation subnetwork concatenates the global feature with the per-point features extracted by the classification network and applies another two MLPs to generate features and produce output scores for each point.
As we have observed, there has been a significant progress in single-image segmentation. However, when dealing with image sequences, many systems rely on the naïve application of the very same algorithms in a frame-by-frame manner. This approach works, often producing remarkable results. Nevertheless, applying those methods frame by frame is usually non-viable due to computational cost. In addition, those methods completely ignore temporal continuity and coherence cues which might help increase the accuracy of the system while reducing its execution time.
Arguably, the most remarkable work in this regard is the clockwork FCN by Shelhamer et al.. This network is an adaptation of a FCN
to make use of temporal cues in video to decrease inference time while preserving accuracy. The clockwork approach relies on the following insight: feature velocity – the temporal rate of change of features in the network – across frames varies from layer to layer so that features from shallow layers change faster than deep ones. Under that assumption, layers can be grouped into stages, processing them at different update rates depending on their depth. By doing this, deep features can be persisted over frames thanks to their semantic stability, thus saving inference time. Figure21 shows the network architecture of the clockwork FCN.
It is important to remark that the authors propose two kinds of update rates: fixed and adaptive. The fixed schedule just sets a constant time frame for recomputing the features for each stage of the network. The adaptive schedule fires each clock on a data-driven manner, e.g., depending on the amount of motion or semantic change. Figure 22 shows an example of this adaptive scheduling.
Zhang et al. took a different approach and made use of a 3DCNN, which was originally created for learning features from volumes, to learn hierarchical spatio-temporal features from multi-channel inputs such as video clips. In parallel, they over-segment the input clip into supervoxels. Then they use that supervoxel graph and embed the learned features in it. The final segmentation is obtained by applying graph-cut on the supervoxel graph.
Another remarkable method, which builds on the idea of using 3D convolutions, is the deep end-to-end voxel-to-voxel prediction system by Tran et al.. In that work, they make use of the Convolutional 3D (C3D) network introduced by themselves on a previous work , and extend it for semantic segmentation by adding deconvolutional layers at the end. Their system works by splitting the input into clips of frames, performing predictions for each clip separately. Its main contribution is the use of 3D convolutions. Those convolutions make use of three-dimensional filters which are suitable for spatio-temporal feature learning across multiple channels, in this case frames. Figure 23 shows the difference between 2D and 3D convolutions applied to multi-channel inputs, proving the usefulness of the 3D ones for video segmentation.
In the previous section we reviewed the existing methods from a literary and qualitative point of view, i.e., we did not take any quantitative result into account. In this Section we are going to discuss the very same methods from a numeric standpoint. First of all, we will describe the most popular evaluation metrics that can be used to measure the performance of semantic segmentation systems from three aspects: execution time, memory footprint, and accuracy. Next, we will gather the results of the methods on the most representative datasets using the previously described metrics. After that, we will summarize and draw conclusions about those results. At last, we enumerate possible future research lines that we consider significant for the field.
For a segmentation system to be useful and actually produce a significant contribution to the field, its performance must be evaluated with rigor. In addition, that evaluation must be performed using standard and well-known metrics that enable fair comparisons with existing methods. Furthermore, many aspects must be evaluated to assert the validity and usefulness of a system: execution time, memory footprint, and accuracy. Depending on the purpose or the context of the system, some metrics might be of more importance than others, i.e., accuracy may be expendable up to a certain point in favor of execution speed for a real-time application. Nevertheless, for the sake of scientific rigor it is of utmost importance to provide all the possible metrics for a proposed method.
Speed or runtime is an extremely valuable metric since the vast majority of systems must meet hard requirements on how much time can they spend on the inference pass. In some cases it might be useful to know the time needed for training the system, but it is usually not that significant, unless it is exaggeratedly slow, since it is an offline process. In any case, providing exact timings for the methods can be seen as meaningless since they are extremely dependant on the hardware and the backend implementation, rendering some comparisons pointless.
However, for the sake of reproducibility and in order to help fellow researchers, it is useful to provide timings with a thorough description of the hardware in which the system was executed on, as well as the conditions for the benchmark. If done properly, that can help others estimate if the method is useful or not for the application as well as perform fair comparisons under the same conditions to check which are the fastest methods.
Memory usage is another important factor for segmentation methods. Although it is arguably less constraining than execution time – scaling memory capacity is usually feasible – it can also be a limiting element. In some situations, such as onboard chips for robotic platforms, memory is not as abundant as in a high-performance server. Even high-end Graphics Processing Units, which are commonly used to accelerate deep networks, do not pack a copious amount of memory. In this regard, and considering the same implementation-dependent aspects as with runtime, documenting the peak and average memory footprint of a method with a complete description of the execution conditions can be extraordinarily helpful.
Many evaluation criteria have been proposed and are frequently used to assess the accuracy of any kind of technique for semantic segmentation. Those metrics are usually variations on pixel accuracy and IoU. We report the most popular metrics for semantic segmentation that are currently used to measure how per-pixel labeling methods perform on this task. For the sake of the explanation, we remark the following notation details: we assume a total of classes (from to including a void class or background) and is the amount of pixels of class inferred to belong to class . In other words, represents the number of true positives, while and are usually interpreted as false positives and false negatives respectively (although either of them can be the sum of both false positives and false negatives)..
Pixel Accuracy (PA): it is the simplest metric, simply computing a ratio between the amount of properly classified pixels and the total number of them.
Mean Pixel Accuracy (MPA): a slightly improved PA in which the ratio of correct pixels is computed in a per-class basis and then averaged over the total number of classes.
Mean Intersection over Union (MIoU): this is the standard metric for segmentation purposes. It computes a ratio between the intersection and the union of two sets, in our case the ground truth and our predicted segmentation. That ratio can be reformulated as the number of true positives (intersection) over the sum of true positives, false negatives, and false positives (union). That IoU is computed on a per-class basis and then averaged.
Frequency Weighted Intersection over Union (FWIoU): it is an improved over the raw MIoU which weights each class importance depending on their appearance frequency.
Of all metrics described above, the MIoU stands out of the crowd as the most used metric due to its representativeness and simplicity. Most challenges and researchers make use of that metric to report their results.
As we stated before, Section 4 provided a functional description of the reviewed methods according to their targets. Now we gathered all the quantitative results for those methods as stated by their authors in their corresponding papers. These results are organized into three parts depending on the input data used by the methods: 2D RGB or 2.5D RGB-D images, volumetric 3D, or video sequences.
The most used datasets have been selected for that purpose. It is important to remark the heterogeneity of the papers in the field when reporting results. Although most of them try to evaluate their methods in standard datasets and provide enough information to reproduce their results, also expressed in widely known metrics, many others fail to do so. That leads to a situation in which it is hard or even impossible to fairly compare methods.
Furthermore, we also came across the fact few authors provide information about other metrics rather than accuracy. Despite the importance of other metrics, most of the papers do not include any data about execution time nor memory footprint. In some cases that information is provided, but no reproducibility information is given so it is impossible to know the setup that produced those results which are of no use.
For the single 2D image category we have selected seven datasets: PASCAL VOC2012, PASCAL Context, PASCAL Person-Part, CamVid, CityScapes, Stanford Background, and SiftFlow. That selection accounts for a wide range of situations and targets.
The first, and arguably the most important dataset, in which the vast majority of methods are evaluated is PASCAL VOC-2012. Table III shows the results of those reviewed methods which provide accuracy results on the PASCAL VOC-2012 test set. This set of results shows a clear improvement trend from the firs proposed methods (SegNet and the original FCN) to the most complex models such as CRFasRNN and the winner (DeepLab) with IoU.
Apart from the widely known VOC we also collected metrics of its Context counterpart. Table IV shows those results in which DeepLab is again the top scorer ( IoU).
In addition, we also took into account the PASCAL Part dataset, whose results are shown in Table V. In this case, the only analyzed method that provided metrics for this dataset is DeepLab which achieved a IoU.
Moving from a general-purpose dataset such as PASCAL VOC, we also gathered results for two of the most important urban driving databases. Table VI shows the results of those methods which provide accuracy metrics for the CamVid dataset. In this case, an RNN-based approach (DAG-RNN) is the top one with a IoU.
Table VII shows the results on a more challenging and currently more in use database: CityScapes. The trend on this dataset is similar to the one with PASCAL VOC with DeepLab leading with a IoU.
Table VIII shows the results of various recurrent networks on the Stanford Background dataset. The winner, rCNN, achieves a maximum accuracy of IoU.
At last, results for another popular dataset such as SiftFlow are shown in Table IX. This dataset is also dominated by recurrent methods. In particular DAG-RNN is the top scorer with IoU.
Regarding the 2.5D category, i.e., datasets which also include depth information apart from the typical RGB channels, we have selected three of them for the analysis: SUN-RGB-D and NYUDv2. Table X shows the results for SUN-RGB-D that are only provided by LSTM-CF, which achieves IoU.
Table XI shows the results for NYUDv2 which are exclusive too for LSTM-CF. That method reaches IoU.
At last, Table XII gathers results for the last 2.5D dataset: SUN-3D. Again, LSTM-CF is the only one which provides information for that database, in this case a accuracy.
Two 3D datasets have been chosen for this discussion: ShapeNet Part and Stanford-2D-3D-S. In both cases, only one of the analyzed methods actually scored on them. It is the case of PointNet which achieved and IoU on ShapeNet Part (Table XIII) and Stanford-2D-3D-S (Table XIV) respectively.
The last category included in this discussion is video or sequences. For that part we gathered results for two datasets which are suitable for sequence segmentation: CityScapes and YouTube-Objects. Only one of the reviewed methods for video segmentation provides quantitative results on those datasets: Clockwork Convnet. That method reaches IoU on CityScapes (Table XV), and on YouTube-Objects (Table XVI).
In light of the results, we can draw various conclusions. The most important of them is related to reproducibility. As we have observed, many methods report results on non-standard datasets or they are not even tested at all. That makes comparisons impossible. Furthermore, some of them do not describe the setup for the experimentation or do not provide the source code for the implementation, thus significantly hurting reproducibility. Methods should report their results on standard datasets, exhaustively describe the training procedure, and also make their models and weights publicly available to enable progress.
Another important fact discovered thanks to this study is the lack of information about other metrics such as execution time and memory footprint. Almost no paper reports this kind of information, and those who do suffer from the reproducibility issues mentioned before. This void is due to the fact that most methods focus on accuracy without any concern about time or space. However, it is important to think about where are those methods being applied. In practice, most of them will end up running on embedded devices, e.g., self-driving cars, drones, or robots, which are fairly limited from both sides: computational power and memory.
Regarding the results themselves, we can conclude that DeepLab is the most solid method which outperforms the rest on almost every single RGB images dataset by a significant margin. The 2.5D or multimodal datasets are dominated by recurrent networks such as LSTM-CF. 3D data segmentation still has a long way to go with PointNet paving the way for future research on dealing with unordered point clouds without any kind of preprocessing or discretization. Finally, dealing with video sequences is another green area with no clear direction, but Clockwork Convnets are the most promising approach thanks to their efficiency and accuracy duality. 3D convolutions are worth remarking due to their power and flexibility to process multichannel inputs, making them successful at capturing both spatial and temporal information.
Based on the reviewed research, which marks the state of the art of the field, we present a list of future research directions that would be interesting to pursue.
3D datasets: methods that make full use of 3D information are starting to rise but, even if new proposals and techniques are engineered, they still lack one of the most important components: data. There is a strong need for large-scale datasets for 3D semantic segmentation, which are harder to create than their lower dimensional counterparts. Although there are already some promising works, there is still room for more, better, and varied data. It is important to remark the importance of real-world 3D data since most of the already existing works are synthetic databases. A proof of the importance of 3D is the fact that the ILSVRC will feature 3D data in 2018.
Sequence datasets: the same lack of large-scale data that hinders progress on 3D segmentation also impacts video segmentation. There are only a few datasets that are sequence-based and thus helpful for developing methods which take advantage of temporal information. Bringing up more high-quality data from this nature, either 2D or 3D, will unlock new research lines without any doubt.
Point cloud segmentation using Graph Convolutional Networks: as we already mentioned, dealing with 3D data such as point clouds poses an unsolved challenge. Due to its unordered and unstructured nature, traditional architectures such as CNNs cannot be applied unless some sort of discretization process is applied to structure it. One promising line of research aims to treat point clouds as graphs and apply convolutions over them   . This has the advantage of preserving spatial cues in every dimension without quantizing data.
Context knowledge: while FCNs are a consolidated approach for semantic segmentation, they lack several features such as context modelling that help increasing accuracy. The reformulation of CRFs as RNNs to create end-to-end solutions seems to be a promising direction to improve results on real-life data. Multi-scale and feature fusion approaches have also shown remarkable progress. In general, all those works represent important steps towards achieving the ultimate goal, but there are some problems that still require more research.
Real-time segmentation: In many applications, precision is important; however, it is also crucial that these implementations are able to cope with common camera frame rates (at least 25 frames per second). Most of the current methods are far from that framerate, e.g., FCN-8s takes roughly ms to process a low-resolution PASCAL VOC image whilst CRFasRNN needs more than ms. Therefore, during the next years, we expect a stream of works coming out, focusing more on real-time constraints. These future works will have to find a trade-off between accuracy and runtime.
Memory: some platforms are bounded by hard memory constraints. Segmentation networks usually do need significant amounts of memory to be executed for both inference and training. In order to fit them in some devices, networks must be simplified. While this can be easily accomplished by reducing their complexity (often trading it for accuracy), another approaches can be taken. Pruning is a promising research line that aims to simplify a network, making it lightweight while keeping the knowledge, and thus the accuracy, of the original network architecture .
Temporal coherency on sequences: some methods have addressed video or sequence segmentation but either taking advantage of that temporal cues to increase accuracy or efficiency. However, none of them have explicitly tackled the coherency problem. For a segmentation system to work on video streams it is important, not only to produce good results frame by frame, but also make them coherent through the whole clip without producing artifacts by smoothing predicted per-pixel labels along the sequence.
Multi-view integration: Use of multiple views in recently proposed segmentation works is mostly limited to RGB-D cameras and in particular focused on single-object segmentation.
To the best of our knowledge, this is the first review paper in the literature which focuses on semantic segmentation using deep learning. In comparison with other surveys, this paper is devoted to such a rising topic as deep learning, covering the most advanced and recent work on that front. We formulated the semantic segmentation problem and provided the reader with the necessary background knowledge about deep learning for such task. We covered the contemporary literature of datasets and methods, providing a comprehensive survey of datasets and methods. Datasets were carefully described, stating their purposes and characteristics so that researchers can easily pick the one that best suits their needs. Methods were surveyed from two perspectives: contributions and raw results, i.e., accuracy. We also presented a comparative summary of the datasets and methods in tabular forms, classifying them according to various criteria. In the end, we discussed the results and provided useful insight in shape of future research directions and open problems in the field. In conclusion, semantic segmentation has been approached with many success stories but still remains an open problem whose solution would prove really useful for a wide set of real-world applications. Furthermore, deep learning has proved to be extremely powerful to tackle this problem so we can expect a flurry of innovation and spawns of research lines in the upcoming years.
This work has been funded by the Spanish Government TIN2016-76515-R grant for the COMBAHO project, supported with Feder funds. It has also been supported by a Spanish national grant for PhD studies FPU15/04516. In addition, it was also funded by the grant Ayudas para Estudios de Máster e Iniciación a la Investigación from the University of Alicante.
2012 IEEE Conference on Computer Vision and Pattern Recognition, June 2012, pp. 3354–3361.
Y. Yoon, H.-G. Jeon, D. Yoo, J.-Y. Lee, and I. So Kweon, “Learning a deep convolutional network for light-field image super-resolution,” inProceedings of the IEEE International Conference on Computer Vision Workshops, 2015, pp. 24–32.
J. Wan, D. Wang, S. C. H. Hoi, P. Wu, J. Zhu, Y. Zhang, and J. Li, “Deep learning for content-based image retrieval: A comprehensive study,” inProceedings of the 22nd ACM international conference on Multimedia. ACM, 2014, pp. 157–166.
D. Ciresan, A. Giusti, L. M. Gambardella, and J. Schmidhuber, “Deep neural networks segment neuronal membranes in electron microscopy images,” inAdvances in neural information processing systems, 2012, pp. 2843–2851.