The instance segmentation task aims to precisely localize objects in images and it is a relatively new problem in computer vision. Currently, the main approach to solve the instance segmentation problem is based in on ensemble methods, a combination of multiple convolutional neural networks or multi-stage neural networks where each network/stage is specialized to solve one specific subtask of the problem. The work presented by Cordts groups instance segmentation solutions in three categories according to the order of the subtasks: segmentation + detection, detection + segmentation and simultaneous detection and segmentation. Differently from most solutions the Distance to Center of Mass Encoding (DCME) does not rely on detections (region proposal methods). The DCME is a mathematical representation based in linear algebra to indirectly encode instances on ground truth annotations. Each instance is represented by a center of mass and 2D displacement vectors. The center of mass is computed considering each pixel as an unitary mass and a displacement vector is computed with the difference between 2 pixels positions in the image. The displacement vectors are mapped for each pixel and they point towards their instance center of mass. Encoding instance information in the ground truth annotations is a straight and simple solution to the problem. This approach adds overhead to extract the instances but most solutions also need tailored post-processing to extract them. To the best of our knowledge there is no other work describing similar representation to solve the instance segmentation problem. To allow scientific reproducibility a demo will be publicly available.
Following the success of deep learning this work focus on deep semantic segmentation models  to solve the instance segmentation problem. In this context, any pixel-wise deep convolutional neural network is considered a segmentation model. Experiments were performed on SegNet  and FCN 
to verify the generalization capabilities of segmentation models trained with DCME. These models were selected to cover two different approaches of deep segmentation models. SegNet uses max-pooling indices to upsample feature maps and FCN uses deconvolution (transposed convolution) filters.
The success of deep learning models are partially due to the existence of large datasets. In this context, a dataset quality is directly assessed by the amount of data and the annotations quality. There are only a few datasets for instance segmentation and the most acknowledged ones are PASCAL VOC [8, 10], Microsoft COCO , and Cityscapes . The proposed solution was evaluated on the challenging Cityscapes dataset because of its urban street context and high quality annotations. Depending on the dataset the instance segmentation is also being called object segmentation but this paper favors the Cityscapes naming convention.
The DCME was able to separate individual instances independently of their classes. But a complete solution for the instance segmentation problem requires to identify the class of each instance. The solution was based on SegNet/FCN to separate instances and googleNet 
to classify them. Cityscapes is an urban street scene dataset but the solution is theoretically independent of the object type and shape.
2 Related work
Since detection based solutions usually rely on region proposal techniques, Uhrig  have separated instance segmentation solutions on proposal-based and proposal-free methods. Region proposal methods localize areas in images that might present objects generating object candidates.
The Multiscale Combinatorial Grouping (MCG)  computes a hierarchical segmentation on images of different scales to combine them into a single multi-scale segmentation hierarchy. Finally a grouping component rank a list of object proposals.
More recent, the Region Proposal Networks (RPNs)  is a CNN built to score region proposals and according to the authors it operates in a sliding-window fashion. This method has been widely used once it allows multi-stage neural networks to share feature maps. The great advantage of multi-stage neural networks over stacking multiple networks lies on jointly training the different sub-networks. This improves feature learning once these different parts share feature maps.
2.1 Proposal-based methods
Frequently, proposal-based methods make use of multiple networks trained separately or are single networks with multi-stage tasks. They often present the best results in computer vision benchmarks but they are also very computational expensive. As benchmarks do not evaluate computational cost there is no disadvantage on stacking different methods to solve a problem. The scientific value of these solutions are directly associated with benchmark scores. Unless they reduce the computational cost (number of parameters and/or number of operations) architectures with higher scores are more acknowledged by the scientific community.
Simultaneous Detection and Segmentation 
uses MCG to find category-independent region proposals. Then it uses a convolutional neural network to extract features on each region, a support vector machine to score each candidate and finally employs a non-maximum suppression to select candidates.
A fully convolutional encoder-decoder network was proposed by Yang  to detect contours from all objects in the image an then uses MCG to separate instances. A similar approach was proposed by Li  where a segmentation network generates saliency masks and contours masks. With the contours masks the MCG generates object proposals which are compared to the saliency masks to finally generate the object instances.
The multi-task network cascades 
specializes a single CNN to solve 3 different tasks: instances differentiation, masks estimation and objects categorization. The first stage is based in the RPNs and proposes class-agnostic bounding boxes. Later network stages are built upon earlier ones, forming a cascade.
Hayder  propose a solution that is more robust to error in the object candidate generation process based on RPNs. It proposes the object mask network (OMN) with a residual-deconvolution architecture and integrates the OMN in a multi-task cascade framework with several stages.
2.2 Proposal-free methods
Usually, proposal-free solutions are bottom-up approaches that directly use segmentation models. These methods do not directly represent instances but they learn specific characteristics like contours, depth or position that are helpful to separate/identify instances. These solutions are diverse but they are highly dependent on post-processing techniques.
A FCN was used by van den Brand  to separate object contours in 4 classes based in their relative position: top, left, bottom and right. Instance contours are extracted from the FCN output and the flood fill algorithm is used to define instances masks. Uhrig  also employ a FCN to generate a combination of per pixel semantic label, depth and direction. These three outputs are used by a template matching scheme to extract and classify instances.
A discriminative loss function based in a distance metric was proposed by De Brabandere to cluster instances pixels. Another loss function was designed by Romera-Paredes 
to train end-to-end a recurrent neural network. And, according to the authors, the solution is inspired on how humans count elements in a scene. This work assumes the instances belong to a single class.
model the problem as a Markov Random Field. These works propose energy functions to train a network with image patches of different scales. They also propose some heuristics to extract the instances. These works can only detect instances from a single class and have a limit of 9 instances per image.
Bai and Urtasun  employ a cascade neural network to predict a watershed energy landscape such that each basin corresponds to a single instance. The final solution generates a watershed transform energy map representation which is cut at a fixed threshold to yield the final predictions. The proposal-free network  also specializes a convolutional neural network to predict pixel-level semantic labels, the number of instances per class and an instance location vector for each pixel.
Arnab and Torr  utilize several methods to generate objects instances over object detections. Its final representation is a segmentation map which assigns an object class and instance label for each pixel. Liu  also use a sequence of neural networks to learn breakpoints that form line segments which are then combined to generate instances.
This work describes a novel mathematical representation for object instance annotations that deep semantic segmentation models are able to learn and generalize, including this solution among the proposal-free methods. The approach consists on transforming the annotations to 2D displacement vectors that point towards instances centers of mass. Encoding and decoding steps are proposed to generate the representations and extract information.
Arnab and Torr  also encode instance information in each pixel but this representation is inferred from bounding box annotations and semantic segmentation annotations while DCME indirectly encode instance information on ground truth training annotations.
The work introduced by Uhrig  and latter extended by Levinkov  also compute directions towards instance centers. It divides a turn in 8 regions of 45 degrees and each pixel is classified in one of these eight regions according to its relative position to the instance center. Each one of these 8 regions represent a class inferred by softmax. Differently from Uhrig , the 2D displacement vectors also encode magnitude information and are represented by two real numbers corresponding to the vector components in each axis. Their work also uses pixel-level semantic labels and depth information to train their models.
This work shares several similarities with Bai and Urtasun . The watershed transform energy map representation is similar to DCME magnitude map, although DCME is not limited to 16 quantization levels. The DCME aims to generate a 2D displacement vector as the segmentation model output. While Bai and Urtasun  employ direction vectors without distance information as an intermediate representation to generate the watershed transform energy map. The energy map is directly used to extract instances therefore their solution does not solve partial occlusion.
Due to the high scores achieved by proposal-based methods in benchmark datasets one may think this is the best approach to solve the problem, however, some questions stay with no answer. What are the most important subtasks to solve the problem? What is the order among these subtasks? How these substaks should interact with each other? It is hard to answer these questions because convolutional neural networks automatically extract the most important features to solve the problem and most of the works in the area are developed empirically. Although proposal-free methods usually present smaller scores, mathematical models for instance segmentation annotations seems to fit better how deep learning models are currently being employed.
3 Distance to center of mass encoding
The distance to center of mass encoding (DCME) aims to generate instances representations that deep segmentation models are able to learn and generalize. This work proposes a model to represent instances and algorithms to encode and decode the information. To create meaningful instance segmentation representation, some points must be addressed:
images have a variable number of instances;
overlapping regions between instances are critical regions;
a single instance may have several disjoint regions (partial occlusion);
Differently from the classification and segmentation problem where the number of classes is fixed, the images have a variable number of instances. Several solutions have a tendency to group together different instances with overlapping regions and some instance segmentation solutions are not able to resolve the partial occlusion [4, 26].
The approach to solve these three problems was to generate a representation that could indirectly encode instance information for each pixel. The pixels are represented by 2D displacement vectors pointing to an instance center of mass (CM). Each instance is represented by a single center of mass and the 2D displacement vectors that point to it. A vector magnitude is the distance between a pixel and its instance center of mass and a vector direction points to the center of mass. The vectors are represented by 2D components in a Cartesian coordinate system with origin on the image top left corner. Background pixels that does not represent instances have zero components.
Theoretically, there is no limit for the number of centers of mass and there is no limit for the distance vector magnitude. Each pixel may become a center of mass, a scene may have any number of instances and these instances have no size restrictions. However, in practice the number of instances is limited by the image resolution. The maximum number of instances is half the resolution (full HD instances), when the other half would be 2D displacement vectors pointing to a single center of mass. This happens because a center of mass needs at least one displacement vector pointing to it.
As explained in other works [4, 16, 25], the direction information helps separating overlapping regions since vectors in these areas may have components pointing to opposite directions. Finally, when each pixel carries information about the instance it belongs to, partial occlusions are background pixels or pixels that belong to other instance.
Two steps are required to transform the ground truth annotations according to the DCME. The first step is to find the center of mass position for each object instance, Eq. 1. The second step computes the displacement vector for each pixel position from each object instance, Eq. 2. Each pixel position is relative to the coordinate system origin in the annotation top left corner and the axis points downward, according to Figure 1.
Step 1 - find center of mass for each instance :
(1a) (1b) (1c)
Step 2 - compute displacement vector for each pixel from each instance :
(2a) (2b) (2c) (2d)
The center of mass position is the average of the columns and rows of the pixels belonging to the instance. The mass is considered uniformly distributed and each pixel has unitary mass. Therefore, there is no heavier regions pulling the center of mass which tends to be located in regions with higher superficial density of pixels.
Initially the instance bounding box centroid was used in place of the center of mass. However, this anchor was very susceptible to outliers, where a few pixels placed far from the instance would greatly affect its position. The centroid of close instances tend to be very close to each other reducing the decodification effectiveness. Once the center of mass is located in dense regions it improves separation over close instances, Figure2.
For convenience the final annotation generated by the DCME will be denominated 2D vector map. This representation have same spatial dimensions as the input and each pixel is a vector with associated components and .
Once the vector components belong to the real numbers set
the classification softmax layer from the segmentation models must be replaced by a regression layer. This is the single modification in the model architecture required by this solution. The segmentation model will not compute anymore a probability distribution over a fixed number of classes but it will directly infer the displacement vector components.
After training, the segmentation model will be able to generate results similar to the annotations. The decoding step intends to extract the instances from the vector maps. The decodification step requires to find the centers of mass which are local minima in the magnitude surface. The magnitude and direction are noisy and in consequence the magnitude surface is irregular with several local minima. Therefore, an heuristic was developed select the most likely centers of mass.
Initially, the center of mass pointed by each 2D vector in evaluated. Due to the segmentation model output noise the vectors will point to a region close to the instance center of mass. The centers of mass closer to each other are clustered and only those that a minimal number of vectors are pointing at, are selected. An instance is the ensemble of all the vectors that point to the region around the center of mass. In summary, the decoding is composed of the following steps:
compute center of mass proposals.
cluster close centers of mass distance threshold (DT).
select centers of mass pointed by a minimal number of vectors vote threshold (VT).
select vectors that point to the center of mass region error threshold (ET).
There are three thresholds associated with this solution and a compromise among them is required to achieve good results. A high vote threshold tends to discard small instances with small number of pixels pointing to the the center of mass. Smaller vote thresholds include small instances but also include several false positives.
Large instances have large areas with potential centers of mass. High error threshold and high distance threshold improve the detection of large instances pixels but small instances that are close to each other will tend to be grouped as a single instance.
The DCME was evaluated on Cityscapes urban street scene benchmark and dataset for pixel-level and instance-level semantic labeling 
. The dataset was designed to leverage the understanding of complex traffic scenes and driving scenarios. Only the fine annotation train set was used to train the models in a single phase fashion and no data augmentation, data balancing or pre-processing technique was applied to improve score results. However the images and annotations had to be subsampled by a factor of 3 reducing their resolution from (1024, 2048) to (340, 680). The experiments were performed with the Caffe deep learning framework.
4.1 Instance segmentation pipeline
The DCME can be used to separate instances and it does not encode information about instances classes. Currently, to completely solve the instance segmentation problem a model is required to classify each one of the instances. The classification step can be done by any classification model and it is only required when the dataset presents more than one class. If the problem has multiple instances of a single class, the classification step would not be required.
Three convolutional neural networks were used to evaluate the DCME. Two semantic segmentation models, SegNet and FCN-8s to separate instances and the googleNet classification model to classify the extracted instances. These two segmentation models were chosen to cover the two different approaches for image segmentation thus demonstrating DCME is independent of the segmentation model architecture/paradigm. Both have different decoders and additionally the SegNet model uses the max poling indices to upsample feature maps while the FCN uses the deconvolution layer, also known as transposed convolutional layer. The basic-SegNet model was also evaluated but its results were roughly half of the main SegNet model and they are not presented here.
Both segmentation networks were modified to yield a 2D output with same spatial dimension of the input images/annotations. These networks perform a regression and their weights were optimized with the mean squared error. The SegNet network was trained with a learning rate and a batch size of 15 images/annotations, the FCN-8s network was trained with a learning rate and a batch size of 14 images/annotations. The decodification thresholds were defined as (DT, VT, ET): DCME-SegNet (10, 50, 15) and DCME-FCN (15, 30, 20).
The googleNet inception-v1 model was used to classify the instances. To train the classification network, the instances from the fine train set were extracted and resized to (224, 224). It was trained with the “poly” policy, initial learning rate of and batch size of 64 images. The segmentation and classification models are trained separately and future efforts will be directed to integrate these two phases.
Due to the high computational cost of the segmentation model, the original images and annotations had to be subsampled by a factor of 3 (1024, 2048) to (340, 680). The subsampling transformation destroys information required to precisely delineate object contours. Moreover, the final result must be upsampled to the original resolution to be evaluated, adding noise to it. Therefore, the reduction on the image resolution has a great impact over the final score. Evaluating the validation set ground truth annotations with a (340, 680) resolution provides only 39.8% AP.
The Cityscapes instance segmentation problem has 8 classes where 94% of the fine training set instances belong to 4 classes: car, person, bicycle and rider. This imbalance is likely a representation of the real scenario on German urban streets rather than a design error. There is no size limit to detect instances and the high resolution of the images make the Cityscapes very challenging.
Cityscapes evaluate instance segmentation submissions based on variations of the Average Precision (AP). Two submission were made and in both evaluations the same googleNet model was used to classify the instances among the 8 classes. The DCME-SegNet scores will be published on Cityscapes webpage. These solutions are compared among each other and other Cityscapes solutions. Also, some specific modifications in the proposed solution are evaluated.
Figures 3 and 4 present the output masks generated by both solutions on the fine annotation test set. The magnitude maps represent the distance between the pixels and their instance center of mass. Small instances are not visible in the magnitude maps once the values were normalized by the maximum magnitude and then scaled to fit the 0-255 interval. Comparing DCME-SegNet and DCME-FCN magnitude maps it is possible to notice that the segmentation model quality has a great impact over the final result. The DCME-FCN magnitude map has an blurry aspect while the DCME-SegNet presents a smooth intensity gradient. The DCME-SegNet is able to precisely delineate car instances, clearly representing wing mirrors. These images demonstrate evidences that the main solution bottleneck is the segmentation model quality. This includes not only the selection of better segmentation network architectures but also design options like image resolution, data augmentation and data balancing. These may be viewed as technical problems rather than scientific.
There are only 14 entries on the benchmark indicating that instance segmentation is still an open problem, Table 1. Currently the best performing solution, method 1, has 31.95% AP. This is still a low score when comparing to the best performing solutions on image classification, object detection and semantic segmentation benchmarks. Method 1 is proposal-based and methods 12, 13 and 14 are proposal-free.
The DCME-SegNet AP is a little higher than the DCME-FCN, this is likely due the fact SegNet is usually more precise than FCN, presenting higher scores in segmentation benchmarks. The AP100m and AP50m scores compute the AP over objects within 100m and 50m range. In both submissions these two scores are close to those presented by method 13 while the AP50% is roughly the half. This indicates the DCME-decoding tends to discard small instances.
Tables 2 and 3 present per class scores from both submissions. The DCME-SegNet has higher scores than the method 13 for the classes person, rider and car. The low score on other classes may be explained by the high data imbalance of the dataset. On the fine train set, all the instances that belong to the classes truck, bus, train and motorcycle represent only 3.65% of all instances. Applying artificial data balancing techniques may increase score over these minority classes.
Also, the AP from the validation set was around 6.1% for the DCME-SegNet and 6.2% for the DCME-FCN, therefore any technique to reduce the generalization error like data augmentation could effectively improve the final score. Although this solution presents low scores, the main evaluation purpose was to confirm that deep segmentation models are able to learn and generalize the DCME representation. Both SegNet and FCN-8s were able to separate instance for all eight classes presenting results compatible with other instance segmentation solutions.
A different decoding was evaluated replacing its fourth step by the watershed algorithm. The results were slightly worse for the DCME-SegNet scoring 5.7% AP on the validation data set. But the results were even worse for the DCME-FCN which scored 1.8% AP. In both cases the watershed could not detect regions close to contours. The watershed is computed over the magnitude map and it does not make use of direction information. Therefore the watershed worse performance may be explained because it uses less information than the proposed decoding.
This work presents a novel encoding technique to represent object instances in images. The representation is capable to separate object instances independently of their classes. Deep semantic segmentation models with minor modifications were able to learn this mathematical representation. The model was evaluated in the context of instance segmentation problem on Cityscapes dataset presenting promising results. Future work will investigate solutions to integrate regression and classification in a single network and improvements over the DCME decoding.
This work was funded by Sao Paulo Research Foundation (FAPESP) project: #2015/26293-0.
P. Arbeláez, J. Pont-Tuset, J. T. Barron, F. Marques, and J. Malik.
Multiscale combinatorial grouping.
Proceedings of the IEEE conference on computer vision and pattern recognition, pages 328–335, 2014.
-  A. Arnab and P. H. S. Torr. Pixelwise instance segmentation with a dynamically instantiated network. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 879–888, July 2017.
-  V. Badrinarayanan, A. Kendall, and R. Cipolla. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. In IEEE Transactions on Pattern Analysis and Machine Intelligence. IEEE, 2017.
-  M. Bai and R. Urtasun. Deep watershed transform for instance segmentation. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2858–2866, July 2017.
M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson,
U. Franke, S. Roth, and B. Schiele.
The cityscapes dataset for semantic urban scene understanding.In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
-  J. Dai, K. He, and J. Sun. Instance-aware semantic segmentation via multi-task network cascades. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3150–3158, 2016.
-  B. De Brabandere, D. Neven, and L. Van Gool. Semantic instance segmentation with a discriminative loss function. arXiv preprint arXiv:1708.02551, 2017.
-  M. Everingham, S. A. Eslami, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The pascal visual object classes challenge: A retrospective. International journal of computer vision, 111(1):98–136, 2015.
-  A. Garcia-Garcia, S. Orts-Escolano, S. Oprea, V. Villena-Martinez, and J. Garcia-Rodriguez. A review on deep learning techniques applied to semantic segmentation. arXiv preprint arXiv:1704.06857, 2017.
-  B. Hariharan, P. Arbelaez, L. Bourdev, S. Maji, and J. Malik. Semantic contours from inverse detectors. In International Conference on Computer Vision (ICCV), 2011.
-  B. Hariharan, P. Arbeláez, R. Girshick, and J. Malik. Simultaneous detection and segmentation. In European Conference on Computer Vision, pages 297–312. Springer, 2014.
-  Z. Hayder, X. He, and M. Salzmann. Boundary-aware instance segmentation. In Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
-  K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask r-cnn. arXiv preprint arXiv:1703.06870, 2017.
-  Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM international conference on Multimedia, pages 675–678. ACM, 2014.
-  Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 521(7553):436–444, 2015.
-  E. Levinkov, J. Uhrig, S. Tang, M. Omran, E. Insafutdinov, A. Kirillov, C. Rother, T. Brox, B. Schiele, and B. Andres. Joint graph decomposition & node labeling: Problem, algorithms, applications. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
-  G. Li, Y. Xie, L. Lin, and Y. Yu. Instance-level salient object segmentation. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 247–256, July 2017.
-  X. Liang, Y. Wei, X. Shen, J. Yang, L. Lin, and S. Yan. Proposal-free network for instance-level object segmentation. arXiv preprint arXiv:1509.02636, 2015.
-  T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.
-  S. Liu, J. Jia, S. Fidler, and R. Urtasun. Sgn: Sequential grouping networks for instance segmentation. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017.
-  J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.
-  S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015.
-  B. Romera-Paredes and P. H. S. Torr. Recurrent instance segmentation. In European Conference on Computer Vision, pages 312–329. Springer, 2016.
-  C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1–9, 2015.
-  J. Uhrig, M. Cordts, U. Franke, and T. Brox. Pixel-level encoding and depth layering for instance-level semantic labeling. In German Conference on Pattern Recognition, pages 14–25. Springer, 2016.
-  J. van den Brand, M. Ochs, and R. Mester. Instance-level segmentation of vehicles by deep contours. In Asian Conference on Computer Vision, pages 477–492. Springer, 2016.
-  J. Yang, B. Price, S. Cohen, H. Lee, and M.-H. Yang. Object contour detection with a fully convolutional encoder-decoder network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 193–202, 2016.
-  Z. Zhang, S. Fidler, and R. Urtasun. Instance-level segmentation for autonomous driving with deep densely connected mrfs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 669–677, 2016.
-  Z. Zhang, A. G. Schwing, S. Fidler, and R. Urtasun. Monocular object instance segmentation and depth ordering with cnns. In Proceedings of the IEEE International Conference on Computer Vision, pages 2614–2622, 2015.