SOLO: Segmenting Objects by Locations

12/10/2019 ∙ by Xinlong Wang, et al. ∙ 27

We present a new, embarrassingly simple approach to instance segmentation in images. Compared to many other dense prediction tasks, e.g., semantic segmentation, it is the arbitrary number of instances that have made instance segmentation much more challenging. In order to predict a mask for each instance, mainstream approaches either follow the 'detect-thensegment' strategy as used by Mask R-CNN, or predict category masks first then use clustering techniques to group pixels into individual instances. We view the task of instance segmentation from a completely new perspective by introducing the notion of "instance categories", which assigns categories to each pixel within an instance according to the instance's location and size, thus nicely converting instance mask segmentation into a classification-solvable problem. Now instance segmentation is decomposed into two classification tasks. We demonstrate a much simpler and flexible instance segmentation framework with strong performance, achieving on par accuracy with Mask R-CNN and outperforming recent singleshot instance segmenters in accuracy. We hope that this very simple and strong framework can serve as a baseline for many instance-level recognition tasks besides instance segmentation.



There are no comments yet.


page 4

page 6

page 8

page 10

Code Repositories


SOLO and SOLOv2 for instance segmentation, ECCV 2020 & NeurIPS 2020.

view repo


SOLO: Segmenting Objects by Locations

view repo


Unofficial implementation for SOLO instance segmentation

view repo


SOLO network written in tensorflow 2

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Instance segmentation is challenging because it requires the correct separation of all objects in an image while also semantically segmenting each instance at the pixel level. Objects in an image belong to a fixed set of semantic categories, but the number of instances varies. As a result, semantic segmentation can be easily formulated as a dense per-pixel classification problem, while it is challenging to predict instance labels directly following the same paradigm.

To overcome this obstacle, recent instance segmentation methods can be categorized into two groups, i.e., top-down and bottom-up paradigms. The former approach, namely ‘detect-then-segment’, first detects bounding boxes and then segments the instance mask in each bounding box. The latter approach learns an affinity relation, assigning an embedding vector to each pixel, by pushing away pixels belonging to different instances and pulling close pixels in the same instance. A grouping post-processing is then needed to separate instances. Both these two paradigms are step-wise and indirect, which either heavily rely on accurate bounding box detection or depend on per-pixel embedding learning and the grouping processing.

(a) Mask R-CNN
(b) SOLO
Figure 1: Comparison of the pipelines of Mask R-CNN and our SOLO.

In contrast, we aim to directly segment instance masks, under the supervision of full instance mask annotations instead of masks in boxes or additional pixel pairwise relations. We start by rethinking a question: What are the fundamental differences between object instances in an image? Take the challenging MS COCO dataset [12] for example. There are in total objects in the validation subset, of object pairs have center distance greater than pixels. As for the rest of object pairs, of them have size ratio greater than 1.5. Here we do not consider the few cases like two objects in ‘’ shape. To conclude, in most cases two instances in an image either have different center locations or have different object sizes. This observation makes one wonder whether we could directly distinguish instances by the center locations and object sizes?

In the closely related field, semantic segmentation, now the dominate paradigm leverages a fully convolutional network (FCN) to output dense predictions with channels. Each output channel is responsible for one of the semantic categories (including background). Semantic segmentation aims to distinguish different semantic categories. Analogously, in this work, we propose to distinguish object instances in the image by introducing the notion of “instance categories”, i.e., the quantized center locations and object sizes, which enables to segment objects by locations, thus the name of our method, SOLO.

The core idea of our proposed SOLO is to separate object instances by locations and sizes.

Locations An image can be divided into a grid of cells, thus leading to

center location classes. According to the coordinates of the object center, an object instance is assigned to one of the grid cells, as its center location category. Unlike DeepMask 

[21] and TensorMask [2], which pack the mask into the channel axis, we encode center location categories as the channel axis, similar to the semantic categories in semantic segmentation. Each output channel is responsible for one of the center location categories, and the corresponding channel map should predict the instance mask of the object belonging to that category. Thus, structural geometric information is naturally preserved in the spatial matrix with dimensions of height by width.

In essence, an instance category approximates the location of the object center of an instance. Thus, by classification of each pixel into its instance category, it is equivalent to predict the object center from each pixel using regression. The importance here of converting the location prediction task into classification rather than regression is that, with classification it is much more straightforward and easier to model varying number of instances using a fixed number of channels, at the same time not relying on post-processing like grouping or learning embeddings.

Sizes To distinguish instances with different object sizes, we employ the feature pyramid network (FPN) [11]

, so as to assign objects of different sizes to different levels of feature maps, as the object size classes. Thus, all the object instances are separated regularly, enabling to classify objects by “instance categories”. Note that FPN was designed for the purposes of detecting objects of different sizes in an image.

In the sequel, we empirically show that FPN is one of the core components for our method and has a profound impact on the segmentation performance, especially objects of varying sizes being presented.

With the proposed SOLO framework, we are able to optimize the network in an end-to-end fashion for the instance segmentation task using mask annotations solely, and perform pixel-level instance segmentation out of the restrictions of local box detection and pixel grouping. Note that, most instance segmentation methods to date need box annotations as one of the supervision signals. For the first time, we demonstrate a very simple instance segmentation approach achieving on par results to the dominant ‘detect-then-segment’ method Mask R-CNN [7], on the challenging COCO dataset [12]

with diverse scenes and semantic classes. Additionally, we showcase the generality of our framework via the task of instance contour detection, by viewing the instance edge contours as a one-hot binary mask, with almost no modification SOLO can generate reasonable instance contours. The proposed SOLO only needs to solve two pixel-level classification tasks, analogue to semantic segmentation. Thus it may be possible to borrow some of the recent advances in semantic segmentation for improving SOLO. Essentially SOLO converts coordinate regression into classification by discrete quantization. One feat of doing so is the avoidance of heuristic coordination normalization and log-transformation typically used in detectors such as YOLO

[22]. The embarrassing simplicity and strong performance of the proposed SOLO method may predict its application to a wide range of instance-level recognition tasks.

2 Related Work

We review some instance segmentation works that are closest to ours.

Top-down Instance Segmentation. The methods that segment object instance in a priori bounding box fall into the typical top-down paradigm. FCIS [10] assembles the position-sensitive score maps within the region-of-interests (ROIs) generated by a region proposal network (RPN) to predict instance masks. Mask R-CNN [7] extends the Faster R-CNN detector [23] by adding a branch for segmenting the object instances within the detected bounding boxes. Based on Mask R-CNN, PANet [16] further enhances the feature representation to improve the accuracy, Mask Scoring R-CNN [8] adds a mask-IoU branch to predict the quality of the predicted mask and scoring the masks to improve the performance. TensorMask [2] adopts the dense sliding window paradigm to segment the instance in the local window for each pixel with a predefined number of windows and scales. In contrast to the top-down methods above, our SOLO is totally box-free thus not being restricted by (anchor) box locations and scales, and naturally benefits from the inherent advantages of FCNs.

Bottom-up Instance Segmentation. This category of the approaches generate instance masks by grouping the pixels into an arbitrary number of object instances presented in an image. We briefly review several recent approaches. Pixels are grouped into instances using the learned associative embedding in [19]

. A discriminative loss function 

[5] learns pixel-level instance embedding efficiently, by pushing away pixels belonging to different instances and pulling close pixels in the same instance. SGN [15] decomposes the instance segmentation problem into a sequence of sub-grouping problems. SSAP [6]

learns a pixel-pair affinity pyramid, the probability that two pixels belong to the same instance, and sequentially generates instances by a cascaded graph partition. Typically bottom-up methods lag behind in accuracy compared to top-down methods, especially on the dataset with diverse scenes and semantic classes. Instead of exploiting pixel pairwise relations and pixel grouping, SOLO directly learns with the instance mask annotations solely during training, and predicts instance masks and semantic categories end-to-end without grouping post-processing. In this sense, our proposed SOLO is a

direct end-to-end instance segmentation approach.

Direct Instance Segmentation. To our knowledge, no prior methods directly train with mask annotations solely, and predict instance masks and semantic categories in one shot without the need of grouping post-processing. Several recently proposed methods may be viewed as the ‘semi-direct’ paradigm. AdaptIS [24] first predicts point proposals, and then sequentially generates the mask for the object located at the detected point proposal. PolarMask [27] proposes to use the polar representation to encode masks and transforms per-pixel mask prediction to distance regression. They both do not need bounding boxes for training but are either being step-wise or founded on compromise, e.g., coarse parametric representation of masks. Our SOLO takes an image as input, directly outputs instance masks and corresponding class probabilities, in a fully convolutional, box-free and grouping-free paradigm. Our simple network can be optimized end-to-end without the need of box supervision. To predict, the network directly maps an input image to masks for each individual instance, relying on neither intermediate operators like RoI feature cropping, nor grouping post-processing.

3 Our Method: SOLO

Figure 2: SOLO framework. We reformulate the instance segmentation as two sub-tasks: category prediction and instance mask generation problems. An input image is divided into a uniform grids, i.e., . Here we illustrate the grid with . If the center of an object falls into a grid cell, that grid cell is responsible for predicting the semantic category (top) and masks of instances (bottom). We do not show the feature pyramid network (FPN) here for simpler illustration.

3.1 Problem Formulation

Given an arbitrary image, an instance segmentation system needs to determine whether there are instances of semantic objects; and if present, the system returns the segmentation mask. The central idea of SOLO framework is to reformulate the instance segmentation as two simultaneous category-aware prediction and instance-aware mask generation problems. Concretely, our system divides the input image into a uniform grids, i.e., . If the center of an object falls into a grid cell, that grid cell is responsible for 1) predicting the semantic category as well as 2) segmenting that object instance.

3.1.1 Semantic Category

For each grid, our SOLO predicts the -dimensional output to indicate the semantic class probabilities, where is the number of classes. These probabilities are conditioned on the grid cell. If we divide the input image into grids, the output space will be , as shown in Figure 2 (top). This design is based on the assumption that each cell of the grid must belong to one individual instance, thus only belonging to one semantic category. During inference, the -dimensional output indicates the class probability for each object instance.

3.1.2 Instance Mask

In parallel with the semantic category prediction, each positive grid cell will also generate the corresponding instance mask. For an input image , if we divide it into grids, there will be at most

predicted masks in total. We explicitly encode these masks at the third dimension (channel) of a 3D output tensor. Specifically, the instance mask output will have

dimension. The channel will be responsible to segment instance at grid (, ), where (with and zero-based)111We also show an equivalent and more efficient implementation in Section 5.. To this end, a one-to-one correspondence is established between the semantic category and class-agnostic mask (Figure 2).

A direct approach to predict the instance mask is to adopt the fully convolutional networks, like FCNs in semantic segmentation [17]. However the conventional convolutional operations are spatially invariant to some degree. Spatial invariance is desirable for some tasks such as image classification as it introduces robustness. However, on the contrary, here we need a model that is spatially variant, or in more precise words, position sensitive, since our segmentation masks are conditioned on the grid cells and must be separated by different feature channels.

Our solution is very simple: at the beginning of the network, we directly feed normalized pixel coordinates to the networks, inspired by ‘CoordConv’ operator [14]. Specifically, we create a tensor of same spatial size as input that contains pixel coordinates, which are normalized to . This tensor is then concatenated to the input features and passed to the following layers. By simply giving the convolution access to its own input coordinates, we add the spatial functionality to the conventional FCN model. It should be noted that CoordConv is not the only choice. For example the semi-convolutional operators [20] may be competent, but we employ CoordConv for its simplicity and being easy to implement. If the original feature tensor is of size , the size of new tensor becomes , in which the last two channels are - pixel coordinates. For more information on CoordConv, we refer readers to [14].

Forming Instance Segmentation. In SOLO, the category prediction and the corresponding mask are naturally associated by their reference grid cell, i.e., . Based on this, we can directly form the final instance segmentation result for each grid. The raw instance segmentation results are generated by gathering all grid results. Finally, non-maximum-suppression (NMS) is used to obtain the final instance segmentation results. No other post processing operations are needed.

3.2 Network Architecture

Figure 3: SOLO Head architecture. At each FPN feature level, we attach two sibling sub-networks, one for instance category prediction (top) and one for instance mask segmentation (bottom). In the mask branch, we concatenate the ,

coordinates and the original features to encode spatial information. Here numbers denote spatial resolution and channels. In the figure, we assume 256 channels as an example. Arrows denote either convolution or interpolation. All convolutions are 3

3, except the output conv. ‘Align’ means adaptive-pooling, interpolation or region-grid-interpolation, which is discussed in Section 4.3. During inference, the mask branch outputs are further upsampled to the original image size.

We now present networks utilized in our SOLO framework. SOLO attaches to a convolutional backbone. We use FPN  [11], which generates a pyramid of feature maps with different sizes with a fixed number of channels (usually 256-d) for each level. These maps are used as input for each prediction head: semantic category and instance mask. Weights for the head are shared across different levels. Grid number may varies at different pyramids. Only the last conv is not shared in this scenario.

To demonstrate the generality and effectiveness of our approach, we instantiate SOLO with multiple architectures. The differences include: (a) the backbone

architecture used for feature extraction, (b) the network

head for computing the instance segmentation results, and (c) training loss function used to optimize the model. Most of the experiments are based on the head architecture as shown in Figure 3. We also utilize different variants to further study the generality. We note that our instance segmentation heads have a straightforward structure. More complex designs have the potential to improve performance but are not the focus of this work.

3.3 SOLO Learning

3.3.1 Label Assignment

For category prediction branch, the network need to give the object category probability for each of grid. Specifically, grid is considered as a positive sample if it falls into the center region of any ground truth mask, Otherwise it is a negative sample. Center sampling is effective in recent works of object detection [26, 9], and here we also utilize a similar technique for mask category classification. Given the mass center , width , and height of the ground truth mask, the center region is controlled by constant scale factors : . We set and there are on average 3 positive samples for each ground truth mask.

Besides the label for instance category, we also have a binary segmentation mask for each positive sample. Since there are grids, we also have output masks for each image. For each positive samples, the corresponding target binary mask will be annotated. One may be concerned that the order of masks will impact the mask prediction branch, however, we show that the most simple row-major order works well for our method.

3.3.2 Loss Function

We define our training loss function as follows:


where is the conventional Focal Loss [13] for semantic category classification. is the loss for mask prediction:


Here indices , if we index the grid cells (instance category labels) from left to right and top to down. denotes the number of positive samples, and represent category and mask target respectively. is the indicator function, being 1 if and 0 otherwise.

In our implementation, we have compared different implementations of : Binary Cross Entropy (BCE), Focal Loss [13] and Dice Loss [18]. Finally, we employ Dice Loss for its effectiveness and stability in training. in Equation (1) is set to 3. The Dice Loss is defined as


where is the dice coefficient which is defined as


Here and refer to the value of pixel located at in predicted soft mask and ground truth mask . We will give further comparisons about the loss functions at the experimental section.

3.4 Inference

The inference of SOLO is very straightforward. Given an input image, we forward it through the backbone network and FPN, and obtain the category score at grid and the corresponding masks , where since we usually keep row-major order. We first use a confidence threshold of to filter out predictions with low confidence. Then we select the top scoring masks and feed them into the NMS operation. To convert predicted soft masks to binary masks, we use a threshold of to binaries the predicted soft masks. We keep the top instance masks for evaluation.

4 Experiments

backbone AP AP AP AP AP AP


    MNC [3] Res-101-C4 24.6 44.3 24.8 4.7 25.9 43.6
    FCIS [10] Res-101-C5 29.2 49.5 7.1 31.3 50.0
    Mask R-CNN [7] Res-101-FPN 35.7 58.0 37.8 15.5 38.1 52.4
    Mask R-CNN [2] Res-50-FPN 36.8 59.2 39.3 17.1 38.7 52.1
    Mask R-CNN [2] Res-101-FPN 38.3 61.2 40.8 18.2 40.6 54.1
    TensorMask [2] Res-50-FPN 35.4 57.2 37.3 16.3 36.8 49.3
    TensorMask [2] Res-101-FPN 37.1 59.3 39.4 17.4 39.1 51.6
    YOLACT [1] Res-101-FPN 31.2 50.6 32.8 12.1 33.3 47.1
    PolarMask [27] Res-101-FPN 30.4 51.9 31.0 13.4 32.4 42.8
    SOLO Res-50-FPN 36.8 58.6 39.0 15.9 39.5 52.1
    SOLO Res-101-FPN 37.8 59.5 40.4 16.4 40.6 54.2
    SOLO Res-DCN-101-FPN 40.4 62.7 43.3 17.6 43.3 58.9
Table 1: Instance segmentation mask AP on COCO -. All entries are single-model results. Here we adopt the “6

” schedule (72 epochs) for better results. Mask R-CNN

is the improved version in [2].
Figure 4: SOLO behavior. We show the visualization of soft mask prediction. Here . For each column, the top one is the instance segmentation result, and the bottom one shows the mask activation maps. The sub-figure in an activation map indicates the mask prediction results generated by the corresponding mask channel. Different instances activates at different mask prediction channels.

We present experimental results on the MS COCO instance segmentation track [12], and report lesion and sensitivity studies by evaluating on the 5k split. For our main results, we report COCO mask AP on the - split, which has no public labels and is evaluated on the evaluation server.

Training details.

SOLO is trained with stochastic gradient descent (SGD). We use synchronized SGD over 8 GPUs with a total of 16 images per mini-batch (2 images per GPU). Unless otherwise specified, all models are trained for 36 epochs with an initial learning rate of

, which is then divided by 10 at 27th and again at 33th epoch. Weight decay of and momentum of

are used. All models are initialized from ImageNet pre-trained weights. We use scale jitter where the shorter image side is randomly sampled from 640 to 800 pixels.

4.1 Main Results

We compare SOLO to the state-of-the-art methods in instance segmentation on MS COCO - in Table 1. SOLO with ResNet-101 achieves a mask AP of 37.8%, the state of the art among existing two-stage instance segmentation methods such as Mask R-CNN. SOLO outperforms all previous one-stage methods, including TensorMask [2]. With DCN-101 [4] backbone, SOLO further achieves 40.4 AP, which is much better than current dominant approaches in COCO instance segmentation task. SOLO outputs are visualized in Figure 8, we show that SOLO achieves good results even under challenging conditions.

4.2 How SOLO Works?

We show the network outputs generated by = grids (Figure 4). The sub-figure indicates the soft mask prediction results generated by the corresponding mask channel (after ). Here we can see that different instances activates at different mask prediction channels. By explicitly segmenting instances at different positions, SOLO converts the instance segmentation problem into a position-aware classification task.

Only one instance will be activated at each grid, and one instance may be predicted by multiple adjacent mask channels. During inference, we use NMS to suppress these redundant masks.

4.3 Ablation Experiments

Grid number. We compare the impacts of grid number on the performance with single output feature map as shown in Table 2

. The feature is generated by merging C3, C4, and C5 outputs in ResNet (stride=8). To our surprise,

= can already achieve 27.2 AP on the challenging MS COCO dataset. SOLO gets 29.0 AP when improving the grid number to 24. This results indicate that our single-scale SOLO can be applicable to some scenarios where object scales do not vary much. However, the single-scale models largely lag behind the pyramid model, demonstrating the importance of FPN in dealing multi-scale prediction.

grid number AP AP AP AP AP AP


12 27.2 44.9 27.6 8.7 27.6 44.5
24 29.0 47.3 29.9 10.0 30.1 45.8
36 28.6 46.3 29.7 9.5 29.5 45.2
Pyramid 35.8 57.1 37.8 15.0 38.7 53.6
Table 2: The impact of grid number and FPN. FPN (Table 3) significantly improves the performance thanks to its ability to deal with varying sizes of objects.

Multi-level Prediction. From Table 2 we can see that our single-scale SOLO struggles in segmenting multi-scale objects. In this ablation, we show that this issue can be largely resolved with multi-level prediction of FPN [11]. Beginning from the ablations of Table 2, we use five FPN pyramids to segment objects of different scales (Table 3). Scales of ground-truth masks are explicitly used to assign them to the levels of the pyramid. Based on our multi-level prediction, we further achieve 35.8 AP. As expected, the segmentation performance over all the metrics has been largely improved.

pyramid P2 P3 P4 P5 P6


re-scaled stride 8 8 16 32 32
grid number 40 36 24 16 12
instance scale 96 48192 96384 192768 384
Table 3: we use five FPN pyramids to segment objects of different scales. The grid number increases for smaller instances due to larger existence space.

CoordConv. Another important component that facilitates our SOLO paradigm is the spatially variant convolution (CoordConv [14]). As shown in Table 4, the standard convolution can already have spatial variant property to some extent, which is in accordance with the observation in [14]. When making the convolution access to its own input coordinates through concatenating extra coordinate channels, our method enjoys 3.6 absolute AP gains. Two or more CoordConvs do not bring noticeable improvement. It suggests that a single CoordConv already enables the predictions to be well spatially variant/position sensitive.

#CoordConv AP AP AP AP AP AP


0 32.2 52.6 33.7 11.5 34.3 51.6
1 35.8 57.1 37.8 15.0 38.7 53.6
2 35.7 57.0 37.7 14.9 38.7 53.3
3 35.8 57.4 37.7 15.7 39.0 53.0
Table 4: Conv vs. CoordConv. CoordConv can considerably improve AP upon standard convolution. Two or more layers of CoordConv are not necessary.

Loss function. Table 5 compares different loss functions for our mask optimization branch. The methods include conventional Binary Cross Entropy (BCE), Focal Loss (FL), and Dice Loss (DL). To obtain improved performance, for Binary Cross Entropy we set a mask loss weight of 10 and a pixel weight of 2 for positive samples. The mask loss weight of Focal Loss is set to 20. As shown, the Focal Loss works much better than ordinary Binary Cross Entropy loss. It is because that the majority of pixels of an instance mask are in background, and the Focal Loss is designed to mitigate the sample imbalance problem by decreasing the loss of well-classified samples. However, the Dice Loss achieves the best results without the need of manually adjusting the loss hyper-parameters. Dice Loss views the pixels as a whole object and could establish the right balance between foreground and background pixels automatically. Note that with carefully tuning the balance hyper-parameters and introducing other training tricks, the results of Binary Cross Entropy and Focal Loss may be considerably improved. However the point here is that with Dice Loss, training typically becomes much more stable and more likely to attain good results without using much heuristics.

mask loss AP AP AP AP AP AP


BCE 30.0 50.4 31.0 10.1 32.5 47.7
FL 31.6 51.1 33.3 9.9 34.9 49.8
DL 35.8 57.1 37.8 15.0 38.7 53.6
Table 5: Different loss functions may be employed in the mask branch. The Dice loss (DL) leads to best AP and is more stable to train.

Alignment in category branch. In the category prediction branch, we must match the convolutional features with spatial size to . Here, we compare three common implementations: interpolation, adaptive-pool, and region-grid-interpolation.

  • Interpolation: Directly bilinear interpolating to the target grid size;

  • Adaptive-pool: Applying a 2D adaptive max-pool over

    to ;

  • Region-grid-interpolation: For each grid cell, we use bilinear interpolation conditioned on dense sample points, and aggregate the results with average.

From our observation, there is no noticeable performance gap between these variants ( 0.1AP), indicating the alignment process is fairly flexible.

Different head depth. In SOLO, instance segmentation is formulated as a pixel-to-pixel task and we exploit the spatial layout of masks by using an FCN. In Figure 5, we compare different head depth utilized in our work. Changing the head depth from 4 to 7 gives 1.2 AP gains. The results in Figure 5 show that when the depth grows beyond 7, the performance becomes stable. In this paper, we use depth being 7 in other experiments.

Figure 5: Results on the COCO set using different head depth on ResNet-50-FPN.

Previous works (e.g., Mask R-CNN) usually adopt four convolutional layers for mask prediction. In SOLO, the mask is conditioned on the spatial position and we simply attach the coordinate to the beginning of the head. The mask head must have enough representation power to learn such transformation. For the semantic category branch, the computational overhead is negligible since .

4.4 Solo-512

We also train a smaller version of SOLO designed to push the boundaries of real-time instance segmentation. We uses a model with smaller input resolution (shorter image size of 512 instead of 800). Other training and testing parameters are the same between SOLO-512 and SOLO.

backbone AP AP AP fps


SOLO ResNet-50-FPN 36.0 57.5 38.0 12.1
SOLO ResNet-101-FPN 37.1 58.7 39.4 10.4
SOLO-512 ResNet-50-FPN 34.2 55.9 36.0 22.5
SOLO-512 ResNet-101-FPN 35.0 57.1 37.0 19.2
Table 6: SOLO-512. SOLO-512 uses a model with smaller input size (shorter image size of 512 instead of 800). All models are evaluated on . Here the models are trained with “6” schedule.

With 34.2 mask AP, SOLO-512 achieves a model inference speed of 22.5 FPS, showing that SOLO has potentiality for real-time instance segmentation applications. The speed is reported on a single V100 GPU by averaging 5 runs.

Figure 6 shows some contour detection examples generated by our model. We provide these results as a proof of concept that SOLO can be used in contour detection task. Tuning of training and post-processing will likely improve performance, but the main message here is that SOLO serves well as a general technique for dense and arbitrary instance prediction tasks.

Figure 6: Visualization of SOLO for instance contour detection. The model is trained on COCO dataset with ResNet-50-FPN. Each instance contour is shown in a different color.

5 Decoupled SOLO

Given an predefined grid number, e.g., , our SOLO head outputs channel maps. However, the prediction is somewhat redundant as in most cases the objects are located sparsely in the image, as it is unlikely that so many instances are presented in an image. In this section, we further introduce an equivalent and significantly more efficient variant of the vanilla SOLO, termed Decoupled SOLO, shown in Figure 7.

(a) Vanilla head
(b) Decoupled head
Figure 7: Decoupled SOLO head. is input feature. Dashed arrows denote convolutions. . ‘’ denotes element-wise multiplication.

In Decoupled SOLO, the original output tensor is replaces with two output tensors and , corresponding two axes respectively. Thus, the output space is decreased from to . For an object located at grid location , the vanilla SOLO segments its mask at channel of output tensor , where . While in Decoupled SOLO, the mask prediction of that object is defined as the element-wise multiplication of two channel maps:


where and are the and channel map of and after operation.



Vanilla SOLO 35.8 57.1 37.8 15.0 38.7 53.6
Decoupled SOLO 35.8 57.2 37.7 16.3 39.1 52.2
Table 7: Vanilla head vs. Decoupled head. The models are trained with“3” schedule and evaluated on .

We conduct experiments using the the same hyper-parameters as vanilla SOLO. As shown in Table 7, Decoupled SOLO achieves the same performance as vanilla SOLO. It indicates that the Decoupled SOLO serves as an efficient and equivalent variant in accuracy of SOLO. Note that, as the output space is largely reduced, the Decoupled SOLO needs considerably less GPU memory during training.

6 SOLO for Instance Contour Detection

Our framework can easily be extended to instance contour detection by changing the optimization target of the mask branch. We first convert the ground-truth masks in MS COCO into instance contours using OpenCV’s function [25], and then use the binary contours to optimize the mask branch in parallel with the semantic category branch. Here we use Focal Loss to optimize the contour detection, other settings are the same with instance segmentation baseline.

Figure 8: Visualization of instance segmentation results using the Res-101-FPN backbone. The model is trained on the COCO dataset, achieving a mask AP of 37.8 on the COCO -.

7 Conclusion

In this work we have developed a direct instance segmentation framework, termed SOLO, achieving competitive accuracy compared against the de facto instance segmentation method, Mask R-CNN. Our proposed model is end-to-end trainable and can directly map a raw input image to the desired instance masks with constant inference time, eliminating the need for the grouping post-processing as in bottom-up methods or the bounding-box detection and RoI operations in top-down approaches.

By introducing the new notion of ‘instance categories’, for the first time, we are able to reformulate instance mask prediction into a much simplified classification task, making instance segmentation significantly simpler than all current approaches. We have showcased two instance-level recognition tasks, namely instance segmentation and instance contour detection with the proposed SOLO. Given the simplicity, flexibility, and strong performance of SOLO, we hope that our SOLO can serve as a cornerstone for many instance-level recognition tasks.

Acknowledgements and Declaration of Conflicting Interests Chunhua Shen and his employer received no financial support for the research, authorship, andor publication of this article. Thanks to Chong Xu at ByteDance AI Lab for technical support; and to Enze Xie at Hong Kong University for constructive discussions.


  • [1] Daniel Bolya, Chong Zhou, Fanyi Xiao, and Yong Jae Lee. Yolact: Real-time instance segmentation. In Proc. IEEE Int. Conf. Comp. Vis., 2019.
  • [2] Xinlei Chen, Ross Girshick, Kaiming He, and Piotr Dollar. Tensormask: A foundation for dense object segmentation. In Proc. IEEE Int. Conf. Comp. Vis., 2019.
  • [3] Jifeng Dai, Kaiming He, and Jian Sun. Instance-aware semantic segmentation via multi-task network cascades. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2016.
  • [4] Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. Deformable convolutional networks. In Proc. IEEE Int. Conf. Comp. Vis., 2017.
  • [5] Bert De Brabandere, Davy Neven, and Luc Van Gool. Semantic instance segmentation with a discriminative loss function. arXiv:1708.02551, 2017.
  • [6] Naiyu Gao, Yanhu Shan, Yupei Wang, Xin Zhao, Yinan Yu, Ming Yang, and Kaiqi Huang. Ssap: Single-shot instance segmentation with affinity pyramid. In Proc. IEEE Int. Conf. Comp. Vis., 2019.
  • [7] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross B. Girshick. Mask R-CNN. In Proc. IEEE Int. Conf. Comp. Vis., 2017.
  • [8] Zhaojin Huang, Lichao Huang, Yongchao Gong, Chang Huang, and Xinggang Wang. Mask scoring R-CNN. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2019.
  • [9] Tao Kong, Fuchun Sun, Huaping Liu, Yuning Jiang, and Jianbo Shi. Foveabox: Beyond anchor-based object detector. arXiv:1904.03797, 2019.
  • [10] Yi Li, Haozhi Qi, Jifeng Dai, Xiangyang Ji, and Yichen Wei. Fully convolutional instance-aware semantic segmentation. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2017.
  • [11] Tsung-Yi Lin, Piotr Dollár, Ross B. Girshick, Kaiming He, Bharath Hariharan, and Serge J. Belongie. Feature pyramid networks for object detection. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2017.
  • [12] Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft COCO: common objects in context. In Proc. Eur. Conf. Comp. Vis., 2014.
  • [13] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. In Proc. IEEE Int. Conf. Comp. Vis., 2017.
  • [14] Rosanne Liu, Joel Lehman, Piero Molino, Felipe Petroski Such, Eric Frank, Alex Sergeev, and Jason Yosinski.

    An intriguing failing of convolutional neural networks and the coordconv solution.

    In Proc. Advances in Neural Inf. Process. Syst., 2018.
  • [15] Shu Liu, Jiaya Jia, Sanja Fidler, and Raquel Urtasun. Sequential grouping networks for instance segmentation. In Proc. IEEE Int. Conf. Comp. Vis., 2017.
  • [16] Shu Liu, Lu Qi, Haifang Qin, Jianping Shi, and Jiaya Jia. Path aggregation network for instance segmentation. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2018.
  • [17] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2015.
  • [18] Fausto Milletari, Nassir Navab, and Seyed-Ahmad Ahmadi. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In Proc. Int. Conf. 3D Vision, 2016.
  • [19] Alejandro Newell, Zhiao Huang, and Jia Deng. Associative embedding: End-to-end learning for joint detection and grouping. In Proc. Advances in Neural Inf. Process. Syst. 2017.
  • [20] David Novotný, Samuel Albanie, Diane Larlus, and Andrea Vedaldi. Semi-convolutional operators for instance segmentation. In Proc. Eur. Conf. Comp. Vis., 2018.
  • [21] Pedro H. O. Pinheiro, Ronan Collobert, and Piotr Dollár. Learning to segment object candidates. In Proc. Advances in Neural Inf. Process. Syst., 2015.
  • [22] Joseph Redmon and Ali Farhadi. YOLOv3: An incremental improvement., pages 1–6, 2018.
  • [23] Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. Faster R-CNN: towards real-time object detection with region proposal networks. In Proc. Advances in Neural Inf. Process. Syst., 2015.
  • [24] Konstantin Sofiiuk, Olga Barinova, and Anton Konushin. Adaptis: Adaptive instance selection network. In Proc. IEEE Int. Conf. Comp. Vis., 2019.
  • [25] Satoshi Suzuki et al. Topological structural analysis of digitized binary images by border following. Computer vision, graphics, and image processing, 1985.
  • [26] Zhi Tian, Chunhua Shen, Hao Chen, and Tong He. FCOS: Fully convolutional one-stage object detection. In Proc. IEEE Int. Conf. Comp. Vis., 2019.
  • [27] Enze Xie, Peize Sun, Xiaoge Song, Wenhai Wang, Xuebo Liu, Ding Liang, Chunhua Shen, and Ping Luo. Polarmask: Single shot instance segmentation with polar representation. arXiv:1909.13226, 2019.