SOLO and SOLOv2 for instance segmentation, ECCV 2020 & NeurIPS 2020.
We present a new, embarrassingly simple approach to instance segmentation in images. Compared to many other dense prediction tasks, e.g., semantic segmentation, it is the arbitrary number of instances that have made instance segmentation much more challenging. In order to predict a mask for each instance, mainstream approaches either follow the 'detect-thensegment' strategy as used by Mask R-CNN, or predict category masks first then use clustering techniques to group pixels into individual instances. We view the task of instance segmentation from a completely new perspective by introducing the notion of "instance categories", which assigns categories to each pixel within an instance according to the instance's location and size, thus nicely converting instance mask segmentation into a classification-solvable problem. Now instance segmentation is decomposed into two classification tasks. We demonstrate a much simpler and flexible instance segmentation framework with strong performance, achieving on par accuracy with Mask R-CNN and outperforming recent singleshot instance segmenters in accuracy. We hope that this very simple and strong framework can serve as a baseline for many instance-level recognition tasks besides instance segmentation.READ FULL TEXT VIEW PDF
Compared to many other dense prediction tasks, e.g., semantic segmentati...
Instance segmentation is one of the fundamental vision tasks. Recently, ...
In this work, we aim at building a simple, direct, and fast instance
Most existing instance segmentation methods only focus on 2D objects and...
Current advances in deep learning is leading to human-level accuracy in
Anticipating future events is an important prerequisite towards intellig...
Most recent transformer-based models show impressive performance on visi...
SOLO and SOLOv2 for instance segmentation, ECCV 2020 & NeurIPS 2020.
SOLO: Segmenting Objects by Locations
Unofficial implementation for SOLO instance segmentation
SOLO network written in tensorflow 2
Instance segmentation is challenging because it requires the correct separation of all objects in an image while also semantically segmenting each instance at the pixel level. Objects in an image belong to a fixed set of semantic categories, but the number of instances varies. As a result, semantic segmentation can be easily formulated as a dense per-pixel classification problem, while it is challenging to predict instance labels directly following the same paradigm.
To overcome this obstacle, recent instance segmentation methods can be categorized into two groups, i.e., top-down and bottom-up paradigms. The former approach, namely ‘detect-then-segment’, first detects bounding boxes and then segments the instance mask in each bounding box. The latter approach learns an affinity relation, assigning an embedding vector to each pixel, by pushing away pixels belonging to different instances and pulling close pixels in the same instance. A grouping post-processing is then needed to separate instances. Both these two paradigms are step-wise and indirect, which either heavily rely on accurate bounding box detection or depend on per-pixel embedding learning and the grouping processing.
In contrast, we aim to directly segment instance masks, under the supervision of full instance mask annotations instead of masks in boxes or additional pixel pairwise relations. We start by rethinking a question: What are the fundamental differences between object instances in an image? Take the challenging MS COCO dataset  for example. There are in total objects in the validation subset, of object pairs have center distance greater than pixels. As for the rest of object pairs, of them have size ratio greater than 1.5. Here we do not consider the few cases like two objects in ‘’ shape. To conclude, in most cases two instances in an image either have different center locations or have different object sizes. This observation makes one wonder whether we could directly distinguish instances by the center locations and object sizes?
In the closely related field, semantic segmentation, now the dominate paradigm leverages a fully convolutional network (FCN) to output dense predictions with channels. Each output channel is responsible for one of the semantic categories (including background). Semantic segmentation aims to distinguish different semantic categories. Analogously, in this work, we propose to distinguish object instances in the image by introducing the notion of “instance categories”, i.e., the quantized center locations and object sizes, which enables to segment objects by locations, thus the name of our method, SOLO.
The core idea of our proposed SOLO is to separate object instances by locations and sizes.
Locations An image can be divided into a grid of cells, thus leading to
center location classes. According to the coordinates of the object center, an object instance is assigned to one of the grid cells, as its center location category. Unlike DeepMask and TensorMask , which pack the mask into the channel axis, we encode center location categories as the channel axis, similar to the semantic categories in semantic segmentation. Each output channel is responsible for one of the center location categories, and the corresponding channel map should predict the instance mask of the object belonging to that category. Thus, structural geometric information is naturally preserved in the spatial matrix with dimensions of height by width.
In essence, an instance category approximates the location of the object center of an instance. Thus, by classification of each pixel into its instance category, it is equivalent to predict the object center from each pixel using regression. The importance here of converting the location prediction task into classification rather than regression is that, with classification it is much more straightforward and easier to model varying number of instances using a fixed number of channels, at the same time not relying on post-processing like grouping or learning embeddings.
Sizes To distinguish instances with different object sizes, we employ the feature pyramid network (FPN) 
, so as to assign objects of different sizes to different levels of feature maps, as the object size classes. Thus, all the object instances are separated regularly, enabling to classify objects by “instance categories”. Note that FPN was designed for the purposes of detecting objects of different sizes in an image.
In the sequel, we empirically show that FPN is one of the core components for our method and has a profound impact on the segmentation performance, especially objects of varying sizes being presented.
With the proposed SOLO framework, we are able to optimize the network in an end-to-end fashion for the instance segmentation task using mask annotations solely, and perform pixel-level instance segmentation out of the restrictions of local box detection and pixel grouping. Note that, most instance segmentation methods to date need box annotations as one of the supervision signals. For the first time, we demonstrate a very simple instance segmentation approach achieving on par results to the dominant ‘detect-then-segment’ method Mask R-CNN , on the challenging COCO dataset 
with diverse scenes and semantic classes. Additionally, we showcase the generality of our framework via the task of instance contour detection, by viewing the instance edge contours as a one-hot binary mask, with almost no modification SOLO can generate reasonable instance contours. The proposed SOLO only needs to solve two pixel-level classification tasks, analogue to semantic segmentation. Thus it may be possible to borrow some of the recent advances in semantic segmentation for improving SOLO. Essentially SOLO converts coordinate regression into classification by discrete quantization. One feat of doing so is the avoidance of heuristic coordination normalization and log-transformation typically used in detectors such as YOLO. The embarrassing simplicity and strong performance of the proposed SOLO method may predict its application to a wide range of instance-level recognition tasks.
We review some instance segmentation works that are closest to ours.
Top-down Instance Segmentation. The methods that segment object instance in a priori bounding box fall into the typical top-down paradigm. FCIS  assembles the position-sensitive score maps within the region-of-interests (ROIs) generated by a region proposal network (RPN) to predict instance masks. Mask R-CNN  extends the Faster R-CNN detector  by adding a branch for segmenting the object instances within the detected bounding boxes. Based on Mask R-CNN, PANet  further enhances the feature representation to improve the accuracy, Mask Scoring R-CNN  adds a mask-IoU branch to predict the quality of the predicted mask and scoring the masks to improve the performance. TensorMask  adopts the dense sliding window paradigm to segment the instance in the local window for each pixel with a predefined number of windows and scales. In contrast to the top-down methods above, our SOLO is totally box-free thus not being restricted by (anchor) box locations and scales, and naturally benefits from the inherent advantages of FCNs.
Bottom-up Instance Segmentation. This category of the approaches generate instance masks by grouping the pixels into an arbitrary number of object instances presented in an image. We briefly review several recent approaches. Pixels are grouped into instances using the learned associative embedding in 
. A discriminative loss function learns pixel-level instance embedding efficiently, by pushing away pixels belonging to different instances and pulling close pixels in the same instance. SGN  decomposes the instance segmentation problem into a sequence of sub-grouping problems. SSAP 
learns a pixel-pair affinity pyramid, the probability that two pixels belong to the same instance, and sequentially generates instances by a cascaded graph partition. Typically bottom-up methods lag behind in accuracy compared to top-down methods, especially on the dataset with diverse scenes and semantic classes. Instead of exploiting pixel pairwise relations and pixel grouping, SOLO directly learns with the instance mask annotations solely during training, and predicts instance masks and semantic categories end-to-end without grouping post-processing. In this sense, our proposed SOLO is adirect end-to-end instance segmentation approach.
Direct Instance Segmentation. To our knowledge, no prior methods directly train with mask annotations solely, and predict instance masks and semantic categories in one shot without the need of grouping post-processing. Several recently proposed methods may be viewed as the ‘semi-direct’ paradigm. AdaptIS  first predicts point proposals, and then sequentially generates the mask for the object located at the detected point proposal. PolarMask  proposes to use the polar representation to encode masks and transforms per-pixel mask prediction to distance regression. They both do not need bounding boxes for training but are either being step-wise or founded on compromise, e.g., coarse parametric representation of masks. Our SOLO takes an image as input, directly outputs instance masks and corresponding class probabilities, in a fully convolutional, box-free and grouping-free paradigm. Our simple network can be optimized end-to-end without the need of box supervision. To predict, the network directly maps an input image to masks for each individual instance, relying on neither intermediate operators like RoI feature cropping, nor grouping post-processing.
Given an arbitrary image, an instance segmentation system needs to determine whether there are instances of semantic objects; and if present, the system returns the segmentation mask. The central idea of SOLO framework is to reformulate the instance segmentation as two simultaneous category-aware prediction and instance-aware mask generation problems. Concretely, our system divides the input image into a uniform grids, i.e., . If the center of an object falls into a grid cell, that grid cell is responsible for 1) predicting the semantic category as well as 2) segmenting that object instance.
For each grid, our SOLO predicts the -dimensional output to indicate the semantic class probabilities, where is the number of classes. These probabilities are conditioned on the grid cell. If we divide the input image into grids, the output space will be , as shown in Figure 2 (top). This design is based on the assumption that each cell of the grid must belong to one individual instance, thus only belonging to one semantic category. During inference, the -dimensional output indicates the class probability for each object instance.
In parallel with the semantic category prediction, each positive grid cell will also generate the corresponding instance mask. For an input image , if we divide it into grids, there will be at most
predicted masks in total. We explicitly encode these masks at the third dimension (channel) of a 3D output tensor. Specifically, the instance mask output will havedimension. The channel will be responsible to segment instance at grid (, ), where (with and zero-based)111We also show an equivalent and more efficient implementation in Section 5.. To this end, a one-to-one correspondence is established between the semantic category and class-agnostic mask (Figure 2).
A direct approach to predict the instance mask is to adopt the fully convolutional networks, like FCNs in semantic segmentation . However the conventional convolutional operations are spatially invariant to some degree. Spatial invariance is desirable for some tasks such as image classification as it introduces robustness. However, on the contrary, here we need a model that is spatially variant, or in more precise words, position sensitive, since our segmentation masks are conditioned on the grid cells and must be separated by different feature channels.
Our solution is very simple: at the beginning of the network, we directly feed normalized pixel coordinates to the networks, inspired by ‘CoordConv’ operator . Specifically, we create a tensor of same spatial size as input that contains pixel coordinates, which are normalized to . This tensor is then concatenated to the input features and passed to the following layers. By simply giving the convolution access to its own input coordinates, we add the spatial functionality to the conventional FCN model. It should be noted that CoordConv is not the only choice. For example the semi-convolutional operators  may be competent, but we employ CoordConv for its simplicity and being easy to implement. If the original feature tensor is of size , the size of new tensor becomes , in which the last two channels are - pixel coordinates. For more information on CoordConv, we refer readers to .
Forming Instance Segmentation. In SOLO, the category prediction and the corresponding mask are naturally associated by their reference grid cell, i.e., . Based on this, we can directly form the final instance segmentation result for each grid. The raw instance segmentation results are generated by gathering all grid results. Finally, non-maximum-suppression (NMS) is used to obtain the final instance segmentation results. No other post processing operations are needed.
We now present networks utilized in our SOLO framework. SOLO attaches to a convolutional backbone. We use FPN , which generates a pyramid of feature maps with different sizes with a fixed number of channels (usually 256-d) for each level. These maps are used as input for each prediction head: semantic category and instance mask. Weights for the head are shared across different levels. Grid number may varies at different pyramids. Only the last conv is not shared in this scenario.
To demonstrate the generality and effectiveness of our approach, we instantiate SOLO with multiple architectures. The differences include: (a) the backbone
architecture used for feature extraction, (b) the networkhead for computing the instance segmentation results, and (c) training loss function used to optimize the model. Most of the experiments are based on the head architecture as shown in Figure 3. We also utilize different variants to further study the generality. We note that our instance segmentation heads have a straightforward structure. More complex designs have the potential to improve performance but are not the focus of this work.
For category prediction branch, the network need to give the object category probability for each of grid. Specifically, grid is considered as a positive sample if it falls into the center region of any ground truth mask, Otherwise it is a negative sample. Center sampling is effective in recent works of object detection [26, 9], and here we also utilize a similar technique for mask category classification. Given the mass center , width , and height of the ground truth mask, the center region is controlled by constant scale factors : . We set and there are on average 3 positive samples for each ground truth mask.
Besides the label for instance category, we also have a binary segmentation mask for each positive sample. Since there are grids, we also have output masks for each image. For each positive samples, the corresponding target binary mask will be annotated. One may be concerned that the order of masks will impact the mask prediction branch, however, we show that the most simple row-major order works well for our method.
We define our training loss function as follows:
where is the conventional Focal Loss  for semantic category classification. is the loss for mask prediction:
Here indices , if we index the grid cells (instance category labels) from left to right and top to down. denotes the number of positive samples, and represent category and mask target respectively. is the indicator function, being 1 if and 0 otherwise.
In our implementation, we have compared different implementations of : Binary Cross Entropy (BCE), Focal Loss  and Dice Loss . Finally, we employ Dice Loss for its effectiveness and stability in training. in Equation (1) is set to 3. The Dice Loss is defined as
where is the dice coefficient which is defined as
Here and refer to the value of pixel located at in predicted soft mask and ground truth mask . We will give further comparisons about the loss functions at the experimental section.
The inference of SOLO is very straightforward. Given an input image, we forward it through the backbone network and FPN, and obtain the category score at grid and the corresponding masks , where since we usually keep row-major order. We first use a confidence threshold of to filter out predictions with low confidence. Then we select the top scoring masks and feed them into the NMS operation. To convert predicted soft masks to binary masks, we use a threshold of to binaries the predicted soft masks. We keep the top instance masks for evaluation.
|Mask R-CNN ||Res-101-FPN||35.7||58.0||37.8||15.5||38.1||52.4|
|Mask R-CNN ||Res-50-FPN||36.8||59.2||39.3||17.1||38.7||52.1|
|Mask R-CNN ||Res-101-FPN||38.3||61.2||40.8||18.2||40.6||54.1|
” schedule (72 epochs) for better results. Mask R-CNNis the improved version in .
We present experimental results on the MS COCO instance segmentation track , and report lesion and sensitivity studies by evaluating on the 5k split. For our main results, we report COCO mask AP on the - split, which has no public labels and is evaluated on the evaluation server.
SOLO is trained with stochastic gradient descent (SGD). We use synchronized SGD over 8 GPUs with a total of 16 images per mini-batch (2 images per GPU). Unless otherwise specified, all models are trained for 36 epochs with an initial learning rate of, which is then divided by 10 at 27th and again at 33th epoch. Weight decay of and momentum of
are used. All models are initialized from ImageNet pre-trained weights. We use scale jitter where the shorter image side is randomly sampled from 640 to 800 pixels.
We compare SOLO to the state-of-the-art methods in instance segmentation on MS COCO - in Table 1. SOLO with ResNet-101 achieves a mask AP of 37.8%, the state of the art among existing two-stage instance segmentation methods such as Mask R-CNN. SOLO outperforms all previous one-stage methods, including TensorMask . With DCN-101  backbone, SOLO further achieves 40.4 AP, which is much better than current dominant approaches in COCO instance segmentation task. SOLO outputs are visualized in Figure 8, we show that SOLO achieves good results even under challenging conditions.
We show the network outputs generated by = grids (Figure 4). The sub-figure indicates the soft mask prediction results generated by the corresponding mask channel (after ). Here we can see that different instances activates at different mask prediction channels. By explicitly segmenting instances at different positions, SOLO converts the instance segmentation problem into a position-aware classification task.
Only one instance will be activated at each grid, and one instance may be predicted by multiple adjacent mask channels. During inference, we use NMS to suppress these redundant masks.
Grid number. We compare the impacts of grid number on the performance with single output feature map as shown in Table 2
. The feature is generated by merging C3, C4, and C5 outputs in ResNet (stride=8). To our surprise,= can already achieve 27.2 AP on the challenging MS COCO dataset. SOLO gets 29.0 AP when improving the grid number to 24. This results indicate that our single-scale SOLO can be applicable to some scenarios where object scales do not vary much. However, the single-scale models largely lag behind the pyramid model, demonstrating the importance of FPN in dealing multi-scale prediction.
Multi-level Prediction. From Table 2 we can see that our single-scale SOLO struggles in segmenting multi-scale objects. In this ablation, we show that this issue can be largely resolved with multi-level prediction of FPN . Beginning from the ablations of Table 2, we use five FPN pyramids to segment objects of different scales (Table 3). Scales of ground-truth masks are explicitly used to assign them to the levels of the pyramid. Based on our multi-level prediction, we further achieve 35.8 AP. As expected, the segmentation performance over all the metrics has been largely improved.
CoordConv. Another important component that facilitates our SOLO paradigm is the spatially variant convolution (CoordConv ). As shown in Table 4, the standard convolution can already have spatial variant property to some extent, which is in accordance with the observation in . When making the convolution access to its own input coordinates through concatenating extra coordinate channels, our method enjoys 3.6 absolute AP gains. Two or more CoordConvs do not bring noticeable improvement. It suggests that a single CoordConv already enables the predictions to be well spatially variant/position sensitive.
Loss function. Table 5 compares different loss functions for our mask optimization branch. The methods include conventional Binary Cross Entropy (BCE), Focal Loss (FL), and Dice Loss (DL). To obtain improved performance, for Binary Cross Entropy we set a mask loss weight of 10 and a pixel weight of 2 for positive samples. The mask loss weight of Focal Loss is set to 20. As shown, the Focal Loss works much better than ordinary Binary Cross Entropy loss. It is because that the majority of pixels of an instance mask are in background, and the Focal Loss is designed to mitigate the sample imbalance problem by decreasing the loss of well-classified samples. However, the Dice Loss achieves the best results without the need of manually adjusting the loss hyper-parameters. Dice Loss views the pixels as a whole object and could establish the right balance between foreground and background pixels automatically. Note that with carefully tuning the balance hyper-parameters and introducing other training tricks, the results of Binary Cross Entropy and Focal Loss may be considerably improved. However the point here is that with Dice Loss, training typically becomes much more stable and more likely to attain good results without using much heuristics.
Alignment in category branch. In the category prediction branch, we must match the convolutional features with spatial size to . Here, we compare three common implementations: interpolation, adaptive-pool, and region-grid-interpolation.
Interpolation: Directly bilinear interpolating to the target grid size;
Adaptive-pool: Applying a 2D adaptive max-pool overto ;
Region-grid-interpolation: For each grid cell, we use bilinear interpolation conditioned on dense sample points, and aggregate the results with average.
From our observation, there is no noticeable performance gap between these variants ( 0.1AP), indicating the alignment process is fairly flexible.
Different head depth. In SOLO, instance segmentation is formulated as a pixel-to-pixel task and we exploit the spatial layout of masks by using an FCN. In Figure 5, we compare different head depth utilized in our work. Changing the head depth from 4 to 7 gives 1.2 AP gains. The results in Figure 5 show that when the depth grows beyond 7, the performance becomes stable. In this paper, we use depth being 7 in other experiments.
Previous works (e.g., Mask R-CNN) usually adopt four convolutional layers for mask prediction. In SOLO, the mask is conditioned on the spatial position and we simply attach the coordinate to the beginning of the head. The mask head must have enough representation power to learn such transformation. For the semantic category branch, the computational overhead is negligible since .
We also train a smaller version of SOLO designed to push the boundaries of real-time instance segmentation. We uses a model with smaller input resolution (shorter image size of 512 instead of 800). Other training and testing parameters are the same between SOLO-512 and SOLO.
With 34.2 mask AP, SOLO-512 achieves a model inference speed of 22.5 FPS, showing that SOLO has potentiality for real-time instance segmentation applications. The speed is reported on a single V100 GPU by averaging 5 runs.
Figure 6 shows some contour detection examples generated by our model. We provide these results as a proof of concept that SOLO can be used in contour detection task. Tuning of training and post-processing will likely improve performance, but the main message here is that SOLO serves well as a general technique for dense and arbitrary instance prediction tasks.
Given an predefined grid number, e.g., , our SOLO head outputs channel maps. However, the prediction is somewhat redundant as in most cases the objects are located sparsely in the image, as it is unlikely that so many instances are presented in an image. In this section, we further introduce an equivalent and significantly more efficient variant of the vanilla SOLO, termed Decoupled SOLO, shown in Figure 7.
In Decoupled SOLO, the original output tensor is replaces with two output tensors and , corresponding two axes respectively. Thus, the output space is decreased from to . For an object located at grid location , the vanilla SOLO segments its mask at channel of output tensor , where . While in Decoupled SOLO, the mask prediction of that object is defined as the element-wise multiplication of two channel maps:
where and are the and channel map of and after operation.
We conduct experiments using the the same hyper-parameters as vanilla SOLO. As shown in Table 7, Decoupled SOLO achieves the same performance as vanilla SOLO. It indicates that the Decoupled SOLO serves as an efficient and equivalent variant in accuracy of SOLO. Note that, as the output space is largely reduced, the Decoupled SOLO needs considerably less GPU memory during training.
Our framework can easily be extended to instance contour detection by changing the optimization target of the mask branch. We first convert the ground-truth masks in MS COCO into instance contours using OpenCV’s function , and then use the binary contours to optimize the mask branch in parallel with the semantic category branch. Here we use Focal Loss to optimize the contour detection, other settings are the same with instance segmentation baseline.
In this work we have developed a direct instance segmentation framework, termed SOLO, achieving competitive accuracy compared against the de facto instance segmentation method, Mask R-CNN. Our proposed model is end-to-end trainable and can directly map a raw input image to the desired instance masks with constant inference time, eliminating the need for the grouping post-processing as in bottom-up methods or the bounding-box detection and RoI operations in top-down approaches.
By introducing the new notion of ‘instance categories’, for the first time, we are able to reformulate instance mask prediction into a much simplified classification task, making instance segmentation significantly simpler than all current approaches. We have showcased two instance-level recognition tasks, namely instance segmentation and instance contour detection with the proposed SOLO. Given the simplicity, flexibility, and strong performance of SOLO, we hope that our SOLO can serve as a cornerstone for many instance-level recognition tasks.
Acknowledgements and Declaration of Conflicting Interests Chunhua Shen and his employer received no financial support for the research, authorship, andor publication of this article. Thanks to Chong Xu at ByteDance AI Lab for technical support; and to Enze Xie at Hong Kong University for constructive discussions.
An intriguing failing of convolutional neural networks and the coordconv solution.In Proc. Advances in Neural Inf. Process. Syst., 2018.