SOLO: Segmenting Objects by Locations https://arxiv.org/abs/1912.04488
In this work, we aim at building a simple, direct, and fast instance segmentation framework with strong performance. We follow the principle of the SOLO method of Wang et al. "SOLO: segmenting objects by locations". Importantly, we take one step further by dynamically learning the mask head of the object segmenter such that the mask head is conditioned on the location. Specifically, the mask branch is decoupled into a mask kernel branch and mask feature branch, which are responsible for learning the convolution kernel and the convolved features respectively. Moreover, we propose Matrix NMS (non maximum suppression) to significantly reduce the inference time overhead due to NMS of masks. Our Matrix NMS performs NMS with parallel matrix operations in one shot, and yields better results. We demonstrate a simple direct instance segmentation system, outperforming a few state-of-the-art methods in both speed and accuracy. A light-weight version of SOLOv2 executes at 31.3 FPS and yields 37.1 mask byproduct) and panoptic segmentation show the potential to serve as a new strong baseline for many instance-level recognition tasks besides instance segmentation. Code is available at: https://git.io/AdelaiDetREAD FULL TEXT VIEW PDF
We present a new, embarrassingly simple approach to instance segmentatio...
We propose a simple yet effective instance segmentation framework, terme...
Instance segmentation is a promising yet challenging topic in computer
We propose a simple yet efficient anchor-free instance segmentation, cal...
Instance segmentation is one of the fundamental vision tasks. Recently, ...
We present an auxiliary task to Mask R-CNN, an instance segmentation net...
The way that information propagates in neural networks is of great
SOLO: Segmenting Objects by Locations https://arxiv.org/abs/1912.04488
Generic object detection demands for the functions of localizing individual objects and recognizing their categories. For representing the object locations, bounding box stands out for its simplicity. Localizing objects using bounding boxes have been extensively explored, including the problem formulation, network architecture, post-processing and all those focusing on optimizing and processing the bounding boxes. The tailored solutions largely boost the performance and efficiency, thus enabling wide downstream applications recently. However, bounding boxes are coarse and unnatural. Human vision can effortlessly localize objects by their boundaries. Instance segmentation, i.e., localizing objects using masks, pushes object localization to the limit at pixel level and opens up opportunities to more instance-level perception and applications. To date, most existing methods deal with instance segmentation in the view of bounding boxes, i.e., segmenting objects in (anchor) bounding boxes. How to develop pure instance segmentation including the supporting facilities, e.g., post-processing, is largely unexplored compared to bounding box detection and instance segmentation methods built on top it.
In the recently proposed SOLO111Hereafter SOLO and SOLOv1 are used interchangeably, referring to the work of .
, the task of instance segmentation is formulated as two sub-tasks of pixel-level classification, solvable using standard FCNs, thus dramatically simplifying the formulation of instance segmentation. SOLO takes an image as input, directly outputs instance masks and corresponding class probabilities, in a fully convolutional, box-free and grouping-free paradigm. As such, the research focus is shifted to how to generate better object masks. We need to develop techniques focusing on masks rather than boxes, to boost the performance as well as to accelerate the inference. In this work, we improve SOLO from these two aspects: mask learning and mask NMS.
We first introduce a dynamic scheme, which enables dynamically segmenting objects by locations. Specifically, the mask learning can be divided into two parts: convolution kernel learning and feature learning. When classifying the pixels into different location categories, the classifiers are predicted by the network and conditioned on the input. About feature learning, as envisioned in SOLO, more techniques developed in semantic segmentation could be applied to boost the performance. Inspired by the semantic FPN in, we construct a unified and high-resolution mask feature representation for instance-aware segmentation. A step-by-step derivation of mask learning from SOLOv1 to SOLOv2 is shown in Section 3.
We further propose an efficient and effective matrix NMS algorithm. As a post-processing step for suppressing the duplicate predictions, non-maximum suppression (NMS) serves as an integral part in state-of-the-art object detection systems. Take the widely adopted multi-class NMS for example. For each class, the predictions are sorted in descending order by confidence. Then for each prediction, it removes all other highly overlapped predictions. The sequential and recursive operations result in non-negligible latency. For mask NMS, the drawback is magnified. Compared to bounding box, it takes more time to compute the IoU of each mask pair, thus leading to a large overhead. We address this problem by introducing Matrix NMS, which performs NMS with parallel matrix operations in one shot. Our Matrix NMS outperforms the existing NMS and the varieties in both accuracy and speed. As a result, Matrix NMS processes 500 masks in less than 1 ms in simple PyTorch implementation, and outperforms the recently proposed Fast NMS by 0.4% AP.
With the improvements, SOLOv2 outperforms SOLOv1 by 1.9% AP while being 33% faster. The Res-50-FPN SOLOv2 achieves 38.8% mask AP at 18 FPS on the challenging MS COCO dataset, evaluated on a single V100 GPU card. A light-weight version of SOLOv2 executes at 31.3 FPS and yields 37.1% mask AP. Interestingly, although the concept of bounding box is thoroughly eliminated in our method, our bounding box byproduct, i.e., by directly converting the predicted mask to its bounding box, yield 42.4% AP for bounding box object detection, which even surpasses many state-of-the-art, highly-engineered object detection methods.
We believe that, with simple, fast and sufficiently strong solutions, instance segmentation should be an advanced alternative to the widely used object bounding box detection, and SOLOv2 may play an important role and predict its wide applications.
Here we review some recent work closest to ours.
Instance segmentation is a challenging task, as it requires instance-level and pixel-level predictions simultaneously. The existing approaches can be summarized into three categories. Top-down methods [20, 12, 27, 13, 6, 2, 3, 38] solve the problem from the perspective of object detection, i.e., detecting first and then segment the object in the box. In particular, recent methods of [3, 38, 35] build their methods on the anchor-free object detector FCOS , showing promising performance. Bottom-up methods [29, 9, 26, 10] view the task as a label-then-cluster problem, e.g., learn the per-pixel embeddings and then cluster them into groups. The latest direct method  aims at dealing with instance segmentation directly, without dependence on box detection or embedding learning. In this work, We inherit the core design of SOLO and further explore the direct instance segmentation solutions.
We specifically compare our method with the recent YOLACT . YOLACT learns a group of coefficients which are normalized to [, 1] for each anchor box. During the inference, it first performs a bounding box detection and then use the predicted boxes to crop the assembled masks. BlendMask improves YOLACT, achieving a better balance betweem accuracy and speed .
While our method is evolved from SOLO  through directly decoupling the original mask prediction to kernel learning and feature learning. No anchor box is needed. No normalization is needed. No bounding box detection is needed. We directly map the input image to the desired object classes and object masks. Both the training and inference are much simpler. As a result, our proposed framework is much simpler, yet achieving significantly better performance (6% AP better at a comparable speed); and our best model achieves 41.7 AP vs. YOLACT’s best 31.2% AP.
In traditional convolution layers, the learned convolution kernels stay fixed and being independent on the input, e.g
., the weights are the same for every image and every location of the image. Some previous works explore the idea of bringing more flexibility into the traditional convolutions. Spatial Transform Networks predicts a global parametric transformation to warp the feature map, allowing the network to adaptively transform feature maps conditioned on the input. Dynamic filter  is proposed to actively predict the parameters of the convolution filters. It applies dynamically generated filters to an image in a sample-specific way. Deformable Convolutional Networks  dynamically learn the sampling locations by predicting the offsets for each image location. We bring the dynamic scheme into instance segmentation and enable learning instance segmenters by locations.
NMS is widely adopted in many computer vision tasks and becomes an essential component of object detection systems. Some recent works are proposed to improve the traditional NMS. They can be divided into two groups, either for improving the accuracy or speeding up. Instead of applying the hard removal to duplicate predictions according to a threshold, Soft-NMS decreases the confidence scores of neighbors according to their overlap with higher scored predictions. The detection accuracy is slightly improved over the traditional NMS but inference is slow due to the sequential operations. Adaptive NMS  applies dynamic suppression threshold to each instance, which is tailored for pedestrian detection in a crowd. To accelerate the inference, Fast NMS proposed in  enables deciding the predictions to be kept or discarded predictions in parallel. Note that it speeds up at the cost of performance deterioration. Different from the previous methods, our Matrix NMS addresses the issues of hard removal and sequential operations at the same time. As a result, the proposed Matrix NMS is able to process 500 masks in less than 1 ms in simple PyTorch implementation, which is negligible compared with the time of network evaluation, and yields 0.4% AP better than Fast NMS.
The core idea of SOLOv1 framework is to segment objects by locations. The input image is conceptually divided into grids. If the center of an object falls into a grid cell, that grid cell is responsible for predicting the semantic category as well as assigning the per-pixel location categories. There are two branches: category branch and mask branch. Category branch predicts the semantic categories, while the mask branch segments the object instance. Concretely, category branch ouputs
shaped tensor, whereis the number of object classes. Mask branch generates output tensor . The in channel is responsible to segment instance at grid (, ), where .
We zoom in to show what happens in the last layer of the mask branch. The last layer is a convolution layer which takes feature as input and produces output channels, i.e., the tensor . The convolution kernel is . The operation can be written as:
This layer can be viewed as classifiers. Each classifier is responsible for classifying whether the pixels belonging to this location category.
As discussed in , the prediction is somewhat redundant as in most cases the objects are located sparsely in the image. It means that only a small part of classifiers actually functions during a single inference. Decoupled SOLO  addresses this issue by decoupling the classifiers into two groups of classifiers, corresponding to horizontal and vertical location categories respectively. Thus the output space is decreased from to .
From another perspective, since output is redundant, and feature is fixed, why not directly learning the convolution kernel ? In this way, we can simply pick the valid ones from predicted classifiers and perform the convolution dynamically. The number of model parameters also decreases. What’s more, as the predicted kernel is generated dynamically conditioned on the input, it benefits from the flexibility and adaptive nature. Additionally, each of classifiers is conditioned on the location. It is in accordance with the core idea of segmenting objects by locations and goes a step further by predicting the segmenters by locations. We illustrate the detailed method in Section 4.1.
We present the details of the proposed SOLOv2 design in this section.
We inherit the most of settings from SOLOv1, e.g., grid cells, multi-level prediction, CoordConv 
and loss function. Based on that, we introduce the dynamic scheme, in which the original mask branch is decoupled into a mask kernel branch and a mask feature branch, for predicting the convolution kernel and the convolved features respectively. We show the comparisons with SOLOv1 in Figure2.
The mask kernel branch lies in the prediction head, along with the semantic category branch. The head works on a pyramid of feature maps generated by FPN . Both the branches in the head consist of 4
convs for feature extraction an a final one conv for prediction. Weights for the head are shared across different feature map levels. We add the spatial functionality to the kernel branch by giving the first convolution access to the normalized coordinates,i.e., concatenating two additional input channels.
For each grid, the kernel branch predicts the -dimensional output to indicate predicted convolution kernel weights, where is the number of parameters. For generating the weights of a convolution with input channels, equals . As for convolution, equals . These generated weights are conditioned on the locations, i.e., the grid cells. If we divide the input image into grids, the output space will be
, There is no activation function on the output.
The mask feature branch needs to predict instance-aware feature maps , where is the dimension of mask feature. will be convolved by the output of mask kernel branch. If all the predicted weights are used, i.e., classifiers, the outputted instance mask after the final convolution will be in , which is the same as the output space of SOLOv1.
Since the mask feature and mask kernel are decoupled and separately predicted, there are two ways to construct the mask feature branch. We can put it into the head, along with the kernel branch. It means that we predict the mask features for each FPN levels. Or, to predict a unified mask feature representation for all FPN levels. We have compared the two implementations in Section 5.1.3 by experiments. Finally, we employ the latter one for its effectiveness and efficiency.
, ReLU andbilinear upsampling, the FPN features P2 to P5 are merged into a single output at 1/4 scale. The last layer after the element-wise summation consists of convolution, group norm and ReLU. The details are illustrated in Figure 3. It should be noted that we feed normalized pixel coordinates to the deepest FPN level (at 1/32 scale), before the convolutions and bilinear upsamplings. The provided accurate position information is important for enabling position sensitivity and predicting instance-aware features.
For each grid cell at (), we first obtain the mask kernel . Then is convoluted with to get the instance mask. In total, there will be at most masks for each prediction level. Finally, we use the proposed Matrix NMS to get the final instance segmentation results.
The label assignment and loss functions are the same as SOLOv1. The training loss function is defined as follows:
During the inference, we forward input image through the backbone network and FPN, and obtain the category score at grid . We first use a confidence threshold of to filter out predictions with low confidence. The corresponding predicted mask kernels are then used to perform convolution on the mask feature. After the sigmoid operation, we use a threshold of to convert predicted soft masks to binary masks. The last step is the Matrix NMS.
Motivation Our Matrix NMS is motivated from Soft-NMS . Soft-NMS decays the other detection scores as a monotonic decreasing function of their overlaps. By decaying the scores according to IoUs recursively, higher IoU detections will be eliminated with a minimum score threshold. However, such process is sequential like traditional Greedy NMS and could not be implemented in parallel.
Matrix NMS views this process from another perspective by considering how a predicted mask being suppressed. For , its decay factor is affected by: (a) The penalty of each prediction on (); and (b) the probability of being suppressed. For (a), the penalty of each prediction on could be easily computed by . For (b), the probability of being suppressed is not so elegant to be computed. However, the probability usually has positive correlation with the IoUs. So here we directly approximate the probability by the most overlapped prediction on as
To this end, the final decay factor becomes
and the updated score is computed by .
We consider two most simple decremented functions, denoted as linear
Implementation All the operations in Matrix NMS could be implemented in one shot without recurrence. We first compute a pairwise IoU matrix for the top predictions sorted descending by score. For binary masks, the IoU matrix could be efficiently implemented by matrix operations. Then we get the most overlapping IoUs by column-wise max on the IoU matrix. Next, the decay factors of all higher scoring predictions are computed, and the decay factor for each prediction is selected as the most effect one by column-wise min (Eqn. (4)). Finally, the scores are updated by the decay factor. For usage, we just need threshing and selecting top- scoring masks as the final predictions.
Figure 4 shows the pseudo-code of Matrix NMS in Pytorch style. In our code base, Matrix NMS is 9 times faster than traditional NMS and being more accurate (Table 7). We show that Matrix NMS serves as a superior alternative of traditional NMS both in accuracy and speed, and can be easily integrated into the state-of-the-art detection/segmentation systems.
To evaluate the proposed method SOLOv2, we conduct experiments on three basic tasks, instance segmentation, object detection and panoptic segmentation on MS COCO . We also present experimental results on the recently proposed LVIS dataset, which has more than 1K categories and thus is considerably more challenging.
For instance segmentation, we report lesion and sensitivity studies by evaluating on the COCO 5K val2017 split. We also report COCO mask AP on the test-dev split, which is evaluated on the evaluation server.
SOLOv2 is trained with stochastic gradient descent (SGD). We use synchronized SGD over 8 GPUs with a total of 16 images per mini-batch. Unless otherwise specified, all models are trained for 36 epochs (i.e., 3) with an initial learning rate of , which is then divided by 10 at 27th and again at 33th epoch. Weight decay of and momentum of
are used. All models are initialized from ImageNet pre-trained weights. We use scale jitter where the shorter image side is randomly sampled from 640 to 800 pixels.
We compare SOLOv2 to the state-of-the-art methods in instance segmentation on MS COCO test-dev in Table 1. SOLOv2 with ResNet-101 achieves a mask AP of 39.7%, which is much better than SOLOv1 and other state-of-the-art instance segmentation methods. Our method shows its superiority espically on large objects (e.g. +5.0 AP than Mask R-CNN).
We also provide the speed-accuracy trade-off on COCO to compare with some dominant instance segmenters (Figure 1). We show our models with ResNet-50, ResNet-101, ResNet-DCN-101 and two light-weight versions described in Section 5.1.3. The proposed SOLOv2 outperforms a range of state-of-the-art algorithms, both in accuracy and speed. The running time is tested on our local machine, with a single V100 GPU, Pytorch 1.2 and CUDA 10.0. We download code and pre-trained models to test inference time for each model on the same machine.
|Mask R-CNN ||Res-101-FPN||35.7||58.0||37.8||15.5||38.1||52.4|
|Faster R-CNN ||Res-101-FPN||36.2||59.1||39.0||18.2||39.0||48.2|
We visualize what SOLOv2 learns from two aspects: mask feature behavior and the final outputs after being convolved by the dynamically learned convolution kernels.
We visualize the outputs of mask feature branch. We use a model which has 64 output channels (i.e., for the last feature map prior to mask prediction) for easy visualization. Here we plot each of the 64 channels (recall the channel spatial resolution is ) as shown in Figure 5.
There are two main patterns. The first and the foremost, the mask features are position-aware. It shows obvious behavior of scanning the objects in the image horizontally and vertically. Interestingly, it is indeed in accordance to the target in the decoupled-head SOLO: Segmenting objects by their independent horizontal and vertical location categories. The other obvious pattern is that some feature maps are responsible for activating all the foreground objects, e.g., the one in white boxes.
The final outputs are shown in Figure 8. Different objects are in different colors. Our method shows promising results in diverse scenes. It is worth pointing out that the details at the boundaries are segmented well, especially for large objects. We compare against Mask R-CNN on object details in Figure 6. Our method shows great advantages.
We investigate and compare the following four aspects in our methods: (a) the kernel shape used to perform convolution on mask features; (b) CoordConvs used in the mask kernel branch and mask feature branch; (c) unified mask feature representation and (d) the effectiveness of Matrix NMS.
Kernel shape We consider the kernel shape from two aspects: number of input channels and kernel size. The comparisons are shown in Table 3. conv shows equivalent performance to conv. Changing the number of input channels from 128 to 256 attains 0.4% AP gains. When it grows beyond 256, the performance becomes stable. In this work, we set the number of input channels to be 256 in all other experiments.
Effectiveness of coordinates Since our method segments objects by locations, or specifically, learns the object segmenters by locations, the position information is very important. For example, if the mask kernel branch is unaware of the positions, the objects with the same appearance may have the same predicted kernel, leading to the same output mask. On the other hand, if the mask feature branch is unaware of the position information, it would not know how to assign the pixels to different feature channels in the order that matches the mask kernel. As shown in Table 4
, the model achieves 36.3% AP without explicit coordinates input. The results are reasonably good because that CNNs can implicitly learn the absolute position information from the commonly used zero-padding operation, as revealed in. The pyramid zero-paddings in our mask feature branch should have contributed considerably. However, the implicitly learned position information is coarse and inaccurate. When making the convolution access to its own input coordinates through concatenating extra coordinate channels, our method enjoys 1.5% absolute AP gains.
Unified Mask Feature Representation For mask feature learning, we have two options: to learn the feature in the head separately for each FPN level or to construct a unified representation. For the former one, we implement as SOLOv1 and use seven conv to predict the mask features. For the latter one, the detailed implementation is illustrated in Figure 3. We compare these two implementations in Table 5. As shown, the unified representation achieves better results, especially for the medium and large objects. This is easy to understand: in SOLOv1 large-size objects are assigned to high-level feature maps of low spatial resolutions, leading to coarse boundary prediction.
Dynamic vs. Decoupled The dynamic head and decoupled head both serve as the efficient varieties of the SOLO head. We compare the results in Table 6. All the settings are the same except the head type, which means that for the dynamic head we use the separate features as above. The dynamic head achieves 0.7% AP better than the decoupled head. We believe that the gains have come from the dynamic scheme which learns the kernel weights dynamically, conditioned on the input.
Matrix NMS Our Matrix NMS can be implemented totally in parallel. Table 7 presents the speed and accuracy comparison of Hard-NMS, Soft-NMS, Fast NMS and our Matrix NMS. Since all methods need to compute the IoU matrix, we pre-compute the IoU matrix in advance for fair comparison. The speed reported here is that of the NMS process alone, excluding computing IoU matrices.
|Our Matrix NMS||✗||36.6|
Hard NMS and Soft-NMS are widely used in current object detection and segmentation models. Unfortunately, both methods are recursive and spend much time budget (22 ms). Our Matrix NMS only needs ms and is almost cost free! Here we also show the performance of Fast NMS, which utilizes matrix operations but with performance penalty. To conclude, our Matrix NMS shows its advantages on both speed and accuracy.
Real-time setting We design two light-weight models for different purposes. 1) Speed priority, the number of convolution layers in the prediction head is reduced to two and the input shorter side is 448. 2) Accuracy priority, the number of convolution layers in the prediction head is reduced to three and the input shorter side is 512. Moreover, deformable convolution  is used in the backbone and the last layer of prediction head. We train both models with the schedule, with shorter side randomly sampled from [352, 512]. Results are shown in Table 8. SOLO can not only push state-of-the-art, but has also been ready for real-time applications.
Although our instance segmentation solution removes the dependence of bounding box prediction, we are able to produce the 4D object bounding box from each instance mask. In Table 2, we compare the generated box detection performance with other object detection methods on COCO. All models are trained on the train2017 subset and tested on test-dev.
As shown in Table 2, our detection results outperform most methods, especially for objects of large scales, demonstrating the effectiveness of SOLOv2 in object box detection. Similar to instance segmentation, we also plot the speed/accuracy trade-off curve for different methods in Figure 7. We show our models with ResNet-101 and two light-weight versions described above. The plot reveals that the bounding box performance of SOLOv2 beats most recent object detection methods in both accuracy and speed. Here we emphasis that our results are directly generated from the off-the-shelf instance mask, without any box based supervised training or engineering.
An observation from Figure 7 is as follows. If one does not care much about the cost difference between mask annotation and bounding box annotation, it appears to us that there is no reason to use box detectors for downstream applications, considering the fact that our SOLOv2 beats most modern detectors in both accuracy and speed.
The proposed SOLOv2 can be easily extended to panoptic segmentation by adding the semantic segmentation branch, analogue to the mask feature branch. We use annotations of COCO 2018 panoptic segmentaiton task. All models are trained on train2017 subset and tested on val2017. We use the same strategy as in Panoptic-FPN to combine instance and semantic results. As shown in Table 10, our method achieves state-of-the-art results and outperforms other recent box-free methods by a large margin. All methods listed use the same backbone (ResNet50-FPN) except SSAP (ResNet101) and Panoptic-DeepLab (Xception-71).
|SSAP (Res101) ||36.5|
|Pano-DeepLab (Xcept71) ||39.7||43.9||33.2|
LVIS  is a recently proposed dataset for long-tail object segmentation, which has more than 1000 object categories. In LVIS, each object instance is segmented with a high-quality mask that surpasses the annotation quality of the relevant COCO dataset. Since LVIS is new, only the results of Mask R-CNN are publicly available. Therefore we only compare SOLOv2 against the Mask R-CNN baseline.
Table 9 reports the performances on the rare (110 images), common (11100), and frequent ( 100) subsets, as well as the overall AP. Our SOLOv2 outperforms the baseline method by about 1% AP. For large-size objects (AP), our SOLOv2 achieves 6.7% AP improvement, which is consistent with the results on the COCO dataset.
In this work, we have significantly improved SOLOv1 instance segmentation from three aspects.
We have proposed to learn adaptive, dynamic convolutional kernels for the mask prediction head, conditioned on the location, leading to a much more compact yet more powerful head design, and achieving better results with reduced FLOPs.
We have re-designed the mask features as shown in Figure 3, which predicts more accurate boundaries. Especially for the medium-size and large-size objects, better mask APs are attained compared against SOLOv1.
Moreover, unlike box NMS as in object detection, for instance segmentation a bottleneck in inference efficiency is the NMS of masks. Previous works either use box NMS as a surrogate, or speed it up via approximation which results in harming mask AP. We have designed a simple and much faster NMS strategy, termed Matrix NMS, for NMS processing of masks, without sacrificing mask AP.
Our experiments on the Microsoft COCO and LVIS datasets demonstrate the superior performance in terms of both accuracy and speed of the proposed SOLOv2. Being versatile for instance-level recognition tasks, we show that without any modification to the framework, SOLOv2 performs competitively for panoptic segmentation. Thanks to its simplicity (being proposal free, anchor free, FCN-like), strong performance in both accuracy and speed, and potentially being capable of solve many instance-level tasks, we hope that SOLOv2 can be a strong baseline approach to instance recognition, and inspires future work such that its full potential can be exploited as we believe that there is still much room for improvement.
How much position information do convolutional neural networks encode?In Proc. Int. Conf. Learn. Representations, 2020.