TensorMask: A Foundation for Dense Object Segmentation

by   Xinlei Chen, et al.

Sliding-window object detectors that generate bounding-box object predictions over a dense, regular grid have advanced rapidly and proven popular. In contrast, modern instance segmentation approaches are dominated by methods that first detect object bounding boxes, and then crop and segment these regions, as popularized by Mask R-CNN. In this work, we investigate the paradigm of dense sliding-window instance segmentation, which is surprisingly under-explored. Our core observation is that this task is fundamentally different than other dense prediction tasks such as semantic segmentation or bounding-box object detection, as the output at every spatial location is itself a geometric structure with its own spatial dimensions. To formalize this, we treat dense instance segmentation as a prediction task over 4D tensors and present a general framework called TensorMask that explicitly captures this geometry and enables novel operators on 4D tensors. We demonstrate that the tensor view leads to large gains over baselines that ignore this structure, and leads to results comparable to Mask R-CNN. These promising results suggest that TensorMask can serve as a foundation for novel advances in dense mask prediction and a more complete understanding of the task. Code will be made available.



page 1

page 2

page 7

page 10

page 11


SOLO: A Simple Framework for Instance Segmentation

Compared to many other dense prediction tasks, e.g., semantic segmentati...

Mask R-CNN

We present a conceptually simple, flexible, and general framework for ob...

Semi-convolutional Operators for Instance Segmentation

Object detection and instance segmentation are dominated by region-based...

SipMask: Spatial Information Preservation for Fast Image and Video Instance Segmentation

Single-stage instance segmentation approaches have recently gained popul...

Non-local RoIs for Instance Segmentation

We introduce the concept of Non-Local RoI (NL-RoI) Block as a generic an...

Towards Bounding-Box Free Panoptic Segmentation

In this work we introduce a new bounding-box free network (BBFNet) for p...

Weakly-Supervised Amodal Instance Segmentation with Compositional Priors

Amodal segmentation in biological vision refers to the perception of the...

Code Repositories


Instance Segmentation, TensorMask

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The sliding-window paradigm—finding objects by looking in each window placed over a dense set of image locations

—is one of the earliest and most successful concepts in computer vision 

[37, 39, 9, 10] and is naturally connected to convolutional networks [20]. However, while today’s top-performing object detectors rely on sliding window prediction to generate initial candidate regions, a refinement stage is applied to these candidate regions to obtain more accurate predictions, as pioneered by Faster R-CNN [34] and Mask R-CNN [17] for bounding-box object detection and instance segmentation, respectively. This class of methods has dominated the COCO detection challenges [24].

Recently, bounding-box object detectors which eschew the refinement step and focus on direct sliding-window prediction, as exemplified by SSD [27] and RetinaNet [23], have witnessed a resurgence and shown promising results. In contrast, the field has not witnessed equivalent progress in dense sliding-window instance segmentation; there are no direct, dense approaches analogous to SSD / RetinaNet for mask prediction. Why is the dense approach thriving for box detection, yet entirely missing for instance segmentation? This is a question of fundamental scientific interest. The goal of this work is to bridge this gap and provide a foundation for exploring dense instance segmentation.

Figure 1: Selected output of TensorMask, our proposed framework for performing dense sliding-window instance segmentation. We treat dense instance segmentation as a prediction task over structured 4D tensors. In addition to obtaining competitive quantitative results, TensorMask achieves results that are qualitatively reasonable. Observe that both small and large objects are well delineated and more critically overlapping objects are properly handled.
Figure 2: Example results of TensorMask and Mask R-CNN [17] with a ResNet-101-FPN backbone (on the same images as used in Fig. 6 of Mask R-CNN [17]). The results are quantitatively and qualitatively similar, demonstrating that the dense sliding window paradigm can indeed be effective for the instance segmentation task. We challenge the reader to identify which results were generated by TensorMask.111In Fig. 2, Mask R-CNN results on top; TensorMask results on bottom.

Our main insight is that the core concepts for defining dense mask representations, as well as effective realizations of these concepts in neural networks, are both lacking. Unlike bounding boxes, which have a fixed, low-dimensional representation regardless of scale, segmentation masks can benefit from richer, more structured representations. For example, each mask is itself a 2D spatial map, and masks for larger objects can benefit from the use of larger spatial maps. Developing effective representations for dense masks is a key step toward enabling dense instance segmentation.

To address this, we define a set of core concepts for representing masks with high-dimensional tensors that allows for the exploration of novel network architectures for dense mask prediction. We present and experiment with several such networks in order to demonstrate the merits of the proposed representations. Our framework, called TensorMask, establishes the first dense sliding-window instance segmentation system that achieves results near to Mask R-CNN.

The central idea of the TensorMask representation is to use structured

4D tensors to represent masks over a spatial domain. This perspective stands in contrast to prior work on the related task of segmenting class-agnostic object proposals such as DeepMask 

[31] and InstanceFCN [7] that used unstructured

3D tensors, in which the mask is packed into the third ‘channel’ axis. The channel axis, unlike the axes representing object position, does not have a clear geometric meaning and is therefore difficult to manipulate. By using a basic channel representation, one misses an opportunity to benefit from using structural arrays to represent masks as 2D entities—analogous to the difference between MLPs 

[35] and ConvNets [20] for representing 2D images.

Unlike these channel-oriented approaches, we propose to leverage 4D tensors of shape , in which both representing object position—and representing relative mask position—are geometric sub-tensors, , they have axes with well-defined units and geometric meaning the image. This shift in perspective from encoding masks in an unstructured channel axis to using structured geometric sub-tensors enables the definition of novel operations and network architectures. These networks can operate directly on the sub-tensor in geometrically meaningful ways, including coordinate transformation, up-/downscaling, and use of scale pyramids.

Enabled by the TensorMask framework, we develop a pyramid structure over a scale-indexed list of 4D tensors, which we call a tensor bipyramid. Analogous to a feature pyramid, which is a list of feature maps at multiple scales, a tensor bipyramid contains a list of 4D tensors with shapes , where indexes scale. This structure has a pyramidal shape in both and geometric sub-tensors, but growing in opposite directions. This natural design captures the desirable property that large objects have high-resolution masks with coarse spatial localization (large ) and small objects have low-resolution masks with fine spatial localization (small ).

We combine these components into a network backbone and training procedure closely following RetinaNet [23] in which our dense mask predictor extends the original dense bounding box predictor. With detailed ablation experiments, we evaluate the effectiveness of the TensorMask framework and show the importance of explicitly capturing the geometric structure of this task. Finally, we show TensorMask yields similar results to its Mask R-CNN counterpart (see Figs. 1 and 2). These promising results suggest the proposed framework can help pave the way for future research on dense sliding-window instance segmentation.

2 Related Work

Classify mask proposals.

The modern instance segmentation task was introduced by Hariharan  [15] (before being popularized by COCO [24]). In their work, the method proposed for this task involved first generating object mask proposals [38, 1]

, then classifying these proposals 

[15]. In earlier work, the classify-mask-proposals methodology was used for other tasks. For example, Selective Search [38] and the original R-CNN [12]

classified mask proposals to obtain box detections and semantic segmentation results; these methods could easily be applied to instance segmentation. These early methods relied on bottom-up mask proposals computed by pre-deep-learning era methods 

[38, 1]; our work is more closely related to dense sliding-window methods for mask object proposals as pioneered by DeepMask [31]. We discuss this connection shortly.

Detect then segment.

The now dominant paradigm for instance segmentation involves first detecting objects with a box and then segmenting each object using the box as a guide [8, 40, 21, 17]. Perhaps the most successful instantiation of the detect-then-segment methodology is Mask R-CNN [17], which extended the Faster R-CNN [34] detector with a simple mask predictor. Approaches that build on Mask R-CNN [26, 30, 4] have dominated leaderboards of recent challenges [24, 29, 6]. Unlike in bounding-box detection, where sliding-window [27, 33, 23] and region-based [11, 34] methods have both thrived, in the area of instance segmentation, research on dense sliding-window methods has been missing. Our work aims to close this gap.

Label pixels then cluster.

A third class of approaches to instance segmentation (, [3, 19, 2, 25]) builds on models developed for semantic segmentation [28, 5]. These approaches label each image pixel with a category and some auxiliary information that a clustering algorithm can use to group pixels into object instances. These approaches benefit from improvements on semantic segmentation and natively predict higher-resolution masks for larger objects. Compared to detect-then-segment methods, label-pixels-then-cluster methods lag behind in accuracy on popular benchmarks [24, 29, 6]. Instead of employing fully convolutional models for dense pixel labeling, TensorMask explores the framework of building fully convolutional (, dense sliding window) models for dense mask prediction, where the output at each spatial location is itself a 2D spatial map.

Dense sliding window methods.

To the best of our knowledge, no prior methods exist for dense sliding-window instance segmentation. The proposed TensorMask framework is the first such approach. The closest methods are for the related task of class-agnostic mask proposal generation, specifically models such as DeepMask [31, 32] and InstanceFCN [7]

which apply convolutional neural networks to generate mask proposals in a

dense sliding-window manner. Like these approaches, TensorMask is a dense sliding-window model, but it spans a more expressive design space. DeepMask and InstanceFCN can be expressed naturally as class-agnostic TensorMask models, but TensorMask enables novel architectures that perform better. Also, unlike these class-agnostic methods, TensorMask performs multi-class classification in parallel to mask prediction, and thus can be applied to the task of instance segmentation.

3 Tensor Representations for Masks

The central idea of the TensorMask framework is to use structured high-dimensional tensors to represent image content (, masks) in a set of densely sliding windows.

Consider a window sliding on a feature map of width and height . It is possible to represent all masks in all sliding window locations by a tensor of a shape , where each mask is parameterized by pixels. This is the representation used in DeepMask [31].

The underlying spirit of this representation, however, is in fact a higher dimensional (4D) tensor with shape . The sub-tensor represents a mask as a 2D spatial entity. Instead of viewing the channel dimension as a black box into which a mask is arranged, the tensor perspective enables several important concepts for representing dense masks, discussed next.

3.1 Unit of Length

The unit of length (or simply unit) of each spatial axis is a necessary concept for understanding 4D tensors in our framework. Intuitively, the unit of an axis defines the length of one pixel along it. Different axes can have different units.

The unit of the and axes, denoted as , can be set as the stride the input image (, res of ResNet-50 [18] has 16 image pixels). Analogously, the and axes define another 2D spatial domain and have their own unit, denoted as . Shifting one pixel along the or axis corresponds to shifting pixels on the input image. The unit need not be equal to the unit , a property that our models will benefit from.

Defining units is necessary because the interpretation of the tensor shape is ambiguous if units are not specified. For example, represents a window in image pixels if 1 image pixel, but a window in image pixels if 2 image pixels. The units and how they change due to up/down-scaling operations are central to multi-scale representations (more in §3.6).

3.2 Natural Representation

With the definition of units, we can formally describe the representational meaning of a tensor. In our simplest definition, this tensor represents the windows sliding over . We call this the natural representation. Denoting as the ratio of units, formally we have:

Natural Representation: For a 4D tensor of shape , its value at coordinates represents the mask value at in the window centered at .222Derivation: on the input image pixels, the center of a sliding window is , and a pixel located this window is at . Projecting to the domain (, normalizing by the unit ) gives us and .

Here , where ‘’ denotes cartesian product. Conceptually, the tensor can be thought of as a continuous function in this domain. For implementation, we must instead rasterize the 4D tensor as a discrete function defined on sampled locations. We assume a sampling rate of one sample per unit, with samples located at integer coordinates (, if 3, then ). This assumption allows the same value to represent both the length of the axis in terms of units (, 3) and also the number of discrete samples stored for the axis. This is convenient for working with tensors produced by neural networks that are discrete and have lengths.

Fig. 3 (left) illustrates an example when and is 1. The natural representation is intuitive and easy to parse as the output of a network, but it is not the only possible representation in a deep network, as discussed next.

Figure 3: Left: Natural representation. The sub-tensor at a pixel represents a window centered at this pixel. Right: Aligned representation. The sub-tensor at a pixel represents the values at this pixel in each of the windows overlapping it.

3.3 Aligned Representation

In the natural representation, a sub-tensor located at represents values at offset pixels instead of directly at . When using convolutions to compute features, preserving pixel-to-pixel alignment between input pixels and predicted output pixels can lead to improvements (this is similar to the motivation for RoIAlign [17]). Next we describe a pixel-aligned representation for dense masks under the tensor perspective.

Formally, we define the aligned representation as:

Aligned Representation: For a 4D tensor , its value at coordinates represents the mask value at in the window centered at .

is the ratio of units in the aligned representation.

Here, the sub-tensor at pixel always describes the values taken at this pixel, it is aligned. The subspace does not represent a single mask, but instead enumerates mask values in all windows that overlap pixel . Fig. 3 (right) illustrates an example when (nine overlapping windows) and is 1.

Note that we denote tensors in the aligned representation as (and likewise for coordinates/units). This is in the spirit of ‘named tensors’ [36] and proves useful.

Our aligned representation is related to the instance-sensitive score maps proposed in InstanceFCN [7]. We prove (in §A.2

) that those score maps behave like our aligned representation but with nearest-neighbor interpolation on

, which makes them unaligned. We test this experimentally and show it degrades results severely.

3.4 Coordinate Transformation

We introduce a coordinate transformation between natural and aligned representations, so they can be used interchangeably in a single network. This gives us additional flexibility in the design of novel network architectures.

For simplicity, we assume units in both representations are the same: , and , and thus (for the more general case see §A.1). Comparing the definitions of natural aligned representations, we have the following two relations for : and . With , solving this equation for and gives: and . A similar results hold for . So the transformation from the aligned representation () to the natural representation () is:


We call this transform align2nat. Likewise, solving this set of two relations for and gives the reverse transform of nat2align: . While all the models presented in this work only use align2nat, we present both cases for completeness.

Without restrictions on , these transformations may involve indexing a tensor at a non-integer coordinate, if is not an integer. Since we only permit integer coordinates in our implementation, we adopt a simple strategy: when the op align2nat is called, we ensure that is a positive integer. We can satisfy this constraint on by changing units with up/down-scaling ops, as described next.

3.5 Upscaling Transformation

The aligned representation enables the use of a coarse sub-tensors to create finer sub-tensors, which proves quite useful. Fig. 4 illustrates this transformation, which we call up_align2nat and describe next.

The up_align2nat op accepts a tensor as input. The sub-tensor is coarser than the desired output (so its unit is bigger). It performs bilinear upsampling, up_bilinear, in the domain by , reducing the underlying unit by . Next, the align2nat op converts the output into the natural representation. The full up_align2nat op is shown in Fig. 4.

As our experiments demonstrate, the up_align2nat op is effective for generating high-resolution masks without inflating channel counts in preceding feature maps. This in turn enables novel architectures, as described next.

3.6 Tensor Bipyramid

In multi-scale box detection it is common practice to use a lower-resolution feature map to extract larger-scale objects [10, 22]—this is because a sliding window of a fixed size on a lower-resolution map corresponds to a larger region in the input image. This also holds for multi-scale mask detection. However, unlike a box that is always represented by four numbers regardless of its scale, a mask’s pixel size must scale with object size in order to maintain constant resolution density. Thus, instead of always using units to present masks of different scales, we propose to adapt the number of mask pixels based on the scale.

Figure 4: The up_align2nat op is defined as a sequence of two ops. It takes an input tensor that has a coarse, lower resolution on (so the unit is larger). The op performs upsampling on by followed by align2nat, resulting in an output where (where is the stride).

Consider the natural representation on a feature map of the finest level. Here, the domain has the highest resolution (smallest unit). We expect this level to handle the smallest objects, so the domain should have the lowest resolution. With reference to this, we build a pyramid that gradually reduces and increases . Formally, we define a tensor bipyramid as:

Tensor bipyramid: A tensor bipyramid is a list of tensors of shapes: , for , with units and .

Because the units are the same across all levels, a mask has more pixels in the input image. In the domain, because the units increase with , the number of predicted masks decreases for larger masks, as desired. Note that the total size of each level is the same (it is ). A tensor bipyramid can be constructed using the swap_align2nat operation, described next.

This swap_align2nat op is composed of two steps: first, an input tensor with fine and coarse is upscaled to using up_align2nat. Then is subsampled to obtain the final shape. The combination of up_align2nat and subsample, shown in Fig. 5, is called swap_align2nat: the units before and after this op are swapped. For efficiency, it is not necessary to compute the intermediate tensor of shape from up_align2nat, which would be prohibitive. This is because only a small subset of values in this intermediate tensor appear in the final output after subsampling. So although Fig. 5 shows the conceptual computation, in practice we implement swap_align2nat as a single op that only performs the necessary computation and has complexity regardless of .

4 TensorMask Architecture

We now present models enabled by TensorMask representations. These models have a mask prediction head that generates masks in sliding windows and a classification head to predict object categories, analogous to the box regression and classification heads in sliding-window object detectors [27, 23]. Box prediction is not necessary for TensorMask models, but can easily be included.

Figure 5: The swap_align2nat op is defined by two ops. It upscales the input by up_align2nat (Fig. 4), then performs subsample on the domain. Note how the op swaps the units between the and domains. In practice, we implement this op in place so the complexity is independent of .
Figure 6: Baseline mask prediction heads: Each of the four heads shown starts from a feature map (, from a level of an FPN [22]) with an arbitrary channel number . Then a 11 conv layer projects the features into an appropriate number of channels, which form the specified 4D tensor by reshape. The output units of these four heads are the same, and .

4.1 Mask Prediction Heads

Our mask prediction branch attaches to a convolutional backbone. We use FPN [22], which generates a pyramid of feature maps with sizes with a fixed number of channels per level . These maps are used as input for each prediction head: mask, class, and box. Weights for the heads are shared across levels, but not between tasks.

Output representation.

We always use the natural representation (§3.2) as the output format of the network. Any representation (natural, aligned, ) can be used in the intermediate layers, but it will be transformed into the natural representation for the output. This standardization decouples the loss definition from network design, making use of different representations simpler. Also, our mask output is class-agnostic, , the window always predicts a single mask regardless of class; the class of the mask is predicted by the classification head. Class-agnostic mask prediction avoids multiplying the output size by the number of classes.

Baseline heads.

We consider a set of four baseline heads, illustrated in Fig. 6. Each head accepts an input feature map of shape for any . It then applies a 1

1 convolutional layer (with ReLU) with the appropriate number of output channels such that reshaping it into a 4D tensor produces the desired shape for the next layer, denoted as ‘

conv+reshape’. Fig. 6a and 6b are simple heads that use natural and aligned representations, respectively. In both cases, we use output channels for the 11 conv, followed by align2nat in the latter case. Fig. 6c and 6d are upscaling heads that use the natural and aligned representations, respectively. Their 11 conv has fewer output channels than in the simple heads.

In a baseline TensorMask model, one of these four heads is selected and attached to all FPN levels. The output forms a pyramid of , see Fig. 7a. For each head, the output sliding window always has the same unit as the feature map on which it slides: for all levels.

Figure 7: Conceptual comparison between: (a) a feature pyramid with any one of the baseline heads (Fig. 6) attached, and (b) a tensor bipyramid that uses swap_align2nat (Fig. 5). A baseline head on the feature pyramid has for each level, which implies that masks for large objects and small objects are predicted using the same number of pixels. On the other hand, the swap_align2nat head can keep the mask resolution high (, is the same across levels) despite the resolution changes.

Tensor bipyramid head.

Unlike the baseline heads, the tensor bipyramid head (§3.6) accepts a feature map of fine resolution at all levels. Fig. 8 shows a minor modification of FPN to obtain these maps. For each of the resulting levels, now all , we first use conv+reshape to produce the appropriate 4D tensor, then run a mask prediction head with swap_align2nat, see Fig. 7b. The tensor bipyramid model is the most effective TensorMask variant explored in this work.

Figure 8: Conversion of FPN feature maps from to for use with tensor bipyramid (see Fig. 7b). For an FPN level , we apply bilinear interpolation to upsample the feature map by a factor of . As the upscaling can be large, we add the finest level feature map to all levels (including the finest level itself), followed by one 33 conv with ReLU.

4.2 Training

Label assignment.

We use a version of the DeepMask assignment rule [31] to label each window. A window satisfying three conditions a ground-truth mask is positive:

(i) Containment: the window fully contains and the longer side of , in image pixels, is at least 1/2 of the longer side of the window, that is, .333A fallback is used to increase small object recall: masks smaller than the minimum assignable size are assigned to windows of the smallest size.

(ii) Centrality: the center of ’s bounding box is within one unit () of the window center in distance.

(iii) Uniqueness: there is no other mask that satisfies the other two conditions.

If satisfies these three conditions, then the window is labeled as a positive example whose ground-truth mask, object category, and box are given by . Otherwise, the window is labeled as a negative example.

In contrast to the IoU-based assignment rules for boxes in sliding-window detectors (, RPN [34], SSD [27], RetinaNet [23]), our rules are mask-driven. Experiments show that our rules work well even when using only 1 or 2 window sizes with a single aspect ratio of 1:1, versus, , RetinaNet’s 9 anchors of multiple scales and aspect ratios.


For the mask prediction head, we adopt a per-pixel binary classification loss. In our setting, the ground-truth mask inside a sliding window often has a wide margin, resulting in an imbalance between foreground background pixels. Therefore instead of using binary cross-entropy, we use focal loss [23] to address the imbalance, specifically we use FL with and . The mask loss of a window is averaged over the pixels in the window (note that in a tensor bipyramid the window size varies across levels), and the total mask loss is averaged over all positive windows (negative windows do not contribute to the mask loss).

For the classification head, we again adopt FL with and . For box regression, we use a parameter-free loss. The total loss is a weighted sum of all task losses.

Implementation details.

Our FPN implementation closely follows [23]; each FPN level is output by four 33 conv layers of channels with ReLU (instead of one conv in the original FPN [22]). As with the heads, weights are shared across levels, but not between tasks. In addition, we found that averaging (instead of summing [22]) the top-down and lateral connections in FPN improved training stability. We use FPN levels 2 through 7 () with 128 channels for the four conv layers in the mask and box branches, and 256 (the same as RetinaNet [23]) for the classification branch. Unless noted, we use ResNet-50 [18].

For training, all models are initialized from ImageNet pre-trained weights. We use scale jitter where the shorter image side is randomly sampled from [640, 800] pixels 

[16]. Following SSD [27] and YOLO [33], which train models longer (

65 and 160 epochs) than 

[23, 17], we adopt the ‘6’ schedule [16] (72 epochs), which improves results. The minibatch size is 16 images in 8 GPUs. The base learning rate is 0.02, with linear warm-up [14] of 1k iterations. Other hyper-parameters are kept the same as [13].

4.3 Inference

Inference is similar to dense sliding-window object detectors. We use a single scale of 800 pixels for the shorter image side. Our model outputs a mask prediction, a class score, and a predicted box for each sliding window. Non-maximum suppression (NMS) is applied to the top-scoring predictions using box IoU on the regressed boxes, following the settings in [22]. To convert predicted soft masks to binary masks at the original image resolution, we use the same method and hyper-parameters as Mask R-CNN [17].

Figure 9: Baseline upscaling heads (). Top: the natural upscaling head (a) produces coarse masks, and is ineffective for large . Left: for simple scenes, the unaligned head (b) and aligned head (c) (which use nearest-neighbor and bilinear interpolation, respectively), behave similarly. Right: for overlapping objects the difference is striking: the unaligned head creates severe artifacts.

5 Experiments

We report results on COCO instance segmentation [24]. All models are trained on the 118k train2017 images and tested on the 5k val2017 images. Final results are on test-dev. We use COCO mask average precision (denoted by AP). When reporting box AP, we denote it as AP.

natural 29.4 52.5 30.2 14.4 31.6 41.4
aligned 29.6 52.6 30.5 15.5 31.9 40.5
Table 1: Simple heads: natural aligned (Fig. 6a 6b) with 1515 perform comparably if upscaling is not used.

head AP AP AP aligned - natural natural 1.5 28.8 52.0 28.9 +0.9 +0.5 +1.7 aligned 29.7 52.5 30.6 natural 3 25.4 48.8 23.7 +4.1 +3.4 +6.6 aligned 29.5 52.2 30.3 natural 5 13.5 33.9 9.0 +15.6 +18.2 +20.8 aligned 29.1 52.1 29.8

(a) Upscaling heads: natural aligned heads (Fig. 6c 6d). The 1515 output is upscaled by : conv+reshape uses output channels as input. The aligned representation has a large gain over its natural counterpart when is large.

head AP AP AP bilinear - nearest nearest 1.5 29.4 52.4 30.1 +0.3 +0.1 +0.5 bilinear 29.7 52.5 30.6 nearest 3 28.5 51.3 28.8 +1.0 +0.9 +1.5 bilinear 29.5 52.2 30.3 nearest 5 25.9 47.8 25.6 +3.2 +4.3 +4.2 bilinear 29.1 52.1 29.8

(b) Upscaling: bilinear nearest-neighbor interpolation for the aligned head (Fig. 6d). The output has 1515. With nearest-neighbor interpolation, the aligned upscaling head is similar to the InstanceFCN [7] head. Bilinear interpolation shows a large gain when is large.

head AP AP AP AP AP AP feature pyramid, best 29.7 52.5 30.6 15.1 32.2 40.7 tensor bipyramid 33.8 54.8 35.8 16.1 36.3 47.7 +4.1 +2.3 +5.2 +1.0 +4.1 +7.0

(c) The tensor bipyramid substantially improves results compared to the best baseline head (Tab. (a)a, row 2) on a feature pyramid (Fig. 7a).

AP AP AP AP AP AP 1515 33.8 54.8 35.8 16.1 36.3 47.7 1515, 1111 35.4 56.5 37.5 16.4 37.9 50.0 +1.6 +1.7 +1.7 +0.3 +1.6 +2.3

(d) Window sizes: extending from one window size (per level) to two increases all AP metrics. Both rows use the tensor bipyramid.
Table 6: Ablations on TensorMask representations on COCO val2017. All variants use ResNet-50-FPN and a 72 epoch schedule.
method backbone aug epochs AP AP AP AP AP AP
Mask R-CNN [13] R-50-FPN 24 34.9 57.2 36.9 15.4 36.6 50.8
Mask R-CNN, ours R-50-FPN 24 34.9 56.8 36.8 15.1 36.7 50.6
Mask R-CNN, ours R-50-FPN 72 36.8 59.2 39.3 17.1 38.7 52.1
TensorMask R-50-FPN 72 35.5 57.3 37.4 16.6 37.0 49.1
Mask R-CNN, ours R-101-FPN 72 38.3 61.2 40.8 18.2 40.6 54.1
TensorMask R-101-FPN 72 37.3 59.5 39.5 17.5 39.3 51.6
Table 7: Comparison with Mask R-CNN for instance segmentation on COCO test-dev.

5.1 TensorMask Representations

First we explore various tensor representations for masks using 15 and a ResNet-50-FPN backbone. We report quantitative results in Tab. 6 and show qualitative comparisons in Figs. 2 and  9.

Simple heads.

Tab. 1 compares natural aligned representations with simple heads (Fig. 6a 6b). Both representations perform similarly, with a marginal gap of 0.2 AP. The simple natural head can be thought of as a class-specific variant of DeepMask [31] with an FPN backbone [22] and focal loss [23]. As we aim to use lower-resolution intermediate representations, we explore upscaling heads next.

Upscaling heads.

Tab. (a)a compares natural aligned representations with upscaling heads (Fig. 6c 6d). The output size is fixed at 1515. Given an upscaling factor , the conv in Fig. 6 has channels, , 9 channels with (225 channels if no upscaling). The difference in accuracy is big for large : the aligned variant improves AP +15.6 over the natural head (115% relative) when 5.

The visual difference is clear in Fig. 9a (natural) 9c (aligned). The upscale aligned head still produces sharp masks with large . This is critical for the tensor bipyramid, where we have an output of , which is achieved with a large upscaling factor of (, ); see Fig. 5.


The tensor view reveals the sub-tensor as a 2D spatial entity that can be manipulated. Tab. (b)b compares the upscale aligned head with bilinear (default) nearest-neighbor interpolation on . We refer to this latter variant as unaligned since quantization breaks pixel-to-pixel alignment. The unaligned variant is related to InstanceFCN [7] (see §A.2).

We observe in Tab. (b)b that bilinear interpolation yields solid improvements over nearest-neighbor interpolation, especially if is large (AP3.2). These interpolation methods lead to striking visual differences when objects overlap: see Fig. 9b (unaligned) 9c (aligned).

Tensor bipyramid.

Replacing the best feature pyramid model with a tensor bipyramid yields a large 4.1 AP improvement (Tab. (c)c). Here, the mask size is 1515 on level , and is 480480 for ; see Fig. 7b. The higher resolution masks predicted for large objects (, at ) have clear benefit: AP jumps by 7.0 points. This improvement does not come at the cost of denser windows as the output is at resolution.

Again, we note that it is intractable to have, , a 480-channel conv. The upscaling aligned head with bilinear interpolation is key to making tensor bipyramid possible.

Multiple window sizes.

Thus far we have used a single window size (per-level) for all models, that is, . Analogous to the concept of anchors in RPN [34] that are also used in current detectors [33, 27, 23], we extend our method to multiple window sizes. We set , leading to two heads per level. Tab. (d)d shows the benefit of having two window sizes: it increases AP by 1.6 points. More window sizes and aspect ratios are possible, suggesting room for improvement.

5.2 Comparison with Mask R-CNN

Tab. 7 summarizes the best TensorMask model on test-dev and compares it to the current dominant approach for COCO instance segmentation: Mask R-CNN [17]. We use the Detectron [13] code to reflect improvements since [17] was published. We modify it to match our implementation details (FPN average fusion, 1k warm-up, and box loss). Tab. 7 row 1 & 2 verify that these subtleties have a negligible effect. Then we use training-time scale augmentation and a longer schedule [16], which yields an 2 AP increase (Tab. 7 row 3) and establishes a fair and solid baseline for comparison.

The best TensorMask in Tab. (d)d achieves 35.5 mask AP on test-dev (Tab. 7 row 4), close to Mask R-CNN counterpart’s 36.8. With ResNet-101, TensorMask achieves 37.3 mask AP with a 1.0 AP gap behind Mask R-CNN. These results demonstrate that dense sliding-window methods can close the gap to ‘detect-then-segment’ systems (§2). Qualitative results are shown in Figs. 2, 10, and 11.

We report box AP of TensorMask in §A.3. Moreover, compared to Mask R-CNN, one intriguing property of TensorMask is that masks are independent from boxes. In fact, we find joint training of box and mask only gives marginal gain over mask-only training, see §A.4.

Speed-wise, the best R-101-FPN TensorMask runs at 0.38s/im on a V100 GPU (all post-processing included), Mask R-CNN’s 0.09s/im. Predicting masks in dense sliding windows (100k) results in a computation overhead, Mask R-CNN’s sparse prediction on 100 final boxes. Accelerations are possible but outside the scope of this work.


TensorMask is a dense sliding-window instance segmentation framework that, for the first time, achieves results close to the well-developed Mask R-CNN framework—both qualitatively and quantitatively. It establishes a conceptually complementary direction for instance segmentation research. We hope our work will create new opportunities and make both directions thrive.

Appendix A Appendix

a.1 Generalized Coordinate Transformation

In Sec. 3.4 we have assumed and . Here we relax this condition and only assume . Again, we still have the following two relations for : and . Solving for and gives: and . Then align2nat is:


More generally, consider arbitrary units , , , and . Then the relations between the natural and aligned representation can be rewritten as:


Note that these relations only hold in the image pixel domain (hence the usage of all units). Solving for , gives:


And the align2nat transform becomes:


This version of the coordinate transformation demonstrates the role of units and may enable more general uses.

a.2 Aligned Representation and InstanceFCN

We prove that the InstanceFCN [7] output behaves as an upscaling aligned head with nearest-neighbor interpolation.

In [7], each output mask has pixels that are divided into bins. A mask pixel is read from the channel corresponding to the pixel’s bin. In our notation, [7] predicts which is related to the natural representation by:


where is a rounding operation and the integers and index a bin. Now, define a new function by:


and new coordinates: and (likewise for and ). Then can be written as:


Eqn.(8) says that is the nearest-neighbor interpolation of on . Eqn.(7), (6), and the new coordinates show that is computed from by the align2nat transform with . Thus, InstanceFCN masks can be constructed in the TensorMask framework by predicting , performing nearest-neighbor interpolation of on to get , and then using align2nat to compute natural masks .

a.3 Object Detection Results

In Tab. 8 we show the associated bounding-box (bb) object detection results. Overall, TensorMask has a comparable box AP with Mask R-CNN and outperforms RetinaNet.

method aug epochs AP AP AP
RetinaNet, ours 24 37.1 55.0 39.9
RetinaNet, ours 72 39.3 57.2 42.4
Faster R-CNN, ours 72 40.6 61.4 44.2
Mask R-CNN, ours 72 41.7 62.5 45.7
TensorMask, box-only 72 40.8 60.4 43.9
TensorMask 72 41.6 61.0 45.0
Table 8: Object detection box AP on COCO test-dev. All models use ResNet-50-FPN. ‘TensorMask, box-only’ is our model without the mask head: it resembles RetinaNet but with the mask-driven assignment rule and only 2 window sizes instead of 9 [23].

a.4 Mask-Only TensorMask

One intriguing property of TensorMask is that masks are not dependent on boxes. This not only opens up new model designs that are mask-specific, but also allows us to investigate whether box predictions improve masks in a multi-task setting. Here, we conduct experiments without the use of a box head. Note that although we predict masks densely, we still need to perform NMS for post-processing. If regressed boxes are absent, we simply use the bounding boxes of the masks as a substitute (and also to report box AP).

Tab. 9 gives the results. We observe a slight degradation switching from the default setting which uses original boxes (row 1) for NMS to using mask bounding boxes (row 2). After accounting for this, TensorMask without a box head (row 3) has nearly equal mask AP to the mask+box variant (row 2). These results indicate that the role of the box head is auxiliary in our system, in contrast to Mask R-CNN.

box head NMS AP AP AP AP AP AP
bb 35.4 56.5 37.5 41.5 60.7 44.7
mask-bb 34.9 55.6 37.0 39.5 58.6 41.6
mask-bb 34.8 55.7 36.6 39.1 58.4 41.3
Table 9: Multi-task benefits of box training for mask prediction on COCO val2017 with our final ResNet-50-FPN model.

a.5 Qualitative Comparisons and Calibration

We show more results in Figs. 10 and 11. For these, and all visualizations in the main text, we display all detections that have a calibrated score 0.6. We use a simple calibration that maps uncalibrated detector scores to precision values: for each model and for each category, we compute its precision-recall (PR) curve on val2017. As a PR curve is parameterized by score, we can map an uncalibrated score for the detector-category pair to its corresponding precision value. Score-to-precision calibration enables a fair visual comparison between methods using a fixed threshold.

Figure 10: More results of Mask R-CNN [17] (top row per set) and TensorMask (bottom row per set) on the last 65 val2017 images (continued in Fig. 11). These models use a ResNet-101-FPN backbone and obtain 38.3 and 37.3 AP, on test-dev, respectively. Visually, TensorMask gives sharper masks compared to Mask R-CNN although its AP is 1 point lower. Best viewed in a digital format with zoom.
Figure 11: More results of Mask R-CNN [17] (top row per set) and TensorMask (bottom row per set) continued from Fig. 10.


  • [1] P. Arbeláez, J. Pont-Tuset, J. T. Barron, F. Marques, and J. Malik. Multiscale combinatorial grouping. In CVPR, 2014.
  • [2] A. Arnab and P. H. Torr. Pixelwise instance segmentation with a dynamically instantiated network. In CVPR, 2017.
  • [3] M. Bai and R. Urtasun. Deep watershed transform for instance segmentation. In CVPR, 2017.
  • [4] K. Chen, J. Pang, J. Wang, Y. Xiong, X. Li, S. Sun, W. Feng, Z. Liu, J. Shi, W. Ouyang, et al. Hybrid task cascade for instance segmentation. arXiv:1901.07518, 2019.
  • [5] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Semantic image segmentation with deep convolutional nets and fully connected crfs. In ICLR, 2015.
  • [6] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele.

    The cityscapes dataset for semantic urban scene understanding.

    In CVPR, 2016.
  • [7] J. Dai, K. He, Y. Li, S. Ren, and J. Sun. Instance-sensitive fully convolutional networks. In ECCV, 2016.
  • [8] J. Dai, K. He, and J. Sun. Instance-aware semantic segmentation via multi-task network cascades. In CVPR, 2016.
  • [9] P. Dollár, Z. Tu, P. Perona, and S. Belongie. Integral channel features. In BMVC, 2009.
  • [10] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained part-based models. PAMI, 2010.
  • [11] R. Girshick. Fast R-CNN. In ICCV, 2015.
  • [12] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014.
  • [13] R. Girshick, I. Radosavovic, G. Gkioxari, P. Dollár, and K. He. Detectron. https://github.com/facebookresearch/detectron, 2018.
  • [14] P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He. Accurate, large minibatch SGD: Training ImageNet in 1 hour. arXiv:1706.02677, 2017.
  • [15] B. Hariharan, P. Arbeláez, R. Girshick, and J. Malik. Simultaneous detection and segmentation. In ECCV, 2014.
  • [16] K. He, R. Girshick, and P. Dollár. Rethinking imagenet pre-training. arXiv:1811.08883, 2018.
  • [17] K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask R-CNN. In ICCV, 2017.
  • [18] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
  • [19] A. Kirillov, E. Levinkov, B. Andres, B. Savchynskyy, and C. Rother. Instancecut: from edges to instances with multicut. In CVPR, 2017.
  • [20] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropagation applied to handwritten zip code recognition. Neural computation, 1989.
  • [21] Y. Li, H. Qi, J. Dai, X. Ji, and Y. Wei. Fully convolutional instance-aware semantic segmentation. In CVPR, 2017.
  • [22] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie. Feature pyramid networks for object detection. In CVPR, 2017.
  • [23] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár. Focal loss for dense object detection. In ICCV, 2017.
  • [24] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft COCO: Common objects in context. In ECCV, 2014.
  • [25] S. Liu, J. Jia, S. Fidler, and R. Urtasun. SGN: Sequential grouping networks for instance segmentation. In ICCV, 2017.
  • [26] S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia. Path aggregation network for instance segmentation. In CVPR, 2018.
  • [27] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg. SSD: Single shot multibox detector. In ECCV, 2016.
  • [28] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015.
  • [29] G. Neuhold, T. Ollmann, S. Rota Bulo, and P. Kontschieder. The mapillary vistas dataset for semantic understanding of street scenes. In ICCV, 2017.
  • [30] C. Peng, T. Xiao, Z. Li, Y. Jiang, X. Zhang, K. Jia, G. Yu, and J. Sun. MegDet: A large mini-batch object detector. In CVPR, 2018.
  • [31] P. O. Pinheiro, R. Collobert, and P. Dollár. Learning to segment object candidates. In NIPS, 2015.
  • [32] P. O. Pinheiro, T.-Y. Lin, R. Collobert, and P. Dollár. Learning to refine object segments. In ECCV, 2016.
  • [33] J. Redmon and A. Farhadi. YOLO9000: better, faster, stronger. In CVPR, 2017.
  • [34] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In NIPS, 2015.
  • [35] F. Rosenblatt.

    The perceptron: a probabilistic model for information storage and organization in the brain.

    Psychological review, 1958.
  • [36] A. Rush. Tensor considered harmful. 2019.
  • [37] R. Vaillant, C. Monrocq, and Y. LeCun. Original approach for the localisation of objects in images. IEE Proc. on Vision, Image, and Signal Processing, 1994.
  • [38] K. E. van de Sande, J. R. Uijlings, T. Gevers, and A. W. Smeulders. Segmentation as selective search for object recognition. In ICCV, 2011.
  • [39] P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. In CVPR, 2001.
  • [40] S. Zagoruyko, A. Lerer, T.-Y. Lin, P. O. Pinheiro, S. Gross, S. Chintala, and P. Dollár. A multipath network for object detection. In BMVC, 2016.