1 Introduction
The slidingwindow paradigm—finding objects by looking in each window placed over a dense set of image locations
—is one of the earliest and most successful concepts in computer vision
[37, 39, 9, 10] and is naturally connected to convolutional networks [20]. However, while today’s topperforming object detectors rely on sliding window prediction to generate initial candidate regions, a refinement stage is applied to these candidate regions to obtain more accurate predictions, as pioneered by Faster RCNN [34] and Mask RCNN [17] for boundingbox object detection and instance segmentation, respectively. This class of methods has dominated the COCO detection challenges [24].Recently, boundingbox object detectors which eschew the refinement step and focus on direct slidingwindow prediction, as exemplified by SSD [27] and RetinaNet [23], have witnessed a resurgence and shown promising results. In contrast, the field has not witnessed equivalent progress in dense slidingwindow instance segmentation; there are no direct, dense approaches analogous to SSD / RetinaNet for mask prediction. Why is the dense approach thriving for box detection, yet entirely missing for instance segmentation? This is a question of fundamental scientific interest. The goal of this work is to bridge this gap and provide a foundation for exploring dense instance segmentation.
Our main insight is that the core concepts for defining dense mask representations, as well as effective realizations of these concepts in neural networks, are both lacking. Unlike bounding boxes, which have a fixed, lowdimensional representation regardless of scale, segmentation masks can benefit from richer, more structured representations. For example, each mask is itself a 2D spatial map, and masks for larger objects can benefit from the use of larger spatial maps. Developing effective representations for dense masks is a key step toward enabling dense instance segmentation.
To address this, we define a set of core concepts for representing masks with highdimensional tensors that allows for the exploration of novel network architectures for dense mask prediction. We present and experiment with several such networks in order to demonstrate the merits of the proposed representations. Our framework, called TensorMask, establishes the first dense slidingwindow instance segmentation system that achieves results near to Mask RCNN.
The central idea of the TensorMask representation is to use structured
4D tensors to represent masks over a spatial domain. This perspective stands in contrast to prior work on the related task of segmenting classagnostic object proposals such as DeepMask
[31] and InstanceFCN [7] that used unstructured3D tensors, in which the mask is packed into the third ‘channel’ axis. The channel axis, unlike the axes representing object position, does not have a clear geometric meaning and is therefore difficult to manipulate. By using a basic channel representation, one misses an opportunity to benefit from using structural arrays to represent masks as 2D entities—analogous to the difference between MLPs
[35] and ConvNets [20] for representing 2D images.Unlike these channeloriented approaches, we propose to leverage 4D tensors of shape , in which both —representing object position—and —representing relative mask position—are geometric subtensors, , they have axes with welldefined units and geometric meaning the image. This shift in perspective from encoding masks in an unstructured channel axis to using structured geometric subtensors enables the definition of novel operations and network architectures. These networks can operate directly on the subtensor in geometrically meaningful ways, including coordinate transformation, up/downscaling, and use of scale pyramids.
Enabled by the TensorMask framework, we develop a pyramid structure over a scaleindexed list of 4D tensors, which we call a tensor bipyramid. Analogous to a feature pyramid, which is a list of feature maps at multiple scales, a tensor bipyramid contains a list of 4D tensors with shapes , where indexes scale. This structure has a pyramidal shape in both and geometric subtensors, but growing in opposite directions. This natural design captures the desirable property that large objects have highresolution masks with coarse spatial localization (large ) and small objects have lowresolution masks with fine spatial localization (small ).
We combine these components into a network backbone and training procedure closely following RetinaNet [23] in which our dense mask predictor extends the original dense bounding box predictor. With detailed ablation experiments, we evaluate the effectiveness of the TensorMask framework and show the importance of explicitly capturing the geometric structure of this task. Finally, we show TensorMask yields similar results to its Mask RCNN counterpart (see Figs. 1 and 2). These promising results suggest the proposed framework can help pave the way for future research on dense slidingwindow instance segmentation.
2 Related Work
Classify mask proposals.
The modern instance segmentation task was introduced by Hariharan [15] (before being popularized by COCO [24]). In their work, the method proposed for this task involved first generating object mask proposals [38, 1]
, then classifying these proposals
[15]. In earlier work, the classifymaskproposals methodology was used for other tasks. For example, Selective Search [38] and the original RCNN [12]classified mask proposals to obtain box detections and semantic segmentation results; these methods could easily be applied to instance segmentation. These early methods relied on bottomup mask proposals computed by predeeplearning era methods
[38, 1]; our work is more closely related to dense slidingwindow methods for mask object proposals as pioneered by DeepMask [31]. We discuss this connection shortly.Detect then segment.
The now dominant paradigm for instance segmentation involves first detecting objects with a box and then segmenting each object using the box as a guide [8, 40, 21, 17]. Perhaps the most successful instantiation of the detectthensegment methodology is Mask RCNN [17], which extended the Faster RCNN [34] detector with a simple mask predictor. Approaches that build on Mask RCNN [26, 30, 4] have dominated leaderboards of recent challenges [24, 29, 6]. Unlike in boundingbox detection, where slidingwindow [27, 33, 23] and regionbased [11, 34] methods have both thrived, in the area of instance segmentation, research on dense slidingwindow methods has been missing. Our work aims to close this gap.
Label pixels then cluster.
A third class of approaches to instance segmentation (, [3, 19, 2, 25]) builds on models developed for semantic segmentation [28, 5]. These approaches label each image pixel with a category and some auxiliary information that a clustering algorithm can use to group pixels into object instances. These approaches benefit from improvements on semantic segmentation and natively predict higherresolution masks for larger objects. Compared to detectthensegment methods, labelpixelsthencluster methods lag behind in accuracy on popular benchmarks [24, 29, 6]. Instead of employing fully convolutional models for dense pixel labeling, TensorMask explores the framework of building fully convolutional (, dense sliding window) models for dense mask prediction, where the output at each spatial location is itself a 2D spatial map.
Dense sliding window methods.
To the best of our knowledge, no prior methods exist for dense slidingwindow instance segmentation. The proposed TensorMask framework is the first such approach. The closest methods are for the related task of classagnostic mask proposal generation, specifically models such as DeepMask [31, 32] and InstanceFCN [7]
which apply convolutional neural networks to generate mask proposals in a
dense slidingwindow manner. Like these approaches, TensorMask is a dense slidingwindow model, but it spans a more expressive design space. DeepMask and InstanceFCN can be expressed naturally as classagnostic TensorMask models, but TensorMask enables novel architectures that perform better. Also, unlike these classagnostic methods, TensorMask performs multiclass classification in parallel to mask prediction, and thus can be applied to the task of instance segmentation.3 Tensor Representations for Masks
The central idea of the TensorMask framework is to use structured highdimensional tensors to represent image content (, masks) in a set of densely sliding windows.
Consider a window sliding on a feature map of width and height . It is possible to represent all masks in all sliding window locations by a tensor of a shape , where each mask is parameterized by pixels. This is the representation used in DeepMask [31].
The underlying spirit of this representation, however, is in fact a higher dimensional (4D) tensor with shape . The subtensor represents a mask as a 2D spatial entity. Instead of viewing the channel dimension as a black box into which a mask is arranged, the tensor perspective enables several important concepts for representing dense masks, discussed next.
3.1 Unit of Length
The unit of length (or simply unit) of each spatial axis is a necessary concept for understanding 4D tensors in our framework. Intuitively, the unit of an axis defines the length of one pixel along it. Different axes can have different units.
The unit of the and axes, denoted as , can be set as the stride the input image (, res of ResNet50 [18] has 16 image pixels). Analogously, the and axes define another 2D spatial domain and have their own unit, denoted as . Shifting one pixel along the or axis corresponds to shifting pixels on the input image. The unit need not be equal to the unit , a property that our models will benefit from.
Defining units is necessary because the interpretation of the tensor shape is ambiguous if units are not specified. For example, represents a window in image pixels if 1 image pixel, but a window in image pixels if 2 image pixels. The units and how they change due to up/downscaling operations are central to multiscale representations (more in §3.6).
3.2 Natural Representation
With the definition of units, we can formally describe the representational meaning of a tensor. In our simplest definition, this tensor represents the windows sliding over . We call this the natural representation. Denoting as the ratio of units, formally we have:
Natural Representation: For a 4D tensor of shape , its value at coordinates represents the mask value at in the window centered at .^{2}^{2}2Derivation: on the input image pixels, the center of a sliding window is , and a pixel located this window is at . Projecting to the domain (, normalizing by the unit ) gives us and .
Here , where ‘’ denotes cartesian product. Conceptually, the tensor can be thought of as a continuous function in this domain. For implementation, we must instead rasterize the 4D tensor as a discrete function defined on sampled locations. We assume a sampling rate of one sample per unit, with samples located at integer coordinates (, if 3, then ). This assumption allows the same value to represent both the length of the axis in terms of units (, 3) and also the number of discrete samples stored for the axis. This is convenient for working with tensors produced by neural networks that are discrete and have lengths.
Fig. 3 (left) illustrates an example when and is 1. The natural representation is intuitive and easy to parse as the output of a network, but it is not the only possible representation in a deep network, as discussed next.
3.3 Aligned Representation
In the natural representation, a subtensor located at represents values at offset pixels instead of directly at . When using convolutions to compute features, preserving pixeltopixel alignment between input pixels and predicted output pixels can lead to improvements (this is similar to the motivation for RoIAlign [17]). Next we describe a pixelaligned representation for dense masks under the tensor perspective.
Formally, we define the aligned representation as:
Aligned Representation: For a 4D tensor , its value at coordinates represents the mask value at in the window centered at .
is the ratio of units in the aligned representation.
Here, the subtensor at pixel always describes the values taken at this pixel, it is aligned. The subspace does not represent a single mask, but instead enumerates mask values in all windows that overlap pixel . Fig. 3 (right) illustrates an example when (nine overlapping windows) and is 1.
Note that we denote tensors in the aligned representation as (and likewise for coordinates/units). This is in the spirit of ‘named tensors’ [36] and proves useful.
Our aligned representation is related to the instancesensitive score maps proposed in InstanceFCN [7]. We prove (in §A.2
) that those score maps behave like our aligned representation but with nearestneighbor interpolation on
, which makes them unaligned. We test this experimentally and show it degrades results severely.3.4 Coordinate Transformation
We introduce a coordinate transformation between natural and aligned representations, so they can be used interchangeably in a single network. This gives us additional flexibility in the design of novel network architectures.
For simplicity, we assume units in both representations are the same: , and , and thus (for the more general case see §A.1). Comparing the definitions of natural aligned representations, we have the following two relations for : and . With , solving this equation for and gives: and . A similar results hold for . So the transformation from the aligned representation () to the natural representation () is:
(1) 
We call this transform align2nat. Likewise, solving this set of two relations for and gives the reverse transform of nat2align: . While all the models presented in this work only use align2nat, we present both cases for completeness.
Without restrictions on , these transformations may involve indexing a tensor at a noninteger coordinate, if is not an integer. Since we only permit integer coordinates in our implementation, we adopt a simple strategy: when the op align2nat is called, we ensure that is a positive integer. We can satisfy this constraint on by changing units with up/downscaling ops, as described next.
3.5 Upscaling Transformation
The aligned representation enables the use of a coarse subtensors to create finer subtensors, which proves quite useful. Fig. 4 illustrates this transformation, which we call up_align2nat and describe next.
The up_align2nat op accepts a tensor as input. The subtensor is coarser than the desired output (so its unit is bigger). It performs bilinear upsampling, up_bilinear, in the domain by , reducing the underlying unit by . Next, the align2nat op converts the output into the natural representation. The full up_align2nat op is shown in Fig. 4.
As our experiments demonstrate, the up_align2nat op is effective for generating highresolution masks without inflating channel counts in preceding feature maps. This in turn enables novel architectures, as described next.
3.6 Tensor Bipyramid
In multiscale box detection it is common practice to use a lowerresolution feature map to extract largerscale objects [10, 22]—this is because a sliding window of a fixed size on a lowerresolution map corresponds to a larger region in the input image. This also holds for multiscale mask detection. However, unlike a box that is always represented by four numbers regardless of its scale, a mask’s pixel size must scale with object size in order to maintain constant resolution density. Thus, instead of always using units to present masks of different scales, we propose to adapt the number of mask pixels based on the scale.
Consider the natural representation on a feature map of the finest level. Here, the domain has the highest resolution (smallest unit). We expect this level to handle the smallest objects, so the domain should have the lowest resolution. With reference to this, we build a pyramid that gradually reduces and increases . Formally, we define a tensor bipyramid as:
Tensor bipyramid: A tensor bipyramid is a list of tensors of shapes: , for , with units and .
Because the units are the same across all levels, a mask has more pixels in the input image. In the domain, because the units increase with , the number of predicted masks decreases for larger masks, as desired. Note that the total size of each level is the same (it is ). A tensor bipyramid can be constructed using the swap_align2nat operation, described next.
This swap_align2nat op is composed of two steps: first, an input tensor with fine and coarse is upscaled to using up_align2nat. Then is subsampled to obtain the final shape. The combination of up_align2nat and subsample, shown in Fig. 5, is called swap_align2nat: the units before and after this op are swapped. For efficiency, it is not necessary to compute the intermediate tensor of shape from up_align2nat, which would be prohibitive. This is because only a small subset of values in this intermediate tensor appear in the final output after subsampling. So although Fig. 5 shows the conceptual computation, in practice we implement swap_align2nat as a single op that only performs the necessary computation and has complexity regardless of .
4 TensorMask Architecture
We now present models enabled by TensorMask representations. These models have a mask prediction head that generates masks in sliding windows and a classification head to predict object categories, analogous to the box regression and classification heads in slidingwindow object detectors [27, 23]. Box prediction is not necessary for TensorMask models, but can easily be included.
4.1 Mask Prediction Heads
Our mask prediction branch attaches to a convolutional backbone. We use FPN [22], which generates a pyramid of feature maps with sizes with a fixed number of channels per level . These maps are used as input for each prediction head: mask, class, and box. Weights for the heads are shared across levels, but not between tasks.
Output representation.
We always use the natural representation (§3.2) as the output format of the network. Any representation (natural, aligned, ) can be used in the intermediate layers, but it will be transformed into the natural representation for the output. This standardization decouples the loss definition from network design, making use of different representations simpler. Also, our mask output is classagnostic, , the window always predicts a single mask regardless of class; the class of the mask is predicted by the classification head. Classagnostic mask prediction avoids multiplying the output size by the number of classes.
Baseline heads.
We consider a set of four baseline heads, illustrated in Fig. 6. Each head accepts an input feature map of shape for any . It then applies a 1
1 convolutional layer (with ReLU) with the appropriate number of output channels such that reshaping it into a 4D tensor produces the desired shape for the next layer, denoted as ‘
conv+reshape’. Fig. 6a and 6b are simple heads that use natural and aligned representations, respectively. In both cases, we use output channels for the 11 conv, followed by align2nat in the latter case. Fig. 6c and 6d are upscaling heads that use the natural and aligned representations, respectively. Their 11 conv has fewer output channels than in the simple heads.In a baseline TensorMask model, one of these four heads is selected and attached to all FPN levels. The output forms a pyramid of , see Fig. 7a. For each head, the output sliding window always has the same unit as the feature map on which it slides: for all levels.
Tensor bipyramid head.
Unlike the baseline heads, the tensor bipyramid head (§3.6) accepts a feature map of fine resolution at all levels. Fig. 8 shows a minor modification of FPN to obtain these maps. For each of the resulting levels, now all , we first use conv+reshape to produce the appropriate 4D tensor, then run a mask prediction head with swap_align2nat, see Fig. 7b. The tensor bipyramid model is the most effective TensorMask variant explored in this work.
4.2 Training
Label assignment.
We use a version of the DeepMask assignment rule [31] to label each window. A window satisfying three conditions a groundtruth mask is positive:
(i) Containment: the window fully contains and the longer side of , in image pixels, is at least 1/2 of the longer side of the window, that is, .^{3}^{3}3A fallback is used to increase small object recall: masks smaller than the minimum assignable size are assigned to windows of the smallest size.
(ii) Centrality: the center of ’s bounding box is within one unit () of the window center in distance.
(iii) Uniqueness: there is no other mask that satisfies the other two conditions.
If satisfies these three conditions, then the window is labeled as a positive example whose groundtruth mask, object category, and box are given by . Otherwise, the window is labeled as a negative example.
In contrast to the IoUbased assignment rules for boxes in slidingwindow detectors (, RPN [34], SSD [27], RetinaNet [23]), our rules are maskdriven. Experiments show that our rules work well even when using only 1 or 2 window sizes with a single aspect ratio of 1:1, versus, , RetinaNet’s 9 anchors of multiple scales and aspect ratios.
Loss.
For the mask prediction head, we adopt a perpixel binary classification loss. In our setting, the groundtruth mask inside a sliding window often has a wide margin, resulting in an imbalance between foreground background pixels. Therefore instead of using binary crossentropy, we use focal loss [23] to address the imbalance, specifically we use FL with and . The mask loss of a window is averaged over the pixels in the window (note that in a tensor bipyramid the window size varies across levels), and the total mask loss is averaged over all positive windows (negative windows do not contribute to the mask loss).
For the classification head, we again adopt FL with and . For box regression, we use a parameterfree loss. The total loss is a weighted sum of all task losses.
Implementation details.
Our FPN implementation closely follows [23]; each FPN level is output by four 33 conv layers of channels with ReLU (instead of one conv in the original FPN [22]). As with the heads, weights are shared across levels, but not between tasks. In addition, we found that averaging (instead of summing [22]) the topdown and lateral connections in FPN improved training stability. We use FPN levels 2 through 7 () with 128 channels for the four conv layers in the mask and box branches, and 256 (the same as RetinaNet [23]) for the classification branch. Unless noted, we use ResNet50 [18].
For training, all models are initialized from ImageNet pretrained weights. We use scale jitter where the shorter image side is randomly sampled from [640, 800] pixels
[16]. Following SSD [27] and YOLO [33], which train models longer (65 and 160 epochs) than
[23, 17], we adopt the ‘6’ schedule [16] (72 epochs), which improves results. The minibatch size is 16 images in 8 GPUs. The base learning rate is 0.02, with linear warmup [14] of 1k iterations. Other hyperparameters are kept the same as [13].4.3 Inference
Inference is similar to dense slidingwindow object detectors. We use a single scale of 800 pixels for the shorter image side. Our model outputs a mask prediction, a class score, and a predicted box for each sliding window. Nonmaximum suppression (NMS) is applied to the topscoring predictions using box IoU on the regressed boxes, following the settings in [22]. To convert predicted soft masks to binary masks at the original image resolution, we use the same method and hyperparameters as Mask RCNN [17].
5 Experiments
We report results on COCO instance segmentation [24]. All models are trained on the 118k train2017 images and tested on the 5k val2017 images. Final results are on testdev. We use COCO mask average precision (denoted by AP). When reporting box AP, we denote it as AP.
head  AP  AP  AP  AP  AP  AP 

natural  29.4  52.5  30.2  14.4  31.6  41.4 
aligned  29.6  52.6  30.5  15.5  31.9  40.5 
method  backbone  aug  epochs  AP  AP  AP  AP  AP  AP 
Mask RCNN [13]  R50FPN  24  34.9  57.2  36.9  15.4  36.6  50.8  
Mask RCNN, ours  R50FPN  24  34.9  56.8  36.8  15.1  36.7  50.6  
Mask RCNN, ours  R50FPN  72  36.8  59.2  39.3  17.1  38.7  52.1  
TensorMask  R50FPN  72  35.5  57.3  37.4  16.6  37.0  49.1  
Mask RCNN, ours  R101FPN  72  38.3  61.2  40.8  18.2  40.6  54.1  
TensorMask  R101FPN  72  37.3  59.5  39.5  17.5  39.3  51.6 
5.1 TensorMask Representations
First we explore various tensor representations for masks using 15 and a ResNet50FPN backbone. We report quantitative results in Tab. 6 and show qualitative comparisons in Figs. 2 and 9.
Simple heads.
Tab. 1 compares natural aligned representations with simple heads (Fig. 6a 6b). Both representations perform similarly, with a marginal gap of 0.2 AP. The simple natural head can be thought of as a classspecific variant of DeepMask [31] with an FPN backbone [22] and focal loss [23]. As we aim to use lowerresolution intermediate representations, we explore upscaling heads next.
Upscaling heads.
Tab. (a)a compares natural aligned representations with upscaling heads (Fig. 6c 6d). The output size is fixed at 1515. Given an upscaling factor , the conv in Fig. 6 has channels, , 9 channels with (225 channels if no upscaling). The difference in accuracy is big for large : the aligned variant improves AP +15.6 over the natural head (115% relative) when 5.
Interpolation.
The tensor view reveals the subtensor as a 2D spatial entity that can be manipulated. Tab. (b)b compares the upscale aligned head with bilinear (default) nearestneighbor interpolation on . We refer to this latter variant as unaligned since quantization breaks pixeltopixel alignment. The unaligned variant is related to InstanceFCN [7] (see §A.2).
Tensor bipyramid.
Replacing the best feature pyramid model with a tensor bipyramid yields a large 4.1 AP improvement (Tab. (c)c). Here, the mask size is 1515 on level , and is 480480 for ; see Fig. 7b. The higher resolution masks predicted for large objects (, at ) have clear benefit: AP jumps by 7.0 points. This improvement does not come at the cost of denser windows as the output is at resolution.
Again, we note that it is intractable to have, , a 480channel conv. The upscaling aligned head with bilinear interpolation is key to making tensor bipyramid possible.
Multiple window sizes.
Thus far we have used a single window size (perlevel) for all models, that is, . Analogous to the concept of anchors in RPN [34] that are also used in current detectors [33, 27, 23], we extend our method to multiple window sizes. We set , leading to two heads per level. Tab. (d)d shows the benefit of having two window sizes: it increases AP by 1.6 points. More window sizes and aspect ratios are possible, suggesting room for improvement.
5.2 Comparison with Mask RCNN
Tab. 7 summarizes the best TensorMask model on testdev and compares it to the current dominant approach for COCO instance segmentation: Mask RCNN [17]. We use the Detectron [13] code to reflect improvements since [17] was published. We modify it to match our implementation details (FPN average fusion, 1k warmup, and box loss). Tab. 7 row 1 & 2 verify that these subtleties have a negligible effect. Then we use trainingtime scale augmentation and a longer schedule [16], which yields an 2 AP increase (Tab. 7 row 3) and establishes a fair and solid baseline for comparison.
The best TensorMask in Tab. (d)d achieves 35.5 mask AP on testdev (Tab. 7 row 4), close to Mask RCNN counterpart’s 36.8. With ResNet101, TensorMask achieves 37.3 mask AP with a 1.0 AP gap behind Mask RCNN. These results demonstrate that dense slidingwindow methods can close the gap to ‘detectthensegment’ systems (§2). Qualitative results are shown in Figs. 2, 10, and 11.
We report box AP of TensorMask in §A.3. Moreover, compared to Mask RCNN, one intriguing property of TensorMask is that masks are independent from boxes. In fact, we find joint training of box and mask only gives marginal gain over maskonly training, see §A.4.
Speedwise, the best R101FPN TensorMask runs at 0.38s/im on a V100 GPU (all postprocessing included), Mask RCNN’s 0.09s/im. Predicting masks in dense sliding windows (100k) results in a computation overhead, Mask RCNN’s sparse prediction on 100 final boxes. Accelerations are possible but outside the scope of this work.
Conclusion.
TensorMask is a dense slidingwindow instance segmentation framework that, for the first time, achieves results close to the welldeveloped Mask RCNN framework—both qualitatively and quantitatively. It establishes a conceptually complementary direction for instance segmentation research. We hope our work will create new opportunities and make both directions thrive.
Appendix A Appendix
a.1 Generalized Coordinate Transformation
In Sec. 3.4 we have assumed and . Here we relax this condition and only assume . Again, we still have the following two relations for : and . Solving for and gives: and . Then align2nat is:
(2) 
More generally, consider arbitrary units , , , and . Then the relations between the natural and aligned representation can be rewritten as:
(3) 
Note that these relations only hold in the image pixel domain (hence the usage of all units). Solving for , gives:
(4) 
And the align2nat transform becomes:
(5) 
This version of the coordinate transformation demonstrates the role of units and may enable more general uses.
a.2 Aligned Representation and InstanceFCN
We prove that the InstanceFCN [7] output behaves as an upscaling aligned head with nearestneighbor interpolation.
In [7], each output mask has pixels that are divided into bins. A mask pixel is read from the channel corresponding to the pixel’s bin. In our notation, [7] predicts which is related to the natural representation by:
(6) 
where is a rounding operation and the integers and index a bin. Now, define a new function by:
(7) 
and new coordinates: and (likewise for and ). Then can be written as:
(8) 
Eqn.(8) says that is the nearestneighbor interpolation of on . Eqn.(7), (6), and the new coordinates show that is computed from by the align2nat transform with . Thus, InstanceFCN masks can be constructed in the TensorMask framework by predicting , performing nearestneighbor interpolation of on to get , and then using align2nat to compute natural masks .
a.3 Object Detection Results
In Tab. 8 we show the associated boundingbox (bb) object detection results. Overall, TensorMask has a comparable box AP with Mask RCNN and outperforms RetinaNet.
method  aug  epochs  AP  AP  AP 

RetinaNet, ours  24  37.1  55.0  39.9  
RetinaNet, ours  72  39.3  57.2  42.4  
Faster RCNN, ours  72  40.6  61.4  44.2  
Mask RCNN, ours  72  41.7  62.5  45.7  
TensorMask, boxonly  72  40.8  60.4  43.9  
TensorMask  72  41.6  61.0  45.0 
a.4 MaskOnly TensorMask
One intriguing property of TensorMask is that masks are not dependent on boxes. This not only opens up new model designs that are maskspecific, but also allows us to investigate whether box predictions improve masks in a multitask setting. Here, we conduct experiments without the use of a box head. Note that although we predict masks densely, we still need to perform NMS for postprocessing. If regressed boxes are absent, we simply use the bounding boxes of the masks as a substitute (and also to report box AP).
Tab. 9 gives the results. We observe a slight degradation switching from the default setting which uses original boxes (row 1) for NMS to using mask bounding boxes (row 2). After accounting for this, TensorMask without a box head (row 3) has nearly equal mask AP to the mask+box variant (row 2). These results indicate that the role of the box head is auxiliary in our system, in contrast to Mask RCNN.
box head  NMS  AP  AP  AP  AP  AP  AP 

✓  bb  35.4  56.5  37.5  41.5  60.7  44.7 
✓  maskbb  34.9  55.6  37.0  39.5  58.6  41.6 
maskbb  34.8  55.7  36.6  39.1  58.4  41.3 
a.5 Qualitative Comparisons and Calibration
We show more results in Figs. 10 and 11. For these, and all visualizations in the main text, we display all detections that have a calibrated score 0.6. We use a simple calibration that maps uncalibrated detector scores to precision values: for each model and for each category, we compute its precisionrecall (PR) curve on val2017. As a PR curve is parameterized by score, we can map an uncalibrated score for the detectorcategory pair to its corresponding precision value. Scoretoprecision calibration enables a fair visual comparison between methods using a fixed threshold.
References
 [1] P. Arbeláez, J. PontTuset, J. T. Barron, F. Marques, and J. Malik. Multiscale combinatorial grouping. In CVPR, 2014.
 [2] A. Arnab and P. H. Torr. Pixelwise instance segmentation with a dynamically instantiated network. In CVPR, 2017.
 [3] M. Bai and R. Urtasun. Deep watershed transform for instance segmentation. In CVPR, 2017.
 [4] K. Chen, J. Pang, J. Wang, Y. Xiong, X. Li, S. Sun, W. Feng, Z. Liu, J. Shi, W. Ouyang, et al. Hybrid task cascade for instance segmentation. arXiv:1901.07518, 2019.
 [5] L.C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Semantic image segmentation with deep convolutional nets and fully connected crfs. In ICLR, 2015.

[6]
M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson,
U. Franke, S. Roth, and B. Schiele.
The cityscapes dataset for semantic urban scene understanding.
In CVPR, 2016.  [7] J. Dai, K. He, Y. Li, S. Ren, and J. Sun. Instancesensitive fully convolutional networks. In ECCV, 2016.
 [8] J. Dai, K. He, and J. Sun. Instanceaware semantic segmentation via multitask network cascades. In CVPR, 2016.
 [9] P. Dollár, Z. Tu, P. Perona, and S. Belongie. Integral channel features. In BMVC, 2009.
 [10] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained partbased models. PAMI, 2010.
 [11] R. Girshick. Fast RCNN. In ICCV, 2015.
 [12] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014.
 [13] R. Girshick, I. Radosavovic, G. Gkioxari, P. Dollár, and K. He. Detectron. https://github.com/facebookresearch/detectron, 2018.
 [14] P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He. Accurate, large minibatch SGD: Training ImageNet in 1 hour. arXiv:1706.02677, 2017.
 [15] B. Hariharan, P. Arbeláez, R. Girshick, and J. Malik. Simultaneous detection and segmentation. In ECCV, 2014.
 [16] K. He, R. Girshick, and P. Dollár. Rethinking imagenet pretraining. arXiv:1811.08883, 2018.
 [17] K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask RCNN. In ICCV, 2017.
 [18] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
 [19] A. Kirillov, E. Levinkov, B. Andres, B. Savchynskyy, and C. Rother. Instancecut: from edges to instances with multicut. In CVPR, 2017.
 [20] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropagation applied to handwritten zip code recognition. Neural computation, 1989.
 [21] Y. Li, H. Qi, J. Dai, X. Ji, and Y. Wei. Fully convolutional instanceaware semantic segmentation. In CVPR, 2017.
 [22] T.Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie. Feature pyramid networks for object detection. In CVPR, 2017.
 [23] T.Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár. Focal loss for dense object detection. In ICCV, 2017.
 [24] T.Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft COCO: Common objects in context. In ECCV, 2014.
 [25] S. Liu, J. Jia, S. Fidler, and R. Urtasun. SGN: Sequential grouping networks for instance segmentation. In ICCV, 2017.
 [26] S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia. Path aggregation network for instance segmentation. In CVPR, 2018.
 [27] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.Y. Fu, and A. C. Berg. SSD: Single shot multibox detector. In ECCV, 2016.
 [28] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015.
 [29] G. Neuhold, T. Ollmann, S. Rota Bulo, and P. Kontschieder. The mapillary vistas dataset for semantic understanding of street scenes. In ICCV, 2017.
 [30] C. Peng, T. Xiao, Z. Li, Y. Jiang, X. Zhang, K. Jia, G. Yu, and J. Sun. MegDet: A large minibatch object detector. In CVPR, 2018.
 [31] P. O. Pinheiro, R. Collobert, and P. Dollár. Learning to segment object candidates. In NIPS, 2015.
 [32] P. O. Pinheiro, T.Y. Lin, R. Collobert, and P. Dollár. Learning to refine object segments. In ECCV, 2016.
 [33] J. Redmon and A. Farhadi. YOLO9000: better, faster, stronger. In CVPR, 2017.
 [34] S. Ren, K. He, R. Girshick, and J. Sun. Faster RCNN: Towards realtime object detection with region proposal networks. In NIPS, 2015.

[35]
F. Rosenblatt.
The perceptron: a probabilistic model for information storage and organization in the brain.
Psychological review, 1958.  [36] A. Rush. Tensor considered harmful. 2019.
 [37] R. Vaillant, C. Monrocq, and Y. LeCun. Original approach for the localisation of objects in images. IEE Proc. on Vision, Image, and Signal Processing, 1994.
 [38] K. E. van de Sande, J. R. Uijlings, T. Gevers, and A. W. Smeulders. Segmentation as selective search for object recognition. In ICCV, 2011.
 [39] P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. In CVPR, 2001.
 [40] S. Zagoruyko, A. Lerer, T.Y. Lin, P. O. Pinheiro, S. Gross, S. Chintala, and P. Dollár. A multipath network for object detection. In BMVC, 2016.
Comments
There are no comments yet.