Code for ICCV2019 paper "InstaBoost: Boosting Instance Segmentation Via Probability Map Guided Copy-Pasting"
Instance segmentation requires a large number of training samples to achieve satisfactory performance and benefits from proper data augmentation. To enlarge the training set and increase the diversity, previous methods have investigated using data annotation from other domain (e.g. bbox, point) in a weakly supervised mechanism. In this paper, we present a simple, efficient and effective method to augment the training set using the existing instance mask annotations. Exploiting the pixel redundancy of the background, we are able to improve the performance of Mask R-CNN for 1.7 mAP on COCO dataset and 3.3 mAP on Pascal VOC dataset by simply introducing random jittering to objects. Furthermore, we propose a location probability map based approach to explore the feasible locations that objects can be placed based on local appearance similarity. With the guidance of such map, we boost the performance of R101-Mask R-CNN on instance segmentation from 35.7 mAP to 37.9 mAP without modifying the backbone or network structure. Our method is simple to implement and does not increase the computational complexity. It can be integrated into the training pipeline of any instance segmentation model without affecting the training and inference efficiency. Our code and models have been released at https://github.com/GothicAi/InstaBoostREAD FULL TEXT VIEW PDF
Humans have a strong class-agnostic object segmentation ability and can
We propose a simple yet effective instance segmentation framework, terme...
Building instance segmentation models that are data-efficient and can ha...
Within the field of instance segmentation, most of the state-of-the-art ...
Human-object interaction (HOI) detection requires a large amount of anno...
TACO is an open image dataset for litter detection and segmentation, whi...
Mapping new and old buildings are of great significance for understandin...
Code for ICCV2019 paper "InstaBoost: Boosting Instance Segmentation Via Probability Map Guided Copy-Pasting"
Instance segmentation aims to simultaneously perform instance localization and classification and outputs pixel-level masks denoting the detected instance. It plays an vital role in computer vision and has many practical applications in autonomous driving, robotic manipulation , HOI detection [29, 36]
etc. Recent researches have proposed effective CNN (Convolution Neural Networks) architectures[28, 23] for the problem. To fully exploit the power of CNN, a large number of training data is indispensable. However, obtaining the annotations of pixel-wise masks is labor intensive, and thus limits the number of available training samples.
To tackle this problem, previous works utilize the data from other domains and conducted weakly supervised learning to obtain extra information. These researches mainly follow two lines: i) transform annotations from other domain to object masks[12, 30] or ii) utilize data from other domain as extra regularization term [21, 4]. However, few of these works investigate leveraging the existing mask annotations to augment the training set.
Recently, crop-and-paste data augmentation has been exploited in the area of instance detection  and object detection . They crop the object using their masks and paste them on a random chose background randomly or according to the visual context. However, these data augmentation method does not work in the area of instance segmentation, as dataset priors are not efficiently exploited, resulting in poor performance in our experiments. Meanwhile, adopting a deep context model  introduces significant computational overhead, making it less practical in real-world applications.
In this paper, we first propose a simple but surprisingly effective random augmentation technique. Inspired by the stochastic grammar of images , we paste objects in the neighboring of its original position, with additional small jittering on scale and rotation. Namely random InstaBoost, such method brings 1.7 mAP improvement with Mask R-CNN on COCO instance segmentation benchmark.
Further, we look back to the area of visual perception, from which we get inspiration for a better-refined position transformation scheme. Previous research in Bayesian approaches to brain function shows the brain’s ability to extract perceptual information from sensory data was modeled in terms of probabilistic estimation and visual inference requires prior experience of the world . These researches shed light on the area of crop-paste data augmentation for instance segmentation.
Intuitively, there exists a probability map representing reasonable placement that aligns with real-world experience. Inspired by , we link such probability map to appearance consistency heatmap, which is based on local contour similarity since the background usually has redundancy in continuous, but non-aligned features. We sample feasible locations from the heatmap and conduct crop-paste data augmentation, reaching in total 2.2 mAP improvement on the COCO dataset. Such a scheme is denoted as appearance consistency heatmap guided InstaBoost. An example of our appearance consistency heatmap is shown in Fig.1.
We conduct exhaustive experiments on the Pascal VOC dataset and COCO dataset. By augmenting through appearance consistency heatmap guided InstaBoost, we are able to achieve 2.2 mAP improvement of COCO instance segmentation and 3.9 mAP on Pascal dataset.
Instance mask segmentation. Combining instance detection and semantic segmentation, instance segmentation [13, 23, 31, 42, 28, 41, 44, 34] is a much harder problem. Earlier methods either propose segmentation candidates followed by classification , or associate pixels on the semantic segmentation map into different instances . Recently, FCIS  proposed the first fully convolutional end-to-end solution to instance segmentation, which predicted position-sensitive channels  for instance segmentation. This idea is further developed by  which outperforms competing methods on the COCO dataset . With the help of FPN  and a precise pooling scheme named RoI Align, He  proposed a two-step model Mask R-CNN that extends Faster R-CNN framework with a mask head and achieves state-of-the-art on instance segmentation  and pose estimation  tasks. Although these methods have reached impressive performance on public datasets, those heavy deep models are hungry for an extremely large number of training data, which is usually not available in real-world applications. Furthermore, the potential of large datasets are not fully exploited by existing training methods.
Instance-level augmentation. One branch of recent work has emerged with more precise instance-level image augmentation, laying potential to fully exploit the supervised information in the existing dataset [16, 26, 14, 15, 18, 25, 43]. Dwibedi  improved instance detection by simple cut-and-paste strategy with extra instances that have annotated masks. Khoreva  generate pairs of synthetic images for video object segmentation using cut-and-paste method. However, the object position is uniformly sampled and they just need to guarantee that changes between image pairs are kept small. Such setting does not work for image-level instance segmentation, as we demonstrated in our experiments that randomly pasted object will decrease the segmentation accuracy. Another recent work  proposed a context model to place segmented objects at backgrounds with proper context and demonstrated that it can improve objection detection on the Pascal VOC dataset. Such method requires training an extra model and preprocessing data offline. In this paper, we propose a simple but effective online augmentation method, which is the first attempt that successfully improve overall accuracy on COCO instance segmentation, as to the best of our knowledge.
Given a cropped object patch from a specific image, the placement of that patch on the image can be defined by the affine transformation matrix
where , denote the coordinate shift in -axis respectively,
denotes the scale variance anddenotes the rotation in degrees. Thus, the placement can be uniquely determined by a 4D tuple
From the view of stochastic grammar of images 
, a probabilistic model can be defined on this 4D space to learn the natural occurrence frequency of objects and then sampled to synthesize a large number of configurations to cover novel instances in the test set. By this end, we define probability density functionmeasuring how reasonable it is to paste the object on the given image , following a specific transformation tuple. Assuming as the object’s original coordinate and are new coordinates, a probability map is defined on set , which is given as
the given image and object conditions will be omitted for simplicity in the following context. Specifically, the identity transform which corresponds to the original paste configuration should have the highest probability, i.e.
Intuitively, in a small neighbor area of , our probability map shall also be high-valued since images are usually continuous and redundant in pixel level. Based on such observation, we propose a simple but effective augmentation approach: object jittering that randomly samples transformation tuples from the neighboring space of identity transform and paste the cropped object following affine transform . Experimental result in Sec. 4.4 shows the surprising effectiveness of this simple data augmentation strategy.
In addition, inspired by , the feasible location of can be further extended without being restricted to the neighboring area of if the background shares a similar pattern for a wide range. Therefore, we proposed a simple appearance consistency heatmap to utilize the redundancy in continuous, but non-aligned features of background. With the guidance of such heatmap, we can maximize the utility of our object jittering.
A simple but effective augmentation approach named random InstaBoost is proposed, which draws a sample from an instance segmentation dataset, separate its foreground and background with ground truth annotations aided with matting and inpainting, and apply a restricted random transform to generate an augmented image. With visually appealing images generated via InstaBoost, experiments show the effectiveness of random InstaBoost, achieving 1.7 mAP improvement on COCO instance segmentation. Random InstaBoost mainly contains two steps: i) instance and background preparation via matting and inpainting and ii) random transform sampled from neighboring space of identity transform.
Instance and background preparation. Given an image with ground truth labels for instance segmentation, we need to separate the target instance and the background, where the annotation of an instance segmentation dataset has already given sufficient information. However, in popular datasets e.g. COCO , annotations are stored in the format of boundary points and edges, leading to a disappointing situation where the outline is zigzag. To overcome such issue, matting  is adopted to get a smoother outline with the alpha channel, which is much more similar to the actual situation. In such a manner, instances can be cut off from the original image properly.
After the cutting step, we get a reasonable instance patch and an incomplete background with an instance-shaped hole on it. Inpainting method  are adopted to fill in such holes. Fig. 2 shows an example for inpainting and matting visualization.
Random transformation With 4D tuple transformation parameters defined in Eq. (2), our simple but effective InstaBoost technique is proposed, where. Slight blurring is introduced to the original image, which will not strongly violate the visual content in the original image, but parallelly provides additional supervision to train instance segmentation models.
The feasible transformation of coordinates is restricted in the neighborhood of in random InstaBoost, whose performance could be further elevated with a more complicated metric on the image, i.e. appearance consistency heatmap, to better refine the position where the new instance is pasted. Regarded as one implementation of the probability metric in Eq. (3), appearance consistency heatmap evaluates similarity on the RGB space, between any transformation with respect to . Examples of appearance consistency heatmap on COCO  dataset are shown in Fig. 3. Each example in Fig. 3 consists of two images, the left image is the original image from COCO dataset and the right one is the corresponding appearance consistency heatmap.
We derive in Eq. (3) as three conditional probability functions , and denoting probability density function w.r.t. and , respectively, whereby the formulation is simplified assuming the independence between and :
where are uniform distributions adopted by random InstaBoost in Sec. 3.2. Appearance consistency heatmap is defined as the expectation of probability map , given , input image and object patch , which is proportional to
Details of the appearance consistency map will be given as follows.
Appearance descriptor. To measure the appearance similarity of an object patch pasted on two locations, we first need to define a descriptor which encodes the texture of the background in the neighbor area of the object. Intuitively, the influence of the ambient environment of the target instance on appearance consistency decreases with the increase of distance.
Based on this assumption, we define the appearance descriptor as the weighted combination of three fixed width contour areas with different scales, which can be formulated as
where denotes the contour area with weight , given as the center of the instance. With being the most inside contour, we define emphasizing stronger consistency around neighboring areas of the original object. Fig. 4 shows an example of contour areas of appearance consistency heatmap.
Appearance distance. In this part, appearance distance is defined as local appearance consistency metric between pairs of appearance descriptor, i.e. instance centers. Since we have already defined affinity descriptor with three contour areas and corresponding weights, appearance distance between is defined as
where . denotes the RGB value of image on pixel coordinate. can take any distance metric, where Euclidean distance is adopted in our implementation.
There occurs an exception that when part of the semantic consistency effective area locates outside of the background. For this situation, we consider the semantic consistency distance of this pixel equals to infinity (and therefore ignored).
Heatmap generation. By fixing to the object’s original position and scanning appearance distance on all feasible in the image, a heatmap is produced w.r.t. the center positions are taken by . Appearance distances are normalized and scaled via negative for the heatmap . The mapping is formulated as
where represents the maximum distance in all candidate centers, represents the minimum distance. Heatmap is generated with applied to every pixel in the background image, with respect to original instance’s position .
Coordinate shift. Transformation is performed according to a 4D tuple as introduced in Eq. (1, 2). As suggested in Eq. (6), heatmap values are proportional to the probability density function on -axis, namely . Therefore, values in the appearance consistency heatmap are normalized and treated as probabilities, from which candidate points are sampled via Monte Carlo method. Compared to randomly sampling from the uniform distribution, the feasible area to placing the new object grows significantly, while avoiding pasting the instance onto semantically inconsistent backgrounds. Such operation on the heatmap introduces extra information for model training, which is an appealing feature for data augmentation.
Scaling and rotation. Scale and rotation parameters are sampled independently from uniform distribution in the neighboring of , as we assume independence among . Such practice is identical to our implementation of random InstaBoost in Sec. 3.2.
Following the steps described in Section 3.3, we can successfully generate a heatmap for any target instance. However, computing the feature map is computationally inefficient as it needs to compute semantic consistency distances for each point in the original effective area, where represents the width of the image and represents the height. The time complexity comes to for computing Eq. 8
, which is unacceptable in real-world applications. Therefore, we calculate the similarity map after resizing the original images to a fixed size and then upsample the heatmap to the original image size through interpolation. With such an acceleration strategy, appearance consistency heatmap is calculated in high quality and high speed, which is decisive in the implementation of our online InstaBoost algorithm.
Our InstaBoost data augmentation strategy can be integrated into the training pipeline of any existing CNN based framework. During the training phase, the dataloader takes an image and applies InstaBoost strategy with a given probability, together with other data augmentation strategies. Our implementation of InstaBoost only introduces little CPU overhead to the original framework, together with parallel processing of dataloader that guarantees the efficiency during the training phase.
Previous method  investigated applying context model to explicitly model the consistency of the object and background in semantic space. Different from their approach, our appearance consistency map does not consider the semantic consistency explicitly but enforces the object to be pasted at places with similar background pattern on the original image. With such tight constraint, although some configurations that are semantic consistent but present a different background pattern may be pruned, we can guarantee that the generated images are visually coherent in most cases. Compared to , our method can generate images that is more photorealistic and displays less blending artifacts, therefore introducing less noise when training the neural networks. Experimental results (Sec. 4.4, Sec. 4.5) show the superior performance of our method in both qualitative and quantitative manner, while having a much more efficient implementation.
|Mask R-CNN(Res-50-FPN)||map guided||40.5||62.0||44.2||23.0||42.7||51.8|
|Mask R-CNN(Res-101-FPN)||map guided||43.0||64.3||47.2||24.8||45.9||54.6|
|Cascade R-CNN(Res-101-FPN)||map guided||45.9||64.2||50.0||26.3||49.0||58.6|
|Mask R-CNN(Res-50-FPN)||map guided||36.0||58.3||38.1||15.9||37.8||52.3|
|Mask R-CNN(Res-101-FPN)||map guided||37.9||60.9||40.2||17.0||40.0||54.7|
|Cascade R-CNN(Res-101-FPN)||map guided||39.5||61.4||42.9||21.2||42.5||52.1|
Performance of models on both bounding box detection and instance segmentation has been evaluated on popular benchmarks, including Pascal VOC  with additional mask annotation from VOCSDS  and COCO  dataset.
Pascal VOC and VOCSDS. The original Pascal VOC dataset contains 17,125 images in 20 semantic categories with bounding box annotation. 2,913 images are annotated with instance masks for instance segmentation and semantic segmentation tasks. In this paper we adopted additional mask annotation from VOCSDS  with 11,355 images annotated with instance masks, following the train/test split in  where 5,623 images for training and 5,732 for testing.
COCO dataset. COCO dataset is the state-of-the-art evaluation benchmark for computer vision tasks including bounding box detection , instance segmentation , human pose estimation  and captioning . COCO is a much larger-scale image set compared to Pascal VOC, with 80 categories and more than 200,000 labeled images. Objects in COCO are annotated with both bounding box and instance mask labels. It contains large amounts of small objects, complicated object-object occlusion and noisy background, and is challenging for augmentation methods to generate “fake” but visually coherent images, to fully exploit the information in the dataset.
Nowadays, Mask R-CNN  based methods are widely adopted for instance segmentation  due to its promising performance and efficiency. In our experiment, we adopt the original Mask R-CNN  and its variant Cascaded Mask R-CNN  as our baseline networks. For Mask R-CNN, we experiment with both Res-50-FPN and Res-101-FPN backbones using open implementation  while only Res-101-FPN is tested for Cascaded Mask R-CNN based on . Baselines are retrained using corresponding open implementations. Experimental result reveals the generalizability of our augmentation approach.
Hyperparameters on COCO
For network training on COCO dataset, we adopt the default configuration provided by the authors, with only modifying the training epochs. We evaluated the network performance on, , and training epochs which are equivalent to 1x, 2x, 3x and 4x the default value in their configuration. The reported results in Tab. 1 and Tab. 2 are obtained using training epochs. Analysis in Sec. 4.5 shows that the network improves substantially after adopting our InstaBoost while suffering from over-fitting problem without such data augmentation.
Hyperparameters on VOC For Pascal VOC dataset, we only test the performance of Res-50-FPN based Mask R-CNN to evaluate the effectiveness of our algorithm. We use learning rate to train iterations, then continue training for iterations with and iterations with . Other hyperparameters keep unchanged according to Res-50-FPN training configuration on COCO dataset.
Hyperparameters of InstaBoost For our random InstaBoost, we need to set the range of the uniform distribution. For the translation, the range in and axis are set proportional to the width and height of the object. The ratio is set as . For scaling, we set the range from to in our experiment. For the rotation, as described in Sec. 3.3.2, the degree of rotation is better small. Thus, we set the range as . For appearance descriptor, the three fixed width contour areas are all 5 pixels and the values of weights for each contour are 0.4, 0.35 and 0.25 from inside to outside respectively. For map generation acceleration, we set the fixed size as .
COCO dataset InstaBoost is evaluated with state-of-the-art instance segmentation models on the popular COCO benchmark  on both instance segmentation and bounding box detection tracks. Experimental result against competing methods in of bounding box detection is shown in Tab. 1, and instance segmentation shown in Tab. 2. With InstaBoost, the performance of state-of-the-art models could be further elevated on both bounding box detection and instance segmentation tasks.
VOC dataset We report the instance segmentation results on VOC dataset based on R-50-FPN Mask R-CNN in Tab. 3. We can see that the improvement on VOC is around mAP, indicating the effectiveness of our method on small size dataset.
|Mask R-CNN||map guided||42.23||71.66||44.65||42.73||69.10||45.56|
|Translation Ratio||Scaling Ratio|
We visualize some results of Mask R-CNN trained with and w/o InstaBoost in Fig. 6. We can see that with InstaBoost, Mask R-CNN predicts correct masks while the vanilla one generates incomplete masks or ignores the objects.
Comparison with context model We compare our method with previous state-of-the-art  on COCO detection and instance segmentation. We adopt Res-101-FPN Mask R-CNN as the base network. Results are given in Tab. 5. It shows that our data augmentation strategy can achieve better performance on both tasks. Moreover,  requires extra training step and offline data prepossessing before data augmentation, while our method can be integrated into the training pipeline without tedious preparation or affecting the training efficiency.
Comparison with random paste To figure out the decisive role appearance consistency plays in InstaBoost, we compare our method with randomly pasting instances on the image, without overlapping with existing instances. Experiments are done on Mask R-CNN(Res-50-FPN) framework and on both VOC and COCO dataset. Tab. 6 shows a performance degradation for 1.3 and 1.1 mAP compared to the original baseline on instance segmentation task. Such results are aligned with the findings of .
Substantial Improvement We conduct experiments to validate the performance of the network using different training epochs with and without our InstaBoost. Results are shown in Fig. 7, where InstaBoost performs a promising resistance of overfitting. Both detection and segmentation accuracy of original Mask R-CNN stop increasing when epochs reaches . After applying InstaBoost augmentation method, both accuracy continue going up even in large training epoch.
Sensitivity analysis InstaBoost has parameters translation ratio and scaling ratio to decide the extent of the augmentation. We vary these parameters and measure AP, AP50 and AP75 of segmentation task on COCO dataset, see Tab. 4. For translation ratio, AP is stable in range to , and drops a little when it approaches to . Scaling ratio is more sensitive than translation ratio, and a variation of can cause about - drop in AP. In our experiments, we set translation ratio to and scaling ratio to -.
Interior-boundary study. We compared Mask R-CNN trained with/without InstaBoost, on interior and boundary masks respectively. Following the protocol introduced in , the interior and boundary masks are obtained from a trimap built from the edges of ground truth mask. Results in Fig. 8 shows that InstaBoost improves instance segmentation accuracy on better interior detection and finer boundary prediction. The improvement on instance boundary is more significant than interior part. Readers are referred to Sec. 5.1 and Fig. 4 in  for details of this evaluation.
This paper studies data augmentation techniques aiding the lack of training data in instance segmentation. By uniform sampling on the neighboring of identity transform in 4D transformation tuple, our simple but effective random InstaBoost achieves 1.7 mAP improvement with Mask R-CNN on COCO instance segmentation benchmark. We further devised InstaBoost with appearance consistency heatmap, reaching in total 2.2 mAP improvement on COCO instance segmentation. Our online implementation of InstaBoost can be easily embedded into existing instance segmentation frameworks, where free-lunch improvement is offered with little CPU overhead.
This work is supported in part by the National Key R&D Program of China, No. 2017YFA0700800, National Natural Science Foundation of China under Grants 61772332.
The cityscapes dataset for semantic urban scene understanding.In CVPR, 2016.