Object recognition is fundamental to understand our world. It is associated with one of the most important topics in computer vision, classification[7, 33, 15], detection [13, 28], and segmentation [24, 14]. Though being extensively studied, current object recognition researches mainly focus on categories at a coarse semantic level, the class-level, high up in the object-category taxonomy tree (as shown in Fig. 1). The categories are abstract and highly conclusive in the class-level, bottle, chair. Some researchers go down to explore the finer-grained level object recognition problem [19, 39, 18, 10, 11]. However, it still cannot cover the whole picture of object recognition. To complete the object-category taxonomy tree, we go to the finest-level by extending the leaf on the individual-level object recognition.
The individual-level means no sub-category under it. Below the individual-level is object entity, which is not a category but a concrete item in the real world. An individual-level category can represent vast copies of entities (500ml Coke bottle) or only one copy (a DIY chair, a signed baseball). Why do we need the individual-level information for object recognition? Because it can be much more useful in practice than general category information. In many scenarios, we need specific semantic information, only individual-level objects work. For example, in a smart shop, like Amazon Go***https://www.amazon.com/b?ie=UTF8&node=16008589011, we would like to buy a 500ml Diet Coke rather than a “bottle”. However, it is a non-trivial task even when a few images of objects are given. We should search the object in a complex scene under occlusion and challenging lighting conditions, and distinguish the target from similar objects. The complex background and 1L Diet Coke bottle can easily confuse the fine-grained RGB-based instance matching or detection frameworks. As the fine-grained setting assumes discriminative part to be distinguished for sub-categories, which may not hold on the individual-level, where two objects can have the same appearance with only different sizes, and the categories are assigned according to other factors like functionality (large clamp and small clamp), market strategy (500ml Coke and 1L Coke).
Unlike the class-level object recognition setting, each category has numerous of training images from different individual objects. However, for a single individual object, it is not practical to collect too much training images. Therefore, with regarding to the individual-level object segmentation (simplified as “individual segmentation”), our training set is several RGB-D images in different viewpoints only for each one. In testing, we segment out the individual objects when they appear in images that may be with occlusion and complex backgrounds. With this new setting, some challenges arise: 1) training sample is small, 2) the backgrounds are unseen in training. 3) the use of depth is non-trivial. 4) the explodes of the class number.
Thus, to address these problems, we propose a pipeline named “Context Less-Aware” (CoLA). The annotation issue is bypassed by an cut-and-paste synthesis. Specifically, the synthesis will produce the object-predominated training images which contain a small proportion of synthetic background context only. The object-predominated design naturally carries the information of object scales. As the CNN is not scale-invariant, conventional segmentation approaches (Mask R-CNN) should also learn the object scales for general category objects. In the synthetic setting, it is even easier to prepare the training data objects with different scales. Thus we decouple the scale learning by scale-aware training, resulting in a performance boost. In addition, the depth information is incorporated to help the classification of similar objects. In the testing phase, we implement a sliding inference in a scale-by-scale manner. As the backgrounds of testing environment is unseen during training, it will produce great many false positives when the context is complex. We learn a novel depth verification module to address this issue. Besides, the depth image encodes the 3D shape information of the objects. It can also boost the classification of the appearance-similar objects.
The evaluation is carried out on the YCB-Video dataset  and more challenging “Supermarket-10K” dataset built by us. Detailed description of this dataset is in Sec. 4.2.1. Extensive experiments demonstrate that the proposed approach outperforms various baseline methods significantly. To verify the robustness toward class explosion, we conduct an experiment on 3000 individual objects on a dataset that rendering from ShapeNet and achieve desirable results.
Our contributions are:
We explore the segmentation task at the finest-grained semantic level on the category taxonomy tree, the individual-level. As the existing public datasets rarely support such task, we build a dataset along with it.
We propose a novel CoLA pipeline, and equip it on Mask R-CNN to do RGB-D instance segmentation, given a small number of training image in each individual object category.
2 Related Works
is a fundamental task in computer vision, and it is richly studied. The mainstream researches focus on the class-level instance segmentation, including MNC , FCIS , Mask R-CNN  and more recently PANet . For the individual segmentation in this work, the problem setting is different at 2 aspects: a). Semantic level of category. The individual-level is the finest-grained semantic level, thus objects from two individual-level categories may look alike, even the same. Such ambiguity rarely exists in high semantic level cases. b). Model training with useless context.
deals with sub-category problems. Researchers have explored classification [10, 38], detection  at the fine-grained level. Our work can be regarded as the finest-grained instance segmentation. As the semantic level goes down through the category taxonomy tree from root to leaf, the finer-grained it becomes, the less informative it gets. Thus the generalization ability is a trade-off on the semantic levels. At the class-level, models usually have the ability to generalize to unseen objects of known categories, whereas at the individual-level, unseen objects of known categories is a self-contradictory term. Hence the generalization of the individual-level category should mainly consider the intra-class variations (viewpoints, deformation, contamination ) and the context variations (backgrounds, occlusion, lighting conditions ).
Object Recognition By Instance Matching
is a template based method to locate objects of interest in an image [5, 44, 16]. This setting is quite like ours. Such methods compute the dense keypoint mappings between images and thus can be further used to indicate the object proposal. However, it usually takes a long time to process complex images through object matching and generally has trouble to handle occlusion, complex background, and intra-class variations. With the CNN-based features, the recent proposed template matching [29, 30, 26] address these problems to some extent, however, the matching is still taken through multiple pass for even one object with multiple viewpoints. We compare a basic object matching-based segmentation to ours in Table 2.
Synthetic Data for Training
has gained more and more attention since data-hungry is a long-standing problem for researchers and real-world applications. One representative line of work to address data acquisition is to synthesize data for training. We roughly divide data synthesis approaches into two categories, 3D model-based and 2D image patch based (the cut-and-paste approach). Render CNN , rendering and domain adaptation , domain randomization [37, 2] are representative works of 3D model rendering synthesis. In 3D space, almost everything is controllable, such as camera viewpoints, object poses, lighting conditions, textures and so forth. However, a fine designed 3D model requires expertise in computer graphics techniques and considerable time. Besides, details that hard to be simulated, like camera noise may cause serious domain shift when data is generated as training sample.
Therefore, we follow the cut-and-paste path. Qi  generated road data by firstly cutting the interested instances to “memory bank”, and then reorganized samples from “memory bank” to formulate a full image. They tried to preserve context information, which is not quite suitable for the individual segmentation. Since in object-centric tasks, the object location and pose are more various than instances on the road, thus the context is harder to preserve. Dwibedi  proposed a data synthesizing pipeline for object instance detection. They do not consider the depth during synthesis. Besides, to reduce the influence of the context, we propose to synthesize object-dominated images.
Given objects, we snapshot viewpoints of RGB-D images for each object and segment out the instance patches. The goal of individual segmentation is to segment those objects when it appears in images. This setting is largely different from conventional category-level instance segmentation, since it is not practical to collect thousands of individual object image under different scenes. In contrast, capturing several images for an object in one scene is much more easy.
In this task, we meet a lot of challenges: 1). The object can be located in any background context of complex scenes, but we have no any prior knowledge on those background contexts. The false positive from the background can significantly compromise the performance. 2). Some objects have a similar appearance, such as the large clamp and the extra large clamp in the YCB-Video dataset. Pure RGB-based method is hard to distinguish them. 3). How to incorporate the 3D object information to improve performance is non-trivial.
In what follows, “object scale” refers to the size of object bounding box area in the image. “Object size” means the physical size of an object in real-world, indicated by its volume inside 3D bounding box.
To address those challenges, we propose a new framework named CoLA. In the training phase, our novel designs include object-predominated image synthesis (see Sec. 3.2) and object scale-aware training (see Sec. 3.3). The motivation is to shrink the background area around a completed object segment and help to synthesize the depth map with less artifacts. Since the training is conducted on the object-predominated images, we propose a sliding inference in the testing phase to find objects at different regions (see Sec. 3.4). And a depth verification module is applied afterwards to reduce the false positives from the background and correct the mis-classification of the appearance-similar objects 3.5.
3.2 Object-Predominated Image Synthesis
Typical cut-and-paste synthesize the full-ranged images with multiple objects and randomly selected backgrounds. However, such practice will introduce considerably context noises for segmentation tasks. Thus, we propose to synthesize the object-predominated images.
The object-predominated image is a kind of images that the a foreground object which occupies at least 30% pixels in an image. The rest pixels in the images can be other contextual contents that are not related to the foreground object. Some examples are illustrated in Fig. 4.
Object-Predominated Image Producing
Every object instance is snapshot viewpoints on a turntable in advance. The instance segment can be effortlessly obtained when the photos are taken with a contrastive background. Aside from the objects of interest, we also prepare objects of no-interest as distractors. After that, we have object segments and distractor segments to generate the object-predominated images. We pick up background images by randomly cropping from nature images. Thus, one targeting object segment and some distractor segments are pasted with different layouts on those background images, under the condition that targeting object segment should take at least pixels. The distractor may occlude targeting object to capture the occlusion cases. So, training on those images can handle occlusion to some extent. In the experiments, which is given by snapshoting at different camera heights, namely . At the first three heights, the turntable rotates a time, resulting in snapshots for each height. While only snapshot for height. And , the distractor object is randomly selected from daily objects which are not among the objects of interest.
Deep learning framework maybe over-fitting when the training samples are small. Previous works [12, 8] have shown that proper data augmentation techniques can help the generalization. Unlike the practice in [12, 8], which change the image appearance drastically, we treat the augmentation which would affect the RGB pattern carefully. Since there exist objects of different individual-level categories have similar shapes, drastic data augmentation may damage the RGB patterns and thus cause worse results. We apply the color jittering with Gaussian random noise
(We denote Gaussian distribution as “
”, and uniform distribution as “”, similarly hereinafter), saturation and hue jittering with , blur methods are similar as in  but without Poisson blur. Other augmentations which would not affect the appearance we take are rotation on angles with , flipping horizontally and vertically, allowing truncation.
3.3 Scale-Aware Training
Considering a typical instance segmentation setting, the input image is denoted as , objects in have different scales , is the number of the object scales appear in the image. is the predicted segmentation mask given .
Conventional instance segmentation methods (Mask R-CNN) seek a scale-invariant representation. That is, they assume the object scale is previously unknown, and actually, should predict object segmentation and object scale simultaneously, estimating
. In this setting, neurons should be co-adaptive to multi-scale objects. By contrast, if we know the object scales in advance, we only need to predict segmentation on given scale, that said, which is much easier to learn. Additionally, we compared these two kinds of multi-scale learning in Table 4. The results show that the latter one is more suitable for individual segmentation.
Scale-Aware Training Strategy
We train 3 scale-specific networks each for a specific input resolution in object-predominated images. The model outputs segmentation mask and category confidences. This approach can be viewed as an ensemble of scale-specific networks. In the experiment, the image resolutions , and the object pixel occupation is randomly set at when generating. Indeed, the pixel occupation can be according to the definition of the predominating object. However, a lower upperbound can help to control the object scales. For example, in this case, the object scales are around respectively. To be more specific, object scales in images are among , in are , and in are . They are rather narrow ranges compared to conventional methods [34, 35].
3.4 Sliding Inference
The object scale in the testing phase is unknown in the generic setting. There are also efforts made to determine the scales before inference by scale selection . However, the scale selection in  requires to forward the network several times for each object. We propose a novel scale selection method which requires only one pass to obtain scales information for all objects in the testing image.
For faster scale selection, we propose an adaptive scale selection technique by utilizing EdgeBox  on depth images. Proposals generated by EdgeBox are used as indicators of object scales in testing images. The reason why we perform EdgeBox on depth not RGB images is that it is sensitive to the texture of objects. Compared with RGB images, depth images are “edge-less”. As illustrated in Fig. 3, the proposal scales on depth images are more consistent to object scales than scales on RGB images. Given a depth image, we generate proposals, and then divide these boxes to group according to their sizes. The centers of groups are
. We found there is a small-scale variation on each group, but such variation does not degrade segmentation performance in our experiment. The grouping is done by k-means clustering on metric of the object sizes with clustering number of .
During inference, we first resize the image according to the center scale of each group. Suppose the image has size , for each center scale , if the closest scale in trained scales is . Then the image will be resized to . The resized images are then put to inference.
3.5 Depth Verification Network
As the backgrounds in the testing environment is not previously accessible, the backgrounds for synthesizing the image is selected randomly from large public dataset such as COCO . Nevertheless, if the testing background is largely different from the randomly selected ones, the RGB based model may be confused by the RGB pattern from the context. Besides, the objects with different individual level classes may resemble to each other.
We propose a depth verification module to address these issues. As the depth not only encodes the surface pattern of the backgrounds, but also the object size in the real world. During inference, when the segmentation network produces the object segment masks with the probability, like in Mask R-CNN. The mask may be coarse and not perfectly crop the instance, however, it further reduced the context, and the content within the segment is mostly related to the object. Thus the depth information inside the coarse segment is major related to the object shape in the real world. We model the object shape by simply reduced the mean value of the masked depth, which transform the absolute depth to relative depth, and thus the relative depth becomes an object-centered representation. Then we forward the transformed object depth segment along with the RGB counterpart into a pretrained classification network. The depth verification network is adapted from the ResNet18 , by modifying the input to 4 channels. It will be concatenated with RGB feature to predict individual category label.
Training the depth verification module
Thanks for mask provided by RGB segmentation, the depth verification module can be trained without knowing the testing backgrounds. As we already have RGB-D segments for each object. We can synthesize the coarse situation by randomly apply the image morphology transformation, such as dilation and erosion.
3.6 Compared with Instance Matching
Individual-level segmentation shares a few of similarities with instance matching, but, some significant differences between these two tasks should be noted.
Problem Goal: Instance matching seeks the corresponding points of two object images. While, individual segmentation is to pixel-wise segment the objects of interest from the images. Actually, instance matching can be largely improved, given a segmented object that is occlusion-free and background-free. Therefore, individual segmentation can be considered as a pre-stage of instance matching.
Pipeline: Conventional instance matching makes the use of key-point matching. While, our individual segmentation is an end-to-end system. Effectiveness: Instance matching by key-point matching suffers from occlusion, complex background clutter (that damage the local descriptor). While, individual segmentation can release from such difficulties by training CNN on different occlusions and backgrounds by synthetic data. The experiments will verifies that.
Efficiency: Any object has different viewpoints. Given objects at an image and each object has viewpoints patterns, key-point-based instance matching should implement rounds. By contrast, individual segmentation needs only a single image as input.
4 Experiment Details and Results
4.1 Experiment Details
Through all the experiments, Mask R-CNN is built with ResNet50-FPN-RPN-Heads. The experiments can be divided into two categories according to the form of input images, object-predominated image, and full-range image. The CoLA strategy relies on the former, hence related experiments are carried out on the object-predominated images. Besides, we also conduct full-range image experiments to make comparisons.
For full-range image experiments, networks trained on images with resolution (800, 1333), which indicates the shorter sides of images are at most 800 and the longer sides are at most 1333. Testing images also are resized in the same way for inference.
For object-predominated experiments, we collect the background cropped for negative samples during training, making the positive-negative sample ratio , we change the ratio to , and find no significant difference. When inference, we let the EdgeBox generate proposals on depth image. To note, as the proposals are used for scale clustering, not for object localization, is enough. While if we apply EdgeBox on the RGB images, should be grown to , because in the complex background, the unsupervised proposal generation methods may be more sensitive to visual cues such as edge pattern. The proposals are grouped to clusters, since we train the object-predominated images on object scales at .
The training is optimized by SGD with momentum , with a learning rate of 0.005 for full-range image experiments, 0.0002 for object-predominated image experiments for both instance segmentation framework and the depth verification module. The outputs of networks are put through Soft-NMS  before depth verification.
The training image size is 30,000 including both positive and negative samples for all the segmentation task, if not specially explained.
4.2 Evaluation on Supermarket-10K
4.2.1 Supermarket-10K Building Process
The YCB-Video dataset is essentially designed for 6D pose estimation task, hence the object layout configuration is not very challenging for instance segmentation. Publicly available object-centric instance segmentation dataset is quite rare. To demonstrate the improvement of our CoLA strategy, we build a new dataset for individual segmentation.
We took 1500 images from supermarkets where exists the objects of interest with Intel RealSense D435. And label them by LabelMe . The objects are listed in Fig. 5. It contains 15452 (15k) object instances, average 10 instances in an image. Some representative samples from Supermarket-10K are shown in Fig. 6. It seems the volume of the dataset is rather small, but it only serves as the validation set and testing test , and has more various viewpoints, challenging backgrounds and distant objects. Hence we believe it can test out the power of our CoLA strategy.
4.2.2 Does Background Matter?
As we take the cut-and-paste approach to synthesize training image, the selection of backgrounds may affect the results. We analyze the background effects by comparing background numbers from the same source and background sources of the same number.
The study of the influence of background numbers is conducted on COCO dataset . The background number is 500, 1000, 5000, 10000, 15000, respectively.
), similar backgrounds to testing images (taken from the real-world). This experiment has to be conducted on the Supermarket-10K dataset since we cannot access the scenes in the YCB-Video dataset. To note, we recalculate the mean and variance for whole training set once changing the background. The results are shown in Table.1.
According to the background number series experiments, the diversity of the background also can influence the training. When train 15000 synthetic object-predominated images on only 500 backgrounds selected from the COCO, the network can easily overfit the background pattern. As the number of background images sampled grows, the performance approaches stable. In addition, the background type is relevant as expected, the “real” backgrounds come from the testing environment help the network to achieve the best performance. The most performance comes from less false positives from the backgrounds. However, with depth verification, such false positive can also be fixed during inference phase.
4.2.3 Comparison with Object Template Matching
We also compare our method with an object template matching method. Because the individual segmentation shares a few of similarities with it. The biggest difference is the matching efficiency. Any object has different viewpoints. Given objects in an image and each object has viewpoints, key-point-based instance matching should forward rounds, each time with a different object patch. By contrast, individual segmentation need only a single image as input.
To compare the performance between our CoLA for the individual segmentation and the object template matching. We take a classic instance matching techniques as baseline. It is following the procedures in  by feature matching and homography computing. Originally, this method can only apply to one object in an image a time. Thus we leverage EdgeBox first, and then find the object in each proposal. Finally, we use grabcut  to obtain the foreground mask of detected regions. This baseline is denoted “TM” in the Table 2.
Besides, stronger CNN-based instance matching methods, namely NC-Net , LF-Net  are also compared in Table 2. To fairly compare, as our method adopt the depth verification module to utilize the 3D shape information, we also allow the instance matching methods training with depth (“+depth”). All the experiments are conducted on the Supermarket-10K dataset, during inference, they all use the same pretrained RPN. If not first segment out the proposal, the keypoint based instance matching can be diffused.
|TM + depth||24.3||40.9||30.2|
|LF-Net + depth||40.4||58.2||47.0|
|NC-Net + depth||42.8||57.4||48.6|
The instance matching typically fails when the background is complex. Besides, it cannot reduce the false positive, as the goal of instance matching is to match, but not segment out.
4.2.4 Ablative Study on Synthetic Data
We carried out the ablative study of synthetic data on Supermarket-10K, the backgrounds for synthetic samples are from COCO . And later we fine-tune the model on real data. The number of real data for fine-tuning is 500. 4 mAP performance boost with fine-tuning on real images are observed.
Mask R-CNN baseline are carried out on full-ranged images, we compare it with changing inputs to object-predominated images (“OP”), and following the scale-aware training (“SA”) and depth verification (“DVM”).
|+ fine-tune on real||54.9||81.2||62.7|
4.3 Evaluation on YCB-Video dataset
YCB-Video dataset provide real images with semantic segmentation mask. Since in this dataset, objects of the same category do not show up at the same time, we can convert the semantic segmentation masks to instance segmentation masks, and benchmark our pipeline.
4.3.1 Ablative Study on Real Data
Though there are over 90,000 labeled real images in the YCB-Video dataset, the images are redundant as they are cut from video sequences and the variation of object pose and occlusion is rather limited. We sampled 3731 images (every 30 frames) for training.
Since this dataset is not very challenging for instance segmentation, the results are not very significant among different methods. However, the experiment with depth verification module shows 27.5 AP boost on large clamps (denoted as “large”) and 47.4 AP boost on extra large clamps (denoted as “extra-large”).
Mask R-CNN baseline carried out on full-range images, by changing the inputs to object-predominated images (“OP”), apply the scale-aware training and testing (“SA”), and the depth verification (“DVM”).
4.4 On the explodes of classes
We are aware of the explodes of the classes in this task setting, as the individual-level is the finest grained level. However, it is cost-prohibitive to construct a dataset with that many classes. To prove the network can operate decently on the thousands of classes, we conduct a toy experiment, which select 3K objects from ShapeNet . The detail of this experiment please refer to the supplementary materials.
We explore the segmentation task on the individual level, which is quite different from the class-level, the conventional instance segmentation task setting. It requires segment objects of interest out from complex scenes with only a few images of objects. We propose a novel pipeline named CoLA to address such problem because normally the training is taken on synthetic images with useless context. The depth verification module can be effectively reducing the false positives from the backgrounds and help to distinguish the appearance-similar object with only size difference.
-  N. Bodla, B. Singh, R. Chellappa, and L. S. Davis. Soft-nms—improving object detection with one line of code. In Computer Vision (ICCV), 2017 IEEE International Conference on, pages 5562–5570. IEEE, 2017.
-  J. Borrego, A. Dehban, R. Figueiredo, P. Moreno, A. Bernardino, and J. Santos-Victor. Applying domain randomization to synthetic data for object category detection. arXiv preprint arXiv:1807.09834, 2018.
-  G. Bradski. The OpenCV Library. Dr. Dobb’s Journal of Software Tools, 2000.
-  A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, J. Xiao, L. Yi, and F. Yu. ShapeNet: An Information-Rich 3D Model Repository. Technical Report arXiv:1512.03012 [cs.GR], Stanford University — Princeton University — Toyota Technological Institute at Chicago, 2015.
Y. Chen, L. Guibas, and Q. Huang.
Near-optimal joint object matching via convex relaxation.
Proceedings of the 31st International Conference on International Conference on Machine Learning - Volume 32, ICML’14, pages II–100–II–108. JMLR.org, 2014.
-  J. Dai, K. He, and J. Sun. Instance-aware semantic segmentation via multi-task network cascades. In CVPR, 2016.
-  J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, 2009.
A. Dosovitskiy, J. T. Springenberg, M. Riedmiller, and T. Brox.
Discriminative unsupervised feature learning with convolutional neural networks.In Advances in neural information processing systems, pages 766–774, 2014.
-  D. Dwibedi, I. Misra, and M. Hebert. Cut, paste and learn: Surprisingly easy synthesis for instance detection. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017.
-  J. Fu, H. Zheng, and T. Mei. Look closer to see better: Recurrent attention convolutional neural network for fine-grained image recognition. In CVPR, volume 2, page 3, 2017.
-  T. Gebru, J. Hoffman, and L. Fei-Fei. Fine-grained recognition in the wild: A multi-task domain adaptation approach. In Computer Vision (ICCV), 2017 IEEE International Conference on, pages 1358–1367. IEEE, 2017.
-  R. Geirhos, P. Rubisch, C. Michaelis, M. Bethge, F. A. Wichmann, and W. Brendel. Imagenet-trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness. CoRR, abs/1811.12231, 2018.
-  R. Girshick. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 1440–1448, 2015.
-  K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask R-CNN. In Proceedings of the International Conference on Computer Vision (ICCV), 2017.
K. He, X. Zhang, S. Ren, and J. Sun.
Deep residual learning for image recognition.
Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
-  N. Hu, Q. Huang, B. Thibert, and L. J. Guibas. Distributable consistent multi-object matching. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
-  E. Jahangiri, E. Yoruk, R. Vidal, L. Younes, and D. Geman. Information pursuit: A bayesian framework for sequential scene parsing. arXiv preprint arXiv:1701.02343, 2017.
-  A. Khosla, N. Jayadevaprakash, B. Yao, and L. Fei-Fei. Novel dataset for fine-grained image categorization. In First Workshop on Fine-Grained Visual Categorization, IEEE Conference on Computer Vision and Pattern Recognition, Colorado Springs, CO, June 2011.
-  J. Krause, M. Stark, J. Deng, and L. Fei-Fei. 3d object representations for fine-grained categorization. In 4th International IEEE Workshop on 3D Representation and Recognition (3dRR-13), Sydney, Australia, 2013.
-  Y. Li, H. Qi, J. Dai, X. Ji, and Y. Wei. Fully convolutional instance-aware semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2359–2367, 2017.
-  T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars, editors, Computer Vision – ECCV 2014, pages 740–755, Cham, 2014. Springer International Publishing.
-  S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia. Path aggregation network for instance segmentation. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
-  S. P. Lloyd. Least squares quantization in pcm. IEEE Trans. Information Theory, 28:129–136, 1982.
-  J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3431–3440, 2015.
-  C. Lu, Y. Lu, H. Chen, and C.-K. Tang. Square localization for efficient and accurate object detection. In Proceedings of the IEEE International Conference on Computer Vision, pages 2560–2568, 2015.
-  Y. Ono, E. Trulls, P. Fua, and K. M. Yi. Lf-net: learning local features from images. In Advances in Neural Information Processing Systems, pages 6237–6247, 2018.
-  X. Qi, Q. Chen, J. Jia, and V. Koltun. Semi-parametric image synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8808–8816, 2018.
-  J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 779–788, 2016.
-  I. Rocco, R. Arandjelović, and J. Sivic. Convolutional neural network architecture for geometric matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
-  I. Rocco, M. Cimpoi, R. Arandjelović, A. Torii, T. Pajdla, and J. Sivic. Neighbourhood consensus networks. In Proceedings of the 32nd Conference on Neural Information Processing Systems, 2018.
-  C. Rother, V. Kolmogorov, and A. Blake. Grabcut: Interactive foreground extraction using iterated graph cuts. In ACM transactions on graphics (TOG), volume 23, pages 309–314. ACM, 2004.
-  B. C. Russell, A. Torralba, K. P. Murphy, and W. T. Freeman. Labelme: a database and web-based tool for image annotation. International journal of computer vision, 77(1-3):157–173, 2008.
-  K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
-  B. Singh and L. S. Davis. An analysis of scale invariance in object detection–snip. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3578–3587, 2018.
-  B. Singh, M. Najibi, and L. S. Davis. SNIPER: Efficient multi-scale training. NIPS, 2018.
-  H. Su, C. R. Qi, Y. Li, and L. J. Guibas. Render for cnn: Viewpoint estimation in images using cnns trained with rendered 3d model views. In The IEEE International Conference on Computer Vision (ICCV), December 2015.
-  J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel. Domain randomization for transferring deep neural networks from simulation to the real world. In Intelligent Robots and Systems (IROS), 2017 IEEE/RSJ International Conference on, pages 23–30. IEEE, 2017.
-  Y. Wang, V. I. Morariu, and L. S. Davis. Learning a discriminative filter bank within a cnn for fine-grained recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4148–4157, 2018.
-  P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. Belongie, and P. Perona. Caltech-UCSD Birds 200. Technical Report CNS-TR-2010-001, California Institute of Technology, 2010.
-  Y. Xiang, T. Schmidt, V. Narayanan, and D. Fox. Posecnn: A convolutional neural network for 6d object pose estimation in cluttered scenes. Robotics: Science and Systems (RSS), 2018.
-  J. Xiao, A. Owens, and A. Torralba. Sun3d: A database of big spaces reconstructed using sfm and object labels. In Proceedings of the IEEE International Conference on Computer Vision, pages 1625–1632, 2013.
-  W. Xu, Y. Li, and C. Lu. Srda: Generating instance segmentation annotation via scanning, reasoning and domain adaptation. In The European Conference on Computer Vision (ECCV), September 2018.
-  N. Zhang, J. Donahue, R. Girshick, and T. Darrell. Part-based r-cnns for fine-grained category detection. In European conference on computer vision, pages 834–849. Springer, 2014.
-  X. Zhou, M. Zhu, and K. Daniilidis. Multi-image matching via fast alternating minimization. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), ICCV ’15, pages 4032–4040, Washington, DC, USA, 2015. IEEE Computer Society.
-  C. L. Zitnick and P. Dollár. Edge boxes: Locating object proposals from edges. In ECCV, 2014.