Object detection is one of the key problems in computer vision. While there has been significant effort and progress in detecting generic object classes (e.g. detect all the phones in an image), relatively little research work has been devoted to detect specific object instances (e.g. detect this particular phone). Recent approaches on this topic [rad2017bb8, xiang2018posecnn, zakharov2019dpod, kehl2017ssd] have achieved very good performance in detecting object instances, even under challenging occlusions. By relying on textured 3D models as a way to specify the object instances to be detected, these methods propose to train detectors tailored for these objects. Because they know the objects to be detected at training time, these approaches essentially overfit to the objects themselves, i.e. become specialized at only detecting them.
While this is a promising and still very active research direction, requiring knowledge of the objects to be detected at training time might not always be practical. For instance, if a new object needs to be detected, then the entire training process must be started over. This implies first generating a full training dataset and then optimizing the network. Having to wait for hours for a network to be usable is not the only potential limitation: it can be a severe constraint for embedded applications with a lack of memory to require such a specialized network for each object.
In this work, we explore the case of training a generic instance detector, where 3D models of the specific objects are only available at test time. This is akin to a line of work which has received less attention recently, that of template matching. These techniques scan the image over a dense set of sub-windows and compare each of them with a template representing the object. A canonical example is Linemod [hinterstoisser2011multimodal], which detects a 3D object by treating several views of the object as templates, and by efficiently searching for matches over the image. While very efficient, traditional template-matching techniques were typically quite brittle, especially under occlusion, and tended to yield relatively large amounts of false positives.
In this paper, we revive this line of work and propose a novel instance detection method. Using a philosophy akin to meta-learning [vinyals2016matching], our method learns to match the templates of an object instance given only its textured 3D model, by leveraging a large-scale 3D object dataset and a synthetic image rendering pipeline. At test time, our approach takes as input a textured 3D model of a novel, unseen object and is able to detect it from a single RGB image immediately, without any additional training (fig. 1).
To this end, our main contribution is the design of a novel deep learning architecture which learns to match a set of templates to the background to find an instance of the object. Instead of learning to match the pixel intensities directly, the network architecture instead learns to match the template with the image in a joint embedding space, trained specifically for that purpose. Our approach is trained exclusively on synthetic data and operates by using a single RGB image as input. Second, we introduce a series of extensions to the architecture which improves the detection performance such as tunable filters to adapt the feature extraction process to the object instance at hand. Through a detailed ablation study, we quantify the contribution of each extension. Third, we present extensive experiments that demonstrate that our method can successfully detect object instances that were unseen during training. In particular, we report performance that significantly outperform the state of the art on the well-known Linemod[hinterstoisser2012model] and Occluded Linemod [brachmann2014learning] datasets. Notably, we attain a mAP of 50.71% which is almost 30% better than LINE-2D [hinterstoisser2011multimodal] and on par with learning based methods that overfit on the object instance during training. We hope this approach sets the bar for the problem of unseen instance detection, and spurs further research in this direction.
2 Related work
Our work is most related to two areas: object instance detection in RGB images, and 2D tracking in RGB images. These are discussed below.
Object instance detection.
Our work focuses on the framework of retrieving the 2D pose of a particular object instance given its textured 3D model. This is in contrast with well-known methods such as Faster-RCNN [ren2015faster] and SSD [liu2016ssd], which provide 2D poses of classes.
Detecting a particular object is challenging due to the large variety of objects that can be found in the wild. Descriptor-based and template-based methods were useful in such context, as generic features including gradient histograms [hinterstoisser2011multimodal] and color histograms [tjaden2017real] could be computed and then retrieved from an object codebook.
To circumvent these limitations, we propose a novel network architecture that is trained to detect a generic object that is unavailable at training time. Our method is trained on a large set of 3D objects, but can generalize to new objects at test time. Our architecture bears resemblance to TDID [ammirato2018target] that uses a template to detect a particular instance of an object. We show through an ablation study that our method performs significantly better than their architecture on unknown objects.
Tracking in 2D images.
Our work also shares commonalities with image patch tracking, as it generally operates in the same context of detection with limited prior knowledge of the object to track and fast turn-around. Here, we strictly focus the discussion on tracking approaches in 2D images that employ deep neural networks. Many such approaches propose to use an in-network cross-correlation operation (sometimes denoted as ) between a template and an image in feature space to track the 2D position of an object in a temporal sequence [wang2019fast, dave2019learning, li2019target]. Additionally, recent 6-DOF trackers achieve generic instance tracking using simple renders from a 3D model [garon2018framework, li2018deepim, manhardt2018deep]. These methods are limited by the requirement of a previous temporal state in order to infer the current position. Our method takes inspiration from these two lines of work by using the in-network cross-correlation and renders from the 3D object as a template in order to detect the position of a 3D model in the image without previous knowledge of its pose.
3 Network architecture
We begin by presenting an overview of our proposed network architecture, depicted in fig. 2. Then, we separately discuss the two main stages of our architecture: 1) correlation and 2) object detection. The correlation stage borrows from classical template matching methods, where the template of an object is compared to the query image in a sliding-window fashion. The second stage is inspired from the recent literature in class-based object detection.
3.1 Architecture overview
We design an architecture that receives knowledge of the object as input, computes the template correlation as a first stage, and detects objects from the correlation results in a second stage. As shown in fig. 2, the network takes as input the RGB query image as well as two types of templates: 1) a global template used as an attention mechanism to specialize early features in the network; and 2) a local template that helps extract viewpoint-related features. Each template is an RGB image representing the rendered 3D object from a given viewpoint on a black background, concatenated with its binary mask to form four channel images. The templates are obtained with a fast OpenGL render of the object with diffuse reflectance, ambient occlusion, lit by a combination of one overhead directional light and constant ambient lighting.
3.2 Correlation stage
The query image is first processed by a conventional backbone to extract a latent feature representation. The global template is fed to an “object attention branch” (OAB), whose task is to inject a set of tunable filters early into this backbone network such that the features get specialized to the particular object instance. The local template, on the other hand, is consumed by the “Pose-Specific Branch” (PSB) to compute an embedding of the object. The resulting features are then correlated with the backbone features using simple cross-correlation operations. Note that at test time, the backbone (85% of total computing) is processed only once per instance, while the second stage is computed for each template.
The role of the backbone network is to extract meaningful features from the query image. For this, we use a DenseNet121 [huang2017densely]
model pretrained on ImageNet[deng2009imagenet]. Importantly, this network is augmented by adding a set of tunable filters between the first part of the backbone (
convolution layer with stride 2) and the rest of the model. These tunable filters are adjusted by the Object Attention Branch, described below.
Object attention branch (OAB).
The main idea behind the “Object Attention Branch” (OAB) is to guide the low-level feature extraction of the backbone network by injecting high-level information pertaining to the object of interest. The output of the OAB can be seen as tunable filters, which are correlated with the feature map of the first layer of the backbone network. The correlation is done within a residual block, similarly to what is done in Residual Networks [he2016deep].
Our ablation study in sec. 5.3 demonstrate that these tunable filters are instrumental in conferring to a fixed backbone the ability to generalize to objects not seen during training.
The OAB network is a SqueezeNet [iandola2016squeezenet] pretrained on ImageNet, selected for its relatively small memory footprint and good performance. In order to receive a 4-channel input (RGB and binary mask), an extra channel is added to the first convolution layer. The pretrained weights for the first three channels are kept and the weights of the fourth channel are initialized by the Kaiming method [he2015delving].
Pose-specific branch (PSB).
The role of the “pose-specific branch” (PSB) is to produce a high-level representation of the input template, that we refer to as embeddings. These are used to localize the target object in the query image, while taking into account the spatial information available in the local template. This search, based on learned features, is accomplished by depth-wise correlations and subtraction with filters applied on the backbone output feature map. This correlation/subtraction approach is inspired by [ammirato2018target], where they have demonstrated an increased detection performance when combining these two operations with embeddings. Siamese-based object trackers [bertinetto2016fully, wang2019fast] also use correlations, but with embeddings of higher spatial resolution. We found beneficial to merge these two concepts in our architecture, by using depth-wise correlations (denoted as ) in both and spatial dimensions. The first one is devoid of spatial information, whereas the second one preserves some of the spatial relationships in the template. We conjecture that this helps in being more sensitive to a template orientation, thus providing some cues about the object pose.
This PSB branch has the same structure and weight initialization as the OAB, but is trained with its own specialized weights. The output of that branch are two embeddings: at and spatial resolution respectively. Depth-wise correlations ( and ) and subtractions () are applied between the embeddings generated by this branch and the feature maps extracted from the backbone. They all pass through subsequent convolutions (C1–C3) and are then concatenated.
At test time, no a priori knowledge about the pose of the target object is known. Therefore, the local template is replaced by a stack of templates generated from multiple viewpoints. The embeddings are precomputed in an offline phase and saved in the GPU memory to save processing time (they do not have to be computed again).
3.3 Object detection stage
The second stage of the network deals with estimating object information from the learned correlation map. The architecture comprises a main task (bounding box prediction) and two auxiliary tasks (segmentation and center prediction). The latter two are used to improve training/performance.
Bounding box prediction.
The Bounding box classification and regression are used to predict the presence and location of the object respectively (as in [lin2017focal]). The classification head predicts the presence/absence of the object for anchors at every location of the feature map while the regression head predicts a relative shift on the location and size with respect to every anchor. In our method, we have : 8 scales (30, 60, 90, 120, 150, 180, 210 and 240 pixels) and 3 different ratios (0.5, 1 and 2). These are implemented as 5-layer convolution branches [lin2017focal]. Inspired from RetinaNet [lin2017focal], anchors with an Intersection-over-Union (IoU) of at least 0.5 are considered as positive examples, while those with IoU lower than 0.4 are considered as negatives. The other anchors between 0.4 and 0.5 are not used. At test time, bounding box predictions for all templates are accumulated and predictions with an (IoU) 0.5 are filtered by Non-Maximum Suppression (NMS). Also, for each bounding box prediction, a depth estimation can be made by multiplying the depth at which the local template was rendered with the size ratio between the local template size (124 pixels) and its own size. Predictions that have a predicted depth outside the chosen range of [0.4, 2.0] meters, which is a range that fits to most tabletop settings, are filtered out.
Segmentation and center prediction.
The segmentation head predicts a pixel-wise binary mask of the object in the scene image at full resolution, and does so with 5 convolutional layers with bilinear upsampling between each one. The center prediction head predicts the location of the object center at the same resolution than the correlation map, that is . The correlation channels are compressed to a single channel heatmap with a single convolution layer. This task enforce the correlation to be low when the object is not present.
3.4 Loss Functions
As mentioned, our network is trained with a main (bounding box detection) and two auxiliary (segmentation and center prediction) tasks. As such, the training loss is:
where is a binary cross-entropy loss for segmentation, is an loss for the prediction of the object center in a heatmap, is a focal loss [lin2017focal] associated with the object presence classification and is a smooth- loss for bounding box regression. The multi-task weights were empirically set to .
4 Training data
In this section, we detail all information related to the input images (query and templates). In particular, we define how the synthetic images are generated and how the dataset is augmented during training.
4.1 Domain randomization training images
Our fully-annotated training dataset is generated with a physic-based simulator similar to [mitash2017self], for which objects are randomly dropped on a table in a physical simulation. Every simulation is done in a simple cubic room (four walls, a floor and a ceiling) containing a table placed on the floor in the middle of the room. Inspired from the success of domain randomization [tobin2017domain, tremblay2018training], we introduced more randomness to the simulation parameters in order to reduce the domain gap between synthetic and real images. The following parameters are randomized: the texture of the environment (walls, floor and table), lighting (placement, type, intensity and color), object material (diffuse and specular reflection coefficients) and by using different types of anti-aliasing.
Our physics-based domain randomization dataset is composed of 10,000 images. To generate these images, we run 250 different simulations with different sets of objects (between 4 and 13 objects in each simulation). In 50% of the simulations, objects are automatically repositioned to rest on their bottom/main surface to replicate a bias found in many tabletop datasets. For each simulation, 20 camera positions are randomly sampled on half-spheres of radius ranging from 0.8 to 1.4 meters, all pointing towards the table center with random offsets of degrees for each rotation axis. For each sampled camera position, two image variations were rendered: one with realistic parameters (containing reflections and shadows) as shown in fig. 3-(c) and the other without, as shown in fig. 3-(b). Tremblay et al. [tremblay2018deep] showed that using different kinds of synthetic images reduced the performance gap between synthetic and real images. Accordingly, we have generated an additional set of 10,000 simpler renders using OpenGL. For this, we rendered objects in random poses on top of real indoor backgrounds sampled from the Sun3D dataset [xiao2013sun3d] (fig. 3-(a)).
After the simulations, we keep the 6 degree of freedom pose of each object as the ground truth. We use the pose together with the 3D model to generate a visibility mask for the segmentation task, and project the center of the 3D model in the image plane to generate the pose heatmap. The ground-truth heatmap is a 2D Gaussian with an amplitued of 1 and a variance of 5 at the projected center of the object at an image resolution equivalent to the output of the network.
The following section describes the template generation procedure for training. Note that a different procedure is used to generate test time templates and is described in sec. 5.
We use 115 different textured 3D models mainly found in the various datasets of the benchmark for 6D pose estimation[hodan2018bop] (excluding Linemod [hinterstoisser2011multimodal] since it is used for evaluation). For each training iteration, one of the objects from the query image is selected as the target object and others are considered as background.
The global template (input of the object attention branch) is a render of the target object from a random viewpoint. In an offline phase, 240 templates are generated for each 3D model by sampling 40 viewpoints on an icosahedron with 6 in-plane rotations per viewpoint. During training, one of the 240 templates is sampled randomly.
The local template (input of the pose-specific branch) is rendered by taking the pose of the target object in the query image into account. The template image thus matches the scene object perfectly. We also apply perturbations on the orientation of the template image by sampling a random rotation axis and rotation magnitude. We show the impact of using different rotation magnitude in sec. 5.3, with best performance when trained with random rotations in the range of 20–30 added to the pose of the target object.
Both template types have a resolution of
pixels. To render consistent templates from multiple objects of various size, we adjust the distance of the object so that its largest length on the image plane falls in the range of 100 to 115 pixels. The borders are then padded to reach the size of.
4.3 Data augmentation
Online data augmentation is applied to the synthetic images during training. We use the segmentation mask of the object in the query image to change some of its properties. We randomly change the hue, saturation and brightness of the object and its template. We also apply augmentations on the whole query image, such as: brightness shifts, Gaussian blur and noise, horizontal and vertical flips, random translations and scale. To minimize the risk of overfitting to the color as the main characteristic of the template, a random hue is applied to the whole image and the template 50% of the time. Finally, we apply a motion blur effect 20% of the time to the image by convolving a line kernel to the image, as described in [Dwibedi_2017_ICCV].
In this section, we provide the training hyper-parameters followed by details on the dataset and metrics used to evaluate our approach. We also describe the various ablation studies that validates our design choices and finally present an extensive evaluation against the state-of-the-art methods.
5.1 Training details
The network is trained for 50 epochs with the AMSGrad variant[reddi2019convergence] of the ADAM optimizer [kingma2014adam]. We use a learning rate of with steps of 0.1 at epochs 20 and 40, a weight decay of and mini batches of size 6. We use 1k renders as a validation set and use the remaining 19k of the generated dataset (OpenGL and Physics-based) for training. Each epoch, the network is trained for 1300 iterations and images are sampled with a ratio of 80/20 respectively from the physics-based and OpenGL renders. Once the training is done, the network with the smallest validation loss is kept for testing.
5.2 Datasets and metrics
We evaluate on the well-known Linemod [hinterstoisser2012model] and Occluded Linemod [brachmann2014learning] datasets. Linemod consists of 15 sequences of objects containing heavy clutter where the annotations of a single object are available per sequence. Occluded Linemod is a subset of Linemod, where annotations for 8 objects have been added by [brachmann2014learning]. Keeping in line with previous work, we keep only the prediction with the highest score for each object and use the following metrics.
We use the “2D bounding box” metric proposed in [brachmann2014learning]. The metric calculates the ratio of images for which the predicted bounding box has an intersection-over-union (IoU) with the ground truth higher than 0.5.
The standard mean average precision (mAP) is used to evaluate the performance of multi-object detection. To allow for direct comparison, we regroup the predictions made for different objects and apply Non-Maximum Suppression on predictions with an IoU 0.5. We use the same methodology as in [brachmann2014learning]: the methods are evaluated on 13 of the 15 objects of the Linemod dataset (the “bowl” and “cup” objects are left out). Of the remaining 13 objects, 4 are never found in the images, yet those are still detected and kept in the evaluation (as an attempt to evaluate the robustness to missing objects). The mAP is therefore computed by using all the predictions on the 9 other objects left.
5.3 Ablation studies
We now evaluate the design decisions made for the network architecture (table 1) and the pose used for the local template (table 2) through an ablation study. We also evaluate the effect of the number of templates used at test time (table 3).
The experiments in this section are computed on the Linemod dataset using the “2D bounding box” metric from [brachmann2014learning] and described in sec. 5.2. Evaluations in this section are done on a subset of 25% of the images of the full dataset. In each table, we report the difference in performance between the best performing variation and the others.
|Full architecture||0.00 (ref)|
|w/o tunable filters (OAB)||-19.76|
|w/o correlation (PSB)||-5.37|
|w/o auxiliary tasks||-7.73|
|TDID correlations [ammirato2018target]||-26.48|
|SiamMask correlations [wang2019fast]||-12.93|
Table 1 reports the relative performance to the full architecture when each of the proposed modules in sec. 3 are removed (one at a time). First, removing the tunable filters computed with the “object attention branch” results in the largest performance drop, resulting in a decrease of almost 20%. Second, removing the higher-resolution embeddings and auxiliary tasks reduces performance by approximately 5% and 8% respectively.
We also compare our approach with the technique used in TDID [ammirato2018target] and SiamMask [wang2019fast] to correlate the template with the query image. Instead of implementing their exact specifications (which may differ in backbones, for example), we provided a fairer comparison by adapting our architecture to their main ideas. We thus use a siamese network with our DenseNet-121 backbone for both methods (since they propose a shared-weights approach), and remove the tunable filters from our object attention branch. For TDID, we only use the embedding (both depth-wise cross-correlation and subtraction), whereas a embedding is used for SiamMask. The same training procedure and tasks than our proposed approach is used. Our proposed approach improves the performance by a large margin when compared with these baselines.
Local template pose.
We evaluate the impact of perturbing the object orientation in the local template during training. A random rotation of means that the local template contains the object at the same orientation as the object in the image. Perturbations are added by randomly sampling a rotation axis (in spherical coordinates) and a magnitude. A network is retrained with each level of random rotations. Table 2 illustrates that the optimal degree of perturbations seems to be around 20–30. Indeed, deviations from this number results in a decrease in performance of up to -16% when a completely random rotation (180) is used.
|Random rotations||performance (%)|
Template density and runtime.
The impact of providing various numbers of templates to the network at test time is evaluated, both in terms of accuracy and speed, in table 3. Timings are reported on a Nvidia GeForce GTX 1080Ti. To generate a varying number of templates, we first generate templates from 16 pre-defined viewpoints spanning a half-sphere on top of the object. Each template subsequently undergoes an in-plane rotation of varying numbers: (80 templates), 10 (160 templates) and 20 (320 templates). The table compares performance with that obtained with an oracle who provides a template with the ground truth object pose.
|# of templates||performance (%)||runtime (s)|
5.4 Comparative evaluation
We report an evaluation on Linemod and Occluded Linemod (OL) in table 4 and compare with other state of the art RGB-only methods. Competing methods are divided into 2 main groups: those who do know the test objects at train time (“known objects”), and those who do not. Approaches such as [brachmann2016uncertainty, kehl2017ssd, zakharov2019dpod, hodan2019photorealistic] are all learning-based methods that were specifically trained on the objects. We could identify [tjaden2017real] and [hinterstoisser2011multimodal] which do not include a specific training step that is targeted towards specific object instances.
It is worth noting that even though [tjaden2017real] is classified as not needing known objects at training time, they still require an initialization phase to be performed on real images (to build a dictionary of histogram features). As in [brachmann2016uncertainty], they thus use parts of the Linemod dataset as a training set that covers most of the object viewpoints. These methods have therefore an unfair advantage compared to our approach and Line-2D since they leverage domain-specific information (lighting, camera, background) of the evaluation dataset.
Our method can directly be compared with Line-2D [hinterstoisser2011multimodal] which also uses templates generated from 3D models. On the standard Linemod dataset, Line-2D outperforms our method by 8.5% on the “2D bounding box” metric. However, our method outperforms Line-2D by around 30% in mAP on Occluded Linemod. It also provides competitive performance that is on par or close to all other methods that test on known objects and/or have access to real images. Note how the accuracy of Line-2D severely degrades under occlusion, while our approach remains robust. We show qualitative results of our approach on Occluded Linemod in fig. 4.
|Brachmann et al. [brachmann2016uncertainty]||Yes||Yes||97.50||51.00|
|Hodan et al. [hodan2019photorealistic]||Yes||No||N/A||55.90*|
|Tjaden et al. [tjaden2017real]||No||Yes||78.50||N/A|
This paper presents what we believe to be the first deep learning approach for unseen instance detection. Inspired by template matching, we propose an architecture which learns a feature embedding where correlation between object templates and a query image can subsequently be converted to bounding box detection. Our experiments show that while the network has not been trained with any of the test objects, it is significantly more robust to occlusion than previous template-based methods (30% improvement in mAP over Line-2D [hinterstoisser2011multimodal]), while being highly competitive with networks that are specifically trained on the object instances.
A main limitation of our approach are the false positives that arise from clutter with similar color and shape as the object, as shown in fig. 4. We hypothesize that our late correlation at small spatial resolution (templates of and ) prevents the network from leveraging detailed spatial information to fit to the object shape more accurately. Another limitation is that of speed. Currently, our approach requires 0.43s to detect a single object instance in an image (c.f. table 3), but it scales linearly with the number of objects. The culprit here is the object attention branch, which makes the backbone features instance-specific via the tunable filters, and thus needs to be recomputed.
By providing a generic and robust 2D instance detection framework, this work opens the way for new methods that can extract additional information about the object, such as its full 6-DOF pose. We envision a potential cascaded approach, which could first detect unseen objects, and subsequently regress the object pose from a high-resolution version of the detection window.
This work was supported by the NSERC/Creaform Industrial Research Chair on 3D Scanning: CREATION 3D. We gratefully acknowledge the support of Nvidia with the donation of the GPUs used for this research.
Appendix A Per object performances
In table 4, we reported the performance of our approach on Linemod [hinterstoisser2012model] and Occluded Linemod [brachmann2014learning] datasets. We extend the reported results by showing the performance of our approach on each object in tables 5 and 6. Object with their corresponding indices can be viewed in fig. 5.
|2D BBox metric (%)||89.16||71.50||94.00||46.88||92.47||80.75||82.74||77.19||63.31||96.89||89.51||67.83||87.67||86.39||42.48||77.92|
|Object ID||1||2||5||6||8||9||10||11||12||Mean Average Precision (mAP)|
|Average Precision (%)||36.58||55.92||73.49||29.18||55.20||77.48||52.79||16.26||59.52||50.71|
Appendix B Domain randomization training images
Additional examples of domain randomization images generated with our simulator are shown in fig. 6.
Appendix C Qualitative results on Linemod dataset
We show examples of good and bad predictions on Linemod dataset in fig. 7.