Learning to Match Templates for Unseen Instance Detection

11/26/2019 ∙ by Jean-Philippe Mercier, et al. ∙ 19

Detecting objects in images is a quintessential problem in computer vision. Much of the focus in the literature has been on the problem of identifying the bounding box of a particular type of objects in an image. Yet, in many contexts such as robotics and augmented reality, it is more important to find a specific object instance—a unique toy or a custom industrial part for example—rather than a generic object class. Here, applications can require a rapid shift from one object instance to another, thus requiring fast turnaround which affords little-to-no training time. In this context, we propose a method for detecting objects that are unknown at training time. Our approach frames the problem as one of learned template matching, where a network is trained to match the template of an object in an image. The template is obtained by rendering a textured 3D model of the object. At test time, we provide a novel 3D object, and the network is able to successfully detect it, even under significant occlusion. Our method offers an improvement of almost 30 mAP over the previous template matching methods on the challenging Occluded Linemod (overall mAP of 50.7). With no access to the objects at training time, our method still yields detection results that are on par with existing ones that are allowed to train on the objects. By reviving this research direction in the context of more powerful, deep feature extractors, our work sets the stage for more development in the area of unseen object instance detection.



There are no comments yet.


page 1

page 3

page 5

page 8

page 12

page 13

page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Overview of the proposed method. At test time, a network predicts the location of an object never seen during training from a set of templates obtained from a textured 3D model.

Object detection is one of the key problems in computer vision. While there has been significant effort and progress in detecting generic object classes (e.g. detect all the phones in an image), relatively little research work has been devoted to detect specific object instances (e.g. detect this particular phone). Recent approaches on this topic [rad2017bb8, xiang2018posecnn, zakharov2019dpod, kehl2017ssd] have achieved very good performance in detecting object instances, even under challenging occlusions. By relying on textured 3D models as a way to specify the object instances to be detected, these methods propose to train detectors tailored for these objects. Because they know the objects to be detected at training time, these approaches essentially overfit to the objects themselves, i.e. become specialized at only detecting them.

While this is a promising and still very active research direction, requiring knowledge of the objects to be detected at training time might not always be practical. For instance, if a new object needs to be detected, then the entire training process must be started over. This implies first generating a full training dataset and then optimizing the network. Having to wait for hours for a network to be usable is not the only potential limitation: it can be a severe constraint for embedded applications with a lack of memory to require such a specialized network for each object.

In this work, we explore the case of training a generic instance detector, where 3D models of the specific objects are only available at test time. This is akin to a line of work which has received less attention recently, that of template matching. These techniques scan the image over a dense set of sub-windows and compare each of them with a template representing the object. A canonical example is Linemod [hinterstoisser2011multimodal], which detects a 3D object by treating several views of the object as templates, and by efficiently searching for matches over the image. While very efficient, traditional template-matching techniques were typically quite brittle, especially under occlusion, and tended to yield relatively large amounts of false positives.

In this paper, we revive this line of work and propose a novel instance detection method. Using a philosophy akin to meta-learning [vinyals2016matching], our method learns to match the templates of an object instance given only its textured 3D model, by leveraging a large-scale 3D object dataset and a synthetic image rendering pipeline. At test time, our approach takes as input a textured 3D model of a novel, unseen object and is able to detect it from a single RGB image immediately, without any additional training (fig. 1).

To this end, our main contribution is the design of a novel deep learning architecture which learns to match a set of templates to the background to find an instance of the object. Instead of learning to match the pixel intensities directly, the network architecture instead learns to match the template with the image in a joint embedding space, trained specifically for that purpose. Our approach is trained exclusively on synthetic data and operates by using a single RGB image as input. Second, we introduce a series of extensions to the architecture which improves the detection performance such as tunable filters to adapt the feature extraction process to the object instance at hand. Through a detailed ablation study, we quantify the contribution of each extension. Third, we present extensive experiments that demonstrate that our method can successfully detect object instances that were unseen during training. In particular, we report performance that significantly outperform the state of the art on the well-known Linemod 

[hinterstoisser2012model] and Occluded Linemod [brachmann2014learning] datasets. Notably, we attain a mAP of 50.71% which is almost 30% better than LINE-2D [hinterstoisser2011multimodal] and on par with learning based methods that overfit on the object instance during training. We hope this approach sets the bar for the problem of unseen instance detection, and spurs further research in this direction.

Figure 2: Our proposed architecture, shown during training. In stage 1, the network learns to localize an object solely from a set of templates. Object-specific features are learned by the “object attention” and “pose-specific” branches, and are subsequently correlated/subtracted with the generic features of the backbone network. In stage 2, the network leverages the learned representation to perform different tasks: binary segmentation, center and bounding box prediction.

2 Related work

Our work is most related to two areas: object instance detection in RGB images, and 2D tracking in RGB images. These are discussed below.

Object instance detection.

Our work focuses on the framework of retrieving the 2D pose of a particular object instance given its textured 3D model. This is in contrast with well-known methods such as Faster-RCNN [ren2015faster] and SSD [liu2016ssd], which provide 2D poses of classes.

Detecting a particular object is challenging due to the large variety of objects that can be found in the wild. Descriptor-based and template-based methods were useful in such context, as generic features including gradient histograms [hinterstoisser2011multimodal] and color histograms [tjaden2017real] could be computed and then retrieved from an object codebook.

Recent progress in machine learning enabled the community to develop approaches that automatically learn features from the 3D model of an object using neural network 

[rad2017bb8, xiang2018posecnn, zakharov2019dpod, kehl2017ssd]

or random forest 

[brachmann2016uncertainty]classifiers. While these methods perform exceptionally well on known benchmarks [hodan2018bop], they share the important limitation that training these deep neural networks requires a huge amount of labeled data tailored to the object instances to be detected. Consequently, gathering the training dataset for specific objects is both costly and time-consuming. Despite this, efforts have been made to capture such real datasets [brachmann2014learning, hinterstoisser2012model, hodan2017t, rennie2016dataset, doumanoglou2016recovering, tejani2014latent] and to combine them together [hodan2018bop]. A side effect is that it confines most deep learning methods to the very limited set of objects present in these datasets, as the weights of a network are specifically tuned to detect only a single [kehl2017ssd] or a few instances [kehl2017ssd, rad2017bb8]. The difficulty of gathering a real data can be partially alleviated using simple rendering techniques [hinterstoisser2018pre, kehl2017ssd, rad2017bb8] combined with data augmentation such as random backgrounds and domain randomization [tobin2017domain, zakharov2019deceptionnet, tremblay2018training], but still suffers from a domain gap with real images. Recently, Hodan et al. [hodan2019photorealistic] demonstrated that the domain gap can be minimized with more realistic, physics-based rendering. Despite this progress, all of the above methods share the same limitation, in that they all require significant time (and compute power) to train a network on a new object. This implies a slow turn-around time, where a practitioner must wait hours before a new object can be detected.

To circumvent these limitations, we propose a novel network architecture that is trained to detect a generic object that is unavailable at training time. Our method is trained on a large set of 3D objects, but can generalize to new objects at test time. Our architecture bears resemblance to TDID [ammirato2018target] that uses a template to detect a particular instance of an object. We show through an ablation study that our method performs significantly better than their architecture on unknown objects.

Tracking in 2D images.

Our work also shares commonalities with image patch tracking, as it generally operates in the same context of detection with limited prior knowledge of the object to track and fast turn-around. Here, we strictly focus the discussion on tracking approaches in 2D images that employ deep neural networks. Many such approaches propose to use an in-network cross-correlation operation (sometimes denoted as ) between a template and an image in feature space to track the 2D position of an object in a temporal sequence [wang2019fast, dave2019learning, li2019target]. Additionally, recent 6-DOF trackers achieve generic instance tracking using simple renders from a 3D model [garon2018framework, li2018deepim, manhardt2018deep]. These methods are limited by the requirement of a previous temporal state in order to infer the current position. Our method takes inspiration from these two lines of work by using the in-network cross-correlation and renders from the 3D object as a template in order to detect the position of a 3D model in the image without previous knowledge of its pose.

3 Network architecture

We begin by presenting an overview of our proposed network architecture, depicted in fig. 2. Then, we separately discuss the two main stages of our architecture: 1) correlation and 2) object detection. The correlation stage borrows from classical template matching methods, where the template of an object is compared to the query image in a sliding-window fashion. The second stage is inspired from the recent literature in class-based object detection.

3.1 Architecture overview

We design an architecture that receives knowledge of the object as input, computes the template correlation as a first stage, and detects objects from the correlation results in a second stage. As shown in fig. 2, the network takes as input the RGB query image as well as two types of templates: 1) a global template used as an attention mechanism to specialize early features in the network; and 2) a local template that helps extract viewpoint-related features. Each template is an RGB image representing the rendered 3D object from a given viewpoint on a black background, concatenated with its binary mask to form four channel images. The templates are obtained with a fast OpenGL render of the object with diffuse reflectance, ambient occlusion, lit by a combination of one overhead directional light and constant ambient lighting.

3.2 Correlation stage

The query image is first processed by a conventional backbone to extract a latent feature representation. The global template is fed to an “object attention branch” (OAB), whose task is to inject a set of tunable filters early into this backbone network such that the features get specialized to the particular object instance. The local template, on the other hand, is consumed by the “Pose-Specific Branch” (PSB) to compute an embedding of the object. The resulting features are then correlated with the backbone features using simple cross-correlation operations. Note that at test time, the backbone (85% of total computing) is processed only once per instance, while the second stage is computed for each template.

Backbone network.

The role of the backbone network is to extract meaningful features from the query image. For this, we use a DenseNet121 [huang2017densely]

model pretrained on ImageNet 

[deng2009imagenet]. Importantly, this network is augmented by adding a set of tunable filters between the first part of the backbone (

convolution layer with stride 2) and the rest of the model. These tunable filters are adjusted by the Object Attention Branch, described below.

Object attention branch (OAB).

The main idea behind the “Object Attention Branch” (OAB) is to guide the low-level feature extraction of the backbone network by injecting high-level information pertaining to the object of interest. The output of the OAB can be seen as tunable filters, which are correlated with the feature map of the first layer of the backbone network. The correlation is done within a residual block, similarly to what is done in Residual Networks [he2016deep].

Our ablation study in sec. 5.3 demonstrate that these tunable filters are instrumental in conferring to a fixed backbone the ability to generalize to objects not seen during training.

The OAB network is a SqueezeNet [iandola2016squeezenet] pretrained on ImageNet, selected for its relatively small memory footprint and good performance. In order to receive a 4-channel input (RGB and binary mask), an extra channel is added to the first convolution layer. The pretrained weights for the first three channels are kept and the weights of the fourth channel are initialized by the Kaiming method [he2015delving].

Pose-specific branch (PSB).

The role of the “pose-specific branch” (PSB) is to produce a high-level representation of the input template, that we refer to as embeddings. These are used to localize the target object in the query image, while taking into account the spatial information available in the local template. This search, based on learned features, is accomplished by depth-wise correlations and subtraction with filters applied on the backbone output feature map. This correlation/subtraction approach is inspired by [ammirato2018target], where they have demonstrated an increased detection performance when combining these two operations with embeddings. Siamese-based object trackers [bertinetto2016fully, wang2019fast] also use correlations, but with embeddings of higher spatial resolution. We found beneficial to merge these two concepts in our architecture, by using depth-wise correlations (denoted as ) in both and spatial dimensions. The first one is devoid of spatial information, whereas the second one preserves some of the spatial relationships in the template. We conjecture that this helps in being more sensitive to a template orientation, thus providing some cues about the object pose.

This PSB branch has the same structure and weight initialization as the OAB, but is trained with its own specialized weights. The output of that branch are two embeddings: at and spatial resolution respectively. Depth-wise correlations ( and ) and subtractions () are applied between the embeddings generated by this branch and the feature maps extracted from the backbone. They all pass through subsequent convolutions (C1–C3) and are then concatenated.

At test time, no a priori knowledge about the pose of the target object is known. Therefore, the local template is replaced by a stack of templates generated from multiple viewpoints. The embeddings are precomputed in an offline phase and saved in the GPU memory to save processing time (they do not have to be computed again).

3.3 Object detection stage

The second stage of the network deals with estimating object information from the learned correlation map. The architecture comprises a main task (bounding box prediction) and two auxiliary tasks (segmentation and center prediction). The latter two are used to improve training/performance.

Bounding box prediction.

The Bounding box classification and regression are used to predict the presence and location of the object respectively (as in [lin2017focal]). The classification head predicts the presence/absence of the object for anchors at every location of the feature map while the regression head predicts a relative shift on the location and size with respect to every anchor. In our method, we have : 8 scales (30, 60, 90, 120, 150, 180, 210 and 240 pixels) and 3 different ratios (0.5, 1 and 2). These are implemented as 5-layer convolution branches [lin2017focal]. Inspired from RetinaNet [lin2017focal], anchors with an Intersection-over-Union (IoU) of at least 0.5 are considered as positive examples, while those with IoU lower than 0.4 are considered as negatives. The other anchors between 0.4 and 0.5 are not used. At test time, bounding box predictions for all templates are accumulated and predictions with an (IoU) 0.5 are filtered by Non-Maximum Suppression (NMS). Also, for each bounding box prediction, a depth estimation can be made by multiplying the depth at which the local template was rendered with the size ratio between the local template size (124 pixels) and its own size. Predictions that have a predicted depth outside the chosen range of [0.4, 2.0] meters, which is a range that fits to most tabletop settings, are filtered out.

Segmentation and center prediction.

The segmentation head predicts a pixel-wise binary mask of the object in the scene image at full resolution, and does so with 5 convolutional layers with bilinear upsampling between each one. The center prediction head predicts the location of the object center at the same resolution than the correlation map, that is . The correlation channels are compressed to a single channel heatmap with a single convolution layer. This task enforce the correlation to be low when the object is not present.

3.4 Loss Functions

As mentioned, our network is trained with a main (bounding box detection) and two auxiliary (segmentation and center prediction) tasks. As such, the training loss is:


where is a binary cross-entropy loss for segmentation, is an loss for the prediction of the object center in a heatmap, is a focal loss [lin2017focal] associated with the object presence classification and is a smooth- loss for bounding box regression. The multi-task weights were empirically set to .

4 Training data

In this section, we detail all information related to the input images (query and templates). In particular, we define how the synthetic images are generated and how the dataset is augmented during training.

4.1 Domain randomization training images

(a) (b) (c)
Figure 3: Examples from our domain randomization training set. In (a), objects are randomly placed in front of the camera and rendered using OpenGL with a background sampled from Sun3D dataset [xiao2013sun3d]. In (b) and (c), a physical simulation is used to drop several objects on a table with randomized parameters (camera position, textures, lighting, materials and anti-aliasing). For each render, 2 variations are saved: one with simple diffuse materials and without shadows (b) and one with more sophisticated specular materials and shadows (c).

Our fully-annotated training dataset is generated with a physic-based simulator similar to [mitash2017self], for which objects are randomly dropped on a table in a physical simulation. Every simulation is done in a simple cubic room (four walls, a floor and a ceiling) containing a table placed on the floor in the middle of the room. Inspired from the success of domain randomization [tobin2017domain, tremblay2018training], we introduced more randomness to the simulation parameters in order to reduce the domain gap between synthetic and real images. The following parameters are randomized: the texture of the environment (walls, floor and table), lighting (placement, type, intensity and color), object material (diffuse and specular reflection coefficients) and by using different types of anti-aliasing.


Our physics-based domain randomization dataset is composed of 10,000 images. To generate these images, we run 250 different simulations with different sets of objects (between 4 and 13 objects in each simulation). In 50% of the simulations, objects are automatically repositioned to rest on their bottom/main surface to replicate a bias found in many tabletop datasets. For each simulation, 20 camera positions are randomly sampled on half-spheres of radius ranging from 0.8 to 1.4 meters, all pointing towards the table center with random offsets of degrees for each rotation axis. For each sampled camera position, two image variations were rendered: one with realistic parameters (containing reflections and shadows) as shown in fig. 3-(c) and the other without, as shown in fig. 3-(b). Tremblay et al. [tremblay2018deep] showed that using different kinds of synthetic images reduced the performance gap between synthetic and real images. Accordingly, we have generated an additional set of 10,000 simpler renders using OpenGL. For this, we rendered objects in random poses on top of real indoor backgrounds sampled from the Sun3D dataset [xiao2013sun3d] (fig. 3-(a)).


After the simulations, we keep the 6 degree of freedom pose of each object as the ground truth. We use the pose together with the 3D model to generate a visibility mask for the segmentation task, and project the center of the 3D model in the image plane to generate the pose heatmap. The ground-truth heatmap is a 2D Gaussian with an amplitued of 1 and a variance of 5 at the projected center of the object at an image resolution equivalent to the output of the network.

4.2 Templates

The following section describes the template generation procedure for training. Note that a different procedure is used to generate test time templates and is described in sec. 5.

We use 115 different textured 3D models mainly found in the various datasets of the benchmark for 6D pose estimation 

[hodan2018bop] (excluding Linemod [hinterstoisser2011multimodal] since it is used for evaluation). For each training iteration, one of the objects from the query image is selected as the target object and others are considered as background.

The global template (input of the object attention branch) is a render of the target object from a random viewpoint. In an offline phase, 240 templates are generated for each 3D model by sampling 40 viewpoints on an icosahedron with 6 in-plane rotations per viewpoint. During training, one of the 240 templates is sampled randomly.

The local template (input of the pose-specific branch) is rendered by taking the pose of the target object in the query image into account. The template image thus matches the scene object perfectly. We also apply perturbations on the orientation of the template image by sampling a random rotation axis and rotation magnitude. We show the impact of using different rotation magnitude in sec. 5.3, with best performance when trained with random rotations in the range of 20–30 added to the pose of the target object.

Both template types have a resolution of

pixels. To render consistent templates from multiple objects of various size, we adjust the distance of the object so that its largest length on the image plane falls in the range of 100 to 115 pixels. The borders are then padded to reach the size of


4.3 Data augmentation

Online data augmentation is applied to the synthetic images during training. We use the segmentation mask of the object in the query image to change some of its properties. We randomly change the hue, saturation and brightness of the object and its template. We also apply augmentations on the whole query image, such as: brightness shifts, Gaussian blur and noise, horizontal and vertical flips, random translations and scale. To minimize the risk of overfitting to the color as the main characteristic of the template, a random hue is applied to the whole image and the template 50% of the time. Finally, we apply a motion blur effect 20% of the time to the image by convolving a line kernel to the image, as described in [Dwibedi_2017_ICCV].

5 Experiments

In this section, we provide the training hyper-parameters followed by details on the dataset and metrics used to evaluate our approach. We also describe the various ablation studies that validates our design choices and finally present an extensive evaluation against the state-of-the-art methods.

5.1 Training details

The network is trained for 50 epochs with the AMSGrad variant 

[reddi2019convergence] of the ADAM optimizer [kingma2014adam]. We use a learning rate of with steps of 0.1 at epochs 20 and 40, a weight decay of and mini batches of size 6. We use 1k renders as a validation set and use the remaining 19k of the generated dataset (OpenGL and Physics-based) for training. Each epoch, the network is trained for 1300 iterations and images are sampled with a ratio of 80/20 respectively from the physics-based and OpenGL renders. Once the training is done, the network with the smallest validation loss is kept for testing.

5.2 Datasets and metrics

We evaluate on the well-known Linemod [hinterstoisser2012model] and Occluded Linemod [brachmann2014learning] datasets. Linemod consists of 15 sequences of objects containing heavy clutter where the annotations of a single object are available per sequence. Occluded Linemod is a subset of Linemod, where annotations for 8 objects have been added by [brachmann2014learning]. Keeping in line with previous work, we keep only the prediction with the highest score for each object and use the following metrics.


We use the “2D bounding box” metric proposed in [brachmann2014learning]. The metric calculates the ratio of images for which the predicted bounding box has an intersection-over-union (IoU) with the ground truth higher than 0.5.

Occluded Linemod.

The standard mean average precision (mAP) is used to evaluate the performance of multi-object detection. To allow for direct comparison, we regroup the predictions made for different objects and apply Non-Maximum Suppression on predictions with an IoU 0.5. We use the same methodology as in [brachmann2014learning]: the methods are evaluated on 13 of the 15 objects of the Linemod dataset (the “bowl” and “cup” objects are left out). Of the remaining 13 objects, 4 are never found in the images, yet those are still detected and kept in the evaluation (as an attempt to evaluate the robustness to missing objects). The mAP is therefore computed by using all the predictions on the 9 other objects left.

5.3 Ablation studies

We now evaluate the design decisions made for the network architecture (table 1) and the pose used for the local template (table 2) through an ablation study. We also evaluate the effect of the number of templates used at test time (table 3).

The experiments in this section are computed on the Linemod dataset using the “2D bounding box” metric from [brachmann2014learning] and described in sec. 5.2. Evaluations in this section are done on a subset of 25% of the images of the full dataset. In each table, we report the difference in performance between the best performing variation and the others.

Network performance (%)
Full architecture 0.00 (ref)
w/o tunable filters (OAB) -19.76
w/o correlation (PSB) -5.37
w/o auxiliary tasks -7.73
TDID correlations [ammirato2018target] -26.48
SiamMask correlations [wang2019fast] -12.93
Table 1: Network architecture ablation study. Removing tunable filters results in the most notable performance drop of almost 20% while dismissing multiple resolution correlations and multiple tasks decreases accuracy by 5.37% and 7.73% respectively. The bottom part of the table illustrates that baseline correlation-based strategies from [ammirato2018target] and [wang2019fast] also fall short of our proposed architecture.

Network architecture.

Table 1 reports the relative performance to the full architecture when each of the proposed modules in sec. 3 are removed (one at a time). First, removing the tunable filters computed with the “object attention branch” results in the largest performance drop, resulting in a decrease of almost 20%. Second, removing the higher-resolution embeddings and auxiliary tasks reduces performance by approximately 5% and 8% respectively.

We also compare our approach with the technique used in TDID [ammirato2018target] and SiamMask [wang2019fast] to correlate the template with the query image. Instead of implementing their exact specifications (which may differ in backbones, for example), we provided a fairer comparison by adapting our architecture to their main ideas. We thus use a siamese network with our DenseNet-121 backbone for both methods (since they propose a shared-weights approach), and remove the tunable filters from our object attention branch. For TDID, we only use the embedding (both depth-wise cross-correlation and subtraction), whereas a embedding is used for SiamMask. The same training procedure and tasks than our proposed approach is used. Our proposed approach improves the performance by a large margin when compared with these baselines.

Local template pose.

We evaluate the impact of perturbing the object orientation in the local template during training. A random rotation of means that the local template contains the object at the same orientation as the object in the image. Perturbations are added by randomly sampling a rotation axis (in spherical coordinates) and a magnitude. A network is retrained with each level of random rotations. Table 2 illustrates that the optimal degree of perturbations seems to be around 20–30. Indeed, deviations from this number results in a decrease in performance of up to -16% when a completely random rotation (180) is used.

Random rotations performance (%)
0 -4.33
10 -3.12
20 0.00
30 -0.42
40 -5.18
180 -16.07
Table 2: Ablation study on random rotations applied to the template orientation during training. A random rotation of 0 represent a strict training where the local template perfectly matches the ground truth object while a 180 is equivalent to a random rotation angle. Using a random perturbation of 20–30 provides the best compromise.

Template density and runtime.

The impact of providing various numbers of templates to the network at test time is evaluated, both in terms of accuracy and speed, in table 3. Timings are reported on a Nvidia GeForce GTX 1080Ti. To generate a varying number of templates, we first generate templates from 16 pre-defined viewpoints spanning a half-sphere on top of the object. Each template subsequently undergoes an in-plane rotation of varying numbers: (80 templates), 10 (160 templates) and 20 (320 templates). The table compares performance with that obtained with an oracle who provides a template with the ground truth object pose.

# of templates performance (%) runtime (s)
80 -2.80 0.23
160 0.00 0.43
320 +0.03 0.87
1 (oracle) +16.75 0.06
Table 3: Evaluating the bounding box detection performance and runtime for varying numbers of templates at test time. While runtime grows linearly, the performance gain saturates around 160 templates. The oracle sets an upper bound of performance by providing the template with the ground truth object pose as input.

5.4 Comparative evaluation

We report an evaluation on Linemod and Occluded Linemod (OL) in table 4 and compare with other state of the art RGB-only methods. Competing methods are divided into 2 main groups: those who do know the test objects at train time (“known objects”), and those who do not. Approaches such as [brachmann2016uncertainty, kehl2017ssd, zakharov2019dpod, hodan2019photorealistic] are all learning-based methods that were specifically trained on the objects. We could identify [tjaden2017real] and [hinterstoisser2011multimodal] which do not include a specific training step that is targeted towards specific object instances.

It is worth noting that even though [tjaden2017real] is classified as not needing known objects at training time, they still require an initialization phase to be performed on real images (to build a dictionary of histogram features). As in [brachmann2016uncertainty], they thus use parts of the Linemod dataset as a training set that covers most of the object viewpoints. These methods have therefore an unfair advantage compared to our approach and Line-2D since they leverage domain-specific information (lighting, camera, background) of the evaluation dataset.

Our method can directly be compared with Line-2D [hinterstoisser2011multimodal] which also uses templates generated from 3D models. On the standard Linemod dataset, Line-2D outperforms our method by 8.5% on the “2D bounding box” metric. However, our method outperforms Line-2D by around 30% in mAP on Occluded Linemod. It also provides competitive performance that is on par or close to all other methods that test on known objects and/or have access to real images. Note how the accuracy of Line-2D severely degrades under occlusion, while our approach remains robust. We show qualitative results of our approach on Occluded Linemod in fig. 4.

Methods Known Real Linemod OL
objects images (2D BBox) (mAP)
Brachmann et al. [brachmann2016uncertainty] Yes Yes 97.50 51.00
SSD-6D [kehl2017ssd] Yes No 99.40 38.00
DPOD [zakharov2019dpod] Yes No N/A 48.00
Hodan et al. [hodan2019photorealistic] Yes No N/A 55.90*
Tjaden et al. [tjaden2017real] No Yes 78.50 N/A
LINE-2D [hinterstoisser2011multimodal] No No 86.50 21.0
Ours No No 77.92 50.71
Table 4: Quantitative comparison to the state of the art. The table lists the 2D bounding box metric on Linemod and mean average precision (mAP) on Occluded Linemod (OL). The 2D bounding box metric calculates the recall for the 2D bounding boxes with the highest prediction score. For both metrics, predictions are considered good if the IoU of the prediction and the ground truth is at least 0.5 (0.75 for Hodan et al. [hodan2019photorealistic]).
Figure 4: Qualitative results on the Occluded Linemod dataset [brachmann2014learning], showing good (green), false (blue) and missed (red) detections. For reference, the 15 objects are shown in the bottom row (image from [hodan2018bop]). To generate these results, all objects (except objects 3 and 7) are searched in each image.

6 Discussion

This paper presents what we believe to be the first deep learning approach for unseen instance detection. Inspired by template matching, we propose an architecture which learns a feature embedding where correlation between object templates and a query image can subsequently be converted to bounding box detection. Our experiments show that while the network has not been trained with any of the test objects, it is significantly more robust to occlusion than previous template-based methods (30% improvement in mAP over Line-2D [hinterstoisser2011multimodal]), while being highly competitive with networks that are specifically trained on the object instances.


A main limitation of our approach are the false positives that arise from clutter with similar color and shape as the object, as shown in fig. 4. We hypothesize that our late correlation at small spatial resolution (templates of and ) prevents the network from leveraging detailed spatial information to fit to the object shape more accurately. Another limitation is that of speed. Currently, our approach requires 0.43s to detect a single object instance in an image (c.f. table 3), but it scales linearly with the number of objects. The culprit here is the object attention branch, which makes the backbone features instance-specific via the tunable filters, and thus needs to be recomputed.

Future directions.

By providing a generic and robust 2D instance detection framework, this work opens the way for new methods that can extract additional information about the object, such as its full 6-DOF pose. We envision a potential cascaded approach, which could first detect unseen objects, and subsequently regress the object pose from a high-resolution version of the detection window.


This work was supported by the NSERC/Creaform Industrial Research Chair on 3D Scanning: CREATION 3D. We gratefully acknowledge the support of Nvidia with the donation of the GPUs used for this research.


Appendix A Per object performances

In table 4, we reported the performance of our approach on Linemod [hinterstoisser2012model] and Occluded Linemod [brachmann2014learning] datasets. We extend the reported results by showing the performance of our approach on each object in tables 5 and 6. Object with their corresponding indices can be viewed in fig. 5.

Object ID 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Mean
2D BBox metric (%) 89.16 71.50 94.00 46.88 92.47 80.75 82.74 77.19 63.31 96.89 89.51 67.83 87.67 86.39 42.48 77.92
Table 5: Performances on the 2D bounding box metric for each object of the Linemod dataset.
Object ID 1 2 5 6 8 9 10 11 12 Mean Average Precision (mAP)
Average Precision (%) 36.58 55.92 73.49 29.18 55.20 77.48 52.79 16.26 59.52 50.71
Table 6: Average precision for each object evaluated on the Occluded Linemod dataset.
Figure 5: All 15 objects in the Linemod dataset (taken from [hodan2018bop]).

Appendix B Domain randomization training images

Additional examples of domain randomization images generated with our simulator are shown in fig. 6.

Figure 6: More domain randomization images generated with our simulator

Appendix C Qualitative results on Linemod dataset

We show examples of good and bad predictions on Linemod dataset in fig. 7.

Figure 7: Qualitative results on Linemod dataset [hinterstoisser2012model] with predictions (yellow) and ground-truths (red). The first two rows show good predictions while the last row shows examples of bad predictions.

Appendix D Qualitative results on Occluded Linemod

We show additional qualitative results on Occluded Linemod in fig. 8 to expand results shown in fig. 4 of the paper.

Figure 8: More qualitative results on the Occluded Linemod dataset [brachmann2014learning].

Appendix E Architecture details

To expand fig. 2 of the main paper, we show more detailed networks in fig 9.

Figure 9: Detailed networks