Real-time texturing for 6D object instance detection from RGB Images

12/13/2019 ∙ by Pavel Rojtberg, et al. ∙ Fraunhofer 23

For objected detection, the availability of color cues strongly influences detection rates and is even a prerequisite for many methods. However, when training on synthetic CAD data, this information is not available. We therefore present a method for generating a texture-map from image sequences in real-time. The method relies on 6 degree-of-freedom poses and a 3D-model being available. In contrast to previous works this allows interleaving detection and texturing for upgrading the detector on-the-fly. Our evaluation shows that the acquired texture-map significantly improves detection rates using the LINEMOD detector on RGB images only. Additionally, we use the texture-map to differentiate instances of the same object by surface color.



There are no comments yet.


page 1

page 2

page 3

page 4

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Texture extraction

(a) Image space
(b) Texture space
Figure 2: We are mapping from image to texture space, which is the reverse direction compared to rendering. Texture coordinates are encoded as red-green.

In rendering, the process of ”texture mapping” consist of the following two steps

  1. creating a mapping from a texture to the surface of a 3D model and

  2. projecting the model and simultaneously mapping the texture into a 2D image.

The first step is also called ”texture atlas creation” and is typically performed by an artist during mesh creation. In contrast, our focus is the reverse direction, namely mapping from a 2D image of a projected 3D model back to the surface image as specified by the texture atlas (see Figure 2).

Generally texture atlas coordinates are not included in CAD data and therefore have to be generated. However, automatic texture atlas generation is still an active area of research [8] and outside of the scope of this work. Here, we just use the angle-based ”Smart UV Project” algorithm implemented in the Blender toolset (v2.79b) to generate the texture atlas and instead focus on the second step of texture mapping.

In the remainder of this section we first discuss a simple exposure normalization scheme, before we present our texture extraction method in detail and finally turn to merging multiple views into one texture. The full pipeline is illustrated in Figure 3.

Figure 3: Our texture extraction pipeline. Given a 3D model and its pose in a RGB frame, we first render the depth to determine visibility. Image regions around depth discontinuities are discarded as they are unreliable. Next an texture-increment is extracted and a per-pixel score is computed to decide whether to merge the visible pixels into the final texture. Only the following buffers are required on the GPU; ”final texture”, ”increment” and ”discontinuities”.

1.1 Exposure normalization

As our method does not explicitly compensate for different exposure times we pre-process the image stream to homogenize the brightness. For this we use the first captured frame as reference and modify the successive frames to match its brightness and contrast levels.

Here we follow the idea of Reinhard et al. [13] of adapting an input image I to match a reference image as


where and

are the mean and variance of the input image and the reference image, respectively.

However, whereas [13] apply the transfer for all channels in the Lab color space, we only apply it to the luma component Y in the YUV color space as we explicitly want to preserve the chrominance information.

This step is omitted if the exposure can be fixed during capturing.

1.2 Texture-space to image-space mapping

Texture mapping can be formalized as follows: given a triangulated mesh, each vertex with an associated texture coordinate is projected into the current view by a world-to-image transform as . Here is a normalized pixel location in the image .

On the interior of the triangle formed by , a texture coordinate

is interpolated and used for lookup in texture



This mapping is continuous in texture space and therefore allows for bi-linear interpolation to avoid aliasing artifacts.

For texture extraction however we are interested in the reverse mapping, namely


Instead of iterating over the mesh topology as defined by in 3D, we now iterate over as defined by the texture-atlas in 2D. Conversely, we now require a continuous value of in image space for lookup. This is computed by interpolating in the triangle formed by , of which each point is obtained as above by .

Here, visibility must be explicitly computed; with equation (2), we implicitly assumed overlapping points to be resolved by a depth-test, only retaining the points closest to the camera. This can no longer be exploited, as points do not overlap in the texture space.

(a) Depth test aliasing
(b) Slope biased depth
Figure 4: We store slope-scaled biased depth values to avoid aliasing errors during the visibility test.

To handle visibility we therefore introduce an additional depth buffer and render depth from the camera view. This allows comparing the depth of an interpolated coordinate to the actually visible depth value. However, this leads to aliasing; with non-planar objects the view resolution cannot be adapted to match the texture space resolution.

To remedy the aliasing artifacts we apply techniques from the shadow mapping domain [2] where the same problem occurs when a scene is rendered from a shadow camera and an observer camera view. Particularly, we

  1. focus the camera on the object bounding box to increase the sampling rate in image space and

  2. apply a slope-scale depth-bias to account for the remaining differences in sampling rates during visibility testing.

The latter is especially important; as the texture atlas has a higher sampling rate than the depth buffer, several points , interpolated in the texture space, map to the same point in the image depth-buffer. At steep angles has a strong depth variation and thus neighboring points alternatively fail and pass the visibility test when compared to a single reference value (see Figure (b)b).

To account for this we store a biased depth that allows for a sampling offset of 1px in image space. The bias depends on the depth slope per pixel and the minimal depth buffer resolution as:


The bias is large in steep regions while minimal for faces parallel to the camera. This computation can be implemented efficiently on the GPU by using e.g. glPolygonOffset. The effect can be observed by comparing Figure (a)a and Figure (b)b. This allows us to map each visible pixel from a single image into the texture to record the object surface. The resulting reconstruction can already be applied for detecting the object in similar views (see Section 2).

1.3 Merging multiple views

Generally the object surface is only partially visible from a single view and therefore multiple images are needed to reconstruct the full texture.

Assuming that the same texture point will be observed in different images as where , we discard edge-pixels at object boundaries or strong depth discontinuities. These measurements are unreliable as they might come from different surfaces due to pose imprecisions and limited camera resolution. Instead, we aim for a view where is not at an observed edge. A pixel is considered to be part of an edge if the depth change is larger than 10% of the object diameter. All points in a 5px neighborhood of an edge-pixel are discarded as well (see Figure (a)a).

To combine multiple valid observations of , we define score that, inspired by [3], weighs each observation by the distance to camera and the angle between surface normal and view direction as


where is assumed in normalized device coordinates ranged and is computed based on the interpolated surface normal, which can be defined per vertex (e.g. for a sphere) and therefore is not required to be constant for a single face.

Using we implemented two merging strategies; a weighted arithmetic mean


and only retaining the best view


Both equations can be efficiently implemented on the GPU using a single RGBA buffer for accumulation, as .

Figure 5 shows exemplary results. Eq. (6) produces a smooth surface, while retaining more detail than vertex coloring. However, the averaging over slightly inaccurate object poses results in a loss of fine detail when compared to Eq. (7).

Using Eq. (7) on the other hand retains all details, but emphasizes inconsistencies in exposure or object pose as seams between neighboring increment texture-patches.

(a) Vertex colors
(b) Weighted mean as in Eq. (6)
(c) Best score as in Eq. (7)
(d) Best score + blending
Figure 5: Exemplary surface color reconstructions of the ”Driller” object Texture merging strategies using (a) KinectFusion and (b, c, d) variations of our algorithm.

To alleviate this problem we blend increment-patches at their boundaries into the existing texture during accumulation. Instead of simply overwriting the texture content with the new maximum, we compute the distance transform to the patch boundaries over a 5x5px support using the L2 norm. Using the distance we then linearly interpolate between the old and the new color value and pixel score .

(a) Initial view
(b) argmax based update with blending
Figure 6: Merge-maps of two successive frames when using Eq. (7). Valid pixels are colored blue.

Figure (b)b shows an increment-patch for Figure (a)a, projected onto the object. Note the gradient at the edges, which is linear in texture space.

The blending not only produces visually more pleasing results (compare Figures (c)c and (d)d), but is crucial for computing the LINEMOD descriptor which relies on local gradient orientation.

2 Object instance detection

In this section we describe how to employ the extracted textures for object instance detection i.e. differentiating multiple instances of the same object. Here, we extend the color based outlier rejection of [5] to multiple color hypotheses to simultaneously perform classification.

The idea of color based outlier rejection in [5] is to store the expected color of the object projection alongside the LINEMOD template and at run-time count how many pixels in the camera frame have the expected color.

To make the check robust against lighting variations, they convert the images to the HSV colour space and compare only the hue component. However, hue does not cover the colors black () and white (). Therefore, these are mapped to blue and yellow respectively, which completes the color based descriptor (see Figure (a)a).

To extend this scheme for object instance detection as well as for on-the-fly recorded textures, we separate the expected color from the expected surface visibility. To this end, we store the texture coordinates of the object projection (compare Figure (a)a) instead of storing the expected color directly. The template surface-texture is stored separately. At runtime we now use the texture coordinates to perform a lookup into the template-texture to retrieve the expected color, which gives us the same information as in [5].

However, it is now possible to easily swap the surface-texture to globally change the expected colors. Here a live-reconstructed texture can provide more accurate template colors and notably multiple template-textures can be used for object instance detection (see Figure (b)b).

Finally, the outlier rejection scheme needs a slight modification for classification. Instead of returning the first inlier based on the expected color, it needs to allow multiple matches without repetition. For this, after finding an inlier, only the corresponding texture-template is removed and the remaining candidates are checked until all template-textures are found or all candidates are rejected.

While this is an integral part of the LINEMOD pipeline, it can be optionally integrated as a post-processing step to an CNN based architecture that is capable to abstract the object appearance to some degree. E.g. it can be exectued after non-maximum-suppression in [15] to compute agreement with the color template.

(a) input image
(b) candidate / white template / red template
Figure 7: Hue based instance detection. The input image (Figure 1) is cropped based on the template bounding box and compared to a set of hue templates.

3 Evaluation

(a) LINEMOD [5]
(b) ”Single shot pose” [15]
(c) Ground truth
Figure 8:

Qualitative results for on-the-fly surface color reconstructions of the ”driller” object in relation to different pose detection methods.

The presented method is evaluated in the context of object detection. To this end we train the LINE2D variant of the LINEMOD detector on the corresponding dataset [5]. The dataset does not contain views specifically for surface reconstruction and thus represents reconstruction during detection well. We use the publicly available LINEMOD implementation in OpenCV.

There are 15 sequences for different objects, consisting of RGB-D frames with ground-truth poses and recorded at distances of 65cm-115cm. We select a subset of 8 objects for which a 3D mesh is available and that are large enough to provide a reasonable texture resolution. The meshes included in the dataset were recorded using a variation of KinectFusion [11] and thus encode surface information as vertex-colors.

We apply our texturing algorithm on each sequence using the ground-truth poses to merely simulate a tracking algorithm for better reproducibility. Then we train LINEMOD on synthetic renderings using the generated textures as well as included vertex-colors as a baseline. We parametrize training and testing as [5], particularly;

  • We use 89 views on the upper hemisphere around the object, derived by subdividing an icosahedron twice recursively.

  • For each view there are 7 in-plane rotations with roll angles between and .

  • Furthermore 6 distances, with 10cm increments, between 65cm and 115cm are used.

  • During color based outlier rejection we discard candidates where less than 70% of the pixels have the expected color. The threshold on per-pixel hue difference is set to .

This results in a total of 3738 templates per object for training. However, in contrast to [5], we are only using RGB data without depth — therefore we do not restrict the color gradient features to the contour, but compute them on the interior as well.

For testing, we measure the true positive rate on the sequences. As in [4] we consider an object successfully detected when it is within a fixed radius around the ground truth position. We globally set cm in our experiments to allow for depth mis-classification by one step.

For keeping interactive performance we only consider the first 30 LINEMOD candidates for matching and outlier rejection.

To simulate the CAD data use-case without any surface information available, we additionally perform training using a white diffuse material for all objects. For generating gradient features on the interior of the object, we use ambient occlusion (AO) [1] as a lighting approximation. Ambient occlusion is a purely geometrical method that is independent of actual light and surface properties. We skip the outlier-rejection step as no color information is available.

Object AO vertexcolor texture (7) texture (6)
benchvise 0.54 0.75 0.82 0.82
driller 0.15 0.43 0.63 0.54
iron 0.53 0.71 0.67 0.68
can 0.38 0.67 0.78 0.83
glue 0.07 0.21 0.17 0.17
cam 0.1 0.28 0.62 0.55
eggbox 0.47 0.6 0.79 0.79
holepuncher 0.2 0.62 0.59 0.65
average 0.3 0.53 0.64 0.63
Table 1: True positive rates on the linemod dataset with different training data. The ambient occlusion (AO) variant does not include any outlier rejection.

Table 1 shows the true positive rates for the variants mentioned above — as can be seen the texture based variants outperform the vertex-color baseline of [5] by a margin of 10% on average. However, there are strong variations between the individual objects, therefore it remains inconclusive whether variant (6) or variant (7) of our algorithm is preferable.

Notably the AO variant cannot reach the performance of the other methods. With some objects where it even becomes unusable (e.g. driller, cam). This emphasizes the need of surface information for object detection.

3.1 Using noisy pose data

To evaluate the applicability of our method for on-the-fly texturing with noisy pose data, we additionally used the state-of-the-art ”single shot pose” (SSP) detector [15] instead of relying on ground-truth poses.

Figure 8 shows qualitative results of texturing using ground-truth, SSP and LINEMOD poses. While the LINEMOD results only allow for for a rough color based outlier rejection, the results using SSP poses are very similar to using the ground truth. To further quantify this, we repeated the training of SSP using synthetic renderings of the ”driller” object instead of using cross-validation as in the original paper. At this, we measured the true positive rate (TPR) using the 5cm, 5deg metric. Here, training with textured renderings (Fig. (c)c) resulted in a TPR of . Using the imperfectly textured objects (Fig. (b)b) resulted in a TPR of , which supports the qualitative impression. When training with vertex colored renderings only, the performance was significantly degraded, resulting in a TPR of .

3.2 Speed

The evaluation was performed on a notebook with an Intel i7-7700HQ CPU at 2.80GHz and an Intel HD 630 iGPU. The average time to accumulate one video frame into a 1024x1024 px sized texture is 2.69 ms. This allows running the texturing algorithm in parallel to tracking to reconstruct a texture on-the-fly.

The average time to perform a texture lookup as described in Section 2 is 0.82 ms using the software remap implementation in OpenCV. This step can be therefore applied generally without requiring GPU usage.

3.3 Multi instance detection

For the multi-instance detection we performed a qualitative analysis using a separate sequence where two toy cars are alternately and simultaneously visible. The surface colors are white and red which are adjacent in HSV space (white is mapped to yellow as described in section 2). Furthermore, the surface exhibits specular reflection which is not filtered during texturing.

Nevertheless, our method was able to robustly discriminate both objects (see Figure 1 and supplemental material111

4 Conclusion and future work

We have presented a method for real-time texturing that can be used to improve detection on-the-fly. At this we have shown that texturing itself is crucial for detection of CAD data where no surface information is available. However, even for meshes where vertex-colors were previously available, our approach improves detection performance significantly. Furthermore, we successfully applied the resulting textures to extend LINEMOD for object-instance recognition. By interleaving detection and texture extraction it now becomes possible to extend detection algorithms by color cues on-the-fly.

Our method currently requires the camera exposure to be fixed or relies on a global exposure compensation approach which is error-prone. Here reading the actual camera exposure could be used for accurate exposure fusion of the images. The surface specularity could be explicitly considered during merging [6]. Currently we assume diffuse reflection, which systematically over-brightens specular surfaces. At this a plausibility test during merging could be used to reject implausible colors as caused by e.g. occlusion. As the LINEMOD detector is no longer state-of-the-art and further investigation is needed to similarly integrate our approach into an existing CNN based method. This will require to breaking up the end-to-end trained ”black-box” to make the color information explicit.