1 Texture extraction
In rendering, the process of ”texture mapping” consist of the following two steps
creating a mapping from a texture to the surface of a 3D model and
projecting the model and simultaneously mapping the texture into a 2D image.
The first step is also called ”texture atlas creation” and is typically performed by an artist during mesh creation. In contrast, our focus is the reverse direction, namely mapping from a 2D image of a projected 3D model back to the surface image as specified by the texture atlas (see Figure 2).
Generally texture atlas coordinates are not included in CAD data and therefore have to be generated. However, automatic texture atlas generation is still an active area of research  and outside of the scope of this work. Here, we just use the angle-based ”Smart UV Project” algorithm implemented in the Blender toolset (v2.79b) to generate the texture atlas and instead focus on the second step of texture mapping.
In the remainder of this section we first discuss a simple exposure normalization scheme, before we present our texture extraction method in detail and finally turn to merging multiple views into one texture. The full pipeline is illustrated in Figure 3.
1.1 Exposure normalization
As our method does not explicitly compensate for different exposure times we pre-process the image stream to homogenize the brightness. For this we use the first captured frame as reference and modify the successive frames to match its brightness and contrast levels.
Here we follow the idea of Reinhard et al.  of adapting an input image I to match a reference image as
are the mean and variance of the input image and the reference image, respectively.
However, whereas  apply the transfer for all channels in the Lab color space, we only apply it to the luma component Y in the YUV color space as we explicitly want to preserve the chrominance information.
This step is omitted if the exposure can be fixed during capturing.
1.2 Texture-space to image-space mapping
Texture mapping can be formalized as follows: given a triangulated mesh, each vertex with an associated texture coordinate is projected into the current view by a world-to-image transform as . Here is a normalized pixel location in the image .
On the interior of the triangle formed by , a texture coordinate
is interpolated and used for lookup in textureas
This mapping is continuous in texture space and therefore allows for bi-linear interpolation to avoid aliasing artifacts.
For texture extraction however we are interested in the reverse mapping, namely
Instead of iterating over the mesh topology as defined by in 3D, we now iterate over as defined by the texture-atlas in 2D. Conversely, we now require a continuous value of in image space for lookup. This is computed by interpolating in the triangle formed by , of which each point is obtained as above by .
Here, visibility must be explicitly computed; with equation (2), we implicitly assumed overlapping points to be resolved by a depth-test, only retaining the points closest to the camera. This can no longer be exploited, as points do not overlap in the texture space.
To handle visibility we therefore introduce an additional depth buffer and render depth from the camera view. This allows comparing the depth of an interpolated coordinate to the actually visible depth value. However, this leads to aliasing; with non-planar objects the view resolution cannot be adapted to match the texture space resolution.
To remedy the aliasing artifacts we apply techniques from the shadow mapping domain  where the same problem occurs when a scene is rendered from a shadow camera and an observer camera view. Particularly, we
focus the camera on the object bounding box to increase the sampling rate in image space and
apply a slope-scale depth-bias to account for the remaining differences in sampling rates during visibility testing.
The latter is especially important; as the texture atlas has a higher sampling rate than the depth buffer, several points , interpolated in the texture space, map to the same point in the image depth-buffer. At steep angles has a strong depth variation and thus neighboring points alternatively fail and pass the visibility test when compared to a single reference value (see Figure (b)b).
To account for this we store a biased depth that allows for a sampling offset of 1px in image space. The bias depends on the depth slope per pixel and the minimal depth buffer resolution as:
The bias is large in steep regions while minimal for faces parallel to the camera. This computation can be implemented efficiently on the GPU by using e.g. glPolygonOffset. The effect can be observed by comparing Figure (a)a and Figure (b)b. This allows us to map each visible pixel from a single image into the texture to record the object surface. The resulting reconstruction can already be applied for detecting the object in similar views (see Section 2).
1.3 Merging multiple views
Generally the object surface is only partially visible from a single view and therefore multiple images are needed to reconstruct the full texture.
Assuming that the same texture point will be observed in different images as where , we discard edge-pixels at object boundaries or strong depth discontinuities. These measurements are unreliable as they might come from different surfaces due to pose imprecisions and limited camera resolution. Instead, we aim for a view where is not at an observed edge. A pixel is considered to be part of an edge if the depth change is larger than 10% of the object diameter. All points in a 5px neighborhood of an edge-pixel are discarded as well (see Figure (a)a).
To combine multiple valid observations of , we define score that, inspired by , weighs each observation by the distance to camera and the angle between surface normal and view direction as
where is assumed in normalized device coordinates ranged and is computed based on the interpolated surface normal, which can be defined per vertex (e.g. for a sphere) and therefore is not required to be constant for a single face.
Using we implemented two merging strategies; a weighted arithmetic mean
and only retaining the best view
Both equations can be efficiently implemented on the GPU using a single RGBA buffer for accumulation, as .
Figure 5 shows exemplary results. Eq. (6) produces a smooth surface, while retaining more detail than vertex coloring. However, the averaging over slightly inaccurate object poses results in a loss of fine detail when compared to Eq. (7).
Using Eq. (7) on the other hand retains all details, but emphasizes inconsistencies in exposure or object pose as seams between neighboring increment texture-patches.
To alleviate this problem we blend increment-patches at their boundaries into the existing texture during accumulation. Instead of simply overwriting the texture content with the new maximum, we compute the distance transform to the patch boundaries over a 5x5px support using the L2 norm. Using the distance we then linearly interpolate between the old and the new color value and pixel score .
2 Object instance detection
In this section we describe how to employ the extracted textures for object instance detection i.e. differentiating multiple instances of the same object. Here, we extend the color based outlier rejection of  to multiple color hypotheses to simultaneously perform classification.
The idea of color based outlier rejection in  is to store the expected color of the object projection alongside the LINEMOD template and at run-time count how many pixels in the camera frame have the expected color.
To make the check robust against lighting variations, they convert the images to the HSV colour space and compare only the hue component. However, hue does not cover the colors black () and white (). Therefore, these are mapped to blue and yellow respectively, which completes the color based descriptor (see Figure (a)a).
To extend this scheme for object instance detection as well as for on-the-fly recorded textures, we separate the expected color from the expected surface visibility. To this end, we store the texture coordinates of the object projection (compare Figure (a)a) instead of storing the expected color directly. The template surface-texture is stored separately. At runtime we now use the texture coordinates to perform a lookup into the template-texture to retrieve the expected color, which gives us the same information as in .
However, it is now possible to easily swap the surface-texture to globally change the expected colors. Here a live-reconstructed texture can provide more accurate template colors and notably multiple template-textures can be used for object instance detection (see Figure (b)b).
Finally, the outlier rejection scheme needs a slight modification for classification. Instead of returning the first inlier based on the expected color, it needs to allow multiple matches without repetition. For this, after finding an inlier, only the corresponding texture-template is removed and the remaining candidates are checked until all template-textures are found or all candidates are rejected.
While this is an integral part of the LINEMOD pipeline, it can be optionally integrated as a post-processing step to an CNN based architecture that is capable to abstract the object appearance to some degree. E.g. it can be exectued after non-maximum-suppression in  to compute agreement with the color template.
Qualitative results for on-the-fly surface color reconstructions of the ”driller” object in relation to different pose detection methods.
The presented method is evaluated in the context of object detection. To this end we train the LINE2D variant of the LINEMOD detector on the corresponding dataset . The dataset does not contain views specifically for surface reconstruction and thus represents reconstruction during detection well. We use the publicly available LINEMOD implementation in OpenCV.
There are 15 sequences for different objects, consisting of RGB-D frames with ground-truth poses and recorded at distances of 65cm-115cm. We select a subset of 8 objects for which a 3D mesh is available and that are large enough to provide a reasonable texture resolution. The meshes included in the dataset were recorded using a variation of KinectFusion  and thus encode surface information as vertex-colors.
We apply our texturing algorithm on each sequence using the ground-truth poses to merely simulate a tracking algorithm for better reproducibility. Then we train LINEMOD on synthetic renderings using the generated textures as well as included vertex-colors as a baseline. We parametrize training and testing as , particularly;
We use 89 views on the upper hemisphere around the object, derived by subdividing an icosahedron twice recursively.
For each view there are 7 in-plane rotations with roll angles between and .
Furthermore 6 distances, with 10cm increments, between 65cm and 115cm are used.
During color based outlier rejection we discard candidates where less than 70% of the pixels have the expected color. The threshold on per-pixel hue difference is set to .
This results in a total of 3738 templates per object for training. However, in contrast to , we are only using RGB data without depth — therefore we do not restrict the color gradient features to the contour, but compute them on the interior as well.
For testing, we measure the true positive rate on the sequences. As in  we consider an object successfully detected when it is within a fixed radius around the ground truth position. We globally set cm in our experiments to allow for depth mis-classification by one step.
For keeping interactive performance we only consider the first 30 LINEMOD candidates for matching and outlier rejection.
To simulate the CAD data use-case without any surface information available, we additionally perform training using a white diffuse material for all objects. For generating gradient features on the interior of the object, we use ambient occlusion (AO)  as a lighting approximation. Ambient occlusion is a purely geometrical method that is independent of actual light and surface properties. We skip the outlier-rejection step as no color information is available.
|Object||AO||vertexcolor||texture (7)||texture (6)|
Table 1 shows the true positive rates for the variants mentioned above — as can be seen the texture based variants outperform the vertex-color baseline of  by a margin of 10% on average. However, there are strong variations between the individual objects, therefore it remains inconclusive whether variant (6) or variant (7) of our algorithm is preferable.
Notably the AO variant cannot reach the performance of the other methods. With some objects where it even becomes unusable (e.g. driller, cam). This emphasizes the need of surface information for object detection.
3.1 Using noisy pose data
To evaluate the applicability of our method for on-the-fly texturing with noisy pose data, we additionally used the state-of-the-art ”single shot pose” (SSP) detector  instead of relying on ground-truth poses.
Figure 8 shows qualitative results of texturing using ground-truth, SSP and LINEMOD poses. While the LINEMOD results only allow for for a rough color based outlier rejection, the results using SSP poses are very similar to using the ground truth. To further quantify this, we repeated the training of SSP using synthetic renderings of the ”driller” object instead of using cross-validation as in the original paper. At this, we measured the true positive rate (TPR) using the 5cm, 5deg metric. Here, training with textured renderings (Fig. (c)c) resulted in a TPR of . Using the imperfectly textured objects (Fig. (b)b) resulted in a TPR of , which supports the qualitative impression. When training with vertex colored renderings only, the performance was significantly degraded, resulting in a TPR of .
The evaluation was performed on a notebook with an Intel i7-7700HQ CPU at 2.80GHz and an Intel HD 630 iGPU. The average time to accumulate one video frame into a 1024x1024 px sized texture is 2.69 ms. This allows running the texturing algorithm in parallel to tracking to reconstruct a texture on-the-fly.
The average time to perform a texture lookup as described in Section 2 is 0.82 ms using the software remap implementation in OpenCV. This step can be therefore applied generally without requiring GPU usage.
3.3 Multi instance detection
For the multi-instance detection we performed a qualitative analysis using a separate sequence where two toy cars are alternately and simultaneously visible. The surface colors are white and red which are adjacent in HSV space (white is mapped to yellow as described in section 2). Furthermore, the surface exhibits specular reflection which is not filtered during texturing.
4 Conclusion and future work
We have presented a method for real-time texturing that can be used to improve detection on-the-fly. At this we have shown that texturing itself is crucial for detection of CAD data where no surface information is available. However, even for meshes where vertex-colors were previously available, our approach improves detection performance significantly. Furthermore, we successfully applied the resulting textures to extend LINEMOD for object-instance recognition. By interleaving detection and texture extraction it now becomes possible to extend detection algorithms by color cues on-the-fly.
Our method currently requires the camera exposure to be fixed or relies on a global exposure compensation approach which is error-prone. Here reading the actual camera exposure could be used for accurate exposure fusion of the images. The surface specularity could be explicitly considered during merging . Currently we assume diffuse reflection, which systematically over-brightens specular surfaces. At this a plausibility test during merging could be used to reject implausible colors as caused by e.g. occlusion. As the LINEMOD detector is no longer state-of-the-art and further investigation is needed to similarly integrate our approach into an existing CNN based method. This will require to breaking up the end-to-end trained ”black-box” to make the color information explicit.
-  (2008) Screen space ambient occlusion. NVIDIA developer information: http://developers. nvidia. com 6. Cited by: §3.
-  (2002) Practical shadow mapping. Journal of Graphics Tools 7 (4), pp. 9–18. Cited by: §1.2.
-  (2008) Masked photo blending: mapping dense photographic data set on high-resolution sampled 3d models. Computers & Graphics 32 (4), pp. 464–473. Cited by: §1.3, Real-time texturing for 6D object instance detection from RGB Images.
-  (2011) Multimodal templates for real-time detection of texture-less objects in heavily cluttered scenes. In Computer Vision (ICCV), 2011 IEEE International Conference on, pp. 858–865. Cited by: §3.
-  (2012) Model based training, detection and pose estimation of texture-less 3d objects in heavily cluttered scenes. In Asian conference on computer vision, pp. 548–562. Cited by: Real-time texturing for 6D object instance detection from RGB Images, Figure 1, §2, §2, §2, (a)a, §3, §3, §3, Real-time texturing for 6D object instance detection from RGB Images, Real-time texturing for 6D object instance detection from RGB Images, Real-time texturing for 6D object instance detection from RGB Images.
-  (2012) Real-time surface light-field capture for augmentation of planar specular surfaces. In 2012 IEEE International Symposium on Mixed and Augmented Reality (ISMAR), pp. 91–97. Cited by: §4.
-  (2017) SSD-6d: making rgb-based 3d detection and 6d pose estimation great again. In Proceedings of the International Conference on Computer Vision (ICCV 2017), Venice, Italy, pp. 22–29. Cited by: Real-time texturing for 6D object instance detection from RGB Images, Real-time texturing for 6D object instance detection from RGB Images, Real-time texturing for 6D object instance detection from RGB Images.
-  (2018) OptCuts: joint optimization of surface cuts and parameterization. In SIGGRAPH Asia 2018 Technical Papers, pp. 247. Cited by: §1.
-  (2015) Live texturing of augmented reality characters from colored drawings. IEEE Transactions on Visualization & Computer Graphics (11), pp. 1201–1210. Cited by: Real-time texturing for 6D object instance detection from RGB Images.
-  (2018) Deep model-based 6d pose refinement in rgb. arXiv preprint arXiv:1810.03065. Cited by: Real-time texturing for 6D object instance detection from RGB Images.
-  (2011) KinectFusion: real-time dense surface mapping and tracking. In Mixed and augmented reality (ISMAR), 2011 10th IEEE international symposium on, pp. 127–136. Cited by: §3, Real-time texturing for 6D object instance detection from RGB Images, Real-time texturing for 6D object instance detection from RGB Images.
-  (2018) Learning 6dof object poses from synthetic single channel images. In 2018 IEEE International Symposium on Mixed and Augmented Reality Adjunct (ISMAR-Adjunct), pp. 164–169. Cited by: Real-time texturing for 6D object instance detection from RGB Images.
-  (2001) Color transfer between images. IEEE Computer graphics and applications 21 (5), pp. 34–41. Cited by: §1.1, §1.1.
-  (2013) Optimal Local Searching for Fast and Robust Textureless 3D Object Tracking in Highly Cluttered Backgrounds. IEEE Transactions on visualization and computer graphics (TVCG). Cited by: Real-time texturing for 6D object instance detection from RGB Images.
Real-time seamless single shot 6d object pose prediction.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 292–301. Cited by: §2, (b)b, §3.1, Real-time texturing for 6D object instance detection from RGB Images, Real-time texturing for 6D object instance detection from RGB Images, Real-time texturing for 6D object instance detection from RGB Images.
-  (2014) Let there be color! large-scale texturing of 3d reconstructions. In European Conference on Computer Vision, pp. 836–850. Cited by: Real-time texturing for 6D object instance detection from RGB Images.
-  (2013) Robust real-time visual odometry for dense rgb-d mapping. In Robotics and Automation (ICRA), 2013 IEEE International Conference on, pp. 5724–5731. Cited by: Real-time texturing for 6D object instance detection from RGB Images.
-  (2014) Color map optimization for 3d reconstruction with consumer depth cameras. ACM Transactions on Graphics (TOG) 33 (4), pp. 155. Cited by: Real-time texturing for 6D object instance detection from RGB Images.