Polarimetric Pose Prediction

12/07/2021
by   Daoyi Gao, et al.
Technische Universität München
0

Light has many properties that can be passively measured by vision sensors. Colour-band separated wavelength and intensity are arguably the most commonly used ones for monocular 6D object pose estimation. This paper explores how complementary polarisation information, i.e. the orientation of light wave oscillations, can influence the accuracy of pose predictions. A hybrid model that leverages physical priors jointly with a data-driven learning strategy is designed and carefully tested on objects with different amount of photometric complexity. Our design not only significantly improves the pose accuracy in relation to photometric state-of-the-art approaches, but also enables object pose estimation for highly reflective and transparent objects.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 4

page 8

page 10

05/02/2021

GODSAC*: Graph Optimized DSAC* for Robot Relocalization

Deep learning based camera pose estimation from monocular camera images ...
12/19/2019

P^2GNet: Pose-Guided Point Cloud Generating Networks for 6-DoF Object Pose Estimation

Humans are able to perform fast and accurate object pose estimation even...
10/31/2020

Pose Estimation of Specular and Symmetrical Objects

In the robotic industry, specular and textureless metallic components ar...
12/16/2019

ConvPoseCNN: Dense Convolutional 6D Object Pose Estimation

6D object pose estimation is a prerequisite for many applications. In re...
06/27/2020

Light Pose Calibration for Camera-light Vision Systems

Illuminating a scene with artificial light is a prerequisite for seeing ...
03/07/2019

Locating Transparent Objects to Millimetre Accuracy

Transparent surfaces, such as glass, transmit most of the visible light ...
09/15/2021

Hybrid ICP

ICP algorithms typically involve a fixed choice of data association meth...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

”Fiat lux”.111Latin for ”let there be light”.

Light has always fascinated mankind. It is not only the inherent centre of attention for many of the greatest scientific discoveries in the last century, but also plays a crucial role for society and even sets the basis for religions. Typical light sensors used in computer vision either send or receive pulses and waves for which the wavelength and energy are measured to retrieve colour and intensity within a specified spectrum. However, intensity and wavelength are not the only properties of an electromagnetic (EM) wave. The oscillation direction of the EM-field relative to the light ray defines its polarisation. Most natural light sources such as the sun, a lamp or a candle emit unpolarised light, which means that the light wave oscillates in a multitude of directions. When such a wave is reflected off an object, light becomes either perfectly or partially polarised. Polarisation therefore carries information on surface structure, material and reflection angle which can complement passively retrieved texture information from a scene 

[30]. These additional measurements can be particularly interesting for photometrically challenging objects with metallic, reflective or transparent materials which all pose challenges to vision pipelines effectively hampering their use for automation.

Figure 1: PPP-Net. Orthogonal to colour and depth images (left), polarisation data provides cues to surface normals especially for highly reflective (cutlery) and translucent (glass bottle) objects. Our Polarimetric Pose Prediction Pipeline (right) leverages the input of an RGBP camera and uniquely combines physical surface cues from polarisation properties with a data-driven approach to estimate accurate poses even for challenging objects which cannot be accurately predicted by current state-of-the-art approaches based on RGB and RGB-D.
Figure 2: PPP-Net Pipeline Overview. After the initial detection of the object of interest, the RGBP image - a quadruple of four differently polarised RGB images - is utilised to compute AOLP/DOLP and polarised normal maps through our physical model. The polarised information and the physical cues are individually encoded and fused in our hybrid model. The decoder predicts an object mask, normal map and NOCS, and finally the 6D object pose is predicted by Patch-PP  [53].

While robust pipelines [23, 41, 10, 13] have been designed for the task of 6D pose estimation and texture-less [25, 14] objects have been successfully predicted, photometrically challenging objects with reflectance and partial transparency have become the focus of research only very recently [39]. These objects pose challenges to RGB-D sensing and the field still lacks methods to cope with these problems. We move beyond previous methods based on light intensity and exploit the polarisation property of light as an additional prior for surface normals. This allows us to build a hybrid method combining a physical model with a data-driven learning approach to facilitate 6D pose estimation. We show that this not only facilitates pose estimation for photometrically challenging objects, but also improves the pose accuracy for classical objects. To this end, our core contributions are:

  1. We propose polarisation as a new modality for object pose estimation and explore its advantages over previous modalities

  2. We design a hybrid pipeline for pose estimation that leverages polarisation cues through a combination of physical model cues with learning.

  3. As a result, we propose the first solution to estimate 6D poses for photometrically challenging objects with high reflectance and translucency using polarisation.

2 Related Work

2.1 Polarimetric Imaging

Polarisation for 2D. Polarisation cues provide complementary information useful for various tasks in 2D computer vision that involve photometrically challenging objects. This has inspired a series of works on semantic [58] and instance [30] segmentation for reflective and transparent objects. The absence of strong glare behind specific polarisation filters further helps to remove reflections from images [36]. While one polarisation camera can already provide significant improvements compared to photometric acquisition setups, the use of multispectral polarimetric light fields [28] boosts the performance even more.

Polarisation for 3D. Due to the inherent connection of polarisation with surface shape and texture, the natural field of application seems to be 3D computer vision. Indeed, previous works on shape from polarisation (SfP) investigate the estimation of surface normals and depth from polarimetric data. However, intrinsic model ambiguities constraint setups in early works. Classical methods leverage an orthographic camera model and restrict the investigations to lab scenarios with very controlled environment conditions [18, 3, 56, 48]. Yu et al. [56] mathematically connect polarisation intensity with surface height and optimise for depth in a controlled scenario, while Atkinson et al. [3] recover surface orientation for fully diffuse surfaces. Others [48] add shape from shading principles or investigate the normal estimation using circular polarised light [18]. While these methods rely on monocular polarisation, more than one view can be combined with physical models for SfP [2, 11]. Some works also explore the use of complementary photometric stereo [4] and hybrid RGB+P approaches [61] which complement each other and allow for metrically accurate depth estimates if the light direction is known. If an initial depth map (e.g. from RGB-D) exists, polarimetric cues can further refine the measurements [29]. Furthermore, the polarimetric sensing model help estimate the relative transformation of a moving polarisation sensor [12] assuming the scene is fully diffuse. Data-driven approaches can mitigate any assumptions on surface properties, light direction and object shapes. Ba et al. [5]

estimate surface normals by presenting a set of plausible cues to a neural network which can use these ambiguous cues for SfP. We take inspiration from this approach to complement our pose estimation pipeline with physical priors. In contrast to these works, we are interested in the object poses in an unconstrained setup without further assumption on the reflection properties or lighting. The insights of previous works enable, for the first time, the design of a pipeline to address pose prediction for photometrically challenging objects made of transparent and highly reflective materials.

2.2 6D Pose Prediction

Monocular RGB. Methods that predict 6D pose from a single image can be separated into three main categories: the ones that directly optimise for the pose, learn a pose embedding or establish correspondences between the 3D model and the 2D image. Works that leverage pose parameterisation either directly regress the 6D pose [55, 37, 41, 35] or discretise the regression task and solve for classification [32, 10]. Networks trained this way directly predict pose parameters in the form of elements given the parameterisation used for training. Pose parameterisation can also be implicitly learned [60]. The second branch of methods [54, 51, 50] utilises this to learn an implicit space to encode the pose from which the predictions can be decoded. Latest and also the currently best-performing methods follow a two-stage approach. A network is used to predict 2D-3D correspondences between image and 3D model which are used by a consecutive RANSAC/PP pipeline that optimises the displacement robustly. Some methods in this field use sparse correspondences [45, 43, 49, 27] while others establish dense 2D-3D pairs [57, 42, 38, 24]. While these methods typically learn the correspondences alone, some works managed to learn the task end-to-end [26, 53, 13]. Inspired by the success of this, we also structurally follow the design of GDR-Net [53].

RGB-D and Refinement. Since the task of monocular pose estimation from RGB is an inherently ill-posed problem, depth maps serve as a geometrical rescue. The spatial cue given by the depth map can be leveraged to establish point pairs for pose estimation [16] which can be further improved with RGB [7]. In general, pose can be recovered from depth or combined RGB-D and most RGB-only methods (e.g. [51, 38, 42, 35]) benefit from a depth-driven refinement using ICP [6] or from indirect multi-view cues [35]. The complementary information of RGB and depth has also inspired the seminal work DenseFusion [52] in which deeply encoded features from both modalities are fused. FFB6D [20] further improves this through a tight coupling strategy with cross-modal information exchanges in multiple feature layers combined with a keypoint extraction [21] that leverages geometry and texture cues. These works however, crucially depend on input quality and depth sensing suffers in photometrically challenging regions, where polarisation cues for depth could expedite the pose prediction. However, to the best of our knowledge, this has not been proposed, yet.

Photometric Challenges. The field of 6D pose estimation usually tests on a set of well established dataset with RGB-D input [23, 8, 55, 31]. Photometrically challenging objects such as texture-less and reflective industrial parts are also part of publically available dataset [25, 15]. While most of these datasets are carefully annotated for the pose, polarisation input is not available. Transparency is a further challenge which has been addressed already in the pioneering work of Saxena et al. [47] where the robotic grasp point of objects is determined from RGB stereo without a 3D model. Philipps et al. [44] demonstrate how transparent object with rotation symmetry can be reconstructed from two views using an edge detector and contour fitting and more recently, KeyPose [40] investigates instance and category level pose prediction from RGB stereo. Since their depth sensor fails on transparent objects, they leverage an opaque-transparent object pair to establish ground truth depth. ClearGrasp [46] constitutes an RGB-D method that can be used on transparent objects. More recently, Liu et al. [39] presented the extensive StereOBJ-1M dataset. It includes transparent, reflective and translucent objects with variations in illumination and symmetry using a binocular stereo RGB camera for pose estimation. However, none of these dataset comprised RGBP data.

To this end, the next natural step connects the shape cues from polarisation to recover object geometry in challenging environments. We further ask the question how to do so by starting with a look into polarimetric image formation.

3 Polarimetric Pose Prediction

Figure 3: Polarisation Camera. Light from an unpolarised light source reflects on an object surface. The refracted and reflected part are partially polarised. A polarisation sensor captures the light. In front of every pixel there are four polarisation filters (PF) arranged at different angles (, , , ). The colour filter array (CFA) separates lights into different wavebands.

In contrast to RGBP sensors (see Fig. 3), RGB-D sensors enjoy a wide use in the pose estimation field. Their cost-efficiency and tight integration in many devices present a lot of possibilities in the vision field, but their design also comes with a few drawbacks.

3.1 Photometric Challenges for RGB-D

Commercial depth sensors typically use active illumination either by projecting a pattern (e.g. intel RealSense D series) or using time-of-flight (ToF) measurements (e.g. Kinect v2 / Azure Kinect, intel RealSense L series). While the former triangulate depth using stereo vision principles on projected or scene textures, the latter measures the roundtrip time of a light pulse that reflects from the scene. Since the measurement principle is photometric, both suffer on photometrically challenging surfaces where reflections artificially extend the roundtrip time of photons and translucent objects deteriorate the projected pattern to an extent that makes depth estimation infeasible. Fig 4 illustrates such an example for a set of common household objects. The semi-transparent vase becomes almost invisible for the used ToF sensor (RealSense L515) which measures the distance to the objects behind. The reflections on both cutlery and can lead to incorrect depth estimates significantly further than the correct value while strong reflections at boundaries invalidate pixel distances.

Figure 4: Depth Artifacts. A depth sensor (RealSense L515) miscalculates depth values for typical household objects. Reflective boundaries (1,3) invalidate pixels while strong reflections (2,3) lead to incorrect values too far away. Semi-transparent objects such as the vase (4) becomes partly invisible for the depth sensor which measures the distance to the objects behind.

3.2 Surface Normals from Polarisation

Before working with RGBP data, we introduce some of the physics behind polarimetric imaging. Natural light and most artificially emitted light is unpolarised, meaning that the electromagnetic wave oscillates along all planes perpendicular to the direction of propagation of light [17]. When unpolarised light passes through a linear polariser or is reflected at Brewster’s angle from a surface, it becomes perfectly polarised. How fast light travels through the material, how much of it is reflected is determined by the refractive index. It also determines the Brewster’s angle of that medium. When light is reflected at the same angle to the surface normal as the incident ray, we speak of specular reflection. The remaining part penetrates the object as refracted light. As the light wave traverses through the medium, it becomes partially polarised. Following this, it escapes from the object and creates diffuse reflection. For all real physical objects, the resulting reflection is a combination of specular and diffuse reflection, where the ratio largely depends on the refractive index and the angle of incident light as exemplified in Fig. 5

Figure 5: DOLP. Polarisation changes for reflection of diffuse light on a translucent surface. Note the indicated differences in the polarimetric image quadruplet that directly relate to the surface normal. The degree of linear polarisation (DOLP) for the translucent and reflective surfaces are considerably higher than for the rest of the image.

Light reaches the sensor with a specific intensity and wavelength . The colour filter array (CFA) of the sensor then separates the incoming light into RGB wavebands as illustrated in Fig. 3. The incoming light also has a degree of linear polarisation (DOLP) and a direction (angle) of polarisation (AOLP) . The measured intensity behind a polariser with an angle depends on these parameters and the unpolarised intensity  [30]:

(1)

We find and from the over-determined system of linear equations in 1 using linear least squares. Depending on the surface properties, AOLP is calculated as

(2)

where indicates the -ambiguity and is the azimuth angle of the surface normal n. We can further relate the viewing angle to the degree of polarisation by considering Fresnel coefficients, thus DOLP is similarly given by [3]

(3)

with the refractive index of the observed object material . Solving equation 3 for , we retrieve three solutions , one for the diffuse case and two for the specular case. For each of the cases, we can now find the 3D orientation of the surface by calculating the surface normals:

(4)

We use these plausible normals as physical priors per pixel to guide our neural network to estimate the 6D object pose.

3.3 Hybrid Polarimetric Pose Prediction Model

In this section, we present our Polarimetric Pose Prediction Network, short PPP-Net. Given polarimetric images at four different angles , together with the calculated AOLP , DOLP , and normal maps as physical priors, we aim to utilise a network to learn the pose that can transform the target object from the object frame to the camera frame given the 3D CAD model of the object.
Network Architecture. Our network architecture is depicted in Fig. 2. The network has two encoders, which take joint polarisation information from the native polarimetric images and the calculated AOLP/DOLP maps as well as the physical normals as priors with zoomed-in ROI of size as inputs separately. The decoder takes the combined encoded information from both encoders, together with skip connections from different hierarchical levels of the encoders, to decode the object mask, normal map, and a 3-channel dense correspondence map (NOCS) which maps each pixel to its corresponding normalised 3D coordinate. The predicted normal map together with the dense correspondence map are consecutively fed into a pose estimator as used in GDR-Net [53]. The pose estimator is composed of convolution layers and fully connected layers, to output the final estimated 3D rotation and translation.

Pose Parametrisation. Inspired by recent works [60, 38, 53], we parameterise our rotation as allocentric continuous 6-dimensional representation, and translation as scale-invariant representation [38, 53, 13].
The continuous 6-dimensional representation for rotation comes from the first two columns of the original rotation matrix [60], and we further turn it into allocentric representation [53, 13], since our network only perceives the ROI of the target object, which favors the viewpoint-independant representation.

The zoomed-in ROI can help the network focus on more relevant information in the image, i.e. our target object. To overcome the limitations of direct translation vector regression, we estimate the scale-invariant translation composed of relative differences between projected object centroids and the detected bounding box center location with respect to the bounding box size. The latter is given by

and the relative zoomed-in depth, , where

(5)

with and being the projected object centroids and bounding box center coordinates. The size of the bounding box is also used for calculating the zoomed-in ratio where and is the size of the output. Note that we can recover both the rotaion matrix and translation vector with known camera intrinsics  [34, 38].

Object Normal Map. The surface normal map contains the surface orientation at each discrete pixel coordinate and thus ecodes the shape of the object. Inspired by the previous works in SfP, we also aim to retrieve the surface normal map in a data-driven manner [5]. To better encode the geometric cue from the input physical priors apart from the polarisation cue, we do not concatenate the physical normals with the polarised images as suggested by Ba et al. [5], but encode them separately into two ResNet encoders. The decoder then learns to produce object shape encoded by surface normal map. The estimated normals are L2-normalised to unit length. As shown in Tab. 1, with the given physical normals as shape prior, we can achieve high quality normal map prediction.

Dense Correspondence Map. The dense correspondence map stores the normalised 3D object coordinates given associated poses. This explicitly models correspondences between object 3D coordinates and projected 2D pixel locations. As shown by Wang et al. [53], this representation helps the consecutive differentiable pose estimator to achieve high accuracy in comparison with RANSAC/PP.

3.4 Learning Objectives

The overall objective is composed of both geometrical features learning and the pose optimisation  [53] as:

(6)

with

(7)
(8)

Specifically, we employ separate loss terms for given ground truth rotation , and as

(9)

where denotes prediction. For symmetrical objects, the rotation loss will be calculated based on the smallest loss from all possible ground-truth rotations under symmetry.

To learn the intermediate geometrical features, we employ

losses for mask and dense correspondences map learning, and cosine similarity loss for normal estimation:

(10)

where indicates the Hadamard product of element-wise multiplication, and denotes the dot product.

4 Experimental Results

Object Photo. Chall. Input Modalities Output Variants Normal Metrics Pose Metric
RGB Polar RGB Physical N Normals NOCS mean med. 11.25 22.5 30 ADD
Cup - - - - - 91.1
- - - - - 91.3
7.3 5.5 86.2 96.1 97.9 91.3
4.5 3.5 94.7 99.1 99.6 97.2
Knife - - - - - 84.1
- - - - - 88.0
12.2 8.0 68.7 88.5 92.4 89.4
6.8 5.4 88.2 97.3 98.6 96.4

Table 1: PPP-Net Modalities Evaluation. Different combinations of input and output modalities are used for training to study their influence on pose estimation accuracy ADD for objects with different photometric complexity. Where applicable, metrics for estimated normals are reported as well. Results for other objects in Supplementary Material.

The motivation of our proposed pipeline is to show the advantage of leveraging pixelwise physical priors from polarised light (a.k.a. RGBP) for accurate 6D pose estimation of photometrically challenging objects - for which RGB-only and RGB-D methods often fail. For this purpose, we train and test PPP-Net with different modalities first on two exemplary objects with very different level of photometric complexity, i.e. a plastic cup, and a photometrically very challenging, reflective and textureless stainless steel cutlery knife. As detailed later, we find that polarimetric information yields significant performance gain for photometrically challenging objects.

4.1 Polarimetric Data Acquisition

To evaluate our pipeline we leverage 6 models from the PhoCal [1] category-level pose estimation dataset which comprises 60 household objects with high-quality 3D models scanned by a structured light 3D stereo scanner (EinScan-SP 3D Scanner, SHINING 3D Tech. Co., Ltd., Hangzhou, China). The scanning accuracy of the device is  mm which allows for highly accurate models. We select the models cup, teapot, can, fork, knife, bottle with increasing photometric complexity which we illustrate in Fig. 6. The last three models do not include texture due to their surface structure. The 3D scanning has been done with a vanishing 3D scanning spray that made the surface temporarily opaque. To acquire RGB-D images, we use a direct Time-of-Flight (dToF) camera, intel RealSense LiDAR Camera L515 (intel, Santa Clara, California, USA), which captures RGB and Depth data at 640x480 pixel resolution.

RGBP is acquired using the polarisation camera Phoenix 5.0 MP PHX050S1-QC comprising a Sony IMX264MYR CMOS (Color) Polarsens sensor (LUCID Vision Labs, Inc., Richmond B.C, Canada) through a Universe Compact C-Mount 5MP 2/3” 6mm f/2.0 lens (Universe, New York, USA) at 612x512 pixel resolution. Both cameras are mounted jointly to a KUKA iiwa (KUKA Roboter GmbH, Augsburg, Germany) 7 DoF robotic arm that guarantees a positional reproducibility of  mm. Intrinsic and extrinsic calibration is performed following the standard pinhole camera model [59] with five distortion coefficients [22]. For pose annotation, we leverage the mechanical pose annotation method proposed in PhoCal [1] where the robotic manipulator is used to tip the object of interest and extract a point cloud. This point cloud is consecutively aligned to the 3D model using ICP [6] to allow for highly accurate pose labels even for photometrically challenging objects. We plan a robot trajectory and use this setup to acquire four scenes with four different trajectories each and utilise a total of 8740 image sets for the dataset.

Figure 6: 3D Models. Test objects with increasing photometric complexity (left to right). Three objects have no texture in as they are reflective (cutlery) or transparent (bottle).

4.2 Experiments Setup

Implementation Details. We initially refine an off-the-shelf detector Mask RCNN [19] directly on the polarised images to provide useful object crops on our data (as is needed for the RGB-only benchmark and ours). We follow similar training/testing split strategy as commonly used for the public datasets [9], and employ of the RGBP images for training and for testing. We train our network end-to-end with Adam optimiser [33]

for 200 epochs. The initial learning rate is set to 1e-4, which is halved every 50 epochs. As the depth sensor has a different field of view and is placed beneath the polarisation camera on a customised camera rig, the RGB-D benchmark split differs from the RGB training/testing split.

Evaluation Metrics. To establish our proposed novel 6D pose estimation approach, we report the pose estimation accuracy per object as the commonly used average distance (ADD) and its equivalent for symmetrical objects (ADD-S) [23] for different benchmarks. For the surface normal estimation, we calculate the mean and median errors (in degrees) and the percentage of pixels where the estimated normals vary less than , and from the ground truth. We additionally give valuable insights into our proposed pipeline by performing detailed ablations on the input modalities, the fusion of complementary modalities, and the effect of explicit learning of physically plausible geometric information and its effect on pose prediction accuracy (see Tab. 1), and discuss limitations of our proposed approach.

4.3 PPP-Net

Here, we perform a series of experiments to study the influence of the input modality on the pose estimation accuracy (compare Tab. 1), where we specifically analyse the influence of polarimetric image information for the task of 6D object pose estimation. We demonstrate that our network with RGBP input performs at the state-of-the-art level for non-reflective, textured objects, which we define as less photometrically challenging, e.g. plastic cup, and outperforms current models for photometrically complex objects, e.g. stainless steel cutlery.
To identify the direct influence of polarisation imaging for the task of accurate object pose estimation, we first establish an RGB-only baseline by neglecting our contributions of PPP-Net. To compute the unpolarised RGB image, we average over polarimetric images at complementary angles and use this as input for RGB-only. As shown in the first two rows in Tab. 1 for each object (RGB against Polar RGB), the polarisation modality yields larger accuracy gains for the photometrically challenging object knife as compared to cup. Auxiliary network predictions for normals and NOCS marginally enhance the performance as the network is encouraged to explicitly encode this information from the input modalities. The physically-induced normals from polarisation images provide orthogonal information that significantly boosts the pose prediction quality and thus achieves best ADD performance across all experiments. This behaviour is most promiment for the photometrically challenging knife.

Object Photo. Chall. Properties Depth Quality RGB-D Split RGB Split
Reflective Metallic Textureless Transparent Symmetric FFB6D Ours GDR Ours
Cup (+) 99.4 98.1 96.7 97.2
Teapot (*) ++ 86.8 94.2 99.0 99.9
Can * * - 80.4 99.7 96.5 98.4
Fork * * * -- 37.0 72.4 86.6 95.9
Knife * * * --- 36.7 87.2 92.6 96.4
Bottle * * * * None 61.5 93.6 94.4 97.5
Mean 67.0 90.9 94.3 97.6
Table 2: Benchmark comparisons. We compare our method against recent RGB-D (FFB6D [20]) and RGB-only (GDR-Net [53]) methods on a variety of objects with different level photometric challenges (), and depth map quality (good: to low:) which serves as input for FFB6D. RGB-D and RGB-only comparisons are trained and tested on different splits due to different field of view of depth camera (see Sec. 4 for details). We report the Average Recall of ADD(-S).

4.4 Comparison with established Benchmarks

The input modality experiments already demonstrate the strong capabilities of polarimetric imaging inputs for PPP-Net to successfully learn reliable 6D pose prediction with high accuracy for photometrically challenging objects. The depth map of an RGB-D sensor can also provide geometric information that can be utilised for the task of 6D object pose estimation. FFB6D [20] is currently the best-performing state-of-the-art learning pipeline which combines RGB and geometric information from depth maps. Hence, the design of FFB6D is motivated by similar principles as our proposed method, since it leverages geometric information for the task of 6D pose estimation, and is therefore chosen as a strong geometric benchmark for comparison. The unique Full-Flow-Bidirectional fusion network [20] of FFB6D learns to combine appearance and depth information as well as local and global information from the two individual modalities.

We train FFB6D on our data for each object individually and report the best ADD(-S) metric for all objects in Tab. 2. The photometric challenge that each object constitutes is summarised in the Tab. 2 and detailed by its properties (compare with Fig. 6). The objects are categorised into three classes based on the depth map quality for the depth sensor (compare also Fig. 4). We can observe that objects with good depth maps and minor photometric challenges achieve high ADD values for FFB6D. For challenging objects, the increase in photometric complexity (and lower depth map quality) correlates with a decrease in ADD. The transparent Bottle object is an exception to this pattern. The depth map is completely invalid (compare Fig. 4), but FFB6D still achieves high ADD. Our hypothesis is that the network successfully learns to ignore the depth map input from early training onward (see Sec. 5 for details). PPP-Net achieves comparable results for easy objects and outperforms the strong benchmark for photometrically complex objects. Our method does not suffer from reduced ADD due to noisy or inaccurate depth maps but rather leverages the orthogonal surface information from RGBP data.

As PPP-Net profits vastly from physical priors from polarisation, we thoroughly investigate to which extent this additional information impacts the improvement of estimated poses, especially for photometrically challenging objects, by comparing the results also against the monocular RGB-only method GDR-Net [53]. We observe that while using polarimetric information slightly improves pose estimation accuracy for non-challenging objects, we can achieve superior performance for items with inconsistent photometric information due to reflection or transparency. In Tab. 2 the accuracy gain of PPP-Net against GDR-Net increases proportionally to the photometric complexity, since our physical priors provide additional information about the geometry of an object.

5 Discussion

Limitations of current geometric methods. As mentioned earlier, we postulate that the RGB-D method ignores invalid depth data already in early stages of training (e.g. for the transparent bottle) and eventually learns to also ignore noisy or corrupted depth information. To prove this assumption, we perform adversarial attacks on the input depth map for the FFB6D [20] encoder to analyse which parts of input modalities the network relies on when making a prediction. For this purpose we add small Gaussian noise on the depth-related feature embedding in the bottleneck of the network and compare the ADD under this attack. We purposely ”overfit” the model on objects of different photometric complexity and compute the relative decrease in ADD under the attack. We observe that the relative decrease is smaller for photometrically challenging objects as compared to objects with accurate depth maps ( drop in ADD for knife and for cup). These findings suggest that the network indeed relies on the RGB information only.

Benefits of Polarisation. We have shown that physical priors from polarised light can significantly improve 6D pose estimation results for photometrically challenging objects. RGB-only methods do not incorporate any geometric information and therefore show worse results in scenarios with reflective surfaces or objects of little texture. Methods which try to leverage geometric priors from RGB-D [20], often cannot reliably recover the 6D pose of such objects as the depth map is usually degenerated and corrupt. Our PPP-Net, as the first RGBP 6D object pose estimation method, successfully achieves to learn accurate poses even for very challenging objects by extracting geometric information from physical priors. Qualitative results are shown in Figs. 12 and 7, and additionally in the supplementary material. Another benefit of using RGBP lies in the sensor itself: as the polarisation filter is directly integrated on the same sensor as the Bayer filter, both modalities are intrinsically calibrated and the image can be acquired passively paving the way to sensor-integration on low-energy and mobile devices. RGB-D cameras, on the contrary, often require energy-costly active illumination and extrinsic calibration, which prevents simple integration and introduces additional uncertainty to the final RGB-D image.

Limitations. Our physical model requires the refractive index of the respective object to reliably compute the physical priors. To explore the potential of the physical model, distinct to prior works  [48, 5] which fix the refractive index to for all experiments, we use physically plausible values according to the materials.222We approximate the refractive index by the look-up table provided by https://refractiveindex.info/ This means one would need to manually choose such parameter, which limits the performance of the physical model when encountering objects with unknown composite materials. Moreover, strong changes in texture also affect the reflection of light and thus DOLP calculation which, in turn, influences our physical priors.

6 Conclusion

We have presented PPP-Net, the first learning-based 6D object pose estimation pipeline which leverages geometric information from polarisation images through physical cues. Our method outperforms current state-of-the-art RGB-D and RGB methods for photometrically challenging objects and demonstrates at par performance for ordinary objects. Extensive ablations show the importance of the complementary polarisation information for accurate pose estimation - specifically for objects without texture, reflective surfaces or transparency.

Figure 7: Qualitative Results. Input image with 2D detections are shown. Predicted and GT 6D poses are illustrated by blue and green bounding boxes, respectively.

A Physical Priors

We use physical priors as inputs in our network to improve the estimated 6D pose of an object. These priors form relations between polarisation properties and azimuth and zenith angle of the surface normal, which serves as geometric cues orthogonal to color information. We calculate the physical priors under the assumption of either specular or diffuse reflection.
To recover the azimuth and zenith angle of the surface normal, we present the calculation for solving the unknowns of Eq. A1.
A polarimetric camera registers intensity behind four linear polarisers with angles , which depends on unpolarised intensity , degree of polarisation , and angle of polarisation :

(A1)

Eq. A1 can be re-written as:

(A2)

For all angles , we get a linear equation system for each pixel location with , and . After solving this over-determined linear equation system using least squares, we find unpolarised intensity, degree of polarisation and angle of polarisation:

(A3)

The azimuth angle can be found using Eq.2. Then, we can estimate the azimuth angle

from Eq.3 by linear interpolation. Both models take in the same value for the refractive index

, since it is an intrinsic property of the material and it does not depend on the reflection model. The values used for our objects can be seen in Tab. A1.

Object Material Refractive Index
Teapot ceramic 1.54
Can aluminium composite 1.35
Fork stainless steel 2.75
Knife stainless steel 2.75
Bottle glass 1.52
Cup plastics 1.50
Table A1: Refractive Indices.

B Additional Results

In Fig. A1, we visualise the 6D pose by overlaying the image with the corresponding transformed 3D bounding box. For better visualization we cropped the images and zoomed into the area of interest. Tab. A2 is an extension to Tab.1 in the main paper and summarises the quantitative evaluation for different modalities for PPP-Net for all object under consideration in the dataset.

Figure A1: Qualitative Results. Predicted and GT 6D poses are illustrated by blue and green bounding boxes, respectively.
Object Photo. Chall. Input Modalities Output Variants Normal Metrics Pose Metric
RGB Polar RGB Physical N Normals NOCS mean med. 11.25 22.5 30 ADD(-S)
Teapot - - - - - 97.8
- - - - - 99.5
7.9 5.4 82.5 94.5 97.1 99.2
5.3 4.0 91.6 98.7 99.5 99.9
Can - - - - - 91.8
- - - - - 93.2
5.7 3.9 90.0 97.0 98.6 96.7
6.0 4.5 89.0 97.3 98.9 98.4
Fork - - - - - 85.4
- - - - - 86.1
11.0 7.3 72.6 90.7 93.9 92.9
6.5 4.3 87.6 95.9 97.6 95.9
Bottle - - - - - 90.5
- - - - - 93.5
5.6 4.7 92.9 99.0 99.6 94.7
5.4 4.5 92.1 99.0 99.6 97.5

Table A2: PPP-Net Modalities Evaluation. Different combinations of input and output modalities are used for training to study their influence on pose estimation accuracy ADD(-S) for objects with different photometric complexity. Where applicable, metrics for estimated normals are reported as well.

References

  • [1] Anonymous (2021) PhoCaL: a multimodal dataset for category-level object pose estimation with photometrically challenging objects. In Under Submission, Cited by: §4.1, §4.1.
  • [2] G. A. Atkinson and E. R. Hancock (2005) Multi-view surface reconstruction using polarization. In Tenth IEEE International Conference on Computer Vision (ICCV’05) Volume 1, Vol. 1, pp. 309–316. Cited by: §2.1.
  • [3] G. A. Atkinson and E. R. Hancock (2006) Recovery of surface orientation from diffuse polarization. IEEE transactions on image processing 15 (6), pp. 1653–1664. Cited by: §2.1, §3.2.
  • [4] G. A. Atkinson (2017) Polarisation photometric stereo. Computer Vision and Image Understanding 160, pp. 158–167. Cited by: §2.1.
  • [5] Y. Ba, A. Gilbert, F. Wang, J. Yang, R. Chen, Y. Wang, L. Yan, B. Shi, and A. Kadambi (2020) Deep shape from polarization. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXIV 16, pp. 554–571. Cited by: §2.1, §3.3, §5.
  • [6] P. J. Besl and N. D. McKay (1992) Method for registration of 3-d shapes. In Sensor fusion IV: control paradigms and data structures, Vol. 1611, pp. 586–606. Cited by: §2.2, §4.1.
  • [7] T. Birdal and S. Ilic (2015) Point pair features based object detection and pose estimation revisited. In 2015 International Conference on 3D Vision, pp. 527–535. Cited by: §2.2.
  • [8] E. Brachmann, A. Krull, F. Michel, S. Gumhold, J. Shotton, and C. Rother (2014) Learning 6d object pose estimation using 3d object coordinates. In European conference on computer vision, pp. 536–551. Cited by: §2.2.
  • [9] E. Brachmann, F. Michel, A. Krull, M. Y. Yang, S. Gumhold, et al. (2016) Uncertainty-driven 6d pose estimation of objects and scenes from a single rgb image. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    ,
    pp. 3364–3372. Cited by: §4.2.
  • [10] B. Busam, H. J. Jung, and N. Navab (2020) I like to move it: 6d pose estimation as an action decision process. arXiv preprint arXiv:2009.12678. Cited by: §1, §2.2.
  • [11] Z. Cui, J. Gu, B. Shi, P. Tan, and J. Kautz (2017) Polarimetric multi-view stereo. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1558–1567. Cited by: §2.1.
  • [12] Z. Cui, V. Larsson, and M. Pollefeys (2019) Polarimetric relative pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2671–2680. Cited by: §2.1.
  • [13] Y. Di, F. Manhardt, G. Wang, X. Ji, N. Navab, and F. Tombari (2021) SO-pose: exploiting self-occlusion for direct 6d pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 12396–12405. Cited by: §1, §2.2, §3.3.
  • [14] B. Drost, M. Ulrich, P. Bergmann, P. Hartinger, and C. Steger (2017-10) Introducing mvtec itodd - a dataset for 3d object recognition in industry. In Proceedings of the IEEE International Conference on Computer Vision (ICCV) Workshops, Cited by: §1.
  • [15] B. Drost, M. Ulrich, P. Bergmann, P. Hartinger, and C. Steger (2017) Introducing mvtec itodd-a dataset for 3d object recognition in industry. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 2200–2208. Cited by: §2.2.
  • [16] B. Drost, M. Ulrich, N. Navab, and S. Ilic (2010) Model globally, match locally: efficient and robust 3d object recognition. In 2010 IEEE computer society conference on computer vision and pattern recognition, pp. 998–1005. Cited by: §2.2.
  • [17] T. Fließbach (2012) Elektrodynamik: lehrbuch zur theoretischen physik ii. Vol. 2, Springer-Verlag. Cited by: §3.2.
  • [18] N. M. Garcia, I. De Erausquin, C. Edmiston, and V. Gruev (2015) Surface normal reconstruction using circularly polarized light. Optics express 23 (11), pp. 14391–14406. Cited by: §2.1.
  • [19] K. He, G. Gkioxari, P. Dollár, and R. Girshick (2017) Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 2961–2969. Cited by: §4.2.
  • [20] Y. He, H. Huang, H. Fan, Q. Chen, and J. Sun (2021-06) FFB6D: a full flow bidirectional fusion network for 6d pose estimation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.2, §4.4, Table 2, §5, §5.
  • [21] Y. He, W. Sun, H. Huang, J. Liu, H. Fan, and J. Sun (2020-06) PVN3D: a deep point-wise 3d keypoints voting network for 6dof pose estimation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.2.
  • [22] J. Heikkila and O. Silvén (1997) A four-step camera calibration procedure with implicit image correction. In Proceedings of IEEE computer society conference on computer vision and pattern recognition, pp. 1106–1112. Cited by: §4.1.
  • [23] S. Hinterstoisser, V. Lepetit, S. Ilic, S. Holzer, G. Bradski, K. Konolige, and N. Navab (2012) Model based training, detection and pose estimation of texture-less 3d objects in heavily cluttered scenes. In Asian conference on computer vision, pp. 548–562. Cited by: §1, §2.2, §4.2.
  • [24] T. Hodan, D. Barath, and J. Matas (2020) Epos: estimating 6d pose of objects with symmetries. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 11703–11712. Cited by: §2.2.
  • [25] T. Hodan, P. Haluza, Š. Obdržálek, J. Matas, M. Lourakis, and X. Zabulis (2017) T-less: an rgb-d dataset for 6d pose estimation of texture-less objects. In 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 880–888. Cited by: §1, §2.2.
  • [26] Y. Hu, P. Fua, W. Wang, and M. Salzmann (2020) Single-stage 6d object pose estimation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 2930–2939. Cited by: §2.2.
  • [27] Y. Hu, J. Hugonot, P. Fua, and M. Salzmann (2019) Segmentation-driven 6d object pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3385–3394. Cited by: §2.2.
  • [28] M. N. Islam, M. Tahtali, and M. Pickering (2021) Specular reflection detection and inpainting in transparent object through msplfi. Remote Sensing 13 (3), pp. 455. Cited by: §2.1.
  • [29] A. Kadambi, V. Taamazyan, B. Shi, and R. Raskar (2017) Depth sensing using geometrically constrained polarization normals. International Journal of Computer Vision 125 (1-3), pp. 34–51. Cited by: §2.1.
  • [30] A. Kalra, V. Taamazyan, S. K. Rao, K. Venkataraman, R. Raskar, and A. Kadambi (2020) Deep polarization cues for transparent object segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8602–8611. Cited by: §1, §2.1, §3.2.
  • [31] R. Kaskman, S. Zakharov, I. Shugurov, and S. Ilic (2019) HomebrewedDB: rgb-d dataset for 6d pose estimation of 3d objects. International Conference on Computer Vision (ICCV) Workshops. Cited by: §2.2.
  • [32] W. Kehl, F. Manhardt, F. Tombari, S. Ilic, and N. Navab (2017) Ssd-6d: making rgb-based 3d detection and 6d pose estimation great again. In Proceedings of the IEEE international conference on computer vision, pp. 1521–1529. Cited by: §2.2.
  • [33] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.2.
  • [34] A. Kundu, Y. Li, and J. M. Rehg (2018) 3d-rcnn: instance-level 3d object reconstruction via render-and-compare. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3559–3568. Cited by: §3.3.
  • [35] Y. Labbé, J. Carpentier, M. Aubry, and J. Sivic (2020) Cosypose: consistent multi-view multi-object 6d pose estimation. In European Conference on Computer Vision, pp. 574–591. Cited by: §2.2, §2.2.
  • [36] C. Lei, X. Huang, M. Zhang, Q. Yan, W. Sun, and Q. Chen (2020) Polarized reflection removal with perfect alignment in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1750–1758. Cited by: §2.1.
  • [37] Y. Li, G. Wang, X. Ji, Y. Xiang, and D. Fox (2018) Deepim: deep iterative matching for 6d pose estimation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 683–698. Cited by: §2.2.
  • [38] Z. Li, G. Wang, and X. Ji (2019) Cdpn: coordinates-based disentangled pose network for real-time rgb-based 6-dof object pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7678–7687. Cited by: §2.2, §2.2, §3.3, §3.3.
  • [39] X. Liu, S. Iwase, and K. M. Kitani (2021) StereOBJ-1m: large-scale stereo image dataset for 6d object pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10870–10879. Cited by: §1, §2.2.
  • [40] X. Liu, R. Jonschkowski, A. Angelova, and K. Konolige (2020) Keypose: multi-view 3d labeling and keypoint estimation for transparent objects. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 11602–11610. Cited by: §2.2.
  • [41] F. Manhardt, D. M. Arroyo, C. Rupprecht, B. Busam, T. Birdal, N. Navab, and F. Tombari (2019) Explaining the ambiguity of object detection and 6d pose from visual data. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6841–6850. Cited by: §1, §2.2.
  • [42] K. Park, T. Patten, and M. Vincze (2019) Pix2pose: pixel-wise coordinate regression of objects for 6d pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7668–7677. Cited by: §2.2, §2.2.
  • [43] S. Peng, Y. Liu, Q. Huang, X. Zhou, and H. Bao (2019) Pvnet: pixel-wise voting network for 6dof pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4561–4570. Cited by: §2.2.
  • [44] C. J. Phillips, M. Lecce, and K. Daniilidis (2016) Seeing glassware: from edge detection to pose estimation and shape recovery.. In Robotics: Science and Systems, Vol. 3. Cited by: §2.2.
  • [45] M. Rad and V. Lepetit (2017) Bb8: a scalable, accurate, robust to partial occlusion method for predicting the 3d poses of challenging objects without using depth. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3828–3836. Cited by: §2.2.
  • [46] S. Sajjan, M. Moore, M. Pan, G. Nagaraja, J. Lee, A. Zeng, and S. Song (2020) Clear grasp: 3d shape estimation of transparent objects for manipulation. In 2020 IEEE International Conference on Robotics and Automation (ICRA), pp. 3634–3642. Cited by: §2.2.
  • [47] A. Saxena, J. Driemeyer, and A. Y. Ng (2008) Robotic grasping of novel objects using vision. The International Journal of Robotics Research 27 (2), pp. 157–173. Cited by: §2.2.
  • [48] W. A. Smith, R. Ramamoorthi, and S. Tozza (2018) Height-from-polarisation with unknown lighting or albedo. IEEE transactions on pattern analysis and machine intelligence 41 (12), pp. 2875–2888. Cited by: §2.1, §5.
  • [49] C. Song, J. Song, and Q. Huang (2020) Hybridpose: 6d object pose estimation under hybrid representations. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 431–440. Cited by: §2.2.
  • [50] M. Sundermeyer, M. Durner, E. Y. Puang, Z. Marton, N. Vaskevicius, K. O. Arras, and R. Triebel (2020) Multi-path learning for object pose estimation across domains. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13916–13925. Cited by: §2.2.
  • [51] M. Sundermeyer, Z. Marton, M. Durner, M. Brucker, and R. Triebel (2018) Implicit 3d orientation learning for 6d object detection from rgb images. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 699–715. Cited by: §2.2, §2.2.
  • [52] C. Wang, D. Xu, Y. Zhu, R. Martín-Martín, C. Lu, L. Fei-Fei, and S. Savarese (2019) Densefusion: 6d object pose estimation by iterative dense fusion. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 3343–3352. Cited by: §2.2.
  • [53] G. Wang, F. Manhardt, F. Tombari, and X. Ji (2021) GDR-net: geometry-guided direct regression network for monocular 6d object pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16611–16621. Cited by: Figure 2, §2.2, §3.3, §3.3, §3.3, §3.4, §4.4, Table 2.
  • [54] P. Wohlhart and V. Lepetit (2015) Learning descriptors for object recognition and 3d pose estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3109–3118. Cited by: §2.2.
  • [55] Y. Xiang, T. Schmidt, V. Narayanan, and D. Fox (2017)

    Posecnn: a convolutional neural network for 6d object pose estimation in cluttered scenes

    .
    arXiv preprint arXiv:1711.00199. Cited by: §2.2, §2.2.
  • [56] Y. Yu, D. Zhu, and W. A. Smith (2017) Shape-from-polarisation: a nonlinear least squares approach. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 2969–2976. Cited by: §2.1.
  • [57] S. Zakharov, I. Shugurov, and S. Ilic (2019) Dpod: 6d pose object detector and refiner. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1941–1950. Cited by: §2.2.
  • [58] Y. Zhang, O. Morel, M. Blanchon, R. Seulin, M. Rastgoo, and D. Sidibé (2019)

    Exploration of deep learning-based multimodal fusion for semantic road scene segmentation.

    .
    In VISIGRAPP (5: VISAPP), pp. 336–343. Cited by: §2.1.
  • [59] Z. Zhang (2000) A flexible new technique for camera calibration. IEEE Transactions on pattern analysis and machine intelligence 22 (11), pp. 1330–1334. Cited by: §4.1.
  • [60] Y. Zhou, C. Barnes, J. Lu, J. Yang, and H. Li (2019) On the continuity of rotation representations in neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5745–5753. Cited by: §2.2, §3.3.
  • [61] D. Zhu and W. A. Smith (2019) Depth from a polarisation + rgb stereo pair. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7586–7595. Cited by: §2.1.