1 Introduction
Robust, accurate and fast object pose estimation and tracking, i.e. estimation of the object’s 3D position and orientation, has been a matter of intense research for many years. The applications of such an estimation problem can be found in Robotics, Autonomous Navigation, Augmented Reality, etc. Although the Computer Vision community has consistently studied the problem of object pose estimation and tracking for decades, the recent spread of affordable and reliable RGBD sensors like Kinect, along with advances in Deep Learning (DL) and especially the use of CNNs as the new SoA image feature extractors, led to a new era of research and a reexamination of several problems, with general aim the generalization over different tasks. CNNs have achieved groundbreaking results in 2D problems like object classification, object detection and segmentation. Thus, it has been tempting to the research community to increasingly use them in the more challenging 3D tasks.
The innate challenges of object pose estimation from RGBD streams include background clutter, occlusions (both static, from other objects present in the scene, and dynamic, due to possible interactions with a human user), illumination variation, sensor noise, image blurring (due to fast movement) and appearance changes as the object viewpoint alters. Moreover, one should account for the pose ambiguity, which is a direct consequence of the object’s own geometry, in possible symmetries, the challenges of proper parameter representation of rotations and the inevitable difficulties that an effort of forging a model faces, when extracting information about the 3D scene geometry from 2Dprojected images.
Previous works attempted to tackle the problem using DL, focusing on two different directions. The first family of proposed approaches in literature processes each video frame separately, without any feedback from the previous timeframe estimation. In [Xiang_2017], Xiang et al. constructed a CNN architecture that estimates binary object masks and then predicts the object class and its translation and rotation separately, while in [Kehl_2017b] Kehl et al. extended the Single Shot Detection (SSD) framework [Liu_2016] for 2D Object detection by performing discrete viewpoint classification for known objects. Finally, they refined their initial estimations via ICP [Segal_2009] iterations. In [Zakharov_2019] a CNN framework was proposed using RGB images for pixelwise object semantic segmentation in a masklevel. Following this, UV texture maps are estimated to extract dense correspondences between 2D images and 3D object models minimizing cross entropy losses. Those correspondences are used for pose estimation via P’n’P [Lepetit_2009]. This estimation is, ultimatelly, inserted as a prior to a refinement CNN that outputs the final pose prediction. More recently, iPose [Jafari_2019] is one of the attempts whose philosophy is the closest to ours. Its authors segment binary masks with a pretrained MaskRCNN [He_2017] to extract background clutter and occluders and, they map 2D pixels to dense 3D object coordinates, which, in turn, are used as input to a P’n’P geometric optimization. Our attention modules have the same effect, but are computationally cheaper than MaskRCNN, as they relax the requirement for hard segmentation. The second category under study is temporal tracking, where feedback is utilized, to allow for skipping steps without prior knowledge of the previous pose. Garon et al. [Garon_2017, Garon_2018], formulated the tracking problem exclusively as a learning one, by generating two streams of synthetic RGBD frame pairs from independent viewpoints and regressing the pose using a CNN. Liao et al. initialized a similar CNN architecture using a FlowNet2 [Ilg_2017] backbone and fused its two streams by subtraction. In [Zhou_2018], training was done with an Optical flowbased regularization term which encouraged the production of multiple heterogenous pose hypotheses that got bootstrapped in the final layer.
In this paper we extend the aforementioned approaches for object pose tracking, while building upon previous works [Garon_2017, Garon_2018], delivering as main contributions:

An explicit background clutter and occlusion handling mechanism that leverages spatial attentions and provides an intuitive understanding of the tracker’s region of interest at each frame, while boosting its performance. To the best of our knowledge, this is the first such strategy, that explicitly handles these two challenges, is incorporated into a CNNbased architecture, while achieving realtime performance. Supervision for this mechanism is extracted by fully exploiting the synthetic nature of our training data.

The use of a novel multitask pose tracking loss function, that respects the geometry of both the object’s 3D model and the pose space and boosts the tracking performance by optimizing auxiliary tasks along with the principal one.

SoA realtime performance in the hardest scenario of the benchmark dataset [Garon_2018], while achieving lower translation and rotation errors by an average of for translation and for rotation.
Accordingly, we provide the necessary methodological design details and experimental results that justify the importance of the proposed method in the challenging object pose tracking problem.
2 Methodology
2.1 Problem Formulation
Our problem consists in estimating the object pose , which is usually described as a rigid 3D transformation w.r.t. a fixed coordinate frame, namely an element of the Special Euclidean Lie group in 3D: . It can be disentangled into two components; a rotation matrix R, which is an element of the Lie Group SO(3)
and a translation vector
.However, Bregier et al. [Bregier_2017] proposed a broader definition for the object pose, which can be considered as a family of rigid transformations, accounting for the ambiguity caused by possible rotational symmetry, noted as G SO(3). We leverage this augmented mathematical definition for introducing a relaxation to the pose space definition:
(1) 
For example, as stated in [Bregier_2017], the description of the pose of an object with spherical symmetry requires just 3 numbers: (), as G can be any instance of SO(3) with the imprinted shape of the object remaining the same. Obviously, for asymmetrical objects, G=.
2.2 Architecture Description
The proposed architecture is depicted in Fig.2. Our CNN inputs two RGBD frames of size : I(t),(t) (with I(t) being the ”Observed” and (t) the ”Predicted” one) and regresses an output pose representation , with 3 parameters for translation () and 6 for rotation. The first two layers of the ”Observed” stream are initialized with the weights of a ResNet18[He_2016]
, pretrained on Imagenet
[Deng_2009], to narrow down the realsynthetic domain adaptation gap, as proposed in [Hinterstoisser_2018]. Since ImageNet contains only RGB images, we initialize the weights of the Depth input modality with the average of the weights corresponding to each of the three RGB channels. Contrary to [Hinterstoisser_2018], we find beneficial not to freeze those two layers during training. The reason is that we aim to track the pose of the single objects we train on and not to generalize to unseen ones. So, overfitting to that object’s features helps the tracker to focus only on distinguishing the pose change. To the output of the second ”Observed” layer, we apply spatial attention for foreground extraction and occlusion handling and we add their corresponding output feature maps with the one of the second layer, along with a Residual connection
[He_2015] from the first layer. As a next step, we fuse the two streams by concatenating their feature maps and pass this concatenated output through three sequential Fire modules [Iandola_2016], all connected with residual connections [He_2016].Background and Occlusion Handling: After our first ”Observed” Fire layer, our model generates an attention weight map by using a convolutional layer dedicated to occlusion handling and foreground extraction respectively, followed by a convolution that squeezes the feature map channels to a weight map (normalized by softmax). Our goal is to distil the soft foreground and occlusion segmentation masks from the hard binary groundtruth ones (that we keep from augmenting the objectcentric image with random backgrounds and occluders) in order to have their estimations available during the tracker’s inference. To this end, we add the two corresponding binary cross entropy losses to our overall loss function. We argue our design choice of using two attention modules, as after experimentation, we found that assigning a clear target to each of the two modules is more beneficial, rather than relying on a single attention layer to resolve both challenges (see Sect.3.3.1).
Overall Loss and rotation representation: From a mathematical standpoint, immediate regression of pose parameters [Garon_2018] with an Euclidean loss is suboptimal: while the translation component belongs to the Euclidean space, the rotation component lies on a nonlinear manifold of SO(3). Thus, it is straightforward to model the rotation loss using a Geodesic metric [Huynh_2009, Hartley_2013] on SO(3), i.e. the length, in radians, of the minimal path that connects two of its elements: (see Eq.2.2). In order to minimize the rotation errors due to ambiguities caused by the parameterization choice, we employ the 6D continuous rotation representation that was introduced in [Zhou_2019]: , where . Given , the matrix is obtained by:
(2) 
where , is the normalization function. Furthermore, as it has already been discussed in [Bregier_2017]
, each 3D rotation angle has a different visual imprint regarding each rotation axis. So, we multiply both rotation matrices with a diagonal Inertial Tensor
, calculated on the object model’s weighted surface and with respect to its center mass, in order to assign a different weight to each rotational component. We note here that since we want that matrix product to still lie in SO(3), we perform a GrammSchmidt orthonormalization on the Inertial Tensor before rightmultiplying it with each rotation matrix. Finally, we weigh the translation and rotation losses using a pair of learnable weights that are trained along with the rest of the network’s parameters using a Gradient Descentbased optimization method, as proposed by [Kendall_2017]. Symmetric Object Handling: In the special case of symmetric objects, we disentangle the ambiguities inserted due to this property from the core of the rotation estimation. We regress a separate Euler angle triplet of symmetrybased parameters that is converted to a rotation matrix , which gets rightmultiplied with before being weighted by the parameters of . We used a cylindrical cookiejar model for the symmetric object case, the shape of which has only one axis of symmetry. Consequentially, we estimate a single symmetry parameter, that of the objectcentric zaxis. Before the conversion, that parameter is passed through a tanh function and multiplied by to constrain its values.As a result, our overall tracking loss function is formulated as:
(3) 
Using a similar external multitask learnable weighting scheme () as in (2.2), we combine our primary learning task, the pose tracking, with the two auxiliary ones: clutter and occlusion handling:
(4) 
2.3 Data Generation and Augmentation
Following [Garon_2017], for our network (Fig.2), we generate two synthetic RGBD pairs I(t), and we modify the augmentation procedure of [Garon_2017, Garon_2018] as follows: Firstly, we blend the object image with a background image, sampled from a subset of the SUN3D dataset [Xiao_2013]. We also mimic the procedure of [Garon_2017, Garon_2018]
in rendering a 3D hand modeloccluder on the object frame with probability
. A twist we added, is preparing our network for cases of occlusion, by completely covering the object by the occluder for of the occluded subset. Note that both the foreground and unoccluded object binary masks are kept during both of these augmentation procedures. Hence, we can use them as ground truth segmentation signals for clutter extraction and occlusion handling in our auxiliary losses to supervise the corresponding spatial attention maps. We add to the ”Observed” frame pair I(t): (i) Gaussian RGB noise, (ii) HSV noise, (iii) blurring (to simulate rapid object movement), (iv) depth downsampling and (v) probabilistic dropout of one of the modalities, all with same parameters as in [Garon_2018]. With a probability of , we change the image contrast, using parameters , (where U() is a uniform distribution) and gamma correction
with probability, to help generalize over cases of illumination differences between rendered and sensor generated images. Instead of modelling the noise added to the ”Observed” Depth modality with an adhoc Gaussian distribution as in
[Garon_2018], we consider the specific properties of Kinect noise [Nguyen_2013] and model it with a 3D Gaussian noise (depending on depth and the ground truth object pose), used for simulating the reality gap between synthetic and real images. The rest of the preprocessing follows [Garon_2017].3 Evaluation and Results
3.1 Implementation Details
We use ELU activation functions,a minibatch size of 128, Dropout with probability 0.3, Adam optimizer with corrected weight decay
[Loshchilov_2017] by a factor , learning rate and a scheduler with warm restarts [Loshchilov_2017]every 10 epochs. All network weights (except those transferred from ResNet18
[He_2016]) are initialized via a uniform K.He [He_2015] scheme. Since the Geodesic distance suffers from multiple local minima, following [Mahendran_2017], we first warmup the weights, aiming to minimize the LogCosh loss function for 25 epochs. Then, we train until convergence, minimizing the loss (2.2). The average training time is 12 hours in a single GeForce 1080 Ti GPU.3.2 Dataset and Metrics
We test our approach on the ”hard interaction scenario” of [Garon_2018], which is considered the most difficult. It comprises of free 3D object motion, along with arbitrary occlusions by the user’s hand. Our assumption is that if our proposed method performs better in the most challenging scenario, it will behave at least equally well in every other scenario. As in [Garon_2018]
, we initialize our tracker every 15 frames, and use the same evaluation metrics. Due to limited computational resources, we produced only 20.000 samples, whose variability covers the pose space sufficiently enough, both for the ablation study and the final experimentation.
3.3 Ablation Study
3.3.1 Hierarchy choices for the attention modules
Here, we justify the need for both attention modules of our architecture (Fig. 2). We build upon the network proposed by [Garon_2018], and we firstly introduce a single convolutional attention map just for occlusion handling . Then, we explore the possibility for a seperate attentional weighting of the ”Observed” feature map for foreground extraction, prior to the occlusion one, and, we, finally, leverage both in parallel and add their resulting maps altogether.
The comparison of Table 2 establishes not only the need for both attentional modules in our design, but also that parallel modules are optimal. We can observe the effect of parallel connection in Fig.1 as both attentions present sharper peaks. We can, also, observe a visual tradeoff between the parallel attentions: while the object is not occluded (either in steady state or when moving), the module responsible for foreground extraction is highlighted more intensely than the occlusion one. As the object gets more and more covered by the user’s hand, the focus gradually shifts to the module responsible for occlusion handling. Note that this is not an ability we explicitly train our network to obtain, but rather a side effect of our approach, which fits our intuitive understanding of cognitive visual tracking.
3.3.2 Contributions of the rotation Loss components
We demonstrate the value of every component included in our rotation loss (leaving symmetries temporarily out of study), by: (i) regressing only the rotational parameters with the baseline architecture of [Garon_2018], (ii), replacing the MSE loss with the Geodesic one, (iii), replacing the rotation parameterization of [Garon_2018] with the continuous one of (2), and, (iv) including the Inertial Tensor weighting of each rotational component.
Table 2 indicates the value that translation estimation brings to rotation estimation, as when the former’s regression is excluded, the latter’s performance decreases. Moreover, Table 2 justifies our progressive design selections in formulating our rotation loss, as with the addition of each ambiguity modelling, the 3D rotationerror metric decreases, starting from and reaching .
3.4 Experimental Results
According to our ablation study, we proceed to merge our parallel attention modules with the Geodesic rotation loss of (2.2), along with the remaining elements of Sect.2. We evaluate our method on two objects of [Garon_2018]: one asymmetrical with rich texture and complex shape (dragon) and one symmetrical with poor texture and simple 3D shape (cookiejar).
3.4.1 The Dragon model: the asymmetric case
Our approach reduces mean errors by about for translation and 57.7 for rotation w.r.t. baseline [Garon_2018]. When the object is not occluded, the tracker focuses mostly on its 3D center, implicitly realizing in this way that this is the main 3D point of tracking interest. When the user’s hand occludes parts of the dragon, the attention shifts to its body parts of interest that stand out of the grip, like its neck, wings or tail (Fig.1). The effectiveness of our method is demonstrated by the fact that, while [Garon_2018] keeps track only of the object’s 3D position under extreme occlusions, our improved version extends this property to 3D rotations as well. Although more computationally intense, the speed of our CNN (40 frames/sec.) still lies within the boundaries of realtime performance set by [Garon_2018].
3.4.2 The CookieJar model: the symmetric case
For the special case of rotoreflective symmetry, we also report our results (Table 3) without/with accounting for the object’s symmetry axis in formulating our rotation loss.
We improve the approach of Garon et al.[Garon_2018] by and
in translation and rotation, respectively, if we do not take the symmetry degree of freedom into account in the loss. When we disentangle the rotation estimation and symmetries we try two different configurations: (i) learning one symmetry parameter over all possible pose changes in the training set and (ii) regressing a different one per pose pair. It is obvious that in the first case the symmetry parameter does not improve the tracker’s performance, while in the second one it reduces both metrics to
mm/ . This occurs as, in the second case, the minimization of the tracking loss w.r.t. the symmetry matrix (see [Bregier_2017]) achieves to fully exploit the extra degree of rotational freedom, by relaxing the globalsolution constraint of the first one and allow one solution per pose pair. Finally, we improve [Garon_2018] by and for translation and rotation, respectively. Here, the differences between our method and the baseline are lower than the ones of the asymmetric case. The attentions’ effect is less prominent here since the cookiejar model is of simpler, symmetric shape and poorer texture. This replaces the distinctive clues of the dragon case (e.g. tails and wings standing out) with ambiguities, denying the corresponding modules of the ability to easily identify the pose.4 Conclusion
In this work, we propose a CNN for fast and accurate single object pose tracking. We perform explicitly modular design of clutter and occlusion handling and we account for the geometrical properties of both the pose space and the object model during training. As a result, we reduce both SoA pose errors by an average of for translation and 57.5 for rotation and gain an intuitive understanding of our artificial tracking mechanism. In the future, we aim to extend this work in the objectagnostic case and model temporal continuity of motion.