Estimating the 3D rotation of an object from 2D images is one of the fundamental problems in computer vision. Several applications relying on 3D rotation estimation have been developed such as a robot grasping an object[tremblay2018deep], a self driving vehicle constantly sensing its surrounding environment [Meng_2017], an augmented reality system combining computer-generated information onto the real world [marchand:hal-01246370], or a system detecting the face orientation to enhance human-computer interactions [Weidenbacher2006].
Advances of deep learning techniques have resulted in improvements in estimation of 3D orientation. However, precise orientation estimation remains an open problem. The main problem is that the space of all 3D rotations lies on a nonlinear and closed manifold, referred to as the special orthogonal group. This manifold has a different topology than unconstrained values in
, where neural network outputs exist. As a result it is hard to design a loss function which is continuous without disconnected local minima. For example using euler angles as an intermediate step causes problems due to the so-called gimbal lock. Quaternions have a double embedding giving rise to the existence of two disconnected local minimas. Some more complicated methods use Gram-Schmidt[zhou2019continuity]
which has a continuous inverse, but the function is not continuous with a discontinuity when the input vectors do not span.
Despite these issues various deep learning based solutions have been suggested. One approach is to use one of the rotation representations and model the constraint in the loss function or in the network architecture [mahendran20173d]. An alternative is to construct a mapping function, which directly converts the network output to a rotation matrix [zhou2019continuity].
Quantifying the 3D orientation uncertainty when dealing with noisy or otherwise difficult inputs is also an important task. Uncertainty estimation provides valuable information about the quality of the prediction during the process of decision making. Only recent efforts have been made on modeling the uncertainty of 3D rotation estimation [prokudin2018deep, gilitschenski2019deep]. However, those methods still rely on complex solutions to fulfil the constraints required by their parameterization.
In this paper, we instead propose a deep learning approach to estimate the 3D rotation uncertainty by using the matrix Fisher probability density function developed in the field of directional statistics[directional_statistics]. This unimodal distribution has been selected because of its relevant properties in regards to the problem of 3D orientation estimation: i) The parameterization is unconstrained so there is no need for complex functions to enforce constraints. ii) It is possible to create a loss for this distribution which has desirable properties such as convexity iii) The mode of the distribution can subsequently be estimated along with the uncertainty around that mode for further analysis.
Our method offers a simple solution to the problem of 3D orientation estimation, where a neural network learns to regress the parameters of the matrix Fisher distribution. While several other losses used for rotation estimation are discontinuous with multiple local minimas, with respect to the network output. In this paper we instead propose a loss which is convex with bounded gradient magnitudes, resulting in a stable training. In addition, we suggest a solution to efficiently compute the non-trivial normalizing constant of the distribution. Finally, the proposed method is evaluated on multiple problem domains and compared with the latest published approaches. The results show that our method outperforms all previous methods.
Our contributions include: 1) a method for estimating a probability distributions over the set of rotations with neural networks by using the matrix Fisher distribution, 2) a loss associated with this distribution and show it is convex with bounded gradients, and 3) an extensive analysis encompassing several datasets and recent orientation estimation works, where we demonstrate the superiority of our method over the state-of-the-art.
2 Related Work
3D rotation estimation has been studied over the last two decades. A common method estimates 3D rotation by aligning two sets of 3D feature points where each data set is matched and defined in a different coordinate system [eggert1997estimating]. Another well-known approach matches 2D keypoints extracted from images with features of a known 3D model and recovers the 3D pose given the 2D-3D correspondences [se2001vision, Lepetit09]
Recent methods using deep networks often predict 3D rotation directly from images without the knowledge of a 3D model of the object. Those methods can be grouped in two categories. The first one divides the set of rotations into regions and subsequently solves the 3D orientation estimation as a classification task. Subsequently, the classification network output is refined by a regression network [mahendran2018mixed, liao2019spherical].
The second category transforms the network output to a 3D rotation representation and learns to directly regress the 3D rotation given an image input. Commonly, quaternions [xiang2017posecnn] or Euler angles [mahendran20173d] representation are used. However, the paper [zhou2019continuity] shows any rotation representation of dimensions four or lower is discontinuous, which makes it difficult for the neural network to generalize over the set of rotations. They propose two continuous 5D and 6D rotation representations and construct a function that maps those representations to a rotation matrix.
Recently some studies have investigated the prediction of rotation uncertainty using probability distributions over rotations. In [prokudin2018deep] the parameters of a mixture of von Mises distribution using a biternion network are estimated. In [gilitschenski2019deep], the Bingham distribution over quaternions is used to jointly estimate a probability distribution over all rotation axes. However, their parameters have to be positive semidefinite due to their choice of probability distribution.
In this paper, we propose a solution which learns to regress the probability distribution with unconstrained parameters leading to a simple formulation of the problem of 3D rotation estimation.
We train a neural network to estimate the 3D pose of an input image. Specifically, the network outputs the parameters of the matrix Fisher distribution, which is a distribution over . From the predicted parameters we can obtain the maximum likelihood estimate of the input’s 3D pose. In the rest of this section we review the matrix Fisher distribution and provide some visualizations to help the reader’s interpretation of its parameters. Then we derive the loss, based on maximizing the likelihood of the labelled data, and finally explain how we deal with the distribution’s complex normalizing constant when we calculate our loss and calculate the gradient during back-propagation.
3.1 The matrix Fisher distribution on
We model 3D rotation matrices probabilistically with the matrix Fisher distribution [downs1972orientation, Khatri:rss:77]. This distribution has probability density function
where is an unconstrained matrix in parametrising the distribution, , and is the distribution’s normalizing constant. We will denote that is distributed according to a matrix Fisher distribution with . The distribution is unimodal but visualizing the distribution in equation (1) is hard as it has a 3D domain. Fortunately, [lee2018bayesian] describes a helpful visualization scheme, see figure 1 for details, which we use throughout the paper.
|(a)||(b)||(c)||(d) Compact Vis.|
Also not immediately apparent is how the shape of the distribution varies as varies. From [eggert1997estimating]
we know the mode of the distribution can be computed from the singular value decomposition of, where the singular values are sorted in descending order, and setting
The latter operation ensures is a proper rotation matrix. Similar results are available in [downs1972orientation]. Figure 2 displays examples of the distribution for simple matrices. These figures show that larger singular values correspond to more peaked distributions. To further help understanding of how the shape of the distribution relates to please consult section 3 of the supplementary material and [lee2018bayesian].
Finally, the deceptively simple form of the pdf in equation (1) hides the fact that its normalizing constant, , is rather complicated and non-trivial to compute accurately and efficiently. Explicitly the normalizing constant is
where is the generalized hypergeometric function of a matrix argument and
Since this matrix is not positive semidefinite we use the extension from [bingham1974antipodally] to define the function for all real matrices. Here are the singular values of and . Equation (3) can be derived using results from [jupp1979maximum] and a derivation appears in [lee2018bayesian].
3.2 A negative log-likelihood loss function
Assume we have a labelled training example where is the input and its ground truth 3D rotation matrix. To train a neural network that estimates for input , it is necessary to define a loss function measuring the compatibility between and . As the pdf in equation (1) has support in all of , we use the negative log-likelihood of given as the loss:
This loss has several interesting properties such as it is Lipschitz continuous, convex and has Lipschitz continuous gradients which makes it suitable for optimization. See supplementary material section 4 for proofs.
In practice the loss in equation 5 has an equilibrium far from the origin, in some experiments we believe this led to instability. To alleviate this problem we used a regularizing term which was 5% larger than what is analytically correct to move the equilibrium closer to the origin.
3.3 Efficiently estimating the normalizing constant
The normalizing constant is non-trivial to evaluate accurately and quickly. To allow computationally feasible training we fit a simple function to approximate where the input has the form of and also to its derivative w.r.t to the inputs . The latter approximating functions are used for back-prop training, while the former approximation is used to visualize the loss during training. We used software from [koev2006efficient] to compute the value and its numerical gradients for multiple values of . Via a process of manual inspection and curve fitting of simple functions we create an approximation of this function. The exact details are given in section 5 of the supplementary material.
Though perhaps our approach is far from optimal it does make training and testing computationally feasible and our results indicate that our approximation is sufficiently accurate to allow us to train a powerful network. Some preliminary experiments would also seem to indicate that the results are not so sensitive to the accuracy of the approximation, though having a more theoretically well-founded approach would be more satisfying such as adapting the approach in [gilitschenski2019deep] or approximating the relevant analytical derivatives calculated in [lee2018bayesian].
4 Experimental Details
We test our proposed approach on three separate datasets Pascal3D+, ModelNet10-
and UPNA head pose. We briefly describe these datasets, the pre-processing we applied to the images from each dataset before training and then the evaluation metrics.
4.1 Datasets & Pre-processing
has 12 rigid object classes and contains images from Pascal VOC and ImageNet of these classes. Each image is annotated with the object’s class, bounding box and 3D pose. The latter is found by having an annotator align a 3D CAD model to the object in the image.
We pre-process each image by applying a homography so that the transformed image appears to come from a camera with known intrinsics and a principal axis pointing towards the object. This approach is similar to [linden2018appearance] and [zhang2018revisiting]. More details are given in section 7 of the supplementary material. We perform data augmentation similar to the data augmentations introduced in [mahendran20173d], but adapted for our preprocessing. At test time we apply the same type of homography transformation as applied during training, but no data augmentation.
ModelNet10-[liao2019spherical] is a synthetic dataset. It is created by rendering rotated 3D models from ModelNet10 [wu20153d] with uniformly sampled rotations. The task is to estimate the applied rotation matrix.
We do not use any preprocessing for these images since the object is already centered and of a reasonable size. We do not use any data augmentation as the original paper did not use any and we want a fair comparison between the losses.
UPNA head pose [ariz2016novel] consists of videos with synchronized annotations of keypoints for the face in the image as well as its 3D rotation and position. The dataset has 10 people each with 11 recordings.
From the keypoint annotations we create a face bounding box for each image. After this we perform a small random perturbation of this bounding box to degrade the quality of the bounding box to be similar to what one would expect to get from a face detector. Using this artificial bounding box enables us to use the same data augmentation and preprocessing as we used for Pascal3D+. There is no official training/test split for this dataset. We create a split by using the last 2 people as the test set. We did not use a validation set since we did not apply a new hyperparameter search for the dataset. Our test split differs from[gilitschenski2019deep] as theirs is not publicly available.
4.2 Details of network & training
We run experiments with ResNet-101 as our backbone network. The ResNet-101 parameters are initialized from pre-trained ImageNet weights. The object’s class is encoded by an embedding layer that produces a 32-dimensional vector and which is appended to the ResNet’s activations obtained from the final average pooling layer. We apply 3 fully connected layers to this vector with nodes output at each layer.
We fine-tune the embedding and fully connected layer weights for 2 epochs. We use SGD and start with a learning rate of 0.01. We use a batch size of 32 and train for 120 epochs. For Pascal3D+ we reduce this learning rate by a factor 10 at epochs 30, 60 and 90. For ModelNet10-SO(3) we train for 50 epochs and reduce the learning rate by a factor of 10 at epochs 30, 40 and 45. For UPNA head pose we use the same hyperparameters as for Pascal3D+, except we do not use a class embedding since there are only faces in this dataset.
4.3 Evaluation metrics
The evaluation metrics used are based on the geodesic distance:
where and are the ground truth and estimated rotation respectively. This metric returns an angle error and we measure it in degrees. For a test set , containing tuples of input and its ground truth rotation , we summarize performance on with the median angle error and :
where is the indicator function, is the estimated rotation for input and is the cardinality of a set. To compute the overall performance on a dataset the median angle error and are first computed per class and then averaged across all classes. For the UPNA dataset we use the mean geodesic error angle instead of the median to allow more direct comparison with the results in [gilitschenski2019deep].
We have done all development and hyper-parameter optimization where the full training set was partitioned into a training and validation set. After hyper-parameter optimization, we have used the full training set for training and these are the numbers presented in the tables. For the Pascal3D+ dataset, we use the ImageNet validation split for the test set. Some samples of Pascal3D+ are labeled as “truncated”, “difficult” or “occluded”. We exclude these samples from our evaluations similar to other reported results [liao2019spherical]. This implementation detail had only a very slight effect on performance.
Table 5 compares the performance of our method and recent high performing approaches. Table 5 compares per class performance for some classes on on Pascal3D+ with the previous state-of-the-art method [mahendran2018mixed]. Our method significantly outperforms all the prior approaches. When the training set is augmented with the synthetic dataset from [su2015render], we further reduce the mean over medians angle error by approximately 1 degree.
The results reported in table LABEL:tab:modelnet_performance show that our method also achieves state-of-the-art performance on ModelNet10-SO(3). On the UPNA head pose dataset our algorithm gives a mean angle error of 4.5 degrees. This compares well to [gilitschenski2019deep] who quote a performance of 6.3 degrees. Unfortunately, since different test splits were used a direct comparison is hard. We use a test split where there is no overlap between individuals in the test and training set as compared to [gilitschenski2019deep] where the same individuals can appear in both the training and test sets (personal communication). Thus it is likely our test split is more challenging. Despite this we get a significantly lower average error.
|Method||MedErr||Acc@ (%)||Acc@ (%)||Use synth.|