Satellite Pose Estimation Challenge: Dataset, Competition Design and Results

11/05/2019 ∙ by Mate Kisantal, et al. ∙ 7

Reliable pose estimation of uncooperative satellites is a key technology for enabling future on-orbit servicing and debris removal missions. The Kelvins Satellite Pose Estimation Challenge aims at evaluating and comparing monocular vision-based approaches and pushing the state-of-the-art on this problem. This work is based on the Satellite Pose Estimation Dataset, the first publicly available machine learning set of synthetic and real spacecraft imagery. The choice of dataset reflects one of the unique challenges associated with spaceborne computer vision tasks, namely the lack of spaceborne images to train and validate the developed algorithms. This work briefly reviews the basic properties and the collection process of the dataset which was made publicly available. The competition design, including the definition of performance metrics and the adopted testbed, is also discussed. Furthermore, the submissions of the 48 participants are analyzed to compare the performance of their approaches and uncover what factors make the satellite pose estimation problem especially challenging.



There are no comments yet.


page 1

page 3

page 4

page 5

page 6

page 7

page 11

page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

In recent years, mission concepts such as debris removal and on-orbit servicing have gained increasing attention from academia and industry in order to address the congestion in Earth orbits and extend the lifetime of geostationary satellites. These include the RemoveDEBRIS mission by Surrey Space Centre [removedebris], the Phoenix program by DARPA [phoenix_darpa], the Restore-L mission by NASA [restore_L], and the on-orbit servicing programs proposed by Infinite Orbits, Effective Space, and many other startup companies. A key to performing these tasks is the availability of the target spacecraft’s position and attitude relative to the servicer spacecraft (i.e., pose). However, the targets of interest, including defunct satellites and debris pieces, are noncooperative and thus incapable of providing the servicer the information on their state. Moreover, the servicer cannot rely on the availability of known fiduciary markers on these targets. Overall, the servicer must be able to estimate and predict the target’s relative pose on-board without human-in-the-loop. It is especially attractive to perform pose estimation using a vision-based sensor such as a camera due to its small mass and power requirements compared to other active sensors such as lidars or radars. Moreover, monocular cameras are favored over stereo systems due to their relative simplicity and the fact that spacecraft, especially emerging small spacecraft such as CubeSats, do not allow for a large enough baseline to make stereovision effective. In order to enable autonomous pose estimation, the servicer then must harness fast and robust computer vision algorithms to compute relative position and attitude of the target from a single or a set of monocular images.

Starting with the the success of AlexNet [Krizhevsky2012] in the ILSVRC challenge [ilsvrc]

in 2012, deep learning models have been outperforming traditional approaches on a number of computer vision problems. However, deep learning relies on large annotated datasets. While there is a plethora of large-scale datasets for various terrestrial applications of computer vision and pose estimation that allows training the state-of-the-art machine learning models, there is a lack of such datasets for spacecraft pose estimation. The main reason arises from the difficulty of acquiring thousands of spaceborne images of the desired target spacecraft with accurately annotated pose labels. Moreover, a lack of common datasets makes it impossible to systematically evaluate and compare the performance of different pose estimation algorithms. In order to address these difficulties, the Satellite Pose Estimation Challenge (SPEC) was organized by the Space Rendezvous Laboratory (SLAB) at Stanford University and the Advanced Concepts Team (ACT) of the European Space Agency (ESA). The challenge was hosted on the ACT’s Kelvins competition website

111, a platform hosting a number of space-related competitions. The primary aim of the SPEC was to provide a common benchmark for satellite pose estimation algorithms, identify the state-of-the-art, and show where further improvements can be made. Furthermore, such dedicated challenges have potential to raise awareness of the problems of the satellite pose estimation in the wider scientific community, bringing in new ideas and researchers to this field.

The dataset for the SPEC, named Spacecraft Pose Estimation Dataset (SPEED), mostly consists of synthetic images and the submissions were solely ranked by their accuracy as evaluated on these images. The dataset also includes a smaller amount of real images which were collected using a realistic satellite mockup and the Testbed for Rendezvous and Optical Navigation (TRON) facility of SLAB. Even though the domain adaptation was not the main focus of the competition, evaluating the submissions on these images provides an indication of the generalization capability of the proposed algorithms.

The main contribution of this work is the analysis of the SPEC results. On the one hand, samples of the dataset are ranked based on performance of the submitted algorithms to uncover which factors contribute to the difficulty of the pose estimation task the most. Target distance and background were found to be the main challenges. On the other hand, an analysis of the submissions and comparison of the efficacy of different approaches are presented based on a survey conducted among the participants. Perspective-n-Point (PnP) solver-based approaches were found to be significantly more accurate compared to direct pose estimation approaches. Including a separate detection step was also found to be an important element of high performing pose estimation pipelines. It allows cropping the relevant part of the images and zooming on the satellite, which brings significant benefits in terms of orientation accuracy.

After a review of the related pose estimation research in Section II, Section III discusses the creation of the dataset, and Section IV briefly discusses the competition design considerations. This is followed by an in-depth analysis of the final submissions in Section V. Finally, the recommendations for further improvements are given in Section VI.

Ii Related Work

The classical approach to monocular-based pose estimation of a target spacecraft [Cropp2002PoseEO, Leinz2008_OrbitalExpress, Zhang2005_pose, Petit2011_CaseStudy, grompone2015_phdthesis, damico_benn_jorgensen_2014, kanani2012] would first extract hand-crafted features of the target from a 2D image. These features include Harris corners [Harris88acombined], Canny edges [Canny:1986:CAE:11274.11275], lines via Hough transform [ballard_1981], or scale-invariant features such as SIFT [Lowe2004], SURF [Bay:2008:SRF:1370312.1370556], and ORB features [Rublee:2011:OEA:2355573.2356268]

. Upon successful extraction of said features, iterative algorithms are required to predict the best pose solution that minimizes a certain error criterion in the presence of outliers and unknown features correspondences. The process is crucial in providing a good initial pose estimate to the on-board vision-based navigation system

[sharma_damico_2017, KimJunkins2007_KFPose]. Earlier works on initial pose estimation tended to rely on a coarse a priori knowledge of the target’s pose [Cropp2002PoseEO, Zhang2005_pose, Petit2011_CaseStudy] or assumed the availability of active fiduciary markers or sensors on the target [Leinz2008_OrbitalExpress]. Without making any such assumptions, D’Amico et al. [damico_benn_jorgensen_2014] were one of the first to publish pose estimation results using Hough transform and Canny edge detector on spaceborne images captured during the rendezvous phase of the PRISMA mission [damico_benn_jorgensen_2014, PRISMA_chapter]. By grouping edge features into a geometrically meaningful shape, they were able to reduce the size of the feature correspondence search space. The work was followed by Sharma et al. [Sharma2018_RobustPoseInitial] who additionally introduced Weak Gradient Elimination (WGE) technique to essentially separate the spacecraft’s edge features from the weak edge features of the background. While the proposed architecture showed improved performance on the spaceborne images from the PRISMA mission, the method was affected by low availability of high confidence solutions.

On the other hand, recent years have seen a significant breakthrough in computer vision with the advent of Deep Neural Networks (DNN). It was made possible by increasing computational resources represented by the Graphical Processing Units (GPU) and the availability of large-scale datasets to train the DNN, such as ImageNet for classification

[Krizhevsky2012], MS COCO for object detection [Lin2014COCO], and LINEMOD for pose estimation [Hinterstoisser2013_LINEMOD] of ordinary household objects. While various DNN-based approaches have been proposed to perform pose estimation [Tulsiani2015, Su2015_RenderForCNN, Kendall2015_PoseNet, Rad2017_BB8, Kehl2017_ssd6d, Sundermeyer_2018_ECCV, Mahendran2017, Xiang2018_PoseCNN, Tekin2018, Zhao2018_KPD, Peng2019_PVNet], current state-of-the-art methods employ Convolutional Neural Networks (CNN) that either directly predict the 6D pose or an intermediate information that can be used to compute the 6D pose, notably a set of keypoints defined a priori. For example, PoseCNN [Xiang2018_PoseCNN]

directly regresses 3D translation vector and a unit quaternion representing the relative attitude of the target, whereas SPN

[Sharma2019, sharmaThesis2019] poses attitude prediction as a classification problem by discretizing the viewpoint space into a finite number of bins. Most recently, architectures like KPD [Zhao2018_KPD] and PVNet [Peng2019_PVNet] have been proposed to predict the locations of the 2D keypoints on the target’s surface. Given the corresponding 3D coordinates of the keypoints from available models, one can solve the PnP problem [Lepetit2008] to compute the relative position and attitude. It is noteworthy to mention that terrestrial applications of the object pose estimation are not typically subject to strict navigation and computation requirements as for satellite on-orbit servicing.

Iii Dataset

This section provides a high-level description of SPEED, which comprises the training and test images of this challenge. SPEED represents the first publicly available machine learning data set for spacecraft pose estimation.222 The images of the Tango spacecraft from the PRISMA mission [damico_benn_jorgensen_2014, PRISMA_chapter] are generated from two different sources, referred to as synthetic and real images in the following. Both images are created using the same camera model. Specifically, the real images are captured using the Point Grey Grasshopper 3 camera with a Xenoplan 1.9/17 mm lens, while the synthetic images are created using the same camera properties. The ground-truth pose labels, consisting of the translation vector and a unit quaternion describing the relative orientation of the Tango spacecraft with respect to the camera, are released along with the associated training images. The readers are encouraged to read [Sharma2019] and [sharmaThesis2019] for more details on the dataset.

Fig. 1: Examples of synthetic training images from SPEED.
(a) Flight imagery
(b) Beierle
(d) Histogram of pixel intensities
Fig. 2: Cropped versions of (a) the flight imagery captured during the PRISMA mission [PRISMA_chapter], (b) synthetic imagery in Beierle and D’Amico [Beierle2019], (c) SPEED synthetic imagery, and (d) histogram comparison of image pixel intensities of the three images. They are cropped from the downscaled 224 224 images.

Iii-a Creation of the synthetic dataset

The synthetic images of the Tango spacecraft are created using the camera emulator software of the Optical Stimulator [Sharma2018_CNN, Beierle2019]. The software uses the OpenGL-based image rendering pipeline to generate photo-realistic images of the Tango spacecraft with desired ground-truth poses (examples are shown on Fig. 1). Random Earth images captured by the Himawari-8 geostationary weather satellite333 are inserted to the background of half of the synthetic images. For these images, the illumination conditions are created to best match those of the background Earth images. Finally, Gaussian blurring and noise are applied to all images.

From Fig. 2 it is clear that the synthetic imagery of SPEED can closely emulate the illumination conditions captured from the actual flight imagery, indicated by the overlapping histogram curves of the image pixel intensities of both imageries. This demonstrates significant improvement of SPEED’s image rendering pipeline over the previous work by Beierle and D’Amico [Beierle2019] and its capability of generating photorealistic images of any desired spacecraft with specified pose labels.

Iii-B Collecting real images with TRON

The real images of the Tango spacecraft are captured using the TRON facility of SLAB [sharmaThesis2019, Beierle2019] as shown in Fig. 3

. At the time of image generation, the facility consisted of a 1:1 mockup model of the Tango spacecraft and a ceiling-mounted seven degrees-of-freedom robotic arm, which holds the camera at its end-effector. The facility also includes custom Light-Emitting Diode (LED) wall panels which can simulate the diffused illumination conditions due to Earth albedo and a xenon short-arc lamp to simulate collimated sunlight in various orbit regimes. The ground-truth pose labels for the real images are acquired using ten Vicon cameras

[vicon_vero] that track infrared (IR) markers on the satellite mockup and the test camera. Careful calibration processes outlined in [sharmaThesis2019] are performed to remove any biases in the estimated target and camera reference frames. Overall, the independent pose measurement of the calibrated Vicon system provides the pose labels with degree-level and centimeter-level accuracy [sharmaThesis2019]. Current work is undergoing to improve the accuracy of the ground-truth pose by one order of magnitude by fusing Vicon cameras and robot measurements concurrently.

Fig. 3: Left: TRON facility at SLAB. Right: Two examples of real training images from SPEED.

Fig. 4 provides a qualitative comparison of synthetic and real images of SPEED. Note that while both images share identical ground-truth poses and general direction of Earth albedo, one can readily observe a number of discrepancies in the image properties, such as the spacecraft’s texture, illumination and eclipse of certain spacecraft features.

Fig. 4: Left: SPEED synthetic imagery. Right: SPEED real imagery.

Iii-C Basic Dataset Properties

The released dataset contains almost synthetic and real images and is partitioned into the training and test sets according to Table I. Note that while synthetic images are partitioned into 8:2 ratio, only five real images are provided with labels for training. It represents a common situation in spaceborne applications in which the images of an orbiting satellite are scarce and difficult to obtain. All images are grayscale with high resolution ( pixels).

Synthetic Real
Training set 12000 5
Test set 2998 300
TABLE I: Number of Images in Different Partitions of the Dataset
Fig. 5: Definition of spacecraft body reference frame (), camera reference frame (), relative position (), and relative orientation ().

Fig. 5 graphically describes the spacecraft body and camera reference frames to visualize the position and orientation distributions of the dataset. Specifically, is aligned with the camera boresight in the camera reference frame, while is perpendicular to the solar panel in the Tango’s body reference frame. (, ) and (, ) then form a plane perpendicular to and , respectively, as shown in Fig. 5.

Fig. 6 shows the range of relative position distributions in the dataset in the camera frame. The distance of the satellite in the synthetic images is between 3 and 40.5 meters. Due to physical limitations of the TRON facility in combination with the size of the satellite mockup, the distance distribution of real images is much more constrained, ranging from 2.8 to 4.7 meters.

Fig. 6: Position distributions of the pose labels across the dataset in the camera frame (), for synthetic (left) and real (right) samples.
Fig. 7: Camera poses for real images in the Tango’s body frame () from two views. The simplified wireframe model of the satellite is plotted in green, camera poses are plotted in red and black for test and training samples, respectively.
Fig. 8: Distribution of the camera’s relative positions of the synthetic images in the Tango’s body frame ( from two views. The satellite is in the origin, training and test camera poses are plotted in red and black for test and training samples respectively.

Fig. 7 visualizes the relative orientation and position distributions for real images in the satellite body frame. For synthetic images, Fig. 8 visualizes the relative position distribution in the satellite body frame. It especially visualizes the fact that for synthetic images, the relative orientations are well distributed across the 3D space. However, in case of real images, the diversity of orientations and distances is restricted due to physical limitations.

Iv Competition Design

In an open scientific competition such as SPEC and other Kelvins competitions, scientific problems are turned into well-formulated mathematical problems that are solved by engaging the broader scientific community and citizen scientists. Therefore, there are two key factors that are considered in setting up the competition:

  • [ ]

  • community engagement: The participants and the effort they put into solving the problems are our main resource. Therefore, a broad audience has to be reached to attract many individuals and teams. Then, the barrier to entry into the competition has to be as low as possible. Finally, engagement of the participants has to be maintained. This last point involves making sure that the problem can be solved based on the released dataset (e.g., “there is signal”, samples are well distributed, etc.), that solutions are quickly evaluated and added to a live leader board, and in general that the competition is fair (e.g., by keeping the test set private).

  • competition metric: The creation of the competition metric is the process in which the scientific problem of interest is turned into an optimization problem. Care should be taken in designing the competition metric, as it has to directly reflect the important aspects of the problem. Otherwise a solution to the optimization problem might not be relevant to the original scientific problem. In case the metric can be cheated, participants may focus on specific solutions that might lead to good scores but are of less practical value.

SPEC particularly aimed to focus community efforts on the problem of estimating pose of uncooperative satellites. The following sections describe the competition setup and the baseline solutions provided to the participants and introduce the competition metric.

Iv-a Competition setup - the Kelvins competition platform

Kelvins, the platform which hosts SPEC and many other satellite-related challenges, was designed to provide a seamless experience for the participants. It features a live leaderboard that is a key for maintaining community engagement over longer intervals. Teams have direct information about how their latest submission compares to their peers, the limits are constantly pushed further, and the competitive aspect brings more motivation for teams to put in effort. Another important feature is the automated evaluation of submissions. This allows for keeping the test set private, which helps ensuring a fair competition. During the competition only of the test set was used for evaluation and placement in the leaderboard in order to prevent the participants from overfitting on the entire test set.

Iv-B Competition Metric

The competition metric has to faithfully reflect the underlying scientific problem in order to ensure that the high-scoring solutions are meaningful also outside the context of the competition. While it is not uncommon to have separate orientation and position metrics [Kendall2015_PoseNet], a single scalar score was used instead to rank the submissions on the leaderboard.

To evaluate the submitted pose solutions, separate position () and orientation () errors are computed. Fig. 5 graphically describes the relevant reference frames to compute the errors. The position error, , is defined as


the magnitude (2-norm) of difference between the ground-truth () and estimated () position vectors from the origin of the camera reference frame to that of the target body frame . The normalized position error, is also defined as


which penalizes the position errors more heavily when the target satellite is closer.

The orientation error is calculated as the angular distance between the predicted, = , and actual, = , unit quaternions, i.e., the magnitude of the rotation that aligns the target body frame with the camera reference frame ,


The pose error for a single image is the sum (1-norm) of the orientation and the normalized position error,


Finally, the total error is the average of the pose errors over all images of the test set,


A main concern during the creation of the competition metric was to balance its sensitivity to position and orientation errors and avoid situations where one factor dominates the metric while neglecting the other. Note that since the position error is dependent on the target distance, the balance between the two contributions also depends on the particular distance distribution of the test set.

In order to check the balance of the sensitivities, the total error was calculated over the test set for two cases: introducing of orientation error in the first case, and adding m translation error in the second case. It was shown that m translation error, on average, is equivalent to error for the particular distribution of poses in the test in the first case. Likewise, orientation error was shown to be equivalent to m translation error in the second case. Such behavior is expected due to the underlying perspective equations which drive image formation. This suggested the contributions of each error type are reasonably balanced, thus the total score combines both errors without the introduction of additional scaling factors.

Two alternative metrics were also considered. The reprojection error is the average distance between projected keypoints measured in 2D on the image plane [Brachmann2016UncertaintyDriven6P]. The average distance error is the 3D distance between the ground truth and predicted keypoints (usually referred to as ADD metric [Hinterstoisser2013_LINEMOD]). Both have the disadvantage that the orientation and position sensitivity is dependent on the choice of keypoints, since the slope of orientation error is proportional to the distance of the keypoints from the origin of the target’s body frame. Furthermore, the reprojection error is numerically unstable in the case when predicted keypoints lie very close to the image plane.

Iv-C Baseline solutions

Two different example solutions are provided to the participants in Python using two popular deep learning frameworks, Keras and PyTorch

444 The main reason for providing these baseline solutions is to lower the barriers of entering the competition. While the performance of these baselines is intentionally rather weak, it still allows competitors to submit their first result within an hour. Along with the example solutions, the competition platform provides useful tools that facilitate working with the dataset, such as functions to visualize samples and corresponding pose labels, or data loaders for the two deep learning frameworks.

The baseline solutions rely on pre-trained ResNet models where the last layer is replaced with a layer containing seven linear outputs for the pose variables. The models are fed with downscaled images and trained with simple Mean-Squared Error (MSE) loss for epochs. These baselines leave quite some room for improvements. For instance the outputs are not normalized, or the predicted distance along the camera boresight is typically one order of magnitude larger than all the other output variables. Using the MSE loss, errors in this direction dominate the loss. Furthermore, MSE loss does not account for the periodicity of orientation.

Keeping the baseline solutions intentionally simple and weak helped to engage the participants in the competition. These baselines allow for incremental improvements, such as replacing the loss function or training on larger input images. Additionally, a stronger third baseline solution, also based on CNN, was developed during the competition by SLAB and is used for comparison purposes.

Fig. 9: Final results on the synthetic and real test sets.

V Competition Results

During the competition, 48 teams participated and submitted results. 20 teams filled a post-competition questionnaire and provided detailed descriptions about their approaches. This section analyzes and compares their submissions, evaluates the performance of the different approaches, and identifies difficult samples to show what are the current limits of this technology.

V-a Final results

Fig. 9 illustrates the final scores. The first 20 teams significantly outperformed the initial baseline with the top teams getting a two orders of magnitude improvement over the baseline solutions.555Final leaderboard:

Team [m] PnP
1. UniAdelaide [Chen2019SPEC] 0.0094 0.3752 Yes
2. EPFL_cvlab 0.0215 0.1139 Yes
3. pedro_fairspace [Proenca2019SPEC] 0.0571 0.1555 No
SLAB Baseline [Park2019_TowardsRL] 0.0626 0.3951 Yes
  • Best results for each metric are highlighted with bold fonts. The mean and the standard deviation of the orientation errors (

    ) as in (3) and position errors () as in (1) are measured on the synthetic test set.

TABLE II: Detailed Results of the Top Three Submissions Compared to the SLAB’s Baseline Performance

While the primary competition ranking criteria was the score on the synthetic test set, submissions were also evaluated on the real test set. Results on real images are weaker compared to those on synthetic images for most teams, except for three of the solutions. Machine learning models are generally expected to perform worse when evaluated on data with a statistical distribution that significantly differ from their training set. It is possible that the reason those three teams achieved better results on real imagery is related to its limited pose distribution.

The results of the top three teams are collected in Table II and compared to the baseline network developed by SLAB [Park2019_TowardsRL] during the course of the competition. While team UniAdelaide [Chen2019SPEC] won the competition by achieving the highest score on the synthetic test set, EPFL_cvlab achieved the highest accuracy on real images. pedro_fairspace [Proenca2019SPEC] submitted the best submission that did not rely on PnP solvers, finishing on the third place. These top three solutions were the only submissions to outperform the SLAB baseline. Before the competition, the best published result on SPEED was Spacacraft Pose Network (SPN) by Sharma and D’Amico [Sharma2019, sharmaThesis2019]. SPN was also the first published result on SPEED benchmark prior to its public release, and its reported performances in terms of the mean orientation and position error are and m.

V-B Survey on methods

Shortly after the competition, all participants were asked to answer a short surveying questionnaire regarding their backgrounds, the approaches they used, and how they dealt with certain aspects of the problem. 20 teams, including the top 13 competitors, answered the survey. Most of the teams (except for three) consisted of a single individual contributor, affiliated with academic institutions () or industry (). It is noteworthy that only half of the teams were involved with space related research, and were not working on pose estimation problems at all.

Fig. 10: Position (left) and orientation (right) error distributions for direct and PnP solver based methods.

Deep learning approaches dominated the submissions, as all teams used deep learning either in an end-to-end fashion or as an intermediary process in their pipelines. The teams addressed the pose estimation problem as a regression task, except for one team that framed orientation prediction as a soft classification problem. Various architectures were used from well known pre-trained models, such as ResNets [He2015_ResNet], Inception v3 [Szegedy2015Inceptionv3], and YOLO [redmonFarhadi2017YOLOv2, redmonFarhadi2018YOLOv3], to custom models trained from scratch. 18 of the 20 teams made use of the data augmentation techniques to maximize their performance, such as geometric transformations (e.g., rotation around the camera axis, zooming and cropping) and pixel intensity changes (e.g., adding noise, changing brightness).

SPEED consists of high resolution images that are not suitable as direct inputs to a neural network due to memory limitations of GPUs. Therefore, all teams performed downscaling of the given images to a variety of sizes ranging from to . Some teams cropped the input image, either taking a sufficiently large central crop or localizing the satellite first and then cropping the relevant part of the image. Specifically, a number of top-scoring teams used a separate CNN to perform localization before cropping in order to prevent any loss of information due to downscaling. of the teams used ImageNet pre-trained models that expect three channel RGB input images. Since the dataset consists of single channel grayscale images, this provided additional freedom for teams for constructing their input. While most teams simply stacked the same input channel to have RGB input, two teams included masked or filtered versions of the input on the extra channels.

Since the 3D model for the satellite was not released as part of the competition, some teams chose to reconstruct the satellite model in order to use any keypoints-based architecture. Specifically, seven teams reconstructed the 3D coordinate locations of 8 to 11 keypoints using 10 to 20 hand-selected images and the provided pose labels. The keypoints generally correspond to the corners of the satellite body and the tips of the antennae. The method of reconstruction ranged from manually aligning the vertices to triangulation or reprojection-based optimization. The resulting models were used for generating bounding box or segmentation ground truth from the available pose labels, and in some cases directly in the pose estimation process with PnP solvers.

Fig. 11: Position (left) and orientation (right) error distributions highlighting the effect of a localization step prior to orientation estimation.

V-C Comparing approaches

This section provides the analysis of survey results and submissions together to compare design decisions in light of the final results. In particular, it discusses how keypoint matching techniques compare to pure deep learning approaches and what the effect of a separate localization step is in the pose estimation pipeline.

V-C1 Keypoint matching techniques

Most teams designed an architecture that predicts the target’s pose in an end-to-end fashion. However, four teams designed an architecture that first predicts a set of pre-defined keypoints using a neural network. Then, they use a keypoint matching technique such as a PnP solver to align a known model of the satellite (e.g., reconstructed 3D keypoint coordinates) with the detected keypoints. While the PnP optimization is prone to local minima, it allows for explicitly incorporating the geometric constraints in the pose estimation process.

Fig. 10 illustrates the error distributions for the solutions based on PnP and direct pose estimation separately for position and orientation error. Specifically, the performances of the top 10 teams were analyzed to compare the PnP solutions and strong direct pose estimation submissions. In the submissions, PnP-based solutions significantly outperform direct pose estimation both in terms of position and orientation performance, ranking on the first, second and fourth places. The average orientation errors and their deviations are and for direct and PnP methods, respectively, while relative position errors are m and m.

V-C2 The effect of separate localization

Another recurring technique across the participants is the use of a separate localization step. In this case, the first step is the detection of the satellite, either by segmenting its contour or identifying a tight-fitting bounding box around it. This step separates the position and orientation estimation tasks, and allows to train separate models. The main advantage is that an intermediate detection result allows for cropping the original high resolution image, to use only the relevant part of the images downstream. The disadvantages of this approach are the added complexity and the need for segmentation/bounding box annotation via a separate model reconstruction step.

Fig. 11 compares the error distributions of the top 8 teams that use direct pose estimation methods (i.e., no PnP solver). Specifically, the half of the selected teams use an independent localization step in their direct pose estimation approach, whereas the other half use a combined architecture that performs localization and pose estimation simultaneously. Interestingly, the position error distributions are nearly identical, while separate localization significantly outperforms the combined approach in terms of the orientation. This suggests that localization does not bring any benefits in terms of detecting the position, having it predicted simultaneously with the orientation of the satellite is just as accurate. However, the capability to crop irrelevant parts and zoom in on the important part of the image makes a significant difference in orientation estimation. Specifically, the mean orientation error and deviation is as opposed to of the combined approach.

Fig. 12: Top: Test images ranked by difficulty, measured as minimum pose error across all submissions. Bottom: Nine example images from different parts of the distribution. Images are shown with scaled colors to maximize contrast.

V-D Difficulty of samples

In order to uncover which factors contribute the most to the difficulty of the satellite pose estimation task, the best prediction from all submissions is selected for each image of the test set. This ‘super pose estimator’ is used as a proxy of how difficult the pose estimation task is on a certain sample. The resulting score distribution is plotted in Fig. 12 along with a number of selected images. Except for a few outliers, the error distribution is flat with pose errors well below . In fact, the average orientation error and its standard deviation is , while the average position error is m.666In comparison, the winning team UniAdelaide achieved orientation error and m relative position error.

The general trend is that the images with black background, representing the case of an under-exposed star field, are easier compared to the samples with Earth background. Black background makes the detection of the satellite a straightforward task, given the sharp contrast of the satellite to its background. Having a cluttered Earth background makes the pose estimation more difficult.

Fig. 13: Distribution of the minimum pose error with respect to the inter-spacecraft distance. The minimum pose error is computed across all submissions. Mean and standard deviation are calculated over one meter wide distance bins.

The most challenging samples are the images with Earth background and small target due to large inter-spacecraft distance. In this situation, the apparent size of the satellite can be comparable with features on the background image, and in some cases the contrast of the satellite to the background is minimal. This makes pose estimation particularly challenging. In fact, just spotting the satellite in these images is a demanding task for humans as well (see first four images in Fig. 12). Fig. 13 also highlights the importance of the inter-spacecraft distance. It plots the distribution of the pose score for within 1 m distance bin. The distribution of scores is correlated with the target distance, i.e., it is harder to estimate the pose of the satellites that are farther away. This is expected, since larger target distance results in a smaller apparent size of the satellite, corresponding to less pixels associated with the spacecraft.

Vi Conclusion and future work

The aim of organizing the Satellite Pose Estimation Challenge (SPEC) was to draw more attention to the satellite pose estimation problem and to provide a benchmark to gauge different approaches. Nearly 50 teams participated during the 5 month long duration of the competition. This paper summarizes the creation of the dataset and the considerations put into designing this competition. Based on the submissions and a survey conducted amongst the top performing participants, the analysis is presented on different approaches to the problem. The top performing participants managed to significantly outperform the previous state-of-the-art and push the boundaries of the vision-based satellite pose estimation further.

The analysis on the submissions discovered that the target distance and cluttered backgrounds are the most significant factors contributing to the difficulty of samples. A general trend in computer vision also observed in this competition is the domination of deep learning approaches. Virtually all teams relied on Deep Neural Networks (DNN), at least in some steps of their pose estimation pipeline. However, while DNNs proved to be indispensable in solving the problem of perception, they are still not the best choice throughout all steps of a pose estimation pipeline. Perspective-n-Point (PnP)-based keypoint matching techniques that used keypoints detected by DNNs won the first two places. Another finding was that with the availability of high resolution images and Graphical Processing Unit (GPU) memories that limit input resolution, a separate localization step can bring significant improvements in pose accuracy, as it allows for cropping the irrelevant parts of the image.

Overall, the scores of the top submissions indicate that various DNN architectures are able to perform good pose estimation of a noncooperative spacecraft, provided the servicer has access to the target’s 3D model or 3D keypoint coordinates as designed by the mission operators. However, the performances of the same architectures on real images are relatively poor, as the real images have different statistical distributions from the synthetic images that were used to train the DNNs. As any DNNs deployed in future space missions will undoubtedly utilize synthetic images as main source of training, future SPEC must design the datasets and competition metrics that better reflect the significance of domain gap. Ultimately, to support debris removal and other representative mission scenarios, SPEC must address the issue of estimating the pose of an unknown resident space object.


The authors would like to thank OHB Sweden for the 3D model of the Tango spacecraft used to create the images used in this work and for the flight images collected during the PRISMA extended mission.