Pose Estimation for Non-Cooperative Rendezvous Using Neural Networks

06/24/2019
by   Sumant Sharma, et al.
0

This work introduces the Spacecraft Pose Network (SPN) for on-board estimation of the pose, i.e., the relative position and attitude, of a known non-cooperative spacecraft using monocular vision. In contrast to other state-of-the-art pose estimation approaches for spaceborne applications, the SPN method does not require the formulation of hand-engineered features and only requires a single grayscale image to determine the pose of the spacecraft relative to the camera. The SPN method uses a Convolutional Neural Network (CNN) with three branches to solve for the pose. The first branch of the CNN bootstraps a state-of-the-art object detector to detect a 2D bounding box around the target spacecraft. The region inside the bounding box is then used by the other two branches of the CNN to determine the attitude by initially classifying the input region into discrete coarse attitude labels before regressing to a finer estimate. The SPN method then uses a novel Gauss-Newton algorithm to estimate the position by using the constraints imposed by the detected 2D bounding box and the estimated attitude. The secondary contribution of this work is the generation of the Spacecraft PosE Estimation Dataset (SPEED). SPEED consists of synthetic as well as actual camera images of a mock-up of the Tango spacecraft from the PRISMA mission. The synthetic images are created by fusing OpenGL-based renderings of the spacecraft's 3D model with actual images of the Earth captured by the Himawari-8 meteorological satellite. The actual camera images are created using a 7 degrees-of-freedom robotic arm, which positions and orients a vision-based sensor with respect to a full-scale mock-up of the Tango spacecraft. The SPN method, trained only on synthetic images, produces degree-level attitude error and cm-level position errors when evaluated on the actual camera images not used during training.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 11

page 12

page 14

page 17

09/01/2019

Towards Robust Learning-Based Pose Estimation of Noncooperative Spacecraft

This work presents a novel Convolutional Neural Network (CNN) architectu...
09/19/2018

Pose Estimation for Non-Cooperative Spacecraft Rendezvous Using Convolutional Neural Networks

On-board estimation of the pose of an uncooperative target spacecraft is...
10/19/2018

Fast and Robust Multiple ColorChecker Detection using Deep Convolutional Neural Networks

ColorCheckers are reference standards that professional photographers an...
08/12/2021

Robotic Testbed for Rendezvous and Optical Navigation: Multi-Source Calibration and Machine Learning Use Cases

This work presents the most recent advances of the Robotic Testbed for R...
11/17/2019

Fast 3D Pose Refinement with RGB Images

Pose estimation is a vital step in many robotics and perception tasks su...
12/16/2020

MSL-RAPTOR: A 6DoF Relative Pose Tracker for Onboard Robotic Perception

Determining the relative position and orientation of objects in an envir...
10/06/2021

SPEED+: Next Generation Dataset for Spacecraft Pose Estimation across Domain Gap

Autonomous vision-based spaceborne navigation is an enabling technology ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The on-board estimation of the pose, i.e., the relative position and attitude, of a known non-cooperative spacecraft using a monocular camera is a key-enabling technology for current and future on-orbiting servicing and debris removal missions such as the RemoveDEBRIS mission by Surrey Space Centre[1], the Phoenix program by DARPA [2], and the Restore-L mission by NASA[3]. The knowledge of the current pose of the target spacecraft during the proximity operations of these missions allows real-time approach trajectory guidance and control. In contrast to systems based on LiDAR and stereo camera sensors, monocular navigation ensures pose estimation under low mass and power requirements, making it a natural sensor candidate for future formation-flying missions especially employing small satellites.

Prior demonstrations of close-range pose estimation have utilized image processing based on hand-engineered features[4, 5, 6, 7, 8, 9, 10, 11, 12] and an a-priori knowledge of the pose[13, 14, 15, 16]. A key strength for many of these methods is their use of the perspective transformation between the scene and the image to hypothesize and test feature correspondences detected in the 2D image and the known 3D model of target spacecraft. However, the formulation of specific features is not scalable to spacecraft of different structural and physical properties as well as not robust to the dynamic illumination conditions of space. Secondly, the a-priori knowledge of the pose of the target is not always available due to mission operations constraints nor desirable when full autonomy is required.

Recent advancements in pose estimation techniques for terrestrial applications have relied on deep learning algorithms. Broadly, these algorithms bypass the classical image processing based pipeline and instead attempt to learn the non-linear transformation between the two-dimensional input image space and the six-dimensional output pose space in an end-to-end fashion. The deep learning-based methods either discretize the pose space and solve the resulting classification problem

[17, 18, 19, 20] or directly regress the relative pose from the input image [21, 22, 23]. Render-for-CNN [18] demonstrated the use of rendered images to train a convolutional neural network for viewpoint estimation on actual camera images. The viewpoint estimation problem is casted as classifying the camera rotation parameters into fine-grained bins. PoseCNN [23] used separate branches of a convolutional neural network to predict semantic labels, object centers in the 2D image, and object rotation using direct regression. However, the approach is not accurate enough without further refinement using the iterative closest point method[24]

. Classification-based approaches rely on the fine discretization of the pose space into a large number of pose labels in order to achieve reasonable pose accuracy. On the other hand, direct regression-based approaches require careful choice of parameters to avoid unpredictable behavior while learning the transformation between the input two-dimensional pixel information and the output regression parameters describing the six-dimensional pose space. In addition, the applicability of these algorithms to space imagery is not trivial due to two reasons. Firstly, unlike terrestrial applications, spaceborne navigation cameras are challenged with quickly varying illumination conditions and capture imagery with low signal-to-noise ratio and high contrast. Secondly, the massive datasets necessary to train machine learning algorithms are typically not available for spaceborne navigation.

Figure 1: Modules of the proposed SPN method, which takes as input a 2D image and a trained convolutional neural network for relative attitude and position determination.

The primary contribution of this work is the introduction of the Spacecraft Pose Network (SPN), a new deep learning-based pose estimation method. As shown in Figure 1

, the SPN method uses a convolutional neural network to estimate the relative position and relative attitude in a decoupled fashion from a single grayscale image. One branch of the convolutional neural network is used to detect a 2D bounding box around the target spacecraft in the input image. The other two branches are used to estimate the relative attitude using a hybrid discrete-continuous method. The relative attitude and the 2D bounding box are then combined with geometrical constraints of the perspective transformation to estimate the relative position using the Gauss-Newton algorithm. In contrast to current deep learning-based techniques, the relative attitude accuracy provided by the SPN method is not limited by the level of discretization of the pose space and it explicitly uses the geometrical knowledge of the perspective transformation in the estimation of relative position. As compared with techniques comparing image features with features of a known target spacecraft model on-board, the SPN method provides a pose estimate from a single image without requiring a long initialization phase or favorable relative translation motion. Further, due to the decoupling of the relative attitude and position estimation and the use of transfer learning, the SPN method can be used to infer the pose of the target spacecraft through training on a relatively small number of synthetic images of the same spacecraft.

The secondary contribution of this work is the creation of the Spacecraft Pose Estimation Dataset (SPEED), which consists of high-fidelity imagery involving close proximity operations around a tumbling spacecraft. The dataset contains 15,300 images from two sources: a purely software-based Augmented Reality (AR) source that fuses synthetic and actual space imagery, and a purely reality-based source that uses an actual camera sensor to capture images of a mock-up spacecraft under high-fidelity illumination conditions. The convolutional neural network of the SPN method is trained only on a portion of these AR images while it is tested on the remaining AR images as well as the actual camera images of SPEED. SPEED will also be made publicly available to the community to enable a fair comparison of the state-of-the-art pose estimation techniques through a competition on pose estimation for spaceborne applications organized in collaboration with the European Space Agency[25].

The paper is organized as follows: Section 2 describes the framework for the SPN method; Section 3 describes the SPEED image generation; Section 4 describes the experiments conducted to validate the performance of SPN method; and Section 5 presents conclusions from this work as well as directions for further research and development.

2 Spacecraft Pose Network

Formally, the problem statement for this work is the estimation of the attitude and position of the camera frame, C, with respect to the body frame of the target spacecraft, B. As shown in Figure 2, is the relative position of the origin of the target’s body reference frame w.r.t. the origin of the camera’s reference frame. Similarly, is the quaternion associated with the rotation that aligns the target’s body reference frame with the camera’s reference frame.

Figure 2: Definition of the reference frames, relative position, and relative attitude.
Figure 3: Illustration of the convolutional neural network used in the SPN method. Branch 1 uses the region proposal network [26] to detect a 2D bounding box around the target spacecraft while Branches 2 and 3 of the network are used in a hybrid classification-regression fashion to obtain the relative position.

As shown in Figure 3, the SPN method uses three separate branches of a convolutional neural network to estimate and . Branch 1 is used to detect a 2D bounding box in the image around the target spacecraft. The output of the convolutional layers corresponding to the detected 2D bounding box is used as an input to Branches 2 and 3 to obtain an estimate of the relative attitude, . Finally, the 2D bounding box and are input to the Gauss-Newton algorithm to estimate the relative position as . The estimation of the relative position and relative attitude are discussed in detail in the following subsections.

2.1 Relative Position Estimation

The relative position estimation begins with the detection of a 2D bounding box in the image around the target spacecraft using the region proposal network [26]. The detected 2D bounding box is not only important to effectively remove the background from the image before relative attitude estimation but makes the SPN method extensible to perform pose estimation for multiple objects or spacecraft components from the same image. The region proposal network used in the SPN method is an “off-the-shelf” object detection algorithm that takes the output of the five convolutional layers as an input. This network uses a sliding window approach on the output of the convolutional layers to produce region proposals based on predefined anchor boxes. Each anchor box is centered at the sliding window in question, and is associated with a scale and aspect ratio. Typically, 3 scales and 3 aspect ratios are used, yielding = 9 anchor boxes at each sliding position. Therefore, for each sliding window location, the region proposal network outputs probabilities (or scores) of whether the target spacecraft is present and regresses to coordinates of the corresponding 2D bounding boxes. Finally, the 2D bounding box associated with the highest probability is provided as the output. The reader can find further implementation details in the original paper [26]. Even though the SPN method uses the region proposal network in its current implementation, it can be easily swapped with another state-of-the-art object detection algorithm [27] based on specific computation runtime, storage, and accuracy requirements.

The resulting 2D bounding box estimated by the region proposal network and the relative attitude estimated by Branches 2 and 3 of the SPN method are combined with geometrical constraints to estimate the relative position using the Gauss-Newton algorithm. The SPN method uses the fact that the perspective projection of the 3D wireframe model of the target spacecraft must fit tightly within the detected 2D bounding box.

Figure 4: Schematic of the projection of the 3D wireframe model of the target (pink) and the estimated bounding box (yellow) in the image plane.

As shown in Figure 4, the 2D bounding box can be parametrized by its top-left coordinates, , and its bottom-right coordinates, , defined in the image plane. Given the estimated relative attitude parametrized as a rotation matrix, , the camera focal lengths, , and the camera principal points, , the 3D points defined in the body frame of the target spacecraft, , can be projected into the image plane using the perspective equation,

(1)

The constraint that the perspective projection of the 3D wireframe model fits tightly within the 2D bounding box requires that the four extremal projected points, , be as close as possible to the four edges of the 2D bounding box. Specifically, the four extremal projected points are the “left-most”, “right-most”, “top-most”, and “bottom-most” points, respectively. For example, is the “left-most” projected point and it is constrained to be as close as possible to , the coordinate representing the left edge of the 2D bounding box. Mathematically, the SPN method solves the following minimization problem subject to the constraint posed by Equation 1

(2)

The minimization problem is solved using the Gauss-Newton algorithm, which requires an initial guess of . The SPN method uses the detected 2D bounding box and the characteristic length of the 3D wireframe model to provide this initial guess.

Figure 5: Calculation of the relative position using the 2D bounding box.

Figure 5 shows that the knowledge of the diagonal characteristic length, , of the spacecraft 3D model and the diagonal length of the detected 2D bounding box, , can be used to obtain . In particular, a coarse estimate of the distance to the target spacecraft from the origin of the camera frame is

(3)

Assuming that the origin of the target spacecraft body frame, B, lies on the ray projected from the origin of the camera frame, C, towards the center of the detected 2D bounding box, , azimuth and elevation angles, , can be derived using and

(4)
(5)

Finally, the coarse relative position used as the initial guess in the Gauss-Newton algorithm is

(6)

2.2 Relative Attitude Determination

The layers of a convolutional neural network transform the raw pixel information,

, to gradually more abstract representations using predefined non-linear functions that contain unknown coefficients. The output of the network is then constrained to minimize a loss function that represents a discrepancy or error between the output of the network and the expected output for a set of training samples (supervised learning). For example, the network used in the SPN method can be represented by

(7)

where are the unknown weights for the -th layer, are the unknown biases for the -th layer, and are the known non-linear functions for the -th layer. For relative attitude estimation, the network’s output is expected to be values defined on a continuous non-Euclidean space, prohibiting the direct use of a typical L2 loss function. To handle this problem, other authors have proposed several loss functions [23, 28, 29] that directly regress to a relative attitude estimate, however, these have limited accuracy and require further refinement [30]. Instead, the SPN method uses two branches of fully-connected layers that share the output of five preceding convolutional layers as their inputs. In order to be robust to the background in the images, the features output by the final convolutional layer associated with the detected 2D bounding box are selected using the RoI pooling layer[31]. Using these features as input, Branch 2 performs a classification task to find the closest predefined attitude classes that describe the input image. Branch 3 uses these features as input to perform a regression task to find the relative weights of the attitude classes identified in Branch 2. For example, Figure 6 shows a simplified example where one dimension of the relative attitude has been discretized. For this case, Branch 2 is expected to find attitude classes 4 and 5 as the two adjacent classes that best describe the input image while Branch 3 is expected to find relative weights as a function of the angular differences and associated with the corresponding attitude classes. The sizes of the five convolutional layers and both sets of the three fully connected layers have been adopted from the Zeiler and Fergus model architecture [32].

Figure 6: Visualization of the relative attitude discretization in a single dimension. Here attitude classes 4 and 5 are adjacent classes closest to the ground truth attitude.

The attitude classes are parametrized as

unit quaternions representing uniformly distributed random rotations in the

space. Algorithm 1 shows the subgroup algorithm [33] used to obtain these random rotations. The algorithm amounts to multiplying a uniformly distributed element from the subgroup of planar rotations with a uniformly distributed coset represented by the rotations pointing the z-axis in different directions.

Input : Integer
Output : Matrix quats
rand() // get m uniformly distributed float values [0,1] rand() rand() quats = [ .* , .* , .* , .* ]
Algorithm 1 Computes uniformly distributed random rotations parametrized as unit quaternions.

To determine which of the predefined attitude classes are the closest to the given image, Branch 2 is tasked to output an vector

, which represents a probability distribution. Each entry

is the probability whether the attitude class in question is one of the closest attitude classes to the relative attitude of the given image. Note that both and are hyper-parameters. Closeness is defined by the angular difference between the unit quaternion representing the attitude class, , and the unit quaternion representing the ground truth value of the relative attitude, . To ensure that is a valid probability distribution, the output of the final layer of the Branch 2 is passed through the softmax function

(8)

Then the loss function for Branch 2, , representing the difference between the branch output and the expected probability distribution, , can be written as

(9)

where represents the weights of the final three layers of Branch 2 and is a scalar representing the strength of the L2 regularization. The aim of the L2 regularization is to penalize the existence of large weights and prevent overfitting to the training examples. Note that the entries of corresponding to the closest attitude classes is set to while the rest of the entries are set to zero.

To estimate the relative attitude for the given image, Branch 3 is tasked to output an vector, , containing weights for the closest attitude classes identified in Branch 2. The weights for the remaining attitude classes are output but not used during either training or inference. The weights of the closest attitude classes are passed through the softmax function such that their sum adds to unity. Let be the vector of indices of the largest values in , then the loss function for the second branch can be written as

(10)

where represents the weights of the final three layers of Branch 3 and is the ground truth vector of weights for the given image. The entries of are set based on the angular difference between the quaternion of the attitude class in question and the ground truth quaternion of the given image. Specifically,

(11)

where is the angular difference between the unit quaternion representing the attitude class in question and the unit quaternion representing the ground truth of the relative attitude of the given image. Hence, the total loss function for relative attitude estimation can be written as:

(12)

The classification loss, , is designed to force Branch 2’s coefficients to be correlated with macroscopic changes in the relative attitude without penalizing classification predictions of attitude classes that have small angular difference between them. Instead, the regularization loss, is designed to force the Branch 3’s coefficients to be correlated with the microscopic differences between adjacent attitude classes.

During inference, the estimate of the relative attitude, , is computed using an average of the unit quaternions of the closest attitude classes, , weighted by their corresponding relative weights, . This is akin to minimizing a weighted sum of the squared Frobenius norms of the differences between the rotation matrix representation of and

(13)

where denotes the unit 3-sphere[34]. The pseudo-code used to compute the weighted average is presented below as Algorithm 2.

Input : Matrix , Vector
Output : Vector q
1 A zeros(4,4) M length(,1) foreach i in range(M) do
2       = = A = .* + A alpha = alpha +
3 end foreach
A = (1.0 / alpha) * A

// Get the eigenvector corresponding to the largest eigenvalue

q eigs(A,1)
Algorithm 2 Computes a unit quaternion that is an average of the input unit quaternions weighted by the input weights .

3 Spacecraft Pose Estimation Dataset (SPEED)

Training a convolutional neural network usually requires extremely large labeled image datasets such as ImageNet

[35] and Places[36], which contain millions of images. Collecting and labeling such amount of actual space imagery is extremely difficult. Therefore, this work introduces SPEED, a dataset of images that enables not only training and validation of the SPN method but also benchmarking various state-of-the-art monocular vision-based pose estimation techniques. The SPEED images and the corresponding ground truth pose information are generated using two key complementary sources at the Space Rendezvous Laboratory (SLAB) of Stanford University[37]. The first source produced 15000 augmented reality images based on the Optical Stimulator [17, 38] camera emulator software and actual images of the Earth captured by the Himawari-8 geostationary meteorological satellite [39]. The second source produced 300 actual camera images of a 1:1 mock-up of the Tango spacecraft using the Testbed for Rendezvous and Optical Navigation (TRON). Table 1 provides the camera model used for both these sources.

Parameter Description Value
Number of horizontal pixels
Number of vertical pixels
Horizontal focal length m
Vertical focal length m
Horizontal pixel length m
Vertical pixel length m
Table 1: Parameters of the camera used to capture the SPEED images.

The first source renders synthetic images of the Tango spacecraft using MATLAB and C++ language bindings of OpenGL. To create a diverse set of views of the target spacecraft, a set of relative attitudes and relative positions are selected. Unit quaternion parametrization of uniformly random rotations in the space are selected using Algorithm 1. Figure 10 shows the distribution of in the SPEED images, parametrized as Euler angles. The relative positions are obtained by separately selecting the relative distance and the bearing angles (defined in Equations 4 and 5

). The bearing angles are uncorrelated random values selected from a multivariate normal distribution,

. The relative distance is randomly selected from a standard normal distribution, . Any relative distance values below 3 meters and above 50 meters are rejected. Figure 7 shows the distribution of in the SPEED images.

Figure 7: The distribution of the relative position, , in the SPEED images.
Figure 8: A montage of four of the 72 full-disk Earth images used to generate the background for half of the SPEED synthetic images.
Figure 9: A montage of six synthetic images from the SPEED training-set.

The azimuth and elevation angles for the solar illumination are specifically chosen to match the solar illumination in the 72 actual images of the Earth captured by the Himawari-8 geostationary meteorological satellite. Figure 8 shows a montage of these images. The 72 images each provide a pixels resolution disk-view of the Earth and were taken 10 minutes apart from each other over a period of 12 hours. Each of the images is converted to grayscale, cropped and inserted in as the background for half of the synthetic images of the Tango spacecraft. The location of the crop is selected at random from a uniform distribution spanning the Earth image. The size of the crop is selected to match the scale of the Earth when viewed through a camera (described in Table 1

) located at an altitude of 700 km and pointed in the nadir direction. Gaussian blurring and white noise are added to all images to emulate the depth of field and shot noise, respectively. Figure 

9 shows a montage of the resulting augmented reality images.

The second source of the SPEED images is the TRON facility at SLAB. It consists of a 7 degrees-of-freedom robotic arm, which positions and orients a vision-based sensor with respect to a target object or scene. Custom illumination devices simulate Earth albedo and Sun light to high fidelity to emulate the illumination conditions present in space [40]. TRON provides images of a 1:1 mock-up model of the Tango spacecraft using a Point Grey Grasshopper 3 camera with a Xenoplan 1.9/17 mm lens. Note that this is the same camera as the one used in the first source. Calibrated motion capture cameras report the positions and attitudes of the camera and the Tango spacecraft, which are then used to calculate the “ground truth” pose of Tango with respect to the camera. Figure 11 shows a montage of the SPEED actual camera images.

Figure 10: The distribution of the relative attitude in the SPEED images. For purposes of visualization, the relative attitude is parametrized as Euler angles.
Figure 11: A montage of three actual camera images from the SPEED real test-set.

The SPEED training-set consists of 12000 synthetic images from the first source while the remainder 3000 synthetic images and the 300 actual camera images from the second source are available as two separate test-sets. The motivation for excluding the actual camera images from the training-set is to evaluate robustness and the domain adaptation capabilities of the pose estimation techniques.

4 Experiments

The proposed SPN method was trained and tested using the SPEED images. For all experiments, the region proposal network, the convolutional layers, and the fully connected layers were pre-trained on the relatively larger ImageNet dataset [35]. Following that, the region proposal network and the fully connected layers were trained using 80 of the SPEED training-set while the remainder 20 was used for validation. The two sets of fully connected layers were trained jointly while the region proposal network was trained separately. Both training routines were carried out using stochastic gradient in batches of 16 images. The initial learning-rate was set to 0.003 with an exponential decay of 0.95 every 1000 steps. For attitude determination, the hyper-parameters and were set to 1000 and 5, respectively. During training, each image was resized to 224 224 pixels to match the input size of the Zeiler and Fergus model architecture[32].

For quantitative analysis of the performance, three separate metrics are reported. To measure the accuracy of the 2D bounding box detection as compared with the ground truth 2D bounding box, the Intersection-Over-Union (IoU) metric is reported as

(14)

To measure the accuracy of the estimated relative position, and the ground truth relative position, , the translation error for each image can be calculated as

(15)

To measure the accuracy of the estimated relative attitude, and the ground truth relative attitude, , the attitude error for each image can be calculated as

(16)
(17)
Metric SPEED synthetic test-set SPEED real test-set
Mean IoU (-) 0.8582 0.8596
Median IoU (-) 0.8908 0.8642
Mean (m) [0.055 0.046 0.78] [0.036 0.015 0.189]
Median (m) [0.024 0.021 0.496] [0.029 0.013 0.191]
Mean (deg) 8.4254 18.188
Median (deg) 7.0689 13.208
Table 2: Performance of the SPN pose estimation method on the real and synthetic SPEED test-sets.

Table 2 lists the mean and median values of the three performance metrics for the SPEED synthetic and real test-sets. The levels of the relative position and attitude errors improve upon those of other pose estimation methods applied on similar imagery in previous work[5, 17]. Notably, the 2D bounding box detection is less affected by the gap between the synthetic and real test-sets as compared with the attitude estimation. The values for the actual camera images are lower than that of the synthetic images since the relative distances for the actual camera images is comparatively much lower than that of the synthetic images.

Figure 12: A montage of a few images with the 2D bounding box detections produced by the SPN method on the SPEED synthetic test-set.

Figure 12 shows the qualitative results of the 2D bounding box detection on the SPEED synthetic and real test-sets. The region proposal network used in the SPN method was successful in detecting the 2D bounding box in images regardless of whether the Earth is visible in the background. In general, the bounding box detection worked marginally better on actual camera images as compared to synthetic images since all actual camera images had a black background. The bounding box detection was successful even in cases with strong shadows as long as one side of the spacecraft was illuminated.

Figure 13: A montage of a few images from the SPEED synthetic test-set with inaccurate 2D bounding box detections.

However, the bounding box detection tends to fail in cases where the target spacecraft was in eclipse or when the target spacecraft was closer than 5 meters. Figure 13 shows a few examples from both these failure modes. At relative distances less than 5 meters, the Tango spacecraft starts getting clipped by the image boundaries leading to unpredictable behavior of the detection. In a docking scenario, this could potentially be resolved by performing pose estimation relative to the docking mechanism or another smaller fixture, instead of the entire spacecraft. Note that the SPN method can be extended to detect multiple 2D bounding boxes corresponding to the fixtures of interest and performing attitude estimation and position determination for each.

To further examine such trends, the three performance metrics are plotted against the relative distance. In particular, the images from the SPEED synthetic test-set were grouped into batches of 100 each according to their ground truth relative distance, . The mean value of the performance metric was then plotted against the mean ground truth relative distance for each batch.

Figure 14: Mean IoU plotted against mean relative distance for the SPEED synthetic test-set. The shaded region shows the 25 and 75 percentile values.

Figure 14 shows the mean IoU values of the SPN method for the SPEED synthetic test-set plotted against the mean relative distance. The bounding box detection has the highest accuracy at ranges of 7 meters to 20 meters while there is a sharp drop-off in bounding box detection at relatively closer distances of less than 5 meters. In contrast, there is a gradual drop-off in performance at relatively farther distances as the target spacecraft starts occupying too few pixels in the image plane to allow for an accurate 2D bounding box detection. Degradation in performance at larger separations is also contributed to by the fact that the SPEED training-set contains more images at lower relative distances.

Figure 15: Mean plotted against mean relative distance for the SPEED synthetic test-set. The shaded region shows the 25 and 75 percentile values.

Figure  15 shows the mean values of the SPN method for the SPEED synthetic test-set plotted against the mean relative distance. The mean

values are between 5 and 10 degrees for most of the relative distances. The range of the relative distances where attitude estimation has the highest accuracy is similar to the bounding box detection. This makes sense since both the region proposal network and the two sets of fully connected layers share the output of shared convolutional layers in the SPN method. Unlike the bounding box detection, the attitude estimation has a sharp drop-off in performance at both high and low relative distances. This is due to the presence of more outliers in the attitude estimation as compared with bounding box detection at high relative distances.


Figure 16: Mean plotted against mean relative distance for the SPEED synthetic test-set. The shaded region shows the 25 and 75 percentile values.

The estimated relative attitude value for each image in the SPEED synthetic test-set was then combined with the corresponding 2D bounding box detection to produce estimated values of the relative position. Figure 16 shows the three components of mean values of the SPN method plotted against the mean relative distance. Note that the z-axis is aligned with the camera boresight direction while the x-axis and y-axis (lateral directions) are aligned with the image plane axes. The errors in the lateral directions was an order of magnitude lower than the camera boresight direction, which implies that the bounding box detection was fairly successful at estimating the center of the bounding box as compared with the size of the bounding box. Trends in performance degradation of mirror those observed for and IoU.

Figure 17: A montage of a few images with the pose solutions produced by the SPN method on the on the SPEED synthetic test-set.
Figure 18: A montage of a few images with the pose solutions produced by the SPN method on the on the SPEED real test-set.

Figure 17 and Figure 18 show some qualitative results of the pose estimated by the SPN method on the SPEED real and synthetic test-sets, respectively. The probability distribution output from the Branch 2 is also plotted for the respective images. In future, the probability distribution allows the setting up of a confidence metric to reject outliers in addition to being used by a navigation filter to accumulate information from sequential images to provide a more accurate estimate at a high rate. Predictably, the peaks in the probability distribution are lower for the SPEED real test-set as compared with the SPEED synthetic test-set since the convolutional neural network has overfit the synthetic images in the SPEED training-set to some extent. This could possibly be addressed by a stronger L2 regularization parameter and/or augmenting the training-set with images containing randomized textures of the target spacecraft similar to work in the area of domain adaptation[41, 42] and domain randomization[43, 44].

5 Conclusions

This work introduces the SPN method to estimate the relative pose of a target spacecraft using a single grayscale image without requiring a-priori pose information. The SPN method makes the novel use of a hybrid classification and regression technique to estimate the relative attitude. The SPN method leverages the geometric knowledge of the perspective transformation and advances in 2D bounding box detection to estimate the relative position using the Gauss-Newton algorithm. This work also introduces SPEED, a publicly available dataset to allow for the training and validation of monocular pose estimation techniques. The SPEED training-set consists of 12000 augmented reality images created by fusing synthetic images of a target spacecraft with actual camera images of the Earth. Apart from the 3000 images of the Tango spacecraft in the SPEED synthetic test-set, 300 actual camera images of the same spacecraft are provided in the SPEED real test-set. The subsequent application of the SPN method on the SPEED synthetic test-set produces degree-level mean relative attitude errors and centimeter-level mean relative position errors exceeding the performance of conventional feature-based methods used in previous work. The pose estimation performance carried over to the actual camera images as well albeit with slightly higher errors due to the gap between the synthetic images used during training and actual camera images used for testing.

However, further work is required in a few directions. First, a complete assessment of how the SPN method stacks against conventional feature-based approaches as well as the more recent deep learning-based methods needs to be performed. The SPEED images and the associated performance metrics allow an excellent framework to carry out this assessment. Second, data augmentation techniques and/or stronger regularization during training is required to bridge the performance gap of the SPN method between the synthetic and real test-sets. Third, the performance drop-off at relative distances where the target spacecraft is only partially visible needs to be addressed to allow for pose estimation during all stages of close proximity operations. Lastly, the SPN method needs to be embedded in flight-grade hardware to profile its computational runtime and memory usage.

6 Acknowledgments

The authors would like to thank the King Abdulaziz City for Science and Technology (KACST) Center of Excellence for research in Aeronautics & Astronautics (CEAA) at Stanford University for sponsoring this work. The authors would like to thank OHB Sweden, the German Aerospace Center (DLR), and the Technical University of Denmark (DTU) for the 3D model of the Tango spacecraft used to create the images used in this work. The authors would like to thank Nathan Stacey of the Space Rendezvous Laboratory at Stanford University for his technical contributions in generating ground truth pose information for the actual camera images of SPEED.

References