A key ability necessary for on-orbit spacecraft servicing is accurate estimation of the position and orientation of the spacecraft in real time, as this allows the servicing spacecraft to automatically fine-tune its trajectory and timing. Similarly, such technology is useful for close-formation flying [hadaegh-2001], precision formation flying (PFF) [hadaegh-2003, hadaegh-2004], active debris removal, and distributed space systems with planetary science applications [damico-2017, matsuka-2019]. In such missions, particularly in on-orbit assembly application, the vision-based sensors are beneficial to estimate the pose of neighboring objects, or spacecraft [sharma-2018, capuano-2019], while the target object is known, but uncooperative. Using monocular vision-based camera in space for navigation purposes has gained interest during the recent years [sharma-2018a, pedro-2019, capuano-2019a, capuano-2020, damico-2020] since these sensors are low-power, low-cost, small, and accessible particularly in small satellites and cubesats. In these settings, we want to determine the relative pose (position and attitude) of a target object with respect to the chaser (i.e., camera). In this paper, we focus on relative pose estimation using vision-based camera for a known, but an uncooperative object.
The relative pose estimation is a well-studied field in many terrestrial pose-estimation tasks particularly for pose estimation of indoor objects. A major challenge in the terrestrial pose-estimation tasks is clutter and object occlusions [posecnn]. However, relative pose estimation of objects in space is a different problem from pose determination of objects on Earth. This issue is not as pronounced in space, where visible foreground typically contains a single object, possibly accompanied by a few background distractors such as planets or stars. In-space visual conditions are however more challenging than conditions met on Earth: because of the lack of atmosphere, light diffusion is entirely absent in space. The lack of diffusion creates much stronger shadows: object surfaces are either exposed to the full power of incident sunlight, or receive almost no light at all, resulting in extreme image contrast. Additionally, space hardware is impeded by technological constraints such as radiation resilience, power consumption and mass limits, which impact image resolution, sensor noise, and computational resources. Considering prior limitations, it is highly important to use passive sensing to do pose estimation in space and, also the active sensors such as light detection and ranging sensor (LIDAR) and RADAR have large masses and are power hungry which makes active sensing not suitable for space applications with power constraints.
On-orbit assembly task comprised of autonomous state estimation and manipulation of external objects in quasi-static environment. Due to previously discussed computational and power constraints, pose estimation using passive sensors such as the monocular vision sensor is important technology for mission critical tasks. In addition to this, this technology will be important feat for future missions such as Phoenix program by DARPA[phoenixdarpa] and The Restore-L servicing mission by NASA[restoreL]. Recent efforts shows demonstration of assembly task using monocular vision based tracking and highly occluded object [surp, tracker]
as proof of concept. However, tracker came short in initialization phase to instantly localize the object in image frame and lock on to it for further tracking. Inaccurate initialization is highly susceptible due to an object occlusion in features based tracking techniques. Instead of completely relying on features based approach, a backup CNN based pose estimation can be used as corrective mechanism to feature based tracking during initialization phase. In this paper, we propose two transfer learning based models for pose estimation from monocular images. The models are not specific to a single type of spacecraft or to a object of interest and can thus be applied to other models of object by training on a sufficiently large dataset of the new object models.
Our main contribution is modifying and improving accuracy of convolutional neural network, initially designed for object recognition, to do a relative pose estimation of known object. It is trained on synthetic dataset generated using simulation testbed shown in fig.2. It has been shown recently that Convolutional Neural Networks (CNN) that were trained solely on synthetic data exhibit improved performance on actual spaceborne images compared to existing onboard feature-based algorithms [tracker]. Specifically, CNN are more robust to adverse illumination and dynamic Earth background in spaceborne images. However, this has only been shown empirically to date and there is no explanation on why an estimator may fail and in some situations may fail. The structure of this paper is as follows. Following this introduction, Section 1.1 reviews some of the recent works in relative pose estimation for on-orbit assembly tasks, and Section 1.2 describes the problem that we are trying to solve. Section 2 describes the synthetic data generated for this study. Section 3
explains the design and implementation of loss functions and CNN model. Later, SectionLABEL:sec:results shows results obtained from dataset compared to other models. In the end, Section LABEL:sec:conclusions, draws some conclusions and discusses future works.
1.1 Relative Pose Estimation for On-orbit Assembly
On-orbit assembly of spacecraft or structures in space has been proposed several times in past considering factors such as bigger size of assembled structure, deployment risk etc. In addition to this, space robotics gained a lot of attention in last decade because of improvement in low compute and space grade hardware. In may 2015, as per [roadmaps] NASA updated many technology area such as Robotics and Autonomous Systems, Human Exploration Destination systems etc. with goal of having In orbit assembly technology [Zimpfer][Doggett_r]. Previously, assembly of space system using robotic technology was demonstrated in NASA Langley Research Center [langley]. However, demonstration involved complex vision sensing for successful completion of task, however make it nearly impossible to implement on low compute power hardware. Nonetheless, In recent work, arm augmented cubesat [remora] was designed by JPL to showcase hardware capability to perform autonomous In orbit assembly. Cubesat with arm uses on board passive sensing to obtain relative pose of object of interest, in this case that object is a truss. Later this obtained relative pose can be used to do visual servoing based control of robotic arm for grasping and manipulation of object. By leveraging this design, authors have designed simulation testbed shown in fig. 1(b). More information about this testbed is given in 2 and assembly framework to perform on-orbit assembly using relative pose estimation can be visualized in fig. 1
1.2 Problem Statement
This paper proposes a solution to solve tracking problem using CNN based pose initialization. In this problem, More emphasis is given on designing a CNN model for a relative pose estimation. Dataset used for training the CNN model is generated from simulation of assembly testbed. Considering promising application of such CNN model based pose estimation in future mission critical tasks, empirical study of trained model is necessary. Thus, authors have analyzed the trained CNN network for good and worst predictions in order to understand which part of the network learns better. Specifically, by looking at heatmaps of convolutional layers we can see where exactly network is focusing for predictions similar to [deep_features]. Furthermore, In assembly task, pose estimation and tracking of the object in real time is critical information for proper manipulation and grasping. Also, Tracking might get lost due to temporary object occlusion from arm itself or partial visibility in camera frame. In that case, reinitialization of tracking by estimating object pose instantly and latching onto the object is important task. Future work will be on integrating this model with tracker to test the performance of assembly task on realworld testbed with framework as shown in fig. 1
In our previous work, [surp] we demonstrated ability to perform assembly task using monocular vision based tracking and remora arm [remora] designed by JPL. This work shows successful real-time decision making using vision based feedback and dynamic motion primitives on low compute platform. Simulation test bed designed for this work is used to generate synthetic dataset.However, for this study generating a set of images that are similar to the real-world can be challenging and costly and is out of scope of this work. There are a few places in academic institutions that have such a capability such as Stanford’s SLAB [slab], Caltech’s CAST [cast] that have such a capability. For this work, we use synthetically generated dataset that simulates testbed with on the orbit conditions shown in fig. 2. In future work, realworld images of object will be generated instead of synthetic using designed testbed shown in fig. 1(a) The following section describes the synthetic dataset generated for this work.
Training a convolutional neural network models usually require large number of training data of hundred thousand images e.g., [coco, openimage]. However, generating real-world images from space can be challenging for the application of on-orbit assembly. As a result, in this paper, we validate our framework using synthetic dataset generated using simulation testbed discussed previously. Fig. 4 shows a few synthetically generated images and corresponding validated images. Our dataset is generated leveraging a physic-based simulation platform, named Gazebo [gazebo], within Robot Operating System (ROS)[ros]. This simulation testbed is capable of simulating on-orbit physics condition such as quasi-static behaviour of object, illumination changes, rigid body dynamics in space. In addition to this, it can simulate variety of sensors such as monocular camera, stereo camera, LIDAR sensor and actuators which makes it all in one framework to get dataset and perform experiments.
In this dataset, a truss shaped object is used with asymmetric shape to avoid different labelling to images consisting same object with 180 degrees rotation making two images to look exact copy of each other. Thus, asymmetric shape of object helps to avoid having any skewed data samples in generated dataset. Using simulated/synthetic camera, translation and orientation of object of interest, in this case truss, was obtained using careful manipulation in simulated in-space lighting conditions. Furthermore, considering in space assembly task using cubesat, size of truss is approximately 0.380.20.05 (meters) and rotated around camera frame within reachable distance for arm. Images were captured initially in resolution of 7201280 pixels where each image was labelled with pose of truss with respective to camera frame. Ones dataset is generated, it is verified using reprojecting the labelled poses of truss on to the image frame. Since simulation allows access to synthetic camera’s intrinsic parameters, labelled pose is transformed with respect to camera frame. Later vertices of object obtained from 3D cad model are transformed to new pose, calculation details are shown belowfor more information. By obtaining reprojected images, validity of dataset can be confirmed as shown in 4.
|is intrinsic matrix of camera|
|are coordinates in camera frame|
|and are rotation and translation of object|
|are coordinates of object’s vertices|
The models were tested on dataset mentioned above. The images in dataset were scaled down to a size of 224x224 pixels to be consistent with ImageNet[imagenet] because learning from ImageNet was transferred to each model. Each image had a corresponding label consisting of seven numbers representing the position and orientation of the satellite with respect to the camera. The position is given by the distance in meters along the x, y, and z axes while the rotation is specified by the other four numbers, which make a unit quaternion. In order to constrain the model such that it predicts only unit quaternions, the predicted quaternions were converted to unit quaternions by dividing each component by the quaternion’s magnitude both when calculating the loss during training and when determining the model’s overall accuracy during testing. Quaternions were chosen over rotation matrices and Euler angles to represent the orientation of the satellite because quaternions have less redundancy than rotation matrices while still avoiding gimbal lock, which Euler angles are highly susceptible to.
3 Relative Pose Estimator Models
Two similar CNN models were constructed to estimate the satellite pose. Similar to [sharmacnn], transfer learning was used with weights obtained from training on the ImageNet dataset [imagenet]. Conversely, the models in this paper are based on VGG-19 [vgg] rather than AlexNet [alexnet]
due to the superior classification accuracy of VGG-19 on the ImageNet dataset. Since the task is to estimate seven continuous numbers rather than perform the 1000-way classification of ImageNet, the last layer of each of the models was replaced by a 7-node layer with no activation function.
The models use transfer learning with most of the weights frozen for three main reasons. The first is that since ImageNet and the satellite or space related dataset are somewhat similar in that they consist of color images of one main object, transfer learning can provide a good initialization to speed up training by transferring some of the knowledge of how low-level features are extracted and combined into more complicated features. The second reason is that the satellite or space related dataset is relatively small and freezing many of the weights so that they are not updated during training helps limit overfitting the training data. The final reason is because [yosinski] showed that transfer learning can boost performance even after the original weights have had significant fine-tuning through training on the new dataset.
One key distinction between object classification and pose estimation is what the model should be robust against. Object classification models should predict the same labels even if the objects are slightly rotated or moved, while pose estimation models need to produce a slightly different output. Conversely, both types of models should be robust against random noise, different backgrounds, and varying illumination conditions. In many CNN architectures for object recognition, the robustness against small translations is in part due to the max pooling layers. Both models created for this work were designed to limit the feature location information lost due to the translational and rotational robustness of VGG-19. The architecture of the two models developed for this paper are shown in fig.LABEL:fig:branched_model and fig. LABEL:fig:cnn_arch_parallel.