I Introduction
The decentralized coordinated control of multiple mobile robots has several applications, including exploration [1], coverage [2], sampling and patrolling [3], search & rescue [4], and manipulation [5]. Achieving coordination within a decentralized architecture requires each agent in the formation to have some knowledge of the pose of its neighboring agents. The relative pose information allows to locally optimize a goal function, replan trajectories, and avoid collisions. However, for some mission scenarios such as mine caves or catastrophic areas, it is particularly challenging to obtain an accurate estimate of the relative pose due to the constrained communication channel bandwidth and impossibility to install or employ a highprecision positioning system, e.g., geolocation devices (GPS and Galileo) or Motion Capture systems. This paper aims at studying the problem of relative localization for two autonomous nonholonomic ground robots using 3D points observed by an onboard visual system (See Fig. 1). Since cameras observe 3D points up to an unknown scale, the formulation considers the problem of concurrently estimating the unknown depth of the 3D points.
In monocular computer vision, the mapping task aims at estimating the inherent 3D structure of the world from the 2D pixels in the image frame [6]. Considering two or more sequences of images taking from different views in a static environment, the depth of the points is computed by estimating the 3D points where the correspondent projection rays intersect. More sophisticated solutions have been proposed [7, 8, 9]. However, these works do not take into account the motion imparted to the camera. Instead it is considered a sequence of images captured at a known pose in the environment. Some authors study the estimation of depth maps from a single image (e.g. [10, 11, 12]). However, these works assume some high level knowledge of the environment. Recently, continuous 3D depth estimation has gained considerable attention (e.g. [13, 14, 15, 16, 17]). The goal is to minimize the estimation error of the 3D features over a continuous sequence of images ordered in time, by accounting for the dynamics of the camera. In contrast to other solutions that aim at algebraically estimating the unknown parameters as a fitting problem, continuous estimation considers the error dynamics and its convergence to the origin over time. This strategy benefits from: 1) being able to provide a metric about the quality of their estimates, e.g. by measuring the error’s covariance or convergence rate; and 2) having robust performance by continuously improving the estimation based on new views. These advantages are particularly important for image based visual servoing schemes [18, 19], in which the depth of the 3D features are fed in the control loop.
Relative pose estimation is another active topic in the robotics and computer vision communities. From [20], geometrically one can see that the minimum required number of 2D point correspondences is five. The number of required point correspondences is reduced even more if the motion is constrained in a plane. In this case, only two 2D point correspondences are required [21]. One of the disadvantages of these methods is that they estimate the translation parameters up to a scale. This means that one needs to get additional information about the observed environment to get the complete relative pose. Over the years, there have been many approaches to solve the relative pose problem, such as [22, 23, 24]. However, all of these methods were built on the assumption that the data (2D images) were discretely acquired over time, without any continuously time assumption. In [25], the authors propose a continuous perception strategy to estimate the relative localization of a set of robots in formation. Their method is based on the bearingrigidity of the formation, and requires that the at least a pair robots should be in mutual visibility. In our work, we also aim at estimating the relative pose of the robots as in [25].
The scenario considered in this paper includes two mobile robots, navigating autonomously in an unknown environment. Both robots are equipped with perspective cameras. While previous solutions for this problem consider a set of two or more images from the environment or use some special fleet configuration (e.g. the robots are in each others’ fields of view or have the ability of sensing the bearinginformation about each others’ positions), in this paper we propose a framework that shares a set of common observations of the environment in the respective local frame of each robot (3D point features are employed). A graphical representation of the problem is shown in Fig. 1
. In this scenario, as will be shown in the experimental results, the proposed solution is able to estimate the 3 degrees of freedom (DoF) relative pose between the two robots using 3D points which are being continuously estimated by the two robotic agents.
To compute the relative pose, we use the fact that the two agents are tracking in the image plane two point correspondences. The proposed pipeline estimates the depth of the points seen by both robots while computing their relative pose. In contrast to [21]
, by taking account a continuous estimation of the relative transformation, our method is able to get the complete relative pose (the translation is not estimated up to a scale). As the robots navigate in the environment, they will be estimating the depth of the points being tracked in their respective image frames using a depth estimation filter. Then, an Extended Kalman Filter (EKF) receives the estimated depths to estimate the relative pose of the robots.
This paper is organized as follows. The next section presents an overview of the system proposed in this paper. In Sec. III we describe the proposed method for depth and relative pose estimation. Results with simulated and real data are shown in Sec. IV, and in Sec. V we discuss and conclude the paper.
Ii System Overview
This section presents the problem studied in this paper, an overview of the proposed system architecture, and the basic mathematical notation that holds for the remainder.
The problem discussed here is twofold: 1) each agent has to compute the unknown depth of the points to recover the 3D structure of the world; and 2) each agent has to estimate the relative pose of its neighboring agent. The robot and camera configuration is shown in Fig. 2, and the proposed pipeline solution is shown in Fig. 3.
The cameras are mounted at the front of a nonholonomic vehicle, in which the optical axis of the sensor is aligned with the forward moving direction (see Fig. 2). Without loss of generality, we assume that the camera lies in the origin of the vehicle. The robot actuation, denoted as , consists of a forward and an angular speed command. This translates into a linear speed along the optical axis of the camera and an angular speed around the axis of the camera.
A depth estimation filter runs in each agent independently, recovering the depth of the 3D points. The depth filter, discussed in Sec. IIIA, is an particularization of [14], constrained for cameras mounted on nonholonomic planar vehicles. Then, an Extended Kalman Filter (EKF) gathers the motion of each camera and the coordinates of the observed points to estimate the relative pose between the vehicles. This requires that one vehicle sends its 3D points estimation and actuation commands to its neighboring vehicle through the network. This is discussed in Sec. IIIB. The EKF allows to capture the uncertainty inherent to the problem: depths converge online and measurements and actuation commands are subjected to noise.
The proposed pipeline includes two additional modules: image processing and network. The former is related with the detection and tracking of a 3D target on the image plane. For that purpose, one can use techniques such as SIFT [26], SURF [27], or ORB [28]. As a proof of concept, a color scheme detection is employed. It also eases the match of the correspondences between the points detected by the two robots. The network module enable the robots to communicate with each other. In practice, the vehicles are equipped with WiFi.
The mathematical notation adopted in this document is as follows: vectors are written in lower case bold font; matrices are written in upper case font; coordinate frames are denoted within curly brackets; and a subscript indicates the coordinate frame that a vector is described in (e.g.
stands for the control input of camera ).Iii Methods
This section presents the methods proposed for the simultaneous depth and relative pose localization of ground robots. We start by describing the used depth filter (Sec. IIIA) for a single vehicle/point. Then, the relative pose filter is presented (Sec. IIIB).
Iiia Depth Filter
This method is based on the framework proposed in [13], adapted to the motion of ground vehicles. Consider a camera moving freely in the 3D space, the dynamic of a point relative to the camera frame is
where and are the camera linear and angular velocity described in the camera frame. Since only the direction of the point is known, it is common practice to use the normalized image coordinates . The previous equation can be restated as
As discussed in Sec. II, the vehicle input consists in a forward speed () and an angular rate (). The forward direction of the vehicle is aligned with the optical axis of the camera, i.e., the linear velocity expressed in the camera frame satisfies and . The rotational axis of the vehicle is aligned with the y–axis of the camera such that and . The reader is referred to Fig. 2 for a graphical illustration of the coordination frames at stake. Substituting to obtain a linear expression and simplifying the previous equation by removing the components multiplied by , the following dynamic system is obtained:
where
(1) 
In order to estimate the unknown depth , the estimation variables and are introduced, as well as the associated estimation errors and . For that purpose, the following observer is considered:
(2)  
(3) 
where is a positive gain and are symmetric positive matrices with . Since for nonnull the camera moves along the optical axis, . For this setup, the dynamics of the error can be proven to be locally exponentially stable. For a formal proof of the convergence of error to the origin, the reader is referred to [13]. The basic idea consists in guarantee that a Persistency of Excitation conditions holds. More precisely, this requires that the matrix is full rank throughout time, that is
where and
is an identity matrix of appropriate dimension. In particular, for a single point, this can be done by ensuring that
. From (1), we haveTherefore, the filter converges, as long as the projected point is not in the center of the image frame, i.e., or must hold for convergence. Geometrically, this constraint is easy to verify: if and the camera moves along the –axis, the motion will be along along the projection ray meaning and it would be impossible to recover the depth (not enough information to do triangulation).
In the next section, the recovered 3D points are employed for online relative pose estimation.
IiiB Relative Pose Filter
This section presents relative pose filter, which takes as input the depth estimation and the control inputs of the robots. First, consider that each robot has a bodyfixed coordinate frame, denoted as for robot and for robot . The transformation from the bodyfixed to the camera fixed frame is known for each robot. Now, instead of a single point, the robots observe –points in the scenario. For each observed 3D point, each robot runs an instance of the depth estimation filter, as presented before. The pose of with respect to described in is denoted as , where is a 2D translation and an orientation angle. In accordance with the definition presented in Sec. II, the control input for each vehicle is defined as and . The list of notations is summarized in Table I.
Variable  Description 

,  Vehicles 
,  Bodyfixed coordinate frames 
,  Control input of camera A and B 
,  Normalized coordinates of the 3D points 
,  Depth of the 3D points 
Pose of w.r.t described in 
In order to derive the dynamic model of the relative pose between the robots, let and be the pose of robot and , respectively, described in an inertial coordinate frame. The dynamic of each robot is
(4) 
The relative pose between the two nonholonomic vehicles described in is given by
(5) 
where the rotation matrix will be soon defined. The dynamics of the relative pose is
(6) 
where ,
(7) 
Substituting (5) in (6) yields
(8) 
Finally, applying (4) and (7) in the previous equation leads to the dynamic model
(9) 
The output model considers the relationship between the pose of the robots and the observed 3D points in the bodyfixed frame. From the geometric perspective, the following equation holds for each of the –points:
for which the corresponding time derivative is
By replacing , it is possible to describe the relationship between the dynamics of 3D points, stated in (2) and (3), and the dynamics of the relative pose:
If vehicle periodically estimates and their dynamics, and receives the corresponding estimates from vehicle , the following observation model holds:
The relative pose filter consists in an Extended Kalman Filter that assumes that the input of the vehicles and the observations are perturbed by Gaussian noise with zero mean. First, in the prediction step, the prior is obtained according to (9)
Then, for each pair of measurements, the update model is:
where K is the Kalman gain and
is computed from the measurements received from the neighboring vehicle. The Kalman gain and the covariance matrix are updated following the regular EKF framework.
With this we conclude the relative pose filter description. This setup allows us to obtain an estimate of the unknown state. We stress that this method builds on the assumption that at least two 3D points correspondences are required to align two coordinate systems [29]. Although we focus on the critical case with only two points, in general, the EKF framework benefits from multiple observations and increase robustness against occlusions.
Iv Results
In this section, we discuss simulated (Sec. IVA ) and experimental results (IVB). Simulations were performed using the Robot Operating System (ROS) and Gazebo/Turtlebot simulator at [Hz]. Similarly, for the experimental results, ROS and two TurtleBots were employed. A footage of this experiment is attached.
Iva Simulation
The first simulated scenario illustrates the depth filter properties presented in Sec. IIIA. A single nonholonomic ground robot observes four 3D points. At , the points coordinates w.r.t. to the camera frame are , , , and . For each point, a filter was initialized with 1 [m] depth estimation error. The robot actuation is constant, set to 0.1 [m/s] and 0 [rad], that is, the robot moves straight ahead in the direction of the optical axis. Figure 4 shows the results for the three points using the same filter parameters. The estimation error converges to zero when the point does not rest at the origin of the image frame.
In fact, the further the point is from the center of the image, the faster is the convergence due to the larger value of . However, this also amplifies numerical errors and noise. Figure 4 at the right depicts the impact of different gains on the depth estimation, was selected. Larger values of in (2) slow down the convergence rate, while higher values of increase the convergence rate. Here, once again, the designer must take into consideration that faster convergence leads to large steadystate errors.
Figure 5 shows the performance of the proposed depth estimation and relative pose estimation framework (method proposed in Sec. IIIB) altogether. Only the points and from the previous experience were considered and an additional robot was inserted in the scenario. For each set of robot/point, the initial depth estimation error was set to [m] and the initial relative pose estimation error was set to [m] in each transnational axis and in orientation. The depth estimation error for each point follows a convergence behaviour similar to the one shown in Fig. 4. Therefore, as shown in Fig. 5, the translation estimation error converges to zero as the depth estimation error converges to zero. The same applies for the orientation error.
IvB Experimental Results
Experiments were conducted at the Smart Mobility Lab^{1}^{1}1https://www.kth.se/dcs/research/controloftransport/smartmobilitylab/smartmobilitylab1.441539 [accessed August 27, 2019]., from KTH. We employed two TurtleBots (differential drive robots) named here Robot_A and Robot_B, respectively. Each robot was equipped with a perspective camera pointing towards the motion of the robot. The scenario contained two 3D targets (red and blue targets). For evaluation purpose, a motion capture system installed at the ceiling of the arena provides ground truth for the poses of the robots and the 3D points. Fig. 6 shows the proposed setup. The following velocities were sent to the robots: = 0.1[m/s] & = 0.02[rad/s], and = 0.1[m/s] & = 0.03[rad/s], for Robot_A and Robot_B, respectively. Through all the trajectory, we make ensure that the robots are observing the target points (around 40 [s]). The depth filter parameters were and for both robots.
The results are shown in Fig. 7. The initial error for the depth filters were approximately 1 [m]. The depth of both points converge after [s] for Robot_A and [s] for Robot_B. The offset between the initial guess and the real values for the relative pose filter was 1.5 [m] in each of the translation axis and in orientation. As one can see, after a transient period, the errors converge to zero, proving that the proposed pipeline is able to correctly estimate the relative pose between the two robots for the evaluated setup.
V Discussion
In this paper we presented a framework for depth and relative pose estimation for two visionbased robots. The depth estimation filter was built on top of recent results on continuous structure from motion methods. In particular, we constrained the problem to a camera mounted on a nonholonomic ground vehicle. The depth estimation and robot input commands are fed into an Extended Kalman Filter that estimates the relative pose between the ground robots. Simulations and experiments in a real scenario shown the convergence of both filters for several scenarios and set of initial conditions. In future work, the active control of the robots will be considered as well as the interplay between the gains and the position of the point in the image frame.
Acknowledgement
The authors are grateful for the invaluable support provided by Pedro Roque (KTH) in the experimental setup.
References
 [1] D. Fox, W. Burgard, H. Kruppa, and S. Thrun, “A probabilistic approach to collaborative multirobot localization,” Autonomous Robots, vol. 8, no. 3, pp. 325–344, Jun 2000.
 [2] J. Cortes, S. Martinez, T. Karatas, and F. Bullo, “Coverage control for mobile sensing networks,” IEEE Trans. Robotics and Automation (TRA), vol. 20, no. 2, pp. 243–255, 2004.
 [3] A. Marino, G. Antonelli, A. P. Aguiar, A. Pascoal, and S. Chiaverini, “A decentralized strategy for multirobot sampling/patrolling: Theory and experiments,” IEEE Transactions on Control Systems Technology (TCST), vol. 23, no. 1, pp. 313–322, 2015.
 [4] J. S. Jennings, G. Whelan, and W. F. Evans, “Cooperative search and rescue with a team of mobile robots,” in Int’l Conf. Advanced Robotics (ICAR), 1997, pp. 193–200.
 [5] D. Rus, B. Donald, and J. Jennings, “Moving furniture with teams of autonomous robots,” in IEEE/RSJ Int’l Conf. Intelligent Robots and Systems (IROS), vol. 1, 1995, pp. 235–242.
 [6] R. I. Hartley and A. Zisserman, Multiple View Geometry in Computer Vision, 2nd ed. Cambridge University Press, 2004.
 [7] R. MurArtal, J. M. M. Montiel, and J. D. Tardós, “ORBSLAM: A versatile and accurate monocular SLAM system,” IEEE Trans. Robotics (TRO), vol. 31, no. 5, pp. 1147–1163, 2015.
 [8] H. Oleynikova, Z. Taylor, M. Fehr, R. Siegwart, and J. Nieto, “Voxblox: Incremental 3D euclidean signed distance fields for onboard mav planning,” in IEEE/RSJ Int’l Conf. Intelligent Robots and Systems (IROS), 2017, pp. 1366–1373.
 [9] J. Engel, V. Koltun, and D. Cremers, “Direct sparse odometry,” IEEE Trans. Pattern Analysis and Machine Intelligence (TPAMI), vol. 40, no. 3, pp. 611–625, 2018.
 [10] C. B. Choy, D. Xu, J. Gwak, K. Chen, and S. Savarese, “3DR2N2: A unified approach for single and multiview 3D object reconstruction,” in European Conf. Computer Vision (ECCV), 2016, pp. 628–644.

[11]
H. Fan, H. Su, and L. Guibas, “A point set generation network for 3D
object reconstruction from a single image,” in
IEEE Conf. Computer Vision and Pattern Recognition (CVPR)
, 2017, pp. 2463–2471.  [12] K. Huang, Y. Wang, Z. Zhou, T. Ding, S. Gao, and Y. Ma, “Learning to parse wireframes in images of manmade environments,” in IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2018, pp. 626–635.
 [13] R. Spica and P. R. Giordano, “A framework for active estimation: Application to structure from motion,” in IEEE Conf. Decision and Control (CDC), 2013, pp. 7647–7653.
 [14] R. Spica, P. R. Giordano, and F. Chaumette, “Active structure from motion: Application to point, sphere, and cylinder,” IEEE Trans. Robotics (TRO), vol. 30, no. 6, pp. 1499–1513, Dec 2014.

[15]
R. Spica, P. R. Giordano, and F. Chaumette, “Plane estimation by active vision from point features and image moments,” in
IEEE Int’l Conf. Robotics and Automation (ICRA), 2015, pp. 6003–6010.  [16] A. Mateus, O. Tahri, and P. Miraldo, “Active structurefrommotion for 3d straight lines,” in IEEE/RSJ Int’l Conf. Intelligent Robots and Systems (IROS), 2018, pp. 5819–5825.
 [17] R. T. Rodrigues, M. Basiri, A. P. Aguiar, and P. Miraldo, “Lowlevel active visual navigation: Increasing robustness of visionbased localization using potential fields,” IEEE Robotis and Automation Letters (RAL), vol. 3, no. 3, pp. 2079–2086, 2018.
 [18] F. Chaumette and S. Hutchinson, “Visual servo control. i. basic approaches,” IEEE Robotics Automation Magazine (RAM), vol. 13, no. 4, pp. 82–90, 2006.
 [19] R. Spica, P. R. Giordano, and F. Chaumette, “Coupling active depth estimation and visual servoing via a large projection operator,” The International Journal of Robotics Research (IJRR), vol. 36, no. 11, pp. 1177–1194, 2017.
 [20] D. Nister, “An efficient solution to the fivepoint relative pose problem,” IEEE Trans. Pattern Analysis and Machine Intelligence (TPAMI), vol. 26, no. 6, pp. 756–770, 2004.
 [21] D. Ortín and J. M. M. Montiel, “Indoor robot motion based on monocular images,” Robotica, vol. 19, no. 3, p. 331–342, 2001.
 [22] D. Scaramuzza and F. Fraundorfer, “Visual odometry [tutorial],” IEEE Robotics Automation Magazine (RAM), vol. 18, no. 4, pp. 80–92, 2011.
 [23] L. Kneip and H. Li, “Efficient computation of relative pose for multicamera systems,” in IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2014, pp. 446–453.
 [24] R. Ranftl and V. Koltun, “Deep fundamental matrix estimation,” in European Conf. Computer Vision (ECCV), 2018, pp. 292–309.
 [25] R. Spica and P. R. Giordano, “Active decentralized scale estimation for bearingbased localization,” in IEEE/RSJ Int’l Conf. Intelligent Robots and Systems (IROS), 2016, pp. 5084–5091.
 [26] D. G. Lowe, “Object recognition from local scaleinvariant features,” in IEEE Int’l Conf. Computer Vision (ICCV), vol. 2, 1999, pp. 1150–1157.
 [27] H. Bay, A. Ess, T. Tuytelaars, and L. V. Gool, “Speededup robust features (SURF),” Computer Vision and Image Understanding (CVIU), vol. 110, no. 3, pp. 346–359, 2008.
 [28] G. Bradski, K. Konolige, V. Rabaud, and E. Rublee, “ORB: An efficient alternative to SIFT or SURF,” in IEEE Int’l Conf. Computer Vision (ICCV), 2011, pp. 2564–2571.
 [29] P. Miraldo, S. Saha, and S. Ramalingam, “Minimal solvers for miniloop closures in 3D multiscan alignment,” in IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2019.