Active Depth Estimation: Stability Analysis and its Applications

03/16/2020 ∙ by Romulo T. Rodrigues, et al. ∙ KTH Royal Institute of Technology Universidade do Porto 0

Recovering the 3D structure of the surrounding environment is an essential task in any vision-controlled Structure-from-Motion (SfM) scheme. This paper focuses on the theoretical properties of the SfM, known as the incremental active depth estimation. The term incremental stands for estimating the 3D structure of the scene over a chronological sequence of image frames. Active means that the camera actuation is such that it improves estimation performance. Starting from a known depth estimation filter, this paper presents the stability analysis of the filter in terms of the control inputs of the camera. By analyzing the convergence of the estimator using the Lyapunov theory, we relax the constraints on the projection of the 3D point in the image plane when compared to previous results. Nonetheless, our method is capable of dealing with the cameras' limited field-of-view constraints. The main results are validated through experiments with simulated data.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Structure-from-Motion (SfM) aims at recovering the 3D structure of the environment from a moving camera. It is used when the motion of the camera and its intrinsic parameters are known. This is one of the more important modules in applications such as: autonomous navigation [1], UAV flight control [2], robot hand-eye calibration [3], topographic surveying [4]

, and multi-robot relative pose estimation 


. The SfM problem has been studied for the last three decades by the roboticists and computer vision researchers. Below, we categorize available solutions as geometric/filtering based methods and passive/active techniques.

Geometric-based techniques [6, 7, 8] often apply triangulation for estimating the depth of the points from two or more different viewpoints. The frames do not need to be consecutive, and this method is usually followed by an offline non-linear refinement such as bundle adjustment [9]. Geometric-based techniques provide accurate results but suffer from small baseline camera displacements. On the other hand, filter or incremental-based methods, such as [10, 11, 12], explicitly consider the dynamics of projected 3D points into a sequence of continuously acquired images. Incremental strategies focus on efficient computation and take advantage of the small continuous motions of the camera (small displacements). Besides, incremental-based techniques aim at getting a robust estimation of the model uncertainties.

The works mentioned in the previous paragraph are passive, i.e., the camera motion is not used to the goal of mapping the 3D environment. In the last decade, some authors have been studying the use of active vision techniques to assist the structure-from-motion modules. The authors in [13] propose the use of 3D reconstruction goals in the control loop. They use the proposed method in the reconstruction of 3D points, cylinders, straight lines, and spheres. In [14], the authors address an active strategy for tuning the transient response of a particular class of nonlinear observers that are well suited for active SfM problems. The technique is applied to a 3D point active SfM scheme. The framework was later used for the cases of cylinder, spheres (see [15]), 3D planes (in [16]), and 3D straight lines (see [17, 18]). There are also works on high level controllers based on SfM. For example, [19] presents a method to actively ensure the presence of good features in a structure-from-motion module, and [20] proposes an optimal path planning framework that maximizes the visual information during navigation.

In this paper, we study the stability analysis for an incremental active SfM using point features. The goal is to understand under what conditions it is possible to obtain an online estimation of the unknown depth of a point feature, from any initial condition. We resort to the knowledge of the motion of the camera and the 2D image plane coordinates of the projected 3D point. Our work builds on top of the incremental depth estimator addressed in [14, 15], where some guarantees for its stability and how to maximize its convergence speed were studied. However, for a point feature, the asymptotic stability result in [14, 15] only holds if 1) the camera motion drives the projection of the point to the origin of the image frame, and 2) the depth (unknown parameter being estimated) is constant after a transient. As a consequence, some issues arise in practical applications. For example, in [21], the results of [14] are applied to the coupled depth estimation and visual servo control problem. The strategy strives to increase the convergence speed, but the convergence properties are not met. This results from the requirement of translating the projection of a point to the origin of the image frame, which conflicts with the visual servoing goal.

In our work, we take a step back to first analyze the camera actuation policies that provide asymptotic stability guarantees on the depth estimation of a single feature. In contrast to previous works with similar stability properties, we do not require the tracked feature to lie in (or visit) the origin of the image frame. Moreover, the unknown depth is not necessarily constant throughout the estimation process.

The next section presents the notations and background work. Section III presents the stability analysis of the active filter. Then, Sec. IV discusses its use in a single 3D point mapping application. Simulation results are shown in Sec. V, and Sec. VI concludes the paper.

Ii Preliminaries

This section presents notations and background work that support the remainder of this document.

Ii-a Notation

Scalars are written in lower case letters and column vectors typed in bold symbol lower case letters. A vector can be split into smaller pieces using the notation

. Matrices are printed in upper case letter, as well as the coordinates of a 3D Point.

Ii-B Background

Consider a camera moving freely in space and let be the coordinate frame attached to the origin of the sensor. The camera observes a static 3D point described in as . Let be the projection of into the camera’s normalized image plane and consider the change of variable . Applying the new variables in the well-known optical flow equation [22] gives


where and are the camera linear and angular velocities described in . The dynamics of the system can be stated in compact form




Given , we want to estimate the unknown depth described by (also denoted as unmeasurable variable). For that, consider the following notations. The estimation variables are and . The respective estimation errors are and . The state estimation problem addressed here uses an observer similar to [14]:


where are the control gains. The corresponding estimation error dynamics is


Iii Convergence of the Estimator

In this section we provide the stability analysis of the depth estimation filter. The goal is to provide guarantees for the convergence of the unmeasurable depth for recovering the 3D structure of the world given by .

Assumption 1.

The observed 3D point cannot lie behind the camera. Consequently, we restrict our analysis to the domain where is positive, that is, we assume .

This assumption has an explicit physical meaning. In fact, cameras are not able to observe 3D points that are behind them. This would require a negative depth.

Theorem 1.

Consider the estimator (4) for the dynamic system (2) under Assumption 1. The equilibrium point is stable and the estimation error converges to zero as provided that the following constraints hold simultaneously:

  1. ;

  2. ;

  3. ;

where , , and their time-derivatives are bounded signals.


Consider the Lyapunov function candidate


with , and its time-derivative


Substituting (5) in the previous equation:


By combining Assumption 1 and the input constraints stated in Theorem 1, we have that the three terms in the right-hand side of (9) are non-positive. Hence, and the equilibrium point is stable. We also conclude that , and therefore, that the signals and are bounded.

The critical case that precludes asserting asymptotically stability from (9) occurs when and , and consequentially, . Let denote the nullspace of a matrix, then either because the feature lies in the origin of the image plane (), or because and simultaneously. For and , the second derivative of the Lyapunov candidate function is


As , , and (by definition) are bounded, the function is also bounded. Thus, is uniformly continuous and from Barbalat’s Lemma [23], we have that as . Now, for the asymptotic behaviour of when and , from (5) we have, as ,


The second equation states that the depth estimation error becomes a constant, but not necessarily zero. To show that indeed it will converge to zero, we first show that is uniformly bounded because its time derivative given by


is a function of bounded signals. Thus, since converges to the origin and is uniformly bounded, we conclude that as . Consequently, we have that


It must be the case that either or . If the function is persistently exciting through all time, then the depth estimation error converges to zero. The signal is persistently exciting if the integral


is positive definite . Hence, the persistency of excitation (PE) condition holds if


which is the case from condition (3) in Theorem 1.

Thus, one can now conclude that the equilibrium point is asymptotically stable. ∎

Iv Constrained Active Depth Estimation

Any vision-based control scheme has to consider an important limitation of image sensors, its limited field of view. While tracking the projected 3D point (related to the unknown depth to be estimated), one needs to make sure the projection does not leave the image space. To achieve that, we have to include constraints on the motion of the camera. This section explores the theoretical stability guarantees derived in Sec. III for active depth estimation, while ensuring the tracked projected point does not leave the image space.

To address the constraints on the camera motion, we introduce the continuous and smooth desired signal and define the tracking error


The signal is chosen such that the feature remains within the field of view of the camera during the depth estimation process. Assume that the feedback control law drives the tracking error to the origin111For instance, if is constant, then the proportional controller , where ensures the desired behaviour., i.e., as . From inspection of (2), in addition to the camera’s linear and angular velocities, depends on the unknown depth . Thus, it is only possible to shape the dynamics of up to an estimation error. That being said, the goal is to design a control law for such that tracks the signal , while:

  1. imposing the constraints stated in Theorem 1, to assure that the stability property holds;

  2. improving the performance of the estimator, by maximizing as defined in (16); and

  3. accounting for the kinodynamics constraints of the camera described by and , where and are the maximum linear and angular speed of the camera, respectively.

Since constraints are most commonly not addressed when designing a control law to track the reference signal , simultaneously tracking and respecting all the forementioned constraints can lead to an infeasible problem. A workaround is proposed by introducing a scale factor such that is required to track the reference . As the depth converges, tracking the scaled vector – rather than minimizing a norm error – ensures that the path of the feature in the image frame follows the assignment specified by . This allows us to design a path for the feature that does not visit the origin of the image frame. The problem is formulated next:


This problem is addressed in two configurations. The estimation strategy proposed in Section IV-A does not implicitly impose the unknown depth to be constant. In contrast, Sec. IV-B addresses the particular case that requires null depth rate. Both cases take advantage of the following Theorem:

Theorem 2.

Consider the non-convex problem:


where , , and . The problem is always feasible if .

Due to the lack of space, the reader is referred to [24] for the proof of Theorem 2 and a closed form solution for the problem in (19), which is employed here. The solution does not impose restrictions on the feature coordinates, except the origin of the image frame, i.e., , which is a singularity.

Iv-a Case:

In this first scenario, and . This allows us to to take advantage of Theorem 2, while still respecting the requirements for asymptotic convergence stated in Theorem 1. The PE condition of (16) simplifies to and its maximum attainable value is limited by the kinodynamic constraint of the camera, . Under this scenario, the problem in (18) can be formulated as


and solved with the following proposition:

Proposition 1.

Let the camera control input be


and , , and be defined as follows:


where , , and is a vector perpendicular to . In particular, define as


A sub-optimal solution for the problem in (20) can be obtained by casting it in the shape of the problem in (19), where the input variables are written as


and the outputs mapped into


First, we show that the control inputs are described as in (21). The constraint implies that . Combining with , the linear velocity vector can be written as , where is a unit vector. For the angular velocity, re-write the constraint using the slack variable , such that


From one concludes that . Applying this result into :


The column space of the first matrix on the right hand side of the previous equation has dimension 1 and, consequentially, it can be generated assuming . Thus, the following equivalence holds:


where . For , we conclude that any feasible angular velocity can be described as


where , , and are as defined in (22). Within this setup the kinodynamics constraint is equivalent to :


This concludes the proof of (21) and (22).

Applying the control inputs into the first constraint of (20) and re-organizing the terms yields:


Let and notice that if , must be such that . On the contrary, if , then one has to ensure . Maximizing the dot product in both cases requires that and to be parallel. Both vectors are aligned if


where the symbol denotes the relationship holds up to a scale factor. The matrix is orthogonal and, therefore, full rank. The matrix is also full rank since . From the Sylvester rank inequality, we have


Since both and are full rank matrices, one concludes that their product is also full rank (and invertible).

For feasibility, the first component of – corresponding to – must be non-positive. Solving the right hand side of (37), can be described as


If is positive in either cases, it means that is the largest feasible value that maximizes the projection of into or . Using a compact notation:


where . For the maximum feasible value of , compute the norm of the previous equation and compare with (34). When , we have


The singular values of

are 1 and . Since the maximum singular value is 1, the upper bound holds and


The same bound is obtained when in (40):


Finally, substituting (40) in (36):


which allows us to obtain a sub-optimal solution for the problem in (46) in the shape of the problem in (19) using the substitutions described by (48) and (49). ∎

The sub-optimality comes from the fact that the solution consists in projecting into when . The projection is done via the mapping . The singular values of are and . Therefore, if , there can exist a that is not projected into , but the shear transformation performed by allows for a higher value of . Since in practical applications , the solution obtained is not far from the optimal solution. The main advantage in our approach is that it is possible to compute a direction for in a closed-form.

Iv-B Case: and

Now, consider the specific scenario where the depth must be kept constant throughout the entire estimation process. For an unknown in (2), setting and guarantees that . Both aforementioned constraints are in accordance with Theorem 1. The problem, which is stated next:


is a particular case of problem (20). According to the following corollary, an optimal solution can be obtained using Theorem 2.

Corollary 1.

Let the camera control input be described as


Then, the problem in (46) is equivalent to the problem in (19), where


and the outputs are mapped as:


The proof is similar to the one presented in Sec. IV-A by imposing , that is, no slackness. In this case, the solution is optimal because the shear mapping is not involved.

V Experiments

The theoretical results derived in this work are validated using a numerical simulator. The following fixed parameters were employed:  m/s,  rad/s, , and . The sampling time of the simulations is  ms. In Fig. 1, we compare the methods proposed in Sec. IV-A and Sec. IV-B with the one presented in [14, 15]. For asymptotic stability, the strategy described in [14, 15] (continuous red line) and denoted here as Spica et al. (2014), requires the projection of the tracked 3D point to lie in the origin of the image plane and its corresponding depth to be constant, i.e., and . The method presented in Sec. IV-A (dashed green line) relaxes both requirements. The strategy described in Sec. IV-B (continuous blue line) is a particular case of the previous method which keeps the unknown depth constant throughout the trajectory of the camera. Aiming at a fair comparison, the initial visual servoing error and the inverse depth estimation error are the same in the three cases. The initial configurations are: and (with and ).

Fig. 1: Comparison of the estimation strategies described in [15] (Spica et al. 14), Sec. IV-A ( relaxed), and Sec. IV-B () . The initial inverse depth estimation error is and the initial tracking error is . From top to bottom, it is shown the results of (a) the inverse depth estimation error, (b) the tracking error, (c) the persistence of excitation measurement , and (d) the constraint described in Theorem 1.

The behaviour of the depth estimation error is almost the same for the three methods - see Fig. 1(a). In fact, as shown in Fig. 1(c), the three strategies continuously fulfill the PE condition, given by , at its maximum value. Figure 1(b) shows that the feature tracking error converges slower for the method described in Sec. IV-B. This is because the constraint imposes severe limitations on the the angular velocity vector. Spica et al. (2014) guarantees asymptotic stability by driving the feature to the origin of the image frame, while the strategies proposed in this paper ensure that the constraints described in Theorem 1 hold throughout the entire estimation process regardless of the feature coordinate. In particular, the constraint associated to can be seen in Fig. 1(d). For the method in Sec. IV-A, is smaller or equal to zero. For the method in Sec. IV-B, the constraint is always zero.

Fig. 2: True depth () and its estimation () using the strategy described in Sec. IV-A and the same setup as in Fig. 1

For the same scenario, Fig. 2 shows the ground truth and the depth estimation using the method in Sec. IV-A. In contrast to other continuous estimation strategies presented in the literature (namely [15]), the method proposed in Sec. IV-A ensures the depth estimation error converges to zero even thought the depth of the point with respect to the camera is not constant throughout the entire estimation process.

In our formulation, the desired feature coordinate can be time-varying. Figure 3 shows a scenario where the goal is to have the projection of the feature moving in a circular pattern. More specifically, we define . As shown in Fig. 3(a), the speed of convergence of the depth estimation error does not change when compared to the previous case (constant ). Finally, Fig. 3(b) and (c) show that while the depth estimation converges, the proposed control law is able to follow the time-varying signal .

Fig. 3: Assessing the performance of the proposed depth estimation framework when the desired feature coordinates () is time-varying. (a) shows the depth estimation error, (b) shows the desired and the current projection of the 3D point in the image plane, (c) illustrates the trajectory of the camera in a black line, the –axis in a blue arrow, and the 3D point in black, and (d) the two previous signals over time per axis.

Vi Conclusions

In this paper we analyze the required conditions for asymptotic stability of a class of depth estimation observers when the control inputs of the camera can be computed in an active manner. We applied the results for the depth estimation of a single 3D point. In contrast to previous works, our framework guarantees asymptotic stability when the feature coordinate does not converge to the origin of the image frame, nor its depth with respect to the camera is necessarily constant. We believe that relaxing the feature coordinates within the image frame while still providing asymptotic stability guarantees is paramount to apply incremental depth estimation in multiple point scenarios. Despite the relaxed constraints that allow a larger set of motions with theoretical guarantees for depth estimation, the numerical simulations shows that the proposed strategy performs similarly to related literature methods. In future work, we will extend our framework to multiple point and performs tests with a real robot/camera setup.


  • [1] C. F. Olson, L. H. Matthies, M. Schoppers, and M. W. Maimone, “Rover navigation using stereo ego-motion,” Robotics and Autonomous Systems (RAS), vol. 43, no. 4, pp. 215–229, 2003.
  • [2] N. H. M. Li and H. H. T. Liu, “Formation uav flight control using virtual structure and motion synchronization,” in American Control Conf. (ACC), 2008, pp. 1782–1787.
  • [3] N. Andreff, R. Horaud, and B. Espiau, “Robot hand-eye calibration using structure-from-motion,” The International Journal of Robotics Research (IJRR), vol. 20, no. 3, pp. 228–248, 2001.
  • [4] F. Clapuyt, V. Vanacker, and K. V. Oost, “Reproducibility of uav-based earth topography reconstructions based on structure-from-motion algorithms,” Geomorphology, vol. 260, pp. 4–15, 2016.
  • [5] R. T. Rodrigues, P. Miraldo, D. V. Dimarogonas, and A. P. Aguiar, “A framework for depth estimation and relative localization of ground robots using computer vision,” in IEEE/RSJ Int’l Conf. Intelligent Robots and Systems (IROS), 2019.
  • [6] J. J. Koenderink and A. J. van Doorn, “Affine structure from motion,” J. Opt. Soc. Am. A, vol. 8, no. 2, pp. 377–385, 1991.
  • [7] A. Bartoli and P. Sturm, “Structure-from-motion using lines: Representation, triangulation, and bundle adjustment,” Computer Vision and Image Understanding (CVIU), vol. 100, no. 3, pp. 416–441, 2005.
  • [8] J. L. Schönberger and J.-M. Frahm, “Structure-from-motion revisited,” in

    IEEE Conf. Computer Vision and Pattern Recognition (CVPR)

    , 2016, pp. 4104–4113.
  • [9] B. Triggs, P. F. McLauchlan, R. I. Hartley, and A. W. Fitzgibbon, “Bundle adjustment — a modern synthesis,” in Vision Algorithms: Theory and Practice, 2000, pp. 298–372.
  • [10] J. Civera, A. J. Davison, and J. M. M. Montiel, “Inverse depth parametrization for monocular SLAM,” IEEE Trans. Robotics (T-RO), vol. 24, no. 5, pp. 932–945, 2008.
  • [11] A. D. Luca, G. Oriolo, and P. R. Giordano, “Feature depth observation for image-based visual servoing: Theory and experiments,” The International Journal of Robotics Research (IJRR), vol. 27, no. 10, pp. 1093–1116, 2008.
  • [12] A. Martinelli, “Vision and imu data fusion: Closed-form solutions for attitude, speed, absolute scale, and bias determination,” IEEE Trans. Robotics (T-RO), vol. 28, no. 1, pp. 44–60, 2012.
  • [13] F. Chaumette, S. Boukir, P. Bouthemy, , and D. Juvin, “Structure from controlled motion,” IEEE Trans. Pattern Analysis and Machine Intelligence (T-PAMI), vol. 18, no. 5, pp. 492–504, 1996.
  • [14] R. Spica and P. Robuffo Giordano, “A framework for active estimation: Application to structure from motion,” in IEEE Conf. Decision and Control (CDC), 2013, pp. 7647–7653.
  • [15] R. Spica, P. Robuffo Giordano, and F. Chaumette, “Active structure from motion: Application to point, sphere, and cylinder,” IEEE Trans. Robotics (T-RO), vol. 30, no. 6, pp. 1499–1513, 2014.
  • [16]

    ——, “Plane estimation by active vision from point features and image moments,” in

    IEEE Int’l Conf. Robotics and Automation (ICRA), 2015, pp. 6003–6010.
  • [17] A. Mateus, O. Tahri, and P. Miraldo, “Active structure-from-motion for 3d straight lines,” in IEEE/RSJ Int’l Conf. Intelligent Robots and Systems (IROS), 2018, pp. 5819–5825.
  • [18] ——, “Active estimation of 3d lines in spherical coordinates,” in American Control Conf. (ACC), 2019, to appear.
  • [19] R. T. Rodrigues, M. Basiri, A. P. Aguiar, and P. Miraldo, “Low-level active visual navigation: Increasing robustness of vision-based localization using potential fields,” IEEE Robotis and Automation Letters (RA-L), vol. 3, no. 3, pp. 2079–2086, 2018.
  • [20] G. Costante, J. Delmerico, M. Werlberger, P. Valigi, and D. Scaramuzza, Exploiting Photometric Information for Planning Under Uncertainty.   Springer, 2018, vol. 1, pp. 107–124.
  • [21] R. Spica, P. R. Giordano, and F. Chaumette, “Coupling active depth estimation and visual servoing via a large projection operator,” The International Journal of Robotics Research, vol. 36, no. 11, pp. 1177–1194, 2017.
  • [22] R. Hartley and A. Zisserman, Multiple View Geometry in Computer Vision, 2nd ed.   New York, NY, USA: Cambridge University Press, 2003.
  • [23] J.-J. E. Slotine and W. Li, Applied nonlinear control.   Prentice-Hall, 1991.
  • [24] R. T. Rodrigues, P. Miraldo, D. V. Dimarogonas, and A. P. Aguiar, “On the Guarantees of Incremental Depth Estimation and its Applications in Visual Servoing (Proof of Theorem 2) – available here:,” Universidade do Porto, Tech. Rep., 09 2019.