I Introduction
Sensing the spatial particulars and inferring information about a realworld scene from images is a classical problem in robotic vision with a multitude of uses ranging from motion planning, situational awareness, to medical imaging [1, 2, 3]. This said, reconstruction of a complex 3D scene from 2D images is a difficult task due to the amount of uncertainties that must be accounted for in realworld scenarios. Although much progress have been made over the last few decades, reconstruction methodologies often fail as a result of imaging artifacts including, but not limited to, noise, occlusions, clutter, and nonuniform illumination. In short, no universal algorithm exists which can work seamlessly across all image modalities [4]
. To combat such risk complexities, there is a need for domain experts or an operator who is able to provide an estimate of the ideal result and subsequently able to verify the quality of reconstruction. Here, we aim to “inject” 2D operator inputs inloop to drive a (multiagent) 3D surface deformation while ensuring the resulting system is stable in the sense of Lyapunov
[5]. While this work builds off of our previous work in image segmentation [4] and reconstruction [1], there lies a few tacit yet important discerning caveats. Firstly, we show that 2D operator inputs of a given set of images can be aptly “mapped” to 3D world and such inputs, are stable. Mathematically, this not a trivial issue as any input on a 2D background should also be corroborated by a 3D action on infinitely large (“blue sky”) background (e.g., specifying the 3D action location based on 2D background input is illposed). From a stability perspective, such singular 2D actions affect not only a 3D surface deformation, but indirectly affect other 2D passive sensors via 3Dto2D projections during the reconstruction process. Secondly, the control laws are developed inpart based on a notion of absolute principle curvature which is a main underlying theme of this work (e.g., confluence of geometry & control). Thirdly, curvature can be shown to relate to a notion of “trust” in the sense of how quickly our reconciled solution converges from both the operator and autonomous perspective. This will be stylized in detail in future work, but is presented here to place this work and contributions in context. We now briefly revisit a few techniques as it pertains to this work.Ia Brief 3D Reconstruction Literature Review
Most modern scene reconstruction methods use the popular deep (reinforcement) learning variants and are often characterized by the requirement of massive training samples
[6, 7]. Some examples of such systems are ScanNet [6] that uses over 2.5 million scenes to train a system that can understand indoor scenes to [7] where authors furnish a synthetic dataset in order to develop an understanding of surface normal prediction, semantic segmentation, and object boundary detection. Generally, such schemes are highly dependent on the training quality. To combat this, [8] explores the use of supervision as an alternative for expensive 3D annotation from which perspective projection and back propagation are employed. On the other hand, such methods use local correspondence matching and hence, are fallible to drawbacks resulting from scene abnormalities (e.g., noise, nonuniform illumination [9]). In regards to robotic vision, such correspondencebased solutions generally involve the wellknown concept of SLAM (Simultaneous Localization and Mapping) [10, 11, 12]. This said, SLAMbased methods traditionally suffer from the requirement of high computational power for sensing a sizable area and process the resulting data to perform both mapping and localization. Also, there is a tacit requirement that input scene images should have overlap from imagetoimage. To this end, SFM (Structure From Motion) based methods provide a relaxed version of this problem [13, 14] (i.e., Google uses this approach in their popular streetview application on Google maps [15]). More recently, [16]explores a recurrent neural network (3DR2N2) by employing shape priors in which one learns 2D to 3D mapping from images of objects to their underlying 3D shapes from large collections of synthetic data. In particular, the authors have been seemingly able to show their method outperforming SLAM or SFM (albiet with learnt knowledge) when there is lack of texture or baseline.
Nevertheless, this paper does not argue the rigors of the underlying reconstruction method itself and our particular focus on our previous work [1] is inpart due a correspondencefree method, independence to local (imagegradient) structure, and dependence on geometric techniques connected to image segmentation [17, 21, 22]. Undoubtedly, each approach whether it be SLAMbased, deep (reinforcement) learning variants, and/or geometric methods work optimally with respect to the prospective operating environment (e.g., space, lowpower requirements compared groundbased robotic vision). At the same time, any such reconstruction are not infallible to errors that arise in realworld dynamic scenes from a humanperception standpoint. This said, humanperception is also fallible and any operator input based on a visual estimate is prone to errors. Philosophically, we make the argument that terms such as overfitting and uncertainty are in part, perceived by an expert who generally acts as a passive entity in such methods. Thus, the problem we seek to resolve is to not only rectify the expected and ideal reconstruction in realtime [23], but provide the necessary feedback control characterization when invoking operator input [24].
The remainder of the paper is organized as follows: In the next, we introduce stereoscopic reconstruction via classic image segmentation. Then Section III provides a control framework along with the necessary conditions for stability. Section IV presents experimental results. From this, we conclude with future work in Section V.
Ii From Segmentation to 3D Reconstruction
This section presents a general introduction to geometric stereoscopic segmentation.
Iia Geometric 2D Image Segmentation
Let us begin with the classic binary problem of segmenting an image into a foreground and background described by functionals , and , which measure the similarity of of the image pixels with a statistical model over the regions and , respectively. Here,
corresponds to the photometric variable of interest. Then, one can define a partitioning problem where the optimal partition between foreground/background is described by a partial differential equation
[22, 26]; i.e.,(1)  
where can be considered “forces” along the curve (partition boundary) that describe the direction of the corresponding evolution in the normal direction. While a complete review of such methodology is beyond the scope of this note, we do refer the reader to several seminal references [17, 21]. For the case image segmentation, it suffices to understand that the partitioning curve “lives” in the 2D image domain.
IiB Stereoscopic 3D Reconstruction
Now, if we consider the problem of 3D reconstruction from 2D images, one can redefine the functional in equation (1) as follows:
(2) 
where the difference is the functional now depends on image observations and where a particular 2D image silhouette curve is derived from a single 3D occluding curve (with a slight abuse of notion) on a given smooth surface in with a corresponding 3D background treated as infinitely large sphere with angular coordinates . That is, where is the realization of the th pinhole camera (sensor) that projects the 3D world onto the 2D domain. Similarly, the background can be related in a onetoone manner with the image coordinates of each observation through the mapping (“blue sky” assumption). To be more precise, is surface coordinates of in and further note that denote the same points expressed in th calibrated camera coordinates relative to the th image. Moreover, ) is the aforementioned perspective projection due to the th pinhole camera . In turn, and redefined to be radiance functions. That is, the foreground object of interest supports a radiance function of : with the usual area element . Similarly, the background supports a different radiance function : . As such, for a given 3D surface, it is possible to partition each image domain of into a foreground object region and the corresponding background region . Note, the operator is not onetoone and, hence noninvertible. However, we can define a back projection operator using the back tracing of rays from image to the surface, i.e, we have which is a pseudo onetoone operation.
Putting this together, assuming the calibrated cameras, the deformation of the surface towards a reconstructed shape based on a set of image observations can be shown to be of the following form:
where we define a visibility characteristic function
from a given location on a surface as:This can be rewritten in terms of the smooth regularizedHeaviside function along with (outward) surface normals at each point of the surface :
Given the above, we are now able to formulate a controlbased reconstruction scheme from which a given physical 2D action, based on visual perception (information), can be used to interactively “sculpt” a 3D shape in collaboration with the above autonomous 3D reconstruction algorithm.
Iii ControlBased Reconstruction
Let us begin by redefining the general form of a surface reconstruction evolution above in levelset notation as follows:
(3) 
where is the surface gradient information computed from the photometric image data, is a levelset function, and is the classical Kronecker delta function. Hence, to “close the loop” that incorporates a physical 3D operator performing 2D inputs in order to control the 3D evolution dynamics of the evolving surface, one has
(4) 
where is the to be defined control law that drives towards the ideal (perfect) surface as . The definition of an ideal surface is this note is a result with no errors. For this work, we use the meanseparable segmentation energy [21] as our reconstruction model. From this, can be expressed in terms of curvature for points on the surface which leads us to the following Lemma.
Lemma III.1 For a given characteristic function and a point that lies on the corresponding surface “imaged” from a given camera , we have that
(5) 
Proof: Following the nomenclature defined above and noting is the second fundamental form [14, 27], we have
(6) 
where is the normal curvature in a particular viewing direction on the corresponding surface . We refer to Figure 2 for a visualization of this type of curvature on a given manifold. From this, we can rewrite as the following:
(7) 
Furthermore, as we aim to define a control law such that , we define the error between our current estimate and ideal shape (no errors) as
(8) 
In doing so, we are now able to define the existence of the control law via Lyapunov method of stabilization.
Theorem III.1: Let us assume and as well as let and be the the principle maximum curvature and principle minimum curvature at a given point with respect to an imaging referential camera , respectively. Then the control law
(9) 
where , asymptotically stabilizes the system given in equation (4) from the current evolving surface to the ideal surface, as .
Proof: We choose the Lyapunov function defined in terms of as
(10) 
Differentiating with respect to time we get:
(11) 
The simplification over the union results from the application of the Kronecker delta function. Moreover, one can show that resulting system is stable (i.e., has a negative semidefinite derivative):
In particular, the above control law will be dependent on curvature. While beyond the scope of this note, one can show exponential convergence whereby higher curvature coincides with faster convergence rates. While we have not included this derivation in the present work due to scope and for sake of clarity, we will expound upon this in future work. This said, we present such comments to better highlight important caveats in terms of geometry and control as well as how one can start to define notions of “trust” (from a reconciliation of an operator augmentation) to that of a geometric (curvature) quantity. We would like to highlight there exists analogous behavior in networked dynamical systems in which one is able to use discrete Ricci curvature as a measure for network robustness [28]. In such work, one can leverage the concept of kconvexity similarly to above to define positive correlation between Boltzmann entropy, curvature, and rate functions from thermodynamics. Ultimately, this work will seek to build upon this area and in particular, explore notions of “trust” in the sense of geometric quantities such as curvature.
Nevertheless, in designing operator guided inputs, we note perfect knowledge of ideal surface is not readily available (even from a human visualization perspective) due a myriad of reasons including, but not limited to, occlusions, clutter, and/or inability to define a wellposed model across image modalities. As such, we allow an operator (whom is also prone to errors) to make interactions with the system in order to reconcile one’s belief with built autonomy towards an estimate of the ideal surface. We stress the fact that the input from a human is fallible and such input indirectly affects our control law through the adjudication of an “ideal” estimate. This estimate herein is denoted as . Moreover, we define as the th input on a given image and the accumulated input as
That is, we seek to allow for the physical operator to make 2D actions such that it will deform a 3D surface. In other words, we are able to define a 3D control law based on 2D inputs which is particularly helpful as the operator is generally illequipped to alter the 3D shape itself (i.e., we assume the operator not to be an artist). To derive the coupled system that fuses 2D operator input to control the 3D surface deformation, we must also define the errors for both the operator and autonomous model:
(12) 
Given this, we can now define a coupled PDE system that unifies both the operator based inputs along with that of the autonomous counterpart which is representative of an estimatorobserver behavior as follows:
(13a)  
(13b)  
where the tuning function that is dependent on operator input from an image observation can be defined as
(14) 
This said, the above system then needs to be shown that it is is still stable even from imperfect operator actions. To do so, we define the accumulated total errors for both the operator and autonomous model as
(15)  
(16) 
From this, we now arrive at the following result.
Theorem III.2: Let us assume previous notation and results in Theorem III.1 and further assume that operator input has stopped (i.e., is constant in all viewing directions), then the estimator
where will stabilize the resulting coupled system in equation (13a) and equation (13b). Namely, the total error has a negative semidefinite derivative.
Proof.
Let us begin by differentiating with respect to :
(17) 
Similarly, differentiating with respect to :
(18) 
Iv Experimental Results and Discussion
In this section, we demonstrate the proposed algorithm on a variety of scenarios. In all demonstrated results, green patches, or marks, are made by the user to denote regions in the foreground. Similarly, red denotes regions on images that are to be considered a part of the background. In images where silhouettes are displayed, the yellow silhouette denotes the autonomous surface while the estimate of ideal surface is always presented in cyan. Each reconstruction utilizes images with the resulting MATLAB code run on an iMac 4.2 Ghz Core i7 with 32GB memory.
We begin with an example that highlights the method in face of occlusions by objects obfuscating several different imaging views. This can be seen Figure 3 along with how such inputs affect the energy minimization landscape in Figure 4. Here, naive reconstruction fails due to ambient occlusion whose intensity is similar to the background. While there exists varying approaches and shape prior models to overcome such a problem, defining such models for particular scenarios becomes quite cumbersome and yet, may not yield stable results. We are able to properly reconstruct the shape through operator input with a simplified model as defined in [21]. For this experiment, the user made interactions for the foreground and interactions for the background. In particular, in regards to the operator input and its impact on the energy landscape, the user actions can be partitioned into 3 milestones: initial incision (Figure: 3fig:0), followed by a repair of the surface (Figure: 3fig:1), and then, consolidating the surface by helping it “free” itself from scene anomalies (Figure: 3fig:2).
More importantly, irrespective of the underlying model chosen for reconstruction, there will exist assumptions that are violated possibly due variety of image artifacts such as noise, clutter, and/or model assumptions itself. That is, for the chosen reconstruction autonomous model, we make the classic assumption that the scene is “meanseparable” and piecewise constant. Of course, while there exists other more advances models, such a model helps illustrate where operator feedback may override basic fallible assumptions. Figure 8 presents a scene in which such piecewise assumption is violated along minor camera miscalibrations. Additional scenes for which such assumption is violated can be seen in Figure 7 which aims to reconstructs a predator drone in a seemingly distinguishable background of clouds yet fails without operator input. In the context of stereoscopic reconstruction, overcoming nonuniform illumination is yet another tacit challenge. Figure 9 presents a scene where reconstruction of a sentinel drone fails due to tacit illumination on the ailerons that varies over the dataset. This is in part, due to illumination on the left wing which is consequently lower than the right wing. Utilizing operator input, the reconstruction results are demonstrated.
To further the idea in a quantitive nonsubjective manner, we conduct numerical noise experiments on reconstruction of a synthetic scene of a sentinel drone which can be seen in Figure 6 and Table I. Ultimately, if the operator requires intensive work to assist the autonomous counterpart in such situations, then manual operator would suffice (or desire for improved built autonomy). This said, Table I presents efficiency results as the amount of user input is needed (in terms of % “actions” per view, % relabeling of pixels) compared to increased output (in terms of true and false positive rate pixel labels). For example, the second row can be stated that under 30% noise with only one action (userinput) on 95% of the views which amounts to only 2.7% pixel relabeling per image view, the true positive rate increases from 78.4% to 99.2%. This is repeated on several versions of noise and occlusion, two of which are seen from different views in Figure 6. Nevertheless, the key application point of view here is that such failures of such reconstruction methods due to imaging artifacts such as noise can be naturally recover with minimal effort with human inloop collaboration. In addition to such results, we provide corresponding Lyapunov decay rates to such scenes in Figure 10.
Lastly, we note significant work on methods that use “feature”based methods that rely on correspondences combined with machine learning to perform reconstruction tasks [6, 9, 8, 12]. While the thematic aspect of this paper is not discuss the rigors of such methods compared to the proposed underlying autonomous method, it is worthwhile to note that under such noisy situations, such correspondence methods (dependent on structural image information) began to suffer. Here, the geometric method proposed can be considered a “coarse” approach to tackle such “featureless” environments. This said, future work will focus on fusing such correspondencebased and learning approaches in hopes to define a notion of image integrity and leverage recent learning success on data that is indeed wellstructured.


V Conclusions and Future Works
In this paper, we have proposed a feedback control framework to guide the dynamics of an evolving surface in the context of multiview stereoscopic reconstruction. This is done to ensure robustness in presence of lowfidelity datasets. From an optimization standpoint, the reconstruction minima which we often seek (due to modeling imperfections) may not coincide with user expectations. As opposed to defining complex models for which overfitting may arise, we incorporate a userdefined input inloop and “onthefly” from a feedback control perspective. We show the resulting framework is stable via Lyapunov analysis and from a practical standpoint, there is an increase in efficiency through a humanautonomous collaboration in shape reconstruction. Mathematically, the thematic interest is the interplay of geometry and control, namely how notions of curvature from geometry infer convergence and for this note, a notion of autonomous trust to userinput. This said, future work will entail a much closer analysis in regards to how Gaussian curvature infers convergence as well as the study of a problem in a distributed optimization sense, nonconstant and timedelayed inputs as well as the inclusion of stochastic optimal control to further characterize operator uncertainty.
Noise  Interactions  % Pixels  True  False  
% View  Pos. Rate  Pos. Rate  
30%  (0, 0)  0%  78.4%  0.96%  
30%  (1, 95%)  2.7%  99.2%  3.09%  
50%  (0, 0)  0%  47.6%  0.2%  
50%  (1, 95%)  2.7%  99.8%  4.8%  
90%  (0, 0)  0%  21.8%  1%  
90%  (1, 95%)  2.7%  99.9%  14.3%  
90% 


99.9%  4.92%  



99.6%  4.3% 
References

[1]
A. Yezzi and S. Soatto. “Stereoscopic Segmentation.”
International Journal of Computer Vision
. 2003  [2] O. Faugeras and R. Keriven. Variational Principles, Surface evolution, PDE’s, Level Set Methods and the Stereo problem, INRIA. 1996.
 [3] F. Zhao and X. Xie. “An Overview of Interactive Medical Image Segmentation”, Annals of the BMVA. 2013
 [4] L. Zhu, P. Karasev, I. Kolesov, I, R. Sandhu, and A. Tannenbaum. “Guiding Image Segmentation On The Fly: Interactive Segmentation From A Feedback Control Perspective”, IEEE Transactions on Automatic Control. 2018.
 [5] K. Khalil. “Nonlinear systems”, PrenticeHall, New Jersey. 1996.
 [6] A. Dai, A. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nießner. “ScanNet: RichlyAnnotated 3D Reconstructions of Indoor Scenes”, CVPR. 2017.

[7]
Y. Zhang, S. Song, E. Yumer, M. Savva, J. Lee, H. Jin, and T. Funkhouser. “PhysicallyBased Rendering for Indoor Scene Understanding Using Convolutional Neural Networks”,
CVPR. 2017.  [8] J. Gwak. C. B. Choy, M. Chandraker, A. Garg, and S. Savarese. “Weakly Supervised 3D Reconstruction with Adversarial Constraint”, 2017 International Conference on 3D Vision. 2017.
 [9] Z. Chen, X. Sun, L. Wang, Y. Yu and C. Huang. “A Deep Visual Correspondence Embedding Model for Stereo Matching Costs”, CVPR. 2015.
 [10] J. Aulinas, Y. Petillot, J. Salvi, X. Lladó. “The SLAM Problem: A Survey”, CCIA. 2008.
 [11] G. Zhang, and P. Vela. “Optimally Observable and Minimal Cardinality Monocular SLAM”, ICRA. 2015.
 [12] Y. Zhao and P. Vela. “Good Line Cutting: Towards Accurate Pose Tracking of Lineassisted VO/VSLAM”, ECCV. 2018.
 [13] A. Yezzi and S. Soatto. “Structure from Motion for Scenes without Features”, CVPR. 2003.
 [14] O. Faugeras. “ThreeDimensional Computer vision: a Geometric Viewpoint”, MIT Press. 1993.
 [15] B. Klingner, D. Martin, and J. Roseborough. “Street View Motion from Structure From Motion”, CVPR. 2013.
 [16] C. Choy, and D. Xu, J. Gwak, K. Chen, and S. Savarese. “3dr2n2: A Unified Approach for Single and MultiView 3D Object Reconstruction”, ECCV. 2016
 [17] D. Mumford and J. Shah. “Optimal Approximations by Piecewise Smooth Functions and Associated Variational Problems”, Communications on Pure and Applied Mathematics.1989.
 [18] K. Kutulakos, N. Kiriakos, and S. Seitz. “A Theory of Shape by Space Carving”, International Journal of Computer Vision. 2000.
 [19] A. Mulayim, U. Yilmaz, and V. Atalay. “SilhouetteBased 3D Model Reconstruction From Multiple Images”, IEEE Transactions on Systems, Man, and Cybernetics. 2003.
 [20] M. Jancosek and T. Pajdla. “Segmentation Based MultiView Stereo.” Computer Vision Winter Worskhop. 2009.
 [21] T. Chan and L. Vese. “An Active Contour Model Without Edges”, International Conference on ScaleSpace Theories in Computer Vision. 1999.
 [22] M. Bertalmıo, L. Cheng, S. Osher and G. Sapiro. “Variational Problems and Partial Differential Equations on Implicit Surfaces”, Journal of Computational Physics. 2001
 [23] T. Nguyen, J. Cai, J. Zhang, J. Zheng. “Robust Interactive Image Segmentation Using Convex Active Contours”, IEEE Transactions on Image Processing. 2012
 [24] J. Doyle, B. Francis, and A. Tannenbaum. “Feedback Control Theory”, Courier Corporation. 2013.

[25]
R. Sandhu, S. Dambreville, A. Yezzi, A. Tannenbaum. “A Nonrigid KernelBased Framework for 2D3D Pose Estimation and 2D image segmentation”,
IEEE TPAMI. 2010. 
[26]
S. Kichenassamy, A. Kumar, P. Olver, A.Tannenbaum, A. Yezzi. “Conformal Curvature Flows: From Phase Transitions to Active Vision”,
Archive for Rational Mechanics and Analysis. 1996.  [27] M. Do Carmo. “Differential Geometry of Curves and Surfaces: Revised and Updated Second Edition”, Courier Dover Publications. 2016.
 [28] R. Sandhu, T Georgiou, E Reznik, L. Zhu, I. Kolesov, Y. Senbabaoglu, and A. Tannenbaum. “Graph Curvature for Differentiating Cancer Networks”, Scientific reports. 2015.
 [29] B. Bamieh, F. Paganini, M Dahleh. “Distributed Control of Spatially Invariant Systems” IEEE TAC. 2002.