Three Dimensional Reconstruction of Botanical Trees with Simulatable Geometry

12/20/2018 ∙ by Ed Quigley, et al. ∙ 8

We tackle the challenging problem of creating full and accurate three dimensional reconstructions of botanical trees with the topological and geometric accuracy required for subsequent physical simulation, e.g. in response to wind forces. Although certain aspects of our approach would benefit from various improvements, our results exceed the state of the art especially in geometric and topological complexity and accuracy. Starting with two dimensional RGB image data acquired from cameras attached to drones, we create point clouds, textured triangle meshes, and a simulatable and skinned cylindrical articulated rigid body model. We discuss the pros and cons of each step of our pipeline, and in order to stimulate future research we make the raw and processed data from every step of the pipeline as well as the final geometric reconstructions publicly available.



There are no comments yet.


page 2

page 5

page 6

page 7

page 11

page 12

page 13

page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Human-inhabited outdoor environments typically contain ground surfaces such as grass and roads, transportation vehicles such as cars and bikes, buildings and structures, and humans themselves, but are also typically intentionally populated by a large number of trees and shrubbery; most of the motion in such environments comes from humans, their vehicles, and wind-driven plants/trees. Tree reconstruction and simulation are obviously useful for AR/VR, architectural design and modeling, film special effects, etc. For example, when filming actors running through trees, one would like to create virtual versions of those trees with which a chasing dinosaur could interact. Other uses include studying roots and plants for agriculture [46, 9, 11] or assessing the health of trees especially in remote locations (similar in spirit to [48]

). 2.5D data, i.e. 2D images with some depth information, is typically sufficient for robotic navigation, etc.; however, there are many problems that require true 3D scene understanding to the extent one could 3D print objects and have accurate geodesics. Whereas navigating around objects might readily generalize into categories or strategies such as ‘move left,’ ‘move right,’ ‘step up,’ ‘go under,’ etc., the 3D object understanding required for picking up a cup, knocking down a building, moving a stack of bricks or a pile of dirt, or simulating a tree moving in the wind requires significantly higher fidelity. As opposed to random trial and error, humans often use mental simulations to better complete a task, e.g. consider stacking a card tower, avoiding a falling object, or hitting a baseball (visualization is quite important in sports); thus, physical simulation can play an important role in end-to-end tasks, e.g. see 

[19, 28, 15] for examples of combining simulation and learning.

Accurate 3D shape reconstruction is still quite challenging. Recently, Malik argued111Jitendra Malik, Stanford cs231n guest lecture, 29 May 2018 that one should not apply general purpose reconstruction algorithms to say a car and a tree and expect both reconstructions to be of high quality. Rather, he said that one should use domain-specific knowledge as he has done for example in [17]. Another example of this specialization strategy is to rely on the prior that many indoor surfaces are planar in order to reconstruct office spaces [13] or entire buildings [3, 2]. Along the same lines, [48] uses a base animal shape as a prior for their reconstructions of wild animals. Thus, we similarly take a specialized approach using a generalized cylinder prior for both large and medium scale features.

In Section 3, we discuss our constraints on data collection as well as the logistics behind the choices we made for the hardware (cameras and drones) and software (structure from motion, multi-view stereo, inverse rendering, etc.) used to obtain our raw and processed data. Sections 4, 5, and 6 then describe how we create geometry from that data with enough efficacy for physical simulation. Section 7

discusses our use of machine learning, and Section 

8 presents a number of experimental results.

2 Previous Work

Tree Modeling and Reconstruction: Researchers in computer graphics have been interested in modeling trees and plants for decades [21, 5, 39, 29, 35]. SpeedTree222

is probably the most popular software utilized, and their group has begun to consider the incorporation of data-driven methods. Amongst the data-driven approaches,

[36] is most similar to ours combining point cloud and image segmentation data to build coarse-scale details of a tree; however, they generate fine-scale details procedurally using a self-similarity assumption and image-space growth constraints, whereas we aim to capture more accurate finer structures from the image data. Other data-driven approaches include [22]

which automatically estimates skeletal structure of trees from point cloud data,

[42] which builds tree models by assembling pieces from a database of scanned tree parts, etc.

Many of these specialized, data-driven approaches for trees are built upon more general techniques such as the traditional combination of structure from motion (see e.g. [41]) and multi-view stereo (see e.g. [12]). In the past, researchers studying 3D reconstruction have engineered general approaches to reconstruct fine details of small objects captured by sensors in highly controlled environments [34]. At the other end of the spectrum, researchers have developed approaches for reconstructing building- or even city-scale objects using large amounts of image data available online [1]. Our goal is to obtain a 3D model of a tree with elements from both of these approaches: the scale of a large structure with the fine details of its many branches and twigs. However, unlike in general reconstruction approaches, we cannot simply collect images online or capture data using a high-end camera.

To address similar challenges in specialized cases, researchers take advantage of domain-specific prior knowledge. [47] uses a generalized cylinder prior (similar to us) for reconstructing tubular structures observed during medical procedures and illustrates that this approach performs better than simple structure from motion. The process of creating a mesh that faithfully reflects topology and subsequently refining its geometry is similar in spirit to [43], which poses a human model first via its skeleton and then by applying fine-scale deformations.

Learning and Networks: So far, our use of networks is limited to segmentation tasks, where we rely on segmentation masks for semi-automated tree branch labeling. Due to difficulties in getting sharp details from convolutional networks, the study of network-based segmentation of thin structures is still an active field in itself; there has been recent work on designing specialized multiscale architectures  [32, 20, 30] and also on incorporating perceptual losses [16] during network training [26].

Figure 1: We target a California oak for reconstruction and simulation. (Inset) The drone and camera setup used to collect video data of the tree.

3 Raw and Processed Data

As a case study, we select a California oak (quercus agrifolia) as our subject for tree reconstruction and simulation (see Figure 1). The mere size of this tree imposes a number of restrictions on our data capture: one has to deal with an outdoor, unconstrained environment, wind and branch motion will be an issue, it will be quite difficult to observe higher up portions of the tree especially at close proximities, there will be an immense number of occluded regions because of the large number of branches that one cannot see from any feasible viewpoint, etc.

Figure 2: Point cloud data (far left) is used as a guide for interactively placing generalized cylinders which are readily simulatable as articulated rigid bodies (middle left). The generalized cylinders are connected together into a skinned mesh (middle right), which is then perturbed based on the point cloud data to capture finer scale geometric details (far right). Notice the stumps that appear in the perturbed mesh reflecting the point cloud data (marked with arrows).

In an outdoor setting, commodity structured light sensors that use infrared light (e.g. the Kinect) fail to produce reliable depth maps as their projected pattern is washed out by sunlight; thus, we opted to use standard RGB cameras. Because we want good coverage of the tree, we cannot simply capture images from the ground; instead, we mounted our cameras on a quadcopter drone that was piloted around the tree. The decision to use a drone introduces additional constraints: the cameras must be lightweight, the camera locations cannot be known a priori, the drone creates its own air currents which can affect the tree’s motion, etc. Balancing the weight constraint with the benefits of using cameras with a global shutter and minimal distortion, we mounted a pair of Sony rx100 v cameras to a DJI Matrice 100 drone. We calibrated the stereo offset between the cameras before flight, and during flight each camera records a video with K resolution at fps.

Data captured in this manner is subject to a number of limitations. Compression artifacts in the recorded videos may make features harder to track than when captured in a RAW format. Because the drone must keep a safe distance from the tree, complete coverage of a given branch is often infeasible. This lack of coverage is compounded by occlusions caused by other branches and leaves (in seasons when the latter are present). Furthermore, the fact that the tree may be swaying slightly in the wind even on a calm day violates the rigidity assumption upon which many multi-view reconstruction algorithms rely. Since we know from the data collection phase that our data coverage will be incomplete, we will need to rely on procedural generation, inpainting, “hallucinating” structure, etc. in order to complete the model.

After capturing the raw data, we augment it to begin to estimate the 3D structure of the environment. We subsample the videos at a sparse or fps and use the Agisoft PhotoScan tool333Agisoft PhotoScan, to run structure from motion and multi-view stereo on those images, yielding a set of estimated camera frames and a dense point cloud. We align cameras and point clouds from separate structure from motion problems by performing a rigid fit on a sparse set of control points. This is a standard workflow also supported by open-source tools [40, 33, 27]. Some cameras may be poorly aligned (or in some cases, so severely incorrect that they require manual correction). Once the cameras are relatively close, one can utilize an inverse rendering approach like that of [24] adjusting the misaligned cameras’ parameters relative to the point cloud. In the case of more severely misaligned cameras, one may select correspondences between 3D points and points in the misaligned image and then find the camera’s extrinsics by solving a perspective-n-point problem [10].

4 Building a Simulatable Tree

Recent work enables the simulation of highly detailed trees modeled as articulated rigid bodies at real-time or interactive speeds (k rigid bodies simulated at frames/sec, or million rigid bodies simulated at sec/frame) [31]

. The scalability of this method makes the simulation of rich, highly detailed geometric models of real-world trees feasible. In order to apply such a method to our reconstruction, we need to create articulated rigid bodies with masses and inertia tensors connected via springs with stiffnesses and damping coefficients. Unfortunately, point clouds, triangle soups with holes, and other similar 3D representations are not readily amenable to such an approach. Commonly used techniques such as Poisson surface reconstruction

[18] produce potentially disconnected meshes that do not respect the topology of the underlying tree, and are thus not well-suited for simulation. In order to create a simulatable reconstructed tree, we make a strong prior assumption that the tree reconstruction consists of a number of generalized cylinders as underlying building blocks, appropriately skinned to provide smooth interconnections, and subsequently modified to provide the desired geometric detail.

We create the generalized cylinders interactively using the point cloud data obtained via multi-view stereo as a guide. An initial cylinder is positioned at the base of the tree’s trunk, then a second cylinder is attached to the first by a common endpoint, and so on, progressively “growing” the generalized cylinder model of the tree. Each cylinder endpoint may be connected to zero, one, or two subsequent cylinders to model a branch ending, curving, or bifurcating, respectively. A radius is also specified for each endpoint of the generalized cylinders. After approximating the trunk and branches using this generalized cylinder basis, the surfaces of the generalized cylinders are skinned together into a single contiguous triangle mesh representing the exterior surface of the tree. Although this model only roughly captures the geometry using the radii of the generalized cylinders as estimates of the tree’s cross-sectional thicknesses, the advantage of this representation is that the model has a topology consistent with the real tree. The generalized cylinders are also readily simulated in order to drive deformations of the skinned mesh. See Figure 2.

Although topologically accurate, the skinned generalized cylinders miss much of the rich geometric structure of the tree that is captured in part by the point cloud data. Thus we augment the generalized cylinder representation in a manner informed by the point cloud. For each vertex on the contiguous skinned mesh, we construct a cylindrical sampling region normal to the mesh; then, the average position of the point cloud points that fall within this sampling region is used to perturb the vertex in its normal direction, essentially creating a 2D height field with respect to the skinned mesh. That is, the point cloud informs a displacement map [6].

Some vertices may have no point cloud data within their respective sampling regions; often the point cloud data only models one side of a branch, so only vertices on that side of the skin mesh are perturbed. To avoid sharp discontinuities in the perturbed mesh, we solve Laplace’s equation for the heightfield displacements of the vertices with empty sampling regions using the adequately perturbed heights as Dirichlet boundary conditions. To obtain a visually desirable mesh, one can additionally utilize Laplacian smoothing (e.g. [37, 8]), vertex normal smoothing, Loop subdivision [23]

, etc., as well as point cloud subset selection interleaved with additional perturbations along the resulting vertex normal directions. In fact, extending image inpainting ideas 

[4] to the height fields on the two dimensional mesh surface (i.e. geometric inpainting) would likely give the best results, especially if informed from other areas of the tree where the geometry is more readily ascertained.

Finally, the mesh is textured by assigning each post-perturbation vertex the color of its nearest point cloud point. Note that we do not use an average of the nearby data as this tends to wash out the texture details. Here, image inpainting can also be used to fill in regions that have no point cloud data for textures.

5 Medium Scale Branches

Figure 3: (Left) We successfully reconstruct twig geometry using a traditional structure from motion and multi-view stereo pipeline under favorable conditions: complete coverage, indoor lighting, rigid geometry, etc. (Right) In the wild, difficult conditions preclude the ability for such reconstructions.

The aforementioned process fails on parts of the tree for which there is insufficient point cloud data (or no point cloud data at all). Although traditional structure from motion is sufficient for recovering fine twig details under favorable conditions, the process of capturing data from a real tree accrues errors that necessitate a specialized approach (see Figure 3). These errors may be attributed to many sources: some branches are heavily occluded by others, the drone cannot perform a full sweep of most branches without other branches acting as obstacles, twig features are often only a few pixels wide when maintaining a safe distance between the drone and the tree, the tree may be nonrigidly deforming in the breeze even on a relatively calm day (or even due to the air currents generated by the drone itself), etc. The net effect of these sources of error is that our approach for creating the trunk and thicker branches of the tree is insufficient for the tree’s finer structures that are not well-resolved by the point cloud data.

Figure 4: (Top) Triangle meshes recovered from point cloud data (Section 4) and image annotations (Section 8). (Bottom) A close-up view of medium scale branches.

Thus, we switch to an image-based approach for finer structures. Our 3D generalized cylinder prior can be extended to 2D images by choosing projected radii and projected lengths of hypothetical 3D generalized cylinders. These 2D projections of generalized cylinders can be extended back to three spatial dimensions using multiple images. Whereas significant geometric detail on the thicker parts of the tree comes from geometric roughness on the skin of the cylinder as caused by knots, bark, etc. and is captured by our aforementioned perturbation process (see Section 4), the most significant geometric detail on thinner branches is often simple bending of their centerline. Thus, the fact that thinner branches lack adequate representation in the 3D point cloud is less consequential.

Figure 5: (Left) Final simulatable geometry: articulated rigid bodies skinned and textured with the aid of the point cloud, along with thinner branch reconstructions with geometry and topology that follow the image data, all of which inform the motion of any reconstructed points and triangles that lack high enough fidelity reconstructions to form coherent structures. (Right) An RGB image of the actual tree from the same view.

We employ an image annotation approach (see Section 8.1) to obtain image space labelings of branch and twig curves and “keypoints,” or features that can be identified across multiple images. After annotating a number of images, we use the annotation data to recover 3D structures by triangulating keypoint positions, connecting 3D keypoints to match the topology of image space curves, and estimating the tree thickness for each 3D point (see Section 8.4). Then, we again create a contiguous skin mesh for these newly recovered 3D branches. In order to boost our ability to capture geometric changes in the centerline, we use the 3D positions as control points for a b-spline curve rather than directly meshing the piecewise linear segments. See Figure 4.

Finally, we project texture information from the annotated images onto the skinned branches. For each vertex on the skinned mesh, we estimate its corresponding position within each corresponding annotated curve by measuring its fractional length along the curve’s medial axis and its fractional thickness measured by projecting the vertex’s distance from its 3D segment onto a plane parallel to the current image plane. For each such annotated curve we compute a quality estimate based on how close the corresponding camera is to the vertex and how closely aligned the vertex’s surface normal is to the direction from the vertex to the annotated point. Since averaging smears out texture information, we assign each vertex the color with the maximum quality score.

6 Unresolved Structure

Because the image annotations depend on human labelers, many of the tree’s branches and twigs remain unmodeled even as more images are progressively covered; the automated and semi-automated approaches considered in Section 7 can help with this. In order to avoid discarding data, we additionally constrain the unstructured point cloud data obtained in Section 3 to the nearest generalized cylinders of the reconstructed model so that the point cloud deforms as the tree’s rigid bodies move during simulation. This allows points from leaves, branches, and other structures that remain “orphaned” even after all possible generalized cylinders are created to contribute to the virtual tree’s appearance and motion. See Figure 5.

7 Annotation and Learning

Annotating images is a challenging task for human labelers and automated methods alike. Branches and twigs heavily occlude one another, connectivity can be difficult to infer, and the path of even a relatively large branch can often not be traced visually from a single view. Thus it is desirable to augment the image data during annotation to aid human labelers.

One method for aiding the labeler is to automatically extract a “flow field” of vectors tracking the anisotropy of the branches in image space (see Figure 

10). The flow field is overlaid on the image in the annotation tool, and the labeler may select endpoints to be automatically connected using the projection-advection scheme discussed in Section 8.3. Section 8.3 also discusses how we generate the flow field itself, after first creating a segmentation mask. Note that segmentation (i.e. discerning tree or not tree for each pixel in the image) is a simpler problem than annotation (i.e. discerning medial axes, topology, and thickness in image space).

Obtaining segmentation masks is straightforward under certain conditions, e.g. in areas where branches and twigs are clearly silhouetted against the grass or sky, but segmentation can be difficult in visually dense regions of an image. Thus, we explore deep learning-based approaches for performing semantic segmentation on images from our dataset. In particular, we use U-Net 


, a state-of-the-art fully convolutional architecture for segmentation; the strength of this model lies in its many residual connections, which give the model the capacity to retain sharp edges despite its hourglass structure. See Section 

8.2 for further discussion.

8 Experiments

Since the approach to large scale structure discussed in Section 4 works well, we focus here on medium scale branches.

8.1 Image Annotation

We present a human labeler with an interface for drawing piecewise linear curves on an overlay of a tree image. User annotations consist of vertices with 2D positions in image space, per-vertex branch thicknesses, and edges connecting the vertices. Degree- vertices are curve endpoints, degree- vertices lie on the interior of a curve, and degree- vertices exist where curves connect. A subset of the annotated vertices are additionally given unique identifiers that are used to match common points between images; these will be referred to as “keypoints” and are typically chosen as bifurcation points or points on the tree that are easy to identify in multiple images. See Figure 6.

Figure 6: Human labelers use our annotation tool to draw curves with positions, thicknesses, connectivities, and unique identifiers on images of the tree.

We take advantage of our estimated 3D knowledge of the tree’s environment in order to aid human labelers and move towards automatic labeling. After some annotations have been created, their corresponding 3D structures are generated and projected back into each image, providing rough visual cues for annotating additional images. Additionally, since we capture stereo information, we augment our labeling interface to be aware of stereo pairs: users annotate one image, copy those annotations to the stereo image, and translate the curve endpoints along their corresponding epipolar lines to the correct location in the stereo image. This curve translation constrained to epipolar lines (with additional unconstrained translation if necessary to account for error) is much less time consuming than labeling the stereo image from scratch.

Human labelers often identify matching branches and twigs across images by performing human optical flow, toggling between adjacent frames of the source video and using the parallax effect to determine branch connectivity. This practice is an obvious candidate for automation, e.g. by annotating an initial frame then automatically carrying the annotated curves through subsequent frames via optical flow. Unfortunately, the features of interest are often extremely small and thin and the image data contains compression artifacts, making automatic optical flow approaches quite difficult. However, it is our hope that in future work the same tools that aid human labelers can be applied to automatic approaches making them more effective for image annotation.

8.2 Deep Learning

In order to generate flow fields for assisting the human labeler as discussed in Section 7, we first obtain semantic segmentations of tree and not tree using a deep learning approach. To train a network for semantic segmentation, we generate a training dataset by rasterizing the image annotations as binary segmentation masks of the labeled branches. From these 4K masks, we then generate a dataset of crops containing more than 4000 images. The crop centers are guaranteed to be at least pixels away from one another, and each crop is guaranteed to correspond to a segmentation mask containing both binary values. The segmentation problem on the raw 4K images must work on image patches with distinctly different characteristics: the more straightforward case of branches silhouetted against the grass, and the more complex case of highly dense branch regions. Therefore, we split the image patches into two sets via

-means clustering, and train two different models to segment the two different cases. For the same number of training epochs, our two-model approach yields qualitatively better results than the single-model approach.

Figure 7: A set of anisotropic kernels is used to obtrain directional activations in segmentation masks for both perceptual loss and flow field generation.

Instead of directly using the standard binary cross entropy loss, the sparseness and incompleteness of our data led us to use a weighted variant, in order to penalize false negatives more than false positives. As a further step to induce smoothness and sparsity in our results, we introduce a second order regularizer through the L2 difference of the output and ground truth masks’ gradients. We also experiment with an auxiliary loss similar to the VGG perceptual loss described in [26], but instead of using arbitrary feature layers of a pretrained network, we look at the L1 difference of hand-crafted multiscale directional activation maps. These activation maps are produced by convolving the segmentation mask with a series of Gabor filter-esque [14] feature kernels , where each kernel is scale-aware and piecewise sinusoidal (see Figure 7). A given kernel detects branches that are at an angle and have thicknesses within the interval . For our experiments, we generate 18 kernels spaced degrees apart and use , , and .

Figure 8: (Left) Image masks generated from image annotations and used as training data. (Right) Outputs of the segmentation network.

Figure 8 illustrates two annotated images used in training and the corresponding learned semantic segmentations. Note that areas of the semantic segmentation that are not part of the labeled annotation may correspond to true branches or may be erroneous; for the time being a human must still choose which pieces of the semantic segmentation to use in adding further annotations.

8.3 Learning-Assisted Annotation

Figure 9: (Top left) A ground truth mask of the tree taken by flattening the image annotation data into a simple binary mask. (Bottom left) A visualization of flow directions estimated by applying directional filters to the ground truth mask. (Right) Medial axes of the tree branches estimated from the flow field.

To generate a flow field, we create directional activation maps as in Section 8.2 again using the kernels from Figure 7, then perform a clustering step on the resulting per-pixel histograms of gradients [7] to obtain flow vectors. Each pixel labeled as tree with sufficient confidence is assigned one or more principal directions; pixels with more than one direction are potentially branching points. We find the principal directions by detecting clusters in each pixel’s activation weights; for each cluster, we take the sum of all relevant directional slopes weighted by their corresponding activation values.

Having generated a flow field of sparse image space vectors, we trace approximate medial axes through the image via an alternating projection-advection scheme. From a given point on a branch, we estimate the thickness of the branch by examining the surrounding flow field and project the point to the estimated center of the branch. We then advect the point through the flow field and repeat this process. In areas with multiple directional activations (e.g. at branch crossings or bifurcations), our advection scheme prefers the direction that deviates least from the previous direction. See Appendix E for further details. By applying this strategy to flow fields generated from ground truth image segmentations, we are able to recover visually plausible medial axes (see Figure 9). However, medial axes automatically extracted from images without ground truth labels are error prone. Thus, we overlay the flow field on the annotation interface and rely on the human labeler. The labeler may select curve endpoints in areas where the flow field is visually plausible, and these endpoints are used to guide the medial axis generation. See Figure 10 for an example flow field generated from the learned segmentation mask and the supplemental video for a demonstration of semi-automated medial axis generation.

Figure 10: The trained network infers a segmentation mask (bottom left) from an input image (top left). We then estimate a flow field (right) by applying anisotropic filters to the segmentation mask. A labeler can specify endpoints between which a medial axis and thickness values are automatically estimated (right, in green).

8.4 Recovering Medium Scale Branches

Given a set of image annotations and camera extrinsics obtained via structure from motion and stereo calibration, we first construct piecewise linear branches in 3D. We triangulate keypoints that have been labeled in multiple images, obtaining 3D positions by solving for the point that minimizes the sum of squared distances to the rays originating at each camera’s optical center and passing through the camera’s annotated keypoint. We then transfer the topology of the annotations to the 3D model by connecting each pair of 3D keypoints with a line segment if a curve exists between the corresponding keypoint pair in any image annotation.

Next, we subdivide and perturb the linear segments connecting the 3D keypoints to match the curvature of the annotated data. Each segment between two keypoints is subdivided by introducing additional vertices evenly spaced along the length of the segment. For each newly introduced vertex, we consider the point that is the same fractional length along the image-space curve between the corresponding annotated keypoints in each image for which such a curve exists. We trace rays through these intra-curve points to triangulate the position of each new vertex in the same way that we triangulated the original keypoints.

Figure 11: Branches labeled from a stereo pair of cameras are visually plausible from the perspective of those cameras (top left), but they can exhibit severe error when viewed from a different angle (top right). By clamping these branch positions, one can achieve a virtually identical projection to the original cameras (bottom left) while maintaining a nondegenerate albeit “flattened” appearance from a different angle (bottom right).

Finally, we estimate the thickness of each 3D vertex beginning with the 3D keypoints. We estimate the world space thickness of each keypoint by considering the corresponding thickness in all annotated camera frames. For each camera in which the keypoint is labeled, we estimate world space thickness using similar triangles, then average these estimates to get the final thickness value. We then set the branch thickness of each of the vertices obtained through subdivision simply by interpolating between the thicknesses of the keypoints at either end of the 3D curve. Using this strategy, we recover a set of 3D positions with local cross-sectional thicknesses connected by edges, which is equivalent to the generalized cylinder representation employed in Section 


The human users of our annotation tools encounter the traditional trade-off of stereo vision: it is easy to identify common features in images with a small baseline, but these features triangulate poorly exhibiting potentially extreme variance in the look-at directions of the corresponding cameras. Conversely, cameras whose look-at directions are close to orthogonal yield more stable triangulations, but common features between such images are more difficult to identify. One heuristic approach is to label each keypoint three times: twice in similar images and once from a more diverse viewpoint. However, it may be the case that some branches are only labeled in two images with a small baseline (e.g. a stereo pair). In this case, we propose a clamping strategy based on the topological prior of the tree. Designating a “root” vertex of a subtree for such annotations, we triangulate the annotated keypoints as usual obtaining noisy positions in the look-at directions of the stereo cameras. We then march from the root vertex to the leaf vertices. For each vertex

with location , we consider each outboard child vertex with location . For each camera in which the point is labeled, we consider the intersection of the ray from the location of ’s annotation to with the plane parallel to the image plane that contains ; let be the intersection point. We then clamp the location of between and the original location based on a user-specified fraction. This process is repeated for each camera in which is annotated, and we obtain the final location for by averaging the clamped location from each camera. See Figure 11.

9 Conclusion and Future Work

Figure 12: The tree model is deformed from its rest pose (left) to an exaggerated pose (right) via simulation.

We presented an end-to-end pipeline for reconstructing a 3D model of a botanical tree from RGB image data. Our reconstructed model may be readily simulated to create motion in response to external forces, e.g. to model the tree blowing in the wind (see Figure 12

). We use generalized cylinders to initialize an articulated rigid body system, noting that one could subdivide these primitives as desired for additional bending degrees of freedom, or decrease their resolution for faster performance on mobile devices. The simulated bodies drive the motion of the textured triangulated surfaces and/or the point cloud data as desired.

Although we presented one set of strategies to go all the way from the raw data to a simulatable mesh, it is our hope that various researchers will choose to apply their expertise to and improve upon various stages of this pipeline, yielding progressively better results. In particular, the rich topological information in the annotations has great potential for additional deep learning applications, particularly for topology extraction [38, 25, 44] and 2D to 3D topology generation [9].


Research supported in part by ONR N000014-13-1-0346, ONR N00014-17-1-2174, ARL AHPCRC W911NF-07-0027, and generous gifts from Amazon and Toyota. In addition, we would like to thank both Reza and Behzad at ONR for supporting our efforts into computer vision and machine learning, as well as Michael Black for many interesting suggestions especially regarding drones. E.Q. was supported in part by an NDSEGF and Y.Z. was supported in part by a Stanford School of Engineering Fellowship. E.Q. would also like to thank Michael Bao and David Hyde for contributing cameras, Pilot AI Labs for contributing drones, and Richard Heru, Katherine Liu, Martin Mbuthia, and Riley Wilson for helping with image annotation.


Appendix A Video Syncing

We collect data using a pair of drone-mounted cameras. In order to synchronize the two video feeds, we calculate a sequence of the L1 distances between adjacent video frames for each camera as a rough estimate of how much motion occurred between frames. We then use the maximum cross correlation of these sequences to find the integer frame offset that best aligns the two videos. Note that since the videos are recorded at fps, even the optimal offset could leave as much as a 1/60 second temporal error, in addition to the other sources of error mentioned in Section 5.

Appendix B Camera Calibration

For each structure from motion problem, we withhold images from camera 2 and obtain estimates of camera 1’s intrinsics as well as extrinsics for individual image samples. Then, we hold the camera intrinsics constant and solve for the stereo offset between cameras 1 and 2 using a standard checkerboard pattern [45]. The scale of the stereo transform relative to the structure from motion scene is unknown, but this scale may be found using image correspondences or even estimated visually.

Appendix C Semantic Segmentation

The branches and twigs in the image data are often either clustered together or silhouetted against the ground. Hypothesizing that two models trained on identical network architectures might perform better than a single monolithic model, we divided the training data into a ‘brown’ set and a ‘green’ set using -means clustering approximately capturing the ‘clustered’ and ‘silhouetted’ arrangements of twigs, respectively. Because color saturation is a defining feature in green vs. brown, we chose to cluster the image crops based on the saturation channel in HSV color space. However, using the average saturation over the entire image crop is not discriminatory enough due to the high amounts of brown that are present even in crops that contain branches silhouetted against grass; the median saturation value is also not sufficiently robust to varying amounts of grass in the background. Thus, for each image crop we took the th and th percentile of per-pixel saturations as a feature vector. We performed -means clustering on these 2D features to split the image crops into two groups. Finally, we trained two models with identical U-Net network architectures, each with image crops of the clusters.

To segment an image, we pass the image through both models to obtain two candidate segmentation masks. We then composite the candidate segmentations by taking a per-pixel weighted sum. To set the per-pixel weights, we compute a feature vector of the th and th percentile saturation values for a small neighborhood around each pixel; then, we calculate the feature vector’s distance to each of the two cluster centers previously obtained in the -means clustering step and use those distances to assign the corresponding pixel a weight between and . See Figures 13 and 14 for examples.

Note that while our experiments with -means clustering on training data for network models were reasonably successful, experiments with using clustering algorithms directly on images to perform segmentation were distinctively less so, as the color-based segmentation methods had trouble segmenting out tangled branches that are not clearly silhouetted against the grass or the sky. See Figure 15 for several examples.

Appendix D Flow Field Extraction

Our handcrafted scale-aware directional kernels are piecewise sinusoidal with a linear falloff, and are designed to have a maximum activation when convolved with an image patch that contains a line passing through the patch center with some specified thickness and orientation . A falloff threshold controls the maximum detectable thickness; convolving any line thicker than with one of the kernels will result in an activation value of zero. See Figure 16. The formulation of the kernel defined on the discrete pixel grid is

We use a bank of filters, with ranging from to in increments. Instead of varying and , for more efficient computation we use fixed, empirically chosen values for and and convolve these filters with a number of rescaled instances of the input image. These convolutions yield a set of directional activations (see Figure 16(b) and Figure 17(b)).

To extract flow field vectors from the directional activations, we first determine whether a meaningful nonzero flow direction exists for each pixel. First, we filter out noise by zeroing out all activation values that are below an empirically chosen threshold (see Figure 17(c)). Next, for each pixel and for each sample, we reduce the activation maps for different scales to a single map by taking the maximum activation across all scales. This yields a per-pixel histogram of activation values as a function of . See Figure 17(d).

Pixels with all zero activations and pixels with more than half their activations nonzero are ignored; the latter case is for filtering out pixels corresponding to roughly isotropic ‘blobs’ in the segmentation mask. We determine primary directions for the remaining pixels by clustering their activation values. Observing that an elongated structure in the segmentation mask may cause large activations in multiple kernels with similar values, we sort the directional activations by and find blocks of activations caused by adjacent samples bounded by local minima in the distribution. See Figures 17(d) and 17(e). For a given activation block with per-angle activations , we use the weighted sum of the directions as a single vector representing the block. After finding all such blocks and computing the corresponding vectors, we designate the vectors with the largest magnitudes as the primary flow directions for the corresponding pixel. In practice, selecting the top two vectors as primary flow directions has proven sufficient so far. See Figure 17(a) and Figure 17.

Appendix E Projection-Advection Scheme

For a point with nonzero flow , we consider the line perpendicular to that passes through . We then find the line segment on which includes and for which every point on the segment has a principal flow direction within degrees of . The center point of the line segment, referred to as , is now the projection of onto the medial axis, and the length is the thickness of the branch at point . is added to the set of medial axis vertices, and we next consider the flow field at point . If the flow field at is nonzero, we choose the principle direction that is closest to . is then updated to be , where is a user-specified step size, and the projection-advection scheme is repeated until we find a zero flow field at , i.e. where a nonzero does not exist.

Note that when estimating a medial axis within our annotation tool we additionally have labeler-specified curve endpoints and . To account for this we add an additional attraction force, picking the advection direction as , where is the normalized flow field direction, is the normalized direction from to , and .

Appendix F Recovering Medium Scale Branches

Here we provide additional implementation details for the approach described in Section 8.4. See Figure 19 for additional views of the resulting meshed geometry.

Misaligned Cameras: Some images may fail to be aligned properly after running structure from motion. There are few enough of these that it is feasible to visually inspect each image and mark those with incorrect extrinsics as ‘misaligned.’ During medium scale branch reconstruction, we perform an initial pass of keypoint triangulation while withholding annotations from misaligned images. We then use the resulting 3D triangulated keypoints to solve a perspective-n-point problem for each misaligned image. If this fails, then the annotations from that image are ignored; otherwise, the image is marked as ‘aligned.’ We then repeat the keypoint triangulation step using annotations from the newly aligned images along with those of the originally aligned images, thus potentially modifying the 3D positions of some keypoints while also potentially adding new keypoints.

Keypoint Thickness Estimation:

For a keypoint with triangulated position , we estimate the world space radius by considering the set of cameras in which the keypoint is labeled. For each such camera we know the camera’s optical center , the annotated point , and the annotated radius . Let be the plane that contains and is parallel to the camera’s image plane. We intersect the ray with obtaining a point , then estimate the keypoint’s radius using a ratio of distances: . Finally, we average the radius estimates over each camera in which the keypoint is annotated so that the final radius estimate is .

Figure 13: From left to right: original image, output of model trained on cluster 1 (‘brown’), output of model trained on cluster 2 (‘green’), visualization of contributions from each model in the composited output (orange for cluster 1 and green for cluster 2), and the final composited segmentation.
Figure 14: Examples of network-predicted segmentations and subsequent flow field extractions using images not seen during network training.
Figure 15: Examples of -means clustering versus our trained segmentation network. From left to right for each image triplet: input image, -means result, network output. For this visualization, we performed -means clustering with 2 clusters in Lab color space, which has a Euclidean distance that closely corresponds to color differences perceived by humans.
Figure 16: Heatmap visualizations of kernels, where red indicates positive weights and blue indicates negative weights. The middle row shows kernels with , , and increasing in increments; these are the values we use in practice. To illustrate the effect of the kernel parameters, the top and bottom rows depict kernels with , and , , respectively.
Figure 17: From left to right: input segmentation mask, a selected subset of multiscale directional activation maps, an HSV visualization of primary flow directions (in which hue represents angle and value represents magnitude), and a flow field visualization of primary and secondary flow directions (visualized in blue and orange, respectively).

(a) An image crop (left), its corresponding ground truth segmentation mask with a pixel of interest colored in red (middle), and the two directions detected at the pixel of interest during flow field generation (right).
(b) Plot of filter activation vs filter angle (in degrees) for the pixel of interest in (a). Each colored line corresponds to the activations at that pixel’s location for a different image scale.
(c) We discard activations from (b) that are below a user-specified threshold.
(d) We reduce the activations from (c) to a single activation map by taking the maximum value over all scales. These values are grouped into clusters bounded by local minima (here blue and orange represent two clusters). Note that a cluster may wrap around from to since the filters are bidirectional.
(e) Kernels corresponding to nonzero values in the final activations from (d).
Figure 18: An illustration of how filter activations are grouped into clusters as part of the flow field generation process.


Figure 19: (Top row) Triangle mesh geometry perturbed using point cloud data. (Middle two rows) Medium scale branches generated from image annotations. (Bottom row) Zoomed out views of the final tree model (so far).


  • [1] S. Agarwal, N. Snavely, I. Simon, S. M. Seitz, and R. Szeliski. Building rome in a day. In Computer Vision, 2009 IEEE 12th Int. Conf. on, pages 72–79. IEEE, 2009.
  • [2] I. Armeni, S. Sax, A. R. Zamir, and S. Savarese. Joint 2d-3d-semantic data for indoor scene understanding. arXiv preprint arXiv:1702.01105, 2017.
  • [3] I. Armeni, O. Sener, A. R. Zamir, H. Jiang, I. Brilakis, M. Fischer, and S. Savarese. 3d semantic parsing of large-scale indoor spaces. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , pages 1534–1543, 2016.
  • [4] M. Bertalmio, G. Sapiro, V. Caselles, and C. Ballester. Image inpainting. In Proceedings of the 27th annual conference on Computer graphics and interactive techniques, pages 417–424. ACM Press/Addison-Wesley Publishing Co., 2000.
  • [5] J. Bloomenthal. Modeling the mighty maple. In ACM SIGGRAPH Computer Graphics, volume 19, pages 305–311. ACM, 1985.
  • [6] R. L. Cook. Shade trees. ACM Siggraph Comput. Graph., 18(3):223–231, 1984.
  • [7] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, volume 1, pages 886–893. IEEE, 2005.
  • [8] M. Desbrun, M. Meyer, P. Schröder, and A. H. Barr. Implicit fairing of irregular meshes using diffusion and curvature flow. Comput. Graph. (SIGGRAPH Proc.), pages 317–324, 1999.
  • [9] R. Estrada, C. Tomasi, S. C. Schmidler, and S. Farsiu. Tree topology estimation. IEEE Transactions on Pattern Analysis & Machine Intelligence, (1):1–1, 2015.
  • [10] M. A. Fischler and R. C. Bolles. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 24(6):381–395, 1981.
  • [11] A. Fuentes, S. Yoon, S. C. Kim, and D. S. Park. A robust deep-learning-based detector for real-time tomato plant diseases and pests recognition. Sensors, 17(9):2022, 2017.
  • [12] Y. Furukawa and J. Ponce. Accurate, dense, and robust multiview stereopsis. IEEE transactions on pattern analysis and machine intelligence, 32(8):1362–1376, 2010.
  • [13] J. Huang, A. Dai, L. J. Guibas, and M. Nießner. 3DLite: towards commodity 3d scanning for content creation. ACM Trans. Graph., 36(6):203–1, 2017.
  • [14] A. K. Jain and F. Farrokhnia. Unsupervised texture segmentation using gabor filters. Pattern recognition, 24(12):1167–1186, 1991.
  • [15] Y. Jiang and C. K. Liu. Data-augmented contact model for rigid body simulation. arXiv preprint arXiv:1803.04019, 2018.
  • [16] J. Johnson, A. Alahi, and L. Fei-Fei.

    Perceptual losses for real-time style transfer and super-resolution.

    In European Conference on Computer Vision, pages 694–711. Springer, 2016.
  • [17] A. Kanazawa, S. Tulsiani, A. A. Efros, and J. Malik. Learning category-specific mesh reconstruction from image collections. In The European Conf. on Comput. Vision (ECCV), pages 371–386, September 2018.
  • [18] M. Kazhdan, M. Bolitho, and H. Hoppe. Poisson surface reconstruction. In Proc. of the Fourth Eurographics Symp. on Geom. Processing, SGP ’06, pages 61–70, Aire-la-Ville, Switzerland, Switzerland, 2006. Eurographics Association.
  • [19] A. Kloss, S. Schaal, and J. Bohg. Combining learned and analytical models for predicting action effects. arXiv preprint arXiv:1710.04102, 2017.
  • [20] G. Lin, A. Milan, C. Shen, and I. Reid. Refinenet: Multi-path refinement networks for high-resolution semantic segmentation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
  • [21] A. Lindenmayer. Mathematical models for cellular interactions in development i. filaments with one-sided inputs. Journal of theoretical biology, 18(3):280–299, 1968.
  • [22] Y. Livny, F. Yan, M. Olson, B. Chen, H. Zhang, and J. El-sana. Automatic reconstruction of tree skeletal structures from point clouds. Proc. SIGGRAPH Asia 2010, 29:151:1–151:8, 2010.
  • [23] C. Loop. Triangle mesh subdivision with bounded curvature and the convex hull property. Technical Report MSR-TR-2001-24, Microsoft Research, 2001.
  • [24] M. M. Loper and M. J. Black. OpenDR: An approximate differentiable renderer. In European Conf. on Comput. Vision, pages 154–169. Springer, 2014.
  • [25] G. Máttyus, W. Luo, and R. Urtasun. Deeproadmapper: Extracting road topology from aerial images. In The IEEE International Conference on Computer Vision (ICCV), 2017.
  • [26] A. Mosinska, P. Márquez-Neila, M. Koziński, and P. Fua. Beyond the pixel-wise loss for topology-aware delineation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
  • [27] P. Moulon, P. Monasse, R. Perrot, and R. Marlet. Openmvg: Open multiple view geometry. In International Workshop on Reproducible Research in Pattern Recognition, pages 60–74. Springer, 2016.
  • [28] X. B. Peng, G. Berseth, K. Yin, and M. Van De Panne.

    Deeploco: Dynamic locomotion skills using hierarchical deep reinforcement learning.

    ACM Trans. Graph., 36(4):41, 2017.
  • [29] P. Prusinkiewicz, M. Hammel, J. Hanan, and R. Měch. Visual models of plant development. In Handbook of formal languages, pages 535–597. Springer, 1997.
  • [30] G. Qu, W. Zhang, Z. Wang, X. Dai, J. Shi, J. He, F. Li, X. Zhang, and Y. Qiao. Stripnet: Towards topology consistent strip structure segmentation. In Proceedings of the 26th ACM International Conference on Multimedia, MM ’18, pages 283–291, New York, NY, USA, 2018. ACM.
  • [31] E. Quigley, Y. Yu, J. Huang, W. Lin, and R. Fedkiw. Real-time interactive tree animation. IEEE transactions on visualization and computer graphics, 24(5):1717–1727, 2018.
  • [32] O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015.
  • [33] J. L. Schönberger and J.-M. Frahm. Structure-from-motion revisited. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4104–4113, 2016.
  • [34] S. M. Seitz, B. Curless, J. Diebel, D. Scharstein, and R. Szeliski. A comparison and evaluation of multi-view stereo reconstruction algorithms. In Proc. of the IEEE Conf. on Comput. Vision and Pattern Recognition (CVPR), pages 519–528. IEEE, 2006.
  • [35] O. Stava, S. Pirk, J. Kratt, B. Chen, R. Měch, O. Deussen, and B. Benes. Inverse procedural modelling of trees. In Computer Graphics Forum, volume 33, pages 118–131. Wiley Online Library, 2014.
  • [36] P. Tan, G. Zeng, J. Wang, S. B. Kang, and L. Quan. Image-based tree modeling. In ACM Trans. Graph., volume 26, page 87. ACM, 2007.
  • [37] G. Taubin. A signal processing approach to fair surface design. In Proc. of the 22nd annual conf. on Comput. graphics and interactive techniques, pages 351–358. ACM, 1995.
  • [38] C. Ventura, J. Pont-Tuset, S. Caelles, K.-K. Maninis, and L. Van Gool. Iterative deep learning for road topology extraction. In Proc. of the British Machine Vision Conf. (BMVC), 2018.
  • [39] J. Weber and J. Penn. Creation and rendering of realistic trees. In Proc. 22nd Ann. Conf. Comput. Graph. Int. Tech., pages 119–128. ACM, 1995.
  • [40] C. Wu. VisualSFM: A visual structure from motion system., 2011.
  • [41] C. Wu. Towards linear-time incremental structure from motion. In 3D Vision-3DV 2013, 2013 International Conference on, pages 127–134. IEEE, 2013.
  • [42] K. Xie, F. Yan, A. Sharf, O. Deussen, B. Chen, and H. Huang. Tree modeling with real tree-parts examples. IEEE TVCG, 22(12):2608–2618, Dec 2015.
  • [43] W. Xu, A. Chatterjee, M. Zollhöfer, H. Rhodin, D. Mehta, H.-P. Seidel, and C. Theobalt. MonoPerfCap: Human performance capture from monocular video. ACM Transactions on Graphics (TOG), 37(2):27, 2018.
  • [44] T. Xue, J. Wu, Z. Zhang, C. Zhang, J. B. Tenenbaum, and W. T. Freeman. Seeing tree structure from vibration. In The European Conf. on Comput. Vision (ECCV), pages 748–764, September 2018.
  • [45] Z. Zhang. A flexible new technique for camera calibration. IEEE Transactions on pattern analysis and machine intelligence, 22, 2000.
  • [46] Y. Zheng, S. Gu, H. Edelsbrunner, C. Tomasi, and P. Benfey. Detailed reconstruction of 3d plant root shape. In Comput. Vision (ICCV), 2011 IEEE Int. Conf. on, pages 2026–2033. IEEE, 2011.
  • [47] J. Zhou, A. Das, F. Li, and B. Li. Circular generalized cylinder fitting for 3d reconstruction in endoscopic imaging based on MRF. In Comput. Vision and Pattern Recognition Workshops, 2008. CVPRW’08. IEEE Comput. Society Conf. on, pages 1–8. IEEE, 2008.
  • [48] S. Zuffi, A. Kanazawa, and M. J. Black. Lions and tigers and bears: Capturing non-rigid, 3d, articulated shape from images. In Proc. of the IEEE Conf. on Comput. Vision and Pattern Recognition (CVPR), pages 3955–3963, 2018.