Commercial forest growers rely on routine inventories of their forest in terms of the number of trees in a given area, their heights and other dimensions, often over large areas. With precise knowledge of the location of trees, their individual structures and the quantity and quality of wood they contain, resources can be utilised more efficiently during harvesting operations and supply-chain decisions can be planned more optimally. A combination of Airborne Laser Scanning (ALS) using manned aircraft [1, 2] and Terrestrial Laser Scanning (TLS) using static, ground-based sensors [3, 4] is typically used to gather data for inventory, but these traditional techniques suffer from several limitations. Manned aircraft ALS typically results in point clouds with insufficient density to identify individual trees, TLS can only cover small areas of the forest. Recently developed UAV-borne LiDAR systems have demonstrated the ability to generate forest pointclouds with densities between ALS and TLS, and over large areas; issues still remain in how to extract inventory data (such as tree counts and tree maps) from these systems in an automatic way.
In this paper, we develop an automated approach to detecting, segmenting and counting trees in high resolution aerially acquired LiDAR pointclouds over plantation forests. Processing of lidar point clouds is an active research area in robotics and computer vision[5, 6, 7] where techniques must be robust to challenging, unstructured environments. This work draws from the robotics and computer vision literature to address the problems of detecting individual trees in a high resolution ALS pointcloud and segmenting each tree into its stem and foliage components. This segmented representation can be used to further derive a number of important attributes about each tree such as the crown height, stem diameter and volume of wood [8, 9].
The specific contributions of this work are:
Detection of individual trees in a forest pointcloud.
Segmentation of each tree into its components via per-point labelling of foliage, lower stem and upper stem.
An automated pipeline for 3D pointcloud processing for forest inventory which comprises the ground removal, detection and segmentation of each tree.
Our processing methodology follows a machine learning paradigm based on state-of-the-art techniques in region-based convolutional neural networks (R-CNNs) and CNN-based 3D segmentation algorithms using a volumetric model of the forest derived from LiDAR pointclouds. Evaluation of our detection and segmentation algorithms is performed on high resolution ALS datasets acquired over two different commercial pine forests, with comparison against other methods for detection and segmentation of trees.
Ii Pipeline for Tree Detection and Segmentation
Given ALS data acquired over a forest, the aim of this pipeline is to detect the pointcloud subset associated with each tree in the forest, and predict a label for each point as either foliage, lower stem or upper stem. This process, summarised in Fig. 2, involves removal of the ground points, object detection to detect cuboids that delineate individual trees and segmentation of the points those trees comprise into their semantic components using a 3D fully convolutional network (3D-FCN) designed to encode and decode occupancy grids.
Ii-a Ground Removal
The ALS receives return pulses from the ground, as well as vegetation growing on the ground. It is useful to quantify the ground, for example, with a digital elevation model (DEM), to estimate forestry attributes such as canopy height. To simplify tree detection and segmentation, it is also important to remove all points associated with the ground and ground-based vegetation from the pointcloud.
To estimate a DEM for the ground, the pointcloud is first discretised into bins in the and axes, and the point in each bin with the smallest height ( axis) is stored. A regular grid with four metre resolution that spans the point cloud is established. A K-D Tree  is used to find the closest four stored points to the centre of each grid cell, and the average height of these points weighted by their distance to the cell centre is calculated as the ground height for that location. Once the height is computed for all cells in the grid, they are meshed using delaunay triangulation  to output a smooth DEM. Finally, all points in the original pointcloud below a certain threshold above their location on the DEM are removed, eliminating all ground points and most ground-based vegetation points.
Ii-B Detecting Individual Trees in a Pointcloud
The structure in the data is leveraged to detect individual trees in the forest pointcloud. Trees are relatively uniform objects across a forest environment. They have vertical, cylindrical shapes with minimal overlap with adjacent trees and in most cases nothing can occlude a tree object in the vertical axis. Therefore, detections are made on 2D rasters from a bird’s-eye perspective, using a CNN-based object detector designed for 2D imagery. The Faster-RCNN object detector  is used to delineate trees by inferring bounding boxes in the -plane which are projected into three dimensional cuboids.
To train the detector, 3D crops of land containing several trees are extracted from the forest pointcloud post removal of the ground points. These pointcloud crops are converted to 2D rasters which represent the vertical density of points at spatially discretised locations in the -plane. The vertical density is computed by summing the number of occupied vertical bins at each -location and dividing by the total number of bins for that location. Each 2D vertical density raster is mapped to a colour image, where tree objects have a distinctive appearance. Trees are annotated with bounding box labels. Two background classes are also labelled: shrubs and partial trees (i.e. those cut off when the plot was cropped out from the forest). These reduce the number of false positive detections. The coloured raster images and bounding box labels are used to train the Faster-RCNN object detector.
During inference, a window slides in the -plane and the corresponding cuboid (with the -axis bounded by the maximum and minimum altitude of the data) is used to extract a crop of 3D points inside of it. The pointcloud crop is converted to a coloured image raster using the same process as for training. The trained Faster-RCNN model is used to detect bounding boxes around all trees, shrubs and partial trees in the raster. The window slides with an overlap so that there is a full tree for every partial tree detected. Tree class bounding box detections are accumulated, and redundant boxes that significantly overlap with others are discarded. The remaining 2D bounding boxes corresponding to the tree class are projected into 3D cuboids, and all 3D points within are identified as belonging to an individual tree.
Ii-C Segmenting Trees into Stem and Foliage
Once pointclouds for individual trees have been detected, they are segmented into foliage, lower stem, upper stem or clutter components. A CNN is trained to segment the pointclouds in 3D, inferring one of these labels for each point. The architecture for the CNN is based on VoxNet , which was a 3D-CNN designed for the classification of lidar scans of objects in urban environments, represented using occupancy grids. In this work it has been adapted for semantic segmentation, drawing from the structure of V-net , which is a 3D fully convolutional encoder-decoder network for segmenting volumetric medical images represented as occupancy grids.
The 3D-FCN accepts a binary occupancy grid representing a single tree as input, with voxels of resolution meters in the and axes respectively. The network is trained to reconstruct four binary occupancy grids - one for each class (foliage, lower stem, upper stem and empty space - Fig. 2). Every location is occupied in one and only one of the corresponding voxels across the four grids, with stem points having occupation priority over foliage points. If a location has no points in it then the voxel in the grid for the empty space class is occupied.
As in VoxNet, the first two layers of the network are 3D convolutional layers, with the first layer having 32 filters of size
with a stride of two along all axes, such that theinput is downsampled by half. The second layer has 32 filters of size with no downsampling. These two layers comprise the encoder, and the decoder comprises a mirrored version of these two layers with 3D deconvolutional layers instead. The second deconvolutional layer upsamples the data back to
. Each convolutional and deconvolutional layer precedes a leaky ReLU activation layer. There are skip connections between corresponding layers in the encoder and decoder to restore the resolution when upsampling. A final
3D convolutional layer maps the output of the decoder to the four target occupancy grids. A softmax nonlinearity is applied across corresponding voxels along the four occupancy grid outputs, treating them as one-hot vectors. A cross-entropy loss function is then used to compare predicted vectors to those in the target occupancy grids.
Pointclouds for individual trees are manually annotated by labelling points as either the foliage, lower stem, upper stem or clutter class (from a harvesting perspective the lower stem of a tree contains wood products of distinctive value from the upper stem). To train the 3D-FCN, each batch of single tree pointclouds are converted to binary occupancy grids for the input, and their labelled equivalent are converted to the four target binary occupancy grids. Points labelled as clutter comprise vegetation on the ground or foliage from adjacent trees. Clutter occupy voxels in the input grid are represented as ’empty space’ in the target grid so that the network will learn not to reconstruct them. Each batch of tree pointclouds is converted to input and target occupancy grids on the fly so that the batch can be augmented with random rotations and flipping about the z-axis (which are done on the pointcloud prior to voxelisation).
During inference, 3D pointcloud crops from bounding box tree detections are converted to binary occupancy grids and passed through the trained 3D-FCN. The network outputs the four binary occupancy grids, one for each semantic component of the tree. The occupied voxels in the foliage, lower stem and upper stem grids are converted back to a single labelled pointcloud.
At this stage, the resolution of the labelled pointcloud is low because it was downsampled when it was converted to an occupancy grid. To restore its former resolution, the labels of the low resolution points are mapped to the high resolution points of the original pointcloud crop using a K-D Tree (Fig. 2). Each point in the original crop queries the K-D Tree to find the nearest point in the low resolution, labelled pointcloud and inherits its label. If the distance to the nearest point exceeds a threshold, then the point is not given a label (these points are likely to be clutter). The result is a high resolution tree pointcloud with labels.
Iii Experimental Setup
High resolution LiDAR pointclouds were collected over commercial pine plantations in Tumut forest (October 2016) and Carabost forest (February 2018) in New South Wales, Australia using the Reigl VUX-1, a compact and lightweight scanner designed for UAV/drone operations. Data was collected at flying heights of 60-90m from the ground, resulting in pointclouds with a density of approximately 300-700 points per . During experiments, the scanner was attached to a manned helicopter; future flights are expected to be performed using a commercial UAV.
From the Tumut site, 17 plot rasters comprising 188 trees in total were labelled with bounding boxes and 75 trees from across the site were also labelled at the point level. From the Carabost site, three plot rasters comprising 71 trees were labelled with bounding boxes and 25 trees were labelled at the point level. For testing, the Tumut and Carabost sites had three and one plot respectively where all trees had bounding box and point labels. The locations of the three test plots for the Tumut site were spread out across the forest and had 12, 8 and 11 trees, whilst the test plot for Carabost had 9 trees. To train the detectors for each test plot, 16 and 2 plot rasters comprising 176-180 and 62 trees were used for the Tumut and Carabost sites respectively. To train the segmentation networks for each test plot, 60 and 14 of the point-labelled trees were used for the Tumut and Carabost sites respectively. All labelling was done using open source software packages LabelImg  for bounding box labels and CloudCompare  for point labels.
To generate the 2D rasters for training the detector, a spatial grid with m resolution was used, where the bins in the -axis were accumulated before the raster was mapped to a colour image. Thus the input to the detector was a colour image. The Faster R-CNN detectors were trained for 10000 iterations through the data, using a learning rate of 0.003, momentum of 0.9, batch size of 1, Resnet-101 
backend and Stochastic Gradient Descent (with momentum) optimisation.
The 3D-FCN segmentation network was trained until convergence (at least 3000 iterations). A learning rate of 0.001 was used for the first 500 iterations, and this was decayed to 0.0001 for the remaining iterations. The input shape of the data was . The batch size and amount of data augmentation was limited by the GPU memory (11GB) and the number of training samples. For Tumut, the batch size was six with four additional augmentations per sample (30 in total per batch). For Carabost, the batch size was seven with three additional augmentations per sample (28 samples per batch). Training was done with the Adam optimiser  and classes were balanced in the loss function.
The detection component of the pipeline was compared against two ALS approaches for detecting trees. One found a canopy height model (CHM) for each test plot and then used marker-controlled watershed segmentation to detect individual trees . The second technique used DBSCAN to cluster the pointcloud such that each cluster with more than a certain number of points was considered a tree .
The segmentation component was compared against TLS methods used in mobile robotics applications. One method that used Eigen features coupled with a classifier was trained to label tree points as lower stem, upper stem and foliage. The other used a RANSAC approach to determine stem points . Whilst the detection and segmentation components were evaluated separately, the same pointclouds detected as individual trees using the proposed approach were used as input for the segmentation experiments. Comparison methods for segmentation were given gold standard tree pointclouds as input.
Metrics used to evaluate detection were the precision, recall and F1 score for predicted tree pointclouds that had an intersection over union (IoU) with a ground truth tree pointcloud greater than 50%. For the segmentation, the IoU was calculated separately for each class, as well as a combined stem class which treated upper and lower stem as the same (although models were still trained on upper and lower stem classes).
All processing was carried out on a 64-bit computer with an Intel Core i7-7700K Quad Core CPU @ 4.20GHz processor and Nvidia GeForce GTX 1080Ti graphics card. For the proposed method, on average, each segmentation model took four days to train and each detection model took 40 minutes to train. The average inference time for a single test plot was 50 seconds, with about 75% of that time for the K-D Tree operations, which are dependant on the density of points.
Iv-a Tree Detection
Table I and Figure 3 show the individual tree detection results. The detection rates for the proposed method were high, with perfect scores for Tumut test plot 1 and Carabost. The proposed method only missed one tree from Tumut test plots 2 and 3, and it never predicted a tree where there was not one (the precision is 1.000 for all test plots). For the Tumut sites, the CHM with watershed method had similar F1 scores to the DBSCAN approach, but achieved a perfect detection score on the Carabost data. The proposed method performed best overall.
|Tumut Plot 1||CHM + watershed ||0.909||0.833||0.870|
|(12 trees)||DBSCAN ||0.714||0.833||0.769|
|Tumut Plot 2||CHM + watershed ||0.556||0.625||0.588|
|(8 trees)||DBSCAN ||0.750||0.375||0.500|
|Tumut Plot 3||CHM + watershed ||0.727||0.727||0.727|
|(11 trees)||DBSCAN ||0.643||0.818||0.720|
|Carabost||CHM + watershed |
|(9 trees)||DBSCAN ||0.750||0.333||0.462|
Iv-B Tree Segmentation
The segmentation results (Table II and Figure 4) indicate that the proposed method performed best overall on the Tumut site data, particularly for the stem classes, where it outperformed the other methods by a large margin. The results for the Carabost site show that the RANSAC approach had the highest combined stem score.
When a 3D-FCN model trained on Tumut data was used for inference on Carabost data without any fine-tuning, the overall result was a decrease in performance (Table III). However, with fine-tuning of the network on the Carabost data, the results exceeded those where the model was trained solely on the Carabost data.
|Test Dataset||Method||Foliage||Lower Stem||Upper Stem||Combined Stem|
|Tumut Plot 1||Eigen features|
|Tumut Plot 2||Eigen features|
|Tumut Plot 3||Eigen features|
|Method of training model||Foliage||Lower Stem||Upper Stem||Combined Stem|
|Trained on Carabost|
|Trained on Tumut|
|Pre-trained on Tumut, fine-tuned on Carabost|
The Faster-RCNN detector has many training examples of trees in the raster representation and can successfully generalise to unseen data. Even under many circumstances where the other two detection methods fail, such as if tree stems bend too much or fork in two, or if trees are very close to each other, the proposed detector can delineate individual trees. In Tumut test plot 2, the one tree that is missed by the proposed method is close to an adjacent tree, has a bent stem and the majority of its foliage distributed to the side of the adjacent tree (Figure 3(b)). The detector incorrectly detects it as being attached to the adjacent tree (Figures 3(d) and 3(f)). Similarly, in Tumut test plot 3, a tree with a small crown diameter that is close to an adjacent tree is misdetected as being a part of the adjacent tree. These are the only misdetections from the proposed approach. These incorrect detections have a trickle effect for the segmentation result.
Regarding the segmentation results, there are more point-labelled trees for training from the Tumut site than the Carabost site, and hence the proposed method’s segmentation result was better for Tumut than Carabost. However, the RANSAC method, which did not rely on training examples, performed better on the Carabost data than the Tumut data because the pointclouds have a higher density and the stems are more exposed.
The lack of training examples for segmentation at Carabost was compensated for when the 3D-FCN network was pre-trained on the Tumut training examples and fine-tuned on the Carabost data. Whilst the foliage and combined stem results improved, the lower stem performance decreased. This is because more of the stem is denoted as lower stem in the Tumut site, and it is likely that the network assumes a similar structure for the trees in Carabost, labelling upper stem points as lower stem.
For the Tumut site, the upper stem was significantly harder to segment than the lower stem, which was reflected in the results for all methods. This is because it lies within the foliage, which blocks the LiDAR pulses, often resulting in large sections of stem missing from the scan. For the Carabost site, the structure of the trees is slightly different and the lower stem appears less frequently. The upper stem is also slightly more exposed than in the Tumut site. Thus the results of segmenting the upper stem were better than the lower stem. For both sites, the IoU scores for the stem classes are more sensitive to error than the foliage class because the stems occupy significantly less space. Slight misclassifications in the predictions cause large overlapping errors with the ground truth, resulting in big decreases in the IoU scores.
The Eigen feature method was outperformed by the proposed approach on both sites. This was likely because it was designed for TLS data, where-as the ALS data has more noise and missing data due to pulses being occluded by thick canopy. The Eigen features are not as robust as the learnt 3D-FCN features. With sufficient training examples available from the Tumut site, the RANSAC approach  was also outperformed by the proposed method at this site. One of the major shortcomings of the RANSAC approach are that it does not work well when stems bend. It would also be negatively impacted by the missing sections of stem, which was more prominent at the Tumut site.
One source of error in the proposed methods segmentation of the stem is due to the downscaling and upscaling of the pointcloud resolution. When the point-labelled trees are converted to low resolution occupancy grids, the stem classes have priority over foliage. The network is trained on this data and when the low resolution pointcloud is mapped back to the high resolution pointcloud, the stem points cover a wider space and encroach on the foliage class (see the Carabost trees in Figure 4). Whilst the recall for stem points remains high, the precision gets negatively affected.
This paper presented a method for detecting and segmenting trees in high resolution airborne LiDAR. Using Faster-RCNN, trees were detected in a 2D coloured raster representation from a bird’s-eye perspective. Once detected, trees were segmented in 3D at the pont-level into their stem and foliage components using a 3D-FCN and KD-Tree. Overall, the proposed approach outperformed other methods for tree detection and segmentation. It was also shown that pre-training a 3D-FCN on data from a site with more training examples can improve results.
Future work will consider real time algorithms for deploying tree detection and segmentation on a UAV. Such a capability would enable aerial robotic applications in forestry such as targeted tree inspections, delivery of pesticides to the upper tree canopy and aerial robotic tree pruning.
This work was supported in part by Forest and Wood Products Australia research grant PNC377-1516. Thanks to David Herries, Susana Gonzales, Christine Stone and Interpine New Zealand for providing access to airborne laser scanning datasets.
-  H. Kaartinen, J. Hyyppä, X. Yu, M. Vastaranta, H. Hyyppä, A. Kukko, M. Holopainen, C. Heipke, M. Hirschmugl, F. Morsdorf, E. Næsset, J. Pitkänen, S. Popescu, S. Solberg, B. M. Wolf, and J. C. Wu, “An international comparison of individual tree detection and extraction using airborne laser scanning,” Remote Sensing, vol. 4, no. 4, pp. 950–974, 2012.
-  E. Ayrey and D. J. Hayes, “The Use of Three-Dimensional Convolutional Neural Networks to Interpret LiDAR for Forest Inventory,” Remote Sensing, vol. 10, no. 4, p. 649, 2018.
-  K. Olofsson and J. Holmgren, “Single tree stem profile detection using terrestrial laser scanner data, flatness saliency features and curvature properties,” Forests, vol. 7, no. 9, 2016.
-  J. Heinzel and M. O. Huber, “Detecting tree stems from volumetric TLS data in forest environments with rich understory,” Remote Sensing, vol. 9, no. 1, 2017.
-  Y. Li and E. B. Olson, “Extracting general-purpose features from LIDAR data,” in IEEE International Conference on Robotics and Automation, 2010, pp. 1388–1393.
-  M. D. Deuge, A. Quadros, C. Hung, and B. Douillard, “Unsupervised Feature Learning for Classification of Outdoor 3D Scans,” in Australasian Conference on Robotics and Automation, 2013.
-  U. Weiss and P. Biber, “Plant detection and mapping for agricultural robots using a 3D LIDAR sensor,” Robotics and Autonomous Systems, vol. 59, no. 5, pp. 265–273, 2011.
-  P. Pueschel, G. Newnham, G. Rock, T. Udelhoven, W. Werner, and J. Hill, “The influence of scan mode and circle fitting on tree stem detection, stem diameter and volume extraction from terrestrial laser scans,” ISPRS Journal of Photogrammetry and Remote Sensing, vol. 77, pp. 44–56, 2013.
-  X. Liang, P. Litkey, J. Hyyppä, H. Kaartinen, M. Vastaranta, and M. Holopainen, “Automatic stem mapping using single-scan terrestrial laser scanning,” IEEE Transactions on Geoscience and Remote Sensing, vol. 50, no. 2, pp. 661–670, 2012.
-  J. L. Bentley, “Multidimensional binary search trees used for associative searching,” Communications of the ACM, vol. 18, no. 9, pp. 509–517, 1975.
-  B. Delaunay, “Sur la sphere vide,” pp. 793–800, 1934.
-  S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 6, pp. 1137–1149, 2017.
-  D. Maturana and S. Scherer, “VoxNet: A 3D Convolutional Neural Network for Real-Time Object Recognition,” in Intelligent Robots and Systems (IROS), 2015 IEEE/RSJ International Conference on. IEEE, 2015, pp. 922–928.
-  F. Milletari, N. Navab, and S.-a. Ahmadi, “V-net: Fully convolutional neural networks for volumetric medical image segmentation,” in 3D Vision (3DV), 2016 Fourth International Conference on. IEEE, 2016, pp. 565—-571.
-  A. L. Maas, A. Y. Hannun, and A. Y. Ng, “Rectifier Nonlinearities Improve Neural Network Acoustic Models,” in Proceedings of the 30th International Conference on Machine Learning, vol. 30, 2013, p. 3.
-  Tzutalin, “LabelImg. Git code,” 2015. [Online]. Available: https://github.com/tzutalin/labelImg
-  D. Girardeau-Montaut, “Cloud compare—3d point cloud and mesh processing software,” 2015. [Online]. Available: http://www.cloudcompare.org/
K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image
Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
-  D. P. Kingma and J. L. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, pp. 1–15, 2014.
-  Q. Chen, D. Baldocchi, P. Gong, and M. Kelly, “Isolating Individual Trees in a Savanna Woodland Using Small Footprint Lidar Data,” Photogrammetric Engineering & Remote Sensing, vol. 72, no. 8, pp. 923–932, 2006.
-  I. Smits, G. Prieditis, S. Dagis, and D. Dubrovskis, “Individual tree identification using different LIDAR and optical imagery data processing methods,” Biosystems and Information Technology, vol. 1, no. 1, pp. 19–24, 2012.
-  J. Lalonde, N. Vandapel, and M. Hebert, “Automatic three-dimensional point cloud processing for forest inventory,” Robotics Institute, vol. 334, 2006.
-  T. Högström and Å. Wernersson, “On Segmentation, Shape Estimation and Navigation Using 3D Laser Range Measurements of Forest Scenes,” IFAC Proceedings Volumes, vol. 31, no. 3, pp. 423–428, 1998.