A global method to identify trees inside and outside of forests with medium-resolution satellite imagery

by   John Brandt, et al.

Scattered trees outside of dense forests are very important for carbon sequestration, supporting livelihoods, maintaining ecosystem integrity, and climate change adaptation and mitigation. In contrast to trees inside of forests, not much is known about the spatial extent and distribution of scattered trees at a global scale. Due to the very high cost of high-resolution satellite imagery, global monitoring systems rely on medium resolution satellites to monitor land use and land use change. However, detecting and monitoring scattered trees with an open canopy using medium resolution satellites is difficult because individual trees often cover a smaller footprint than the satellites resolution. Here we present a globally consistent method to identify trees inside and outside of forests with medium resolution optical and radar imagery. Biweekly cloud-free, pansharpened 10 meter Sentinel-2 optical imagery and Sentinel-1 radar imagery are used to train a fully convolutional network, consisting of a convolutional gated recurrent unit layer and a feature pyramid attention layer. Tested across more than 215,000 Sentinel-1 and Sentinel-2 pixels distributed from -60 to +60 latitude, the proposed model exceeds 75 percent users and producers accuracy identifying trees in hectares with a low to medium density (less than 40 percent) of canopy cover, and 95 percent user's and producer's accuracy in hectares with dense (greater than 40 percent) canopy cover. When applied across large, heterogeneous landscapes, the results demonstrate potential to map trees in high detail and consistent accuracy over diverse landscapes across the globe. This information is important for understanding current land cover and can be used to detect changes in land cover such as agroforestry, buffer zones around biological hotspots, and expansion or encroachment of forests.



There are no comments yet.


page 4

page 6

page 7

page 11


Continental-scale land cover mapping at 10 m resolution over Europe (ELC10)

Widely used European land cover maps such as CORINE are produced at medi...

Spatio-temporal crop classification of low-resolution satellite imagery with capsule layers and distributed attention

Land use classification of low resolution spatial imagery is one of the ...

Combining Sentinel-1 and Sentinel-2 Time Series via RNN for object-based land cover classification

Radar and Optical Satellite Image Time Series (SITS) are sources of info...

High-resolution land cover change from low-resolution labels: Simple baselines for the 2021 IEEE GRSS Data Fusion Contest

We present simple algorithms for land cover change detection in the 2021...

Generating a Training Dataset for Land Cover Classification to Advance Global Development

Semantic segmentation of land cover classes is fundamental for agricultu...

High carbon stock mapping at large scale with optical satellite imagery and spaceborne LIDAR

The increasing demand for commodities is leading to changes in land use ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Forests cover about thirty percent of the world’s land surface (fra). Outside of forests, trees play an important role in agricultural and urban landscapes, as well as in Savannahs, grasslands, and deserts. Although not nearly as dense as forested regions, the vast extent of trees outside of forests contribute greatly to carbon biomass stocks in many countries (Schnell2014TheCO; sequestration). Trees outside of forests are also important sources of fuel for nearly two-thirds of the world’s developing populations (woodfuel). However, the spatial distribution of trees outside of closed canopy forests is not well understood or quantified, despite their importance in carbon sequestration, supporting livelihoods, and the attention given to these trees by major international development agenda (tofreview).

The difficulties of identifying trees outside forests with traditional remote sensing methods arise from a combination of the prohibitive expense of analyzing high-resolution imagery at large geographic scales, as well as the sub-pixel sizes of individual trees in medium-resolution imagery. Because of these difficulties, global analyses of dryland forest cover, which tends to be patchy and have an open canopy, have relied on human interpretation of high resolution satellite imagery rather than remote sensing classifiers


. Within contiguous, closed canopy forests, applying per-pixel approaches such as bagged decision trees to medium-resolution satellite imagery performs well at monitoring tree cover extent and change due to the high degree of between-pixel similarity in closed canopy forests

(hansen2013; KIM2014178). However, when considering trees in mosaic landscapes, or trees outside of forests, these medium-resolution, per-pixel approaches are not able to generate reliable maps of tree extent. Even though radoux2016 found that Sentinel-2, with a ten-meter resolution, has the photometric potential to detect sub-pixel objects as small as three meters, developing globally relevant models that can do so in varying land uses, cloud covers, and terrain has proven very difficult.

Previous approaches to quantitatively map tree cover with remote sensing classifiers have used a variety of supervised and unsupervised machine learning approaches, with either high resolution or medium resolution imagery. One of the most extensively used global datasets for forest monitoring is that of


, which developed a global map of tree cover based on Landsat (30 meter resolution) imagery and a random forests classifier. However, because

hansen2013 was designed for monitoring contiguous, closed-canopy forests, it routinely underestimates tree cover in arid landscapes and is not accurate in heterogeneous landscapes such as urban or peri-urban environments (bastin2017; OTTOSEN2020101947). Indeed, regional models have identified that hansen2013 underestimates tree cover by up to 80% in heterogenous landscapes in Europe and underestimates tree cover loss due to small-scale deforestation in the Amazon (OTTOSEN2020101947; Milodowski_2017).

At the regional scale, per-pixel machine learning classifiers for medium-resolution satellite imagery (Sentinel 1 and 2) have struggled to detect individual trees in heterogeneous landscapes, though these methods are adequate for tracking overall tree cover change at large geographic scales (zhang2019)

. Many other recent approaches have used a variety of machine learning methods, such as k-means, random forests, and gradient boosting, to map tree cover at regional scales with Sentinel 2. For instance,

OTTOSEN2020101947 mapped tree cover across Europe with Sentinel 2 and the unsupervised k-means classifier, achieving 53% user’s accuracy and 80% producer’s accuracy on tree detection. Similarly, Ho_ci_o_2019 developed a regional model of tree species in closed canopy forests in Poland using Sentinel 2 and the supervised random forests classifier, achieving greater than 80% overall accuracy distinguishing between eight tree species. Immitzer2016FirstEW mapped tree species in Europe with Sentinel 2 and a supervised random forest model with a 65% overall accuracy. In a single region in Belgium, Bolyn2018ForestMA were able to achieve over 90% overall accuracy distinguishing forests from non-forests with random forests and Sentinel 2. However, these models focus primarily on homogeneous areas within one geographic region, and do not address the difficulty of sparse tree detection in multiple heterogeneous regions in different biomes.

Over the past decade, convolutional neural networks (CNNs) have transformed image processing methods by supplanting both empirically derived classification methods (such as band thresholding or vegetation indices) and per-pixel machine learning methods (such as random forests and support vector machines) with machine learning methods that explicitly learn spatial patterns. Given a set of input images and their labels, a neural network minimizes the differences between the predictions and the labels by learning a set of nonlinear mathematical operations to parameterize the relationship between the input and output domains


. These nonlinear mathematical operations are tuned by optimizing a loss function such as cross entropy or mean squared error with gradient descent

(backprop). CNNs are a special class of neural networks that learn spatial patterns by subsequently applying convolutional operations between the input data and learned matrices of parameters (e.g. weights) (goodfellow). CNN models have established new state of the art accuracies for a number of remote sensing tasks with high resolution imagery, such as land-sea segmentation, land use classification, and building identification (Castelluccio2015LandUC; Li2017DeepUNetAD; ulu2018; ZHANG201857).

Although deep learning methods have come to popularity in recent years, their applications largely remain limited to high-resolution (less than 3 meter) imagery, which are often too expensive or computationally intense for large area analyses in real-world applications. In their review of deep learning approaches in remote sensing,

ma2019 identified five barriers to applying deep learning to medium-resolution satellite imagery. These include a tendency of papers to focus on small geographic regions, the lack of fine structural details in medium resolution imagery, difficulties with differently scaled objects, a lack of studies using time series imagery, and the tendency for deep learning approaches to generate blurry outputs. In homogeneous and geographically small landscapes, such as monoculture plantations, deep learning and medium-resolution sentinel imagery can classify tree density with higher than 80% accuracy (countinguncountable). However, existing deep learning models struggle to generalize to new geographies because background land uses can contribute more to the spectral information of the corresponding pixel than the tree itself due to the relative size differences between trees and satellite pixels. This noisiness is further complicated by the observation that many popular deep learning models generate fuzzy outputs with degraded accuracy around boundaries of objects (IBTEHAZ202074; pyramid). While this may be sufficient for high resolution imagery, small patches of trees in medium resolution imagery almost entirely consist of boundary pixels and thus classifiers must be designed with this constraint in mind.

While most CNN approaches rely on a single image input, satellite systems are designed to capture sometimes dozens of images per year of each location. Time series convolutional neural networks analyze a temporal sequence of imagery rather than a single input image. The temporal domain adds additional complexity to the satellite imagery that is especially useful for low- and medium-resolution satellite imagery where the complexity of each individual image alone may not be sufficient for deep learning. Additionally, time series models benefit from limiting the noise in individual satellite images driven by atmospheric conditions and the relative satellite and sun positions. Time series models, such as the convolutional long-short-term memory (cLSTM) and convolutional gated recurrent unit (cGRU) learn spatio-temporal relationships by extending the convolution operation to the temporal dimension

(convlstm). Each time step image is convolved with the prior time step and weight matrices that decide how much of the short term information to keep (e.g. forget gate) and how much long-term information to keep (e.g. reset gate). This allows for the generation of complex models of both short-term changes, such as leaf-out events, and long-term seasonal patterns. Indeed, the accuracy of medium-resolution remote sensing models have recently seen improvement by using multi-temporal image analysis (rubworm2018; Roy2019MultitemporalLU).

Recent developments in the field of computer vision include many modifications to CNN models that bring important improvements to the generalizability, pixel-level boundary accuracy, and performance on unbalanced classes, though their applicability to medium-resolution remote sensing models have often not been established. With regard to generalizability, new normalization methods such as layer normalization and batch renormalization standardize intermediate layers to add stochasticity and reduce the extreme predictions on test data that are outside of the range of the training data

(batchrenorm; layernorm). Squeeze and excitation layers force CNNs to learn finely-grained filters which reduce blurriness by rescaling the outputs based on a learned scoring map for either channels or pixels (csse). New loss functions, such as the focal loss, which weighs hard-to-classify samples more, the Lovász-Softmax loss, which focuses on regions rather than pixels, and the boundary loss, which focus on boundaries between classes, have greatly increased the abilities of CNN models to perform well in highly unbalanced classification scenarios, such as medical imagery analysis and land use classification (focalloss; boundaryloss; DBLP:journals/corr/BermanB17). Extending on these approaches, this paper uses fused multi-temporal imagery from Sentinel-1 and Sentinel-2 to construct a globally relevant deep learning model to classify tree presence at the ten meter scale.

2 Methodology

2.1 Data

A total of 4,500 training sample plots, distributed semi-randomly from -60 to +60 latitude, were labelled with visual interpretation of high-resolution imagery on Collect Earth Online, an online platform for systematically labelling geospatial data with high-resolution imagery (0.5 meters, WorldView) (Figure 1) (collectearthonline). Sample plots were sized 140x140 meters with sampling points positioned within at 10 meter intervals for 196 samples per plot. Pixels were marked positive if they intersected a tree identified through visual interpretation of a cloud-free, leaf-on high resolution image (Figure 2). The presence of a tree was determined based on the land use, geographical region, presence of a shadow, and size relative to identifiable shrubs or grass in proximity.

Tree presence at the ten meter scale was predicted with fused Sentinel-1 imagery, Sentinel-2 imagery, and slope derived from the MapZen digital elevation model (DEM). Sentinel-2 detects 13 bands with 10m, 20m and 60m resolution, including the visible, near infrared, and short-wave infrared spectrums. Sentinel-1 provides 10 meter synthetic aperture radar (SAR) data of the entire world every 12 days. Twenty-four cloud-free images of each 19,600 m2

(140 x 140 m) sample area, separated by fifteen days each, were created by removing and interpolating cloud cover and shadow from each Sentinel-2 image acquisition (process described below) and fusing the nearest Sentinel-1 image from January 1 to December 31 2019. For Sentinel-2, the 10 and 20 meter bottom-of-atmosphere (L2A) bands were selected. For Sentinel-1, VV-VH imagery with the gamma back scatter coefficient was used. Data was accessed through the Sentinelhub API. Twenty-meter Sentinel 2 bands were upscaled to ten meters with DSen2, which is a convolutional neural network (CNN) approach to pan sharpening Sentinel-2 imagery

(dsen2). Clouds were identified with S2Cloudless and cloud shadows were identified by generating a mask with the methodology proposed in candra2020 and removing cloud shadow masks which were more than 800 meters from an identified cloud. Images with more than 25% cloud or shadow cover were removed, and remaining clouds and cloud shadows greater than 250 m2 (25 pixels) were linearly interpolated with pixels from the nearest two clean time steps after which the Whittaker smoother was used ( = 800, d = 2) to interpolate missing pixels (whittaker).

The DEM was degraded with a 5x5-pixel median filter before calculating slope to reduce noise. In addition to the raw band values, the enhanced vegetation index (EVI, Eq. (1)), modified soil adjusted vegetation index (MSAVI2, Eq. (2)), and bare soil index (BI, Eq. (3)) were calculated and included as model input (evi; msavi; baresoilindex). Additional indices, including normalized difference vegetation index, soil adjusted vegetation index, and normalized difference moisture index, were tested but did not improve performance.

Figure 1: Locations of training sample plots. Each sample plot is 140 x 140 meters (approximately 2 hectares) with a 10 meter pixel size.
Figure 2: Screenshot from Collect Earth Online showing the plot labelling process. Red grid cells indicate the presence of a tree, while blue grid cells indicate the absence.

2.2 Model

The model consists of a fully convolutional neural network with a bidirectional convolutional gated recurrent unit (cGRU) encoder and a feature pyramid attention (FPA) decoder (convlstm; pyramid) (Figure 3). The cGRU takes as input the biweekly processed Sentinel-1 and Sentinel-2 bands, and uses two-dimensional 3 x 3 convolutions to generate feature encodings that represent the per-pixel change over time. The FPA layer takes the local features generated by the cGRU and increases their field of view, incorporating knowledge about features from up to 15 pixels away, while maintaining the fine-grained localization of feature maps. This is done by multiplying deeper convolutional layers by a 1x1 convolutional layer (Fig 6). The upsampling blocks in the FPA layer use an upsize convolution rather than a transpose convolution in accordance with the recommendations in odena2016deconvolution. These feature maps, which incorporate knowledge of both per-pixel and regional change over time, are then classified with a standard convolutional layer with a sigmoid activation.

Figure 3:

Overview of network architecture. Twenty-four input images for ten Sentinel-2, three indices, and two Sentinel-1 bands are passed to a bidirectional convolutional GRU to model temporal relationships. Next, the feature pyramid attention module integrates regional information with pixel-level information. Finally, Conv-BN-RELU-csSE-DropBlock blocks are combined with hypercolumns before sigmoid classification.

The cGRU layer uses layer normalization which standardizes features to avoid exploding gradients, and a channel squeeze and spatial excitation (SE) layer, which improves fine grained localization, between time steps (layernorm; csse). The SE block rescales each input channel c by multiplying by a sigmoid activated 1 x 1 x c convolution. Separate SE blocks are learned for the update and the reset gate of the GRU, respectively, and weights are shared between time steps.

The FPA layer is followed by two convolutional blocks with batch renormalization and csSE (Fig. 6) (batchrenorm; csse). Hypercolumns, which concatenate deep and shallow features to increase layer dimension, are placed before the final sigmoid classification layer to facilitate pixel-level accuracy (hypercolumns). The sigmoid bias layer was initialized in the same manner as in focalloss

to avoid model collapse in the early stages of training. All padding operations in the model used reflect padding to enforce the distributional consistency of border convolutions. Weights were initialized according to

heinitializer when combined with batch renormalization, and glorot uniform otherwise (glorotuniform)

. Non-linear activations for intermediate layers use the rectified linear unit (RELU)

(Nair2010RectifiedLU). The ‘rmax’ and ‘dmax’ batch renormalization parameters follow the learning schedule in the original paper (batchrenorm)

. All experiments were conducted in Tensorflow 1.13.1 and the presented model has 221 thousand learnable parameters

(tensorflow2015-whitepaper). Data and code are released for reproducibility on the author’s GitHub page.

The model was optimized with the Adabound optimizer with learning rates between 1e-4 and 2e-2, and was trained for 100 epochs on a NVidia K80 GPU

(adabound). A batch size of 20 was selected with equibatch sampling by tree cover percent (DBLP:journals/corr/BermanB17). To mitigate overfitting, several regularization methods were used. These include zoneout (prob = 0.2) in the GRU layer and dropblock (prob increases from 0 to 0.2 during training) after each intermediate convolutional layer (zoneout; dropblock). The loss function for the model was a combination of label-smoothed (prob = 0.10) binary cross entropy, which was weighted by the effective number of samples, and a modified boundary loss (labelsmooth; weighting; boundaryloss). The weight of the boundary loss (BL) increased proportionate to the cross entropy loss (CE) throughout training according to Equation (4), where is the label segmentation,

are the sigmoid probabilities,

is the spatial domain of , and is a distance map with respect to the boundary of the positive segments of .


The boundary loss was modified from boundaryloss to accommodate the smaller image sizes (196 pixels) and the higher variability in positive mask size in the present domain application. Specifically, we modify the distance map such that the loss attributed to pixels immediately proximal to a positive pixel is reduced for low tree cover and increased for high tree cover within a exponential [0, 1] range. In boundaryloss, these pixels all received a loss importance of one. Similarly, the loss attributed to pixels at the border is made more negative for low tree cover and approaches zero for high tree cover within a exponential [0, 1] range. In boundaryloss, these pixels all received a loss importance of zero. These changes have the effect of allowing low tree cover samples to have fuzzy borders (due to imagery mismatch between Sentinel and high-resolution images used for labeling), while enforcing that borders in high tree cover samples (non-forested patches in forested landscapes) have a high loss importance (Figure 4).

Figure 4: Overview of changes to loss from (boundaryloss). For the label mask, white pixels are positive, and black are negative. For the loss maps, brighter pixels indicate a higher loss importance. The loss used in this paper (middle) attributes higher importance to boundaries in samples with more trees, and vice versa.

The model outputs were post-processed to improve accuracy around the border of input tiles by smoothing the predictions over a tiled window with two-dimensional interpolation between overlapping patches. The prediction for each pixel was calculated as the weighted average of nine input images, which were shifted either up, left, down, right, or diagonally by 70 meters from the original bounds, weighted by the distance of each pixel to the center of each of the nine input images with a Gaussian filter with a 3.5 pixel standard deviation.

2.3 Validation Methods

Model performance was assessed at the global scale for pixel-level accuracy of tree identification and plot-level (140 x 140 meter) tree cover accuracy. Pixel accuracy measures the ability of the model to accurately segment local boundaries between tree patches, while plot-level tree cover accuracy validation measures the overall ability of the model to identify differences in tree cover. In addition to the global validation methods, model performance was also tested for regional accuracy in three selected 100,000 hectare landscapes in different biomes in Tanzania, Ghana, and Honduras. In all cases, accuracy validation was assessed against human-annotated high-resolution imagery, a method which has previously been used for accuracy assessments of wall-to-wall maps (hansen2013; zhang2019).

The proposed modelling approach was also tested against standard baselines, including random forests, support vector machines, and a U-net CNN (unet). The U-net is an often used CNN approach for remote sensing classification (DBLP:journals/corr/abs-1904-00592; Li2017DeepUNetAD)

. Baseline models were tested with different temporal aggregation methods, including mean, median, standard deviation, and stacking each image. The presented baseline models use the mean band reflectance for each pixel over the 24 cloud-free images. Common hyperparameters for each baseline approach were tuned using a brute force approach. The U-net CNN was designed in a similar manner to

unet, with the additions of reflect padding, batch renormalization, dropblock, and upsample convolution to match design choices of the proposed model. The U-net has 620 thousand trainable parameters and uses the Adabound optimizer with learning rates between 0.001 and 0.1.

2.3.1 Metrics

Model performance was evaluated with the user’s and producer’s accuracy. These metrics were modified in order to mitigate errors caused by variable coregistration consistency between WorldView and Sentinel imagery. Sub-pixel shifts in the Sentinel imagery are caused by resampling each image with nearest neighbor interpolation to fit within the plot boundaries due to the average coregistration error between images of 12 meters (sentinelqc). Additionally, because WorldView and Sentinel have different viewing geometries and use different digital elevation models (DEM) for orthorectification, the products may be locally misaligned by more than ten meters (K_b_2016; worldviewfuse; sentinelworldviewfuse). These local misalignments cause shifts between the WorldView labels and the Sentinel predictions, rendering per-pixel metrics useless in many cases (Figure 5). To account for this, the producer’s accuracy counts false negatives (FN) if the ground truth positive occurs more than ten meters from a predicted positive at location . Similarly, the user’s accuracy counts false positives (FP) if the predicted positive occurs more than ten meters from a positive ground truth pixel. The metrics are calculated according to Equation (5), where TP refers to true positives, refers to the prediction, and refers to the label.

Figure 5: The user’s (UA) and producer’s accuracy (PA) are sensitive to labelling choices due to coregistration errors between Sentinel and WorldView imagery. Red pixels denote trees, blue pixels denote background. The labelling scheme on the left results in low UA and PA scores when analyzed with Sentinel imagery, despite being visually accurate. The labelling scheme on the right achieves high UA and PA scores. Both labelling schemes achieve similar UA and PA scores (denoted with *) when adjusted with Equation (5).

2.3.2 Global validation

Figure 6: Locations of testing sample plots. Each sample plot is 140 x 140 meters (approximately 2 hectares) with a 10 meter pixel size.

Pixel-level performance was assessed with 1,100 2-hectare plots labelled in the same manner as the training data (section 2.1) (Figure 6). Seven hundred of the test plots were randomly distributed on land areas between -60 and +60 latitude. The other four hundred were randomly distributed within selected regions (including Central America, West Africa, East Africa, and India) to increase the ratio of test plots with high cloud cover and difficult land uses (such as plantation agriculture, agroforestry, step agriculture, and urbanization).

Figure 7: Locations of regional sample plots. Within each of the three regions, 100 evenly distributed 140 x 140 meter plots were labelled with a 10 meter pixel size for binary tree presence. Additionally, within each of the three regions, a randomly selected 200 hectare sub-region was labelled with a 10 meter pixel size for binary tree presence.

To further assess model performance at the global scale, the percent tree cover of a random sub-sample of 1,000 half-hectare plots across global drylands from bastin2017

, stratified by geography and tree cover decile, were reanalyzed with recent high resolution imagery (Figure

8). Thirty five percent of the reanalyzed plots disagreed with the labels in bastin2017 by more than 30% canopy cover. The labels for these plots were reassigned based on recent high resolution imagery, while the rest retained their original labels. One source of disagreement was due to land use change since the original analysis. Because the plots were stratified by tree cover percent, plots with partial canopy cover, were over-represented. These plots are also more likely to experience land use change than are plots with barren or dense canopy (woodyencroachment; kalamandeen). Another source of label disagreement arose from coregistration errors. Because more than half of 10-meter pixels in a half-hectare plot are border pixels, coregistration errors under 10 meters between the image analyzed in bastin2017 and recent high resolution imagery can alter canopy cover predictions by up to 50%. Finally, commentbastin identified significant classification error in the bastin2017 data set due to old imagery and a lack of quality assurance. Tree density was calculated as the proportion of plot pixels with predictions higher than a threshold determined by the receiver operating characteristic.

Figure 8: Locations of 1,000 half-hectare plots from bastin2017 stratified by canopy cover decile.

2.3.3 Regional validation

Model performance was also tested to ensure that the results for large geographic regions were consistent across different land uses and geographies. To do this, we prepared wall-to-wall maps of 100,000-hectare regions in three geographies (Figure 7). The selected geographies included a Semi-arid desert with agricultural production in Tanzania, a tree Savannah with agricultural production in Ghana, and a mosaic tropical broad leaf forest landscape in Honduras. Model performance was assessed for these three regions through two complementary methods. To assess performance over the total 100,000 hectares, 100 evenly distributed 2-hectare plots were labelled in the same manner as in section 2.1. Additionally, to assess performance across smaller regions, a 200 hectare subregion was randomly identified for each of the four regions, for which each 10 meter pixel was labelled for binary tree presence. Combining these two validation sources, a total of 39,600 pixels were evaluated for each of the three regions.

3 Results

3.1 Global accuracy

Figure 9: Performance of proposed model and baseline random forests (RF), support vector machine (SVM), and U-Net CNN measured with user’s and producer’s accuracy and the absolute percent error.
No tree Tree Prod. Acc.
Pred No tree
Usr. Acc.
Table 1:

Per pixel confusion matrix, and user’s, producer’s, and overall accuracy for background and positive classes.

The proposed model is able to segment trees in individual pixels across a variety of tree cover densities, land uses, biomes, terrains, and geographies. The model achieves 94% user’s accuracy and 95% producer’s accuracy across the 1,100 test plots, consisting of 215,600 pixels (Table 1). When compared to the baseline random forest, support vector machine, and U-net, the proposed model performs substantially better in low tree cover scenarios. In plots with below 40% canopy cover, the proposed model reaches 78% user’s and 75% producer’s accuracy (Figure 9). In comparison, the random forest only reaches 63% user’s and 50% producer’s accuracy for plots below 40% canopy cover, while the U-net reaches 56% and 55%, respectively. Averaged across each decile of canopy cover, the proposed model has a 7% and 9% higher user’s accuracy and 10% and 12% higher producer’s accuracy than the random forest and U-net, respectively.

Across all testing plots, the proposed model predicts tree cover (as the percentage of positive pixels in the 140x140m plot) with an average error of 7.1% (6.4 - 7.9%, 95% CI; Table 2). In comparison, the random forest’s tree cover error was 10.7% (9.7 - 11.6%) and the U-net’s was 11.3% (10.2 - 12.5%). In plots with between 30 and 70% tree cover, the proposed model reduces average error from 22.6% (20.0 - 25.1%) and 29% (26.2 - 31.6%) to 16.7% (14.6 - 18.8%) from the random forest and U-net, respectively (Figure 9).

The proposed model is not sensitive to cloud cover or steep terrain. In very cloudy test plots (at least 75% of imagery dates with at least 20% cloud cover), the proposed model maintains high accuracy, with 95 and 94% user’s and producer’s accuracy (Table 1). This reflects a 44% and 45% reduction in commission and omission errors when compared to the random forest, and a 37% and 25% reduction compared to the U-net (91% PA, 89% UA, random forest; 92% PA, 92% UA, U-net). In test plots with an average slope greater than 10%, the proposed model receives 97% user’s accuracy and 95% producer’s accuracy, again reducing omission and commission errors by almost half compared to the baseline approaches.

Parameter User’s acc. Producer’s acc. % error # pixels
Africa 0.91 0.95 6.5 88,984
Asia 0.92 0.92 8.1 24,304
Australia & Oceana 0.93 0.95 12.6 19,600
Europe 0.96 0.96 7.0 23,912
N. America 0.97 0.97 5.4 38,808
S. America 0.95 0.94 6.2 22,148
Cloud (%)
0-75 0.93 0.95 6.9 170,128
75+ 0.94 0.95 7.0 45,864
Slope (%)
0-10 0.92 0.95 6.9 176,400
10+ 0.97 0.95 6.9 39,592
Canopy (%)
0-20 0.66 0.60 3.9 115,248
20-40 0.88 0.87 14.6 21,560
40-60 0.90 0.93 17.0 11,956
60-100 0.97 0.99 7.5 67,228
Overall 0.94 0.96 7.1 215,992
Table 2: Performance of proposed model in different geographical and biophysical conditions, measured with user’s and producer’s accuracy and the absolute percent error.

The predicted tree cover for the 1,000 stratified plot locations from bastin2017 had an average of 0.85 (0.83-0.86, 95% CI) Pearson’s correlation with the cleaned tree cover labels from bastin2017. The Pearson’s correlation for the random forest model was 0.76 (0.74-0.79, 95% CI) and for the U-net was 0.77 (0.74-0.79, 95% CI). The proposed model classified above and below 10% tree cover with 92% and 91% user’s and producer’s accuracy, compared to random forest (90%, 82%, respectively) and U-net (92%, 85%, respectively) (Figure 10). This reflects relative decreases in errors of omission of 50% and 40% versus the random forest and U-net, respectively.

Figure 10: Confusion matrix of predicted versus labeled tree cover from bastin2017 for the proposed model, random forest, and U-net.

3.2 Regional accuracy

Proposed model Random Forest U-net
Parameter User Prod. User Prod. User Prod.
Ghana - region 0.80 0.82 0.77 0.77 0.70 0.70
Ghana - subregion 0.81 0.76 0.78 0.75 0.75 0.76
Tanzania - region 0.86 0.84 0.75 0.78 0.74 0.84
Tanzania - subregion 0.85 0.84 0.52 0.54 0.76 0.75
Honduras - region 0.99 0.98 0.97 0.97 0.96 0.95
Honduras - subregion 0.94 0.99 0.93 0.99 0.92 0.97
Table 3: User’s and producer’s accuracy for the selected 100,000 ha regions and 200 ha subregions in Ghana, Tanzania, and Honduras.
Figure 11: Predictions of the proposed model and baselines for the 200 hectare subregions in Ghana, Tanzania, and Honduras, along with the labels, based on high resolution imagery, and the high resolution base map. Predictions of hansen2013 are added for reference, where green pixels indicate tree cover, and pink pixels indicate tree cover loss.

The proposed model achieves an average of 87.5% user’s accuracy and 87.2% producer’s accuracy across the 118,800 labelled pixels in Ghana, Tanzania, and Honduras. When compared to the baseline models, this reflects a 7.2% and 7.3% increase over the random forests, and a 7% and 4.2% increase over the U-net in terms of user’s and producer’s accuracy (Table 3).

In the 200 hectare subregion in Ghana, the proposed model accurately locates both the scattered trees on farm land and the tree corridors. In comparison, the random forest misses the majority of the scattered trees on farm land, while the U-net overpredicts the density of large patches of trees, and completely misses many small patches of trees. Overall, the proposed model reduces relative commission and omission error by 13% and 22% versus the random forest, and 33% and 40% versus the U-net for the larger region.

The proposed model is able to locate most individual trees in the 200 hectare subregion in Tanzania. The random forest model only detected large patches of trees, and the U-net again over-predicted the density of large patches while missing most of the individual trees. When compared to the baseline random forest model, the proposed model reduces relative commission and omission error by 63% and 27% for the larger region, and commission error by 46% versus the U-net.

Figure 12: Performance comparisons with results from zhang2019 for a 42 hectare region in Senegal. Labels from brandt2018 are denoted in green and predictions are denoted in magenta. The proposed model identifies 85% of the trees labeled in brandt2018.

In the 200 hectare subregion in Honduras, the proposed model identified most of the small patches of barren land within the surrounding dense tree cover, while also identifying many of the individual trees that on farmland. In comparison, the random forest’s predictions were very noisy around the boundaries between dense and scattered tree cover, and the individual trees on farmland were not identified. The U-net was unable to identify the small patches of barren land, and overestimated tree cover substantially. When visually compared with the results from hansen2013, the proposed model performs much better at identifying trees scattered trees, while hansen2013 performs visually similar for trees inside closed canopy forests (Figure 11).

To further test the ability of the proposed model to accurately identify trees in regions with open canopy and sparsely distributed trees, we compare predictions with those of zhang2019. zhang2019 used a support vector machine to identify trees in 3 years of fused Sentinel 1 and Sentinel 2 imagery, with training data consisting of 20,000 labelled pixels in high resolution imagery across the western Sahel. While the proposed model has zero labelled training data in Senegal, it is able to identify more than 85% of the trees in a 42 hectare region of Senegal where zhang2019 only identify 35% (Figure 12).

4 Discussion

4.1 Implications for global forest monitoring efforts

Global assessments of tree cover have underestimated tree cover extent in drylands as well as scattered tree cover in urban environments and mosaic landscapes (bastin2017; OTTOSEN2020101947; Milodowski_2017)

. This has caused considerable uncertainty in estimating the extent of these forms of tree cover at a global scale. Through human photointerpretation of 500,000 plots with high resolution satellite imagery,

bastin2017 found that global remote sensing classifiers underestimated dryland forest extent by 40%. With regard to trees on cropland, treesonagland found that more than 40% of cropland on earth has at least 10% canopy cover. However, without a globally consistent and efficient method to measure these types of sparse and scattered tree cover, estimates continue to rely either on aggregating data from multiple, small-scale assessments, or on human photointerpretation.

The methodology presented in this paper enables the assessment of sparse and scattered tree cover with medium resolution satellite imagery. The accuracy of monitoring tree presence in highly heterogeneous areas such as those with sparse and scattered tree cover was improved by as much as 20% over standard remote sensing classifiers such as random forests and support vector machines. Additionally, the proposed approach reduces omission and commission errors in dense canopy cover, high cloud cover, and mountainous regions by nearly half compared to standard remote sensing classifiers. Global data that combines contiguous, closed canopy forest extent with tree cover data in low canopy cover regions will have significant implications for global land use and land change monitoring.

Many types and drivers of land use change, such as small area deforestation, encroachment, habitat fragmentation, and natural regeneration, occur primarily in highly heterogeneous regions like the urban fringe, forest perimeters, and riparian zones (Tyukavinaeaat2993; woodyencroachment; Pickett331; Rex1990TheFS). For instance, the majority of woody encroachment into open areas in continental Africa occurs in regions with moderate initial tree cover (between 30 - 60%) (woodyencroachment). Furthermore, more than two thirds of deforestation in both the Congo basin and the Amazon between 2000 and 2014 was driven by smallholder clearing (kalamandeen). Human-driven increases in canopy cover, such as afforestation and reforestation, also tend to be comprised of many small tree planting efforts, often by individual land owners (Holl455; MELO2013395). These highly heterogeneous regions where important land use changes are occurring are known to have lower classification accuracy for tree cover and land use than larger, homogeneous regions in many remote sensing approaches (Smith2002IMPACTSOP; Milodowski_2017; reddplus). This limits the applicability of existing global tree cover datasets to monitor small area deforestation, natural regeneration, and landscape restoration.

More than 50 countries have committed to putting more than 170 million hectares of land under restoration by 2030 (fao_wri). Restoration land use interventions target open canopy forests and trees outside of forests, including agroforestry, silviculture, and natural regeneration. While global data like that of hansen2013 have transformed forest monitoring and carbon accounting, there is no comparable global method or data to monitor progress on restoration. Instead, restoration monitoring programs are generally constructed at the national or landscape scale, and rely on field data and human interpretation of satellite imagery (Bey2016CollectEL; afr100). The methodological improvements for tree identification in areas with sparse and scattered tree cover presented in this paper may allow for increased spatial and temporal resolution of monitoring data for land use management decisions by reducing reliance on photointerpretation and field data.

4.2 Implications for remote sensing methodologies

The applications of deep learning methodologies such as CNNs to medium-resolution remote sensing imagery are not well understood. The majority of classification systems rely on random forests or support vector machines rather than CNNs (ma2019). Indeed, in the present study, the U-net baseline performed similarly to the random forest, despite being more than an order of magnitude more computationally intensive. Many studies have concluded that the U-net significantly outperforms other machine learning classifiers for remote sensing tasks with high-resolution imagery (DBLP:journals/corr/abs-1904-00592; Li2017DeepUNetAD; unet3; unet4; unet5). The relative underperformance of the U-net with medium-resolution imagery highlights a need for alternative CNN approaches such as the one proposed in this paper.

By explicitly designing the model architecture to mitigate the constraints of medium resolution data, the proposed model was able to significantly outperform the U-net model and the other machine learning baselines in every decile of canopy cover (Figure 9). These design choices included increasing data complexity with temporal data and the convolutional GRU, choosing model building blocks that are explicitly designed for pixel-level accuracy (FPA, csSE), and using a loss function explicitly designed for different tree cover scenarios and coregistration errors (boundary loss and label smoothing). Because the improvements in accuracy are rooted in the temporal CNN architecture, it is likely that the proposed model design could also greatly benefit land use modeling and the identification of objects in satellite imagery such as roofs, power plants, boats, or roads.

4.3 Future research

Despite the numerous improvements presented in this paper, there are several avenues for future research in imagery preprocessing that could further improve accuracy in low tree cover regions. Due to the global nature of this work, the present approach naively fuses Sentinel 1 and Sentinel 2 by simply stacking the Sentinel 1 ground range detected (GRD) product with the Sentinel 2 L2A product. Because of the sub-pixel size of trees in Sentinel imagery, the pixel-level match up between Sentinel 1 and Sentinel 2 is very important for reducing the blurriness of predictions. Many new approaches for fusing Sentinel 1 and Sentinel 2 have recently been proposed, such as probabilistic Bayesian models, affine transformations, and CNN-based approaches (plsa; fuse). While these approaches are often limited to small geographies and their benefit to global-scale preprocessing is not yet clear, increasing the pixel-level accuracy of Sentinel 1 and 2 fusion would likely improve the identification of small tree patches.

Because the proposed model uses multitemporal imagery, the interpolation of cloud and cloud shadow pixels is very important to accurately reconstruct imagery, especially when the gap between cloud-free image acquisitions is substantial. Although the proposed model in this paper performs much better than the baseline models in high cloud cover scenarios, there is room to improve the interpolation of cloud and cloud shadow when constructing time series imagery. The presented methodology uses the Whittaker smoother, which whittakermodis found to be better at interpolating time series gaps in MODIS imagery in comparison to temporal smoothing and gap filling, Gaussian functions, low pass filters, and many other per-pixel smoothing functions. In contrast to these per-pixel approaches, a number of spatio-temporal methods for gap filling have been proposed. Such methods include conditional generative adversarial networks (gan1), convolutional neural networks (ZHANG2020148), patch-based cloning (patchclone), k-nearest neighbors (knn), and spatio-temporal Markov random fields (CHENG201454).

Finally, future research should identify the benefits of the proposed model for change detection, such as identifying small-scale deforestation as well as forest and landscape restoration activities. While the present paper demonstrates comparative benefits over commonly used remote sensing classifiers for tree identification in heterogeneous landscapes, the potential benefits for change detection are left to future work. In a similar manner, future research should detail how the methodology presented in this paper can be integrated into national and subnational environmental monitoring programs. One potential avenue is through human-in-the-loop artificial intelligence, where locally generated data is used to improve the local accuracy of globally developed models

(logar2020pulsesatellite). Because neural networks allow for online learning, they can easily be ”fine-tuned” with new data, while models such as random forests and support vector machines cannot (Zhou_2017_CVPR; NOGUEIRA2017539). This allows for the combination of participatory mapping and crowd sourcing, where local stakeholders label high resolution imagery with the power of neural network based remote sensing classifiers.