CloudFindr: A Deep Learning Cloud Artifact Masker for Satellite DEM Data

10/26/2021 ∙ by Kalina Borkiewicz, et al. ∙ University of Illinois at Urbana-Champaign 0

Artifact removal is an integral component of cinematic scientific visualization, and is especially challenging with big datasets in which artifacts are difficult to define. In this paper, we describe a method for creating cloud artifact masks which can be used to remove artifacts from satellite imagery using a combination of traditional image processing together with deep learning based on U-Net. Compared to previous methods, our approach does not require multi-channel spectral imagery but performs successfully on single-channel Digital Elevation Models (DEMs). DEMs are a representation of the topography of the Earth and have a variety applications including planetary science, geology, flood modeling, and city planning.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Related Work

Cloud detection is a specific application of the broader field of anomaly detection with methods spanning different techniques and applications. Techniques range from information theoretic to classification-based to statistical; applications span cyber-intrusion detection to image processing to sensor networks


. Deep learning methods can be applied to anomaly detection using algorithms that are supervised, unsupervised, hybrid, or one-class neural networks

[anomaly-survey-deep]. An issue when attempting anomaly detection with spatiotemporal data is that there is often a lack of a clear boundary between normal and abnormal cases [cao2018] – in the case of cloud detection, it can be difficult to determine if a pixel contains a cloud, or a snow-peaked mountain.

Much research on cloud detection in particular focuses on spectral imagery as input data, rather than DEM input. Cloud detection methods for these data are based on cloud optical properties and may detect cloud/no-cloud, cloud/snow, and/or thin/thick cloud regions of an image [cloud-survey]. Fmask [fmask2012] is a popular algorithm for detecting both clouds and cloud shadows in spectral imagery. A recent paper by Wu, et al [wu2016cloudfinder] uses DEM data, but for validation of their spectral cloud-finding results, rather than for the detection directly.

The method described in this paper uses deep learning image segmentation to detect and mask out cloud regions. This is based on the popular U-Net algorithm [unet]

, initially developed for medical image segmentation but which has since been adopted for use in other fields that require classifying image pixels. The RS-Net

[JEPPESEN2019247] and MC-Net [mcnet2021]

methods also use U-Net for cloud detection, but once again on spectral imagery rather than DEM data. Other notable recent machine learning image segmentation papers based on U-Net include a method for identifying vortex boundaries in scientific visualizations

[berenjkoubvortex1] and a method for removing clouds in

-channel RGB spectral imagery with generative adversarial networks


2 Method


2.1 Ground Truth Mask Creation

Figure 3: Example showing the inputs (left, middle) used to output a hand-drawn mask (right) for one sample timestep. Top row shows individual strips, bottom row shows accumulated buildup of strips. Left column shows DEM data, middle column shows artificially shaded preview, right column shows resulting mask (repeated in both rows).

The labelled dataset used as the ground truth in training was created as a byproduct of the work toward the documentary Atlas of a Changing Earth, co-produced by Thomas Lucas Productions and the Advanced Visualization Lab at the National Center for Supercomputing Applications. The artifacts were masked and removed manually in order to fit the timeline of the film production, and these resulting masks served a secondary purpose as the inputs to our machine learning model.

The first step in acquiring the data was identifying an area of interest and downloading a subset of the data at a suitable resolution. A x pixel region was initially selected around the Jakobshavn glacier, a -square km glacier in Greenland, and serves as our dataset. GEOTIFF images were downloaded from the ArcticDEM website and aligned using the georeferenced imagery, so that each new data strip would be in the correct pixel location within our selected region of interest. Several derivative versions of the data were created: () images that show one strip at a time and leave the rest of the frame blank; () images that are an accumulation of strips up until the current timestep; () images where each pixel corresponds to the time that an accumulated pixel was added; and () images that are artificially-shaded using gdaldem’s “hillshade” mode111 for easier visual inspection; among others.

A multimedia specialist on the team used the software Nuke222

to visually inspect the individual DEM strips, comparing them with strips gathered immediately before and after to identify and manually mask out areas that appeared to be artifact-ridden. Using a visual effects technique called rotoscoping, in which a vector mask is created in one image frame and filled in with imagery from another, the expert drew the masks for each new data strip by comparing the various images described above over time, interactively making adjustments to image intensity as needed for better visual acuity. Figure

3 shows a sample of types of inputs into this manual process as well as the output mask for a single timestep.

The hand-drawn masks were not pixel-precise, but were overdrawn for reasons of convenience - e.g. if % of a strip was cloud-covered, it was more time-efficient to mask out the whole strip rather than finding the individual pixels that were valid. This was satisfactory for purposes of the documentary, but would not be suitable for a machine learning task. We therefore created a second set of “motion masks” where each pixel contained a only if the pixel had been updated (moved) in that current timestep, and 0 otherwise, based on derivative data version () described above. Multiplying these two masks together clipped the expert-created overdrawn masks to only pixels that were present in the strip at that timestep. The resulting masks are both expert-driven and pixel-precise.

2.2 Data Pre-Processing

Data must be processed prior to being used for training in order to optimize training time and results. First, each image and its corresponding ground-truth mask is subdivided into patches of size x pixels. This size was chosen in order to divide cleanly into whole numbers when downsampled with the U-Net algorithm. Other patch sizes were tested during parameter tuning, ranging from roughly x - x, and this size was chosen for having a good ratio of processing speed to manageable number of output images. Patches were set to overlap one another by

pixels to account for artifacts around the borders of the image, which are known to occur with many Convolutional Neural Network-based image processing algorithms

[JEPPESEN2019247]. This also had the result of creating more training data with different patch croppings. The value of pixels was selected by visually inspecting a sampling of predicted output masks and determining the region of consistently-inaccurate predictions around the borders. Because clouds are more rare than non-clouds in the data and they are the subject of interest, only the patches that had at least one pixel of cloud (as determined by the ground-truth mask) were saved. There were originally images of size x, which were converted into patches of size x. Scripts were developed for splitting the full-sized image into patches and for reassembling the patches into a full-size image.

Our initial machine learning model used these images as training data, but produced poor results where many discontinuous, individual pixels were identified as clouds rather than broad, connected areas. To resolve this issue, an additional second order textural analysis pre-processing step was added to create derivative data that considers the spatial relationship among the image pixels. A Gray Level Co-occurrence Matrix (GLCM) [glcm] is an image representation which keeps track of different combinations of pixel values (gray levels) as they occur in an image, identifying various image texture features such as contrast, dissimilarity, homogeneity, and entropy. Figure 4 shows three of these features over different types of land covers. Calculating the GLCM requires specifying two parameters - the window size to use around each pixel, and the relationship direction, which is the distance vector between the reference pixel and the neighborhood pixel (often taken as a single unit distance in each of the directions left, right, up, and down). In order to consider both small-scale and large-scale texture features, -, -, and -pixel window sizes were used to create three derivative datasets, to be used in an ensemble method of cloud mask prediction. Each of these datasets consisted of -channel textural “images”. After the GLCM calculations, the images were normalized to be between -, as a best practice for machine learning.

Figure 4: GLCM features for three main types of land covers.

2.3 Deep Learning for Cloud Prediction

Figure 5: The CloudFindr architecture, based on U-Net[unet].

U-Net was selected as the basis for CloudFindr. Other architectures were considered - notably RS-Net [JEPPESEN2019247] and MC-Net [mcnet2021] - which are specialized use cases of the more basic underlying U-Net algorithm and are optimized for different use cases: RS-Net for spectral and MC-Net for multi-channel satellite imagery. U-Net was chosen as it is more generalized and allows for customization at a lower level. The CloudFindr architecture is outlined in Figure 5

. The downstream branch consists of four convolutional blocks, each being a combination of two convolution and ReLU operations, followed by a maxpool to reduce the dimensions of the image by a factor of two (with stride

and kernel size ). At the end of the downstream branch, the input is reduced to a size of width/16 by height/ by features. The upstream branch consists of four upsampling convolutional blocks. Each block first upsamples the input by a factor of two using up-convolution followed by a ReLU operation, increasing the size of the input again by a factor of . A final convolutional layer is applied to convert the resulting channels into

, followed by a softmax to obtain a probability for each class, “cloud” versus “non-cloud”. The resulting image contains a pixel-wise confidence between

- for whether that pixel contains a cloud or not. This image is thresholded to produce discrete or values in the final output mask to give a prediction of “cloud” or “no cloud”.

The dataset has a --

split between training-validation-testing. The hyperparameters of loss function, optimizer, learning rate, regulation, and number of epochs were tuned via control experiments. A combined evaluation of IoUs and segmentation results was performed after each experiment to determine if current variable value would be retained for next experiments. The optimal combination of parameters is found as: loss function weights = [

,] to account for the imbalance between number of instances for each class, Adam optimizer with learning rate of , no dropout regulation, and epochs. Both Adam and SGD optimizers were tested with learning rates between and . The best results came from the use of Adam with a learning rate of .

Initially, the model was run on derivative datasets with GLCM window sizes of , , and

with the aim of finding a single optimal window size. As designed, all resulting predictions skewed toward higher recall rather than higher precision and tended to over-label areas as “clouds” rather than under-labelling them. However by visually analyzing the output masks, it became clear that the three methods tended to agree with one another about the areas

correctly identified as clouds, but disagreed about the areas labelled incorrectly. This inspired the use of an ensemble method for gathering the final result. The final prediction combines results from all three runs by multiplying the outputs together. The effect of this is that the overall confidence value is significantly reduced, but if any one of the runs predicts a value (predicting that there are no clouds present), this overrides any other predictions and a 0 value is placed in the final output mask. The multiplied confidence is thresholded with a value of to create the final binary cloud/non-cloud prediction. Figure 6 shows one example patch prediction.

Figure 6: One example patch where it would be difficult for a casual observer to identify clouds, but the expert and machine learning prediction have closely-aligned results. From left to right: Input DEM patch, ground truth mask hand-drawn by an expert, confidence of prediction after ensemble voting, final thresholded predicted mask.

When a cloud is mislabelled as a non-cloud, this most often appears around the perimeter of a correctly-labelled cloudy area. To account for this, a final post-processing step is applied to dilate the image masks with a kernel of size (,). This reduces the error around the edges of cloud regions, and creates masks that are slightly “overdrawn” similarly to how the human expert performed manual rotoscope labelling.

3 Results

Figure 7: Confusion matrix showing the success of the predictions after all processing.
Figure 8: Images showing the same single frame of a final 3D render. Top: using no cloud mask. Middle: using cloud mask created via the method described here. Bottom: using masks created manually by a rotoscoping expert. Red boxes draw attention to areas with especially visible clouds; yellow boxes show that the clouds have been mostly removed; green boxes show that they have been entirely removed.

The neural network was trained on a GM200GL Quadro M6000 NVIDIA GPU for approximately hours. In the final result, the model was able to correctly identify cloudy DEM pixels % of the time. The mean average precision of the optimal model described above is % and the mean IoU is %, with a further breakdown for each class shown in Figure 7.

The output of the described algorithm is patches of size x with values of 1 where there are likely clouds present, and 0 where there are not. These patches are stitched back together to create masks of size x which can be multiplied against the DEMs of size x around the Jakobshavn area. The DEM strips and masks are then accumulated to create the final DEMs to be used in the 3D cinematic rendering. Figure 8 shows how our result compares to the ground truth in final 3D rendered imagery, as well as what the render looks like without cloud removal. These renderings are created with the software Houdini333, where the DEM values are used to drive both the height and the color of the land. In this figure, the vast majority of the cloud artifacts have been removed, and the ones that have been missed are not as visually disturbing as the more prominent spikes.

4 Conclusion and Future Work

In this paper, we describe CloudFindr, a method of labelling pixels as “cloud” or “non-cloud” from a single-channel DEM image. We first extract textural features from the image with varying window sizes. We feed this derived data into a U-Net based model, trained on labelled data created by an expert, to create image segmentation predictions. The results have high accuracy as demonstrated both by metrics and by a 3D rendering created from the data.

In the future, we will plan a large hyperparameter tuning study including features at different sizes, learning rate, momentum, and batch size to optimize our results. Additionally, we would like to apply this method to other DEM datasets outside the Jakobshavn region of the ArcticDEM dataset, and also incorporate the time dimension into the training to differentiate between strips that are updating a previously-seen area from strips covering a new region.

Thank you to Donna Cox, Bob Patterson, AJ Christensen, Saurabh Gupta, Sebastian Frith, and the reviewers. This work was supported by the Blue Waters Project, National Science Foundation, National Geospatial-Intelligence Agency, and Fiddler Endowment.