Project Website: https://github.com/savasozkan/cloud_detection The presence of clouds due to climate factors limits the clear acquisition of content information from the Earth surface for almost all optical sensors. Ultimately, this reduces the visibility and affects adversely the processing of data for many remote sensing applications such as classification, segmentation and change detection etc. Hence, detection/elimination of cloudy coverages constitutes an important pre-processing step for remote sensing.
In particular, RGB color bands are more sensitive to these atmospheric scattering conditions compared to the high wavelength sensors (i.e. infrared/multi-spectral) . Thus, this problem becomes even harder and the spatial content of the image needs to be leveraged rather than singly spectral properties of clouds as in multi-spectral/infrared sensors. For this reason, addressing the problem from the perspective of object segmentation and classification can yield more intuitive results. Moreover, more generalized solutions, i.e. instead of sensor-specific rules/thresholds, can be presented [3, 4].
In this paper, we tackle the cloud detection problem by presenting a framework based on deep pyramid network architecture (DPN) [5, 6]. Compared to the existing rule-based methods [7, 8, 9, 10], the proposed method exploits texture information exhibited from cloudy/non-cloudy pixels with high-level features. This improves classification decisions without the need of any specific spectral information, since a pre-trained encoder network is capable of extracting rich and distinct high-level representations for visual objects in the images. Moreover, due to the architecture, the network is concurrently optimized for both segmentation and classification phases. Lastly, since the ground truth cloud masks are quite noisy (i.e. achieving perfect pixel-level annotations is quite difficult ), use of a pre-trained model for the abstract representation of an input provides robustness to the overall segmentation and classification phases.
Rest of the paper is organized as follows: related works are reviewed in Section 2. Section 3 is reserved for the detail of the proposed method and the problem statement. Experimental results, dataset and baseline methods are explained in Section 4 and the paper is concluded in Section 5.
2 Related Work
In this section, we review the literature for RGB color satellite images as well as other optical sensors such as multi-spectral/infrared to demonstrate the complexity of the problem for visible domain.
The methods used for multi/infrared bands are frequently based on radiometric properties of clouds/surface as reflectance and temperature. [7, 8] exploit the variations of reflectance in thermal bands to distinguish clouds from the surface. Harb et. al.  propose a processing chain based on the thermal pattern of clouds with morphological filtrations. Similarly, Braaten et. al.  extend the assumption to multi-spectral data. However, these methods highly depend on sensor models (i.e. since they are rule-based methods) and the derived solution cannot be generalized to different sensors by using similar assumptions for band information.
Differently, multi-temporal methods aim to detect clouds based on background changes in time by which data is acquired in different time-instances. Zhu et. al.  combine the thermal cloud patterns with time-series data to detect more accurate cloud masks. Moreover, the method 
uses temporal data to estimate clouds with a non-linear regression algorithm. However, the main limitation of the methods is that the time series of data are assumed to possess smooth variations on ground surfaces while abrupt changes for clouds. Furthermore, recording such dense data practically increase the operational cost.
In order to generalize the solution, classification-based approaches learn a set of parameters from training samples to distinguish clouds from the surface. Hu et. al.  extract several low-level features such as color, texture features etc. to estimate pixel-level masks. Recently, classifies locally sampled patches (i.e. by a Super-Pixel (SP) algorithm) with a Convolutional Neural Network (CNN) as cloud or non-cloud.
3 Cloud Detection With Pyramid Networks
As mentioned, since there is no explicit spectral/physical pattern for clouds in RGB color satellite images, we treat the problem as an object segmentation and classification problem in order to make a realistic problem formulation.
In particular, the texture details around cloudy regions indicate distinct visual patterns for detection/segmentation phases. Our aim is to extract high-level abstract representations from data and iteratively merge them to make pixel-level classification decisions. Moreover, the proposed method is able to compute segmentation and classification phases concurrently to optimize the network in an end-to-end learning manner, thus there is no need to employ these layers separately as in .
Suppose we are given a RGB color satellite image and the method aims to generate an image mask that implicitly corresponds to two channel classification decisions for ground surface and cloud/haze coverages.
Therefore, the main objective is to learn a set of parameters and for encoder and generator functions and such that the input-target error should be minimized for a set of training pairs
based on a loss function:
where is the mask prediction of the network for the input . The softmax cross-entropy loss in Eq. 1 maximizes the similarity of optimum input-target transformation. Moreover, corresponds to the mini-batch size. In the inference stage, the decisions of cloudy/non-cloudy coverages are computed based on the outputs of these learned functions.
Our deep network architecture consists of two main filter blocks [5, 6]. First, encoder block extracts robust abstract representations from a RGB color image. Then, generator block computes pixel-level segmentation and classification masks according to the responses of the encoder block. The overall architecture is illustrated in Figure 1.
Encoder: Encoder block takes an image as input and iteratively computes abstract representations by down-sampling responses. Practically, the goal of the block is to unveil distinct patterns about data which assist the generator so as to obtain an optimal image mask. Moreover, information flows to the generator are maximized with skip-connections .
Throughout the paper, we experimented with two different encoder models:
First, a model with 5 convolutional layers and random parameter initialization is used. At each layer, we employ a batch normalization layer and an activation function, i.e. ReLU, after a convolution layer. Later, we down-sample the responses with stride 2. However, we found out that the random initialization lacks to reach an optimal solution due to the fact that ground truth unwillingly contains noisy labels by omission and/or registration noise during labeling. This ultimately affects adversely the parameters at the end and the parameters (i.e. ) tend to generate false-alarms in the inference stage.
We use the convolutional responses of a pre-trained model, i.e. ‘conv1_2’, ‘conv2_2’, ‘conv3_2’, ‘conv4_2’ and ‘conv5_2’ in VGG-19 , and no finetuning is allowed for the encoder layers. Eventually, this mitigates the problem and more confident responses are obtained for an input. Note that even if the parameters of the model are trained for a different object recognition problem, the studies have already shown that it is still capable of attaining best accuracies on several remote sensing applications [16, 17].
Generator: At each layer, the generator block fuses the abstract representations extracted by the encoder block by adding and up-sampling (with factor ) recursively as illustrated in Fig. 2. Similarly, we use batch normalizations and ReLU functions at the layers to speed up the optimization. Other advantage of these functions is to improve the sparsity of the responses as explained in  for remote sensing.
At the last layer, we utilize a softmax activation to produce classification decisions, i.e. cloud or ground surface, thus it is inclined to set the decisions to either 0 or 1 for the masks at the end of the learning stage.
3.3 Implementation Details
As a pre-processing step, we first normalize each pixel in an image with the constant value computed in , even if a pre-trained encoder model is used or not. Ultimately, it centers data to zero-mean space and data becomes reproducible for the pre-trained model.
For the parameter optimization, Adam optimizer  with momentum and is used and the training rate is set to 0.0001. Moreover, the value of is determined as 10 for
RGB color images and maximum mini-batch iteration is set to 20K. Note that no data augmentation is utilized throughout the training stage. Lastly, all codes are implemented on Python using Tensorflow framework. The models are trained/evaluated on NVIDIA Tesla K40 GPU card.
In this section, we mention the details of the dataset we used in the experiments. Later, we report/discuss the experimental results conducted on this dataset.
The dataset consists of 20 images acquired from low-orbit RASAT and Gokturk-2111https://gezgin.gov.tr/ satellites, and their RGB resolutions are 15.0 m and 5.0 m respectively . In particular, we opt to use the outputs of two different sensors in the dataset to demonstrate the generalization capacity of the proposed method. Moreover, Level-1 processed data is utilized in the experiments to reduce the defects caused by platform motion and optical distortion. The ground truth masks are manually labeled by human experts. Lastly, all methods are trained on 15 images and the rest is reserved for the testing stage.
4.2 Experimental Results
To evaluate the success of the proposed method, we compare the method with two baselines, deep pyramid network (DPN) and the combination of CNN with Super-Pixel segmentation as in . Moreover, performance is measured by three score metrics, namely Accuracy (correctness of the prediction), Precision (reliability of the prediction) and Latency (inference time).
We report the performance scores in Table 1. From the results, the proposed method (i.e. DPN+VGG-19) achieves best accuracy and precision scores. Particularly, our method significantly improves the precision score. This stems by the fact that replacing a pre-trained parameter model at the encoder block provides robustness to noisy-labeled data in the learning phase and it ultimately reduces the false-alarm in the inference stage. Another reason is that the proposed method is able to achieve segmentation and classification phases concurrently, thus the parameters are optimized by this way to estimate best segmentation masks rather than employing these steps separately as in . Lastly, this also provides some advantages in the computation time (i.e on CPU for 35833584 resolution) as reported in Table 1 (Note that  needs to generate a decision with CNN for each local patch).
Furthermore, we illustrate the classification masks of the proposed method and SP+CNN  for the test images in Fig. 3222Note that you can find the results of all methods as well as ground truth masks in the project webpage with better visual quality.. Perceptually, our method obtains impressive results particularly for hard cases such as snowy mountains. Moreover, the method is also able to detect haze coverages (i.e. the last column in Fig. 3), even though there is a limited number of training samples for such haze type in the dataset. The reason is that the network exploits the texture around clouds rather than color information, since their patterns are more discriminative for clouds compared to snow/saturated cases.
|SP+CNN ||0.9820||0.6676||30 min.|
|DPN+VGG-19 (ours)||0.9874||0.8776||1 min.|
In this paper, we propose a deep pyramid network to tackle cloud detection from RGB color images. The method is able to generate pixel-level decisions by exploiting spatial texture information about visual data. Moreover, we show that the integration of a pre-trained CNN model at the encoder layer improves the accuracy of classification masks, since more confident hidden representations are extracted from noisy labeled data. From the experimental results, the proposed methods quantitatively outperforms the baselines and obtains perceptually superior results on the dataset.
The authors are grateful to NVIDIA Corporation for the donation of Tesla K40 GPU card used for this research.
-  Q. Cheng et al., “Cloud removal for remotely sensed images by similar pixel replacement guided with a spatio-temporal mrf model.” ISPRS JPRS, 2014.
-  X. Hu et al., “Automatic recognition of cloud images by using visual saliency features.” IEEE GRSL, 2015.
F. Xie et al.
, “Multilevel cloud detection in remote sensing images based on deep learning.”IEEE JSTAR, 2017.
-  O. Ronneberger et al., “U-net: Convolutional networks for biomedical image segmentation.” MICCAI, 2015.
-  T.-Y. Lin et al., “Feature pyramid networks for object detection.” arXiv preprint, 2016.
-  Z. Zhu and C. E. Woodcock, “Object-based cloud and cloud shadow detection in landsat imagery.” Remote Sensing of Environment, 2012.
-  R. R. Irish et al., “Characterization of the landsat-7 etm+ automated cloud-cover assessment (acca) algorithm.” Photogrammetric engineering and remote sensing, 2006.
-  M. Harb et al., “Automatic delineation of clouds and their shadows in landsat and cbers (hrcc) data.” IEEE JSTAR, 2016.
-  J. D. Braaten et al., “Automated cloud and cloud shadow identification in landsat mss imagery for temperate ecosystems.” Remote Sensing of Environment, 2015.
V. Mnih, “Machine learning for aerial image labeling,”University of Toronto, 2013.
-  Z. Zhu and C. E. Woodcock, “Automated cloud, cloud shadow, and snow detection in multitemporal landsat data: An algorithm designed specifically for monitoring land cover change.” Remote Sensing of Environment, 2014.
-  L. Gómez-Chova et al., “Cloud masking and removal in remote sensing image time series.” Journal of Applied Remote Sensing, 2017.
-  K. He et al., “Deep residual learning for image recognition.” CVPR, 2016.
-  K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition.” arXiv preprint, 2014.
-  E. Maggiori et al., “High-resolution semantic labeling with convolutional neural networks,” IEEE TGRS, 2017.
D. Marmanis et al.
, “Deep learning earth observation classification using imagenet pretrained networks.”IEEE GRSL, 2016.
S. Ozkan et al.
, “Endnet: Sparse autoencoder network for endmember extraction and hyperspectral unmixing.”arXiv preprint, 2017.
-  D. Kingma and J. Ba, “Adam: A method for stochastic optimization.” arXiv preprint, 2014.
-  M. Teke, “Satellite image processing workflow for rasat and gokturk-2.” JAST, 2016.