Cloud forecasting remains one of the major unsolved challenges in meteorology, where cloud errors have wide-reaching impacts on the overall accuracy of weather forecasts [CloudStudy, boucher2013clouds]
. Due to the vertical and horizontal nature of clouds, there is an intrinsic difficulty in measuring clouds quantitatively and evaluating the performance of cloud forecasts. This inability to accurately parameterize and thus quantify clouds, convective effects, and aerosols on a sub-grid scale in weather models is one reason model estimates can carry major uncertainties[boucher2013clouds].
The primary source of quantitative weather forecasts comes from numerical weather prediction (NWP) systems. For these numerical methods, we model the future using governing equations from the field of atmospheric physics [NWP]. Over the past decades, there have been tremendous improvements in weather prediction owing to increased computational power, integration of new theory, and assimilation of large amounts of data. Regardless, these atmospheric simulations are still computationally expensive and operate on coarse spatial scales (9x9 km or above per pixel) [naturedeeplearning, ec]. Furthermore, the current amount of atmospheric data collection exceeds hundreds of petabytes per day [petabyte], implying that data collection far outpaces our ability to analyze and assimilate it. As a consequence of this, the authors behind [naturedeeplearning] argue that we face two substantial challenges in this field for the future; 1) gaining knowledge from these extreme amounts of data and 2) developing models that tend to be more data-driven compared to traditional approaches while still abiding the laws of physics. One recent application found discrepancies in climate models’ estimation of photosynthesis in the tropical rainforests, which ultimately led to a more accurate description of these processes globally [beer2010terrestrial, bonan2011improving]. Ideally, similar insights can be discovered from data-driven methods for cloud dynamics, but obtaining adequate observations of clouds has been a substantial obstacle for developing data-driven cloud forecasting methods to-date.
To tackle this problem and spark further research into data-driven atmospheric forecasting, we introduce a novel satellite-based dataset called “CloudCast” that facilitates the evaluation of cloud forecasting methods. This approach has been paramount to progress in state-of-the-art methods in the computer vision literature with datasets such as MNIST[mnist]
, ImageNet[imagenet], and CIFAR10 [cifar10]. Current datasets for cloud forecasting exhibit coarse spatial resolution (9x9km to 31x31km) and low temporal granularity (one-to multiple hours between images). We overcome both these issues by using geostationary satellite images, arguably the most consistent and regularly sampled global data source for clouds [CloudStudy]. Since these satellites can obtain images every 5 to 15 minutes with a relatively high spatial resolution (1x1km to 3x3km), they provide an essential ingredient for developing data-driven weather systems, which is an abundance of historical observations. It is possible to achieve higher accuracy in the vertical dimension with radar- and lidar-based profiling methods [cloudsat], but these fall short on the temporal resolution due to not being geostationary. Our contributions are as follows:
We present a novel satellite-based dataset designed for cloud forecasting. The dataset has 10 different cloud types for multiple layers of the atmosphere annotated on a pixel level. It consists of 70080 images with a spatial resolution of 928 x 1530 pixels (3x3 km) and 15-min sampling intervals from 2017-01-01 to 2018-12-31. All frames are centered and projected over Europe. To the authors’ best knowledge, no equivalent dataset with high spatial- and temporal resolution exists for evaluating multi-layer cloud forecasting methods.
We evaluate four video prediction methods to serve as benchmarks for our dataset by predicting four hours into the future. Two of these are based on recent advancements in machine learning methods specifically for applications in atmospheric forecasting.
To evaluate our results, we present an evaluation study for measuring cloud forecasting accuracy in satellite-based systems. The evaluation design is based on best practices from the World Meteorological Organization when conducting cloud evaluation studies [CloudStudy]
, which includes widely tested statistical metrics for categorical forecasts. Furthermore, we implement the Peak Signal-to-Noise Ratio and Structural Similarity Index from the computer vision literature. The combination of these two domains should provide the best and most fair evaluation of our results.
2 Related Work
We start by briefly reviewing related datasets commonly used in the cloud forecasting literature. After presenting related datasets, we will review methods for video prediction that are particularly suitable for forecasting in the spatiotemporal domain.
2.1 Related Datasets
The most common cloud measurements include total cloud amount or cover (fraction of sky covered by clouds), cloud amounts stratified by height into low-, medium- and high clouds, and cloud base- and top height [CloudStudy]. These are all macrophysical characteristics of clouds and included in three-dimensional NWP model output. For microphysical characteristics, the most commonly used include ice and liquid water content, liquid water path, and cloud optical depth. These are typically derived using radar reflectivity. Numerous methods for measuring or deriving cloud measurements exist. In general terms, these can be divided into radar- and lidar-based methods, surface-based, satellite-based or combined approaches [CloudStudy]. We narrow our scope to global observations with at least one sample per hour, as we want to establish a reference dataset that can be used globally for data-driven methods. This implies we only consider 1) satellite- and 2) model-based observations. Some overlap does exist between the two, as model-based cloud parameters usually involve some form of satellite data assimilation. Nevertheless, it makes sense to discuss them individually, as the spatial resolution is usually quite different between the two.
2.1.1 Satellite-Based Cloud Observations
Satellite-based cloud observations, when measured quantitatively, can be divided into raw infrared brightness temperature or satellite-derived cloud measurements [CloudStudy]. Raw satellite brightness temperature acts as a proxy for cloud top height, which will be available both during day and night. Satellite-derived cloud measurements typically involve some form of brightness-based algorithm on multispectral images that can extrapolate variables such as cloud mask, cloud type, cloud height, etc. [eumetsat]. The apparent disadvantage of satellite-based cloud observations relates to the "top-view" approach, where we cannot accurately measure clouds’ vertical nature, especially in the presence of high cloud cover. Nevertheless, it is the only method that provides global coverage with high spatial and temporal resolution in near real-time.
Several geostationary weather satellite systems exist worldwide. In Europe, the current constellation is called Meteosat Second Generation (MSG), which is operated by ESA/EUMETSAT and covers Europe, Africa, and the Indian Ocean [eumetsat]. Outside Europe, other major constellations include GOES, which covers the Americas, and MTSAT for Australasia [CloudStudy]. Our dataset is made exclusively on the Meteosat satellites, but it could be extended for any other geostationary satellite system worldwide.
2.1.2 Model-Based Cloud Observations
Model-based clouds are measured using output from an NWP model. Two commonly used global NWP models are the European ECMWF atmospheric IFS model [ec] and the American GFS model [gfs]. The resolution varies between the two, but the ECMWF model offers the highest spatial resolution with 9x9 km grid spacing [ec]. As both models are global, they can be used interchangeably. The advantage of using NWP model output is that a physics-based simulation of the future exists, while the clear disadvantage is the coarse spatial resolution. Other NWP models also exist, which operate on a much finer spatial scale, but these are restricted to local areas, usually on a country-basis [cosmo, arome].
Producing accurate and realistic video predictions in pixel-space is an open problem to date. Extrapolating frames in the near future can be done relatively accurately, but once the future sequence length grows, so does the inherent uncertainty of the predicted pixel values. Several approaches have been proposed for solving this complex and high-dimensional task such as spatiotemporal-transformer networks[sttn], generative adversarial networks (GANs) [futuregan, stochasticgan, cloudgan]
and recurrent-convolutional neural networks[eid3D, convlstm]. In the video prediction literature, the tasks are often governed by relatively simple physics, such as the Moving MNIST dataset [movingmnist]. However, for predicting atmospheric flow, the task becomes bound by much more complex physics. Therefore, our chosen methods focus on applications that have been explicitly applied for atmospheric forecasting, which will serve as the justification behind the chosen benchmark models for our dataset.
2.2.1 Convolutional- and Recurrent Neural Networks (ConvLSTM)
In 2015, the paper [convlstm]
introduced a method for combining convolutions into the inner workings of a long short-term memory (LSTM) model for video prediction. This method enabled the encoding and decoding of latent autoregressive spatiotemporal representations on radar images using backpropagation through time (BPTT)[goodfellow2016deep], which was then used to predict future radar images for precipitation forecasting. While the paper showed great promise and had a follow-up paper in 2017 [xingjian2017deep]
, it suffered from producing increasingly blurry predictions in the future. To address this blurriness problem, the authors behind  proposed several novel changes to video prediction methods such as a new loss function that improves over the standard mean squared error (MSE). Newer methods have also been proposed following the ConvLSTM paper such as PredRNN++[wang2018predrnn] and Eidetic-3D LSTM [eid3D], but since these were not applied nor evaluated on any atmospheric-related datasets, they are outside the scope defined in Section 2.2.
2.2.2 Optical Flow-Based Video Prediction
In the computer vision literature, optical flow is a popular topic and has been studied in much greater depth than video prediction despite the two being conceptually similar [framepred]. Optical flow is one of the most important methods for global data assimilation in meteorology [eumetrain, forsythe2007atmospheric]. The method is analogous to advection equations in meteorology and has been used to extrapolate future radar images for precipitation nowcasting as early as 2004 [peura2004optical], and more recently in 2018 for nowcasting the effective cloud albedo in satellite imagery [urbich2018novel]. In the latter, the authors implement the optical flow method introduced by [zach2007duality] to capture cloud motions on multiple scales spatially based on two subsequent raw satellite images. The estimated flow is then inverted and applied for extrapolating the next image given the latter of the two input images. This approach can be used recursively to extrapolate any given steps in the future. This model is expected to reach high accuracy in scenarios with short forecast horizon, static cloud motion, and predominant advection. Since it cannot model convection nor dynamic weather patterns, it will become less accurate once the sequence grows. Outside meteorology, this approach has also been used for video prediction [ranzato2014]. The success of this approach will be greatly influenced by having an accurate optical flow estimator, which is often tricky when dealing with natural images.
2.2.3 Generative Adversarial Networks
GANs have been applied for video prediction in several recent papers [framepred, cloudgan, stochasticgan]. One extension is based on conditional GANs [isola2017image, zhu2017unpaired], where the generator is trained to map pixels from one domain to another domain. In the case of video prediction, this would imply mapping previous frames to future frames directly in pixel-space. One recent application using GANs achieved state-of-the-art results for generating 32-frame time-lapse videos with 128 x 128 resolution of cloud movement in the sky using only one frame as input [cloudgan]. The authors use a two-stage generative adversarial network-based approach (MD-GAN), where the first-stage model is responsible for generating an initial video of realistic photos in the future with coarse motion. The second stage model then refines and corrects the initially generated video by enforcing motion dynamics using the Gram matrix in a novel fashion. It is important to note that the authors optimized video generation rather than video prediction, meaning the generation of a realistic version of the future but not necessarily the correct one given the past frames.
3 Dataset Description
The CloudCast dataset contains 70080 cloud-labeled satellite images with 10 different cloud types corresponding to multiple layers of the atmosphere as seen in Table 1. The raw satellite images come from a satellite constellation in geostationary orbit centered at zero degrees longitude and arrive in 15-minute intervals from the European Organisation for Meteorological Satellites (EUMETSAT) [eumetsat]. The resolution of these images is 3712 x 3712 pixels for the full-disk of Earth, which implies that every pixel corresponds to a space of dimensions 3x3km. This is the highest possible resolution from European geostationary satellites when including infrared channels [eumetsat]. Some pre- and post-processing of the raw satellite images are also being done by EUMETSAT before being exposed to the public, such as removing airplanes [eumetsat]. We collect all the raw multispectral satellite images and annotate them individually on a pixel-level using a segmentation algorithm developed by [nwcsaf] under the European Organisation for Meteorological Satellites - Satellite Application Facility on Support to Nowcasting and Very Short Range Forecasting (NWCSAF) project [nwcsafsw]. Extensive validation of the segmentation algorithm has been carried out using both space-born lidar and ground-based observations to verify its accuracy. [nwcsaf]. The algorithm uses multispectral channels (visible light, near-infrared, infrared, and water vapor) and climatological variables and metadata such as geographical land-sea masks and viewing geometry to annotate images on a pixel-level. This implies that we are using a combined satellite and model-based approach for cloud observations to some extent, which partially mitigates the problem of satellites being a top-view approach by improving low and mid-level cloud detection in the segmentation algorithm [nwcsaf]. The dataset has a spatial resolution of 928 x 1530, which corresponds to the same 3x3 km per pixel as the raw satellite images. We also publish a standardized version for future studies, where we center and project the final annotated dataset to cover Central Europe, which implies a final resolution of 728 x 728 pixels. An example observation can be seen in Figure 1. Our novel dataset can be found at https://vision.eng.au.dk/cloudcast-dataset/.
|0||No clouds or missing data|
|1||Very low clouds|
|4||High opaque clouds|
|5||Very high opaque clouds|
|7||High semitransparent thin clouds|
|8||High semitransparent moderately thick clouds|
|9||High semitransparent thick clouds|
|10||High semitransparent above low or medium clouds|
Current datasets [gfs, ec] for cloud forecasting and evaluation come with either low temporal granularity (one- to multiple hours between images) or coarse spatial resolution (9x9km to 31x31 km) as mentioned in Section 1, thus demonstrating the need for our novel high-resolution dataset.
We will apply state-of-the-art computer vision methods for evaluating cloud nowcasting using our CloudCast image dataset, which has recently seen considerable success in similar atmospheric nowcasting studies [convlstm, urbich2018novel]. To match the resolution of most state-of-the-art video prediction methods [fidelity, xingjian2017deep], we crop and transform our dataset using a stereographic projection to cover Central Europe with a spatial resolution of 128x128. We still use the full temporal resolution of 15-minute intervals compared to hourly observations for other datasets, as mentioned in Section 2, and we also publish this final transformed dataset to ensure reproducibility for future studies. Several different definitions of nowcasting exist, but they generally vary between 0-2 hours and 0-6 hours [eumetcal, wmosite]. We select the future time frame to be four hours ahead in 15-minute increments (16 time steps), which is somewhere in the middle of most definitions. While forecasting beyond 6 hours is theoretically possible, we expect performance to deteriorate over time unless we incorporate additional variables that cannot be observed from satellite data alone to explain the more medium to long-term cloud dynamics.
We have divided the dataset into 1.5 years (75%) of training and 0.5 years (25%) of testing. Ideally, we would want our test data to cover all seasons of the year. However, the frequency distribution between training and test are relatively similar for most classes as seen in Table 2. We also group the 10 cloud types into four based on height: a) no clouds, b) low clouds, c) medium clouds, and d) high clouds. This ensures a more natural ordering of the classes and enables us to focus on the major cloud types also present in the global NWP models [gfs, ec].
4.1 Benchmark Models
We present an initial benchmark for our dataset based on the reviewed methods in Section 2.2. The results of the baseline models will be presented in Section 4.3 along with the advantages and disadvantages of the chosen methodologies.
4.1.1 Autoencoder ConvLSTM (AE-ConvLSTM)
For our first baseline, we implement a variant of the ConvLSTM model from [convlstm]
paper, where we introduce an autoencoder architecture with 2D CNNs and use the ConvLSTM layers on the final encoded representation instead of directly on the input frames. This helps us to a) encode the relevant spatial features from the input images before we start encoding and decoding the temporal representation, and b) make training more memory efficient as ConvLSTM layers are memory-intensive. The autoencoder uses skip connections similar to UNet[unet]. The motivation behind including skip connections for video prediction is to transfer static pixels from the input to the output images, making the model focus on learning the movement of dynamic pixels instead [fidelity].
We start by reconstructing the first 16 input frames to initialize a spatiotemporal representation of the past cloud movement time-series. To predict 16 frames into the future, we use an autoregressive approach, where we feed the predicted output as input recursively to predict the next 16 steps. This is similar to the approach of other video prediction papers [stochasticgan]. To improve the sharpness of our results without introducing an adversarial loss function we have chosen to use the loss. Furthermore, we use the Adam optimizer with batch size = 4, and we implement an optimization schedule used in the Transformer paper [transformer]
, where we increase the learning rate linearly for a number of warmup rounds before decreasing it proportionally to the inverse square root of the current step number. We run the training schedule for 200 epochs with an initial learning rate ofand momentum parameters , and 400 warmup rounds. We notice only a slight improvement in our loss function after 100 epochs.
4.1.2 Multi-Stage Dynamic Generative Adversarial Networks (MD-GAN)
For training and optimizing the MD-GAN model, we follow the original authors [cloudgan] with some differences. As noted in Section 2.2.3, we are interested in video prediction rather than video generation, and therefore we make necessary adjustments to the experimental design to account for this. Instead of cloning one input frame into 16 and feeding them to the generator, we feed the previous 16 images to the generator.
Besides these changes, we largely followed the approach in [cloudgan]. We found that having the learning rate fixed at did not produce satisfying results and often caused mode collapse for the generator. Instead, we employed the technique found in the paper [ttur], where you set a higher learning rate for the discriminator () than the generator (). This overcomes situations where early mode collapse causes training to stall, and instead incentivizes smaller steps for the generator to fool the discriminator.
We find that the training procedure is inherently unstable, a frequent issue for GANs [improved]. This issue arises particularly for the second stage training, where training seems to stall after around 10-20 epochs. We expect that the implementation of novel advancements in the training of GANs tailored for video prediction [stochasticgan] rather than video generation [cloudgan] could considerably improve the stability.
4.1.3 TVL1 Optical Flow
We implement the optical flow algorithm similar to the authors of [urbich2018novel], which can capture cloud motion on multiple spatial scales and is one of the most popular optical flow algorithms for meteorological purposes [urbich2018novel]. One of the underlying assumptions of optical flow is constant pixel intensity over time [urbich2018novel]. This is violated due to cloud formation and dissipation. Hence, the presence of convective clouds will negatively impact the accuracy of optical flow algorithms. As stated in Section 2.2.2, we can extrapolate multiple steps ahead in time using the estimated optical flow recursively on the predicted cloud images. The algorithm effectively has 11 parameters. The authors [urbich2018novel] optimized these by finding the lowest absolute bias among 21 different parameter settings calculated using a variant of the area-under-the-curve calculation over absolute bias as a function of forecast time [urbich2018novel]
. Instead, we conduct a grid search over a hyperspace of 360 different combinations chosen relative to the default values and the optimal hyperparameters found in[urbich2018novel]. As estimated flows (and by extension, the predicted pixel values) will lie in a continuous real-valued space, we round our predictions to the nearest integer in our four-class setting.
One of the recommended benchmark models in cloud evaluation studies is called a persistence model [CloudStudy]. Persistence refers to the most recent observation, which in this case is the 15-minute lagged cloud-labeled satellite image, replicated 16 steps into the future. Under the case of limited cloud motion, we expect this model to perform relatively well, but obviously it is naïve and will not work in dynamic weather situations. The most challenging part of video prediction is usually realistic motion generation, and therefore comparing other models to the persistence model shows how well the model has captured and predicted future cloud motion dynamics. Hence, the persistence model will serve as the baseline for skill score calculations.
4.2 Evaluation Metrics
Standardized evaluation metrics for the video prediction domain emphasizing atmospheric applications are hard to come by. As stated in Section1, we select our evaluation metrics from the World Meteorological Organization [CloudStudy]. Due to numerous available metrics, we select the ones with the highest ranking score in the referenced paper. As several of these are common in the computer vision and machine learning literature, we only go through the non-standard metrics. Metrics such as Frequency Bias is typically called "bias score", and measures the total number of predicted events relative to observed events. Any value above (below) one indicates the model tends to overforecast (underforecast) events.
|Input Frames||Output Frames|
The first non-standard metric is called “Brier Score”. In this case, the Brier Score refers to the MSE between estimated probabilistic forecasts and binary outcomes. To extend it to the categorical multi-class setting, we sum the individual MSE’s for all categorical probabilistic forecasts relative to the one-hot target class variable as follows
is the predicted probabilities for all classpixels in a given image at time , the actual binary outcomes, the number of future time steps and the number of classes. The second is called Brier Skill Score, and it is calculated using the MSE of a given model relative to some benchmark forecast. As mentioned in Section 4.1.4, we use the persistence forecast as the benchmark in skill score metrics. The formula for the Brier Skill Score is
where any value above (below) zero implies superior (inferior) performance by the proposed model.
In addition to these metrics, we also include video prediction metrics from the computer vision literature, which, taken together with the meteorology metrics, should constitute the fairest evaluation in this complex setting. These include the Structural Similarity Index (SSIM) and the Peak Signal-to-Noise Ratio (PSNR).
|Mean Accuracy||68.44%||67.06%||64.30%||63.60 %|
|Brier Skill Score||0.11||0.07||0.02||NA|
Despite the proposed models showing relatively high overall accuracies, it is quite clear that none of the models show consistent performance across time and space. This is also evident when looking at the decline in accuracy across time. This suggests that we need to a) develop models more suitable for this particular problem or b) incorporate other data sources or variables to make more reasonable and causal predictions for the complex setting of multi-layer cloud movement and formation.
We include a visualization of the worst predictions from the test set measured by mean accuracy in Figure 3. Looking at the failure cases for MD-GAN S2 and AE-ConvLSTM, we observe that they struggle with situations where clouds are primarily scattered. This is unsurprising given that these models tend to generate predictions that are generally clustered and moderately blurry. The TVL1 model projects a considerable movement of clouds that is incorrect. The underlying reason could be the dissipation of clouds from the input images used in the optical flow estimation, which would violate the constant pixel intensity assumption. The Persistence model achieves poor performance in situations with substantial motion as in Figure 3.
4.3.1 Autoencoder ConvLSTM (AE-ConvLSTM)
The AE-ConvLSTM method achieves the highest accuracy on our dataset measured both temporally and spatially on all but medium clouds. For the BSS metric, we notice superior performance relative to the persistence model with a value of 0.11. This implies the application of ConvLSTM layers for cloud-labeled satellite images do capture spatiotemporal motion to some extent. On the other hand, we see in Figure 2 that predictions become increasingly blurry over time. This is in alignment with the discussion in Section 2.2.1.
4.3.2 Multi-Stage Dynamic Generative Adversarial Networks (MD-GAN)
The MD-GAN model outperforms the persistence model with a Brier Skill Score of 0.07. The categorical accuracy is not captured well, as MD-GAN achieves the lowest accuracy for medium clouds between all our models. The temporal accuracy is closely matched to the ConvLSTM model, especially for the 2-hour forecast. Thus, by improving the stability and the initial forecasting accuracy of the MD-GAN model, we expect it could become the best and most consistent model.
4.3.3 TVL1 Optical Flow
The TVL1 algorithm shows marginally superior performance relative to the persistence model with a BSS of 0.02. The primary reason behind the close performance of TVL1 and Persistence relates to the choice of hyperparameters for the TVL1 algorithm, where hyperparameters yielding more static movement generally implied better performance across time. We believe the underlying reason for this result is the complexity of forecasting multi-layer clouds 16-steps ahead combined with the violation of the optical flow assumption of having constant brightness intensity over time. Compared to AE-ConvLSTM and MD-GAN, it achieves lower overall- and temporal accuracy but does reach higher accuracy for medium clouds. While optical flow methods have been popular for atmospheric forecasting, as stated in Section 2.2.2, their application to multi-layer cloud types has not been fully researched yet. Hence, the proposed machine learning methods currently seem more appropriate for this task given their superior performance.
The simple persistence model achieves relatively good results. The high short-term accuracy is not surprising given the limited cloud movement for one-hour ahead. Due to its static nature, however, it achieves the lowest accuracy near the end of our forecasting horizon.
We introduce a novel dataset for cloud forecasting called CloudCast, which consists of pixel-labeled satellite images with multi-layer clouds of high temporal and spatial resolution. The dataset facilitates the development and evaluation of methods for atmospheric forecasting and video prediction in both the vertical and horizontal domains. Four different cloud nowcasting models were evaluated on this dataset based on recent advancements in the machine learning literature for video prediction and traditional methods from the meteorology- and computer vision literature. Several evaluation metrics based on best practices in cloud forecasting studies were proposed in addition to the Peak Signal-to-Noise Ratio and Structural Similarity Index. The four models provided an initial benchmark for this dataset but showed ample room for improvement, especially for predictions near the end of our forecasting horizon. Hybrid methods combining machine-learning and NWP could be interesting approaches to address medium to long-term forecasting in a future study.
We hope this novel dataset will help advance and stimulate the development of new data-driven methods for atmospheric forecasting in a field heavily dominated by physics and numerical methods.