Log In Sign Up

Sky-image-based solar forecasting using deep learning with multi-location data: training models locally, globally or via transfer learning?

by   Yuhao Nie, et al.

Solar forecasting from ground-based sky images using deep learning models has shown great promise in reducing the uncertainty in solar power generation. One of the biggest challenges for training deep learning models is the availability of labeled datasets. With more and more sky image datasets open sourced in recent years, the development of accurate and reliable solar forecasting methods has seen a huge growth in potential. In this study, we explore three different training strategies for deep-learning-based solar forecasting models by leveraging three heterogeneous datasets collected around the world with drastically different climate patterns. Specifically, we compare the performance of models trained individually based on local datasets (local models) and models trained jointly based on the fusion of multiple datasets from different locations (global models), and we further examine the knowledge transfer from pre-trained solar forecasting models to a new dataset of interest (transfer learning models). The results suggest that the local models work well when deployed locally, but significant errors are observed for the scale of the prediction when applied offsite. The global model can adapt well to individual locations, while the possible increase in training efforts need to be taken into account. Pre-training models on a large and diversified source dataset and transferring to a local target dataset generally achieves superior performance over the other two training strategies. Transfer learning brings the most benefits when there are limited local data. With 80 achieve 1 dataset. Therefore, we call on the efforts from the solar forecasting community to contribute to a global dataset containing a massive amount of imagery and displaying diversified samples with a range of sky conditions.


SKIPP'D: a SKy Images and Photovoltaic Power Generation Dataset for Short-term Solar Forecasting

Large-scale integration of photovoltaics (PV) into electricity grids is ...

The Age of Correlated Features in Supervised Learning based Forecasting

In this paper, we analyze the impact of information freshness on supervi...

HyperionSolarNet: Solar Panel Detection from Aerial Images

With the effects of global climate change impacting the world, collectiv...

Virtual-to-Real-World Transfer Learning for Robots on Wilderness Trails

Robots hold promise in many scenarios involving outdoor use, such as sea...

Deep Learning for Global Wildfire Forecasting

Climate change is expected to aggravate wildfire activity through the ex...

1 Introduction

The continuous growth of solar photovoltaic (PV) deployment forms a critical part of the global energy transition. According to the International Energy Agency, a record-high 145 GW capacity has been installed during 2020 even in the face of the pandemic Masson and Kaizuka (2021). The global cumulative PV capacity has amounted to 767 GW at the end of 2020, with around 70% of the capacity installed over the last five years Masson and Kaizuka (2021). The dramatic rise in PV installations will introduce challenges to the electricity grid due to the intermittency of solar energy, mainly caused by local and short-term cloud events Nie et al. (2021). To reduce the uncertainty in solar power generation, accurate solar forecasting is thus pivotal and urgently needed.

There have been numerous research efforts toward building reliable solar forecasting models over the decades, which have targeted forecasting either the solar irradiance or the power output of PV systems. This introduction mainly focuses on the short-term or intra-hour solar forecasting research. Earlier efforts have tended to use statistical time series models Moreno-Munoz et al. (2008); Reikard (2009), e.g., auto-regression (AR), auto-regression moving average (ARMA) and auto-regression integrated moving average (ARIMA), to auto-correlate irradiance/PV measurements for prediction. These methods usually lack forecast ability as they do not use any information on the movement and distribution of clouds Sun et al. (2019); Yang et al. (2022)

. Since 2011, sky-image-based solar forecasting has become more and more popular. Early works first extracted features from ground-based sky images, such as red-blue ratio, cloud coverage and cloud motion vectors, and then used these features for building physical deterministic models

Chow et al. (2011); Marquez and Coimbra (2013); Quesada-Ruiz et al. (2014)

or training machine learning models

Chu et al. (2013, 2015, 2015); Pedro et al. (2019a). In addition, several all-sky cameras can be used in stereo-vision mode to model the cloud cover in three dimensions to provide local irradiance maps Peng et al. (2015); Blanc et al. (2017); Kuhn et al. (2018)

. In the past five years, with the further development of computer vision techniques, efforts have shifted to concentrate on building end-to-end deep learning models, such as convolutional neural networks (CNNs)

Sun et al. (2018, 2019); Venugopal et al. (2019); Nie et al. (2020); Feng and Zhang (2020); Paletta and Lasenby (2020b); Nie et al. (2021); Feng et al. (2022)

or CNNs hybridized with recurrent neural networks (RNNs)

Zhang et al. (2018); Paletta et al. (2021, 2022) to correlate irradiance/PV with sky images, which have generally achieved superior performance over the other methods despite some limitations Paletta et al. (2021). These modern methods will be the focus of this study.

One of the most important factors for deep-learning-based solar forecasting models is the dataset. To train a generalized deep learning model that works well not only on the model development dataset but also for unseen data, the training dataset needs to contain massive amounts of imagery and diversified samples with numerous sky conditions. Contrary to simulated datasets which can be easily extended, real-world data such as sky images are constrained by the period of collection. For this reason, researchers in the solar forecasting community have explored using data augmentation, a common technique in deep learning, to artificially increase the diversity of image samples Nie et al. (2021); Paletta et al. (2022b). Although data augmentation is viable without accessing additional data, the benefit it could bring is constrained by the diversity of cloud patterns in the dataset. With more and more sky image datasets open-sourced in recent years Kurtz et al. (2017); Pedro et al. (2019b); Feng et al. (2019); Terrén-Serrano et al. (2021); Ntavelis et al. (2021); Nie et al. (2022), potential options have not been explored thoroughly in solar forecasting, including training a model by fusing different datasets with more or less heterogeneity and knowledge sharing between different datasets via pre-trained models. These multi-location augmentation approaches are investigated in this study.

Specifically, we examine the following questions by using datasets collected globally from three different locations with disparate climate patterns (the details on these three datasets can be found in Section 3):

  1. Should deep learning models be trained locally using the location-specific dataset or globally via fusion of datasets from different locations?

  2. How to deal with the dataset heterogeneity for training global models, especially the different scales and distributions of prediction targets?

  3. Is there any knowledge that can be shared between different locations via transfer learning from pre-trained models?

To address these questions, two research groups from Stanford University and Cambridge University, as well as researchers from the Dubai Energy and Water Authority (DEWA), have worked in parallel and developed deep learning models for a short-term solar forecasting task, the goal of which is to predict 15-min-ahead PV power outputs (or irradiance values) based on the imagery and measurement data collected in the past 15 minutes (see Section 4 for more details on the model setup). Two different deep learning model architectures are utilized by the two University teams to avoid bias caused by particular model architectures and ensure the reliability of the results for understanding the impact of diverse training data.

The rest of this paper is organized as follows: in Section 2, we review the methods for training deep learning models with multi-location data, including dataset fusion and transfer learning. Section 3 describes the sky images and PV output/irradiance datasets used in this study from three different locations around the world. Section 4

presents the methodology, including the model architectures, training details and evaluation metrics. Section

5 delves into the experimental designs for exploring the optimal strategies for training PV/irradiance prediction models with multi-location data. Section 6 analyzes and discusses the results and provides directions for future research. Finally, we summarize the findings of this study in Section 7.

2 Review of multi-location dataset modeling

Two methods for multi-location dataset modeling are reviewed in this section. The first type is dataset fusion, basically integrating the datasets from multiple locations and training the model jointly. This approach is based on the expectation that fused data are more informative than any individual dataset. Another method which is widely used in the deep learning community but not yet well studied in image-based solar forecasting, is transfer learning. In transfer learning, the datasets are used sequentially, with model parameters passed from earlier to later models to pass learning forward in the training process.

2.1 Dataset fusion

Dataset fusion is commonly used in image-based solar forecasting studies with multi-location datasets. Pothineni et al. (2019) experimented with two sky image datasets collected in two regions with identical camera setup: one in Italy and one in the Swiss mountains. Irradiance measurements were used to determine the associated sky conditions to be either clear or occluded by thresholding them with the irradiance values derived from a clear sky model. The authors compared training CNN models on the individual datasets with training on the fused dataset for predicting the 5-min ahead sky conditions, with results showing superior performance from training the model jointly on both datasets. Bansal et al. (2021) developed a CNN-LSTM auto-regressive model to predict the satellite spectral channel values at the target sites based on a sequence of past satellite observations. They combined satellite data across 25 solar sites in the US to train one global model and test it on individual locations. The authors suggested that while training location-specific models could identify unique local attributes to improve accuracy, their deep learning model is quite data hungry. Under the situation where there is not enough data to train and test on any single location, a global model might be beneficial as it can be applied to any location without re-training.

One point to note here is that most of the existing studies using dataset fusion methods are based on the premise that the prediction targets are of similar scales, e.g., solar irradiance and satellite spectral channel value mentioned in studies above. Very few projects have studied fusing datasets with different scales or different units of the prediction targets, e.g., PV output measurements from systems with different capacity, or PV output and irradiance measurements which are highly correlated. Another challenge that needs attention is that the imagery data collected by these studies are generally based on similar camera setups, e.g., camera model and placement orientation. It is possible that in future fusion studies multiple data streams would be generated by different camera setups (e.g., resolution, contrast, color balance, spectral range, orientation).

2.2 Transfer learning

Transfer learning aims at improving the performance of target learners or solving new problems faster on target domains by transferring the knowledge learned from different (but related) source domains Zhuang et al. (2021). A condition for the transfer of knowledge is the existence of a similarity between tasks. In solar energy for instance, a solar site could benefit from datasets generated in other locations to improve the accuracy of its site-specific algorithm. This specific transfer learning approach, termed domain adaptation, aims at adapting a learning algorithm to a new data distribution while keeping the task unchanged. This could be especially beneficial for new solar facilities which have a limited amount of data. Other similar but distinct activities in solar forecasting are, for instance, PV power output versus solar irradiance forecasting, or cloud cover modelling from sky images versus satellite observations.

There are different ways to categorize transfer learning approaches. An approach is known as homogeneous transfer learning if the type of input variables (e.g., sky images) and labels (e.g, irradiance) are the same for source and target domains (location A and B). In contrast, if input variables (e.g., sky images versus satellite observations) or labels (e.g., irradiance versus PV output) are distinct, an approach is referred to as heterogeneous transfer learning Zhuang et al. (2021). Another review splits methods into four groups: instance-based (instance weighting strategy), feature-based (creation of a new feature representation), parameter-based (the transferred knowledge is encoded at the parameter/model level) and relational-based (transfer the relationship among the source data to the target domain) approaches Pan and Yang (2010). Instead of learning a new task from scratch, a common practice in deep learning is to start learning a new task such as solar forecasting Wen et al. (2021)

with standard neural networks VGGNet 

Simonyan and Zisserman (2015), ResNet He et al. (2016), DenseNet Huang et al. (2017)

, etc., pretrained on large datasets, e.g. ImageNet 

Deng et al. (2009)

. Alternatively, self-supervised learning, a form of unsupervised learning that generates pseudo-labels from the data itself, offers a framework to pretrain a model on an unlabeled dataset close to the target dataset instead of on a generic dataset. We describe in the following paragraph some studies which have applied transfer learning techniques to image-based solar forecasting and related fields.

In sky image segmentation, a data representation can be learnt on a large unlabelled dataset of sky images with tasks such as image reconstruction, clustering and classification, prior to fine-tuning the model on the target segmentation task with a smaller labeled dataset Fabel et al. (2021)

. Models pretrained on the sky image dataset outperform those pretrained on ImageNet

Deng et al. (2009) or randomly initialised. Pothineni et al. (2019)

claimed the deep learning model they developed (KloudNet) can be used to improve the performance on other PV plants via transfer learning, but no quantitative results are presented in their study. In solar irradiance estimation from satellite data, a recent work applied transfer learning to a four-layer neural network

Li et al. (2022). The source domain consists of simulated data used to pre-train the model while real-world satellite and in-situ observations are used for fine-tuning the last four layers. Although some negative transfer is observed, the proposed transfer learning approach benefits some tasks such as daily downward shortwave radiation estimation compared to baselines trained on simulated or in-situ data only.

3 Dataset

3.1 Dataset overview

In our study, three datasets collected globally from three different locations with drastically different weather conditions are used for experiments in this study. These datasets are (1) the Stanford dataset Nie et al. (2022), collected on the campus of Stanford University in California, United States (US), which is characterized by long summers with mostly clear sky and short winters with partly cloudy sky; (2) the SIRTA dataset Haeffelin et al. (2005), collected by the SIRTA Atmospheric Observatory in Palaiseau, France, which is dominated by partly cloudy and cloudy sky conditions over much of the year, and (3) the DEWA dataset, collected from the outdoor testing facility of the Dubai Energy and Water Authority (DEWA) in the United Arab Emirates (UAE), which is clear most of the time over a year but with sandstorms occurring usually in dry summers. Among the three datasets, the Stanford and SIRTA datasets are open source and can be accessed by the public 333Detailed information about the Stanford dataset can be found via the following Github Repository; the SIRTA dataset is available upon request to, while the DEWA dataset is private and not publicly available. It should be noted that only team Stanford has access to the DEWA dataset due to an internal collaboration contract between Stanford and DEWA, and team Cambridge only has access to the two publicly available datasets. The detailed comparison of these datasets is shown in Table 3.1.

Location Stanford, US Palaiseau, France Dubai, UAE
Data type sky images & PV power output sky images & global horizontal irradiance (GHI) sky images & GHI
Data frequency 1-min 1 to 2-min 1-min
Image resolution
Time window 2017.3–2019.11 2017.1–2019.12 2021.1–2021.11
Camera model Hikvision DS-2CD6362F-IV EKO SRF-02 EKO ASI-16
Camera orientation 14 south by west 2 south by east due south
PV system 30-kW rooftop system with elevation 22.5, azimuth 195 N/A N/A
Number of days 269 522 953
Number of valid samples 135,527 448,268 91,979
Training, validation and test set split 84%:9%:7% 88%:10%:2% 83%:9%:7%

For the image data, the frequency is 1-min in 2017 and 2-min in 2018 to 2019; for the irradiance data, the frequency is 1-min for 2017 to 2019; means image down-sizing from the high resolution raw images

Table 3.1: Comparison of three studied datasets

The different weather conditions can be reflected by the data from these three locations. Figure 3.1 shows sky image examples from three locations with different weather conditions, sunny, cloudy and overcast, respectively. It can be observed that the images from the three datasets look somehow different especially the hue, for example, DEWA image samples appear yellowish and dusty due to the impact of sandstorms. In addition, different camera characteristics, e.g., camera contrast and saturation, can affect the images.

Figure 3.1: Sky image examples from three locations with different weather conditions

3.2 Data processing

To ensure consistent data processing, team Stanford processed the data from the three datasets and then shared with team Cambridge for all the experiments. The high-resolution raw images collected by the cameras are first down-sized to pixels to save computation in the model training process. The PV output and irradiance measurements collected from the data loggers are averaged over a minute. To form valid samples for the forecasting task, each time point () is checked to ensure the availability of (1) the prediction target, i.e. the PV output/GHI measurements 15 minutes ahead (); and (2) the model inputs, i.e. the sky images and concurrent PV output/irradiance measurements over the past 15 minutes at a 2 minutes resolution (). All time stamps that do not satisfy these two conditions are dropped. For the Stanford dataset and the DEWA dataset, the sampling frequency is chosen to be 2 minutes (two samples differ in by 2 minutes) because higher frequency led to a longer model training time with limited improvement on the model accuracy (Sun et al. (2019)). For SIRTA dataset, the sampling frequency is set to be 1 minute for 2017 and 2 minutes for 2018 to 2019 to be consistent with the imagery data frequency (see Table 3.1). After processing, the number of valid samples for the three datasets can be found in Table 3.1. In this study, the three datasets share similar characteristics such as similar camera orientations and their location in the same hemisphere. However, these aspects might partially hinder the transfer of knowledge in other contexts (e.g., different trajectory of the sun due to the location of the camera or its orientation). This could be addressed by the application of data augmentation techniques in the data processing stage. For instance, a transfer function independent of the camera orientation can be learnt by randomly rotating sky images during training or by representing the scene with polar coordinates centered on the sun Paletta and Lasenby (2020a); Paletta et al. (2022b, a), which will be explored in the future by accessing more diversified datasets around the world.

3.3 Dataset partition

For model development and evaluation purposes, the valid samples of all three datasets are partitioned into the model development set (consisting of training and validation) and the test set. The test set is first separated out with 10 sunny days and 10 cloudy days across the entire time period and is never touched during the model development processes. The PV output/irradiance profiles of these 20 days for all three datasets are shown in Figure 3.2. The remaining data go to the model development set. Figure 3.3 presents the model development set data distribution of the three locations. It can be observed that Stanford PV and DEWA irradiance distributions share some similarity while SIRTA irradiance distribution shows a nearly opposite trend compared to the other two datasets. To avoid the bias from data partitioning and obtain a less optimistic estimate of the model performance, ten-fold cross-validation is employed in this study, which divides the development set into 10 folds, 9 folds for training the model and 1 fold for validating the model. The model is trained 10 times, each time with a different fold as the validation set, resulting in 10 sub-models. The final prediction made by the model is the ensemble mean of these 10 sub-models. Under this setup, the split of training, validation and testing samples in percent for each dataset is shown in Table 3.1. Moreover, to avoid the data from the same day ending up in both training and validation sets, which could potentially lead to over-estimates of the model performance due to the closeness between training and validation samples, day-block shuffling is performed during the cross-validation process Sun et al. (2019).

Figure 3.2: The test set PV/irradiance profiles for the three datasets (The zigzags in the sunny days of the Stanford test set is due to missing data points)

Figure 3.3: The model development set data distribution of the three locations

4 Methodology

The specific solar forecasting task tackled in this study is to predict 15-min-ahead PV power output or GHI values based on the three datasets described in Section 3. Two different deep learning models utilized by the two teams are presented in this section, including the model architectures and training details, meanwhile, the metrics used to evaluate the performance of the models are also described. It should be noted that, we do not focus on comparing the performance of different architectures in this study, rather we use it as a way to avoid any bias caused by different architectural setups.

4.1 Deep learning models

While the two teams utilized different deep learning models for the forecasting task, the following common setups are shared: (1) using hybrid features as the model input — sky images and measurements history (PV output/GHI) in the past 15 minutes with 2-min interval; (2) minimizing the mean squared error (MSE) loss function and using the stochastic gradient descent optimizer Adam

Kingma and Ba (2014) for model training; (3) employing 10-fold cross-validation and training 10 sub-models; and (4) for model evaluation, the prediction is generated by the ensemble mean of the predictions from the 10 sub-models. An illustration of the model architectures of both teams can be found in Figure 4.1 and the comparisons of the architectures as well as training settings are listed in Table 4.1 with details described individually in the subsections below.

Figure 4.1: Model architectures used by team Stanford and Cambridge.
Input sky images Past 8 frames Past 8 frames
Input PV or GHI history Past 8 values Past 8 values
Activation functions ReLU ReLU
Normalization layers Batch normalization None
Pooling layers Max pooling None
Dropout rate 0.4 0.0
Number of Parameters 13.66M 4.25M
- Number of convolutional layers 2 7
- Number of residual blocks 0 6
Optimizer Adam Adam
Training loss MSE MSE
Learning rate /
Batch size 256 10

for training on Stanford dataset; for training on SIRTA dataset

Table 4.1: Comparison of the two deep learning architectures and training details

4.1.1 Stanford SUNSET model

A CNN architecture named SUNSET (Stanford University Neural Network for Solar Electricity Trend) is used by team Stanford. The SUNSET model is first introduced by Sun et al. (2019) to forecast 15-min-ahead PV output and is characterized by its usage of CNN and hybrid input features. The 1-min lag term interval used in the original SUNSET model is changed to 2-min in this study and modifications were made accordingly to accommodate the different input and output for each dataset, either PV or irradiance measurements.

The basic structure of the SUNSET model includes two Convolutional (Conv.) blocks and one Fully-connected (FC) block. The Conv. block employs a sandwich-like structure, including sequentially one Conv. layer, one batch normalization (BatchNorm) layer, and one pooling layer. The Conv. layer utilizes a 3×3 filter, with a stride of 1 and same-value padding. The activation function used for the Conv. layer is a rectified linear unit (ReLU). In the pooling layer, 2×2 max pooling with a stride of 2 is used to reduce the activation spatial dimensions. The first Conv. block contains 24 filters, while the second contains 48 filters. After the two Conv. blocks, the processed input is vectorized and concatenated with PV output/irradiance history and passed through the FC block to produce the prediction. The FC block includes two dense layers, each containing 1024 neurons and using ReLU as its activation function. After each dense layer, a dropout layer with a 0.4 dropout rate is performed to prevent over-fitting.

Different learning rates are used for model training on different datasets by team Stanford. For the Stanford dataset, a learning rate of is used as this was seen to be effective via prior studies Sun et al. (2019), while for the SIRTA dataset, a learning rate of 2.5

is used based on initial hyper-parameter selection experiments. Batch size is consistently set to be 256 for the stochastic gradient descent optimizer. An early stopping scheme is applied to prevent potential over-fitting, namely, the training is stopped when the validation loss is not observed to decrease for five consecutive epochs.

4.1.2 Cambridge ConvLSTM model

A ConvLSTM model architecture is used by team Cambridge, which was first presented by Paletta et al. (2021) to forecast 10-min ahead GHI levels from past sky images and auxiliary data (irradiance measurements and solar position). Several modifications were made to the original model architecture to this study. The first convolutional layer with a stride of two was changed to a layer with unitary stride to account for the lower spatial resolution of images in the present study (64 pixels instead of 128). In addition, the model now takes three-channel images (RGB) as input instead of two (grey-scale short and long exposure images). Furthermore, the new model is fed only with the past GHI or PV output measurements as auxiliary input.

The ConvLSTM model is made of two parallel encoders for auxiliary data and sky images. Features from past irradiance or PV measurements are extracted with three dense layers and an LSTM layer Hochreiter and Schmidhuber (1997). Sky images are filtered through a sequence of convolutional layers and residual blocks He et al. (2015) to decrease the spatial resolution from to . The resulting set of feature maps is then sequentially fed into a ConvLSTM module Shi et al. (2015). A learning rate of is used to train the ConvLSTM model for both datasets with a batch size of 10.

4.2 Evaluation metrics

In this study, we evaluate the model performance from two aspects: (1) the prediction accuracy, which is measured by some common error metrics via applying the trained models to the test sets; and (2) the training cost, which reflects the consumption of computational resource during the model training.

To assess the prediction accuracy, the error metric root mean squared error (RMSE) is used. RMSE is the most commonly used metric and can be expressed by Equation (1); other similar metrics like mean absolute error (MAE), mean bias error (MBE), etc. are not covered in this study.


where is the number of samples, is the prediction generated by the model and is the ground-truth measurement.

For evaluating the training cost, we define a metric called training effort (TE), which is essentially the total number of samples seen by the model until convergence and can be expressed by Equation (2):


The number of epochs is obtained when the model stops training. Specifically, team Stanford uses an early stopping scheme for the SUNSET model, and the training is stopped when the validation loss does not decrease for five consecutive epochs. Team Cambridge stops the training when the ConvLSTM model starts overfitting, i.e. when the validation loss increases for several consecutive epochs. We do not compare the training time in this study, as it varies with the different GPU models used in training (Stanford used Tesla A100, while Cambridge used GeForce GTX 1080).

5 Experiments

This section details the experiments conducted to address the questions raised in the Introduction section. First, the local models are trained individually based on the dataset from each location, which serve as the baselines for comparison with other models developed in this study. Next, a global model is trained jointly based on the combination of datasets from different locations. In view of the sample heterogeneity of different datasets, we investigated the effect of different input and output normalization methods on the model performance and modifications of the model architectures to accommodate input data streams from different locations. Moreover, we explore the potential of knowledge transfer between different datasets via transfer learning.

5.1 Training local models

The two research teams first train their models individually on each one of the three datasets to construct a local baseline. To examine the generalization of the local models, in other words, to test how well the local models perform when applied offsite without re-training, the models are trained based on the normalized data instead of the original data. Training on normalized data enables the models to be deployed across different locations with different scales or types (e.g., trained on PV data but applied to predict irradiance values) of the prediction targets. Specifically, during the development phase, the models learn to predict the relative values of PV power output or irradiance, and during the implementation phase, the relative predictions generated by the models are post-processed to revert them to the original scale. Here, the data normalization method used by each team is the optimal normalization method identified when the training model jointly uses integrated datasets. Details can be found in Sections 5.2 and 6.2.

5.2 Training global models

The challenges of training global models are associated with the heterogeneity of samples from different datasets. A common problem is the different scales (e.g., 30-kW PV system versus 2-MW PV system) and/or different types (e.g., irradiance versus PV output) of the measurement data. In this study, we mainly deal with the latter. Moreover, differences in the camera setups or data distribution caused by the unique local weather conditions, could add other layers of difficulty. The PV output data in the Stanford dataset and the irradiance measurements in the SIRTA dataset are a good example demonstrating all of the above issues (see figure 3.3), and thus will be our main focus in this section. Other possible combinations of datasets (e.g., Stanford+DEWA, Stanford+SIRTA+DEWA) could essentially go through similar training processes. Specifically, to combine the two datasets for joint training, we explored using different normalization methods for processing the model input and output data. We also experimented with tuning the model architectures to better accommodate the multiple input data streams.

5.2.1 Input and output normalization

To be clear, the normalization in this study is defined in the form of , where is the data to be normalized, and , are normalization factors. Two types of data are involved, namely, images and sensor measurements. For sky images, both teams normalized the pixel values of the images from to by dividing the maximum pixel value 255, which is a common technique used for image data normalization. For PV output/irradiance measurements, including both input and output, different normalization methods are examined in this section and the details are summarized in Table 5.1. With ground truth prediction data being inaccessible when the model is deployed, the normalization factors are solely based on the input data, which is statistical similar. After the input and output of each dataset are normalized, we combine them, and the combined dataset is used for global model training. Figure 5.1 shows the data distribution of both datasets after normalization. In the testing phase, the same normalization factors are applied to the input data of the test set and the predictions are post-processed to the original scales. We evaluate the global models on the test sets of the two datasets individually and compare them with the corresponding local models.

Normalization method Stanford SIRTA

Notes: represents the measurement values and is referred to as the normalization factor; and

represent standard deviation and 95 percentile of the data, respectively.

Table 5.1: Normalization methods in the form of for Stanford and SIRTA datasets
Figure 5.1: The distribution of normalized measurements of the Stanford SIRTA datasets (Note: the distributions of , , , are not shown here because they have similar distributions as their counterparts and the only difference is the scale of 100x)

5.2.2 Architecture tuning

Two alternative architectures are investigated to deal with the two data input streams, i.e. the Stanford and SIRTA data, and are compared with the baseline architectures described in the Section 4.1. The main difference between the baseline and the two alternative architectures is that the location information is given explicitly to the alternative architectures to help the models distinguish between the input samples from different locations, while the baseline model attempts to learn the location-specific features implicitly from the data. Figure 5.2 shows an illustration of the two alternative architectures, which are named as Architecture 1 [Arch. 1, see Figure 5.2 (B)] and Architecture 2 [Arch. 2, see Figure 5.2 (C)]. The modifications from the baseline architecture are highlighted with yellow background. It should be noted that here we use the SUNSET model to explain the architecture modifications, while the same idea can be applied to the ConvLSTM model. It should also be noted that the models are trained with the normalized data based on the optimal normalized method identified for each team via experiments described in Section 5.2.1.

Figure 5.2: Illustration of different architecture set-ups using the SUNSET model as an example (the modifications from the baseline architecture are highlighted with yellow background)

Arch. 1 has minor modifications compared with the baseline. The only change is the introduction of a condition matrix, with the same resolution as the sky images (). All elements of the condition matrix are either 1 or 0, indicating the location of the data, for example, if the sky images come from the Stanford dataset, the condition matrix elements are all 0s, whereas if the sky images come from the SIRTA dataset, the condition matrix elements are all 1s.

Arch. 2 adds more complexity to the basline architecture. It has shared convolutional blocks and two separate fully-connected blocks, with the two fully-connected blocks consisting of the same components as the baseline architecture [see Figure 4.1 (A)]. The shared convolutional blocks act as a common feature extractor and learn features from both locations, while the two separate fully-connected blocks learn the correlation between the image features and the PV output/irradiance individually for each location, which is based on the fact that the sun angle trigonometry changes with locations. During the training process, each of the fully-connected blocks will generate a prediction and the two predictions are then stacked into a vector. To generate the final prediction, an inner product is computed between the prediction vector and the one-hot label vector, which indicates the location of the data, with [1,0] and [0,1] representing the Stanford and SIRTA data, respectively.

5.3 Transfer learning

In this section, we examine if the knowledge, or specifically the feature representations from a pre-trained solar forecasting model can be leveraged, so that we do not have to train a model from scratch in a new location of interest. Here, we define a source dataset and a corresponding source model , as well as a target dataset and a target model . In most cases, is a large dataset that contains massive samples, whereas is limited in the number of samples. The goal of transfer learning, specifically, domain adaptation, is thus to learn using with the knowledge gained from learning based on . It should be noted that the same architecture is used for and in this study, namely, either SUNSET by team Stanford or ConvLSTM by team Cambridge.

To implement transfer learning, we first develop based on . Two different strategies are then investigated for transferring the knowledge from to :

  • warm-starting strategy (WS): instead of initializing the weights of randomly, it initializes the entire network with the weights of , and from that initial point, the training on is started and new weights are learned;

  • freezing Conv. blocks strategy (FConv): the weights for all Conv. blocks from are transferred and are frozen during the training process, while the weights of fully-connected blocks are initialized with the weights from and learned with the target dataset .

For FConv strategy, the feature extractors of are reused by and only the mapping functions associated with the fully connected layers are learned based on new data. An illustration of these two transfer learning strategies can be found in Figure 5.3.

Given that generally has a wider data coverage than , we first use the DEWA dataset as , and experiment with the following combination of transfer learning: {: Stanford : DEWA}, {: SIRTA : DEWA} and {: Stanford+SIRTA : DEWA}. We then examine the possibility of knowledge transfer between two sky image datasets with drastically different prediction targets and data distribution, namely, {: SIRTA : Stanford} and {: Stanford : SIRTA}. We also experiment with different amounts of training data for , including 1%, 5%, 10%, 20%, 50%, 75% and 100%, roughly corresponding to 4, 16, 29, 55, 124, 184, 249 days of data for the DEWA development set, 6, 29, 59, 118, 272, 390, 502 days for the Stanford development set, and 9, 42, 79, 133, 329, 631, 933 days for the SIRTA development set. The data are sampled chronologically in the whole dataset to mimic the real situation of data collection. It should be noted that as team Cambridge does not have access to the DEWA dataset, they only conduct the second part of expriments and team Stanford (with access to the DEWA dataset) conducts all the experiments mentioned above. In this study, both the source models and the target models are trained using the normalized data rather than the original data, again based on the optimal normalization method identified by each team. All transfer learning models are evaluated and compared with the local model baseline trained from scratch using in two aspects: prediction accuracy measured by the RMSE, and training cost measured by the TE as described in Section 4.2.

Figure 5.3: Illustration of different transfer learning strategies examined in this studies, taking a CNN with two convolutional layers (Conv1 and Conv2) and two fully connected layers (FC1 and FC2) as examples.

6 Results and discussion

6.1 Comparison of local and global models for solar forecasting

We first compare the performance of local and global solar forecasting models. Besides serving as local baselines, the local models are also applied offsite without re-training by predicting the relative values of irradiance/PV output and post-processing to the original scales for performance evaluation. A global model is trained using the combination of normalized Stanford and SIRTA dataset. The normalization methods used for both local models and global models are for team Stanford and for team Cambridge, which are the optimal normalization methods identified in the experiments presented in Section 6.2.

Table 6.1 shows the overall test set performance of local and global models trained with baseline SUNSET and ConvLSTM architectures. For both architectures, the local models achieve the best performance when applied locally, and the prediction errors significantly increase when applied offsite (without re-training). Although there are still large prediction errors when applying local models to other locations with similar measurement data distributions, e.g., applying the local model developed using Stanford data (PV output) to the DEWA test set (GHI) (Test RMSE 155.82 W/m), it tends to have better performance than applying local models to locations with quite different data distribution even though they share the same target variable, e.g., applying the local model developed using SIRTA data (GHI) to DEWA test set (GHI) (Test RMSE 219.79 W/m).

The prediction errors of applying learned local models offsite can be largely attributed to two parts of the model architectures, the feature extraction part associated with the Conv. blocks and the regression part associated the fully-connected blocks. We find that the errors mainly come from the regression part, namely, mapping extracted features with the prediction target values, whereas the feature representations learned based on each individual dataset can to some extent be shared. This is illustrated in Figure

6.1 which shows the predictions of local models developed based on Stanford and SIRTA datasets individually using the SUNSET architecture applied to 6 example days (3 sunny and 3 cloudy days) in the DEWA test set without re-training. The first column of the figure shows the original predictions, and the second column shows the scaled predictions obtained by multiplying the original predictions by a factor of 1.2 for the Stanford local model and 1.4 for the SIRTA local model for both example sunny and cloudy days, without any other treatment. The original prediction curves, including both Stanford and SIRTA local model predictions, tend to have similar shapes as the ground truth curves, regardless of sunny days and cloudy days, which suggests that feature representations learnt by the models are common for all the locations and can be shared. The errors are mostly caused by the scale or magnitude of the predictions. Once we manually scale the prediction by a factor, we can observe significant improvement in the prediction, although it is not perfect. This finding is further illustrated in Section 6.3 by freezing the weights of the Conv. blocks and only training the fully connected blocks during transfer learning.

center,angle=0 Model Trained on Training epochs Training samples Training effort Test RMSE on Stanford SIRTA DEWA SUNSET Stanford 21.26.1 125,876 2.670.77M 2.68 136.96 155.82 SIRTA 9.61.8 438,172 4.210.81M 6.53 96.42 219.79 DEWA 11.82.2 85,278 1.010.19M 4.65 160.28 89.49 Stanford+SIRTA 28.65.6 564,048 16.133.15M 2.61 96.34 129.21 ConvLSTM Stanford 11.72.7 125,876 1.470.34M 2.62 113.05 N/A SIRTA 5.62.6 438,172 2.491.13M 4.39 96.46 N/A Stanford+SIRTA 6.01.7 564,048 3.360.99M 2.62 95.53 N/A

Table 6.1: Overall test set performance of local and global models (the best prediction performance on each local test set is highlighted in bold font)
  • Notes: (1) The training effort reported in this table is defined as the number of samples seen by the model until convergence, i.e. the number of training epochs the number of training samples, representing the average performance (meanstd) for training one sub-model from ten-fold cross-validation. M stands for million samples. (2) The training epochs are obtained when the model stops training, namely, the validation loss is not observed to decrease for five consecutive epochs (3) The test RMSE is calculated based on the ensemble mean prediction of the ten sub-models from ten-fold cross-validation.

In contrast, the global model trained with a combined dataset from two locations can perform well on individual test sets from both locations. The performance is close to, or even better than, the corresponding local models (e.g., Stanford+SIRTA global model versus Stanford local model versus SIRTA local model), which suggests that the global models can learn features from both locations simultaneously and the learned features can be correlated with the prediction targets in a relatively separate fashion without compromising the performance for each location. This property is further examined in Section 6.2 by comparing the baseline architectures with the modified architectures that explicitly disentangle the location information. Moreover, including diversified samples from both locations could improve the model generalization ability.

Although applying Stanford+SIRTA global model to the DEWA test set gives much worse performance than the DEWA local model, it is significantly better than the Stanford and SIRTA local models applied to the DEWA test set. We could reasonably expect training a global model on a combination of Stanford, SIRTA and DEWA sets could give promising results on each of the three datasets.

In terms of the prediction accuracy, training a global model with a combined dataset and applying it to locations of interest is superior, especially if the local datasets have limited sample size. However, the training cost of global models remains a challenge. Depending on specific model architectures, more efforts might be required in the training process, due to increases in both training set size and training epochs, for the model to converge. A further evaluation on the training cost is presented in Section 6.4 for comparison of local, global models, and models trained with transfer learning.

Figure 6.1: Visualization of Stanford and SIRTA local model predictions based on the SUNSET architecture applied to 6 example days (3 sunny days shown in the first row and 3 cloudy days shown in the second row) in the DEWA test set without re-training. The first column shows the original predictions and the second column shows the scaled predictions from simply multiplying the original prediction by a factor. The same factor is applied for both sunny days and cloudy days predictions. (GT: ground truth; pred.: prediction)

6.2 A further look into global solar forecasting models

In view of the superior prediction accuracy of the global models over the local models, in this section, we further present experiments on training global models given the dataset heterogeneity, especially the different scales and distributions of prediction targets. Also, we experiment with different alternative model architectures to accommodate the multi-location input data, and compare these with the baseline model architectures.

Figure 6.2 shows the comparison of different normalization methods for training the global model for SUNSET and ConvLSTM architectures, respectively. The global model is trained jointly using the combination of Stanford and SIRTA datasets and evaluated individually on the two local datasets. It can be observed that the SUNSET model is more sensitive to the scale of the normalization factors compared with the ConvLSTM model, dividing the normalization factor by 100 can significantly improve the model performance on both local datasets. In contrast, the ConvLSTM model behaves such that a small normalization factor generally leads to worse performance. The optimal normalization factors for SUNSET and ConvLSTM are identified as and , respectively. It should be noted that normalization factors derived from robust statistics, e.g. or

, are less affected by outliers compared to, say, the

value. Preliminary experiments suggest that the different responses to the normalization factors are due to the different architectures of the two models, especially the batch normalization layers (SUNSET uses batch normalization layers but ConvLSTM does not use), though the underlying reasons need to be more thoroughly understood via future studies. We also suggest that researchers should take care in deciding which normalization methods they use for their models.

Figure 6.2: Comparison of different normalization methods for training global the model for two different model architectures

The results of including dataset location as an auxiliary data input (see Figure 5.2 for a comparison of different global model architectures) are presented in Table 6.2. No significant performance improvement is observed for the two alternative architectures for both SUNSET and ConvLSTM models compared with the baseline architectures, which suggests that the baseline model can learn by itself to distinguish location using the subtle features with the prediction target. Thus, the modifications in the architecture to manually inject the location information or disentangle the prediction for each location are not necessary.

Model Normalization method Architecture choice Trained on Test RMSE on
Stanford SIRTA
SUNSET Baseline Stanford+SIRTA 2.612 96.335
Arch. 1 Stanford+SIRTA 2.606 96.075
Arch. 2 Stanford+SIRTA 2.610 96.585
ConvLSTM Baseline Stanford+SIRTA 2.622 95.529
Arch. 1 Stanford+SIRTA 2.765 96.850
Arch. 2 Stanford+SIRTA 2.663 96.264
Table 6.2: Performance of different global model architectures on local test sets (the best performance is highlighted in bold font)

6.3 Knowledge transfer for solar forecasting

Figure 6.3: Comparison of different transfer learning strategies and different amounts of data available for the target dataset. The RMSEs are evaluated based on the DEWA test set. Subplot (A) shows the RMSE for models learned using different transfer learning strategies at each level of target data; The insert plot in (A) shows the zoom-in of results from 20% to 100% target data and the dotted line represents the RMSE of the DEWA local baseline model trained using 100% data; The baseline performance is represented by the dashed line in each subplot; Subplot (B) shows the relative performance of the models learned using different transfer learning strategies compared with the DEWA local model baseline for each level of target data. For each target data level, local models correspond to RMSE% = 0%, and negative RMSE% indicates that the models learned with different transfer learning strategies outperform the corresponding DEWA local model baseline.

Figure 6.4: Visualization of predictions (Pred.) from the DEWA local baseline model trained with different amounts of data using the SUNSET architecture versus ground truth (GT) on the DEWA test set. The green curve represents the ground truth. Different shades of blue represent the local models trained with different amounts of data (light blue stands for a small dataset for training, dark blue stands for a large dataset for training). Note that for sunny days, the predictions from model trained with 50% data and above almost overlap with the ground truth. Huge gaps can be observed between model predictions and ground truth if there is less than 20% data for training.

Lastly, we explore the potential of transferring knowledge from a pre-trained solar forecasting model developed from a source dataset to a new model based on a target dataset.

We first present the experiments using the DEWA dataset as the target dataset, and the knowledge is transferred from the models learned from source datasets Stanford, SIRTA and Stanford+SIRTA, respectively. Figure 6.3 shows the results of transfer learning with different strategies (WS and FConv) and different amounts of target data as described in Section 5.3. All models for this experiment are trained based on the SUNSET architecture and the performance is evaluated based on the DEWA test set.

In general, models trained using transfer learning strategies, no matter WS or FConv, achieve better prediction performance than the DEWA local baseline models for every level of target data used. The smaller the local dataset is, the more benefit transfer learning can bring. When target data is less than 20% of the whole DEWA dataset (55-day-equivalent amount of data), the transfer learning models can significantly outperform the local baseline by approximately 10% and up to 60%. An extreme case is when there is only 1% of total data (4-day-equivalent amount of data). The transfer learning models can outperform the baseline by nearly 60%, which indicates that the knowledge gained from training models on other large datasets can be transferred and reused in the new task for a jump-start. From 20% target data and beyond, the benefits that transfer learning can bring are gradually diminished. Although it still outperforms the baseline, the improvement is within 10%.

In terms of different transfer learning strategies, namely, WS and FConv, the difference in prediction performance is generally less than 1%. Although in some specific situations (e.g., 1% target data), a 6% difference can be observed, generally, there is no significant advantage for using one strategy over another. To this end, freezing all the Conv. blocks during the model training is totally acceptable, as the learned feature representations from the source models can largely be reused in the target model without compromising the performance. Even when the data distribution is drastically different, e.g., the SIRTA dataset versus the DEWA dataset, some common features learned by the source model can be shared.

Moreover, by comparing different source models for transferring the knowledge, it can be observed that the larger and the more diversified the dataset used for building the source model, the more benefit it can bring in the transfer learning process. Using the Stanford+SIRTA dataset to build the source model can take advantage of the two individual datasets. When target data is less than 20%, the StanfordDEWA model shows better performance than the SIRTADEWA model as a similar data distribution is shared by the Stanford and DEWA datasets and thus the learned features can be easily transferred, while when target data is greater than 20%, the opposite can be observed. When target data is greater than 20%, it is enough for the model to learn a solid sun angle equation (see Figure 6.4, when training data is greater than 20%, the sunny day predictions almost align with the ground truth), and the overall performance of the model on the test set is mainly determined by the prediction accuracy on cloudy days. Therefore, the features learned from the SIRTA dataset which is dominant in cloudy data can contribute more. Another result is the huge data reduction potential by using transfer learning. It can be noticed that with 20% target data the Stanford+SIRTADEWA transfer learning model can even achieve slightly better performance than the case of 100% target data using the DEWA local baseline model.

Figure 6.5: Transfer learning between Stanford dataset and SIRTA dataset for SUNSET and ConvLSTM models with different amounts of target data and different transfer learning strategies. Subplot (A) shows the models pre-trained on the Stanford dataset and transferred to the SIRTA dataset using different strategies; Subplot (B) shows the models pre-trained on the SIRTA dataset and transferred to the Stanford dataset using different strategies.

Transfer learning from SIRTA to DEWA shows promising results as described above, although the two datasets have very different data distribution. Here, we further examine two even harder cases, namely, transfer learning from Stanford to SIRTA and from SIRTA to Stanford, with differences both in prediction target (PV output versus GHI) and data distribution. The Stanford dataset contains only approximately a quarter of the data volume of the SIRTA dataset, and has relatively more balanced sunny and cloudy sample distribution than the SIRTA dataset, which is dominated by cloudy data. Figure 6.5 shows transfer learning between the Stanford and the SIRTA datasets for the SUNSET and ConvLSTM models with different amounts of target data and different transfer learning strategies.

In general, SUNSET and ConvLSTM show similar trends in both transfer learning cases. From StanfordSIRTA, the transfer learning models outperform the local models on both sunny and cloudy days, while the benefits gradually saturate when more training data is available, which is similar to the previous experiments on transfer learning to the DEWA dataset. It also suggests that the knowledge learned from PV power prediction can be transferred to irradiance prediction as these two variables are correlated with each other. However, a caveat should be added here. PV power output is affected by the panel temperature (i.e., high temperature can hinder the power generation of PV), which could be learned by the deep learning model from the data, while irradiance measurements are less affected by temperature changes. Therefore, the temperature parameter needs to be taken into account for transfer learning from PV to irradiance. The average temperature in Stanford is below 30C in the summer, hence its impact is not observed in this study. From SIRTAStanford, different behaviors are observed on sunny and cloudy days. The local models outperform the transfer learning models on sunny days, while the transfer learning models do better on cloudy days. This makes sense as the knowledge transferred from the source model (SIRTA in this case) is mostly trained on cloudy data. The WS or FConv transfer learning strategy could put the model at a sub-optimal starting point and hinder the learning if there is not enough data, while things can become better as more training data is available. In comparison, Stanford dataset has a relative more balanced distribution of sunny and cloudy samples. It can be observed that transfer learning from Stanford to SIRTA outperforms the SIRTA local baseline. It thus suggests that for transfer learning, the balance of dataset samples should be paid attention to besides purely the size of the dataset.

6.4 Overall evaluation of different training strategies

Table 6.3 summarizes the performance of local, global and transfer learning models on each one of the three datasets for both SUNSET and ConvLSTM architectures. The performance is evaluated based on both the prediction accuracy (measured by target test dataset RMSE) as well as the training cost (measured by training effort as defined in Section 4.2). For the DEWA dataset as the target set, the results show that by using transfer learning, either WS or FConv, with 80% less training data and training effort (TE), we can achieve 1% improvement over the local baseline model trained with the whole dataset. If the whole target training data is available, models trained with transfer learning can outperform the local model by 5% with 14% less training effort. For the Stanford dataset and the SIRTA dataset as the target set, both SUNSET and ConvLSTM architectures are used in the experiments. The two architectures show different behaviors in model convergence, which leads to differences in TE, while the general trend is aligned. For both architectures, with the whole dataset available, transfer learning models can achieve slightly better prediction accuracy than the local model and global model (improvement within 3%). Depending on the model and the direction of the transfer between SIRTA and Stanford datasets, transfer learning strategies impact the TE differently. For the SUNSET model, the TE of transfer learning relative to the baseline local model ranges from to %, while for the ConvLSTM model it decreases from to %. In particular, the TE is down by a factor of 8 to 9 when ConvLSTM is used to transfer knowledge from SIRTA to Stanford. Although the global model is more generalized and can be applied to individual locations without compromising its prediction ability (as presented in Table 6.1), it generally needs more training effort than the local model and the transfer learning model due to the increasing dataset size and possibly more training epochs to converge.

Overall, with the superior performance in terms of both prediction accuracy and computational cost in model training, transfer learning appears to be the best of the three training strategies. It is especially suitable for the situation where there is limited data. As more and more open-source datasets are available to the community, we highly recommend joint efforts in building pre-trained solar forecasting models on a centralized large-scale dataset with massive and diversified samples (just like in computer vision, different pre-trained models are built using ImageNet Deng et al. (2009)), which can potentially accelerate the research and development of solar forecasting methods. Local retraining starting from this baseline will allow superior local models.

center Model architecture Target training dataset (% data) Source training dataset Model type TL strategy Target test set RMSE Training samples Training epochs Training effort SUNSET DEWA (20%) N/A L N/A 95.76 17,056 23.43.4 0.400.06M Stanford +SIRTA TL WS 88.44 17,056 12.72.9 0.220.05M FConv 88.82 17,056 12.85.6 0.220.10M DEWA (100%) N/A L N/A 89.49 85,278 11.82.2 1.010.19M Stanford +SIRTA TL WS 84.64 85,278 10.92.4 0.930.21M FConv 85.13 85,278 10.14.6 0.860.39M SUNSET Stanford (100%) N/A L N/A 2.68 125,876 21.26.1 2.670.77M Stanford +SIRTA G N/A 2.61 564,048 28.65.6 16.133.15M SIRTA TL WS 2.61 125,876 23.27.7 2.920.97M FConv 2.62 125,876 22.76.9 2.860.87M SIRTA (100%) N/A L N/A 96.42 438,172 9.61.8 4.210.81M Stanford +SIRTA G N/A 96.34 564,048 28.65.6 16.133.15M Stanford TL WS 96.18 438,172 9.02.0 3.940.88M FConv 96.73 438,172 10.62.6 4.641.15M ConvLSTM Stanford (100%) N/A L N/A 2.62 125,876 11.72.7 1.470.34M Stanford +SIRTA G N/A 2.62 564,048 6.01.7 3.360.99M SIRTA TL WS 2.58 125,876 1.50.9 0.190.11M FConv 2.60 125,876 1.71.4 0.220.17M SIRTA (100%) N/A L N/A 96.46 438,172 5.62.6 2.491.13M Stanford +SIRTA G N/A 95.53 564,048 6.01.7 3.360.99M Stanford TL WS 94.73 438,172 5.41.6 2.380.70M FConv 94.83 438,172 4.51.3 1.980.56M

Table 6.3: Summary of the performance of local, global and transfer learning (TL) models for SUNSET and ConvLSTM architecture. The performance is evaluated for each of the dataset based on both the prediction accuracy (measured by target test set RMSE) and the computational cost of training (measured by training effort).
  • Notes: (1) For “Model type”, L means local model, TL means transfer learning model, G means global model. The global models are not trained on the target training set. They are only trained on the source datasets and applied to the target test set. (2) The test RMSE is calculated based on the ensemble mean prediction of the ten sub-models from ten-fold cross-validation. (3) For “Training samples”, the meaning is different for different types of models. For local and transfer learning models, we count the number of training samples used for the target model without counting those used to develop the source model; this is because large-scale pre-trained models will mostly be open-source and accessible, and users do not need to train by themselves. For the global model, we count the number of training samples for the source model as it is directly applied to the target test set without training a target model. (4) “Training epochs” is the epochs needed until model convergence, which is represented by meanstd based on ten sub-models from ten-fold cross-validation. (5) Training effort=Training samples Training epochs, and M stands for a million samples.

7 Conclusion

In this study, we explore developing deep learning-based solar forecasting models based on three heterogeneous datasets collected from the world. We compare the performance of models trained individually based on local datasets (local models) and models trained jointly based on the combination of multiple datasets (global models), and we further examine the potential of knowledge transfer from pre-trained solar forecasting models to a new dataset of interest (transfer learning models).

The results suggest that the local models works best when deployed locally. Significant errors are observed for the scale of the prediction when applied offsite, however, the trend and shape of the PV/irradiance time series can be predicted well, which indicates the feature representations learned by the local models can generally be shared, while the learned regression functions are location dependent. Training on a global dataset can improve the generalization of the model, evidenced by better performance when adapting to individual locations, though one should be aware of the possible increase in training effort required. Transfer learning can help significantly when there is limited data for training, especially if a source model is pre-trained on a large and diversified source dataset. The results show that by pre-training solar forecasting models on the combined global dataset and then transferring to the local DEWA dataset, either using a warm-starting or freezing convolutional blocks strategy, it can reduce the training effort by 80% while achieving 1% improvement in prediction accuracy compared with the local baseline model trained with the whole dataset. Thus, we further call on the community to contribute to a large-scale global dataset with massive and diversified data for solar forecasting. Open-sourcing pre-trained models built upon such a large-scale dataset can benefit the local model development by saving significantly on training effort. Future studies will also investigate the benefit of data augmentation (e.g. image mixing, rotations, vertical/horizontal flips) and scene representation (e.g. polar coordinates, sun-centred sky images) on knowledge transfer in more diverse contexts.


The research was supported by the Dubai Electricity and Water Authority (DEWA) through their membership in the Stanford Energy Corporate Affiliates (SECA) program. This work was also sponsored by ENGIE Lab CRIGEN, EPSRC (EP/R513180/1) and the University of Cambridge. The authors acknowledge SIRTA for sharing the data used in this study and the Stanford Research Computing Center for providing the computational resources for conducting the experiments.


  • A. S. Bansal, T. Bansal, and D. Irwin (2021)

    A moment in the sun: solar nowcasting from multispectral satellite data using self-supervised learning

    arXiv. External Links: Document Cited by: §2.1.
  • P. Blanc, P. Massip, A. Kazantzidis, P. Tzoumanikas, P. Kuhn, S. Wilbert, D. Schüler, and C. Prahl (2017) Short-term forecasting of high resolution local DNI maps with multiple fish-eye cameras in stereoscopic mode. AIP Conference Proceedings 1850 (1), pp. 140004. External Links: ISSN 0094-243X, Document Cited by: §1.
  • C. W. Chow, B. Urquhart, M. Lave, A. Dominguez, J. Kleissl, J. Shields, and B. Washom (2011) Intra-hour forecasting with a total sky imager at the uc san diego solar energy testbed. Solar Energy 85 (11), pp. 2881–2893. Cited by: §1.
  • Y. Chu, M. Li, H. T. C. Pedro, and C. F. M. Coimbra (2015) Real-time prediction intervals for intra-hour DNI forecasts. Renewable Energy 83, pp. 234–244. External Links: Document, ISSN 18790682, Link Cited by: §1.
  • Y. Chu, H. T. C. Pedro, and C. F. M. Coimbra (2013) Hybrid intra-hour DNI forecasts with sky image processing enhanced by stochastic learning. Solar Energy 98, pp. 592–603. External Links: ISSN 0038-092X, Document Cited by: §1.
  • Y. Chu, B. Urquhart, S. M.I. Gohari, H. T.C. Pedro, J. Kleissl, and C. F.M. Coimbra (2015) Short-term reforecasting of power output from a 48 MWe solar PV plant. Solar Energy 112, pp. 68–77. External Links: Document, ISSN 0038-092X, Link Cited by: §1.
  • J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) ImageNet: A large-scale hierarchical image database. In

    2009 IEEE Conference on Computer Vision and Pattern Recognition

    pp. 248–255. External Links: ISSN 1063-6919, Document Cited by: §2.2, §2.2, §6.4.
  • Y. Fabel, B. Nouri, S. Wilbert, N. Blum, R. Triebel, M. Hasenbalg, P. Kuhn, L. F. Zarzalejo, and R. Pitz-Paal (2021) Applying self-supervised learning for semantic cloud segmentation of all-sky images. Atmospheric Measurement Techniques Discussions, pp. 1–20. External Links: ISSN 1867-1381, Document Cited by: §2.2.
  • C. Feng, D. Yang, B. Hodge, and J. Zhang (2019) OpenSolar: Promoting the openness and accessibility of diverse public solar datasets. Solar Energy 188, pp. 1369–1379. External Links: ISSN 0038-092X, Document Cited by: §1.
  • C. Feng, J. Zhang, W. Zhang, and B. M. Hodge (2022) Convolutional neural networks for intra-hour solar forecasting based on sky image sequences. Applied Energy 310, pp. 118438. External Links: Document Cited by: §1.
  • C. Feng and J. Zhang (2020) SolarNet: A sky image-based deep convolutional neural network for intra-hour solar forecasting. Solar Energy 204 (April), pp. 71–78. External Links: Document, Link Cited by: §1.
  • M. Haeffelin, L. Barthès, O. Bock, C. Boitel, S. Bony, D. Bouniol, H. Chepfer, M. Chiriaco, J. Cuesta, J. Delanoë, P. Drobinski, J.-L. Dufresne, C. Flamant, M. Grall, A. Hodzic, F. Hourdin, F. Lapouge, Y. Lemaître, A. Mathieu, Y. Morille, C. Naud, V. Noël, W. O’Hirok, J. Pelon, C. Pietras, A. Protat, B. Romand, G. Scialom, and R. Vautard (2005) SIRTA, a ground-based atmospheric observatory for cloud and aerosol research. Annales Geophysicae 23 (2), pp. 253–275. External Links: ISSN 1432-0576, Document Cited by: §3.1.
  • K. He, X. Zhang, S. Ren, and J. Sun (2015) Deep Residual Learning for Image Recognition. arXiv:1512.03385 [cs] (en). External Links: 1512.03385 Cited by: §4.1.2.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778. Cited by: §2.2.
  • S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural Computation 9 (8), pp. 1735–1780. External Links: ISSN 08997667, Document Cited by: §4.1.2.
  • G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger (2017) Densely Connected Convolutional Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4700–4708. Cited by: §2.2.
  • D. P. Kingma and J. Ba (2014) Adam: A Method for Stochastic Optimization. In ICLR 2015, pp. 1–15. External Links: Link Cited by: §4.1.
  • P. Kuhn, B. Nouri, S. Wilbert, C. Prahl, N. Kozonek, T. Schmidt, Z. Yasser, L. Ramirez, L. Zarzalejo, A. Meyer, L. Vuilleumier, D. Heinemann, P. Blanc, and R. Pitz-Paal (2018) Validation of an all-sky imager–based nowcasting system for industrial PV plants. Progress in Photovoltaics: Research and Applications 26 (8), pp. 608–621. External Links: ISSN 1099-159X, Document Cited by: §1.
  • B. Kurtz, F. Mejia, and J. Kleissl (2017) A virtual sky imager testbed for solar energy forecasting. Solar Energy 158, pp. 753–759. External Links: ISSN 0038-092X, Document Cited by: §1.
  • R. Li, D. Wang, S. Liang, A. Jia, and Z. Wang (2022) Estimating global downward shortwave radiation from VIIRS data using a transfer-learning neural network. Remote Sensing of Environment 274, pp. 112999. External Links: ISSN 0034-4257, Document Cited by: §2.2.
  • R. Marquez and C. F. Coimbra (2013) Intra-hour dni forecasting based on cloud tracking image analysis. Solar Energy 91, pp. 327–336. Cited by: §1.
  • G. Masson and I. Kaizuka (2021) Trends in Photovoltaic Applications 2021. Technical report IEA PVPS. External Links: ISBN 9783907281284, Link Cited by: §1.
  • A. Moreno-Munoz, J. J. G. de la Rosa, R. Posadillo, and F. Bellido (2008) Very short term forecasting of solar radiation. In 2008 33rd IEEE Photovoltaic Specialists Conference, Vol. , pp. 1–5. External Links: Document Cited by: §1.
  • Y. Nie, X. Li, A. Scott, Y. Sun, V. Venugopal, and A. Brandt (2022) SKIPP’d: a sky images and photovoltaic power generation dataset for short-term solar forecasting. arXiv. External Links: Document, Link Cited by: §1, §3.1.
  • Y. Nie, Y. Sun, Y. Chen, R. Orsini, and A. Brandt (2020) PV power output prediction from sky images using convolutional neural network: The comparison of sky-condition-specific sub-models and an end-to-end model. Journal of Renewable and Sustainable Energy 12 (4), pp. 046101. External Links: Document, Link Cited by: §1.
  • Y. Nie, A. S. Zamzam, and A. Brandt (2021) Resampling and data augmentation for short-term PV output prediction based on an imbalanced sky images dataset using convolutional neural networks. Solar Energy 224 (May), pp. 341–354. External Links: Document, Link Cited by: §1, §1, §1.
  • E. Ntavelis, J. Remund, and P. Schmid (2021) SkyCam: A Dataset of Sky Images and their Irradiance values. arXiv:2105.02922 [cs]. External Links: 2105.02922 Cited by: §1.
  • Q. Paletta, G. Arbod, and J. Lasenby (2021) Benchmarking of deep learning irradiance forecasting models from sky images – An in-depth analysis. Solar Energy 224, pp. 855–867. External Links: ISSN 0038-092X, Document, Link Cited by: §1, §4.1.2.
  • Q. Paletta, G. Arbod, and J. Lasenby (2022a) Cloud flow centring in sky and satellite images for deep solar forecasting. In WCPEC-8, pp. 5. Cited by: §3.2.
  • Q. Paletta, A. Hu, G. Arbod, P. Blanc, and J. Lasenby (2022b) SPIN: Simplifying Polar Invariance for Neural networks Application to vision-based irradiance forecasting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 5182–5191. External Links: Document, Link Cited by: §1, §3.2.
  • Q. Paletta, A. Hu, G. Arbod, and J. Lasenby (2022) ECLIPSE: Envisioning CLoud Induced Perturbations in Solar Energy. Applied Energy 326, pp. 119924. External Links: ISSN 0306-2619, Document Cited by: §1.
  • Q. Paletta and J. Lasenby (2020a) A temporally consistent image-based sun tracking algorithm for solar energy forecasting applications. In NeurIPS 2020 Workshop on Tackling Climate Change with Machine Learning, pp. 10. External Links: Link Cited by: §3.2.
  • Q. Paletta and J. Lasenby (2020b) Convolutional Neural Networks Applied to Sky Images for Short-Term Solar Irradiance Forecasting. In EU PVSEC, pp. 1834 – 1837. External Links: ISBN 3-936338-73-6, Document, Link Cited by: §1.
  • S. J. Pan and Q. Yang (2010) A Survey on Transfer Learning. IEEE Transactions on Knowledge and Data Engineering 22 (10), pp. 1345–1359. External Links: ISSN 1558-2191, Document Cited by: §2.2.
  • H. T. C. Pedro, C. F. M. Coimbra, and P. Lauret (2019a) Adaptive image features for intra-hour solar forecasts. Journal of Renewable and Sustainable Energy 11 (3), pp. 036101. External Links: Document Cited by: §1.
  • H. T. C. Pedro, D. P. Larson, and C. F. M. Coimbra (2019b) A comprehensive dataset for the accelerated development and benchmarking of solar forecasting methods. Journal of Renewable and Sustainable Energy 11 (3), pp. 036102. External Links: Document Cited by: §1.
  • Z. Peng, D. Yu, D. Huang, J. Heiser, S. Yoo, and P. Kalb (2015) 3D cloud detection and tracking system for solar forecast using multiple sky imagers. Solar Energy 118, pp. 496–519. External Links: ISSN 0038092X, Document, ISBN 9781450324694 Cited by: §1.
  • D. Pothineni, M. R. Oswald, J. Poland, and M. Pollefeys (2019) KloudNet: Deep Learning for Sky Image Analysis and Irradiance Forecasting. In German Conference on Pattern Recognition, T. Brix, A. Bruhn, and M. Fritz (Eds.), Vol. 1, pp. 535–551. External Links: Document, ISBN 9781498711425 Cited by: §2.1, §2.2.
  • S. Quesada-Ruiz, Y. Chu, J. Tovar-Pescador, H. T. C. Pedro, and C. F. M. Coimbra (2014) Cloud-tracking methodology for intra-hour DNI forecasting. Solar Energy 102, pp. 267–275. External Links: ISSN 0038-092X, Document Cited by: §1.
  • G. Reikard (2009) Predicting solar radiation at high resolutions: a comparison of time series forecasts. Solar Energy 83 (3), pp. 342–349. External Links: ISSN 0038-092X, Document, Link Cited by: §1.
  • X. Shi, Z. Chen, H. Wang, D. Yeung, W. Wong, and W. Woo (2015) Convolutional lstm network: a machine learning approach for precipitation nowcasting. In Advances in Neural Information Processing Systems, C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett (Eds.), Vol. 28, pp. . External Links: Link Cited by: §4.1.2.
  • K. Simonyan and A. Zisserman (2015) Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv:1409.1556 [cs]. External Links: 1409.1556 Cited by: §2.2.
  • Y. Sun, G. Szűcs, and A. R. Brandt (2018) Solar PV output prediction from video streams using convolutional neural networks. Energy & Environmental Science 11 (7), pp. 1811–1818. External Links: Document, Link Cited by: §1.
  • Y. Sun, V. Venugopal, and A. R. Brandt (2019) Short-term solar power forecast with deep learning: Exploring optimal input and output configuration. Solar Energy 188, pp. 730–741. External Links: Document, Link Cited by: §1, §3.2, §3.3, §4.1.1, §4.1.1.
  • G. Terrén-Serrano, A. Bashir, T. Estrada, and M. Martínez-Ramón (2021) Girasol, a sky imaging and global solar irradiance dataset. Data in Brief 35, pp. 106914. External Links: ISSN 2352-3409, Document Cited by: §1.
  • V. Venugopal, Y. Sun, and A. R. Brandt (2019) Short-term solar PV forecasting using computer vision: The search for optimal CNN architectures for incorporating sky images and PV generation history. Journal of Renewable and Sustainable Energy 11 (6), pp. 066102. External Links: Document, ISSN 1941-7012, Link Cited by: §1.
  • H. Wen, Y. Du, X. Chen, E. Lim, H. Wen, L. Jiang, and W. Xiang (2021) Deep Learning Based Multistep Solar Forecasting for PV Ramp-Rate Control Using Sky Images. IEEE Transactions on Industrial Informatics 17 (2), pp. 1397–1406. External Links: ISSN 1941-0050, Document Cited by: §2.2.
  • D. Yang, W. Wang, C. A. Gueymard, T. Hong, J. Kleissl, J. Huang, M. J. Perez, R. Perez, J. M. Bright, X. Xia, D. van der Meer, and I. M. Peters (2022) A review of solar forecasting, its dependence on atmospheric sciences and implications for grid integration: Towards carbon neutrality. Renewable and Sustainable Energy Reviews 161, pp. 112348. External Links: ISSN 1364-0321, Document Cited by: §1.
  • J. Zhang, R. Verschae, S. Nobuhara, and J. F. Lalonde (2018) Deep photovoltaic nowcasting. Solar Energy 176 (September), pp. 267–276. External Links: Document, Link Cited by: §1.
  • F. Zhuang, Z. Qi, K. Duan, D. Xi, Y. Zhu, H. Zhu, H. Xiong, and Q. He (2021) A Comprehensive Survey on Transfer Learning. Proceedings of the IEEE 109 (1), pp. 43–76. External Links: ISSN 1558-2256, Document Cited by: §2.2, §2.2.