Sub-Seasonal Climate Forecasting via Machine Learning: Challenges, Analysis, and Advances

Sub-seasonal climate forecasting (SSF) focuses on predicting key climate variables such as temperature and precipitation in the 2-week to 2-month time scales. Skillful SSF would have immense societal value, in such areas as agricultural productivity, water resource management, transportation and aviation systems, and emergency planning for extreme weather events. However, SSF is considered more challenging than either weather prediction or even seasonal prediction. In this paper, we carefully study a variety of machine learning (ML) approaches for SSF over the US mainland. While atmosphere-land-ocean couplings and the limited amount of good quality data makes it hard to apply black-box ML naively, we show that with carefully constructed feature representations, even linear regression models, e.g., Lasso, can be made to perform well. Among a broad suite of 10 ML approaches considered, gradient boosting performs the best, and deep learning (DL) methods show some promise with careful architecture choices. Overall, ML methods are able to outperform the climatological baseline, i.e., predictions based on the 30 year average at a given location and time. Further, based on studying feature importance, ocean (especially indices based on climatic oscillations such as El Nino) and land (soil moisture) covariates are found to be predictive, whereas atmospheric covariates are not considered helpful.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

12/16/2020

Copula-based synthetic data generation for machine learning emulators in weather and climate: application to a simple radiation model

Can we improve machine learning (ML) emulators with synthetic data? The ...
07/28/2020

Coupling Machine Learning and Crop Modeling Improves Crop Yield Prediction in the US Corn Belt

This study investigates whether coupling crop modeling and machine learn...
11/26/2020

A Comparison of Statistical and Machine Learning Algorithms for Predicting Rents in the San Francisco Bay Area

Urban transportation and land use models have used theory and statistica...
12/09/2021

Model-Agnostic Hybrid Numerical Weather Prediction and Machine Learning Paradigm for Solar Forecasting in the Tropics

Numerical weather prediction (NWP) and machine learning (ML) methods are...
08/04/2020

Macroeconomic Data Transformations Matter

From a purely predictive standpoint, rotating the predictors' matrix in ...
01/07/2022

Explainable deep learning for insights in El Nino and river flows

The El Nino Southern Oscillation (ENSO) is a semi-periodic fluctuation i...
11/29/2021

Harnessing expressive capacity of Machine Learning modeling to represent complex coupling of Earth's auroral space weather regimes

We develop multiple Deep Learning (DL) models that advance the state-of-...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Over the past few decades, major advances have been made in weather forecasts on time scales of days to about a week Lorenc (1986); Simmons and Hollingsworth (2002b); National Research Council (2010); National Academies of Sciences (2016). Similarly, major advances have been made in seasonal forecasts on time scales of 2-9 months Barnston et al. (2012). However, making high quality forecasts of key climate variables such as temperature and precipitation on sub-seasonal time scales, defined here as the time range between 2-8 weeks, has long been a gap in operational forecasting National Academies of Sciences (2016). Skillful climate forecasts at sub-seasonal time scales would be of immense societal value, and would have an impact in a wide variety of domains including agricultural productivity, hydrology and water resource management, emergency planning for extreme climate, etc. Pomeroy et al. (2002); Klemm and McPherson (2017). The importance of sub-seasonal climate forecasting (SSF) has been discussed in great detail in two recent high profile reports from the National Academy of Sciences (NAS) National Research Council (2010); National Academies of Sciences (2016). Despite the scientific, societal, and financial importance of SSF, the progress on the problem has been limited Braman et al. (2013); de Perez and Mason (2014) since it has attracted less attention compared to weather and seasonal climate prediction. Also, SSF is relatively more difficult compared to weather or seasonal forecasting due to limited predictive information from land and ocean, and virtually no predictive information from the atmosphere, which forms the basis of numerical weather prediction (NWP) The National Oceanic and Atmospheric Administration (2018); Simmons and Hollingsworth (2002a) (Figure 1(a)).

There exists great potential to advance sub-seasonal prediction using machine learning techniques, which has revolutionized statistical prediction in many other fields. Due in large part to this potential promise, a recently concluded real-time forecasting competition called the Sub-Seasonal Climate Forecast Rodeo, was sponsored by the Bureau of Reclamation in partnership with NOAA, USGS, and the U.S. Army Corps of Engineers Hwang et al. (2019). However, such forecasting problem, regardless of its exact application, is non-trivial due to the nature of high-dimensionality and strong spatial correlation within climate data. Besides, sub-seasonal forecasting does not lie in the big data regime: about 40 years of reliable data exists for all climate variables, with each day corresponding to one data point, which totals less than 20,000 data points. Furthermore, different seasons have different predictive relations, and many climate variables have strong temporal correlations on daily time scales, further reducing the effective data size. Therefore, it is worth to investigate the capability of Machine Learning (ML), especially the black-box Deep Learning (DL) models on SSF. To the best of our knowledge, no DL model has yet been properly applied to the specific application of SSF.

(a) Sources of Predictability
(b) MIC
(c) Results of FNN and CNN
Figure 1: (a) Sources of predictability at different forecast time scales. Atmosphere is most predictive at weather time scales, whereas for SSF, land and ocean are considered as important sources of predictability The National Oceanic and Atmospheric Administration (2018). (b) Maximum information coefficient (MIC) Reshef et al. (2011) between temperature of week 3 & 4 and week -2 & -1. Small MICs (

) at a majority of locations indicate little information shared between the most recent date and the forecasting target. (c) Predictive skills (cosine similarity) of Fully connected Neural Networks (FNN) and Convolutional Neural Networks (CNN) for temperature prediction over 2017-2018. Positive values closer to 1 (green) indicate better predictive skills.

In this paper, we perform a comprehensive empirical study on ML approaches for SSF and discuss the challenges and advancements. Our main contributions in this paper are as follows:

  • [leftmargin=*]

  • We illustrate the difficulty of SSF due to the complex physical couplings and the unique nature of climate data mentioned above.

  • We show that suitable ML models, e.g., XGBoost, to some extent, capture predictability for sub-seasonal time scales from climate data, and persistently outperform existing approaches in climate science, such as climatology and the damped persistence model.

  • We demonstrate that even though DL models are not the obvious winner, they still show promising results with demonstrated improvements from careful architectural choices. With further improvements, DL models present a great potential topic for future research.

  • We find ML models intend to select climate variables from sources that are believed to be more useful in SSF: ocean (especially indices based on climatic oscillations such as El Niño) and land (soil moisture) are the most predictive, and atmospheric covariates are not considered helpful.

Organization of the paper. We start with a review of related work in section 2. Section 3 provides a formal definition of the sub-seasonal climate forecasting. Next, we briefly discuss ML approaches we plan to investigate (section 4) followed by details on data and experimental setup (section 5). Subsequently, section 6 presents experimental results, comparing the predictive skills over 10 ML models, including several DL models. Finally, We end the paper with the conclusion in section 7.

2 Related Work

Although statistical models were used for weather prediction before the 1970s Frederik Nebeker (1995), since the 1980s weather forecasting has been carried out using mainly physics-based dynamic system models Barnston et al. (2012). More recently, there is a surge of application for ML approaches to both short-term weather forecasting Cofıno et al. (2002); Grover et al. (2015); Radhika and Shashi (2009), and longer-term climate prediction Badr et al. (2014); Cohen et al. (2019). However, little attention has been paid on forecasting with sub-seasonal time scale. Lack of skillful forecasts Robertson et al. (2015), the sub-seasonal time scale has been considered as a “predictability desert” Vitart et al. (2012). Due to the drastically development of statistical prediction in many other fields, ML techniques are great potentials to advance sub-seasonal prediction. For instance, interest and advances have been seen in developing the high-dimensional sparse models and variants which are suitable for spatial-temporal climate data Goncalves et al. (2014, 2016, 2017); DelSole and Banerjee (2017); He et al. (2019). Such models have been successfully applied to certain problems, e.g., predicting land temperature using oceanic climate data DelSole and Banerjee (2017); He et al. (2019). Recently, promising progresses Hwang et al. (2019); He et al. (2019) have been seen on applying ML algorithms to solve SSF.

Since SSF can be formulated as a sequential modeling problem Sutskever et al. (2014); Venugopalan et al. (2015), bringing the core strength of DL-based sequential modeling, a thriving research area, has the maximum potential for a transformation in climate forecasting Ham et al. (2019); Reichstein et al. (2019); Schneider et al. (2017)

. In the past decade, recurrent neural network (RNN) 

Funahashi and Nakamura (1993)

, and long short-term memory (LSTM) models 

Gers et al. (2000), are two of the most popular sequential models and have been successfully applied in language modeling and other seq-to-seq tasks Sundermeyer et al. (2012). Starting from Sutskever et al. (2014); Srivastava et al. (2015), the encoder-decoder structure with RNN or LSTM has become one of the most competitive algorithms for sequence transduction. The variants of such model that incorporate mechanisms like convolution Xingjian et al. (2015); Shi et al. (2017) or attention mechanisms Bahdanau et al. (2015) have achieved remarkable breakthroughs for audio synthesis, word-level language modeling, and machine translation Vaswani et al. (2017).

3 Sub-seasonal Climate Forecasting

Problem statement. In this paper, we focus on building temperature forecasting models at the forecast horizon of 15-28 days ahead, i.e., the average daily temperature of week 3 & 4. The geographic region of interest is the US mainland (latitudes 25N-49N and longitude 76W-133W) at a 2 by 2 resolution (197 grid points). For covariates, we consider climate variables, such as sea surface temperature, soil moistures, and geopotential height, etc., that can indicate the status of the three main components, i.e., land, ocean, and atmosphere. Table 1 provides a detailed description.

Type Climate variable Description Unit Spatial coverage Data Source

Spatial-temporal

tmp2m Daily average temperature at 2 meters C US mainland CPC Global Daily Temperature Fan and Van den Dool (2008)

sm Monthly Soil Moisture mm CPC Soil Moisture  Huang et al. (1996); Van den Dool et al. (2003); Fan and van den Dool (2004)
sst Daily sea surface temperature C North Pacific & Atlantic Ocean

Optimum Interpolation

SST (OISST) Reynolds et al. (2007)
rhum Daily relative humidity near the surface (sigma level 0.995) US mainland and North Pacific & Atlantic Ocean Atmospheric Research Reanalysis Dataset Kalnay et al. (1996)
slp Daily pressure at sea level Pa
hgt10 & hgt500 Daily geopotential height at 10mb and 500mb m


Temporal

MEI Bimonthly multivariate ENSO index NA NA NOAA ESRL MEI.v2  Zimmerman et al. (2016)
Niño 1+2, 3, 3.4, 4 Weekly Oceanic Niño Index (ONI) NOAA National Weather Service, CPC  Reynolds et al. (2007)
NAO Daily North Atlantic Oscillation index NOAA National Weather Service, CPC  Barnston and Livezey (1987); Van den Dool et al. (2000)
MJO phase & amplitude Madden-Julian Oscillation index Australian Government BoM Wheeler and Hendon (2004)
Table 1: Description of climate variables and their data sources.

Difficulty of the problem. To illustrate the challenge of SSF, we measure the dependence between the normalized average temperature of week -2 & -1 (0-14 days in the past) and week 3 & 4 (15-28 days in the “future") at each grid by maximum information coefficient (MIC) Reshef et al. (2011), a value between . A small MIC value close to 0 indicates a weak dependence. More specifically, we apply moving block bootstrap Kunsch (1989) to time series of 2-week average temperature at each grid point from 1986 to 2018, with the block size of 365 days. The top panel in Figure 1(b) illustrates the average MIC from 100 bootstrap over the US mainland, and the marginal distribution of all locations is shown at bottom. Small MIC values (), which indicates little predictive information shared between the most recent data and the forecasting target, to some extent, demonstrate how difficult SSF is.

From a ML perspective, applying black-box DL approaches naively to SSF is less likely to work due to the limited number of samples, and the high-dimensional and spatial-temporally correlated features. Figure 1

(c) shows the performance of two vanilla DL models: fully connected neural networks (FNN) with ReLU activation and convolutional neural networks (CNN), in terms of the cosine similarity between the prediction and the ground truth at each location over 2017-2018. For most locations, their cosine similarities are either negative or close to zero. In addition, we evaluate 10 ML models with suitable hyper-parameter tuning using another metric called relative

(see formal definition in Appendix A), which compares the predictive skill of a model to the best constant prediction based on climatology, the 30 year average from historical training data. Most of the models do not get even positive relative (details are presented in Appendix A), indicating that they perform no better than the long term average. Such results are another good indication that accurate SSF is hard to achieve.

4 Methods

Notation. Let denote a date and denote a location. The target variable at time is denoted as , where represents the number of target locations. More specifically, is the normalized average temperature over time to , i.e., weeks 3&4 (details on normalization can be found in section 5). denotes the -dimensional covariates designed for time t and location , which is also denoted as if the covariates are shared by all locations .

ML (non-DL) Models. We compare the following ML (non-DL) models with DL models.

  • [leftmargin=*]

  • MultiLLR Hwang et al. (2019).

    MultiLLR introduces a multitask feature selection algorithm to remove the irrelevant predictors and integrates the remaining predictors linearly. For a location

    and target date , its coefficient

    is estimated by

    , where is the temporal span around the target date’s day of the year and is the corresponding weight.

  • AutoKNN Hwang et al. (2019). An auto-regression model with weighted temporally local samples, and where the auto-regression lags were selected via a multitask k-nearest neighbor criterion. It only takes historical measurements of the target variable as input, and the similarity between two dates is measured by the mean cosine similarity of the historical anomalies preceding the candidate dates.

  • Multitask Lasso Tibshirani (1996); Jalali et al. (2013). It assumes , where

    is a Gaussian noise vector and

    is the coefficient matrix for all locations. With samples, is estimated by with and . is a penalty parameter and the corresponding penalty term is computed as .

  • Gradient Boosted Trees (XGBoost) Friedman (2001); Chen and Guestrin (2016). A functional gradient boosting algorithm using regression tree as its week learner. The algorithm starts with one weak learner and iteratively adds new weak learners to approximate functional gradients. The final ensemble model is constructed by a weighted summation of all weak learners. It is implemented using the Python package XGBoost.

  • SOTA Climate Baseline. We consider two baselines from climate science perspective, both are Least Square (LS) linear regression models Weisberg (2005). The first model has predictors as climate indices, such as NAO index and niño indices, which are used to monitor ocean conditions. The predictor of the second model is the most recent anomaly of the target variable, i.e., anomaly temperature of week -2 & -1, with which the model, also known as damped persistence Van den Dool et al. (2007)

    in climate science, is essentially a first-order autoregressive model.

(a) Encoder (LSTM)-Decoder (FNN)
(b) CNN-LSTM
Figure 2: Architectures of the designed DL models. (a) Encoder (LSTM)-Decoder (FNN) includes a few LSTM layers as the Encoder, and two fully connected layers as the Decoder. (b) CNN-LSTM consists of a few convolutional layers followed by an LSTM.

DL Models. As shown in Figure 1(c), it is hard for vanilla deep learning models like FNN and CNN to achieve high prediction accuracy. Therefore, we design two variants of DL models to further improve the performance, namely Encoder (LSTM)-Decoder (FNN) and CNN-LSTM.

  • [leftmargin=*]

  • Encoder (LSTM)-Decoder (FNN).

    Inspired by Autoencoder widely used in sequential modeling

    Sutskever et al. (2014), we design the Encoder (LSTM)-Decoder (FNN) model, of which the architecture is shown in Figure 2

    (a). Input of the model is features extracted spatially from covariates using unsupervised methods like Principal Component Analysis (PCA). The temporal components of covariates are handled by feeding features of each historical date into an LSTM Encoder recurrently. Then, the output of each date from LSTM is sent jointly to a two-layer FNN network using ReLU as an activation function. The output of the FNN Decoder is the predicted average temperature of week 3 & 4 over all target locations.

  • CNN-LSTM. The proposed CNN-LSTM model directly learns the representations from the spatial-temporal data using CNN components LeCun et al. (1998). Shown in Figure 2(b), CNN extracts features for each climate variable at all historical dates separately. Then, the extracted features from the same date are collected and fed into an LSTM model recurrently. The temperature prediction for all target locations is done by an FNN layer taking the output of the LSTM’s last layer from the latest input.

5 Data and Experimental Setup

Data description. Climate agencies across the world maintain multiple datasets with different formats and resolutions. Climate variables (Table 1) have been collected from diverse data sources and converted into a consistent format. Temporal variables, e.g., Niño indices, are interpolated to a daily resolution, and spatial-temporal variables are interpolated to a spatial resolution of by .

Preprocessing. For spatial-temporal variables, we first extract the top 10 principal components (PCs) as features based on PC loadings from 1986 to 2016 (for details, refer to Appendix B

). Next, we de-seasonalize the data by z-scoring at each location and each date with the corresponding mean and standard deviation of the corresponding day of the year over 1986-2016 separately. Note that both training and test sets are z-scored using the mean and standard deviations of the same 30-year historical data. Temporal variables, e.g., Niño indices, are directly used without normalization.

Feature set construction. We combine the PCs of spatial-temporal covariates with temporal covariates into a sequential feature set, which consists not only covariates of the target date, but also covariates of the , , and day previous from the target date, as well as the day of the year of the target date in the past 2 years and both the historical past and future dates around the day of the year of the target date in the past 2 years (see Appendix B for a detailed example).

Evaluation pipeline. Predictive models are created independently for each month in 2017 and 2018. To mimic a live system, we generate 105 test dates during 2017-2018, one for each week, and group them into 24 test sets by their month of the year. Given a test set, our evaluation pipeline consists of two parts: (1) “5-fold” training-validation pairs for hyper-parameter tuning, based on a “sliding-window” strategy designed for time-series data. Each validation set uses the data from the same month of the year as the test set, and we create 5 such set from dates in the past 5 years. Their corresponding training sets contain 10 years of data before each validation set; (2) the training-test pair, where the training set, including 30-year data in the past, ends 28 days before the first date in the test set. We share more explanations, including a pictorial example, in Appendix B.

Evaluation metrics. Forecasts are evaluated by cosine similarity, a widely used metric in weather forecasting, between , a vector of predicted values, and , the corresponding ground truth. It is computed as , where denotes the inner product between the two vectors. If represents the predicted values for a period of time at one location, it becomes temporal cosine similarity which assesses the prediction skill at a specific location. Whereas, if contains the predicted values for all target locations at one date, it becomes spatial cosine similarity measuring the prediction skill at that date. To get a better intuition, one can view spatial and temporal cosine similarity as spatial and temporal correlation respectively, measured between two centered vectors.

6 Experimental results

We compare the predictive skills of 10 ML models on SSF over the US mainland. In addition, we discuss a few aspects that impact the ML models the most, and the evolution of our DL models.

6.1 Results of all methods

Temporal results. Table 2

lists the mean, the median, the 0.25 quantile, the 0.75 quantile, and their corresponding standard errors of spatial cosine similarity of all methods. Additional results based on relative

can be found in Appendix C. XGBoost, Encoder (LSTM)+Decoder (FNN) and Lasso accomplish higher predictive skills than other presented methods, and can outperform climatology and two climate baseline models, i.e., LS with NAO & Niño, and damped persistence. Overall, XGBoost achieves the highest predictive skill in terms of both the mean and the median, demonstrating its predictive power. Surprisingly, linear regression with a proper feature set has good predictive performance. Even though DL models are not the obvious winner, with careful architectural selections, they still show promising results.

Model Mean(se) Median (se) 0.25 quantile (se) 0.75 quantile (se)
Temporally Global Dataset
XGBoost - one day 0.3044(0.03) 0.3447(0.05) 0.0252(0.05) 0.5905(0.04)
Lasso - one day 0.2499(0.04) 0.2554(0.06) -0.0224(0.05) 0.5604(0.06)
Encoder (LSTM)-Decoder (FNN) 0.2616 (0.04) 0.2995 (0.07) -0.0719 (0.06) 0.6310 (0.05)
FNN 0.0792(0.01) 0.0920(0.02) 0.0085(0.02) 0.1655(0.02)
CNN 0.1688(0.04) 0.2324(0.06) -0.0662(0.06) 0.4768(0.04)
CNN-LSTM 0.1743(0.04) 0.2867(0.06) -0.1225(0.07) 0.5148(0.04)
LS with NAO & Niño 0.2415(0.03) 0.3169(0.04) 0.0454(0.05) 0.4624(0.03)
Damped persistence 0.2009(0.04) 0.2310(0.06) -0.0884(0.06) 0.5335(0.05))
MultiLLR 0.0684 (0.03) 0.1046 (0.05) -0.1764 (0.06) 0.3156 (0.04)
AutoKNN 0.1457 (0.03) 0.1744 (0.05) -0.1018 (0.06) 0.4000 (0.04)
Temporally Local Dataset
XGBoost - one day 0.1965(0.04) 0.2345(0.05) -0.0636(0.06) 0.5178(0.05)
Lasso - one day 0.1631(0.04) 0.2087(0.06) -0.1178(0.05) 0.5059(0.05)
Encoder (LSTM)-Decoder (FNN) 0.1277 (0.04) 0.1272 (0.06) -0.1558 (0.06) 0.4971 (0.06)
Table 2: Comparison of spatial cosine similarity of tmp2m forecasting for test sets over 2017-2018. Models achieve better performance using temporally global set compared to temporally local set. XGBoost and Encoder (LSTM)-Decoder (FNN) have the best performance.

Spatial results. Figure 3 shows the temporal cosine similarity of all methods evaluated on test sets described in section 5. Among all methods, XGBoost and the Encoder (LSTM)-Decoder (FNN) achieve the overall best performance, regarding the number of locations with positive temporal cosine similarity. Qualitatively, coastal and south regions, in general, are easier to predict compared to inland regions (e.g., Midwest). Such a phenomenon might be explained by the influence of the slow-moving component, i.e., Pacific and Atlantic Ocean. Such component exhibits inertia or memory, in which anomalous condition can take relatively long period of time to decay. However, each model has its own favorable and disadvantageous regions. For example, XGBoost and Lasso do poorly in Montana, Wyoming, and Idaho, while Encoder (LSTM)-Decoder (FNN) performs much better on those regions. The observations naturally imply that the ensemble of multiple models is a promising future direction.

Comparison with the state-of-the-art methods. MultiLLR and AutoKNN are two state-of-the-art methods designed for SSF on western US Hwang et al. (2019). Both methods have shown good forecasting performance on the original target region. However, over the inland region (Midwest), Northeast, and South region, the methods perform not so well (Figure 3). To be fair, even though a similar set of climate variables have been used in our work compared to the original paper Hwang et al. (2019), how we prepossess the data and construct the feature set are slightly different. Such differences may lead to relatively poor performance for these two methods, especially for MultiLLR.

Figure 3: Temporal cosine similarity over the US mainland of ML models discussed in section 4 for temperature prediction over 2017-2018. Large positive values (green) closer to 1 indicates better predictive skills. Overall, XGBoost and Encoder (LSTM)-Decoder (FNN) perform the best. Qualitatively, coastal and south regions are easier to predict than inland regions (e.g., Midwest).

6.2 Analysis and Exploration

We analyze and explore several important aspects that could influence the performance of ML models.

Temporally “local” vs. “global” dataset. Our training set consists of continuous dates over the past 30 years, which we refer to as the temporally “global” dataset. Another way to construct the training set is to only consider a temporal neighborhood of the test date. For instance, to build a predictive model to forecast June, 2017, the training set can only contains dates in June (from earlier years), and months that are close to June, e.g., April, May, July, and August, over the past 30 years. Such a construction assumes different seasons have different predictive relations, and to predict temperature in summer, the predictive model better not use data from winter. We name such dataset as a temporally “local” dataset. A comparison between the “global” and “local” datasets has been listed in Table 2 where a significant drop in cosine similarity can be noticed when using “local” dataset for all of our best predictors, including XGBoost, Lasso, and Encoder (LSTM)-Decoder (FNN). We suspect such performance drop from “global” to “local” dataset may come from the additional temporal constraint on training set which further reduces the number of effective samples.

Feature importance. At sub-seasonal time scales, climate scientists believe The National Oceanic and Atmospheric Administration (2018); Delsole and Tippett (2017) that land and ocean are the important sources of predictability, while the impact of atmosphere is limited (Figure 1 (a)). We study which covariate(s) are important, considered by ML models, based on the feature importance score. In particular, we compute the feature importance score from 2 ML models, XGBoost and Lasso (Figure 4). For XGBoost, the importance score is computed using the average information gain across all tree node a feature/covariate splits, while for Lasso, we simply count the non-zero coefficients of each model. The reported feature importance score is the average over 24 models (one per month in 2017-2018). We provide additional ways of measuring feature importance, e.g., Shapley value Lipovetsky and Conklin (2001), in Appendix C. Among all covariates, soil moisture ( row from the top) is the variable that has constantly been selected by both models. Another set of important covariates is the family of Niño indices. A LS model using those indices alone as predictors performs fairly well (Table2). Sea surface temperatures (of Pacific and Atlantic) are also commonly selected. Such observations indicate that ML models pick up ocean-based covariates, some land-based covariates, and almost entirely ignore the atmosphere-related covariates, which are well aligned with domain knowledge The National Oceanic and Atmospheric Administration (2018); Delsole and Tippett (2017).

The influence of feature sequence length. To adapt the usage of LSTM, we construct a sequential feature set, which consists not only the target date, but also 17 other dates preceding the target date. However, other ML models, e.g., XGBoost and Lasso, which are not designed to handle sequential data, experience a drastic performance drop when we include more information from the past. More precisely, by including covariates from the full historical sequence, the performance of XGBoost drops approximately 50% compared to the XGBoost model using covariates from the most recent date only. A possible explanation for such performance degradation, as we increase the feature sequence length, is that both models weight covariates from different dates exactly the same without considering temporal information, thus more noise has been introduced. In Appendix C, we provide a more detailed comparison among results obtained from various sequence lengths.

(a) XGBoost
(b) Lasso
Figure 4: Feature importance scores computed from (a) XGBoost and (b) Lasso. Darker color means a covariate is of the higher importance. The first 8 rows contains the top 10 principal components (PCs) extracted from 8 spatial-temporal covariates respectively, and the last row includes all the temporal indices. Land component, e.g., soil moisture ( row from the top) and ocean components, e.g., sst (Pacific and Atlantic) and some climate indices are the most commonly selected covariates.

6.3 What happened with DL models?

As discussed in section 3, applying black-box DL models naively does not work well for SSF. The improvement (Table  2), as we evolve from FNN to CNN-LSTM, and finally to Encoder (LSTM)-Decoder (FNN), demonstrates how the network architecture plays an important role, and leaves us plenty of space for further advancement. Below we focus on the discussion of feature representation, and the architecture design for sequence modeling. More discussion is included in Appendix C.

Feature representation: CNN vs. PCA. Since SSF can be considered as a spatial-temporal prediction problem, CNN LeCun et al. (1998) is a natural choice to handle the spatial aspect of each climate covariate by viewing it as a map, and can be applied as a “supervised” way for learning feature representation. However, our results imply that models with CNN has limited predictive skills regarding both spatial and temporal cosine similarity. CNN, while doing convolution using a small kernel, mainly focus on spatially localized regions. However, the strong spatial correlation of climate variables restricts the effectiveness of CNN kernels on feature extraction. Meanwhile, PCA, termed Empirical Orthogonal Functions (EOF) Lorenz (1956); Von Storch and Zwiers (2001) in climate science, is a commonly used “unsupervised” feature representation method, which focuses on low-rank modeling of spatial covariance structure revealing spatial connection. Our results (Table 2) illustrate that PCA-based models have higher predictive skills than CNN-based models, verifying that PCA is a better technique for feature extraction in SSF.

Sequential modeling: Encoder-Decoder. With features extracted by PCA, we formulate SSF as a sequential modeling problem Sutskever et al. (2014), where the input is the covariates sequence described in section 5, and the output is the target variable. Due to its immense success in sequential modeling Srivastava et al. (2015); Venugopalan et al. (2015), the standard Encoder-Decoder, where both Encoder and Decoder are LSTM Hochreiter and Schmidhuber (1997), is the first architecture we investigate. Unfortunately, the model does not perform well and suffers from over-fitting, possibly caused by overly complex architecture. To reduce the model complexity, we replace the LSTM Decoder with an FNN Decoder which takes only the last step of the output sequence from the Encoder. Such change leads to an immediate boost of predictive performance. However, the input of the FNN Decoder mainly contains information encoded from the latest day in the input sequence and can only embed limited amount of historical information owing to the recurrent architecture of LSTM. To further improve the performance, we adjust the connection between Encoder and Decoder, such that FNN Decoder takes every step of the output sequence from LSTM Encoder, which makes a better use of historical information. Eventually, this architecture achieves the best performance among all investigated Encoder-Decoder variants (see a detailed comparisons in Appendix C).

7 Conclusion

In this paper, we investigate the great potential to advance sub-seasonal climate forecasting using ML techniques. SSF, the skillful forecasts of temperature on the time range between 2-8 weeks, is a challenging task due to the complex coupling among atmosphere, land and ocean, and the unique nature of climate data. We conduct a comprehensive analysis of 10 different ML models, including a few DL models. Empirical results show the gradient boosting model XGBoost, the DL model Encoder (LSTM)-Decoder (FNN), and linear models, such as Lasso, consistently outperform state-of-the-art forecasts. XGBoost has the highest skill over all models, and demonstrates its predictive power. ML models are capable of picking the climate variables from important sources of predictability in SSF, identified by climate scientists. In addition, DL models, with demonstrated improvements from careful architectural choices, are great potentials for future research.

8 Broader Impact

Skillful (i.e., accurate) climate forecasts on sub-seasonal time scales would have immense societal value. For instance, sub-seasonal forecasts of temperature and precipitation could be used to assist farmers in determining planting dates, irrigation needs, expected market conditions, anticipating pests and disease, and assessing the need for insurance. Emergency and disaster-relief supplies can take weeks or months to pre-stage, so skillful forecasts of areas that are likely to experience extreme weather a few weeks in advance could save lives. More generally, skillful sub-seasonal forecasts also would have beneficial impacts on agricultural productivity, hydrology and water resource management, transportation and aviation systems, emergency planning for extreme climate such as Atlantic hurricanes and midwestern tornadoes, among others Pomeroy et al. (2002); Klemm and McPherson (2017). Inaccurate spatial-temporal forecasts associated with extreme weather events and associated disaster relief planning can be expensive both in terms of loss of human lives as well as financial impact. On a more steady state basis, water resource management and planning agricultural activities can be made considerably more precise and cost effective with skillful sub-seasonal climate forecasts.

Acknowledgement

The research was supported by NSF grants OAC-1934634, IIS-1908104, IIS-1563950, IIS-1447566, IIS-1447574, IIS-1422557, CCF-1451986.

References

  • [1] H. S. Badr, B. F. Zaitchik, and S. D. Guikema (2014) Application of statistical models to the prediction of seasonal rainfall anomalies over the sahel. Journal of Applied meteorology and climatology 53 (3), pp. 614–636. Cited by: §2.
  • [2] D. Bahdanau, K. Cho, and Y. Bengio (2015) Neural machine translation by jointly learning to align and translate. In Conference Track Proceedings of the 3rd International Conference on Learning Representations (ICLR), External Links: Link Cited by: §2.
  • [3] A. G. Barnston and R. E. Livezey (1987) Classification, seasonality and persistence of low-frequency atmospheric circulation patterns. Monthly weather review 115 (6), pp. 1083–1126. Cited by: Table 1.
  • [4] A. G. Barnston, M. K. Tippett, M. L. L’Heureux, S. Li, and D. G. DeWitt (2012) Skill of real-time seasonal enso model predictions during 2002–11: is our capability increasing?. Bulletin of the American Meteorological Society 93 (5), pp. 631–651. Cited by: §1, §2.
  • [5] L. M. Braman, M. K. van Aalst, S. J. Mason, P. Suarez, Y. Ait-Chellouche, and A. Tall (2013) Climate forecasts in disaster management: red cross flood operations in west africa, 2008. Disasters 37 (1), pp. 144–164. Cited by: §1.
  • [6] T. Chen and C. Guestrin (2016) Xgboost: a scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD), pp. 785–794. Cited by: 4th item.
  • [7] A. S. Cofıno, R. Cano, C. Sordo, and J. M. Gutiérrez (2002) Bayesian networks for probabilistic weather prediction. In

    In Proceedings of the 15th Eureopean Conference on Artificial Intelligence (ECAI)

    ,
    pp. 695–699. Cited by: §2.
  • [8] J. Cohen, D. Coumou, J. Hwang, L. Mackey, P. Orenstein, S. Totz, and E. Tziperman (2019) S2S reboot: an argument for greater inclusion of machine learning in subseasonal to seasonal forecasts. Wiley Interdisciplinary Reviews: Climate Change 10 (2), pp. e00567. Cited by: §2.
  • [9] E. C. de Perez and S. J. Mason (2014) Climate information for humanitarian agencies: some basic principles. Earth Perspectives 1 (1), pp. 11. Cited by: §1.
  • [10] T. DelSole and A. Banerjee (2017) Statistical seasonal prediction based on regularized regression. Journal of Climate 30 (4), pp. 1345–1361. Cited by: §2.
  • [11] T. Delsole and M. Tippett (2017-10) Predictability in a changing climate. Climate Dynamics, pp. . External Links: Document Cited by: §6.2.
  • [12] Y. Fan and H. van den Dool (2004) Climate prediction center global monthly soil moisture data set at 0.5 resolution for 1948 to present. Journal of Geophysical Research: Atmospheres 109 (D10). Cited by: Table 1.
  • [13] Y. Fan and H. Van den Dool (2008) A global monthly land surface air temperature analysis for 1948–present. Journal of Geophysical Research: Atmospheres 113 (D1). Cited by: Table 1.
  • [14] Frederik Nebeker (1995) Calculating the weather: meteorology in the 20th century.. Elsevier. Cited by: §2.
  • [15] J. H. Friedman (2001) Greedy function approximation: a gradient boosting machine. Annals of statistics, pp. 1189–1232. Cited by: 4th item.
  • [16] K. Funahashi and Y. Nakamura (1993) Approximation of dynamical systems by continuous time recurrent neural networks. Neural networks 6 (6), pp. 801–806. Cited by: §2.
  • [17] F. A. Gers, J. Schmidhuber, and F. Cummins (2000) Learning to forget: continual prediction with lstm. Neural Computation 12 (10), pp. 2451–2471. Cited by: §2.
  • [18] A. R. Goncalves, A. Banerjee, and F. J. Von Zuben (2017) Spatial projection of multiple climate variables using hierarchical multitask learning. In Thirty-First AAAI Conference on Artificial Intelligence, Cited by: §2.
  • [19] A. R. Goncalves, P. Das, S. Chatterjee, V. Sivakumar, F. J. Von Zuben, and A. Banerjee (2014) Multi-task sparse structure learning. In Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, pp. 451–460. Cited by: §2.
  • [20] A. R. Goncalves, F. J. Von Zuben, and A. Banerjee (2016) Multi-task sparse structure learning with gaussian copula models. The Journal of Machine Learning Research 17 (1), pp. 1205–1234. Cited by: §2.
  • [21] A. Grover, A. Kapoor, and E. Horvitz (2015) A deep hybrid model for weather forecasting. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 379–386. Cited by: §2.
  • [22] Y. Ham, J. Kim, and J. Luo (2019) Deep learning for multi-year enso forecasts. Nature 573 (7775), pp. 568–572. Cited by: §2.
  • [23] S. He, X. Li, V. Sivakumar, and A. Banerjee (2019) Interpretable predictive modeling for climate variables with weighted lasso. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 1385–1392. Cited by: §2.
  • [24] S. Hochreiter and J. Schmidhuber (1997-11) Long short-term memory. Neural Comput. 9 (8), pp. 1735–1780. External Links: ISSN 0899-7667, Link, Document Cited by: §6.3.
  • [25] J. Huang, H. M. van den Dool, and K. P. Georgarakos (1996) Analysis of model-calculated soil moisture over the united states (1931–1993) and applications to long-range temperature forecasts. Journal of Climate 9 (6), pp. 1350–1362. Cited by: Table 1.
  • [26] J. Hwang, P. Orenstein, J. Cohen, K. Pfeiffer, and L. Mackey (2019) Improving subseasonal forecasting in the western us with machine learning. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2325–2335. Cited by: §1, §2, 1st item, 2nd item, §6.1.
  • [27] A. Jalali, P. Ravikumar, and S. Sanghavi (2013) A dirty model for multiple sparse regression. IEEE Transactions on Information Theory 59 (12), pp. 7947–7968. Cited by: 3rd item.
  • [28] E. Kalnay, M. Kanamitsu, R. Kistler, W. Collins, D. Deaven, L. Gandin, M. Iredell, S. Saha, G. White, J. Woollen, Y. Zhu, M. Chelliah, W. Ebisuzaki, W. Higgins, J. Janowiak, K. C. Mo, C. Ropelewski, J. Wang, A. Leetmaa, R. Reynolds, R. Jenne, and D. Joseph (1996) The ncep/ncar 40-year reanalysis project. Bulletin of the American Meteorological Society 77 (3), pp. 437–472. External Links: Document, Link, https://doi.org/10.1175/1520-0477(1996)077<0437:TNYRP>2.0.CO;2 Cited by: Table 1.
  • [29] T. Klemm and R. A. McPherson (2017) The development of seasonal climate forecasting for agricultural producers. Agricultural and forest meteorology 232, pp. 384–399. Cited by: §1, §8.
  • [30] H. R. Kunsch (1989) The jackknife and the bootstrap for general stationary observations. The annals of Statistics, pp. 1217–1241. Cited by: §3.
  • [31] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: 2nd item, §6.3.
  • [32] S. Lipovetsky and M. Conklin (2001)

    Analysis of regression in game theory approach

    .
    Applied Stochastic Models in Business and Industry 17 (4), pp. 319–330. Cited by: §C.2, §6.2.
  • [33] A. C. Lorenc (1986) Analysis methods for numerical weather prediction. Quarterly Journal of the Royal Meteorological Society 112 (474), pp. 1177–1194. Cited by: §1.
  • [34] E. N. Lorenz (1956) Empirical orthogonal functions and statistical weather prediction. Cited by: §6.3.
  • [35] National Academies of Sciences (2016) Next generation earth system prediction: strategies for subseasonal to seasonal forecasts. National Academies Press. Cited by: §1.
  • [36] National Research Council (2010) Assessment of intraseasonal to interannual climate prediction and predictability. National Academies Press. Cited by: §1.
  • [37] J. Pomeroy, D. Gray, N. Hedstrom, and J. Janowicz (2002) Prediction of seasonal snow accumulation in cold climate forecasts. Hydrological Processes 16 (18), pp. 3543–3558. Cited by: §1, §8.
  • [38] Y. Radhika and M. Shashi (2009)

    Atmospheric temperature prediction using support vector machines

    .
    International journal of computer theory and engineering 1 (1), pp. 55. Cited by: §2.
  • [39] M. Reichstein, G. Camps-Valls, B. Stevens, M. Jung, J. Denzler, N. Carvalhais, et al. (2019) Deep learning and process understanding for data-driven earth system science. Nature 566 (7743), pp. 195–204. Cited by: §2.
  • [40] D. N. Reshef, Y. A. Reshef, H. K. Finucane, S. R. Grossman, G. McVean, P. J. Turnbaugh, E. S. Lander, M. Mitzenmacher, and P. C. Sabeti (2011) Detecting novel associations in large data sets. science 334 (6062), pp. 1518–1524. Cited by: Figure 1, §3.
  • [41] R. W. Reynolds, T. M. Smith, C. Liu, D. B. Chelton, K. S. Casey, and M. G. Schlax (2007) Daily high-resolution-blended analyses for sea surface temperature. Journal of Climate 20 (22), pp. 5473–5496. Cited by: Table 1.
  • [42] A. W. Robertson, A. Kumar, M. Peña, and F. Vitart (2015) Improving and promoting subseasonal to seasonal prediction. Bulletin of the American Meteorological Society 96 (3), pp. ES49–ES53. Cited by: §2.
  • [43] T. Schneider, S. Lan, A. Stuart, and J. Teixeira (2017) Earth system modeling 2.0: a blueprint for models that learn from observations and targeted high-resolution simulations. Geophysical Research Letters 44 (24), pp. 12–396. Cited by: §2.
  • [44] X. Shi, Z. Gao, L. Lausen, H. Wang, D. Yeung, W. Wong, and W. Woo (2017) Deep learning for precipitation nowcasting: a benchmark and a new model. In Advances in neural information processing systems, pp. 5617–5627. Cited by: §2.
  • [45] A. J. Simmons and A. Hollingsworth (2002) Some aspects of the improvement in skill of numerical weather prediction. Quarterly Journal of the Royal Meteorological Society 128 (580), pp. 647–677. Cited by: §1.
  • [46] A. J. Simmons and A. Hollingsworth (2002) Some aspects of the improvement in skill of numerical weather prediction. Quarterly Journal of the Royal Meteorological Society: A journal of the atmospheric sciences, applied meteorology and physical oceanography 128 (580), pp. 647–677. Cited by: §1.
  • [47] N. Srivastava, E. Mansimov, and R. Salakhudinov (2015) Unsupervised learning of video representations using lstms. In International conference on machine learning, pp. 843–852. Cited by: §2, §6.3.
  • [48] M. Sundermeyer, R. Schlüter, and H. Ney (2012) LSTM neural networks for language modeling. In Thirteenth annual conference of the international speech communication association, Cited by: §2.
  • [49] I. Sutskever, O. Vinyals, and Q. V. Le (2014) Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pp. 3104–3112. Cited by: §2, 1st item, §6.3.
  • [50] The National Oceanic and Atmospheric Administration (2018) Subseasonal and seasonal forecasting innovation: plans for the twenty-first century. Annotated outline of NOAA’s draft sub-seasonal and seasonal (S2S) forecasting report. Cited by: Figure 1, §1, §6.2.
  • [51] R. Tibshirani (1996) Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society,, pp. 267–288. Cited by: 3rd item.
  • [52] H. Van den Dool, S. Saha, and A. Johansson (2000) Empirical orthogonal teleconnections. Journal of Climate 13 (8), pp. 1421–1435. Cited by: Table 1.
  • [53] H. Van den Dool, P. S. Cpc, and H. Van Den Dool (2007) Empirical methods in short-term climate prediction. Oxford University Press. Cited by: 5th item.
  • [54] H. Van den Dool, J. Huang, and Y. Fan (2003) Performance and analysis of the constructed analogue method applied to us soil moisture over 1981–2001. Journal of Geophysical Research: Atmospheres 108 (D16). Cited by: Table 1.
  • [55] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §2.
  • [56] S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell, and K. Saenko (2015) Sequence to sequence-video to text. In

    Proceedings of the IEEE international conference on computer vision

    ,
    pp. 4534–4542. Cited by: §2, §6.3.
  • [57] F. Vitart, A. W. Robertson, and D. L. Anderson (2012) Subseasonal to seasonal prediction project: bridging the gap between weather and climate. Bulletin of the World Meteorological Organization 61 (23), pp. . Cited by: §2.
  • [58] H. Von Storch and F. W. Zwiers (2001) Statistical analysis in climate research. Cambridge university press. Cited by: §6.3.
  • [59] L. Wasserman (2013) All of statistics: a concise course in statistical inference. Springer Science & Business Media. Cited by: §A.1.
  • [60] S. Weisberg (2005) Applied linear regression. Vol. 528, John Wiley & Sons. Cited by: 5th item.
  • [61] M. C. Wheeler and H. H. Hendon (2004) An all-season real-time multivariate mjo index: development of an index for monitoring and prediction. Monthly Weather Review 132 (8), pp. 1917–1932. External Links: Document, Link, https://doi.org/10.1175/1520-0493(2004)132<1917:AARMMI>2.0.CO;2 Cited by: Table 1.
  • [62] S. Xingjian, Z. Chen, H. Wang, D. Yeung, W. Wong, and W. Woo (2015) Convolutional lstm network: a machine learning approach for precipitation nowcasting. In Advances in neural information processing systems, pp. 802–810. Cited by: §2.
  • [63] B. G. Zimmerman, D. J. Vimont, and P. J. Block (2016) Utilizing the state of enso as a means for season-ahead predictor selection. Water resources research 52 (5), pp. 3761–3774. Cited by: Table 1.

Appendix A Difficulty of the problem

a.1 Dependence between historical data and forecasting target

In section 3, the dependence between the most recent historical data (the normalized average temperature of week -2 & -1) and the forecasting target (the normalized average temperature of week 3 & 4) is measured by maximum information coefficient (MIC). Here we show the results measured by Pearson correlation coefficient [59] and Spearman’s rank correlation coefficient [59] (Figure 5). Small values (0.2) of Pearson correlation and Spearman’s rank correlation at a majority of locations, which verify that there is little information shared between the most recent date and the forecasting target, once again, demonstrate how difficult SSF is.

Figure 5: Pearson correlation, Spearman’s rank correlation and MIC between 2m temperature of week -2 & -1 and week 3 & 4. Small values (0.2) of Pearson correlation and Spearman’s rank correlation at a majority of locations verify the fact, as we illustrate in the main paper using MIC, that there is little information shared between the most recent date and the forecasting target.

a.2 Relative R

In the main paper, we introduce cosine similarity, which is widely used in weather prediction evaluation, as an evaluation metric. Here we formally define the other evaluation metric, namely relative as

(1)

where denotes a vector of predicted values, and be the corresponding ground truth. We use relative to evaluate the relative predictive skill of a given prediction compared to the best constant predictor , the long-term average of target variable at each date and each target location computed from training set. A model which achieves a positive relative is, at least, able to predict the sign of accurately. The results of temporal and spatial relative over the US mainland of ML models discussed in section 4 are shown in Table 3 and Figure 7 respectively.

Appendix B Experimental setup

b.1 PCA prepossessing

As mentioned in section 5 of the main paper, one way for feature extraction is to apply PCA to spatial-temporal variables. To do so, let’s consider sst of Pacific ocean as an example. Daily sst of Pacific ocean is originally stored in a matrix, of which each element represents the sea surface temperature at each grid point of Pacific ocean. The covariance matrix can be computed by flattening each matrix into a 1-D vector, viewing each element in the matrix as a feature and each date as one observation. Such covariance matrix captures spatial connection among grid points of Pacific ocean. By considering all dates from 1986 to 2016, we can extract the top 10 principal components (PCs) as features based on PC loadings computed from the corresponding covariance.

(a) Sequential feature set for Mar 1, 2018
(b) Evaluation pipeline
Figure 6: (a) Sequential feature set: to construct feature set at Mar. 1, 2018, we concatenate covariates from Mar. 1 in 2018, 2017, and 2016, their corresponding , , and days in the past, and , , and

“future” days in 2017 and 2016. (b) Evaluation pipeline: to test SSF in Jan 2017, the training set covers historical 30 year ends at Dec 4, 2016 (the last available date). 5 validation sets include dates from each Jan between 2012 to 2016, with the corresponding training sets generated by applying a moving window of 10 years and a stride of 365 days on data start from 2000.

b.2 Feature set construction

To better utilize historical information, we construct a sequential feature set by including not only covariates of the target date, but also covariates of the , , and day previous from the target date, as well as the day of the year of the target date in the past 2 years and both the historical past and future dates around the day of the year of the target date in the past 2 years. Such selection of historical dates mainly bases on the temporal correlation. Figure 6(a) provides a detailed example on how to construct feature set for Mar 1, 2018: we concatenate covariates from Mar. 1 in 2018, 2017, and 2016, their corresponding , , and days in the past, and , , and “future” days in 2017 and 2016.

b.3 Evaluation Pipeline

Predictive models are created independently for each month in 2017 and 2018. To mimic a live forecasting system, we generate 105 test dates during 2017-2018, one for each week, and group them into 24 test sets by their month of the year. Given a test set, our evaluation pipeline consists of two parts (Figure 6(b)):

  • [leftmargin=*]

  • “5-fold” training-validation pairs for hyper-parameter tuning, based on a “sliding-window” strategy designed for time-series data. Each validation set uses the data from the same month of the year as the test set. For instance, if the test set is Jan 2017, the corresponding 5 validation sets are Jan 2012, Jan 2013, Jan 2014, Jan 2015, and Jan 2016 respectively. Each validation set corresponds to a training set containing 10 years of data and ending 28 days before the first date in the validation set. Specifically, if the validation set starting from Jan 1, 2016, the training set is from Dec 4, 2005 to Dec 4, 2015. Such construction is equivalent to apply a sliding-window of 10-year with a stride of 365 days on data from 2002.

  • The training-test pair, where the training set, including 30-year data in the past, ends 28 days before the first date in the test set. For example, to test SSF in Jan 2017, i.e., Jan 1, Jan 8, Jan 15, Jan 22, and Jan 29, the training set starts from Dec 4, 1986 and ends at Dec 4, 2016, which is the day before Jan 1, and the last date we have the ground truth for the target variable.

Appendix C Additional Results

Model Mean(se) Median (se) 0.25 quantile (se) 0.75 quantile (se)
Temporally Global Set
XGBoost - one day 0.0760(0.03) 0.0974(0.03) -0.0449(0.03) 0.2434(0.03)
Lasso - one day 0.0552(0.02) 0.0321(0.02) -0.0309(0.01) 0.1295(0.02)
Encoder (LSTM)-Decoder (FNN) -0.0353 (0.05) 0.0596(0.05) -0.2409 (0.06) 0.2426 (0.05)
FNN -0.5777(0.29) -0.0183(0.15) -0.0794(0.13) 0.0213(0.13)
CNN -0.0564(0.03) 0.0284(0.02) -0.0266(0.02) 0.0570(0.02)
CNN-LSTM -0.1164(0.05) 0.0263(0.03) -0.0862(0.03) 0.0698(0.03)
LS with NAO & all nino - daily 0.0418(0.01) 0.0535(0.01) -0.0078(0.01) 0.0949(0.01)
Damped persistence 0.0266(0.01) 0.0414(0.02) -0.0542(0.02) 0.1354(0.02)
MultiLLR -0.0571 (0.02) 0.0034 (0.02) -0.1156 (0.03) 0.0797 (0.02)
AutoKNN 0.0181 (0.01) 0.0260 (0.02) -0.0531 (0.02) 0.1041 (0.01)
Temporally Local Set
XGBoost - one day -0.0337(0.03) 0.0396(0.03) -0.1310(0.04) 0.1873(0.03)
Lasso - one day -0.0028(0.02) 0.0327(0.02) -0.0613(0.02) 0.0996(0.02)
Encoder (LSTM)-Decoder (FNN) -0.2333 (0.06) -0.1116 (0.06) -0.4694 (0.09) 0.1808 (0.06)
Table 3: Comparison of relative of tmp2m forecasting for test sets over 2017-2018. A positive relative indicates a model predicting the sign of the target variable correctly. XGBoost achieves the highest relative .
Figure 7: Temporal relative over the US mainland of ML models discussed in section 4 for temperature prediction over 2017-2018. Large positive values (green) closer to 1 indicates better predictive skills.

c.1 Temporal and spatial results of relative R

Table 3 lists the mean, the median, the 0.25 quantile, the 0.75 quantile, and their corresponding standard errors of relative for all models. A positive relative indicates a model can at least predict the sign of the target variable correctly. Again, XGBoost achieves the highest predictive skill in terms of both the mean and the median, demonstrating its predictive power. Linear regression, like Lasso, with a proper feature set has good predictive performance. Both XGBoost and Lasso have larger positive relative in terms of the mean, and can still outperform climatology and two climate baseline models, i.e., LS with NAO & Niño, and damped persistence. Even though Encoder (LSTM)-Decoder (FNN) has a slightly negative mean relative , it has the second largest median and 0.75 quantile among all models, showing its potential for further improvement.

Figure 7 shows the spatial relative of all methods. XGBoost and Lasso are able to achieve positive relative for most of the target locations. Encoder (LSTM)-Decoder (FNN) shows better predictive skill over the southern US compared to other regions. MultiLLR and AutoKNN manages to obtain non-negative relative for the coastal area in the western US but their predictive performance drops in the rest of locations. All other baseline methods struggle to reach positive relative for most of the target locations.

(a) XGBoost
(b) Lasso
Figure 8: Shapley values computed from (a) XGBoost and (b) Lasso. Darker color means a covariate is of the higher importance. The first 8 rows contains the top 10 principal components (PCs) extracted from 8 spatial-temporal covariates respectively, and the last row includes all the temporal indices. Land component, e.g., soil moisture ( row from the top) and ocean components, e.g., sst (Pacific and Atlantic) and some climate indices are the most commonly selected covariates.
Model Mean(se) Median (se) 0.25 quantile (se) 0.75 quantile (se)
XGBoost - one day 0.3044(0.03) 0.3447(0.05) 0.0252(0.05) 0.5905(0.04)
XGBoost - one day (w/o soil moisture) 0.2685(0.03) 0.2797(0.05) 0.0703(0.04) 0.5492(0.05)
XGBoost - one day (w/o nao & all nino) 0.2081(0.03) 0.1640(0.05) -0.0588(0.04) 0.5246(0.05)
Lasso - one day 0.2499(0.04) 0.2554(0.06) -0.0224(0.05) 0.5604(0.06)
Lasso - one day (w/o soil moisture) 0.2638(0.04) 0.2912(0.05) 0.0032(0.06) 0.5655(0.05)
Lasso - one day (w/o nao & all nino) 0.1956(0.04) 0.2573(0.07) -0.1657(0.06) 0.5533(0.05)
Encoder (LSTM)-Decoder (FNN) 0.2616 (0.04) 0.2995 (0.07) -0.0719 (0.06) 0.6310 (0.05)
Encoder (LSTM)-Decoder (FNN)(w/o soil moisture) 0.2157 (0.04) 0.2909 (0.07) -0.1106 (0.07) 0.5443 (0.07)
Encoder (LSTM)-Decoder (FNN)(w/o nao & all nino) 0.2236 (0.04) 0.2395 (0.06) -0.1527 (0.07) 0.5989 (0.06)
Table 4: Comparison of cosine similarity of tmp2m forecasting for test sets over 2017-2018 using different feature set. Excluding soil moisture or climate indices (NAO & Niño) leads to a deterioration in the predictive performance.
Model Mean(se) Median (se) 0.25 quantile (se) 0.75 quantile (se)
XGBoost - one day 0.0760(0.03) 0.0974(0.03) -0.0449(0.03) 0.2434(0.03)
XGBoost - one day (w/o soil moisture) 0.0370(0.03) 0.0322(0.03) -0.0564(0.03) 0.2225(0.03)
XGBoost - one day (w/o nao & all nino) -0.0161(0.03) -0.0079(0.04) -0.1618(0.03) 0.2426(0.04)
Lasso - one day 0.0552(0.02) 0.0321(0.02) -0.0309(0.01) 0.1295(0.02)
Lasso - one day (w/o soil moisture) -0.0161(0.03) -0.0079(0.04) -0.1618(0.03) 0.2426(0.04)
Lasso - one day (w/o nao & all nino) 0.0003(0.02) 0.0457(0.02) -0.1113(0.03) 0.1641(0.02)
Encoder (LSTM)-Decoder (FNN) -0.0353 (0.05) 0.0596(0.05) -0.2409 (0.06) 0.2426 (0.05)
Encoder (LSTM)-Decoder (FNN)(w/o soil moisture) -0.1083 (0.05) 0.0314 (0.05) -0.3022 (0.08) 0.2252 (0.05)
Encoder (LSTM)-Decoder (FNN)(w/o nao & all nino) -0.0802 (0.04) 0.0124 (0.05) -0.3032 (0.06) 0.2446 (0.05)
Table 5: Comparison of relative of tmp2m forecasting for test sets over 2017-2018. Excluding soil moisture or climate indices (NAO & Niño) leads to a smaller or even negative relative , showing that it becomes harder for the model to predict the sign of the target variable correctly.

c.2 Analysis on feature importance

Shapley values [32]. A concept from game theory, the Shapley value is another way to evaluate the importance of a feature used in the model. To determine the Shapley value of a given feature, we compute the prediction difference between a model trained with and without that feature. Since the effect of suppressing a feature also depends on other features, we have to consider all possible subsets of other features, and compute the Shapley values as a weighted average of all possible differences. Figure 8 shows the mean absolute value of the Shapley values for each feature over 24 models (one per month in 2017-2018), computed from (a) XGBoost and (b) Lasso. What we observe, based on Shapley values, once again verifies our observations presented in the main paper: ML models pick up ocean-based covariates, some land-based covariates, and almost entirely ignore the atmosphere-related covariates.

To emphasis the importance of the land-based covariates, e.g., soil moisture and the ocean-based covariates, e.g., NAO and Niño indices, we compare the prediction performance among (1) the model trained with all covariates, (2) the model trained without soil moisture, and (3) the model trained without NAO and Niño indices (Table 4 and Table 5). Most models experience a performance deterioration when we exclude certain “important” covariates.

Model Mean(se) Median (se) 0.25 quantile (se) 0.75 quantile (se)
XGBoost - all days 0.2080(0.03) 0.1582(0.05) -0.0466(0.05) 0.5383(0.05)
XGBoost - four days 0.2433(0.03) 0.2203(0.05) 0.0561(0.04) 0.5168(0.06)
XGBoost - one day 0.3044(0.03) 0.3447(0.05) 0.0252(0.05) 0.5905(0.04)
Lasso - all days 0.2160(0.04) 0.2258(0.07) -0.1381(0.06) 0.5384(0.06)
Lasso - four days 0.2247(0.04) 0.1952(0.07) 0.0572(0.06) -0.5700(0.06)
Lasso - one day 0.2499(0.04) 0.2554(0.06) -0.0224(0.05) 0.5604(0.06)
Table 6: Comparison of spatial cosine similarity for tmp2m forecasting over 2017-2018 using various length of feature sequence. Including longer historical sequence leads to a deterioration in the predictive performance of XGBoost and Lasso.
Model Mean(se) Median (se) 0.25 quantile (se) 0.75 quantile (se)
XGBoost - all days -0.0200(0.03) -0.0010(0.04) -0.1499(0.04) 0.2304(0.04)
XGBoost - four days 0.0242(0.03) 0.0193(0.03) -0.0786(0.03) 0.1882(0.04)
XGBoost - one day 0.0760(0.03) 0.0974(0.03) -0.0449(0.03) 0.2434(0.03)
Lasso - all days -0.0167(0.03) 0.0367(0.03) -0.0639(0.02) 0.1588(0.03)
Lasso - four days 0.0518(0.02) 0.0266(0.02) -0.0542(0.02) 0.1653(0.03)
Lasso - one day 0.0552(0.02) 0.0321(0.02) -0.0309(0.01) 0.1295(0.02)
Table 7: Comparison of relative (with training set mean) for tmp2m prediction for test set over 2017-2019 using different length of feature sequence. Including longer historical sequence leads to a smaller or even negative relative for both XGBoost and Lasso.

c.3 The influence of feature sequence length

We compare the prediction performance under 3 different settings, referred to as “one day”, “four days", and “all days” respectively. For feature set construction, “one day” includes covariates at the target date only, “four days” also covers the , , and days previous to the target date, and “all days” uses the exact feature sequence we use for LSTM-based models. Comparison of predictive skills under each setting, measured by both cosine similarity and relative , can be found in Table 6 and Table 7. Both XGBoost and Lasso enjoy a performance boost using “one day” values. Especially for XGBoost, the performance of “one day” is approximately 50% better than using “all days”. A possible explanation for such performance degradation as we increase the feature sequence length is that both models weight covariates from different dates exactly the same without considering temporal information, thus more noise has been introduced.

c.4 Discussion on deep learning models

Results of DL models. Table 8 and Table 9 compare the predictive skills of 5 DL models discussed in section 6, measured by both cosine similarity and relative . Significant improvements can been observed as we evolve from the standard Encoder (LSTM)-Decoder (LSTM), to Encoder (LSTM)-Decoder (FNN)-last step, where “last step” indicates that FNN Decoder only uses the last step of the output sequence from LSTM Encoder, and finally to Encoder (LSTM)-Decoder (FNN) with FNN Decoder uses every step of the output sequence from LSTM Encoder.

One issue with Encoder (LSTM)-Decoder (FNN) is that the input features are shared by all target locations, which requires the model to identify the useful information for each locations without any help from the input.

Autoregressive (AR) component. Currently, the Encoder(LSTM)-Decoder(FNN) clearly considers climate covariates on a global scale, which are shared by all target locations. Nevertheless, SSF depends on not only global climate condition but also local weather change. Therefore, we seek a way to improve the model by adding an autoregressive (AR) component to capture the “local” information from historical data. We consider two variants of Encoder (LSTM)-Decoder (FNN). The first variant contains an AR component with the input as historical temperature at each target location, denoted as Encoder (LSTM)-Decoder (FNN)+AR. The second one includes both historical temperature and historical temporal climate variables, i.e., climate indices, as input features, denoted as Encoder (LSTM)-Decoder (FNN)+AR (CI). For both models, the final forecast is computed as a linear combination of the prediction from Encoder (LSTM)-Decoder (FNN) and the prediction from AR component for each location. Unexpectedly, as shown in Table 8 and Table 9, simply adding the AR component to our Encoder(LSTM)-Decoder(FNN) does not help the model to perform better. However, we believe there is a better way to involve local information, and such modification is a promising direction that worth investigation in the future.

Model Mean(se) Median (se) 0.25 quantile (se) 0.75 quantile (se)
Encoder (LSTM)-Decoder (LSTM) 0.0740(0.03) 0.0358(0.04) -0.1569(0.03) 0.2584(0.04)
Encoder (LSTM)-Decoder (FNN)-last step 0.1614 (0.05) 0.2061 (0.08) -0.2590 (0.08) 0.5720 (0.08)
Encoder (LSTM)-Decoder (FNN) 0.2616 (0.04) 0.2995 (0.07) -0.0719 (0.06) 0.6310 (0.05)
Encoder (LSTM)-Decoder (FNN)+AR 0.1733 (0.04) 0.1922 (0.06) -0.0863 (0.07) 0.5225 (0.06)
Encoder (LSTM)-Decoder (FNN)+AR (CI) 0.1852 (0.04) 0.1986 (0.05) -0.0838 (0.06) 0.5164 (0.05)
Table 8: Comparison of cosine similarity of tmp2m forecasting for test sets over 2017-2018 using different deep learning architectures.
Model Mean(se) Median (se) 0.25 quantile (se) 0.75 quantile (se)
Encoder (LSTM)-Decoder (LSTM) -0.3947(0.05) -0.2999(0.05) -0.6606(0.08) -0.0537(0.05)
Encoder (LSTM)-Decoder (FNN)-last step -0.1709 (0.06) 0.0217 (0.06) -0.4569 (0.11) 0.2278 (0.06)
Encoder (LSTM)-Decoder (FNN) -0.0353 (0.05) 0.0596(0.05) -0.2409 (0.06) 0.2426 (0.05)
Encoder (LSTM)-Decoder (FNN)+AR -0.0414 (0.04) -0.0041 (0.05) -0.3027 (0.07) 0.2309 (0.05))
Encoder (LSTM)-Decoder (FNN)+AR (CI) -0.0563 (0.03) -0.0380 (0.05) -0.2365 (0.05) 0.1951 (0.04)
Table 9: Comparison of relative of tmp2m forecasting for test sets over 2017-2018. A positive relative indicates a model predicting the sign of the target variable correctly.