## 1 Methods

### 1.1 Data

This section describes the datasets used for testing the SVD method. All datasets have been normalized to zero mean and unit variance, in the case of the Plasim model individually per level and variable.

#### 1.1.1 Lorenz95 model

The Lorenz95 model (sometimes also called Lorenz96) is a highly idealized model designed to mimic certain aspects of the mid-latitude atmosphere (lorenz_predictability_1996-2). It is regularly used as a toy-model both for atmosphere-targeted studies (for example for testing ideas in NWP, such as data-assimilation), and more general in the studies of (chaotic) dynamical systems (e.g. bowler_comparison_2006; karimi2010extensive).

It is defined with the following set of ODEs, where is the number of gridpoints in the model:

(1) |

with i = 1, 2, …, N and . It can therefore be seen as a single variable in a circular domain. In this study we use . The system is integrated with a timestep of model time units (MTUs), using the LSODA solver from ODEPACK, provided by the SciPy library (virtanen_scipy_2019).We use

timesteps for training, and 200 (from an independent model run) for testing. Gaussian noise with a standard deviation of 0.1 is added to both the training and test data. With this, the initial conditions are not perfect anymore, making the problem slightly more relevant to the weather forecasting context.

#### 1.1.2 Plasim

The Planet Simulator (Plasim,fraedrich_planet_2005-4) is a simplified General Circulation Model. Its atmospheric part is based on the primitive equations, thus modeling atmospheric dynamics. The other components (ocean, ice and land) are strongly simplified. It has already been shown that neural networks are able to forecast Plasim’s “weather” – i.e. the evolution of the model’s atmosphere a couple of days ahead (scher_weather_2019-1). We train on a run of 100 years of daily data, and test on 1 separate year (using exactly the same Plasim configuration as in scher_weather_2019-1). Since we directly use the model data, the initial conditions are “perfect”.

#### 1.1.3 Era5

Era5 is ECMWF’s most recent reanalysis product (c3s_era5_2017). A reanalysis provides the best guess of the past state of the global atmosphere on a 3d grid, by combining a forecast model with all available observations.We use 6-hourly data of 500hPa geopotential over the Northern Hemisphere on a 2.5° grid for the period 1976-2016 for training, and 2017-2018 for testing. This follows weyn_can_2019. Since the computation of the network Jacobians is quite costly (see below), in the testing period we only used every 8th initial state (i.e. every second day).

### 1.2 Neural network architectures

Since our aim is to test different approaches for generating neural network ensemble forecasts, we rely on feed-forward neural network architectures previously tested in the literature for the different datasets. For the Lorenz95 model, we use the architecture developed in

scher_toward_2018. For Plasim, the network from scher_weather_2019-1 is used. For Era5, we use one of the architectures presented in weyn_can_2019. Specifically, we use their purely feed-forward architecture, with only 500hPa geopotential as input. The architecture for the Lorenz95 model is a deep convolutional architecture, while for Era5 and Plasim it is a deep convolutional encode-decoder architecture. The weyn_can_2019 architecture was developed for a different reanalysis product. Here, we regridded Era5 to the same horizontal resolution as in weyn_can_2019, namely 2.5°, and we assume that the architecture is not overly sensitive to the change of reanalysis product.All networks are trained with the Adam optimizer (kingma_adam_2017) and mean square error loss. The networks are trained to forecast one timestep of the data for Plasim and Era5, and 10 timesteps for Lorenz95. Longer forecasts are made through consecutive forecasts. For further details, we refer the reader to the referenced papers. Additionally, the code repository accompanying this paper contains the full neural network specifications.

### 1.3 Ensemble techniques

#### 1.3.1 Random perturbation

One of the conceptually simplest ways of creating an ensemble in a chaotic dynamical system context is via adding random noise to the initial conditions. This can be done for any type of numerical model that starts form initial conditions, and is equally applicable to a neural network forecast that is run from an input vector (its “initial condition”).

Here we follow the method from bowler_comparison_2006

, with the simplification that we use a pure Gaussian distribution, instead of the convolution of a Gaussian and an exponential distribution. Independently for each gridpoint and each ensemble member, a value from a Gaussian distribution with zero mean and standard-deviation

is added. Since the ensemble has finite size, the mean and standard deviation of the perturbations over the whole ensemble do not necessarily match those of the parent distribution. Therefore, the drawn samples are first normalized to zero mean and standard deviation. The variable is a free parameter that is varied experimentally (0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, 3). We then choose the value that gives the lowest ensemble mean error at leadtime 60 h (Era5), leadtime 96 h (Plasim) and 1 MTU (lorenz95). The rationale behind optimizing at different lead-times is the different lead-time horizon at which the networks can make good forecasts for the different systems.This ensemble method will be referred to as “random ensemble” (“rand” in plot legend). Ensembles with 100 members are generated for all three datasets analyzed here.#### 1.3.2 Singular value decomposition

Singular Value Decomposition (SVD) is a technique from linear algebra with a wide range of applications. One if its uses is to find optimal perturbation patterns for a function

where optimal means that the input perturbation pattern leads to the maximum output perturbation with respect to some norm, when linearizing the function around its input:

In this paper, for the lorenz95 model and for Era5, (input and output dimension are the same). For Plasim, the singular vectors are computed only for 500hPa geopotential, but still with respect to the whole input field (thus ). We use the euclidean norm, which allows the use of standard SVD routines from numerical libraries.

To find the singular vectors, we first compute the Jacobian of the neural network

This is simple to implement, as gradients of the output of neural networks are central in the training procedures, and therefore neural network libraries usually also contain functions to compute gradients and Jacobians of the output with respect to the network inputs. If no explicit function for Jacobians is available, then one can use the gradient function to compute all rows of the Jacobian individually through looping over the output dimension. Computationally, however, this is relatively expensive, as it requires one gradient computation for each element of the output space. We optimize spread growth over 48 hours (Era5 and Plasim) and 1MTU (Lorenz95). For this, for a given input , the Jacobian matrix of the function defined by 10/2/8 (Lorenz95/Plasim/Era5) consecutive neural network forecasts is computed. We then compute the 10 leading singular vectors of the Jacobian matrix with a standard SVD-routine from the numpy library. This results in the SVD of the system, using the euclidean norm.

Following bowler_comparison_2006, the 10 leading singular vectors are then combined with random weights from a truncated Gaussian distribution with standard deviation , truncated at (since the random and the SVD ensemble technique do not necessarily use the same scales, we give them separate names).

This creates symmetric perturbations centered at zero (one member with one member with). As for the random perturbations, was optimized for the lowest ensemble mean error at leadtime 60h (Era5), 96h (Plasim) and 1MTU (lorenz95) on 1 year of training data. The algorithm is sketched in fig. 1 and presented in detail in alg.1 (Appendix). The tested perturbation scales were the same as for the random ensemble, except that for Plasim higher perturbation scales (up to 20) were included as well after some experimentation.

This ensemble method will be referred to as SVD ensemble (“svd” in plot legends). Due to the relatively high computational expense of computing the full Jacobians, for the Era5 data only every second day was used as initial state. Therefore, in order to allow for a valid comparison, for Era5 also the random perturbation ensemble and the network ensemble were initialized every second day.

#### 1.3.3 Network/training ensemble

The neural network training procedure used here has two random components, namely the random initialization of the network weights, and the random selection of training samples in the training loop of the optimizer. Therefore, a simple way to create an ensemble is via retraining the network, starting with different initial seeds for the random number generators. It would also be possible to add another level of randomness via selecting a different subset of the training data for each member, although we have not explored this possibility here.

We retrained the networks 10 times for Plasim, 20 times for Era5 and 100 times for the Lorenz95 model. Since we have only limited GPU computation time, the maximum number of ensemble members for Era5 and Plasim here is 20 and 10, respectively (the network for Plasim is the most expensive for training and computation of the Jacobians), compared to 100 for the ensembles with initial perturbations. For Era5, there was a very large variation in skill between different training realizations (some training realizations had very poor forecasts at longer leadtimes, even though they had only small errors on the training leadtime of 6h). Therefore, for Era5, we trained 50 members, and then selected the 20 members that had highest skill on the last year of the training data at lead time 60h. This ensemble will be referred to as the “network ensemble” (“netens” in plot legends). An overview of all ensemble types and the number of maximum ensemble members is given in table 1.

svd | rand | netens | |
---|---|---|---|

Lorenz95 | 100 | 100 | 20 |

Plasim | 100 | 100 | 10 |

Era5 | 100 | 100 | 20 |

### 1.4 Ensemble spread and error

In an ideal ensemble, the spread of a forecast is the expectation value of the error of the forecast, where spread is defined as standard deviation of the members, and error as the Root Mean Square Error (RMSE) of the ensemble mean. Thus, for a perfect ensemble, the mean forecast spread should be equal to the mean forecast error, when averaged over many forecasts. Since standard deviations and RMSE cannot easily be averaged (a common pitfall when evaluating ensemble forecasts, see fortin_why_2014), we first compute the variance of the members at each gridpoint, and then average over all gridpoints. For mean spreads over multiple forecasts, we average over forecasts as well, and then finally take the square root to get the mean standard deviation. We follow the same procedure for the RMSE of the forecasts, for which we compute the MSE, then average, and then take the square-root.

For Lorenz95 and Era5, the whole output vector is used. For Plasim, which forecasts 4 variables on 10 layers, only 500hPa geopotential is used for computing spread and error.

We evaluate the information of the spread in two ways. First, we compare the mean spread to the mean error. These should be, as already mentioned, equal for a perfect ensemble. The average error of forecasts of chaotic systems grows with increasing lead time, before eventually saturating at some climatological level. Therefore, the average spread as a function of leadtime should follow the average error during the initial error growth phase. Secondly, we compute the correlation between spread and error of all forecasts for a given leadtime. Since the spread is only a predictor of the expected, or average, error this correlation would not be one even for a perfect ensemble. However, if the spread contains useful information about the day-to-day uncertainty in the forecasts, the correlation should be significantly larger than zero (buizza_potential_1997-2).

### 1.5 Statistics

To test the significance of differences in forecast skill and differences in spread-error correlation between two methods, we use bootstrapping to compute confidence intervals. Since all methods use the same initial conditions, their results are not independent and computing individual confidence intervals for each method would be inappropriate. Therefore, we consider the

difference in RMSE/correlation (), and compute confidence intervals for this difference. If the confidence interval is entirely positive, method1 has a significantly higher error/correlation, if it is entirely negative, is has a significantly lower error/correlation. If the confidence interval contains both positive and negative values, there is no significant difference between the two methods.The bootstrapping is done as follows: for the set of forecast dates, a new set of (not necessarily unique) dates is created by randomly drawing from the original dates with replacement. On this new set, the difference between the two methods is computed. This is repeated 1000 times, creating a distribution of 1000 values. From these 1000 values, the 5th and 95th percentile are computed and used as confidence interval. The procedure for the difference in spread-error correlation is analogous.

In order to test whether there is significant positive spread-error correlation for a single method, we also use bootstrapping analogous to the above method, but then only for the correlation of said method.

## 2 Results

### 2.1 Lorenz95

The results for Lorenz95 are shown in fig.2. Panels a-d show the result for ensemble size 20, which is the maximum ensemble size for the network-ensemble. The network-ensemble (multiple trained networks) both has the most realistic spread (almost identical to the error, panel (a)), and the largest improvement in skill of the ensemble mean relative to the deterministic control forecast. The SVD ensemble is comparable to a random ensemble in skill (no significant difference, panel (c)), but has lower spread (panel (a)). In terms of spread-error correlation, the network ensemble again clearly outperforms the SVD ensemble at all leadtimes. At short and long leadtimes, it is also significantly better than the random ensemble (panels b, d). This is not true at lead times of around 0.5 MTU, where the network ensemble’s correlation performance drops significantly. Both the random and SVD ensembles can be made to have spread closer to RMSE when using higher initial perturbation scales, but in this case the ensemble mean is worse then the control run (panels (g,h)). Panels (e-h) show the results fixed at 1 MTU lead time, for different ensemble sizes. The results are consistent for different ensemble sizes, except for the smallest one with 2 members, highlighting the generally superior performance of the network ensemble forecasts.

### 2.2 Plasim

The results for forecasting the “weather” of the Plasim model are shown in fig. 3. Again, panels (a-d) show the results for the maximum size of the network ensemble (10 in this case). As for the Lorenz95 model, the improvement in ensemble mean skill compared to deterministic skill is best for the network ensemble (panel (a)). Contrary to the Lorenz95, however, the SVD ensemble ranks second in terms of ensemble mean skill, significantly outperforming the random ensemble at most lead times (panel (c)). When looking at spread evolution over time, all ensembles are heavily underdispersive (panel (a)). The SVD ensemble is by far the best at lead times of up to roughly 2 days (96 hours). At longer leadtimes, however, the spread of the SVD ensemble surprisingly starts decreasing. The SVD method used here optimizes spread at a lead time of 48h, which may — at least partly — explain this counterintuitive behavior. Another interesting aspect is that the optimal initial perturbation scale for the random ensemble is quite high (see the high spread at leadtime 0). This spread then rapidly decreases, before slowly growing again. Apparently, much of the initial perturbation information is rapidly “lost”. In terms of spread-error correlation, again the network ensemble performs best (panel (b)). The SVD and the random ensembles are comparable, with the SVD ensemble having higher correlation around the optimization leadtime of 48 hours, but lower correlation at longer leadtimes. However, their difference is never significant except at 72h (panel (d)), and their correlation in general is very low. Changing the number of ensemble members does not have a large influence on the results (panels (e-f)). Changing the initial perturbation scale away from its optimum (where it minimizes ensemble mean forecast error) has an interesting effect here. For the random perturbations, as expected — and as for the Lorenz95 model — the spread increases, and the forecast skill decreases rapidly with increasing scale, as does the spread-error correlation after a local maximum (panels (g,h)). Note that scale for has been capped on the higher end in the plot to focus on the most interesting region (for higher values the trend continues). For the SVD-ensemble, on the other hand, the forecast skill decreases much less with higher perturbation scales, wheras the spreads increases faster, finally crossing the error. The spread-error correlation keeps increasing up to an optimum at perturbation scale 10.

### 2.3 Era5

The results for Era5 500hPa geopotential forecasts are shown in fig.4. All three ensemble types have better ensemble mean forecasts than the deterministic control run. The SVD and the random perturbation ensemble have very similar skill, whereas the network ensemble is again the best-performing one (panel (a)). The difference between the network ensemble and the other ensembles is comparable to the difference between the other two ensembles and the deterministic forecast. The ensemble spread is very similar among all methods, with the network ensemble having slightly higher spread, but still lower than its ensemble mean. In terms of spread-error correlation, the network ensemble again clearly outperforms the other two, up to 60h leadtime (panels (b, d)). Its correlation then drops first below the random ensemble and then below the SVD ensemble. The correlation of the SVD ensemble is similar to that of the random ensemble, except at longer leadtimes, were it is lower.

Increasing the number of ensemble members beyond 20 does not have a large effect on the SVD and random perturbation ensembles (panels e,f). The two indeed have a very similar behavior with 100 members compared to 20 members. Changing the initial perturbation scale has a similar - but not identical – effect as for Plasim (panels (g,h)). For the random ensemble, forecast error increases with increasing scale. There is a optimum in spread-error correlation, but this is at a point were the forecast skill has dropped far below the control run. For the SVD-ensemble, on the other hand, forecast skill hardly decreases at all with increasing perturbation scale, and also spread-error correlation is relatively constant after a - not very pronounced - local optimum.

In this paper we have tested the singular value decomposition approach, widely used in numerical weather forecasting, to perform neural network ensemble forecasts on the Lorenz95 system, the Plasim general circulation model, and the Era5 atmospheric reanalysis data. We then compared this to neural network ensemble forecasts created with random initial perturbations and network retraining. All ensembles were evaluated by analyzing Root Mean Square Error and ensemble spread, and compared to a single, deterministic forecast.

For all three datasets, the ensemble mean of the SVD forecasts had better skill than a single, unperturbed forecast, and the ensemble spread contains some information on the expected uncertainty of the forecast. For the Lorenz95 model, the SVD ensemble is inferior to the network ensemble, and comparable to the random ensemble. The latter is in agreement with previous findings that the SVD-generated perturbations are suboptimal for generating perturbed initial conditions for numerical forecasts of the Lorenz95 model (bowler_comparison_2006). This points to the broader question of the relevance of simple systems like the Lorenz95 model for predictability studies. Very recently, (scher_generalization_2019) have indeed shown that data-driven methods can perform very differently across different simple dynamical systems. On the Plasim data, the SVD ensemble shows better spread-growth in the beginning of the forecasts, but later on it is inferior to an ensemble created by retraining. Its performance in terms of RMSE is intermediate between the SVD ensemble, which performs better, and the random ensemble, which performs worse. In terms of spread-error correlation, both the random and the SVD ensemble perform poorly. On atmospheric reanalysis data, the performance of the SVD ensemble is comparable to that of the random ensemble. We further note that the general forecast skill of neural networks on reanalysis data is very low compared to NWP models. Thus, the main source of uncertainty is the network itself, rather than the initial conditions. In summary, the results for different datasets analyzed here shows that the SVD method, although highly effective in conventional NWP models, is not necessarily the optimal choice when performing neural network forecast. Careful evaluation is thus necessary before choosing an ensemble initialization method.

While this study focused on weather prediction, the principles presented here can also be applied to forecasting of other initial value problems. Indeed, the SVD technique can be used with every end-to-end differentiable function. Therefore, it could also be used for combinations of numerical models and neural networks, as long as they are differentiable. SVD itself is also differentiable, making it possible to include the generation of perturbed initial states in the neural network training procedure itself. In this way, one could for example optimize both on ensemble mean error and ensemble mean spread, thus creating a novel probabilistic neural network approach for the forecasting of dynamical systems. Furthermore, applying the SVD technique to neural networks is in fact easier than for numerical models, as the latter requires making a tangent linear version of the model first, either through re-coding the model, or with automatic differentiation techniques. The computation of the singular vectors could also be sped up with the Lanczos algorithm, which is faster than explicitly computing the Jacobian first. Finally, the fact that we could directly apply a method developed in the context of NWP models to neural networks shows that there are potential synergies between these two forecasting concepts. Indeed, more concepts developed for NWP, beyond the SVD technique, might be transferable to machine learning based weather forecasts.

The software used for this study was developed in python, using the tensorflow framework, and is available in the repository (

https://doi.org/10.5281/zenodo.3662436) and on the authors github page (https://github.com/sipposip/nn-svd-weather). The data underlying the figures is also available in the repository. Era5 data can be freely obtained through the Copernicus Climate Change Service at https://cds.climate.copernicus.eu/cdsapp#!/dataset/reanalysis-era5-pressure-levels?tab=overview. The source code for the Plasim model is freely available at https://www.mi.uni-hamburg.de/en/arbeitsgruppen/theoretische-meteorologie/modelle/plasim.html(last access: 11 February 2020). Details on the Plasim run used here, including configuration scripts, are available in the code repository of at https://doi.org/10.5281/zenodo.2572863## Appendix A

SS conceived and implemented the methods presented in the study and drafted the manuscript. Both authors designed the study, interpreted the results and improved the manuscript.

The authors declare that they have no competing interests.

Comments

There are no comments yet.