1. Introduction
Meteorological elements, such as temperature, wind and humidity, profoundly affect many aspects of human livelihood (Gneiting and Raftery, 2005; Murphy, 1993). They provide analytical support for issues related to urban computing such as traffic flow prediction, air quality analysis, electric power generation planning and so on (Zheng et al., 2014). The most common method currently utilized in meteorology is the use of physical models to simulate and predict meteorological dynamics known as numerical weather prediction, or NWP. The advantage of NWP is that it is based on the numerical solution of atmospheric hydro thermo dynamic equations and is able to obtain high prediction accuracy if the initial solution is appropriately chosen. However, NWP may not be reliable due to the instability of these differential equations (Tolstykh and Frolov, 2005). With the growing availability of meteorological big data, researchers have realized that introducing datadriven approaches into meteorology can achieve considerable success. Several machine learning methods have been applied to weather forecasting (Sharma et al., 2011; Grover et al., 2015; Hernández et al., 2016). The merit of datadriven methods is that they can quickly model patterns through learning to avoid solving complex differential equations. Nevertheless, learning from historical observations alone requires big data and a tedious amount of feature engineering to achieve satisfying performance, which presented us with the following challenge. Could we combine the advantages of NWP and machine learning to make a more efficient and effective solution? At the same time, singlevalue (i.e. point estimation) forecasting lacks credibility and flexibility for numerous types of human decision. Could we provide more information to indicate the prediction interval based on highquality uncertainty quantification? This paper aims to introduce a unified deep learning method to address these problems through endtoend learning. In particular, we will predict multiple meteorological variables across different weather stations at multiple future steps. The proposed approach has several advantages: efficient data preprocessing, endtoend learning, high accuracy, uncertainty quantification and easytodeploy which makes it have considerable practical significance. The contributions of this work are summarized as follows:

It proposes an effective deep model and information fusion mechanism to handle weather forecasting problems. To the best of our knowledge, this is the first machine learning method which combines historical observations and NWP for weather forecasting. Data and source codes will be released and can be used as a benchmark for researchers to study machine learning in the meteorology field.

It establishes effective assumptions and constructs a novel negative loglikelihood error (NLE) loss function. Unlike Bayesian deep learning (BDL), deep uncertainty quantification (DUQ) can be seamlessly integrated with current deep learning frameworks such as Tensorflow and Pytorch. It can be directly optimized via backpropagation (BP). Our experiments show that compared with typical mean squared error (MSE) and mean absolute error (MAE) loss, training by NLE loss significantly improves the generalization of point estimation. This phenomenon has never been reported in previous researches.

Besides precise point estimation, DUQ simultaneously inferences the sequential prediction interval. This attractive feature has not been studied well in previous deep learning research for time series forecasting. It can be applied to various time series regression scenarios.

It explores efficient deep ensemble strategies. The experimental results demonstrate that the ensemble solution significantly improves accuracy.
The rest of the paper is structured as follows. We discuss related works in Section II and introduce our method in Section III. In Section IV, we discuss experiments and performance analysis . Last, we conclude with a brief summary and shed light on valuable future works in Section VI.
2. Related Works
Weather Forecasting Weather forecasting has been well studied for more than a century. Most contemporary weather forecasting relies on the use of NWP approaches to simulate weather systems using numerical methods (Tolstykh and Frolov, 2005; Marchuk, 2012; Richardson, 2007). Some researchers have addressed weather forecasting as a purely datadriven task using ARIMA (Chen and Lai, 2011), SVM (Sapankevych and Sankar, 2009)
, forward neural network
(Voyant et al., 2012), etc. These shallow models explore only a few variables, which may not capture the spatiotemporal dynamics of diverse meteorological variables. Deep learning has also shown promise in the field of weather prediction. The study in (Hernández et al., 2016)first adopted an autoencoder to reduce and capture nonlinear relationships between variables, and then trained a multilayer perceptron for prediction. In
(Grover et al., 2015), a deep hybrid model was proposed to jointly predict the statistics of a set of weatherrelated variables. The study in (Shi et al., 2015) formulated precipitation nowcasting as a spatiotemporal sequence forecasting problem and proposed convolutional LSTM to handle it. However, these purely datadriven models are limited in that: 1) they all ignore important prior knowledge contained in NWP, which may not capture the spatiotemporal dynamics of diverse meteorological variables; 2) some need tedious feature engineering, such as extracting seasonal features as inputs and kernel selection, which seems contrary to the endtoend philosophy of deep learning; 3) all lack the flexibility of uncertainty quantification.Deep Learning Although deep learning for regression has achieved great success and benefits from the powerful capability of learning representation, solutions like (Zhang et al., 2018; Zhou and Matteson, 2015; Yi et al., 2018) only focus on point estimation and there is a substantial gap between deep learning and uncertainty quantification.
Uncertainty Quantification For ease of explaining uncertainty in regression scenario, let us only consider the equation: , where statistically is the mean estimation (predictable point estimation) of the learned machine learning model and is also called the epistemic part. Its uncertainty comes from model variance denoted by ; is the irreducible noise, also named the aleatoric part. The reason it exists is because there are unobtained explanatory variables or unavoidable random factors, so it is called data variance. Due to the difficulty of expressing
with a deterministic equation, data variance is usually modeled by a Gaussian distribution with zero mean and a variance
(Central Limit Theorems). If
does not change, it is a homoskedastic problem, otherwise it is regarded as heteroskedastic. Then the total variance . The learning process is usually implemented by maximum likelihood estimation, which will learn the estimated .Uncertainty quantification can provide more reference information for decisionmaking and has received increased attention from researchers in recent years (Khosravi et al., 2011) . However, most uncertainty quantification methods are based on shallow models and do not take advantages of deep learning. Deep models can automatically extract desirable representations, which is very promising for highquality uncertainty quantification. To this end, Bayesian deep learning (BDL), which learns a distribution over weights, is currently the most popular technique (Wang and Yeung, 2016). Nevertheless, BDL has a prominent drawback in that it requires significant modification, adopting variational inference (VI) instead of backpropagation (BP), to train deep models. Consequently, BDL is often more difficult to implement and computationally slower. An alternative solution is to incorporate uncertainty directly into the loss function and directly optimize neural networks by BP (Nix and Weigend, 1994; Lakshminarayanan et al., 2017; Pearce et al., 2018). This still suffers from certain limitations. 1) Regression is solved as a mapping problem rather than curve fitting, hence this method cannot naturally be applied to multistep time series forecasting (Roberts et al., 2013; Pearce et al., 2018). 2) The output only consider a single dimension. If it is to be extended to multiple dimensions or for multistep time series forecasting, the method must be based on effective and reasonable assumptions. 3) Only a shallow forward neural network is used for illustration and the superior performance of deep learning is not explored.
The proposed DUQ addresses these limitations by combining deep learning and uncertainty quantification to forecast multistep meteorological time series. It can quantify uncertainty, fuse multisource information, implement multiout prediction, and can take advantage of deep models. Meantime, it is optimized directly by BP.
3. Our Method
3.1. Problem Statement
Let us say we have historical meteorological observations from a chosen number of weather stations and a preliminary weather forecast from NWP. For each weather station, we concern weather forecasting to approximate ground truth in the future. We define this formally below:
3.1.1. Notations
For a weather station , we are given:

Historical observed meteorological time series
, where the variable is one type of meteorological element, for . 
Another feature series consist of forecasting timesteps, station ID and NWP forecasting, i.e.,
, where the variable is one of features, for , and is the required number of forecasting steps. 
Ground truth of target meteorological variables denoted as , where the variable is one of target variables, for and its estimation denoted as .

Then we define:
.
3.1.2. Task Definition
Given , the point estimation will predict to approximate as far as possible. The prediction interval will ensure
(elementwise) with the predefined tolerance probability. The prediction interval will cover the ground truth with at least the expected tolerance probability.
This research was driven by a realworld weather forecasting competition. For feasible comparison, it focuses on a set time period, i.e., from 3:00 intraday to 15:00 (UTC) of the next day, hence . The target variables include temperature at 2 meters (t2m), relative humidity at 2 meters (rh2m) and wind at 10 meters (w10m), hence . The proposed method can be easily extended for any time interval prediction and more target variables.
The variation of mean (solid line) and 90% confidence interval (shaded area) for 10 stations during the target forecasting time zone (3:00 intraday to 15:00 of the next day) from 03/01/201505/31/2018. Values of Yaxis are following normalization.
3.2. Information Fusion Methodology
Data exploration analysis provides insights for the motivation and methodology of information fusion. Fig. 1 shows the variation of three target meteorological variables over the past three years. It can seen that only temperature reflects a strong seasonal variation, while relative humidity and wind speed are subjected to much noise. Based on this observation, methods that extract seasonal features from historical data may not provide the best results, since weather changes too dramatically (Chen and Lai, 2011; Grover et al., 2015). Frequent concept drift cause longterm historical meteorological data lack value (Lu et al., 2018). One conclusion summarizes that ”For many time series tasks only a few recent time steps are required”(Gers et al., 2002). On the other hand, NWP is a relatively reliable forecasting method, but inappropriate initial states can introduce undesirable error bias. To address this, we propose a balanced fusion methodology:

First, only recent observations, i.e., should be adopted for modeling recent meteorological dynamics.

Second, a wise NWP fusion strategy should incorporate NWP forecasting at a counterpart forecasting timestep to easily correcting bias in NWP. Conversely, an unwise fusion strategy that is not carefully designed may absorb NWP which is not conducive to capturing important NWP signals. Hence we incorporate NWP forcasting into rather than into or hidden coding (see Fig. 3).
Fig. 2 aggregates historical statistics of mean (solid line) and 90% confidence interval (shade area) for 10 stations from 3:00 intraday to 15:00 (UTC). We find that: 1) There exists obvious difference of mean and variance statistics, e.g., the mean value of stationID 7 follows a different trend compared with other stations. 2) Every hour at every station has different meteorological characteristics of mean and variance. To address this, we will introduce station ID and time ID into .
3.3. Data Preprocessing
3.3.1. Missing values
There are two kind of missing values, i.e. block missing (oneday data lost) and local missing (local noncontinuous time series), which vary in severity. For block missing (Yi et al., 2016)
, we just delete the data of those days from the dataset. For local missing data, we use linear interpolation to impute missing values. Taking the training set as an example, we delete 40 days with block missing values from a total of 1188 days, leaving the training data from 1148 (118840) days.
3.3.2. Normalization of Continuous Variables
Continuous variables without normalization sometimes result in training failure for deep leaning, so we use minmax normalization to normalize each continuous feature into [0, 1]. In the evaluation, we renormalize the predicted values back to the normal scale.
3.3.3. Category Variables
There are two category variables, i.e. Timesteps ID and Station ID. Rather than hardcoding, such as onehot or sincosine coding, we code them by embedding, which has achieved better performance than hardcoding (Mikolov et al., 2013).
3.3.4. Input/Output Tensors
Lastly, we load data from all stations and dates and reshape it to three tensors as follows:

(i.e., input tensors).

(i.e., ground truth tensor).
Note that is the date index and is the station index. When drawing training samples, we first draw integer date and station . We can then index by from these three tensors and obtain one training instance = and abbreviated as = and
for brevity. The advantage of organizing data in this fourtuple style is that we can conveniently index the data via the specific dimension for hindsight inspection and consideration of scalability. For example, we can index specific dates and stations for later transfer learning research. Readers can refer to the instantiated example
Parameter Settings for Reproducibility in Section V for deeper understanding.3.4. Model Architecture
The proposed DUQ is based on sequencetosequence (seq2seq, also a.k.a EncoderDecoder). Its detailed formula is not discussed here. Readers can refer to (Srivastava et al., 2015) for more detail. There are already many highperformance variants for different tasks, but most of them focus on making improvements from the structural perspective to make point estimation more precise. We first incorporate sequential uncertainty quantification for weather forecasting into the architecture presented in Fig. 3. The encoder first extracts latent representations c from the observed feature series :
where c captures the current meteorological dynamics and is then transferred to form the initial state of the decoder. Based on the memory of c, the decoder absorbs including station identity (StaID), forecasting time identity (TimeID), and NWP forecasting. Two embedding layers will be introduced for StaID and TimeID respectively to automatically learn the embedding representations. This architecture will generate sequential point estimation used as to predict as well as the variance utilized to estimate :
where and are learnable parameters. We use to represent the combination of EncoderDecoder and use which can then be regarded as:
3.5. Learning Phase
DUQ predicts two values at each timestep corresponding to the predicted mean and variance to parameterize the Gaussian distributions ^{2}^{2}2We enforce the positivity constraint on the variance by passing the second output through the function ), and add a minimum variance (e.g. ) for numerical stability.. The NLE is calculated for the Gaussian distributions, which must be based on reasonable hypotheses. Three mild but experimentally effective assumptions are proposed (degree of effectiveness can be seen in the experimental results in Table 3):

Each day and each station are independent. This assumption ensures it is reasonable that the number of all training samples can be regard as . Based on this, we can minimize the negative loglikelihood error loss:

Each target variable and each timestep at one specific station are conditionally independent given . Based on this, we can further decompose by the product rule and transform it via log operation as:

The target variables satisfy multivariate independent Gaussian distribution and is a function of the input features, i.e., . Based on this assumption, the final loss is:
where is a constant which can be omitted during training. is the ground truth of a target variable at timestep , and are respectively the variance and mean of a Gaussian distribution parameterized by DUQ. The aim of the entire learning phase is to minimize NLE. Optimizing by deep models can easily lead to overfitting on the training set, therefore it is necessary to implement earlystopping on the validation set. Algorithm 1 outlines the procedure for the learning phase.
3.6. Inference Phase
After training, we can implement statistical inference for an input by:
where is statistically the mean estimation i.e., given , which will be adopted for forecasting and is statistically the variance estimation, i.e., given . Recall our assumption that satisfies Gaussian distribution, so upper bound and lower bound can be inferenced as follows:
where
is the standard deviation and,
should be determined according to the predefined . In this research, thusis set to 1.65 according to the zscore of Gaussian distribution. Algorithm
2 gives the inference procedure.3.7. Ensemble Methodology
We adopt a simple but efficient principle for ensemble: each single model is a DUQbased model initialized with specified nodes. The ensemble point estimation is the averaged point estimation of all DUQbased models, which is scalable and easily implementable.
3.8. Evaluation Metrics
3.8.1. Point Estimation Measurement
We first calculate the root mean squared error () for each objective variable from all stations for daily evaluation.
where and are respectively the ground truth and the predicted value of the objective variable (i.e., , or in this paper) at station .
is the ultimate RMSE criterion in the experimental reports for each day. is the average over all days. To demonstrate the improvement over the classic NWP method, we employ the following evaluation using the associated skill score (, the higher the better):
where is the calculated by the NWP method and is calculated from the prediction made of machine learning models.
is the ultimate criterion in experimental reports for every day. is the average over all days, which is also the ultimate rank score in the online competition.
3.8.2. Prediction Interval Measurement
To evaluate the prediction interval, we introduce the metric called prediction interval coverage probability (PICP). First, an indicator vector
is defined, for . Each Boolean variable represents whether the objective variable at the predicted time step has been captured by the estimated prediction interval.
The total number of captured data points for the objective variable is defined as ,
Then for the objective variable is defined as:
Ideally, should be equal to or greater than the predefined value, i.e., where is the significant level and is set to 0.1 in our experiments.
4. Experiments and PERFORMANCE ANALYSIS
4.1. Baselines
SARIMA Seasonal autoregressive integrated moving average is a benchmark model for univariate time series, where parameters are chosen using AIC (Akaike information criterion).
SVR
Support vector regression is a nonlinear support vector machine for regression estimation.
GBRTGradient boosting regression tree is an ensemble method for regression tasks and is widely used in practice.
DUQ is one layer GRUbased seq2seq with 50 hidden nodes. The loss function is .
DUQ is two layers GRUbased seq2seq with 50 hidden nodes of each layer. The loss function is .
DUQ is one layer GRUbased seq2seq with 200 hidden nodes. The loss function is .
DUQ is two layers GRUbased seq2seq with 300 hidden nodes of each layer. The loss function is .
DUQ is the same as DUQ except that NWP forecasting (i.e. NWP of , refer to Fig. 3 ) is masked by zero values.
DUQ is the same as DUQ except that the observation features (i.e. ) are masked by zero values.
Seq2Seq is the same as DUQ except that the loss function is MSE.
Seq2Seq is the same as DUQ except that the loss function is MAE.
DUQ ensembles three DUQ models (i.e., DUQ, DUQ, DUQ) for online evaluation. This method achieved 2nd place in the online competition.
DUQ ensembles 10 DUQ models with different architecture to explore the effectiveness of the ensemble. It ensembles DUQ, DUQ, …, DUQ
(increasing at 10neuron intervals).
Model achieves the best during online comparison. According to the onsite report, the author also adopted a complicated stacking and ensemble learning strategy.
4.2. Experimental Environments
The experiments were implemented on a GPU server with Quadro P4000 GPU and Keras programming environment (Tensorflow backend).
4.3. Parameter Settings for Reproducibility
The batch size is set to 512. The embedding dimension of each embedding layer is set to 2. Since we adopted an earlystopping strategy, it was not necessary to set the epoch parameter. Instead, we set the number of maximum iterations to a relatively large number of 10000 to take sufficient batch iterations into consideration. The validation interval (vi) is set to 50 meaning that for every 50 iterations, we will test our model on validation set and calculate the validation loss. We set the earlystopping tolerance (est) to 10, meaning that if the validation loss over 10 continuous iterations did not decrease, training would be stopped early. We defined the validation times (vt) when earlystopping was triggered, hence the total iterations (ti) can be calculated by ti=vtvi. For the prediction interval, was set to 0.1, thus is set to 1.65 according to the zscore of Gaussian distribution.
We set to preprocess the original dataset. After preprocessing, the final dataset shape was as shown below.
For the training set:

Encoder inputs: (1148, 28, 10, 9)

Decoder inputs: (1148, 37, 10, 31)

Decoder outputs: (1148, 37, 10, 3)
For the validation set:

Encoder inputs: (87, 28, 10, 9)

Decoder inputs: (87, 37, 10, 31)

Decoder outputs: (87, 37, 10, 3)
For the test set on each day:

Encoder inputs: (1, 28, 10, 9)

Decoder inputs: (1, 37, 10, 31)

Decoder outputs: (1, 37, 10, 3)
The meaning of each number is explained as follows: we acquired data from 1148 days for the training set and data from 87 days for the validation set. Because our evaluation is based on online daily forecasting, the test day index is 1. Number 28 is a hyperparameter, meaning that the previous 28 hours of observations were used to model recent meteorological dynamics. Number 37 was set according to the specified forecasting steps for the next 37 hours. Number 9 is the dimension of observed meteorological variables. Number 31 (dimension of decoder inputs) consists of concatenating
Timesteps ID and Station ID into 29dimension of NWP forecasting (2+29=31). Number 3 is the ground truth number for 3 target variables. The size of the final training set is 1148*10=11480. The size of validation set is 87*10=870, which is used for earlystopping. The size of test set on each day is 1*10=10.4.4. Performance analysis
Table 1 presents the evaluation by score based on rolling forecasting, with incremental data releasd on a daily basis for nine days to mimic realworld forecasting processes.
4.4.1. Effect of information fusion
Comparing DUQ
with DUQ validates the effectiveness of fusing NWP forecasting. Comparing DUQ with DUQ validates the effectiveness of modeling recent meteorological dynamics.
4.4.2. Effect of deep learning
On average, the deep learningbased models (DUQ and Seq2Seq) perform better than the nondeep learning models (SARIMA, SVR, GBRT). Comparing DUQ and DUQ validates the influence of deeper layers. Comparing DUQ, DUQ, and DUQ validates the effectiveness of nodes under the same number of layers.
4.4.3. Effect of loss function
A notable result is that DUQ trained by NLE loss performs much better than Seq2Seq (MSE loss) and Seq2Seq (MAE loss). In order to empirically understand the reasons for better generalization when trained by NLE, we calculated the when earlystopping was triggered, as shown in Table 5. It can be seen that DUQ requires more iterations to converge. A reasonable interpretation is that NLE loss jointly implements two tasks i.e., mean optimization and variance optimization, which need more iterations to converge. This joint optimization may to some extent play a regularization role and help each other out of the local minimum. It may therefore require more iterations to converge and may have better generalization. We believe that this phenomenon deserves the attention of researchers, and that it should be proved by theory in followup work.
4.4.4. Effect of ensemble
The ensemble model DUQ was used in the online competition ^{3}^{3}3Readers can refer to https://challenger.ai/competition/wf2018 to check our online scores which are consistent with DUQ during Day 3Day 9. During Day 1 and Day 2 of the online competition, DUQ had not been developed. In this paper we reevaluate DUQ offline on Day 1 and Day 2. This is also why DUQ with SS=0.4673 is better than the Model (0.4671) but was only awarded 2nd place online.. DUQ achieved the best , which indicates that ensemble with more DUQ models would provide a better solution.
4.4.5. Significance of Ttest
Because realworld forecasting only covers nine days, we implemented a onetail paired Ttest with significance level to ensure that our results were statistically significant. The column Pvalue between DUQ
and others shows that each Ttest has been passed which means that our method DUQ
is significantly better than any other baseline under the specified significance level. For the single model, we also implemented a Ttest between DUQ and other baselines including DUQ, DUQ, Seq2Seq, Seq2Seq to ensure that DUQ had significant effectiveness, which is shown in Table. 4.4.4.6. Evaluation by RMSE
We also evaluated all methods by as shown in Table 2. Since and do not have a fully linear relationship, the counterpart assessment does not reach the optimum at the same time while DUQ still achieves the best . The related Pvalue also indicates that DUQ is significantly better than other baselines under . Because online evaluation does not release the RMSE ranking, the RMSE of Model place is not shown in Table 2.
4.4.7. Discuss instability of weather forecasting
Due to meteorological instability and variability, no single model can achieve the best scores every day  not even the ensemble method. Sometimes a single model can achieve the highest score, such as DUQ on Day 2 and DUQ on Day 7. Overall, however, the ensemble method DUQ achieves the greatest score benefit from the stability of ensemble learning. The instability of meteorological elements also reflects the need for a prediction interval.
4.4.8. Quantity of prediction interval
An effective prediction interval should satisfy that is equal to or greater than the predefined . Table 3 shows the results. In particular, the on Day 3 seems far below expectations. The main reason is that forecasting of is not accurate on that day. The online competition scores of all contestants were particular low on that day. Generally, our approach meets the requirement .
4.4.9. Quality of prediction interval
We take the model DUQ to visualize the quality of the prediction interval. Fig. 4 illustrates a forecasting instance at one station on a competition day. In each subfigure, the left green line is the observed meteorological value during the previous 28 hours, the right green line is the ground truth, the blue line is the NWP prediction, the red line is DUQ prediction and the red shaded area is the 90% prediction interval. A noticeable observation is that the prediction interval does not become wider over time, instead, it presents that the width of the middle part is narrower than both ends particularly for t2m and rh2m.A reasonable explanation is that meteorological elements largely change during the daytime and become more stable during night time. Having this prediction interval would provide more information for travel/production planning than only point prediction. Another noteworthy point is that because w10m fluctuates sharply, it is more difficult to forecast point estimate precisely, and the prediction interval tends to be wider than t2m and rh2m.
Method  Pvalue  
SARIMA  0.1249  1.4632  0.2417  0.4421  0.2631  0.2301  0.0630  0.2015  0.4579  0.3010  0.00 
SVR  0.7291  0.6342  0.1999  0.5918  1.1230  0.8568  0.6154  0.5123  0.5807  0.6492  0.00 
GBRT  0.0221  0.1318  0.0086  0.0396  0.0960  0.0067  0.0772  0.0859  0.0000  0.0199  0.00 
DUQ  0.4813  0.4833  0.2781  0.3053  0.4277  0.4853  0.4609  0.4987  0.2647  0.4095  0.00 
DUQ  0.4847  0.4969  0.3088  0.4012  0.4302  0.5051  0.5656  0.5502  0.3239  0.4518  0.00 
DUQ  0.5278  0.5088  0.2890  0.3797  0.4479  0.5358  0.4961  0.5235  0.3478  0.4507  0.02 
DUQ  0.5220  0.5002  0.3352  0.4067  0.4474  0.5289  0.5324  0.5463  0.3047  0.4582  0.00 
DUQ  0.2348  0.2992  0.0081  0.2440  0.1630  0.3125  0.2660  0.3003  0.1599  0.1853  0.00 
DUQ  0.4694  0.4744  0.2624  0.3447  0.3925  0.4588  0.4756  0.4901  0.3150  0.4092  0.00 
Seq2Seq  0.4978  0.3934  0.2860  0.3960  0.3965  0.4842  0.4820  0.5138  0.3192  0.4188  0.00 
Seq2Seq  0.5314  0.4346  0.2671  0.3980  0.4610  0.5391  0.4711  0.5565  0.2999  0.4399  0.00 
DUQ (2rd place)  0.5216  0.4951  0.3358  0.4050  0.4627  0.5359  0.5350  0.5664  0.3479  0.4673  0.04 
DUQ  0.5339  0.4940  0.3516  0.4355  0.4600  0.5575  0.5581  0.5776  0.3298  0.4776   
Model  0.4307  0.4847  0.3088  0.4572  0.5019  0.5753  0.5345  0.5726  0.3384  0.4671  0.24 
Method  Pvalue  
NWP  7.5923  9.7276  5.1079  5.7335  6.1542  7.5239  6.3647  7.0457  5.8819  6.7924  0.00 
SARIMA  7.3017  16.5954  7.2964  8.9968  8.0726  8.1440  7.1722  5.8466  8.1795  8.6228  0.00 
SVR  8.1788  10.0436  5.9310  7.1662  7.7576  8.4064  7.9094  8.4749  7.5431  7.9346  0.00 
GBRT  6.5551  8.0321  5.6892  6.1422  6.1083  6.5687  5.9449  6.7122  5.9573  6.4122  0.00 
DUQ  3.0432  4.5256  4.1858  4.5445  3.0069  3.4912  3.8087  3.2299  4.5545  3.8211  0.00 
DUQ  2.9013  4.2442  4.0203  3.6641  3.2275  3.2062  2.6428  2.8175  4.3089  3.4481  0.00 
DUQ  2.8128  4.0400  4.1624  4.0732  2.8580  2.8086  3.2870  3.1963  4.3326  3.5079  0.03 
DUQ  2.7168  4.0615  3.8866  3.7977  2.8083  2.9211  2.8012  2.9784  4.3308  3.3669  0.16 
DUQ  5.0371  5.5370  5.0529  4.6819  4.1385  4.4716  5.7058  5.8346  7.6805  5.3489  0.00 
DUQ  3.2170  4.8604  4.4150  4.1303  3.5896  3.8239  3.4992  3.2031  4.3008  3.8933  0.00 
Seq2Seq  3.1328  5.0769  4.1400  3.8426  3.2040  3.3142  3.3027  3.1785  4.6426  3.7594  0.00 
Seq2Seq  2.7272  5.0933  4.2837  4.0184  2.7888  3.0029  3.7165  2.7935  4.6509  3.6750  0.00 
DUQ  2.8000  4.4338  3.7054  3.7886  2.8566  2.7890  2.7979  2.8011  4.3310  3.3670  0.05 
DUQ  2.7027  4.3341  3.7999  3.5743  2.7627  2.6874  2.7799  2.7402  4.3949  3.3085   

Day 1  Day 2  Day 3  Day 4  Day 5  Day 6  Day 7  Day 8  Day 9  
0.9513  0.9351  0.8918  0.8891  0.9594  0.9648  0.9027  0.8945  0.9351  0.9249  
0.9945  0.8702  0.7648  0.8729  0.9621  0.9621  0.9243  0.9243  0.9135  0.9099  
0.9567  0.9567  0.9081  0.9594  0.9675  0.9621  0.9378  0.9648  0.9540  0.9519 
Pvalue on RMSE  Pvalue on SS  
DUQ     
DUQ  0.00  0.00 
DUQ  0.00  0.00 
Seq2Seq  0.00  0.00 
Seq2Seq  0.03  0.08 
Methods  ti (vt vi) 
DUQ  2900 (58*50) 
Seq2Seq  2100 (42*50) 
DUQ  1950 (39*50) 
Seq2Seq  1850 (37*50) 
DUQ  1450 (29*50) 
5. Conclusions and Future Works
This paper addresses the realworld problem in weather forecasting which has a profound impact on our daily life, by introducing a new deep uncertainty quantification (DUQ) method. A novel loss function called negative loglikelihood error (NLE) was designed to train the prediction model, which is capable of simultaneously inferring sequential point estimation and prediction interval. A noteworthy experimental phenomenon reported in this paper is that training by NLE loss significantly improves the generalization of point estimation. This may provide practitioners with new insights to develop and deploy learning algorithms for related problems such as time series regression. Based on the proposed method and an efficient deep ensemble strategy, stateoftheart performance on a realworld benchmark dataset of weather forecasting was achieved. The overall method was developed in Keras and was flexible enough to be deployed in the production environment. The data and source codes will be released and can be used as a benchmark for researchers and practitioners to investigate weather forecasting. Future works will be directed towards architecture improvement (e.g., attention mechanism), automatic hyperparametertuning, and theoretical comparison between NLE and MSE/MAE.
Acknowledgements.
This work was supported by the Australian Research Council (No. DP150101645), the Natural Science Foundation of China (No. 61773324) and the Fundamental Research Funds for the Central Universities (No. 2682015QM02) .References
 (1)
 Chen and Lai (2011) Ling Chen and Xu Lai. 2011. Comparison between ARIMA and ANN models used in shortterm wind speed forecasting. In Power and Energy Engineering Conference (APPEEC), 2011 AsiaPacific. IEEE, 1–4.
 Gers et al. (2002) Felix A Gers, Douglas Eck, and Jürgen Schmidhuber. 2002. Applying LSTM to time series predictable through timewindow approaches. In Neural Nets WIRN Vietri01. Springer, 193–200.
 Gneiting and Raftery (2005) Tilmann Gneiting and Adrian E Raftery. 2005. Weather forecasting with ensemble methods. Science 310, 5746 (2005), 248–249.
 Grover et al. (2015) Aditya Grover, Ashish Kapoor, and Eric Horvitz. 2015. A deep hybrid model for weather forecasting. In International Conference on Knowledge Discovery and Data Mining. ACM, 379–386.
 Hernández et al. (2016) Emilcy Hernández, Victor SanchezAnguix, Vicente Julian, Javier Palanca, and Néstor Duque. 2016. Rainfall prediction: A deep learning approach. In International Conference on Hybrid Artificial Intelligence Systems. Springer, 151–162.
 Khosravi et al. (2011) Abbas Khosravi, Saeid Nahavandi, Doug Creighton, and Amir F Atiya. 2011. Comprehensive review of neural networkbased prediction intervals and new advances. TNN 22, 9 (2011), 1341–1356.
 Lakshminarayanan et al. (2017) Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. 2017. Simple and scalable predictive uncertainty estimation using deep ensembles. In Neural Information Processing Systems. 6402–6413.
 Lu et al. (2018) J. Lu, A. Liu, F. Dong, F. Gu, J. Gama, and G. Zhang. 2018. Learning under concept drift: A review. IEEE Transactions on Knowledge and Data Engineering (2018), 1–1. https://doi.org/10.1109/TKDE.2018.2876857
 Marchuk (2012) G Marchuk. 2012. Numerical methods in weather prediction. Elsevier.
 Mikolov et al. (2013) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Neural Information Processing Systems. 3111–3119.
 Murphy (1993) Allan H Murphy. 1993. What is a good forecast? An essay on the nature of goodness in weather forecasting. Weather and Forecasting 8, 2 (1993), 281–293.

Nix and Weigend (1994)
David A Nix and
Andreas S Weigend. 1994.
Estimating the mean and variance of the target probability distribution. In
International Conference on Neural Networks, Vol. 1. 55–60.  Pearce et al. (2018) Tim Pearce, Mohamed Zaki, Alexandra Brintrup, and Andy Neely. 2018. HighQuality prediction intervals for deep learning: A distributionfree, ensembled approach. In International Conference on Machine Learning, Vol. 80. 4075–4084.
 Richardson (2007) Lewis Fry Richardson. 2007. Weather prediction by numerical process. Cambridge University Press.
 Roberts et al. (2013) Stephen Roberts, Michael Osborne, Mark Ebden, Steven Reece, Neale Gibson, and Suzanne Aigrain. 2013. Gaussian processes for timeseries modelling. Philosophical Transactions of the Royal Society A 371, 1984 (2013), 20110550.
 Sapankevych and Sankar (2009) Nicholas I Sapankevych and Ravi Sankar. 2009. Time series prediction using support vector machines: a survey. IEEE Computational Intelligence Magazine 4, 2 (2009).
 Sharma et al. (2011) Navin Sharma, Pranshu Sharma, David Irwin, and Prashant Shenoy. 2011. Predicting solar generation from weather forecasts using machine learning. In International Conference on Smart Grid Communications. IEEE, 528–533.
 Shi et al. (2015) Xingjian Shi, Zhourong Chen, Hao Wang, DitYan Yeung, WaiKin Wong, and Wangchun Woo. 2015. Convolutional LSTM network: A machine learning approach for precipitation nowcasting. In Neural Information Processing Systems. 802–810.
 Srivastava et al. (2015) Nitish Srivastava, Elman Mansimov, and Ruslan Salakhudinov. 2015. Unsupervised learning of video representations using LSTMs. In International Conference on Machine Learning. 843–852.
 Tolstykh and Frolov (2005) MA Tolstykh and AV Frolov. 2005. Some current problems in numerical weather prediction. Izvestiya Atmospheric and Oceanic Physics 41, 3 (2005), 285–295.
 Voyant et al. (2012) Cyril Voyant, Marc Muselli, Christophe Paoli, and MarieLaure Nivet. 2012. Numerical weather prediction (NWP) and hybrid ARMA/ANN model to predict global radiation. Energy 39, 1 (2012), 341–355.
 Wang and Yeung (2016) Hao Wang and DitYan Yeung. 2016. Towards Bayesian deep learning: A framework and some existing methods. IEEE Transactions on Knowledge and Data Engineering 28, 12 (2016), 3395–3408.
 Yi et al. (2018) Xiuwen Yi, Junbo Zhang, Zhaoyuan Wang, Tianrui Li, and Yu Zheng. 2018. Deep Distributed Fusion Network for Air Quality Prediction. In International Conference on Knowledge Discovery and Data Mining. ACM, 965–973.
 Yi et al. (2016) Xiuwen Yi, Yu Zheng, Junbo Zhang, and Tianrui Li. 2016. STMVL: Filling Missing Values in Geosensory Time Series Data. In International Joint Conference on Artificial Intelligence.
 Zhang et al. (2018) Junbo Zhang, Yu Zheng, Dekang Qi, Ruiyuan Li, Xiuwen Yi, and Tianrui Li. 2018. Predicting citywide crowd flows using deep spatiotemporal residual networks. Artificial Intelligence 259 (2018), 147–166.
 Zheng et al. (2014) Yu Zheng, Licia Capra, Ouri Wolfson, and Hai Yang. 2014. Urban computing: Concepts, methodologies, and applications. ACM Transactions on Intelligent Systems and Technology (TIST) 5, 3 (2014), 38.
 Zhou and Matteson (2015) Zhengyi Zhou and David S Matteson. 2015. Predicting ambulance demand: A spatiotemporal kernel approach. In International Conference on Knowledge Discovery and Data Mining. ACM, 2297–2303.