SAAM
Spectral Attention Autoregressive Model (SAAM)
view repo
Time series forecasting is an important problem across many domains, playing a crucial role in multiple realworld applications. In this paper, we propose a forecasting architecture that combines deep autoregressive models with a Spectral Attention (SA) module, which merges global and local frequency domain information in the model's embedded space. By characterizing in the spectral domain the embedding of the time series as occurrences of a random process, our method can identify global trends and seasonality patterns. Two spectral attention models, global and local to the time series, integrate this information within the forecast and perform spectral filtering to remove time series's noise. The proposed architecture has a number of useful properties: it can be effectively incorporated into wellknow forecast architectures, requiring a low number of parameters and producing interpretable results that improve forecasting accuracy. We test the Spectral Attention Autoregressive Model (SAAM) on several wellknow forecast datasets, consistently demonstrating that our model compares favorably to stateoftheart approaches.
READ FULL TEXT VIEW PDFSpectral Attention Autoregressive Model (SAAM)
Time series forecasting, which consists of analyzing historical signals patterns to predict future outcomes, is an important problem with scientific, business, and industrial applications, playing an important role in daily life. Several fields benefit from time series forecasting, such as finance applications of these models for estimating the future movements of stock markets
[engle1982autoregressive, ding2015deep, chatigny2020financial, kim2003financial, sezer2020financial], climate prediction [ak2015interval, le2020deep, shi2015convolutional], forecasting of energy consumption and demand [masum2018multi, wang2019review, dudek2021hybrid], and product demand and supply [simchi2008designing, merkuryeva2019demand, akyuz2017ensemble], which helps optimize resource allocation, allowing for cost reduction and profit maximization.Early approaches to solving time series forecasting problems rely on statistical models, such as State Space Models (SSMs) [durbin2012time], exponential smoothing, [hyndman2008forecasting], [McKenzie_1984] or matrix factorization methods [yu2016temporal], [Hyndman_2011], which learn information via a matrix that relates different time series. Auto Regressive Integrated Moving Average (ARIMA) [box2008time] has been one of the most popular solutions, it produces its predictions as a weighted sum of past observations, making the model vulnerable to error accumulation. For extra information regarding classic techniques, we refer the readers to [Box_1968, hamilton1994time, lutkepohl2005new].
Regardless, all these classic approaches share a number of weaknesses. They make linearity assumptions on the data, which together with its limited scalability, makes them unsuitable for modern largescale forecasting tasks. Furthermore, they incorporate prior knowledge about time series composition, like trends or seasonality patterns present in the data, which requires manual feature engineering and model design by domain experts in order to achieve good results [harvey1990forecasting]. Moreover, they do not usually share a global set of parameters among the different time series, which implies bad generalization and poor performance aside from singlelinear time series prediction.
Deep neural networks
[sutskever2014sequence], [graves2013generating] are an alternative solution for time series forecasting. They are able to model nonlinear temporal patterns, easily identify complex structures across time series, efficiently extract higher order features, and allow a datadriven approach that requires little to no human feature engineering [hwang2015recurrent, sutskever2014sequence, graves2013generating, giuliari2021transformer].Recurrent Neural Networks (RNNs) [funahashi1993approximation], [lai2018modeling]
and Long ShortTerm Memory (LSTM)
[hochreiter1997long] have achieved good results in temporal modeling. DeepAR [salinas2020deepar] set a milestone using a LSTM to perform an embedding latter used by a probabilistic model to forecast in an encoderdecoder fashion. Recently, attention models [bahdanau2014neural], [li2019enhancing] have been used by these recurrent architectures to selectively focus on some segments of the data while making predictions, e.g., in machine translation, only certain words in the input sequence may be relevant for predicting the next word. To do so, these models use an inductive bias that connects each token in the input through a relevance weighted basis of every other token.The idea behind recurrent attention leads to Transformers models [vaswani2017attention]
, which have become one of the most popular methods with respect to the problem of time series forecasting. Initially introduced for Natural Language Processing (NLP), Transformers proposed a completely new architecture where a selfattention mechanism is used to process sequences of data. Several modifications must be accomplished in order to apply Transformer models for the forecasting of time series. In ConvTrans
[li2019enhancing], most of the difficulties associated with Transformers for this specific problem were solved. An alternative to the canonical selfattention of these models was designed to make them aware of the local context via a CNN [lecun1995convolutional]. At the same time, a modification of the attention mechanism to reduce the computational cost of selfattention was also introduced.Many models have also tried to join classical approaches with deep learning techniques, as Deep State Space Models for Time Series Forecasting (DSSM) [rangapuram2018deep] or Deep Factors for Forecasting [wang2019deep]. Various attempts to merge signal processing techniques with deep neural networks can also be found in the literature. In [tamkin2020language], a framework that uses spectral filtering for the problem of NLP was proposed, while [Cao2020SpectralTG] uses the spectral domain to jointly capture interseries correlations and temporal dependencies.
Nevertheless, deep autoregressive models also present some inconveniences. First, they tend to focus on recent past data to predict the future of the time series, frequently wasting important global information not encapsulated in previous predictions. Second, as classical time series forecasting methods, they suffer from error accumulation and propagation [cheng2006multistep], a problem closely related to the previous one. Third, they do not produce interpretable results, neither can we clearly explain how they reach them [castelvecchi2016can].
In this paper we show that the described problems can be partially alleviated by incorporating signal processing filtering techniques into the autoregressive models that perform the time series forecasting. With respect to the inability of these models to focus on the global context, we can obtain time series’ most important trends via frequency domain characterization. These trends can be intelligently incorporated during the forecasting, hence making the local context aware of the time series global patterns. Regarding error accumulation and noisy local context, spectral filtering can be applied to decide at every time instant which frequencies are useful and which can be suppressed, eliminating unwanted components that do not help during forecasting. Finally, these signal processing tools which operate in the spectral domain produce more interpretable internal representations, as it will be proved during the experiments section, making it possible to extract the explainable frequency domain features that are driving the predictions if necessary.
To integrate previous solutions, we propose a general architecture, the Spectral Attention Autoregressive Model (SAAM). SAAM’s modularity allows it to be effectively incorporated into a variety of deepautoregressive models. This architecture uses two spectral attention models to determine, at every time instant, relevant global patterns as well as removing local context’s noise while performing the forecasting. Both operations are performed in the frequency domain of the embedded space.
To the best of our knowledge, SAAM is the first deep neural autoregressive model that exploits attention mechanisms in the spectral domain. A globallocal architecture marries deep neural networks with classic signal processing techniques in this new frequency domain attention framework, incorporating relevant global trends into the forecast and performing spectral filtering to prevent error accumulation. Further, the additional complexity due to Spectral Attention is comparable to classic attention models in the temporal domain.
We perform extensive experiments on both synthetic and realworld time series datasets, showing the effectiveness of the proposed Spectral Attention module, consistently outperforming the base models and reaching stateoftheart results. Ablation studies further prove the effectiveness of the designed architecture.
The rest of this paper is organized as follows. Section II states the time series forecasting problem, presents a base architecture to which we append the proposed Spectral Attention module, and provides a characterization of the time series in the spectral domain. Section III describes our model and Section IV proves its effectiveness, both quantitative and qualitatively. We conclude the paper in Section V.
In this section, we formally state the problem of time series forecasting and introduce a base architecture that represents the core of most deep learningbased autoregressive models in the stateoftheart. Also, a frequency domain characterization of the time series is proposed.
Given a set of N univariate time series , where , is the forecast horizon, the forecast length, and
the sequences’ total length, our goal is to model the conditional probability distribution of future trajectories of each time series given the past, namely, to predict the next
time steps after the forecast horizon:(1) 
where are the learnable parameters of the model and the associated covariates. These covariates are, together with time series’ past observations, the input to our predictive model.
A number of deepautoregressive models in the stateoftheart, including DeepAR [salinas2020deepar], NBeats [oreshkin2019n] and ConvTrans [li2019enhancing], can be characterized by means of a highlevel architecture, represented in Fig. 1. This general framework is composed of two parts:
An embedding function , with transit function and parameters . This embedding receives as input, at time , the time series previous value , the covariates , and the past value of the embedding . This embedding function can be implemented in different ways, with a RNN [rumelhart1986learning],[hopfield1982neural], a LSTM [gers2000learning], or a Temporal Convolutional Network (TCN) [lea2016temporal].
A probabilistic model , with parameters , which uses the embedding to estimate time series’ next value, .
This probabilistic model is usually implemented as a function of a neural network that parameterizes the required probability distribution. E.g., a Gaussian distribution can be represented through its mean and standard deviation as:
, , where and are neural networks.The optimal model’s parameters are obtained by maximizing the loglikelihood function for the observed data in the conditioning range, i.e., from to . All quantities required for computing the loglikelihood function are deterministic, which means that no inference is required:
(2)  
During both training and testing, the conditioning range , which is analogous to the encoder of seq2seq models [sutskever2014sequence], transfers information to the forecasting range , analogous to the decoder. Therefore, this base framework can be interpreted as an encoderdecoder architecture, with the consideration that both encoder and decoder are the same network, as Fig. 2 shows.
For forecasting, directly sampling from the model can be done as , when the model consumes the previous timestep prediction as input, unlike during the conditioning range, where is observed. This is illustrated in Fig. 2.
Note that Transformers, unlike RNNs or LSTMs, do not compute the embedding in a sequential manner. Accordingly, when obtaining the embedding through a Transformer model [wu2020deep] and so to use the encoderdecoder architecture previously described, we use the Transformer decoderonly mode, introduced in [liu2018generating].
Our approach in this paper exploits information in the spectral domain. In this regard, time series’ embedding space can be statistically characterized as instances of a random process for which spectral information can be analyzed using the expected autocorrelation and the Power Spectral Density (PSD) [buttkus2012spectral].
The power spectrum per embedding’s dimension can be calculated from an averaged autocorrelation estimated from a finite number of time series of duration each:
(3) 
where is the expected autocorrelation for the th dimension of the th embedded sequence, , and is the lag between observations.
The BlackmanTukey method [blackman1958measurement]
, which takes advantage of the Discrete Fourier Transform (DFT) of a windowed autocorrelation, can be used to estimate the PSD once the autocorrelation has been computed:
(4) 
where decomposes a function into its constituent frequencies ( is usually equal to the time series’ length ), obtaining the estimated spectrum of the random process that generates the time series in the embedding space. Notice that is a spectral characterization of the dimension of the embedding space, hence not global to the process. In Section III, we propose an alternative autocorrelation function that considers each embedding’s dimensions as independent realizations of the same random process that generates the time series, therefore obtaining a global spectral representation of the process.
In this section, we introduce the Spectral Attention Autoregressive Model (SAAM), a general framework that incorporates a Spectral Attention (SA) module able to exploit embedding function’s spectral information using attention mechanisms to solve two main tasks:
Time series are governed by global trends and seasonality structures. SAAM captures these global patterns and incorporates them into the forecast.
Time series exhibit noise around these primary trends that difficult the forecasting process. SAAM filters this noise using spectral filtering, improving the signal to noise ratio thus alleviating the error propagation problem autoregressive models suffer from.
These two operations, incorporating global trends into the forecast and filtering the time series’ local context, are encapsulated in the SA module, responsible of all the frequency domain related operations. As such, it can be incorporated in any deepautoregressive structure.
The resulting architecture, displayed in Fig. 3, is therefore composed by three main parts: embedding function, SA module and probabilistic model. For further details on the embedding and probabilistic model, common to the base architecture of Fig. 1, we refer the reader to Section IIB. With respect to the SA module, we now explain in detail how it integrates both global and local information into the forecasting.
To incorporate global patterns into the prediction, we perform a frequency domain characterization of the process, as explained in Section IIC. We aim to exploit information in the spectral domain associated with the neural embeddings of a number of time series.
Nevertheless, we are not interested in characterizing each embedding’s dimension independently, , as using Eq. 3 entails. This would require a multidimensional PSD for characterizing the process over all embedding’s dimensions, . Instead, we consider each dimension of the embedding as independent realizations of the same process we average over, resulting in the following autocorrelation function:
(5) 
where is the th dimension of the th embedded sequence and the embedding’s number of dimensions. To apply the BlackmanTukey method from Eq. 4 over the previous autocorrelation function involves to obtain the complete frequencydomain characterization of the process, incorporating all the global patterns across the different dimensions of the embedding into a single spectral representation, .
To perform the spectral local filtering, we analyze the last values of each individual sample: encapsulates the previous embeddings, each of dimension . This embedded buffered signal at time is transformed to the frequency domain, , via a DFT with points. Note that we retain only the module of the Fourier Transform.
We combine both global and local spectral information to modify the embedded representation of the time series . This is done through two spectral attention models contained in the SA module, with parameters :
this frequencydomain attention model, with parameters , is responsible of incorporating, at time and for each time series, the Global Spectral Information into the forecast. To do so, it uses the time series’ local context, summarized via the embedding function , as key to select the relevant frequency components on that should be included during the prediction, where is a repetition of along the dimensions of the embedding: . The global filter’s coefficients, , take values and is a neural network.
as in the Global Spectral Attention model, the embedded representation is used as key to determine the relevant and not relevant local spectral components, , where are the Local Spectral Attention’s parameters and are the local filter’s coefficients for each embedding’s dimension.
We combine both spectraldomain attention models through multiplication and addition operations in the frequency domain: . The multiplication over the embedding’s local spectrum representation is performing the local spectral filtering, setting to zero not relevant frequency components through the local attention model. Furthermore, the addition of includes significant global trends into the forecast via , which selects relevant patterns from the process’ global spectrum representation .
Finally, this spectral representation is transformed back to the time domain and the last value of , , is feed to the probabilistic model to forecast the next timestep, .
The likelihood of the model, which is maximized to match the statistical properties of the data, now includes SA’s parameters, :
(6) 
Once parameters have been learned during the conditioning range, forecast can be produced as in Section IIB:
, computing the joint distribution over the forecasting range for each time series.
Notice that, during both training and testing, global spectral information is obtained using a minibatch of time series different to the one being forecasted.
We conduct experiments with both synthetic and realworld datasets in order to provide evidence of the superior forecasting ability of SAAM. Moreover, ablation studies are conducted. The code for our proposed model is publicly available on GitHub ^{1}^{1}1https://github.com/fmorenopino/SAAM [fernando_moreno_pino_2021_5086179].
Many wellknown deepautoregressive models that define the stateoftheart can take advantage of the Spectral Attention module proposed in Section III. This allows them to filter not relevant spectral components and to recognize global trends that can be incorporated into the forecast.
We integrated the SA module into two forecasting models: DeepAR [salinas2020deepar], a widelyknown and largely used model with several industrial applications [schelter2018automating, bose2017probabilistic, schelter2018challenges, fildes2019retail, alexandrov2019gluonts], which employs a LSTM to perform the embedding; and ConvTrans [li2019enhancing], a more recent Transformerbased proposal, which constitutes one of most efficient approaches for using Transformers for the problem of time series forecasting.
We would like to remark that the added complexity by the Spectral Attention module is equivalent to classic attention models as [bahdanau2014neural]. In these models, for a sequence of length , computing scores between every pair causes memory use. The complexity of Spectral Attention depends on the embedding’s dimension and the number of points used to compute the Fourier Transform . The complexity therefore, depends on .
The normalized quantile loss, , with , which quantifies the accuracy of a quantile of the predictive distribution, is the main metric used to report the results of the experiments, as in many other works [salinas2020deepar], [li2019enhancing], [rangapuram2018deep], [seeger2016bayesian].
(7) 
Rolling window predictions after the last point seen in the conditioning range are used to obtain the results. To compute this metric, we use 200 samples from the decoder to estimate along time. Also, the normalized sum of the quantile losses is considered, as can be appreciated in the previous equation.
The Normalized Deviation (ND) [yu2016temporal] and Root Mean Square Error (RMSE) were also used to evaluate the probabilistic forecasts, especially on the synthetic dataset experiments:
(8) 
In order to demonstrate SAAM capabilities, we conducted experiments on synthetic data composed of sinusoidal signals with no covariates and a duration of 200 samples. Each of these signals is divided into two halves, randomly selecting the components for each of them as:
(9) 
where and
are independently and randomly generated from a Bernoulli distribution,
with probability
. This implies that each half of the time series can take the form of one over two signals, or , both of them composed by the addition of two sines of different frequencies plus a noise component, :(10) 
where and will vary during the experiments, the amplitudes had fixed values, and each of the frequencies used are chosen from a different interval as: . A sampled time series is shown in Fig. 5.
To evaluate spectral attention advantages we trained two models on this dataset: 1) DeepAR as base model and 2) SAAM. Both of them used the exact same architecture for the common parts: the embedding function was performed by a LSTM of 3 layers and 10 hidden units per layer and the probabilistic model is composed of a fully connected network.
Both models DeepAR and SAAM were trained using 500 signals from the synthetic dataset with a noise component . The training loss evolution and the validation ND error are shown in Fig. 6
We then evaluated the trained models in two different scenarios. First, in absence of a noise component, ; second, in the same conditions they were trained, .
Moreover, for each of the previous cases, we evaluated the models using different forecast horizons, starting with up to . In this last case, the final 95 timesteps are forecasted after observing the five first samples from the time series’ second half. Setting would make no sense, as contains no evidence about the time series configuration during .



DeepAR / SAAM  


ND  RMSE  
0  175  25  0.25494 0.000 / 0.24101 0.003  0.35004 0.000 / 0.28680 0.003  0.26250 0.000 / 0.24872 0.004  0.21830 0.000 / 0.221701 0.003 
150  50  0.27403 0.000 / 0.20145 0.002  0.37677 0.000 / 0.24868 0.004  0.27882 0.000 / 0.20594 0.002  0.26836 0.000 / 0.26325 0.004  
120  80  0.37867 0.000 / 0.17323 0.004  0.47434 0.000 / 0.21293 0.006  0.38511 0.000 / 0.17811 0.005  0.38712 0.000 / 0.13148 0.001  
110  90  0.75234 0.000 / 0.20326 0.001  1.06398 0.000 / 0.28207 0.001  0.75810 0.001 / 0.20828 0.001  0.73909 0.001 / 0.14980 0.003  
105  95  0.82233 0.000 / 0.37575 0.003  1.13000 0.000 / 0.66817 0.001  0.82801 0.001 / 0.38062 0.004  0.80492 0.001 / 0.34665 0.002  
0.5  175  25  0.58778 0.000 / 0.56469 0.002  0.75065 0.000 / 0.71553 0.002  0.59492 0.001 / 0.57267 0.002  0.59893 0.001 / 0.61914 0.005 
150  50  0.60786 0.000 / 0.55649 0.001  0.76756 0.000 / 0.70871 0.002  0.61322 0.000 / 0.56057 0.001  0.61094 0.000 / 0.60490 0.003  
120  80  0.63071 0.000 / 0.57218 0.002  0.79379 0.000 / 0.73353 0.002  0.63531 0.001 / 0.57674 0.002  0.62907 0.001 / 0.52398 0.002  
110  90  0.78535 0.000 / 0.67392 0.001  1.03135 0.000 / 0.89065 0.001  0.79090 0.001 / 0.67823 0.001  0.79474 0.001 / 0.61661 0.001  
105  95  0.86135 0.000 / 0.70326 0.009  1.12157 0.000 / 0.93480 0.012  0.86594 0.001 / 0.70809 0.009  0.85522 0.001 / 0.65422 0.012  

Comparative of DeepAR vs SAAM on the synthetic dataset for different noise component’s variance and forecast lengths.





ND  222,56 %  46,55 %  
RMSE  222,82 %  49,41 %  
215,43 %  45,56 %  
DeepAR  268,72 %  42,79 %  
ND  55,91 %  24,54 %  
RMSE  132,97 %  30,64 %  
53,03 %  38,34%  
SAAM  56,36 %  5,67 %  

Table II shows the results reported by DeepAR and SAAM. After training the models, 10 evaluations per scenario were conducted during testing. On average, SAAM improved the by a 28.4%, by a 27.7%, by a 28.2% and by a 28.4%, proving that Spectral Attention inclusion enhances the model’s performance.
Furthermore, remark that the Global Spectral Attention model contained in the SA module helps accelerate global trends detection: the earlier the forecast window starts (smaller ), the better the results of SAAM with respect to the DeepAR base model are. Specifically, for and (the most favorable setting as there is no noise and just 25 data points to forecast), SAAM and DeepAR report very similar results while, with an increased forecast window of , SAAM architecture thoroughly enhances base model’s results: is improved by a 54.3%, by a 40.9%, by a 54% and by a 56.9%.
SAAM’s ability to detect and incorporate relevant trends into the forecast has direct implications in the reported results. Table III shows a comparison for different noise levels of the reported metrics’ degradation while increasing the forecast window from to . SAAM’s deterioration is much smaller than DeepAR’s.
Finally, as can be seen on Fig. 6, a faster training convergence is achieved while using the SA module. Furthermore, SAAM reached the minimum validation error long before DeepAR.
We now visualize how spectral attention affects the embedding evolution during the forecasting, studying SAAM’s internal behavior while using a LSTM with 1 layer and 5 hidden units per layer as embedding. In Fig. 7
, we show the hidden representations produced by SAAM during the forecasting of a time series
. Each row in this figure represents one of those 5 hidden dimensions at time (predicting time series’ final timestep). SAAM’s hidden variables are displayed as:Blue lines represent the hidden representation of the LSTM before SA. This is, the representation that DeepAR would use to forecast.
Red lines represent SA module’s output, with dimension , being and . This is the representation that DeepAR based SAAM uses in order to perform the forecast.
We also represent the true signal in the first row of Fig. 7, the noise component was removed for clarity. This specific sequence can be described as:
(11) 
Fig. 7 shows how the SA module incorporates global trends into the hidden representation, making the model immediately aware of trend changes: in Dim. 0 and 2, exhibits a trend associated with from time , while this component does not appear in until . Furthermore, incorporates in both Dim. 3 and 4 a component that exhibits for both and .
These examples show the ability of the proposed architecture to incorporate into the forecast patterns that the time series exhibit.
The performance of SAAM on several realworld datasets was compared with other stateoftheart models: two classic forecasting techniques, ARIMA [box2008time] and ETS [hyndman2018forecasting]; a recent matrix factorization method, TRMF [yu2016temporal]; a RNN based State Space Model, DSSM [rangapuram2018deep]; DeepAR [salinas2020deepar] and ConvTrans [li2019enhancing].
Two different configurations were proposed for SAAM: the first one uses DeepAR as base model, the second combines ConvTrans with SA, both complied with the architecture of Fig. 3. A basic framework for the common parts was maintained during all the experiments: for the DeepAR base model, the embedding consisted on 3 LSTM layers with 40 hidden units per layer while, for the ConvTrans base proposal, a Transformer with 8 heads and 3 layers was used. Both set of parameters appear in the original articles [salinas2020deepar], [li2019enhancing] as optimal choices.
The electricity^{2}^{2}2https://archive.ics.uci.edu/ml/datasets/ElectricityLoadDiagrams20112014# and traffic datasets ^{3}^{3}3https://archive.ics.uci.edu/ml/datasets/PEMSSF [Dua:2019], [NIPS2016_85422afb] were evaluated using two different forecast windows, of one and seven days. The electricity dataset contains hourly time series of energy consumption of 370 customers. Similarly, the traffic dataset contains the hourly occupancy rates, with values between zero and one, of 963 car lanes in San Francisco area freeways.
Three more datasets, with different forecast windows each, were also used. The solar dataset ^{4}^{4}4https://www.nrel.gov/grid/solarpowerdata.html [solar_dataset] contains the solar power production records from January to August 2006 from 137 plants in Alabama and exhibits hourly measurements. Forecast windows of 1 day were predicted during the evaluation. The wind dataset ^{5}^{5}5https://www.kaggle.com/sohier/30yearsofeuropeanwindgeneration [wind_dataset] contains daily estimates of 28 countries’ wind energy potential in a period from 1986 to 2015, expressed as a percent of a power plant’s maximum output. Finally, the M4Hourly dataset ^{6}^{6}6https://www.kaggle.com/yogesh94/m4forecastingcompetitiondataset [makridakis2018m4], contains 414 hourly time series from the M4 competition [makridakis2020m4]. Table IV summarizes each dataset and the models’ architecture used.
All datasets used covariates , composed by the hour of the day, day of the week, week of the year and month of the year, for daily, weekly, monthly and yearly data, respectively. Also, covariates that measure the distance to the first observation of the time series as well as an item index identification for each time series were used.


Electricity  Traffic  Solar  Wind  M4  


Length  32304  129120  5832  10957  748 
# Time Series  370  370  137  28  414 
Granularity  Hourly  Hourly  Hourly  Daily  Hourly 
Domain  [0,1]  
Batch Size  128  128  128  64  128 
Learning Rate  1e3  1e3  1e3  1e3  1e3 
# LSTM Layers  3  3  3  3  3 
# Hidden Units/Layer  40  40  40  40  40 
# Heads  8  8  8  8  8 
# Layers  3  3  3  3  3 
Forecast Window  1 Day / 7 Days  1 Day / 7 Days  1 Day  30 Days  2 Days 
Encoder Length  168 / 24  168 / 24  168  162  128 
Decoder Length  24 / 168  24 / 168  24  30  48 



Dataset  Method  


ARIMA^{△}  ETS^{△}  TRMF^{△}  DSSM^{△}  DeepAR  ConvTrans  SAAM (DeepAR)  SAAM (ConvTrans)  
elect1d  0.154/ 0.102  0.101 / 0.077  0.084 /   0.083 / 0.056  0.075 / 0.040  0.059 / 0.034  0.0635 / 0.0317  0.056 / 0.029 
elect7d  0.283 / 0.109  0.121 / 0.101  0.087 /   0.085 / 0.052  0.082 / 0.053  0.079 / 0.051  0.076 / 0.037  0.073 / 0.046 
traffic1d  0.223 / 0.137  0.236 / 0.148  0.186 /   0.167 / 0.113  0.159 / 0.106  0.152 / 0.102  0.123 / 0.099  0.120 / 0.083 
traffic7d  0.492 / 0.280  0.509 / 0.529  0.202 /   0.168 / 0.114  0.251 / 0.169  0.172 / 0.110  0.246 / 0.167  0.155 / 0.098 



Dataset  Method  


TRMF^{△}  DeepAR  ConvTrans  SAAM (DeepAR)  SAAM (ConvTrans)  
Solar  0.241 /   0.222 / 0.093  0.210 / 0.082  0.191 / 0.066  0.197 / 0.069 
M4   /   0.085 / 0.044  0.067 / 0.049  0.048 / 0.029  0.061 / 0.044 
Wind  0.311 /   0.286 / 0.116  0.287 / 0.111  0.282 / 0.105  0.278 /0.108 



Dataset  elect1d  elect7d  traffic1d  traffic7d  Solar  M4  Wind  


Base Model  DeepAR  15.33% / 20.75%  7.32% / 30.19%  22.64% / 6.60%  1.99% / 1.18%  13.96% / 29.03%  43.53% / 34.09%  1.05% / 9.48% 
ConvTrans  5.08% / 14.71%  7.59% / 9.80%  21.1% / 18.6%  9.8% / 10.91 %  6.19% / 15.85%  8.96% / 10.20%  3.14% / 2.70%  

Table V shows the results obtained for both electricity and traffic datasets, with forecast windows of 1 and 7 days. Table VI shows the results for solar, M4, and wind datasets.
Some conclusions can be drawn from these results. The classic methods evaluated, ARIMA and ETS, performed the worst, probably due to the incapacity of detecting shared patterns across the different time series. The results reported for TRMF are slightly better but, for most configurations, it was not capable to beat Deep Neural Networks based approaches, where DeepAR and ConvTrans solidly exceeded DSSM. The two proposed variations of SAAM outperformed all other models, as it happened in Section IVC with the synthetic dataset experiments.
Finally, Table VII shows a comparison between the base models, DeepAR and ConvTrans, and their SAAM version, which consistently improved base models’ forecast accuracy.
For DeepAR, was improved by a 15.1% and by a 18.8% after including the SA module on SAAM. For ConvTrans, was improved by a 8.8% and by a 11.8% when using our proposed SAAM architecture.
These results prove how different deepautoregressive models, with significant differences between them, can be improved by correctly incorporating frequency domain information into the forecasting, without any significant complexity overload.
Finally, to quantify the real effect of the SA module and understand the effectiveness of its components, an ablation study was conducted. SA’s basic blocks: the Local Spectral Attention model and the Global Spectral Attention model, described in Section III, were separately evaluated.
To evaluate the behavior of both frequencydomain attention models, two ablation studies with two different datasets were conducted to secure the robustness of the conclusions. For both studies, SAAM is trained using a LSTM as embedding function.
Considering that SA’s output obeys to: , as stated in Table (I), three different configurations of the model were tested: 1) SAAM; 2) SAAM without using Local Spectral Attention, ; 3) SAAM without using Global Spectral Attention, .



ND  RMSE  


Full SAAM  0.028 0.000  0.044 0.000  0.029 0.000  0.025 0.000  
No global SA  0.886 0.003  1.064 0.004  0.893 0.003  1.077 0.009  
No local SA  2.084 0.000  2.352 0.000  2.100 0.000  2.416 0.001  
Full SAAM  0.816 0.000  1.084 0.000  0.823 0.001  0.859 0.001  
No global SA  0.949 0.007  1.167 0.009  0.956 0.007  1.136 0.007  
No local SA  1.936 0.000  2.263 0.000  1.962 0.000  1.936 0.001  
Full SAAM  0.911 0.000  1.145 0.000  0.915 0.000  0.891 0.001  
No global SA  0.990 0.006  1.231 0.005  0.995 0.006  1.126 0.008  
No local SA  1.567 0.000  1.890 0.000  1.591 0.001  1.493 0.002  



ND  RMSE  


Full SAAM  0.076 0.000  0.496 0.001  0.076 0.000  0.03759 0.000 
No global SA  0.236 0.001  2.283 0.029  0.237 0.001  0.078 0.001 
No local SA  0.263 0.000  2.665 0.001  0.264 0.000  0.092 0.000 

Note that fixing makes no change in the local context, which implies that no filtering is performed. Besides, disables models’ habilities to incorporate global trends into the local context.
The first ablation study used the synthetic dataset explained in Section IVC with a noise component of , no covariates and a forecast window of . Table VIII shows the obtained results. The degradation produced by the ablation when is bigger than other considered cases, which is normal considering that the model was trained on those conditions. Also, to disable the local attention , produced worse results than , when the global attention is not used. The later causes a bigger deterioration in predictions’ variance, which could be a sign of the model’s inability to follow the trend after removing the global attention model.
A second ablation study was performed on the electricity dataset using a 7 days forecasting window. Again, the Local and Global Spectral Attention models are separately evaluated. For this dataset, the degradation of the results is similar for both ablations, as Table IX shows. As in the synthetic dataset ablation study, not using the Global Spectral Attention model translates into a higher standard deviation on the results.
We have proposed a novel methodology for neural probabilistic time series forecasting that marries signal processing techniques with deep learningbased autoregressive models, developing an attention mechanism which operates over the frequency domain. Thanks to this combination, which is enclosed in the Spectral Attention module, local spectrum filtering and global patterns incorporation meet during the forecast. To do so, two attention models operate over the embedding’s spectral domain representation to determine, at every time instant and for each time series, which components of the frequency domain should be considered noise and hence be filtered out, and which global patterns are relevant and should be incorporated into the predictions. Experiments on synthetic and realworld datasets confirm these statements and unveil how our suggested modular architecture can be incorporated into a variety of base deepautoregressive models, consistently improving the results of these base models and achieving stateoftheart performance. Especially, in noisy environments or short conditioning ranges, our method stands out by explicitly filtering the noise and rapidly recognizing relevant trends.
This work has been supported by Spanish government Ministerio de Ciencia, Innovación y Universidades under grants FPU18/00470, TEC201792552EXP and RTI2018099655B100, by Comunidad de Madrid under grants IND2017/TIC7618, IND2018/TIC9649, IND2020/TIC17372, and Y2018/TCS4705, by BBVA Foundation under the DeepDARWiN project, and by the European Union (FEDER) and the European Research Council (ERC) through the European Union’s Horizon 2020 research and innovation program under Grant 714161.
Comments
There are no comments yet.