Deep Autoregressive Models with Spectral Attention

Time series forecasting is an important problem across many domains, playing a crucial role in multiple real-world applications. In this paper, we propose a forecasting architecture that combines deep autoregressive models with a Spectral Attention (SA) module, which merges global and local frequency domain information in the model's embedded space. By characterizing in the spectral domain the embedding of the time series as occurrences of a random process, our method can identify global trends and seasonality patterns. Two spectral attention models, global and local to the time series, integrate this information within the forecast and perform spectral filtering to remove time series's noise. The proposed architecture has a number of useful properties: it can be effectively incorporated into well-know forecast architectures, requiring a low number of parameters and producing interpretable results that improve forecasting accuracy. We test the Spectral Attention Autoregressive Model (SAAM) on several well-know forecast datasets, consistently demonstrating that our model compares favorably to state-of-the-art approaches.



There are no comments yet.


page 1

page 2

page 3

page 4


Think Globally, Act Locally: A Deep Neural Network Approach to High-Dimensional Time Series Forecasting

Forecasting high-dimensional time series plays a crucial role in many ap...

Forecasting Multi-Dimensional Processes over Graphs

The forecasting of multi-variate time processes through graph-based tech...

Boosted Embeddings for Time Series Forecasting

Time series forecasting is a fundamental task emerging from diverse data...

N-BEATS: Neural basis expansion analysis for interpretable time series forecasting

We focus on solving the univariate times series point forecasting proble...

Learning Predictive Leading Indicators for Forecasting Time Series Systems with Unknown Clusters of Forecast Tasks

We present a new method for forecasting systems of multiple interrelated...

Mixed pooling of seasonality in time series pallet forecasting

Multiple seasonal patterns play a key role in time series forecasting, e...

Interpretable Vector AutoRegressions with Exogenous Time Series

The Vector AutoRegressive (VAR) model is fundamental to the study of mul...

Code Repositories


Spectral Attention Autoregressive Model (SAAM)

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Time series forecasting, which consists of analyzing historical signals patterns to predict future outcomes, is an important problem with scientific, business, and industrial applications, playing an important role in daily life. Several fields benefit from time series forecasting, such as finance applications of these models for estimating the future movements of stock markets

[engle1982autoregressive, ding2015deep, chatigny2020financial, kim2003financial, sezer2020financial], climate prediction [ak2015interval, le2020deep, shi2015convolutional], forecasting of energy consumption and demand [masum2018multi, wang2019review, dudek2021hybrid], and product demand and supply [simchi2008designing, merkuryeva2019demand, akyuz2017ensemble], which helps optimize resource allocation, allowing for cost reduction and profit maximization.

Early approaches to solving time series forecasting problems rely on statistical models, such as State Space Models (SSMs) [durbin2012time], exponential smoothing, [hyndman2008forecasting], [McKenzie_1984] or matrix factorization methods [yu2016temporal], [Hyndman_2011], which learn information via a matrix that relates different time series. Auto Regressive Integrated Moving Average (ARIMA) [box2008time] has been one of the most popular solutions, it produces its predictions as a weighted sum of past observations, making the model vulnerable to error accumulation. For extra information regarding classic techniques, we refer the readers to [Box_1968, hamilton1994time, lutkepohl2005new].

Regardless, all these classic approaches share a number of weaknesses. They make linearity assumptions on the data, which together with its limited scalability, makes them unsuitable for modern large-scale forecasting tasks. Furthermore, they incorporate prior knowledge about time series composition, like trends or seasonality patterns present in the data, which requires manual feature engineering and model design by domain experts in order to achieve good results [harvey1990forecasting]. Moreover, they do not usually share a global set of parameters among the different time series, which implies bad generalization and poor performance aside from single-linear time series prediction.

Deep neural networks

[sutskever2014sequence], [graves2013generating] are an alternative solution for time series forecasting. They are able to model non-linear temporal patterns, easily identify complex structures across time series, efficiently extract higher order features, and allow a data-driven approach that requires little to no human feature engineering [hwang2015recurrent, sutskever2014sequence, graves2013generating, giuliari2021transformer].

Recurrent Neural Networks (RNNs) [funahashi1993approximation], [lai2018modeling]

and Long Short-Term Memory (LSTM)

[hochreiter1997long] have achieved good results in temporal modeling. DeepAR [salinas2020deepar] set a milestone using a LSTM to perform an embedding latter used by a probabilistic model to forecast in an encoder-decoder fashion. Recently, attention models [bahdanau2014neural], [li2019enhancing] have been used by these recurrent architectures to selectively focus on some segments of the data while making predictions, e.g., in machine translation, only certain words in the input sequence may be relevant for predicting the next word. To do so, these models use an inductive bias that connects each token in the input through a relevance weighted basis of every other token.

The idea behind recurrent attention leads to Transformers models [vaswani2017attention]

, which have become one of the most popular methods with respect to the problem of time series forecasting. Initially introduced for Natural Language Processing (NLP), Transformers proposed a completely new architecture where a self-attention mechanism is used to process sequences of data. Several modifications must be accomplished in order to apply Transformer models for the forecasting of time series. In ConvTrans

[li2019enhancing], most of the difficulties associated with Transformers for this specific problem were solved. An alternative to the canonical self-attention of these models was designed to make them aware of the local context via a CNN [lecun1995convolutional]. At the same time, a modification of the attention mechanism to reduce the computational cost of self-attention was also introduced.

Many models have also tried to join classical approaches with deep learning techniques, as Deep State Space Models for Time Series Forecasting (DSSM) [rangapuram2018deep] or Deep Factors for Forecasting [wang2019deep]. Various attempts to merge signal processing techniques with deep neural networks can also be found in the literature. In [tamkin2020language], a framework that uses spectral filtering for the problem of NLP was proposed, while [Cao2020SpectralTG] uses the spectral domain to jointly capture inter-series correlations and temporal dependencies.

Nevertheless, deep autoregressive models also present some inconveniences. First, they tend to focus on recent past data to predict the future of the time series, frequently wasting important global information not encapsulated in previous predictions. Second, as classical time series forecasting methods, they suffer from error accumulation and propagation [cheng2006multistep], a problem closely related to the previous one. Third, they do not produce interpretable results, neither can we clearly explain how they reach them [castelvecchi2016can].

In this paper we show that the described problems can be partially alleviated by incorporating signal processing filtering techniques into the autoregressive models that perform the time series forecasting. With respect to the inability of these models to focus on the global context, we can obtain time series’ most important trends via frequency domain characterization. These trends can be intelligently incorporated during the forecasting, hence making the local context aware of the time series global patterns. Regarding error accumulation and noisy local context, spectral filtering can be applied to decide at every time instant which frequencies are useful and which can be suppressed, eliminating unwanted components that do not help during forecasting. Finally, these signal processing tools which operate in the spectral domain produce more interpretable internal representations, as it will be proved during the experiments section, making it possible to extract the explainable frequency domain features that are driving the predictions if necessary.

To integrate previous solutions, we propose a general architecture, the Spectral Attention Autoregressive Model (SAAM). SAAM’s modularity allows it to be effectively incorporated into a variety of deep-autoregressive models. This architecture uses two spectral attention models to determine, at every time instant, relevant global patterns as well as removing local context’s noise while performing the forecasting. Both operations are performed in the frequency domain of the embedded space.

To the best of our knowledge, SAAM is the first deep neural autoregressive model that exploits attention mechanisms in the spectral domain. A global-local architecture marries deep neural networks with classic signal processing techniques in this new frequency domain attention framework, incorporating relevant global trends into the forecast and performing spectral filtering to prevent error accumulation. Further, the additional complexity due to Spectral Attention is comparable to classic attention models in the temporal domain.

We perform extensive experiments on both synthetic and real-world time series datasets, showing the effectiveness of the proposed Spectral Attention module, consistently outperforming the base models and reaching state-of-the-art results. Ablation studies further prove the effectiveness of the designed architecture.

The rest of this paper is organized as follows. Section II states the time series forecasting problem, presents a base architecture to which we append the proposed Spectral Attention module, and provides a characterization of the time series in the spectral domain. Section III describes our model and Section IV proves its effectiveness, both quantitative and qualitatively. We conclude the paper in Section V.

Ii Preliminaries

In this section, we formally state the problem of time series forecasting and introduce a base architecture that represents the core of most deep learning-based autoregressive models in the state-of-the-art. Also, a frequency domain characterization of the time series is proposed.

Ii-a Problem definition

Given a set of N univariate time series , where , is the forecast horizon, the forecast length, and

the sequences’ total length, our goal is to model the conditional probability distribution of future trajectories of each time series given the past, namely, to predict the next

time steps after the forecast horizon:


where are the learnable parameters of the model and the associated covariates. These covariates are, together with time series’ past observations, the input to our predictive model.

Ii-B Base Architecture

A number of deep-autoregressive models in the state-of-the-art, including DeepAR [salinas2020deepar], NBeats [oreshkin2019n] and ConvTrans [li2019enhancing], can be characterized by means of a high-level architecture, represented in Fig. 1. This general framework is composed of two parts:

  1. An embedding function , with transit function and parameters . This embedding receives as input, at time , the time series previous value , the covariates , and the past value of the embedding . This embedding function can be implemented in different ways, with a RNN [rumelhart1986learning],[hopfield1982neural], a LSTM [gers2000learning], or a Temporal Convolutional Network (TCN) [lea2016temporal].

  2. A probabilistic model , with parameters , which uses the embedding to estimate time series’ next value, .

    This probabilistic model is usually implemented as a function of a neural network that parameterizes the required probability distribution. E.g., a Gaussian distribution can be represented through its mean and standard deviation as:

    , , where and are neural networks.

Figure 1: Common base architecture of deep learning-based autoregressive models. Gray represents observed variables.

The optimal model’s parameters are obtained by maximizing the log-likelihood function for the observed data in the conditioning range, i.e., from to . All quantities required for computing the log-likelihood function are deterministic, which means that no inference is required:


During both training and testing, the conditioning range , which is analogous to the encoder of seq2seq models [sutskever2014sequence], transfers information to the forecasting range , analogous to the decoder. Therefore, this base framework can be interpreted as an encoder-decoder architecture, with the consideration that both encoder and decoder are the same network, as Fig. 2 shows.

For forecasting, directly sampling from the model can be done as , when the model consumes the previous time-step prediction as input, unlike during the conditioning range, where is observed. This is illustrated in Fig. 2.

Figure 2: Unrolled base architecture. On the left of the forecast horizon, the conditioning range can be found . On its right, the forecasting range .

Note that Transformers, unlike RNNs or LSTMs, do not compute the embedding in a sequential manner. Accordingly, when obtaining the embedding through a Transformer model [wu2020deep] and so to use the encoder-decoder architecture previously described, we use the Transformer decoder-only mode, introduced in [liu2018generating].

Ii-C Characterizing the time series’ embedding in the frequency domain

Our approach in this paper exploits information in the spectral domain. In this regard, time series’ embedding space can be statistically characterized as instances of a random process for which spectral information can be analyzed using the expected autocorrelation and the Power Spectral Density (PSD) [buttkus2012spectral].

The power spectrum per embedding’s dimension can be calculated from an averaged autocorrelation estimated from a finite number of time series of duration each:


where is the expected autocorrelation for the -th dimension of the -th embedded sequence, , and is the lag between observations.

The Blackman-Tukey method [blackman1958measurement]

, which takes advantage of the Discrete Fourier Transform (DFT) of a windowed autocorrelation, can be used to estimate the PSD once the autocorrelation has been computed:


where decomposes a function into its constituent frequencies ( is usually equal to the time series’ length ), obtaining the estimated spectrum of the random process that generates the time series in the embedding space. Notice that is a spectral characterization of the -dimension of the embedding space, hence not global to the process. In Section III, we propose an alternative autocorrelation function that considers each embedding’s dimensions as independent realizations of the same random process that generates the time series, therefore obtaining a global spectral representation of the process.

Iii Spectral Attention Autoregressive Model

In this section, we introduce the Spectral Attention Autoregressive Model (SAAM), a general framework that incorporates a Spectral Attention (SA) module able to exploit embedding function’s spectral information using attention mechanisms to solve two main tasks:

  1. Time series are governed by global trends and seasonality structures. SAAM captures these global patterns and incorporates them into the forecast.

  2. Time series exhibit noise around these primary trends that difficult the forecasting process. SAAM filters this noise using spectral filtering, improving the signal to noise ratio thus alleviating the error propagation problem autoregressive models suffer from.

These two operations, incorporating global trends into the forecast and filtering the time series’ local context, are encapsulated in the SA module, responsible of all the frequency domain related operations. As such, it can be incorporated in any deep-autoregressive structure.

The resulting architecture, displayed in Fig. 3, is therefore composed by three main parts: embedding function, SA module and probabilistic model. For further details on the embedding and probabilistic model, common to the base architecture of Fig. 1, we refer the reader to Section II-B. With respect to the SA module, we now explain in detail how it integrates both global and local information into the forecasting.

Figure 3: Spectral Attention Autoregressive Model general architecture. Gray represents observed variables. Green indicates frequency domain representations.
Table I: Summary of operations performed by SAAM to filter both local and global contexts..

Iii-a Global Spectral Information

To incorporate global patterns into the prediction, we perform a frequency domain characterization of the process, as explained in Section II-C. We aim to exploit information in the spectral domain associated with the neural embeddings of a number of time series.

Nevertheless, we are not interested in characterizing each embedding’s dimension independently, , as using Eq. 3 entails. This would require a multidimensional PSD for characterizing the process over all embedding’s dimensions, . Instead, we consider each dimension of the embedding as independent realizations of the same process we average over, resulting in the following autocorrelation function:


where is the -th dimension of the -th embedded sequence and the embedding’s number of dimensions. To apply the Blackman-Tukey method from Eq. 4 over the previous autocorrelation function involves to obtain the complete frequency-domain characterization of the process, incorporating all the global patterns across the different dimensions of the embedding into a single spectral representation, .

Therefore, Eqs. 4 and 5 allow us to calculate a Monte Carlo approximation of the process’ PSD for a batch of training sequences , via a Fourier Transform with points: . Consequently, is the global spectral representation of the process, shared among the time series to forecast.

Iii-B Local Spectral Information

To perform the spectral local filtering, we analyze the last values of each individual sample: encapsulates the previous embeddings, each of dimension . This embedded buffered signal at time is transformed to the frequency domain, , via a DFT with points. Note that we retain only the module of the Fourier Transform.

Iii-C Merging Global and Local Contexts with Spectral Attention

We combine both global and local spectral information to modify the embedded representation of the time series . This is done through two spectral attention models contained in the SA module, with parameters :

Iii-C1 Global Spectral Attention

this frequency-domain attention model, with parameters , is responsible of incorporating, at time and for each time series, the Global Spectral Information into the forecast. To do so, it uses the time series’ local context, summarized via the embedding function , as key to select the relevant frequency components on that should be included during the prediction, where is a repetition of along the dimensions of the embedding: . The global filter’s coefficients, , take values and is a neural network.

Iii-C2 Local Spectral Attention

as in the Global Spectral Attention model, the embedded representation is used as key to determine the relevant and not relevant local spectral components, , where are the Local Spectral Attention’s parameters and are the local filter’s coefficients for each embedding’s dimension.

We combine both spectral-domain attention models through multiplication and addition operations in the frequency domain: . The multiplication over the embedding’s local spectrum representation is performing the local spectral filtering, setting to zero not relevant frequency components through the local attention model. Furthermore, the addition of includes significant global trends into the forecast via , which selects relevant patterns from the process’ global spectrum representation .

Finally, this spectral representation is transformed back to the time domain and the last value of , , is feed to the probabilistic model to forecast the next time-step, .

Fig. 4 shows the unrolled architecture of the proposed model. Removing the SA module, the base framework displayed in Fig. 2 remains. Note that both the computation of the expected autocorrelation and the Fourier Transform are differentiable with respect to the model parameters.

Figure 4: The unrolled architecture of the Spectral Attention Autoregressive Model.

Iii-D Training

The likelihood of the model, which is maximized to match the statistical properties of the data, now includes SA’s parameters, :


Once parameters have been learned during the conditioning range, forecast can be produced as in Section II-B:

, computing the joint distribution over the forecasting range for each time series.

Notice that, during both training and testing, global spectral information is obtained using a mini-batch of time series different to the one being forecasted.

Iv Experiments

We conduct experiments with both synthetic and real-world datasets in order to provide evidence of the superior forecasting ability of SAAM. Moreover, ablation studies are conducted. The code for our proposed model is publicly available on GitHub 111 [fernando_moreno_pino_2021_5086179].

Iv-a Applying the Spectral Attention Module to State-of-the-Art Models

Many well-known deep-autoregressive models that define the state-of-the-art can take advantage of the Spectral Attention module proposed in Section III. This allows them to filter not relevant spectral components and to recognize global trends that can be incorporated into the forecast.

We integrated the SA module into two forecasting models: DeepAR [salinas2020deepar], a widely-known and largely used model with several industrial applications [schelter2018automating, bose2017probabilistic, schelter2018challenges, fildes2019retail, alexandrov2019gluonts], which employs a LSTM to perform the embedding; and ConvTrans [li2019enhancing], a more recent Transformer-based proposal, which constitutes one of most efficient approaches for using Transformers for the problem of time series forecasting.

We would like to remark that the added complexity by the Spectral Attention module is equivalent to classic attention models as [bahdanau2014neural]. In these models, for a sequence of length , computing scores between every pair causes memory use. The complexity of Spectral Attention depends on the embedding’s dimension and the number of points used to compute the Fourier Transform . The complexity therefore, depends on .

Iv-B Metrics

The normalized quantile loss, , with , which quantifies the accuracy of a quantile of the predictive distribution, is the main metric used to report the results of the experiments, as in many other works [salinas2020deepar], [li2019enhancing], [rangapuram2018deep], [seeger2016bayesian].


Rolling window predictions after the last point seen in the conditioning range are used to obtain the results. To compute this metric, we use 200 samples from the decoder to estimate along time. Also, the normalized sum of the quantile losses is considered, as can be appreciated in the previous equation.

The Normalized Deviation (ND) [yu2016temporal] and Root Mean Square Error (RMSE) were also used to evaluate the probabilistic forecasts, especially on the synthetic dataset experiments:


Iv-C A Synthetic Dataset

In order to demonstrate SAAM capabilities, we conducted experiments on synthetic data composed of sinusoidal signals with no covariates and a duration of 200 samples. Each of these signals is divided into two halves, randomly selecting the components for each of them as:


where and

are independently and randomly generated from a Bernoulli distribution,

with probability

. This implies that each half of the time series can take the form of one over two signals, or , both of them composed by the addition of two sines of different frequencies plus a noise component, :


where and will vary during the experiments, the amplitudes had fixed values, and each of the frequencies used are chosen from a different interval as: . A sampled time series is shown in Fig. 5.

Figure 5: Synthetic dataset example signal, generated with , , , . For each signal, each half randomly varies according to the Bernoulli distribution. Noise’s standard deviation is increased from top to bottom.

To evaluate spectral attention advantages we trained two models on this dataset: 1) DeepAR as base model and 2) SAAM. Both of them used the exact same architecture for the common parts: the embedding function was performed by a LSTM of 3 layers and 10 hidden units per layer and the probabilistic model is composed of a fully connected network.

Iv-C1 Empirical Analysis

Both models DeepAR and SAAM were trained using 500 signals from the synthetic dataset with a noise component . The training loss evolution and the validation ND error are shown in Fig. 6

We then evaluated the trained models in two different scenarios. First, in absence of a noise component, ; second, in the same conditions they were trained, .

Moreover, for each of the previous cases, we evaluated the models using different forecast horizons, starting with up to . In this last case, the final 95 time-steps are forecasted after observing the five first samples from the time series’ second half. Setting would make no sense, as contains no evidence about the time series configuration during .




0 175 25 0.25494 0.000 / 0.24101 0.003 0.35004 0.000 / 0.28680 0.003 0.26250 0.000 / 0.24872 0.004 0.21830 0.000 / 0.221701 0.003
150 50 0.27403 0.000 / 0.20145 0.002 0.37677 0.000 / 0.24868 0.004 0.27882 0.000 / 0.20594 0.002 0.26836 0.000 / 0.26325 0.004
120 80 0.37867 0.000 / 0.17323 0.004 0.47434 0.000 / 0.21293 0.006 0.38511 0.000 / 0.17811 0.005 0.38712 0.000 / 0.13148 0.001
110 90 0.75234 0.000 / 0.20326 0.001 1.06398 0.000 / 0.28207 0.001 0.75810 0.001 / 0.20828 0.001 0.73909 0.001 / 0.14980 0.003
105 95 0.82233 0.000 / 0.37575 0.003 1.13000 0.000 / 0.66817 0.001 0.82801 0.001 / 0.38062 0.004 0.80492 0.001 / 0.34665 0.002
0.5 175 25 0.58778 0.000 / 0.56469 0.002 0.75065 0.000 / 0.71553 0.002 0.59492 0.001 / 0.57267 0.002 0.59893 0.001 / 0.61914 0.005
150 50 0.60786 0.000 / 0.55649 0.001 0.76756 0.000 / 0.70871 0.002 0.61322 0.000 / 0.56057 0.001 0.61094 0.000 / 0.60490 0.003
120 80 0.63071 0.000 / 0.57218 0.002 0.79379 0.000 / 0.73353 0.002 0.63531 0.001 / 0.57674 0.002 0.62907 0.001 / 0.52398 0.002
110 90 0.78535 0.000 / 0.67392 0.001 1.03135 0.000 / 0.89065 0.001 0.79090 0.001 / 0.67823 0.001 0.79474 0.001 / 0.61661 0.001
105 95 0.86135 0.000 / 0.70326 0.009 1.12157 0.000 / 0.93480 0.012 0.86594 0.001 / 0.70809 0.009 0.85522 0.001 / 0.65422 0.012


Table II:

Comparative of DeepAR vs SAAM on the synthetic dataset for different noise component’s variance and forecast lengths.



ND -222,56 % -46,55 %
RMSE -222,82 % -49,41 %
-215,43 % -45,56 %
DeepAR -268,72 % -42,79 %
ND -55,91 % -24,54 %
RMSE -132,97 % -30,64 %
-53,03 % -38,34%
SAAM -56,36 % -5,67 %


Table III: DeepAR & SAAM accuracy deprecation comparative between and .

Table II shows the results reported by DeepAR and SAAM. After training the models, 10 evaluations per scenario were conducted during testing. On average, SAAM improved the by a 28.4%, by a 27.7%, by a 28.2% and by a 28.4%, proving that Spectral Attention inclusion enhances the model’s performance.

Furthermore, remark that the Global Spectral Attention model contained in the SA module helps accelerate global trends detection: the earlier the forecast window starts (smaller ), the better the results of SAAM with respect to the DeepAR base model are. Specifically, for and (the most favorable setting as there is no noise and just 25 data points to forecast), SAAM and DeepAR report very similar results while, with an increased forecast window of , SAAM architecture thoroughly enhances base model’s results: is improved by a 54.3%, by a 40.9%, by a 54% and by a 56.9%.

SAAM’s ability to detect and incorporate relevant trends into the forecast has direct implications in the reported results. Table III shows a comparison for different noise levels of the reported metrics’ degradation while increasing the forecast window from to . SAAM’s deterioration is much smaller than DeepAR’s.

Finally, as can be seen on Fig. 6, a faster training convergence is achieved while using the SA module. Furthermore, SAAM reached the minimum validation error long before DeepAR.

Figure 6: Training loss (left) and validation ND (right). The later for a forecast window .

Iv-C2 Interpretability Analysis

We now visualize how spectral attention affects the embedding evolution during the forecasting, studying SAAM’s internal behavior while using a LSTM with 1 layer and 5 hidden units per layer as embedding. In Fig. 7

, we show the hidden representations produced by SAAM during the forecasting of a time series

. Each row in this figure represents one of those 5 hidden dimensions at time (predicting time series’ final time-step). SAAM’s hidden variables are displayed as:

  • Blue lines represent the hidden representation of the LSTM before SA. This is, the representation that DeepAR would use to forecast.

  • Red lines represent SA module’s output, with dimension , being and . This is the representation that DeepAR based SAAM uses in order to perform the forecast.

  • We also represent the true signal in the first row of Fig. 7, the noise component was removed for clarity. This specific sequence can be described as:


Fig. 7 shows how the SA module incorporates global trends into the hidden representation, making the model immediately aware of trend changes: in Dim. 0 and 2, exhibits a trend associated with from time , while this component does not appear in until . Furthermore, incorporates in both Dim. 3 and 4 a component that exhibits for both and .

These examples show the ability of the proposed architecture to incorporate into the forecast patterns that the time series exhibit.

Figure 7: Hidden representations of SAAM model while forecasting at time (marked by red dotted lines). A filtering window of size was used.

Finally, Fig. 8 displays some forecast examples by DeepAR and SAAM. DeepAR frequently failed to detect trend changes at and to incorporate certain frequency variations into the mean and standard deviation of the forecast, while SAAM correctly manages these situations, as Fig. 8 shows.

Figure 8: Synthetic dataset’s forecasts examples by DeepAR (top rows) and SAAM (bottom rows) on the same two time series, , . Green dotted vertical lines mark the forecast start. Predictions mean and variation appear on red, ground-truth in blue.

Iv-D Real world datasets

The performance of SAAM on several real-world datasets was compared with other state-of-the-art models: two classic forecasting techniques, ARIMA [box2008time] and ETS [hyndman2018forecasting]; a recent matrix factorization method, TRMF [yu2016temporal]; a RNN based State Space Model, DSSM [rangapuram2018deep]; DeepAR [salinas2020deepar] and ConvTrans [li2019enhancing].

Two different configurations were proposed for SAAM: the first one uses DeepAR as base model, the second combines ConvTrans with SA, both complied with the architecture of Fig. 3. A basic framework for the common parts was maintained during all the experiments: for the DeepAR base model, the embedding consisted on 3 LSTM layers with 40 hidden units per layer while, for the ConvTrans base proposal, a Transformer with 8 heads and 3 layers was used. Both set of parameters appear in the original articles [salinas2020deepar], [li2019enhancing] as optimal choices.

The electricity222 and traffic datasets 333 [Dua:2019], [NIPS2016_85422afb] were evaluated using two different forecast windows, of one and seven days. The electricity dataset contains hourly time series of energy consumption of 370 customers. Similarly, the traffic dataset contains the hourly occupancy rates, with values between zero and one, of 963 car lanes in San Francisco area freeways.

Three more datasets, with different forecast windows each, were also used. The solar dataset 444 [solar_dataset] contains the solar power production records from January to August 2006 from 137 plants in Alabama and exhibits hourly measurements. Forecast windows of 1 day were predicted during the evaluation. The wind dataset 555 [wind_dataset] contains daily estimates of 28 countries’ wind energy potential in a period from 1986 to 2015, expressed as a percent of a power plant’s maximum output. Finally, the M4-Hourly dataset 666 [makridakis2018m4], contains 414 hourly time series from the M4 competition [makridakis2020m4]. Table IV summarizes each dataset and the models’ architecture used.

All datasets used covariates , composed by the hour of the day, day of the week, week of the year and month of the year, for daily, weekly, monthly and yearly data, respectively. Also, covariates that measure the distance to the first observation of the time series as well as an item index identification for each time series were used.


Electricity Traffic Solar Wind M4


Length 32304 129120 5832 10957 748
# Time Series 370 370 137 28 414
Granularity Hourly Hourly Hourly Daily Hourly
Domain [0,1]
Batch Size 128 128 128 64 128
Learning Rate 1e-3 1e-3 1e-3 1e-3 1e-3
# LSTM Layers 3 3 3 3 3
# Hidden Units/Layer 40 40 40 40 40
# Heads 8 8 8 8 8
# Layers 3 3 3 3 3
Forecast Window 1 Day / 7 Days 1 Day / 7 Days 1 Day 30 Days 2 Days
Encoder Length 168 / 24 168 / 24 168 162 128
Decoder Length 24 / 168 24 / 168 24 30 48


Table IV: Datasets evaluated’s details.


Dataset Method


elect-1d 0.154/ 0.102 0.101 / 0.077 0.084 / - 0.083 / 0.056 0.075 / 0.040 0.059 / 0.034 0.0635 / 0.0317 0.056 / 0.029
elect-7d 0.283 / 0.109 0.121 / 0.101 0.087 / - 0.085 / 0.052 0.082 / 0.053 0.079 / 0.051 0.076 / 0.037 0.073 / 0.046
traffic-1d 0.223 / 0.137 0.236 / 0.148 0.186 / - 0.167 / 0.113 0.159 / 0.106 0.152 / 0.102 0.123 / 0.099 0.120 / 0.083
traffic-7d 0.492 / 0.280 0.509 / 0.529 0.202 / - 0.168 / 0.114 0.251 / 0.169 0.172 / 0.110 0.246 / 0.167 0.155 / 0.098


Table V: Evaluations summary, using / metrics, on Electricity and Traffic datasets with forecast windows of 1 and 7 days, where are extracted from [li2019enhancing].


Dataset Method


TRMF DeepAR ConvTrans SAAM (DeepAR) SAAM (ConvTrans)
Solar 0.241 / - 0.222 / 0.093 0.210 / 0.082 0.191 / 0.066 0.197 / 0.069
M4 - / - 0.085 / 0.044 0.067 / 0.049 0.048 / 0.029 0.061 / 0.044
Wind 0.311 / - 0.286 / 0.116 0.287 / 0.111 0.282 / 0.105 0.278 /0.108


Table VI: Evaluations summary, using / metrics, on Solar, M4 and Wind datasets with different forecast windows, where are extracted from [li2019enhancing].


Dataset elect-1d elect-7d traffic-1d traffic-7d Solar M4 Wind


Base Model DeepAR 15.33% / 20.75% 7.32% / 30.19% 22.64% / 6.60% 1.99% / 1.18% 13.96% / 29.03% 43.53% / 34.09% 1.05% / 9.48%
ConvTrans 5.08% / 14.71% 7.59% / 9.80% 21.1% / 18.6% 9.8% / 10.91 % 6.19% / 15.85% 8.96% / 10.20% 3.14% / 2.70%


Table VII: Improvement percentage on / metrics for each base model.

Table V shows the results obtained for both electricity and traffic datasets, with forecast windows of 1 and 7 days. Table VI shows the results for solar, M4, and wind datasets.

Some conclusions can be drawn from these results. The classic methods evaluated, ARIMA and ETS, performed the worst, probably due to the incapacity of detecting shared patterns across the different time series. The results reported for TRMF are slightly better but, for most configurations, it was not capable to beat Deep Neural Networks based approaches, where DeepAR and ConvTrans solidly exceeded DSSM. The two proposed variations of SAAM outperformed all other models, as it happened in Section IV-C with the synthetic dataset experiments.

Finally, Table VII shows a comparison between the base models, DeepAR and ConvTrans, and their SAAM version, which consistently improved base models’ forecast accuracy.

For DeepAR, was improved by a 15.1% and by a 18.8% after including the SA module on SAAM. For ConvTrans, was improved by a 8.8% and by a 11.8% when using our proposed SAAM architecture.

These results prove how different deep-autoregressive models, with significant differences between them, can be improved by correctly incorporating frequency domain information into the forecasting, without any significant complexity overload.

Iv-E Ablation study

Finally, to quantify the real effect of the SA module and understand the effectiveness of its components, an ablation study was conducted. SA’s basic blocks: the Local Spectral Attention model and the Global Spectral Attention model, described in Section III, were separately evaluated.

To evaluate the behavior of both frequency-domain attention models, two ablation studies with two different datasets were conducted to secure the robustness of the conclusions. For both studies, SAAM is trained using a LSTM as embedding function.

Considering that SA’s output obeys to: , as stated in Table (I), three different configurations of the model were tested: 1) SAAM; 2) SAAM without using Local Spectral Attention, ; 3) SAAM without using Global Spectral Attention, .




Full SAAM 0.028 0.000 0.044 0.000 0.029 0.000 0.025 0.000
No global SA 0.886 0.003 1.064 0.004 0.893 0.003 1.077 0.009
No local SA 2.084 0.000 2.352 0.000 2.100 0.000 2.416 0.001
Full SAAM 0.816 0.000 1.084 0.000 0.823 0.001 0.859 0.001
No global SA 0.949 0.007 1.167 0.009 0.956 0.007 1.136 0.007
No local SA 1.936 0.000 2.263 0.000 1.962 0.000 1.936 0.001
Full SAAM 0.911 0.000 1.145 0.000 0.915 0.000 0.891 0.001
No global SA 0.990 0.006 1.231 0.005 0.995 0.006 1.126 0.008
No local SA 1.567 0.000 1.890 0.000 1.591 0.001 1.493 0.002


Table VIII: Ablation study on the synthetic dataset.




Full SAAM 0.076 0.000 0.496 0.001 0.076 0.000 0.03759 0.000
No global SA 0.236 0.001 2.283 0.029 0.237 0.001 0.078 0.001
No local SA 0.263 0.000 2.665 0.001 0.264 0.000 0.092 0.000


Table IX: Ablation study on the electricity dataset with 7 days forecast windows.

Note that fixing makes no change in the local context, which implies that no filtering is performed. Besides, disables models’ habilities to incorporate global trends into the local context.

The first ablation study used the synthetic dataset explained in Section IV-C with a noise component of , no covariates and a forecast window of . Table VIII shows the obtained results. The degradation produced by the ablation when is bigger than other considered cases, which is normal considering that the model was trained on those conditions. Also, to disable the local attention , produced worse results than , when the global attention is not used. The later causes a bigger deterioration in predictions’ variance, which could be a sign of the model’s inability to follow the trend after removing the global attention model.

A second ablation study was performed on the electricity dataset using a 7 days forecasting window. Again, the Local and Global Spectral Attention models are separately evaluated. For this dataset, the degradation of the results is similar for both ablations, as Table IX shows. As in the synthetic dataset ablation study, not using the Global Spectral Attention model translates into a higher standard deviation on the results.

V Conclusion

We have proposed a novel methodology for neural probabilistic time series forecasting that marries signal processing techniques with deep learning-based autoregressive models, developing an attention mechanism which operates over the frequency domain. Thanks to this combination, which is enclosed in the Spectral Attention module, local spectrum filtering and global patterns incorporation meet during the forecast. To do so, two attention models operate over the embedding’s spectral domain representation to determine, at every time instant and for each time series, which components of the frequency domain should be considered noise and hence be filtered out, and which global patterns are relevant and should be incorporated into the predictions. Experiments on synthetic and real-world datasets confirm these statements and unveil how our suggested modular architecture can be incorporated into a variety of base deep-autoregressive models, consistently improving the results of these base models and achieving state-of-the-art performance. Especially, in noisy environments or short conditioning ranges, our method stands out by explicitly filtering the noise and rapidly recognizing relevant trends.


This work has been supported by Spanish government Ministerio de Ciencia, Innovación y Universidades under grants FPU18/00470, TEC2017-92552-EXP and RTI2018-099655-B-100, by Comunidad de Madrid under grants IND2017/TIC-7618, IND2018/TIC-9649, IND2020/TIC-17372, and Y2018/TCS-4705, by BBVA Foundation under the Deep-DARWiN project, and by the European Union (FEDER) and the European Research Council (ERC) through the European Union’s Horizon 2020 research and innovation program under Grant 714161.