Topological Attention for Time Series Forecasting

07/19/2021 ∙ by Sebastian Zeng, et al. ∙ Universitätsbibliothek Salzburg 0

The problem of (point) forecasting univariate time series is considered. Most approaches, ranging from traditional statistical methods to recent learning-based techniques with neural networks, directly operate on raw time series observations. As an extension, we study whether local topological properties, as captured via persistent homology, can serve as a reliable signal that provides complementary information for learning to forecast. To this end, we propose topological attention, which allows attending to local topological features within a time horizon of historical data. Our approach easily integrates into existing end-to-end trainable forecasting models, such as , and in combination with the latter exhibits state-of-the-art performance on the large-scale M4 benchmark dataset of 100,000 diverse time series from different domains. Ablation experiments, as well as a comparison to a broad range of forecasting methods in a setting where only a single time series is available for training, corroborate the beneficial nature of including local topological information through an attention mechanism.



There are no comments yet.


page 16

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Time series are ubiquitous in science and industry, from medical signals (e.g., EEG), motion data (e.g., speed, steps, etc.) or economic operating figures to ride/demand volumes of transportation network companies (e.g., Uber, Lyft, etc.). Despite many advances in predicting future observations from historical data via traditional statistical [Brockwell91a]

, or recent machine-learning approaches

[Oreshkin20a, Wang19a, Salinas19a, Rangapuram18a, Li19a, Yu16a], reliable and accurate forecasting remains challenging. This is not least due to widely different and often heavily domain dependent structural properties of time related sequential data.

In this work, we focus on the problem of (point) forecasting univariate time series, i.e., given a length-vector of historical data, the task is to predict future observations for a given time horizon . While neural network models excel in situations where a large corpus of time series is available for training, the case of only a single (possibly long) time series is equally important. The arguably most prominent benchmarks for the former type of forecasting problem are the “(M)akridakis”-competitions, such as M3 [Makridakis00a] or M4 [Makridakis18a]. While combinations (and hybrids) of statistical and machine-learning approaches have largely dominated these competitions [Makridakis18a, see Table 4], Oreshkin et al. [Oreshkin20a] have recently demonstrated that a pure machine-learning based model (N-BEATS) attains state-of-the-art performance. Interestingly, the latter approach is simply built from a collection of common neural network primitives which are not specific to sequential data.

However, the majority of learning-based approaches directly operate on the raw input signal, implicitly assuming that viable representations for forecasting can be learned via common neural network primitives, composed either in a feedforward or recurrent manner. This raises the question of whether there exist structural properties of the signal, which are not easily extractable via neural network components, but offer complementary information. One prime example111Although learning-based approaches [Som20a] exist to approximate topological summaries (without guarantees). are topological features, typically obtained via persistent homology [Carlsson09a, Edelsbrunner2010]. In fact, various approaches [Perea15a, Gidea18a, Dlotko19a, Khasawneh18a, Gidea20a] have successfully used topological features for time series analysis, however, mostly in classification settings, for the identification of certain phenomena in dynamical systems, or for purely exploratory analysis (see Section 2).

Contribution. We propose an approach to incorporate local topological information into neural forecasting models. Contrary to previous works, we do not compute a global topological summary of historical observations, but features of short, overlapping time windows to which the forecasting model can attend to. The latter is achieved via self-attention and thereby integrates well into recent techniques, such as N-BEATS [Oreshkin20a]. Notably, in our setting, computation of topological features (via persistent homology) comes with little computational cost, which allows application in large-scale forecasting problems.

Figure 1: Illustration of topological attention, computed on time series observations . The signal is decomposed into a collection of overlapping windows of length . For each window, a topological summary, i.e., a persistence barcode , is computed. These local topological summaries are then vectorized (in ) via a differentiable map , fed through several transformer encoder layers [Vaswani17a] (implementing a multi-head self-attention mechanism) with positional encoding at the input, and finally mapped to by an MLP (best-viewed in color).

Problem statement. In practice, neural forecasting models typically utilize the last observations of a time series in order to yield (point) forecasts for a given time horizon . Under this perspective, the problem boils down to learning a function (parametrized as a neural network)


from a given collection of inputs (i.e., length- vectors) and targets (i.e., length- vectors). Specifically, we consider two settings, where either (1) a large collection of time series is available, as in the M4 competition, or (2) we only have access to a single time series. In the latter setting, a model has to learn from patterns within a time series, while the former setting allows to exploit common patterns across multiple time series.

2 Related work

Persistent homology and time series. Most approaches to topological time series analysis

are conceptually similar, building on top of work by de Silva et al. [deSilva12a] and Perea & Harer [Perea15a, Perea16a]. Herein, time series observations are transformed into a point cloud via a time-delay coordinate embedding [Takens81a] from which Vietoris-Rips (VR) persistent homology is computed. The resulting topological summaries, i.e., persistence barcodes, are then used for downstream processing. Within this regime, Gidea et al. [Gidea20a] analyze the dynamics of cryptocurrencies using persistence landscapes [Bubenik15a], Khasawneh et al. [Khasawneh18a] study chatter classification in synthetic time series from turning processes and Dłotko et al. [Dlotko19a] identify periodicity patterns in time series. In [Kim18a]

, Kim et al. actually compute one-step forecasts for Bitcoin prices and classify price patterns, essentially feeding barcode statistics as supplementary features to a MLP/CNN-based regression model. Surprisingly, very few works deviate from this pipeline, with the notable exception of

[Gidea18a], where VR persistent homology is not computed from a time-delay coordinate embedding, but rather from assembling observations (within sliding windows of size ) from a -variate time series into a -dimensional point cloud, followed by VR persistent homology computation.

Although these works clearly demonstrate that capturing the “shape” of data via persistent homology provides valuable information for time series related problems, they (1) rely on handcrafted features (i.e., predefined barcode summary statistics, or a fixed barcode-vectorization strategy), (2) consider topological summaries as the single source of information and (3) only partially touch upon forecasting problems (with the exception of [Kim18a]). Furthermore, in this existing line of work, sweeping a sliding window over the time series is, first and foremost, a way to construct a point cloud which represents the entire time series. Instead, in our approach, each window yields its own topological summary in the form of a persistence barcode, reminiscent to representing a sentence as a sequence of word embeddings in NLP tasks. When combined with learnable representations of persistence barcodes [Hofer19b, Carriere20a], this perspective paves the way for leveraging recent techniques for handling learning problems with sequential data, such as attention [Vaswani17a], and allows to seamlessly integrate topological features into existing neural forecasting techniques.

Neural network approaches to time series forecasting. Recently, various successful neural network approaches to (mostly probabilistic) time series forecasting have emerged, ranging from auto-regressive neural networks as in DeepAR [Salinas19a], to (deep) extensions of traditional state space models, such as DeepFactors [Wang19a] or DeepState [Rangapuram18a]. While these models are inherently tailored to the sequential nature of the forecasting problem, Li et al. [Li19a] instead rely on the concept of (log-sparse) self-attention [Vaswani17a], fed by the outputs of causal convolutions, and Oreshkin et al. [Oreshkin20a] even abandon sequential neural network primitives altogether. The latter approach, solely based on operations predominantly found in feed-forward architectures, achieves state-of-the-art performance for (point) forecasts across several benchmarks, including the large-scale M4 competition. Yet, a common factor in all aforementioned works is that raw time series observations are directly input to the model, assuming that relevant structural characteristics of the signal can be learned. While we choose an approach similar to [Li19a], in the sense that we rely on self-attention, our work differs in that representations fed to the attention mechanism are not obtained through convolutions, but rather through a topological analysis step which, by its construction, captures the “shape” of local time series segments.

3 Topological attention

The key idea of topological attention is to analyze local segments within an input time series, , through the lens of persistent homology. As mentioned in Section 2, the prevalent strategy in prior work is, to first construct a point cloud from via a time-delay coordinate embedding and to subsequently compute VR persistent homology. Historically, this is motivated by studying structural properties of an underlying dynamical system, with a solid theoretical foundation, e.g., in the context of identifying periodicity patterns [Perea15a, Perea16a]. In this regime, is encoded as a point cloud in by considering observations within a sliding window of size as a point in .

While the time-delay embedding strategy is adequate in settings where one aims to obtain one global topological summary, it is inadequate when local structural properties of time series segments are of interest. Further, unless large (computationally impractical) historical time horizons are considered, one would obtain relatively sparse point clouds that, most likely, carry little information.

3.1 Time series as local topological summaries

Different to time-delay embeddings, we follow an alternative strategy: a time series signal is still decomposed into a sequence of (overlapping) windows, but not to yield a point cloud element, but rather to be analyzed in isolation. In the following, we only discuss the necessities specific to our approach, and refer the reader to [Edelsbrunner02a, Carlsson09a, Boissonnat18a] for a thorough treatment of persistent homology.

To topologically analyze a length- time series, over the time steps , in a computationally tractable manner, lets consider a 1-dimensional simplicial complex of the form

where denote 0-simplices (i.e., vertices) and 1-simplices (i.e., edges) are in iff and are two consecutive time indices. Topologically, carries the connectivity properties of a time series of length , which is equivalent to a straight line. This is the same for all time series of length and thus offers no discriminative information.

Persistent homology, however, lets us combine the purely topological representation of the time series with its actual values. For a specific , let denote its increasingly sorted values and consider

In fact, forms an increasing sequence of subsets of , i.e., a filtration. Importantly, while is the same for all time series of length , the filtration, , is determined by the values of . Persistent homology then tracks the evolution of topological features throughout this sequence and summarizes this information in the form of persistence barcodes.

In our specific case, as is topologically equivalent to a straight line, we only get -dimensional features, i.e., connected components. Hence, we obtain one (0 degree) barcode . This barcode is a multiset of (birth, death) tuples, representing the birth () and death () of topological features. Informally, we may think of building , piece-by-piece, according to the sorting of the ’s, starting with the lowest value, and tracking how connected components appear / merge, illustrated in Fig. 2.

Figure 2: Illustration of -dimensional persistent homology computation for a time series of length . The barcode encodes topological changes, in the form of (birth,death) tuples, as we sweep through the growing sequence of subsets of . For example, the connected component born at , dies at , caused by the merge with the connected component born at (best-viewed in color).
Remark 3.1.

The information captured throughout this process has two noteworthy properties. First, it is stable in the sense that small changes in the observation values may not cause arbitrary changes in the respective barcodes, see [Cohen-Steiner2007]. Second, one may equally order the negative observations, i.e., , and thus obtain . In that manner, the signal is analyzed from below and above.

Finally, to extract local topological information, we do not compute one single barcode for , but one for each sliding window of size , see Fig. 1. Given a decomposition of into subsequent windows, we obtain barcodes, , which constitute the entry point for any downstream operation. Informally, those barcodes encode the evolution of local topological features over time.

3.2 Barcode vectorization

Although persistence barcodes concisely encode topological features, the space of persistence barcodes, denoted as , carries no linear structure [Turner14a] and the nature of barcodes as multisets renders them difficult to use in learning settings. Myriad approaches have been proposed to alleviate this issue, ranging from fixed mappings into a vector space (e.g., [Bubenik15a, Adams17a]), to kernel techniques (e.g., [Reininghaus14a, Kusano16a]) and, more recently, to learnable vectorization schemes (e.g., [Hofer19a, Carriere20a]). Here, we follow the latter approach, as it integrates well into the regime of neural networks. In particular, the core element in learnable vectorization schemes is a differentiable map of the form


where denotes a so called barcode coordinate function [Hofer19a], designed to preserve the stability property in Remark 3.1. Upon assembling a collection of such coordinate functions and subsuming parameters into , one obtains a -dimensional vectorization of via


Taking into account the representation of as persistence barcodes, we summarize the vectorization step as


This is distinctly different to [Perea15a, Perea16a, Khasawneh18a, Kim18a] (see Section 2), where one barcode is obtained and this barcode is represented in a fixed manner, e.g., via persistence landscapes [Bubenik15a] or via barcode statistics.

3.3 Attention mechanism

In order to allow a forecasting model to attend to local topological patterns, as encoded via the , we propose to use the encoder part of Vaswani et al.’s [Vaswani17a] transformer architecture, implementing a repeated application of a (multi-head) self-attention mechanism. Allowing to attend to local time series segments is conceptually similar to Li et al. [Li19a], but differs in the way local structural properties are captured: not via causal convolutions, but rather through the lens of persistent homology. In this setting, the scaled dot-product attention, at the heart of a transformer encoder layer, computes


and , denoting learnable (key, value, query) projection matrices. Recall that holds all -dimensional vectorizations the persistence barcodes. In its actual incarnation, one transformer encoder layer222omitting the additional normalization and projection layers for brevity, denoted as , computes and concatenates parallel instances of Eq. (5), (i.e., the attention heads), and internally adjusts such that . Composing such AttnEnc maps, one obtains


Finally, we use a two-layer

(with ReLU activations) to map the vectorized output of the transformer encoder to a

-dimensional representation. Topological attention thus implements


where denotes row-major vectorization operation.

Remark 3.2.

Notably, as the domain of TopAttn is

, error backpropagation stops at the persistent homology computation. However, we remark that, given recent works on differentiating through the persistent homology computation

[Hofer20a, Carriere21a], one could even combine topological attention with, e.g., [Li19a], in the sense that the outputs of causal convolutions could serve as the filter function for persistent homology. Error backpropagation would then consequently allow to learn this filter function.

3.4 Forecasting model

While the representation yielded by topological attention (see Eq. (7)) can be integrated into different neural forecasting approaches, it specifically integrates well into the N-BEATS model of Oreshkin et al. [Oreshkin20a]. We briefly describe the generic incarnation of N-BEATS next, but remark that topological attention can be similarly integrated into the basis-expansion variant without modifications.

Essentially, the generic N-BEATS model is assembled from a stack of double-residual blocks. For , each block consists of a non-linear map (implemented as a MLP)


with denoting the internal dimensionality, and two subsequent maps that yield the two-fold output of the -th block as

with , and . While the yield the connection to the following computation block, the are used to compute the final model prediction via (component-wise) summation, i.e., .

Importantly, in this computational chain, the can be leveraged as an interface to integrate additional information. In our case, we enrich the input signal to each block by the output of the topological attention mechanism through concatenation (see figure to the right), i.e.,


This means that the time series signal is (1) input (in its raw form) to N-BEATS and (2) its topological attention representation, , is supplied to each block as a complementary signal. In a similar manner (i.e., through concatenation), can be included in much simpler models as well (see Section 4.2).

Computational complexity. Aside from the computational overhead incurred by the multi-head attention module, we need to compute -dimensional persistent homology for each sliding window. This can be done efficiently, using union find data structures, with complexity , where with the sliding window size and denoting the inverse of the Ackermann function. As the latter grows very slowly, computational complexity is roughly linear for this part.

4 Experiments

We assess the quality of point forecasts in two different settings and perform ablation studies to isolate the impact of topological attention in the proposed regime.

Throughout all experiments, we compute persistent homology from and (see Remark 3.1) using Ripser [Bauer21]. Barcode vectorization, see Eq. (3), is based on rational hat coordinate functions [Hofer19a] with the position parameters (i.e., the locations of each coordinate function in ) initialized by -means++ clustering over all barcodes in the training data (with set to the number of coordinate functions). This yields a representation per sliding window. Full architecture details and dataset statistics can be found in the suppl. material.

Ablation setup. When assessing each component of topological attention in isolation, we refer to +Top as omitting the TransformerEncoder part in Eq. (7), and to +Attn as directly feeding the time series observations to the transformer encoder, i.e., omitting TopVec in Eq. (7).

4.1 Evaluation metrics

To evaluate the quality of point forecasts, two commonly used metrics are the symmetric mean absolute percentage error (sMAPE) and the mean absolute scaled error (MASE). Letting denote the length- forecast, the true observations and the length- history of input observations, both scores are defined as [Makridakis20a]


with depending on the observation frequency. For results on the M4 benchmark (see Section 4.3), we adhere to the competition guidelines and additionally report the overall weighted average (OWA) which denotes the arithmetic mean of sMAPE and MASE (with pre-specified), both measured relative to a naïve (seasonally adjusted) forecast (also provided by the M4 competition as Naive2).

4.2 Single time series experiments

We first consider the simple, yet frequently occurring, practical setting of one-step forecasts with historical observations available for only a single length- time series. Upon receiving a time series (with ), a model should yield a forecast for the time point (i.e., ).

4.2.1 Dataset

To experiment with several single (but long) time series of different characteristics, we use 10 time series from the publicly available electricity [Dua17a]demand dataset333 and four (third-party) time series of car part demands, denoted as car-parts. Based on the categorization scheme of [Syntetos05a], the time series are chosen such that not only smooth time series (regular demand occurrence and low demand quantity variation) are represented, but also lumpy ones (irregular demand occurrence and high demand quantity variation). For electricity, the respective percentages are 70% vs. 30%, and, for car-parts, 75% vs. 25%. All observations are non-negative. In case of electricity, which contain measurements in 15min intervals, we aggregate (by summation) within 7h windows, yielding a total of 3,762 observations. For car-parts, demand is measured on daily basis across a time span of 7-8 years (weekends and holidays excluded), yielding 4,644 observations on average. For each time series, 20% of held-out consecutive observations are used for testing, 5% for validation.

4.2.2 Forecasting model

We employ a simple incarnation of the forecasting model from Section 3.4. In particular, we replace N-BEATS by a single linear map (with bias), implementing


with denoting the concatenation of the topological attention signal and the input time series , as in Eq. (9). During training, we randomly extract consecutive observations from the training portion of the time series. The first observations are used as input , the observation at is used as target. Forecasts for all testing observations are obtained via a length- rolling window, moved forward one step a time.

In terms of hyperparameters for topological attention, we use a single transformer encoder layer with four attention heads and 32 barcode coordinate functions. We minimize the mean-squared-error via ADAM over 1.5k (

electricity) and 2k (car-parts) iterations, respectively, with a batch size of 30. Initial learning rates for the components of Eq. (7) are 9e-2 (TopVec, MLP) and 5e-2 (TransformerEncoder), as well as 9e-2 for the linear map of Eq (11). All learning rates are annealed following a cosine learning rate schedule.

Method Rank % Diff.
Lin.+TopAttn 1.50 2.82
Prophet 2.75 4.91
MLP 3.00 7.49
DeepAR 3.50 7.86
autoARIMA 5.25 13.45
LSTM 6.00 16.24
MQ-RNN 7.50 34.36
Naive 7.75 30.08
MQ-CNN 7.75 29.19
(a) car-parts (4)

Rank % Diff.
Lin.+TopAttn 1.50 10.59
DeepAR 1.90 12.14
MLP 2.90 15.39
MQ-CNN 4.70 45.44
autoARIMA 5.10 45.67
Prophet 6.10 61.55
LSTM 7.40 69.89
MQ-RNN 7.50 68.74
Naive 7.90 77.71
(b) electricity (10)
Table 1: Single time series experiments on car-parts and electricity, using the sMAPE as performance criterion. Listed are (1) the average rank ( Rank) of each method, as well as (2) the average percentual difference (% Diff.) to the Rank-1 approach per time series. denotes GluonTS [gluonts_jmlr] implementations.
4.2.3 Results & Ablation study

We compare against several techniques from the literature that are readily available to a practitioner. This includes autoARIMA [Hyndman08b], Prophet [Taylor17a], a vanilla LSTM model, as well as several approaches implemented within the GluonTS [gluonts_jmlr] library. With respect to the latter, we list results for a single-hidden-layer MLP, DeepAR [Salinas19a] and MQ-CNN/MQ-RNN [Wen17a]. By Naive, we denote a baseline, yielding as forecast for . Importantly, each model is fit separately to each time series in the dataset.

For a fair comparison, we further account for the fact that forecasting models typically differ in their sensitivity to the length of the input observations, . To this end, we cross-validate (for all methods) using the sMAPE on the validation set. Cross-validation points are determined by the topological attention parameters and , i.e., the number and lengths of the sliding windows. For ranging from 10 to 200 and ranging from 5 to 45, we obtain a wide range of input lengths, from 14 to 244. Instead of listing absolute performance figures, we focus on the average rank444the sMAPE determines the rank of a method per time series; these ranks are then averaged over all time series within the cohort of methods, as well as the average percentual difference to the best-ranking approach per time series.

Table 1 lists the overall statistics for electricity and car-parts. We observe that, while the overall ranking per dataset differs quite significantly, Lin+TopAttn consistently ranks well. Second, the average percentual difference to the best-ranking approach per time series is low, meaning that while Lin+TopAttn might not yield the most accurate forecasts on a specific time series, it still produces forecasts of comparable quality.

Table 2 provides the same performance statistics for an ablation study of the topological attention components. Specifically, we combine the linear model of Eq. (11) with each component of topological attention in isolation.

Rank % Diff.
car-parts (4)
Lin. 2.75 0.15
 +Top 3.50 1.45
 +Attn 1.25 0.15
 +TopAttn 2.50 0.64
electricity (10)
Lin. 2.40 5.95
 +Top 3.40 12.20
 +Attn 2.30 2.41
 +TopAttn 1.90 2.10
Table 2: Ablation study

Some observations are worth pointing out: First, the linear model (Lin.) alone already performs surprisingly well. This can possibly be explained by the fact that the task only requires one-step forecasts, for which the historical length- observations (directly preceding the forecast point) are already quite informative. Second, directly including topological features (i.e., +Top) has a confounding effect. We hypothesize that simply vectorizing local topological information from all sliding windows, without any focus, obfuscates relevant information, rather than providing a reliable learning signal. This also highlights the importance of attention in this context, which, even when directly fed with observations from each sliding window (i.e., +Attn), exhibits favorable performance (particularly on car-parts). However, in the latter strategy, the input dimensionality for the transformer encoder scales with the sliding window size . Contrary to that, in case of topological attention, the input dimensionality is always fixed to the of number of coordinate functions, irrespective of the sliding window size .

4.3 Large-scale experiments on the M4 benchmark

Different to Section 4.2, we now consider having multiple time series of different lengths and characteristics available for training. Further, instead of one-step forecasts, the sought-for model needs to output (multi-step) point forecasts for time horizons .

4.3.1 Dataset

Experiments are based on the publicly available M4 competition dataset555available at, consisting of 100,000 time series from six diverse domains, aggregated into six subgroups that are defined by the frequency of observations (i.e., yearly, quarterly, monthly, weekly, daily and hourly). Forecasting horizons range from (yearly) to (hourly). The test set is fixed and contains, for all time series in each subgroup, exactly observations to be predicted (starting at the last available training time point).

4.3.2 Forecasting model

We employ the forecasting model666based on the N-BEATS reference implementation of Section 3.4 and largely stick to the architecture and training configuration of [Oreshkin20a, Table 18]. Our implementation only differs in the hidden dimensionality of N-BEATS blocks (128 instead of 512) and in the ensembling step. In particular, for each forecast horizon (i.e., for each subgroup), [Oreshkin20a] train multiple models, varying from to

, using ten random initializations and three separate loss functions (

sMAPE, MASE, MAPE). One final forecast per time series is obtained by median-aggregation of each model’s predictions. In our setup, we solely rely on the sMAPE as loss function, vary only from to , but still use ten random initializations. Even with this (smaller) ensemble size (40 models per subgroup, instead of 180), N-BEATS alone already outperforms the winner of M4 (see Table 3). As we are primarily interested in the effect of integrating topological attention, sacrificing absolute performance for a smaller ensemble size is incidental.

In terms of hyperparameters for topological attention, the length () of sliding windows is set to , where varies per subgroup as specified above. The model uses 20 transformer encoder layers with two attention heads and 64 structure elements for barcode vectorization. For optimization, we use ADAM with initial learning rates of 1e-3 (for N-BEATS and the MLP part of Eq. (7)), 8e-3 (TopVec) and 5e-3 (TransformerEncoder). All learning rates are annealed according to a cosine learning rate schedule over 5,000 iterations with a batch size of 1,024.

Method Yearly Quarterly Monthly Others Average
(23k) (24k) (48k) (5k) (100k)
Winner M4 [Smyl20a] 13.176 / 0.778 9.679 / 0.847 12.126 / 0.836 4.014 /0.920 11.374 / 0.821
Benchmark 14.848 / 0.867 10.175 / 0.890 13.434 / 0.920 4.987 / 1.039 12.555 / 0.898
Naive2 16.342 / 1.000 11.011 / 1.000 14.427 / 1.000 4.754 / 1.000 13.564 / 1.000
N-BEATS [Oreshkin20a] 13.149 / 0.776 9.684 / 0.845 12.054 / 0.829 3.789 / 0.857 11.324 / 0.814
N-BEATS+TopAttn 13.063 / 0.771 9.687 / 0.845 12.025 / 0.828 3.803 / 0.860 11.291 / 0.811
Table 3: Performance comparison on the M4 benchmark in terms of sMAPE / OWA, listed by subgroup. N-BEATS and N-BEATS+TopAttn denote an ensemble formed by training multiple models, varying from to and randomly initializing each model ten times (i.e., a total of 40 models per subgroup). Forecasts are obtained by taking the median over the point forecasts of all models. denotes results from [Makridakis18b, Makridakis20a].
4.3.3 Results & Ablation study

Table 3 lists the sMAPE and OWA for the winner of the M4 competition [Smyl20a], as well as the Naive2 baseline (with respect to which the OWA is computed) and the M4 benchmark approach, obtained as the arithmetic mean over simple, Holt, and damped exponential smoothing.

Method sMAPE OWA
N-BEATS 11.488 0.827
 +Top 11.505 0.920
 +Attn 11.492 0.826
 +TopAttn 11.466 0.824
Table 4: Ablation study

In terms of the OWA, we see an overall 0.4% improvement over N-BEATS and a 1.2% improvement over the M4 winner [Smyl20a]. In particular, topological attention performs well on the large yearly / monthly subgroups of 23k and 48k time series, respectively. While OWA scores are admittedly quite close, the differences are non-negligible, considering the large corpus of 100k time series. In fact, several methods in the official M4 ranking differ by an even smaller amount with respect to the OWA measure.

Similar to the ablation results of Section 4.2, the ablation study in Table 4 (conducted for only) reveals the beneficial effect of topological attention, in particular, the beneficial nature of allowing to attend to local topological features. Contrary to the ablation in Table 2, we see that in this large-scale setting, neither topological features (+Top), nor attention (+Attn) alone yield any improvements over N-BEATS; when integrated separately into the N-BEATS model, both components even deteriorate performance in terms of the sMAPE.

5 Conclusion

While several prior forecasting works have pointed out the relevance of local structural information within historical observations (e.g., [Li19a]), it is typically left to the model to learn such features from data. Instead, we present a direct approach for capturing the “shape” of local time series segments via persistent homology. Different to the typical application of the latter in signal analysis, we capture the evolution of topological features over time, rather than a global summary, and allow a forecasting model to attend to these local features. The so obtained topological attention mechanism yields a complementary learning signal that easily integrates into neural forecasting approaches. In combination with N-BEATS [Oreshkin20a], for instance, large-scale experiments on the M4 benchmark provide evidence that including topological attention indeed allows to obtain more accurate point forecasts.

Societal impact. Due to the ubiquity of time series data, forecasting in general, certainly touches upon a variety of societally relevant and presumably sensible areas. As our work has potential impact in that sense, we perform large-scale experiments over a wide variety of time series from different domains, thereby obtaining a broad picture of the overall forecasting performance.


Appendix A Evaluation Criteria

In the following, we first replicate the definition of two commonly used, scale-free evaluation metrics, i.e., the

symmetric mean absolute percentage error (sMAPE) and the mean absolute scaled error (MASE), see [Manuscript, Section 4.1]. These scale-free metrics are standard in the practice of forecasting and used across all experiments in the manuscript. Subsequently, we define the overall weighted average (OWA), i.e., a M4 competition specific performance measure used to rank competition entries. We also provide a toy calculation example.

Letting denote the length- forecast, the true observations and the length- history of input observations, both metrics are defined as [Makridakis20a]


with depending on the observation frequency. In the M4 competition [Makridakis20a], the frequencies per subgroup are: 12 for monthly, four for quarterly, 24 for hourly and one for yearly / weekly / daily data. To obtain the OWA of a given forecast method, say Forecast, we compute [Oreshkin20a]


Thus, if Forecast displays a MASE of 1.63 and a sMAPE of 12.65% across the 100k time series of M4, while Naive2 displays a MASE of 1.91 and a sMAPE of 13.56%, the relative MASE and sMAPE of Forecast would be 1.63/1.91 = 0.85 and 12.65/13.56 = 0.93, respectively, resulting in an OWA of (0.93 + 0.85)/2 = 0.89. According to [Makridakis20a], this indicates that, on average, Forecast is about 11% more accurate than Naive2, taking into account both sMAPE and MASE.

Performance criteria for single time series experiments. In our single time series experiments of [Manuscript, Section 4.2], we use the sMAPE as an underlying evaluation measure and compute the following two statistics: first, the average rank ( Rank) of each method based on the rank on each single time series in car-parts and electricity (both datasets are treated separately); and, second, the average percentual difference (% Diff) to the best approach per time series. A calculation example, for four hypothetical models and three time series (TS, TS, TS), is listed in Table A.1.

TS TS TS  Rank % Diff
Model A 14.4 (3, 9.66) 11.8 (1, 0.00) 10.5 (2, 3.81) 2.00 6.73
Model B 14.3 (2, 8.39) 12.1 (2, 2.48) 10.8 (3, 6.48) 2.33 5.78
Model C 13.1 (1, 0.00) 12.2 (3, 3.28) 11.1 (4, 9.01) 2.67 6.14
Model D 14.5 (4, 9.66) 13.1 (4, 9.92) 10.1 (1, 0.00) 3.00 9.79
Table A.1: Example calculation of the average rank ( Rank) and the average percentual difference (% Diff), as reported in [Manuscript, Table 1]. In this calculation example, lower scores are better. For instance, on TS, the rank of Model C is 3 and the percentual difference to the best performing model on TS (i.e., Model A) is .

Appendix B Dataset Details

For completeness, Table B.1 replicates [Oreshkin20a, Table 2], providing an overview of the key statistics for the M4 competition dataset. For all results listed in the main manuscript, the subgroups Weekly, Daily and Hourly are aggregated into Others, accounting for 5,000 time series overall.

max width=           Frequency / Horizon Type Yearly / 6 Quarterly / 8 Monthly / 18 Weekly / 13 Daily / 14 Hourly / 48 Total Demographic 1,088 1,858 5,728 24 10 0 8,708 Finance 6,519 5,305 10,987 164 1,559 0 24,534 Industry 3,716 4,637 10,017 6 422 0 18,798 Macro 3,903 5,315 10,016 41 127 0 19,402 Micro 6,538 6,020 10,975 112 1,476 0 25,121 Other 1,236 865 277 12 633 414 3,437 Total 23,000 24,000 48,000 359 4,227 414 100,000 Min. Length 19 24 60 93 107 748 Max. Length 841 874 2812 2610 9933 1008 Mean Length 37.3 100.2 234.3 1035.0 2371.4 901.9 SD Length 24.5 51.1 137.4 707.1 1756.6 127.9 % Smooth 82% 89% 94% 84% 98% 83% % Erratic 18% 11% 6% 16% 2% 17%

Table B.1: Description / Statistics for the M4 competition dataset.

Table B.2 lists key statistics for the car-parts and the electricity time series we use in [Manuscript, Section 4.2]. Notably, there are no time series categorized into the erratic category, according to Syntetos et al. [Syntetos05a]. As car-parts is proprietary, Fig. B.1 additionally shows a visualization of all observations from the four spare part demand time series.

max width= car-parts electricity Frequency / Horizon Daily / 1 7Hourly / 1 Total 4 10777In reference to [Dua17a], time series IDs are: MT for Min. Length 4507 3762 Max. Length 4783 3762 Mean Length 3644 3762 SD Length 137 0 % Smooth 75% 70% % Lumpy 25% 30%

Table B.2: Description / Statistics for the car-parts and electricity time series.
Figure B.1: Visualization of the four proprietary car-parts time series.

Appendix C Additional Results

Table C.1 replicates [Manuscript, Table 3], listing sMAPE / OWA statistics on M4, as well as the corresponding MASE / OWA statistics888Detailed results for Weekly, Daily and Hourly are not listed in [Makridakis18b, Makridakis20a], but available here.. Table C.2 lists results, split by time series domains.

max width= Granularity Total Winner M4 Benchmark Naive2 N-BEATS [Oreshkin20a] N-BEATS+TopAttn Yearly (23k) 13.176 / 0.778 14.848 / 0.867 16.342 / 1.000 13.149 / 0.776 13.063 / 0.771 Quarterly (24k)   9.679 / 0.847 10.175 / 0.890 11.011 / 1.000 9.684 / 0.845   9.687 / 0.845 Monthly (48k) 12.126 / 0.836 13.434 / 0.920 14.427 / 1.000 12.054 / 0.829 12.025 / 0.828 Weekly (359)   7.817 / 0.851   8.944 / 0.926   9.191 / 1.000   6.447 / 0.703 6.361 / 0.699 Daily (4,227)   3.170 / 1.046   2.980 / 0.978   3.045 / 1.000 2.976 / 0.974   2.979 / 0.975 Hourly (414) 9.328 / 0.440 22.053 / 1.556 18.383 / 1.000 10.040 / 0.464 10.271 / 0.483 Average (100k) 11.374 / 0.821 12.555 / 0.898 13.564 / 1.000 11.324 / 0.814 11.291 / 0.811

(a) sMAPE / OWA

max width= Granularity Total Winner M4 Benchmark Naive2 N-BEATS [Oreshkin20a] N-BEATS+TopAttn Yearly (23k) 2.980 / 0.778 3.280 / 0.867 3.974 / 1.000 2.972 / 0.776 2.950 / 0.771 Quarterly (24k) 1.118 / 0.847 1.173 / 0.890 1.371 / 1.000 1.111 / 0.845 1.112 / 0.845 Monthly (48k) 0.884 / 0.836 0.966 / 0.920 1.063 / 1.000 0.875 / 0.829 0.874 / 0.828 Weekly (359) 2.356 / 0.851 2.432 / 0.926 2.777 / 1.000 1.950 / 0.703 1.953 / 0.699 Daily (4,227) 3.446 / 1.046 3.203 / 0.978 3.278 / 1.000 3.183 / 0.974 3.188 / 0.975 Hourly (414) 0.893 / 0.440 4.582 / 1.556 2.395 / 1.000 0.917 / 0.464 0.974 / 0.483 Average (100k) 1.536 / 0.821 1.663 / 0.898 1.912 / 1.000 1.516 / 0.814 1.511 / 0.811

(b) MASE / OWA
Table C.1: Performance comparison on the M4 benchmark in terms of (table:appendix:m4SMAPE) sMAPE / OWA and (table:appendix:m4MASE) MASE / OWA, listed by subgroup. N-BEATS and N-BEATS+TopAttn denote an ensemble formed by training multiple models, varying from to and randomly initializing each model ten times (i.e., a total of 40 models per subgroup). Forecasts are obtained by taking the median over the point forecasts of all models. denotes results from [Makridakis18b, Makridakis20a].

max width= Granularity Demographic Finance Industry Macro Micro Other (8,7k) (24,5k) (18,8k) (19,4k) (25,1k) (3,5k) Yearly 9.640 / 9.694 14.029 / 13.879 16.645 / 16.523 13.450 / 13.400 10.700 / 10.654 13.094 / 13.000 Quarterly 9.908 / 9.933 11.158 / 11.161   8.822 /   8.832   9.182 /   9.178   9.919 /   9.922   6.222 /   6.173 Monthly 4.605 / 4.599 13.629 / 13.625 12.918 / 12.913 12.490 / 12.428 13.180 / 13.122 11.987 / 11.932 Weekly 1.401 / 1.403   7.598 /   7.516   2.563 /   2.548 11.303 / 10.837   3.658 /   3.681 12.204 / 12.112 Daily 6.300 / 6.313   3.442 /   3.446   3.831 /   3.832   2.532 /   2.532   2.288 /   2.291   2.901 /   2.901 Hourly   9.787 /   9.997 Average 6.358 / 6.367 12.513 / 12.472 12.437 / 12.413 11.709 / 11.665 11.070 / 11.034 8.997 /   8.971

(a) sMAPE / sMAPE

max width= Granularity Demographic Finance Industry Macro Micro Other (8,7k) (24,5k) (18,8k) (19,4k) (25,1k) (3,5k) Yearly 2.410 / 2.428 3.086 / 3.055 3.021 / 2.996 2.956 / 2.932 2.994 / 2.981 2.647 / 2.616 Quarterly 1.234 / 1.238 1.110 / 1.111 1.075 / 1.077 1.123 / 1.121 1.128 / 1.128 0.866 / 0.861 Monthly 0.864 / 0.862 0.912 / 0.912 0.936 / 0.935 0.878 / 0.876 0.790 / 0.788 0.780 / 0.778 Weekly 1.782 / 1.839 1.661 / 1.634 3.724 / 3.808 2.042 / 2.114 2.393 / 2.399 0.910 / 0.899 Daily 9.604 / 9.641 3.396 / 3.402 3.784 / 3.787 3.198 / 3.205 2.597 / 2.603 3.519 / 3.520 Hourly 0.903 / 0.971 Average 1.148 / 1.151 1.695 / 1.687 1.447 / 1.443 1.380 / 1.374 1.558 / 1.554 1.993 / 1.989

Table C.2: Performance comparison on the M4 benchmark in terms of (table:appendix:m4DOMAINSMAPE) sMAPE / sMAPE and (table:appendix:m4DOMAINMASE) MASE / MASE, listed by subgroup and domain. N-BEATS and N-BEATS+TopAttn denote the same models as in Table C.1.

Appendix D Hyperparameter Settings

Hyperparameter settings for our single time series experiments of [Manuscript, Section 4.2] and the large-scale M4 experiments of [Manuscript, Section 4.3] are listed in Tables D.1 and D.2.

max width= car-parts electricity Parameters Daily 7Hourly Iterations 2k 1.5k Loss MSE (Forecast horizon) 1 Lookback period(s), 14 - 244 Batch size 30 Attention heads 4 Barcode coordinate functions 32 Encoder-layers 1 Hidden dimension 128

Table D.1: Single time series experiment hyperparameters for car-parts and electricity data.

As mentioned in the manuscript, for M4 experiments with N-BEATS (and N-BEATS+TopAttn), we closely follow the generic N-BEATS parameter configuration of Oreshkin et al. [Oreshkin20a, Table 18]; any additional parameters (for our N-BEATS+TopAttn approach) are highlighted in red. Note that we also mark Hidden dimension in red, as this is not only the hidden dimension of the N-BEATS blocks, but we equally use this setting for the hidden dimension of the transformer encoder layers.

max width= M4 Parameters Yearly Quarterly Monthly Weekly Daily Hourly (Forecast horizon) 6 8 18 13 14 48 1.5 1.5 1.5 10 10 10 Iterations 5k Loss sMAPE Lookback period(s), 2, 3, 4, 5 Batch size 1024 Attention heads 2 Barcode coordinate functions 64 Encoder-layers 20 Hidden dimension 128 Double-Residual Blocks 1 Block-layers 4 Stacks 30

Table D.2: Large-scale experiment hyperparameters across all subsets of the M4 dataset. Parameters specific to N-BEATS+TopAttn are highlighted in red. For a detailed description of the N-BEATS parameters, we refer to [Oreshkin20a, Section D.1].

Appendix E Sliding Window Configurations

Lets assume we have, at one point in training, a randomly extracted training portion of consecutive observations from a length- time series (). We use the first observations as (1) raw input signal to our models and (2) for extraction of complementary local topological properties. The consecutive observations, starting at , are used as target (to compute the mean-squared-error, or the sMAPE for instance).

Throughout all experiments, sliding windows are moved forward by one observation a time.

For extracting local topological properties from (of length ) via persistent homology, two parameters are necessary: the parameter determines the number of overlapping sliding windows and the parameter determines the length of a single sliding window (i.e., observations). For each sliding window, we obtain one barcode (or two, if is taken into account).

Singe time series experiments [Manuscript, Section 4.2]. In this setting, , as we compute one-step forecasts. Since, typically, forecast models differ in their sensitivity to the length of the input observations , we cross-validate (for all methods) using the sMAPE on the validation set.

The collection of used for cross-validation is constructed based on the following consideration: first, for persistent homology computation, we need a reasonable amount of observations in each sliding window; and, second, we need a reasonable amount of sliding windows for self-attention. Hence, we choose (1) and (2) . For one specific choice of , we get . Varying and thus determines the length, , of the input vector . For instance, setting gives a decomposition of (of length 14), into subsequent windows of length for which persistent homology is computed. Overall, in the described setup, ranges from 14 to 244.

Large-scale experiments [Manuscript, Section 4.3]. In this setting, . For comparability with N-BEATS, we stick to the original setup of considering input lengths as multiples of the forecast horizon (which is specific to each subgroup in M4). In particular, ranges from to , see Table D.2. As an example, on M4 Yearly, this yields a range of from 12 to 30. As mentioned in [Manuscript, Section 4.3.2], we set and is thus determined by .

Appendix F Ensemble Size

As described in [Manuscript, Section 4.3.2], we ensemble 40 models to obtain forecasts for each subgroup of the M4 dataset. One ensemble is formed per subgroup and consists of training N-BEATS, or N-BEATS+TopAttn, respectively, with 10 random initializations for four different values of , i.e., (where denotes the specific forecast horizon prescribed per subgroup), using the sMAPE as a loss function. In case of Yearly for instance, , see Table B.1.

Fig. F.1 shows a comparison of N-BEATS and N-BEATS+TopAttn over the ensemble size, illustrating the N-BEATS+TopAttn equally benefits from a larger ensemble. Notably, in [Oreshkin20a] the ensemble is much larger, as, in addition to training models with the sMAPE as loss, the MAPE and MASE are used and scales up to , resulting in 180 models in total.

Figure F.1: Comparison of N-BEATS and N-BEATS+TopAttn in terms of varying the ensemble size. At the maximum ensemble size of 40, the OWA corresponds to the OWA reported in, e.g., Table C.1.

Appendix G Model Details

In this section, we describe the details for the models used in the single time series experiments of [Manuscript, Section 4.2].

Prophet. We use the publicly available Python implementation of Prophet999 with default parameter choices.

autoARIMA. We use the publicly available Python implementation of autoARIMA101010 In terms of hyperparameters, the initial number of time lags of the auto-regressive (“AR”) and the moving-average (“MA”) model is set to 1 bounded by its maximum 6. The period for seasonal differencing is equal to 5; the order of first-differencing and of seasonal differencing is set to 2 and 0, respectively.

LSTM. We implement a LSTM model with hidden dimensionality 128, 8 recurrent layers and a dropout layer on the outputs of each LSTM

layer with dropout probability of 0.3. Outputs of the LSTM are fed to a subsequent single-hidden-layer


with hidden dimensionality equal to 64, including batch normalization and ReLU activation. Initial learning rate and weight decay are set to 1e-3 and 1.25e-5, respectively. We minimize the mean-squared-error (MSE) via

ADAM over 1.5k (electricity) and 2k (car-parts) iterations, respectively, using a batch size of 30. All learning rates are annealed following a cosine learning rate schedule.

For DeepAR, MQ-CNN, MQ-RNN and the MLP baseline, we use the publicly available GluonTS [gluonts_jmlr] implementations111111, mostly with default

parameter choices. We only adjust the number of (maximum) training epochs to 20 (for comparability to our approach, where we count iterations), change the hidden dimensionality of the

MLP to 64 and set the batch size to 30.

Appendix H System Setup

All experiments were executed on an Ubuntu Linux 20.04 system, using PyTorch v1.7.0 (CUDA 10.1), 128 GB of RAM and 16 Intel(R) Core(TM) i9-10980XE CPUs.

Appendix I Persistent Homology Runtime

To back up the “near-linear runtime” statement for 0-dimensional persistent homology computation in the proposed regime (see [Manuscript, Section 3.4]), Fig. I.1 shows a runtime plot (using Ripser121212 over 10,000 sliding window sizes, , in the range . The system setup for these runtime experiments is given in Section H. Fig. I.1 clearly corroborates the statement from the manuscript.

Figure I.1: Runtime (in seconds) for 0-dimensional persistent homology computation from observations within a sliding window, varying in length () from .