1 Introduction
Time series are ubiquitous in science and industry, from medical signals (e.g., EEG), motion data (e.g., speed, steps, etc.) or economic operating figures to ride/demand volumes of transportation network companies (e.g., Uber, Lyft, etc.). Despite many advances in predicting future observations from historical data via traditional statistical [Brockwell91a]
, or recent machinelearning approaches
[Oreshkin20a, Wang19a, Salinas19a, Rangapuram18a, Li19a, Yu16a], reliable and accurate forecasting remains challenging. This is not least due to widely different and often heavily domain dependent structural properties of time related sequential data.In this work, we focus on the problem of (point) forecasting univariate time series, i.e., given a lengthvector of historical data, the task is to predict future observations for a given time horizon . While neural network models excel in situations where a large corpus of time series is available for training, the case of only a single (possibly long) time series is equally important. The arguably most prominent benchmarks for the former type of forecasting problem are the “(M)akridakis”competitions, such as M3 [Makridakis00a] or M4 [Makridakis18a]. While combinations (and hybrids) of statistical and machinelearning approaches have largely dominated these competitions [Makridakis18a, see Table 4], Oreshkin et al. [Oreshkin20a] have recently demonstrated that a pure machinelearning based model (NBEATS) attains stateoftheart performance. Interestingly, the latter approach is simply built from a collection of common neural network primitives which are not specific to sequential data.
However, the majority of learningbased approaches directly operate on the raw input signal, implicitly assuming that viable representations for forecasting can be learned via common neural network primitives, composed either in a feedforward or recurrent manner. This raises the question of whether there exist structural properties of the signal, which are not easily extractable via neural network components, but offer complementary information. One prime example^{1}^{1}1Although learningbased approaches [Som20a] exist to approximate topological summaries (without guarantees). are topological features, typically obtained via persistent homology [Carlsson09a, Edelsbrunner2010]. In fact, various approaches [Perea15a, Gidea18a, Dlotko19a, Khasawneh18a, Gidea20a] have successfully used topological features for time series analysis, however, mostly in classification settings, for the identification of certain phenomena in dynamical systems, or for purely exploratory analysis (see Section 2).
Contribution. We propose an approach to incorporate local topological information into neural forecasting models. Contrary to previous works, we do not compute a global topological summary of historical observations, but features of short, overlapping time windows to which the forecasting model can attend to. The latter is achieved via selfattention and thereby integrates well into recent techniques, such as NBEATS [Oreshkin20a]. Notably, in our setting, computation of topological features (via persistent homology) comes with little computational cost, which allows application in largescale forecasting problems.
Problem statement. In practice, neural forecasting models typically utilize the last observations of a time series in order to yield (point) forecasts for a given time horizon . Under this perspective, the problem boils down to learning a function (parametrized as a neural network)
(1) 
from a given collection of inputs (i.e., length vectors) and targets (i.e., length vectors). Specifically, we consider two settings, where either (1) a large collection of time series is available, as in the M4 competition, or (2) we only have access to a single time series. In the latter setting, a model has to learn from patterns within a time series, while the former setting allows to exploit common patterns across multiple time series.
2 Related work
Persistent homology and time series. Most approaches to topological time series analysis
are conceptually similar, building on top of work by de Silva et al. [deSilva12a] and Perea & Harer [Perea15a, Perea16a]. Herein, time series observations are transformed into a point cloud via a timedelay coordinate embedding [Takens81a] from which VietorisRips (VR) persistent homology is computed. The resulting topological summaries, i.e., persistence barcodes, are then used for downstream processing. Within this regime, Gidea et al. [Gidea20a] analyze the dynamics of cryptocurrencies using persistence landscapes [Bubenik15a], Khasawneh et al. [Khasawneh18a] study chatter classification in synthetic time series from turning processes and Dłotko et al. [Dlotko19a] identify periodicity patterns in time series. In [Kim18a]
, Kim et al. actually compute onestep forecasts for Bitcoin prices and classify price patterns, essentially feeding barcode statistics as supplementary features to a MLP/CNNbased regression model. Surprisingly, very few works deviate from this pipeline, with the notable exception of
[Gidea18a], where VR persistent homology is not computed from a timedelay coordinate embedding, but rather from assembling observations (within sliding windows of size ) from a variate time series into a dimensional point cloud, followed by VR persistent homology computation.Although these works clearly demonstrate that capturing the “shape” of data via persistent homology provides valuable information for time series related problems, they (1) rely on handcrafted features (i.e., predefined barcode summary statistics, or a fixed barcodevectorization strategy), (2) consider topological summaries as the single source of information and (3) only partially touch upon forecasting problems (with the exception of [Kim18a]). Furthermore, in this existing line of work, sweeping a sliding window over the time series is, first and foremost, a way to construct a point cloud which represents the entire time series. Instead, in our approach, each window yields its own topological summary in the form of a persistence barcode, reminiscent to representing a sentence as a sequence of word embeddings in NLP tasks. When combined with learnable representations of persistence barcodes [Hofer19b, Carriere20a], this perspective paves the way for leveraging recent techniques for handling learning problems with sequential data, such as attention [Vaswani17a], and allows to seamlessly integrate topological features into existing neural forecasting techniques.
Neural network approaches to time series forecasting. Recently, various successful neural network approaches to (mostly probabilistic) time series forecasting have emerged, ranging from autoregressive neural networks as in DeepAR [Salinas19a], to (deep) extensions of traditional state space models, such as DeepFactors [Wang19a] or DeepState [Rangapuram18a]. While these models are inherently tailored to the sequential nature of the forecasting problem, Li et al. [Li19a] instead rely on the concept of (logsparse) selfattention [Vaswani17a], fed by the outputs of causal convolutions, and Oreshkin et al. [Oreshkin20a] even abandon sequential neural network primitives altogether. The latter approach, solely based on operations predominantly found in feedforward architectures, achieves stateoftheart performance for (point) forecasts across several benchmarks, including the largescale M4 competition. Yet, a common factor in all aforementioned works is that raw time series observations are directly input to the model, assuming that relevant structural characteristics of the signal can be learned. While we choose an approach similar to [Li19a], in the sense that we rely on selfattention, our work differs in that representations fed to the attention mechanism are not obtained through convolutions, but rather through a topological analysis step which, by its construction, captures the “shape” of local time series segments.
3 Topological attention
The key idea of topological attention is to analyze local segments within an input time series, , through the lens of persistent homology. As mentioned in Section 2, the prevalent strategy in prior work is, to first construct a point cloud from via a timedelay coordinate embedding and to subsequently compute VR persistent homology. Historically, this is motivated by studying structural properties of an underlying dynamical system, with a solid theoretical foundation, e.g., in the context of identifying periodicity patterns [Perea15a, Perea16a]. In this regime, is encoded as a point cloud in by considering observations within a sliding window of size as a point in .
While the timedelay embedding strategy is adequate in settings where one aims to obtain one global topological summary, it is inadequate when local structural properties of time series segments are of interest. Further, unless large (computationally impractical) historical time horizons are considered, one would obtain relatively sparse point clouds that, most likely, carry little information.
3.1 Time series as local topological summaries
Different to timedelay embeddings, we follow an alternative strategy: a time series signal is still decomposed into a sequence of (overlapping) windows, but not to yield a point cloud element, but rather to be analyzed in isolation. In the following, we only discuss the necessities specific to our approach, and refer the reader to [Edelsbrunner02a, Carlsson09a, Boissonnat18a] for a thorough treatment of persistent homology.
To topologically analyze a length time series, over the time steps , in a computationally tractable manner, lets consider a 1dimensional simplicial complex of the form
where denote 0simplices (i.e., vertices) and 1simplices (i.e., edges) are in iff and are two consecutive time indices. Topologically, carries the connectivity properties of a time series of length , which is equivalent to a straight line. This is the same for all time series of length and thus offers no discriminative information.
Persistent homology, however, lets us combine the purely topological representation of the time series with its actual values. For a specific , let denote its increasingly sorted values and consider
In fact, forms an increasing sequence of subsets of , i.e., a filtration. Importantly, while is the same for all time series of length , the filtration, , is determined by the values of . Persistent homology then tracks the evolution of topological features throughout this sequence and summarizes this information in the form of persistence barcodes.
In our specific case, as is topologically equivalent to a straight line, we only get dimensional features, i.e., connected components. Hence, we obtain one (0 degree) barcode . This barcode is a multiset of (birth, death) tuples, representing the birth () and death () of topological features. Informally, we may think of building , piecebypiece, according to the sorting of the ’s, starting with the lowest value, and tracking how connected components appear / merge, illustrated in Fig. 2.
Remark 3.1.
The information captured throughout this process has two noteworthy properties. First, it is stable in the sense that small changes in the observation values may not cause arbitrary changes in the respective barcodes, see [CohenSteiner2007]. Second, one may equally order the negative observations, i.e., , and thus obtain . In that manner, the signal is analyzed from below and above.
Finally, to extract local topological information, we do not compute one single barcode for , but one for each sliding window of size , see Fig. 1. Given a decomposition of into subsequent windows, we obtain barcodes, , which constitute the entry point for any downstream operation. Informally, those barcodes encode the evolution of local topological features over time.
3.2 Barcode vectorization
Although persistence barcodes concisely encode topological features, the space of persistence barcodes, denoted as , carries no linear structure [Turner14a] and the nature of barcodes as multisets renders them difficult to use in learning settings. Myriad approaches have been proposed to alleviate this issue, ranging from fixed mappings into a vector space (e.g., [Bubenik15a, Adams17a]), to kernel techniques (e.g., [Reininghaus14a, Kusano16a]) and, more recently, to learnable vectorization schemes (e.g., [Hofer19a, Carriere20a]). Here, we follow the latter approach, as it integrates well into the regime of neural networks. In particular, the core element in learnable vectorization schemes is a differentiable map of the form
(2) 
where denotes a so called barcode coordinate function [Hofer19a], designed to preserve the stability property in Remark 3.1. Upon assembling a collection of such coordinate functions and subsuming parameters into , one obtains a dimensional vectorization of via
(3) 
Taking into account the representation of as persistence barcodes, we summarize the vectorization step as
(4) 
This is distinctly different to [Perea15a, Perea16a, Khasawneh18a, Kim18a] (see Section 2), where one barcode is obtained and this barcode is represented in a fixed manner, e.g., via persistence landscapes [Bubenik15a] or via barcode statistics.
3.3 Attention mechanism
In order to allow a forecasting model to attend to local topological patterns, as encoded via the , we propose to use the encoder part of Vaswani et al.’s [Vaswani17a] transformer architecture, implementing a repeated application of a (multihead) selfattention mechanism. Allowing to attend to local time series segments is conceptually similar to Li et al. [Li19a], but differs in the way local structural properties are captured: not via causal convolutions, but rather through the lens of persistent homology. In this setting, the scaled dotproduct attention, at the heart of a transformer encoder layer, computes
(5) 
and , denoting learnable (key, value, query) projection matrices. Recall that holds all dimensional vectorizations the persistence barcodes. In its actual incarnation, one transformer encoder layer^{2}^{2}2omitting the additional normalization and projection layers for brevity, denoted as , computes and concatenates parallel instances of Eq. (5), (i.e., the attention heads), and internally adjusts such that . Composing such AttnEnc maps, one obtains
(6) 
Finally, we use a twolayer
(with ReLU activations) to map the vectorized output of the transformer encoder to a
dimensional representation. Topological attention thus implements(7) 
where denotes rowmajor vectorization operation.
Remark 3.2.
Notably, as the domain of TopAttn is
, error backpropagation stops at the persistent homology computation. However, we remark that, given recent works on differentiating through the persistent homology computation
[Hofer20a, Carriere21a], one could even combine topological attention with, e.g., [Li19a], in the sense that the outputs of causal convolutions could serve as the filter function for persistent homology. Error backpropagation would then consequently allow to learn this filter function.3.4 Forecasting model
While the representation yielded by topological attention (see Eq. (7)) can be integrated into different neural forecasting approaches, it specifically integrates well into the NBEATS model of Oreshkin et al. [Oreshkin20a]. We briefly describe the generic incarnation of NBEATS next, but remark that topological attention can be similarly integrated into the basisexpansion variant without modifications.
Essentially, the generic NBEATS model is assembled from a stack of doubleresidual blocks. For , each block consists of a nonlinear map (implemented as a MLP)
(8) 
with denoting the internal dimensionality, and two subsequent maps that yield the twofold output of the th block as
with , and . While the yield the connection to the following computation block, the are used to compute the final model prediction via (componentwise) summation, i.e., .
Importantly, in this computational chain, the can be leveraged as an interface to integrate additional information. In our case, we enrich the input signal to each block by the output of the topological attention mechanism through concatenation (see figure to the right), i.e.,
(9) 
This means that the time series signal is (1) input (in its raw form) to NBEATS and (2) its topological attention representation, , is supplied to each block as a complementary signal. In a similar manner (i.e., through concatenation), can be included in much simpler models as well (see Section 4.2).
Computational complexity. Aside from the computational overhead incurred by the multihead attention module, we need to compute dimensional persistent homology for each sliding window. This can be done efficiently, using union find data structures, with complexity , where with the sliding window size and denoting the inverse of the Ackermann function. As the latter grows very slowly, computational complexity is roughly linear for this part.
4 Experiments
We assess the quality of point forecasts in two different settings and perform ablation studies to isolate the impact of topological attention in the proposed regime.
Throughout all experiments, we compute persistent homology from and (see Remark 3.1) using Ripser [Bauer21]. Barcode vectorization, see Eq. (3), is based on rational hat coordinate functions [Hofer19a] with the position parameters (i.e., the locations of each coordinate function in ) initialized by means++ clustering over all barcodes in the training data (with set to the number of coordinate functions). This yields a representation per sliding window. Full architecture details and dataset statistics can be found in the suppl. material.
Ablation setup. When assessing each component of topological attention in isolation, we refer to +Top as omitting the TransformerEncoder part in Eq. (7), and to +Attn as directly feeding the time series observations to the transformer encoder, i.e., omitting TopVec in Eq. (7).
4.1 Evaluation metrics
To evaluate the quality of point forecasts, two commonly used metrics are the symmetric mean absolute percentage error (sMAPE) and the mean absolute scaled error (MASE). Letting denote the length forecast, the true observations and the length history of input observations, both scores are defined as [Makridakis20a]
(10) 
with depending on the observation frequency. For results on the M4 benchmark (see Section 4.3), we adhere to the competition guidelines and additionally report the overall weighted average (OWA) which denotes the arithmetic mean of sMAPE and MASE (with prespecified), both measured relative to a naïve (seasonally adjusted) forecast (also provided by the M4 competition as Naive2).
4.2 Single time series experiments
We first consider the simple, yet frequently occurring, practical setting of onestep forecasts with historical observations available for only a single length time series. Upon receiving a time series (with ), a model should yield a forecast for the time point (i.e., ).
4.2.1 Dataset
To experiment with several single (but long) time series of different characteristics, we use 10 time series from the publicly available electricity [Dua17a]demand dataset^{3}^{3}3https://archive.ics.uci.edu/ml/datasets/ElectricityLoadDiagrams20112014 and four (thirdparty) time series of car part demands, denoted as carparts. Based on the categorization scheme of [Syntetos05a], the time series are chosen such that not only smooth time series (regular demand occurrence and low demand quantity variation) are represented, but also lumpy ones (irregular demand occurrence and high demand quantity variation). For electricity, the respective percentages are 70% vs. 30%, and, for carparts, 75% vs. 25%. All observations are nonnegative. In case of electricity, which contain measurements in 15min intervals, we aggregate (by summation) within 7h windows, yielding a total of 3,762 observations. For carparts, demand is measured on daily basis across a time span of 78 years (weekends and holidays excluded), yielding 4,644 observations on average. For each time series, 20% of heldout consecutive observations are used for testing, 5% for validation.
4.2.2 Forecasting model
We employ a simple incarnation of the forecasting model from Section 3.4. In particular, we replace NBEATS by a single linear map (with bias), implementing
(11) 
with denoting the concatenation of the topological attention signal and the input time series , as in Eq. (9). During training, we randomly extract consecutive observations from the training portion of the time series. The first observations are used as input , the observation at is used as target. Forecasts for all testing observations are obtained via a length rolling window, moved forward one step a time.
In terms of hyperparameters for topological attention, we use a single transformer encoder layer with four attention heads and 32 barcode coordinate functions. We minimize the meansquarederror via ADAM over 1.5k (
electricity) and 2k (carparts) iterations, respectively, with a batch size of 30. Initial learning rates for the components of Eq. (7) are 9e2 (TopVec, MLP) and 5e2 (TransformerEncoder), as well as 9e2 for the linear map of Eq (11). All learning rates are annealed following a cosine learning rate schedule.


4.2.3 Results & Ablation study
We compare against several techniques from the literature that are readily available to a practitioner. This includes autoARIMA [Hyndman08b], Prophet [Taylor17a], a vanilla LSTM model, as well as several approaches implemented within the GluonTS [gluonts_jmlr] library. With respect to the latter, we list results for a singlehiddenlayer MLP, DeepAR [Salinas19a] and MQCNN/MQRNN [Wen17a]. By Naive, we denote a baseline, yielding as forecast for . Importantly, each model is fit separately to each time series in the dataset.
For a fair comparison, we further account for the fact that forecasting models typically differ in their sensitivity to the length of the input observations, . To this end, we crossvalidate (for all methods) using the sMAPE on the validation set. Crossvalidation points are determined by the topological attention parameters and , i.e., the number and lengths of the sliding windows. For ranging from 10 to 200 and ranging from 5 to 45, we obtain a wide range of input lengths, from 14 to 244. Instead of listing absolute performance figures, we focus on the average rank^{4}^{4}4the sMAPE determines the rank of a method per time series; these ranks are then averaged over all time series within the cohort of methods, as well as the average percentual difference to the bestranking approach per time series.
Table 1 lists the overall statistics for electricity and carparts. We observe that, while the overall ranking per dataset differs quite significantly, Lin+TopAttn consistently ranks well. Second, the average percentual difference to the bestranking approach per time series is low, meaning that while Lin+TopAttn might not yield the most accurate forecasts on a specific time series, it still produces forecasts of comparable quality.
Table 2 provides the same performance statistics for an ablation study of the topological attention components. Specifically, we combine the linear model of Eq. (11) with each component of topological attention in isolation.
Rank  % Diff.  

carparts (4)  
Lin.  2.75  0.15 
+Top  3.50  1.45 
+Attn  1.25  0.15 
+TopAttn  2.50  0.64 
electricity (10)  
Lin.  2.40  5.95 
+Top  3.40  12.20 
+Attn  2.30  2.41 
+TopAttn  1.90  2.10 
Some observations are worth pointing out: First, the linear model (Lin.) alone already performs surprisingly well. This can possibly be explained by the fact that the task only requires onestep forecasts, for which the historical length observations (directly preceding the forecast point) are already quite informative. Second, directly including topological features (i.e., +Top) has a confounding effect. We hypothesize that simply vectorizing local topological information from all sliding windows, without any focus, obfuscates relevant information, rather than providing a reliable learning signal. This also highlights the importance of attention in this context, which, even when directly fed with observations from each sliding window (i.e., +Attn), exhibits favorable performance (particularly on carparts). However, in the latter strategy, the input dimensionality for the transformer encoder scales with the sliding window size . Contrary to that, in case of topological attention, the input dimensionality is always fixed to the of number of coordinate functions, irrespective of the sliding window size .
4.3 Largescale experiments on the M4 benchmark
Different to Section 4.2, we now consider having multiple time series of different lengths and characteristics available for training. Further, instead of onestep forecasts, the soughtfor model needs to output (multistep) point forecasts for time horizons .
4.3.1 Dataset
Experiments are based on the publicly available M4 competition dataset^{5}^{5}5available at https://github.com/Mcompetitions/M4methods, consisting of 100,000 time series from six diverse domains, aggregated into six subgroups that are defined by the frequency of observations (i.e., yearly, quarterly, monthly, weekly, daily and hourly). Forecasting horizons range from (yearly) to (hourly). The test set is fixed and contains, for all time series in each subgroup, exactly observations to be predicted (starting at the last available training time point).
4.3.2 Forecasting model
We employ the forecasting model^{6}^{6}6based on the NBEATS reference implementation https://github.com/ElementAI/NBEATS of Section 3.4 and largely stick to the architecture and training configuration of [Oreshkin20a, Table 18]. Our implementation only differs in the hidden dimensionality of NBEATS blocks (128 instead of 512) and in the ensembling step. In particular, for each forecast horizon (i.e., for each subgroup), [Oreshkin20a] train multiple models, varying from to
, using ten random initializations and three separate loss functions (
sMAPE, MASE, MAPE). One final forecast per time series is obtained by medianaggregation of each model’s predictions. In our setup, we solely rely on the sMAPE as loss function, vary only from to , but still use ten random initializations. Even with this (smaller) ensemble size (40 models per subgroup, instead of 180), NBEATS alone already outperforms the winner of M4 (see Table 3). As we are primarily interested in the effect of integrating topological attention, sacrificing absolute performance for a smaller ensemble size is incidental.In terms of hyperparameters for topological attention, the length () of sliding windows is set to , where varies per subgroup as specified above. The model uses 20 transformer encoder layers with two attention heads and 64 structure elements for barcode vectorization. For optimization, we use ADAM with initial learning rates of 1e3 (for NBEATS and the MLP part of Eq. (7)), 8e3 (TopVec) and 5e3 (TransformerEncoder). All learning rates are annealed according to a cosine learning rate schedule over 5,000 iterations with a batch size of 1,024.
Method  Yearly  Quarterly  Monthly  Others  Average 

(23k)  (24k)  (48k)  (5k)  (100k)  
Winner M4 [Smyl20a]  13.176 / 0.778  9.679 / 0.847  12.126 / 0.836  4.014 /0.920  11.374 / 0.821 
Benchmark  14.848 / 0.867  10.175 / 0.890  13.434 / 0.920  4.987 / 1.039  12.555 / 0.898 
Naive2  16.342 / 1.000  11.011 / 1.000  14.427 / 1.000  4.754 / 1.000  13.564 / 1.000 
NBEATS [Oreshkin20a]  13.149 / 0.776  9.684 / 0.845  12.054 / 0.829  3.789 / 0.857  11.324 / 0.814 
NBEATS+TopAttn  13.063 / 0.771  9.687 / 0.845  12.025 / 0.828  3.803 / 0.860  11.291 / 0.811 
4.3.3 Results & Ablation study
Table 3 lists the sMAPE and OWA for the winner of the M4 competition [Smyl20a], as well as the Naive2 baseline (with respect to which the OWA is computed) and the M4 benchmark approach, obtained as the arithmetic mean over simple, Holt, and damped exponential smoothing.
Method  sMAPE  OWA 

NBEATS  11.488  0.827 
+Top  11.505  0.920 
+Attn  11.492  0.826 
+TopAttn  11.466  0.824 
In terms of the OWA, we see an overall 0.4% improvement over NBEATS and a 1.2% improvement over the M4 winner [Smyl20a]. In particular, topological attention performs well on the large yearly / monthly subgroups of 23k and 48k time series, respectively. While OWA scores are admittedly quite close, the differences are nonnegligible, considering the large corpus of 100k time series. In fact, several methods in the official M4 ranking differ by an even smaller amount with respect to the OWA measure.
Similar to the ablation results of Section 4.2, the ablation study in Table 4 (conducted for only) reveals the beneficial effect of topological attention, in particular, the beneficial nature of allowing to attend to local topological features. Contrary to the ablation in Table 2, we see that in this largescale setting, neither topological features (+Top), nor attention (+Attn) alone yield any improvements over NBEATS; when integrated separately into the NBEATS model, both components even deteriorate performance in terms of the sMAPE.
5 Conclusion
While several prior forecasting works have pointed out the relevance of local structural information within historical observations (e.g., [Li19a]), it is typically left to the model to learn such features from data. Instead, we present a direct approach for capturing the “shape” of local time series segments via persistent homology. Different to the typical application of the latter in signal analysis, we capture the evolution of topological features over time, rather than a global summary, and allow a forecasting model to attend to these local features. The so obtained topological attention mechanism yields a complementary learning signal that easily integrates into neural forecasting approaches. In combination with NBEATS [Oreshkin20a], for instance, largescale experiments on the M4 benchmark provide evidence that including topological attention indeed allows to obtain more accurate point forecasts.
Societal impact. Due to the ubiquity of time series data, forecasting in general, certainly touches upon a variety of societally relevant and presumably sensible areas. As our work has potential impact in that sense, we perform largescale experiments over a wide variety of time series from different domains, thereby obtaining a broad picture of the overall forecasting performance.
References
Appendix A Evaluation Criteria
In the following, we first replicate the definition of two commonly used, scalefree evaluation metrics, i.e., the
symmetric mean absolute percentage error (sMAPE) and the mean absolute scaled error (MASE), see [Manuscript, Section 4.1]. These scalefree metrics are standard in the practice of forecasting and used across all experiments in the manuscript. Subsequently, we define the overall weighted average (OWA), i.e., a M4 competition specific performance measure used to rank competition entries. We also provide a toy calculation example.Letting denote the length forecast, the true observations and the length history of input observations, both metrics are defined as [Makridakis20a]
(12) 
with depending on the observation frequency. In the M4 competition [Makridakis20a], the frequencies per subgroup are: 12 for monthly, four for quarterly, 24 for hourly and one for yearly / weekly / daily data. To obtain the OWA of a given forecast method, say Forecast, we compute [Oreshkin20a]
(13) 
Thus, if Forecast displays a MASE of 1.63 and a sMAPE of 12.65% across the 100k time series of M4, while Naive2 displays a MASE of 1.91 and a sMAPE of 13.56%, the relative MASE and sMAPE of Forecast would be 1.63/1.91 = 0.85 and 12.65/13.56 = 0.93, respectively, resulting in an OWA of (0.93 + 0.85)/2 = 0.89. According to [Makridakis20a], this indicates that, on average, Forecast is about 11% more accurate than Naive2, taking into account both sMAPE and MASE.
Performance criteria for single time series experiments. In our single time series experiments of [Manuscript, Section 4.2], we use the sMAPE as an underlying evaluation measure and compute the following two statistics: first, the average rank ( Rank) of each method based on the rank on each single time series in carparts and electricity (both datasets are treated separately); and, second, the average percentual difference (% Diff) to the best approach per time series. A calculation example, for four hypothetical models and three time series (TS, TS, TS), is listed in Table A.1.
TS  TS  TS  Rank  % Diff  

Model A  14.4 (3, 9.66)  11.8 (1, 0.00)  10.5 (2, 3.81)  2.00  6.73 
Model B  14.3 (2, 8.39)  12.1 (2, 2.48)  10.8 (3, 6.48)  2.33  5.78 
Model C  13.1 (1, 0.00)  12.2 (3, 3.28)  11.1 (4, 9.01)  2.67  6.14 
Model D  14.5 (4, 9.66)  13.1 (4, 9.92)  10.1 (1, 0.00)  3.00  9.79 
Appendix B Dataset Details
For completeness, Table B.1 replicates [Oreshkin20a, Table 2], providing an overview of the key statistics for the M4 competition dataset. For all results listed in the main manuscript, the subgroups Weekly, Daily and Hourly are aggregated into Others, accounting for 5,000 time series overall.
Table B.2 lists key statistics for the carparts and the electricity time series we use in [Manuscript, Section 4.2]. Notably, there are no time series categorized into the erratic category, according to Syntetos et al. [Syntetos05a]. As carparts is proprietary, Fig. B.1 additionally shows a visualization of all observations from the four spare part demand time series.
Appendix C Additional Results
Table C.1 replicates [Manuscript, Table 3], listing sMAPE / OWA statistics on M4, as well as the corresponding MASE / OWA statistics^{8}^{8}8Detailed results for Weekly, Daily and Hourly are not listed in [Makridakis18b, Makridakis20a], but available here.. Table C.2 lists results, split by time series domains.
Appendix D Hyperparameter Settings
Hyperparameter settings for our single time series experiments of [Manuscript, Section 4.2] and the largescale M4 experiments of [Manuscript, Section 4.3] are listed in Tables D.1 and D.2.
As mentioned in the manuscript, for M4 experiments with NBEATS (and NBEATS+TopAttn), we closely follow the generic NBEATS parameter configuration of Oreshkin et al. [Oreshkin20a, Table 18]; any additional parameters (for our NBEATS+TopAttn approach) are highlighted in red. Note that we also mark Hidden dimension in red, as this is not only the hidden dimension of the NBEATS blocks, but we equally use this setting for the hidden dimension of the transformer encoder layers.
Appendix E Sliding Window Configurations
Lets assume we have, at one point in training, a randomly extracted training portion of consecutive observations from a length time series (). We use the first observations as (1) raw input signal to our models and (2) for extraction of complementary local topological properties. The consecutive observations, starting at , are used as target (to compute the meansquarederror, or the sMAPE for instance).
Throughout all experiments, sliding windows are moved forward by one observation a time.
For extracting local topological properties from (of length ) via persistent homology, two parameters are necessary: the parameter determines the number of overlapping sliding windows and the parameter determines the length of a single sliding window (i.e., observations). For each sliding window, we obtain one barcode (or two, if is taken into account).
Singe time series experiments [Manuscript, Section 4.2]. In this setting, , as we compute onestep forecasts. Since, typically, forecast models differ in their sensitivity to the length of the input observations , we crossvalidate (for all methods) using the sMAPE on the validation set.
The collection of used for crossvalidation is constructed based on the following consideration: first, for persistent homology computation, we need a reasonable amount of observations in each sliding window; and, second, we need a reasonable amount of sliding windows for selfattention. Hence, we choose (1) and (2) . For one specific choice of , we get . Varying and thus determines the length, , of the input vector . For instance, setting gives a decomposition of (of length 14), into subsequent windows of length for which persistent homology is computed. Overall, in the described setup, ranges from 14 to 244.
Largescale experiments [Manuscript, Section 4.3]. In this setting, . For comparability with NBEATS, we stick to the original setup of considering input lengths as multiples of the forecast horizon (which is specific to each subgroup in M4). In particular, ranges from to , see Table D.2. As an example, on M4 Yearly, this yields a range of from 12 to 30. As mentioned in [Manuscript, Section 4.3.2], we set and is thus determined by .
Appendix F Ensemble Size
As described in [Manuscript, Section 4.3.2], we ensemble 40 models to obtain forecasts for each subgroup of the M4 dataset. One ensemble is formed per subgroup and consists of training NBEATS, or NBEATS+TopAttn, respectively, with 10 random initializations for four different values of , i.e., (where denotes the specific forecast horizon prescribed per subgroup), using the sMAPE as a loss function. In case of Yearly for instance, , see Table B.1.
Fig. F.1 shows a comparison of NBEATS and NBEATS+TopAttn over the ensemble size, illustrating the NBEATS+TopAttn equally benefits from a larger ensemble. Notably, in [Oreshkin20a] the ensemble is much larger, as, in addition to training models with the sMAPE as loss, the MAPE and MASE are used and scales up to , resulting in 180 models in total.
Appendix G Model Details
In this section, we describe the details for the models used in the single time series experiments of [Manuscript, Section 4.2].
Prophet. We use the publicly available Python implementation of Prophet^{9}^{9}9https://facebook.github.io/prophet/ with default parameter choices.
autoARIMA. We use the publicly available Python implementation of autoARIMA^{10}^{10}10https://alkalineml.com/pmdarima/. In terms of hyperparameters, the initial number of time lags of the autoregressive (“AR”) and the movingaverage (“MA”) model is set to 1 bounded by its maximum 6. The period for seasonal differencing is equal to 5; the order of firstdifferencing and of seasonal differencing is set to 2 and 0, respectively.
LSTM. We implement a LSTM model with hidden dimensionality 128, 8 recurrent layers and a dropout layer on the outputs of each LSTM
layer with dropout probability of 0.3. Outputs of the LSTM are fed to a subsequent singlehiddenlayer
MLPwith hidden dimensionality equal to 64, including batch normalization and ReLU activation. Initial learning rate and weight decay are set to 1e3 and 1.25e5, respectively. We minimize the meansquarederror (MSE) via
ADAM over 1.5k (electricity) and 2k (carparts) iterations, respectively, using a batch size of 30. All learning rates are annealed following a cosine learning rate schedule.For DeepAR, MQCNN, MQRNN and the MLP baseline, we use the publicly available GluonTS [gluonts_jmlr] implementations^{11}^{11}11https://ts.gluon.ai, mostly with default
parameter choices. We only adjust the number of (maximum) training epochs to 20 (for comparability to our approach, where we count iterations), change the hidden dimensionality of the
MLP to 64 and set the batch size to 30.Appendix H System Setup
All experiments were executed on an Ubuntu Linux 20.04 system, using PyTorch v1.7.0 (CUDA 10.1), 128 GB of RAM and 16 Intel(R) Core(TM) i910980XE CPUs.
Appendix I Persistent Homology Runtime
To back up the “nearlinear runtime” statement for 0dimensional persistent homology computation in the proposed regime (see [Manuscript, Section 3.4]), Fig. I.1 shows a runtime plot (using Ripser^{12}^{12}12https://github.com/Ripser/ripser) over 10,000 sliding window sizes, , in the range . The system setup for these runtime experiments is given in Section H. Fig. I.1 clearly corroborates the statement from the manuscript.
Comments
There are no comments yet.