Log In Sign Up

TILDE-Q: A Transformation Invariant Loss Function for Time-Series Forecasting

by   Hyunwook Lee, et al.

Time-series forecasting has caught increasing attention in the AI research field due to its importance in solving real-world problems across different domains, such as energy, weather, traffic, and economy. As shown in various types of data, it has been a must-see issue to deal with drastic changes, temporal patterns, and shapes in sequential data that previous models are weak in prediction. This is because most cases in time-series forecasting aim to minimize L_p norm distances as loss functions, such as mean absolute error (MAE) or mean square error (MSE). These loss functions are vulnerable to not only considering temporal dynamics modeling but also capturing the shape of signals. In addition, these functions often make models misbehave and return uncorrelated results to the original time-series. To become an effective loss function, it has to be invariant to the set of distortions between two time-series data instead of just comparing exact values. In this paper, we propose a novel loss function, called TILDE-Q (Transformation Invariant Loss function with Distance EQuilibrium), that not only considers the distortions in amplitude and phase but also allows models to capture the shape of time-series sequences. In addition, TILDE-Q supports modeling periodic and non-periodic temporal dynamics at the same time. We evaluate the effectiveness of TILDE-Q by conducting extensive experiments with respect to periodic and non-periodic conditions of data, from naive models to state-of-the-art models. The experiment results indicate that the models trained with TILDE-Q outperform those trained with other training metrics (e.g., MSE, dynamic time warping (DTW), temporal distortion index (TDI), and longest common subsequence (LCSS)).


page 1

page 2

page 3

page 4


Deep Time Series Forecasting with Shape and Temporal Criteria

This paper addresses the problem of multi-step time series forecasting f...

Time Series Forecasting Models Copy the Past: How to Mitigate

Time series forecasting is at the core of important application domains ...

Shape and Time Distortion Loss for Training Deep Time Series Forecasting Models

This paper addresses the problem of time series forecasting for non-stat...

Operator-theoretic framework for forecasting nonlinear time series with kernel analog techniques

Kernel analog forecasting (KAF), alternatively known as kernel principal...

Surface Similarity Parameter: A New Machine Learning Loss Metric for Oscillatory Spatio-Temporal Data

Supervised machine learning approaches require the formulation of a loss...

Enhancing Deep Traffic Forecasting Models with Dynamic Regression

A common assumption in deep learning-based multivariate and multistep tr...

Multiple changepoint detection for periodic autoregressive models with an application to river flow analysis

In river flow analysis and forecasting there are some key elements to co...

1 Introduction

Time-series forecasting has been a core problem across various domains, including traffic domain (Li18; Lee20), economy (Zhu02), and disease propagation analysis (Matsubara14). The crucial part of the time-series forecasting is modeling of the complex temporal dynamics (e.g., non-stationary signal, periodicity). Temporal dynamics, intuitively, shape, has always been one of the most attention-getting keywords in time-series domains, such as rush hour of traffic data or abnormal usage of the electricity (Keogh05b; Bakshi94; Weigend94; Wu21autoformer; Zhou21FEDformer)

. Deep learning methods are one of the appealing solutions to model complex non-linear temporal dependencies and non-stationary signals, but recent work reveals that even deep learning is often insufficient to model temporal dynamics. To properly model the temporal dynamics,

Wu21autoformer; Zhou21FEDformer have proposed a novel deep learning approaches with input sequence decomposition. Guen19dilate try to model sudden changes timely and accurately with dynamic time warping (DTW). Bica20 adopts domain adversarial training to learn balanced representations, which is a treatment invariant representations over time. Wu21autoformer; Zhou21FEDformer have less attention to the essence of the problem: a shape, in other words, temporal dynamics. Guen19dilate; Bica20 try to capture the shape but still have some limitations like Fig. 1 (c).

A shape is a part of patterns in time-series data with a given time interval that could give valuable information, such as rise, drop, trough, peak, and plateau. We call the prediction is informative when it could properly consider the shape. Time-series forecasting models should aim to both accurately forecast the value for each time-step and the prediction should have similar shapes as those in original time-series, but existing models do not consider learning shape (Wu21autoformer; Zhou21FEDformer; Bica20; Guen19dilate), so the forecasting results are often inaccurate and uninformative, because deep learning model tends to learn in easy way (Karras19stylegan). Fig. 1 shows three real forecasting results with same model, different training metrics. When we utilize mean squared error (MSE) as an objectives, the model only aims to reduce gap between prediction and ground truth for each time-step. As a results, the model generates relatively easy prediction regardless of temporal dynamics (Fig. 1 (b)). It rarely gives information about original time-series. In contrast, if we consider both gap and shape of prediction and ground truth, the model could achieve both accuracy and temporal dynamics, as shown in Fig. 1 (a).

In this work, we aim to design a novel objective function that guides models to improve forecasting performance by learning the shapes in time-series data. To design such shape-aware loss function, we review existing literature (Esling12; Bakshi94; Keogh03) and investigate the notions of shapes and distortions that interrupt measurement for recognizing similarity of two time-series data in terms of shapes (Sec. 3.1, Sec. 3.2, and Sec. 3.3). Based on the investigation, we newly propose required conditions for constructing an objective function for shape-aware time-series forecasting (Sec. 3.4). We then present a novel loss function, (Transformation Invariant Loss function with Distance EQualibrium), that enables shape-aware representation learning with three different loss terms, which are invariant to the distortions (Sec. 4). For evaluation, we conduct extensive experiments with state-of-the-art deep learning models for time-series forecasting with . The results indicate that is model-agnostic and could improve accuracy of existing models, compared to MSE and DILATE.

Figure 1: Ground-truth and forecasting results with three metrics (a) , (b) MSE, and (c) DTW-based loss function. (b) MSE tends to generate non-informative forecasting results, similar to an average value of data and (c) DTW often produces misaligned results.


We make the following contributions: (1) To understand shape-awareness and distortion invariances in time-series forecasting, we investigate existing distortions in amplitude and phase; (2) we implement that has invariances to many existing distortions and achieves shape-awareness and informative forecasting in a timely manner; and (3) we show that the proposed allows models to have higher accuracy compared to those with existing metrics such as DTW, TDI, and LCSS on average.

2 Related Work

2.1 Time-Series Forecasting

There are many methods for time-series forecasting from traditional ones, such as ARIMA model (Box15)

and hidden markov model 


to recent deep learning models. In this section, we briefly describe the recent deep learning models for time-series forecasting. Starting with the huge success of the recurrent neural networks (RNNs) 

(Clevert16; Li18; Yu17), researchers have developed novel deep learning architectures, improving forecasting performance. To effectively capture long-term dependency, which is a weakness of RNNs, Stoller20

have proposed convolutional neural networks (CNNs). However, it is required to stack lots of the same CNNs to capture long-term dependency 

(Zhou21informer). Attention-based approaches have been another popular research direction in time-series forecasting, including Transformer (Vaswawni17) and Informer (Zhou21informer). Although the attention-based models effectively capture temporal dependencies, they require high computational cost and often struggle to find proper temporal information (Wu21autoformer). To cope with the problem, Wu21autoformer; Zhou21FEDformer utilize the input decomposition method that helps models better encode appropriate information. The other state-of-the-art models adopt neural memory networks (Kaiser17; Sukhbaatar15; Madotto18; Lee22), which refer to historical data stored in memory to generate meaningful representation.

2.2 Training Metrics

Conventionally, mean squared error (MSE), norm and its variants are the mainstream to optimize forecasting models. However, they are not the best metric to train forecasting models (Esling12) because time-series is temporally continuous. Additionally, norm gives less information about temporal correlation among time-series data. To better model temporal dynamics in time-series data, researchers have used differentiable, approximated dynamic time warping (DTW), as an alternative metric of MSE (Cuturi17; Abid18; Mensch18). However, using DTW as a loss function results in ignoring temporal localization of changes. Recently, Guen19dilate suggests DILATE, a training metric to timely catch sudden changes of non-stationary signals with smooth approximation of DTW and penalized temporal distortion index (TDI). To guarantee to work in a timely manner, Guen19dilate introduce a loss function that gives a harsh penalty when predictions show high temporal distortion. However, TDI relies on the DTW path, and DTW often shows misalignment because of its noise- and scale-sensitive. Thus, DILATE often loses its advantage with complex data, showing disadvantages at the beginning of the training. In this work, we discuss distortions and transformation invariances and design a new loss function that allows models to learn shapes in the data and produce noise-robust forecasting results.

3 Preliminary

In this section, we aim to investigate common distortions without losing the goal of time-series forecasting (i.e., modeling temporal dynamics and accurate forecasting). To help understand the concepts, we first define notations and terms (Sec. 3.1). We then discuss common distortions in time-series in transformation perspectives that need to be considered for building a shape-aware loss function (Sec. 3.2) and describe how other loss functions (e.g., DTW and TDI) handle shapes during learning (Sec. 3.3). Last, we explain the conditions for effective time-series forecasting (Sec. 3.4).

3.1 Notations and Definitions

Let denote a data point at a time step . Then, we can define a time-series forecasting problem as:

Definition 1.

Given -length historical time-series at time and corresponding -length future time-series , time-series forecasting aims to learn mapping function .

To distinguish the label (i.e., ground-truth) and prediction time-series data, we note the label data as and prediction data as . Next, we set up two goals for time-series forecasting, which require not only precise, but also informative forecasting Wu21autoformer; Zhou21FEDformer; Guen19dilate as follow:

  • Mapping function should be learnt to point-wisely reduce distance between and ; and

  • The output should have similar temporal dynamics with .

Temporal dynamics are informative patterns in time-series, such as rise, drop, trough, peak, and plateau. We define the temporal dynamics as follows:

Definition 2.

Temporal dynamics (or shapes) are the informative periodic and non-periodic patterns in time-series data.

In this work, we aim to design a shape-aware loss function that satisfies both goals. To this end, we first discuss distortions that two time-series with similar shapes can have.

Definition 3.

Given two time-series and in a similar shape, distortion is a difference between and .

Distortion generally occurs in different aspects. Distortions are defined as temporal distortion (i.e., warping) and amplitude distortion (i.e.,scaling) with respect to its relevance of dimension, time and amplitude. Existing distortion in data leads to misbehavior of the model, as measurements are interrupted by the distortion. For example, if we have two time-series and , which have a similar shape but different means, could represent many temporal dynamics of . However, measurements often evaluate and are different (e.g., measuring with MSE) and causes misguidance of the model in training. As such, it is important to have measurements that consider similar shape invariant to distortion. We define a measurement for a distortion as follow:

Definition 4.

Let transformation represents a distortion . Then, we call measurement invariant to , if for any time-series .

Figure 2: Example of the six distortions on the amplitude axis (top) and temporal axis (bottom).

3.2 Time-Series Distortions in Transformation Perspectives

Distortion, a gap between two similar time-series, affects on capturing shapes in time-series data. As such, it is important to investigate different distortions and their impact on representation learning aspects. There are six common time-series distortions that models encounter during learning (Esling12; Batista14cid; Berkhin06; Liao05; Kerr08)–Amplitude Shifting, Phase Shifting, Uniform Amplification, Uniform Time Scaling, Dynamic Amplification, and Dynamic Time Scaling. Next, we explain each common time-series distortion in terms of transformation with -length time-series , where t = . Fig. 2 presents example distortions, categorized by amplitude and time dimensions.

  • Amplitude Shifting describes how much a time-series shifts against another time-series. This can be described with two time-series and the degree of shifting (k): , where is constant.

  • Phase Shifting is the same type of transformation (i.e., translation) as amplitude shifting, but it occurs along with the temporal dimension. This distortion can be represented with two time-series functions with the degree of shift (k): , where is constant. Cross-correlation (Paparrizos15kshape; Vlachos05) is the most popular measure method that is invariant to this distortion.

  • Uniform Amplification is a transformation that changes the amplitude by multiplication of . This distortion can be described with two functions and a multiplication factor (k): .

  • Uniform Time Scaling means a uniformly shortened or lengthened on the temporal axis. This distortion can be represented as , where and . Although Keogh04 propose uniform time warping methods to handle this distortion, it still remains one of the difficult distortion types to measure, due to the difficulty in finding the scaling factor without testing all possible cases (Keogh03).

  • Dynamic Amplification can be interpreted as any distortion occurred by non-zero multiplication on the amplitude dimension. This distortion can be described as follows: with function such that . Local amplification is a representative distortion of this type of distortions, which still remains challenging to solve.

  • Dynamic Time Scaling means any transformation that dynamically lengthens or shortens signals on the temporal dimension including local time scaling (Batista14cid) and occlusion (Batista14cid; Vlachos03). It can be represented as follows: , where is a positive, strictly increasing function. Dynamic time warping (DTW) (Bellman59dtw; Berndt94; Keogh05) is the most popular technique on this distortion. Das97

    also introduce the longest common subsequence (LCSS) algorithm to tackle occlusion, noise, and outliers in this distortion.

There are several studies on shape-aware clustering  (Bellman59dtw; Batista14cid; Paparrizos15kshape; Berkhin06; Liao05; Kerr08) and classification (Xi06; Batista14cid; Srisai09) tasks with the consideration of shapes. On the other hand, only a few studies exist for time-series forecasting tasks, including Guen19dilate that utilizes dynamic time warping (DTW) and temporal distortion index (TDI) for modeling temporal dynamics. Next we describe mean square error (MSE) and DILATE, proposed by Guen19dilate, and discuss their invariance to the distortions.

3.3 Distortion Handling in Current Time-Series Forecasting Objectives

Many measurement metrics have been used in the time-series forecasting domain, and those based on the distance, including Euclidean distance, are widely used to handle time-series data. However, such metrics do not have invariance to the aforementioned distortions (Ding08; Guen19dilate) due to its point-wise mapping. Specifically, since distance compares the values per time step, it cannot handle temporal distortions appropriately and vulnerable to scaling of the data. Guen19dilate propose a loss function, called DILATE, to overcome the inadequate characteristic in the distance metrics by recognizing temporal dynamics with DTW and TDI. In terms of transformation, DILATE handles dynamic time scaling, especially, local time scaling with DTW, and phase shifting with penalized TDI, defined as follows:

where , , are the warping path, cost matrix, and squared penalization matrix, respectively.

While DILATE shows better performance than existing methods, there is a missing point in invariance point of view. Basically, DTW computes the Euclidean distance of two time-series after its temporal alignment in dynamic programming and the alignment relies on the distance function. Consequently, the dynamic alignment of the DTW can be properly achieved only when two time-series have the same range (Esling12; Bellman59dtw). That means, it hardly achieves invariance on amplitude distortion without appropriate pre-processing. Gong17 also show that DTW poorly matches the prediction and target (i.e., ground-truth) time-series with amplitude shifting. Even when the target time-series is aligned with normalization, we cannot guarantee that the predicted and target time-series are properly aligned due to DTW’s high sensitivity to noise. As a result, DILATE can generate poor alignment results that can cause wrong optimization of TDI, which produces instability during optimization steps and incorrect results. To design an effective shape-aware loss function, we have to understand measures and when the measures have transformation invariances. In the next section, we discuss how we interpret transformations in time-series forecasting point of view and which types of transformations should be considered in objective function design.

3.4 Transformation Invariances in Time-Series Forecasting

In the time-series domain, data often have various distortions so measurements are needed to satisfy a number of transformation invariances for meaningfully modeling temporal dynamics. As discussed in Sec. 3.1, we set the goal of time-series forecasting as (1) point-wisely reducing the gap between prediction and target time-series and (2) preserving temporal dynamics of the target time-series. To satisfy both of them, we have to consider (1) the method that should not have a negative impact on the traditional goal of accurate time-series forecasting and (2) the distortions that play a crucial role in capturing the temporal dynamics of the target time-series. In this section, we review all six distortions whether it is a feasible loss function or not, discuss their benefits and trade-offs, and find appropriate distortions to be considered in time-series forecasting.

Amplitude Shifting

In a wide range of situations, it is beneficial to capture the the trends of time-series sequence in spite of shifts in terms of amplitude. Thus, being invariant to amplitude shifting in a loss function takes many advantages in time-series forecasting: (1) shape-awareness invariant to amplitude shifting, (2) accurate deviation of values in modeling, and (3) effective on-time prediction of the peak or sudden changes. To guarantee the amplitude shifting invariant in the optimization stage, the loss function should induce an equal gap between prediction and ground truth data in each step. Formally speaking, the loss function with consideration of the amplitude shifting should satisfy:


where is an arbitrary and equal gap, and is a signed distance with a boundary . By allowing tolerance between prediction and target time-series, models can follow trends in time-series instead of tending to predict exact values in point-wise. In short, unlike existing loss functions that handle only point-wise distance (e.g., DTW), we should deal with both the point-wise distance and its relational distance values to guarantee amplitude shifting.

Phase Shifting

There are forecasting tasks, whose main objectives concern accurate forecasting of peaks and periodicity in time-series (e.g., heart beat data and stock price data). For such tasks, phase shifting invariance is one of the best solutions for (1) modeling periodicity, regardless of translation on temporal axis and (2) having precise statistics with shapes, such as peak and plateau values. If a loss function is to be invariant to phase shifting, the function should satisfy:


Note Eq. 2 allows a similar shape as target time-series in forecasting, not exactly same shape (e.g., with the same dominant frequency).

Uniform Amplification

This proposition will be useful in case of sparse data that contains a significant number of zeros. By adopting the uniform amplification invariance, models are able to focus non-zero sequences, whereas this proposition allows models to receive less penalty in zero sequences. Since it guarantees shape-awareness with a multiplication factor in a timely manner as Fig. 2, invariance for uniform amplification fits well. To have a model trained with the uniform amplification invariance, the loss function should satisfy:


Uniform Time Scaling, Dynamic Amplification, and Dynamic Time Scaling

After careful consideration, we conclude that uniform time scaling, dynamic amplification, and dynamic time scaling are incompatible for optimization. We describe the reason below.

To achieve invariance for the uniform time scaling, the loss function should satisfy:

This proposition will influence negatively original temporal dynamics, considering that it gives the tolerance of mispredicting periodicity (e.g., daily periodic signals) and even cannot catch events (e.g., abrupt changing values) in timely manner. In summary, it hinders models from capturing shape and corrupts periodic information.

For both dynamic amplification and dynamic time scaling, loss functions always are zero for all pairs when we do not set the limit of tolerance. For example, if we do not limit tolerance, the proposition for dynamic amplification invariance is as follow:

If a loss function satisfies the proposition, it is always zero because there always exists except . Therefore, it is not able to give any information because all random values could be an optimal solution. The same situation happens with the dynamic time scaling if we do not limit the window. Consequently, all of the uniform time scaling, dynamic amplification, and dynamic time scaling are unsuitable to be objectives in time-series forecasting.

4 Methods

In this section, we describe a novel loss function (a Transformation Invariant Loss function with Distance EQuilibrium), which allows models to perform shape-aware time-series forecasting based on the three distortion invariances. To build a transformation invariant loss function, we have to design a loss function that satisfies the proposition for amplitude shifting invariance (Eq. 1), phase shifting invariance (Eq. 2), and uniform amplification shifting invariance (Eq. 3), as discussed in Sec. 3.4. We select them for our loss function because they help models capture the shape and do not harm the goal of the traditional time-series forecasting (i.e., minimize gap between prediction and target time-series). Not only the loss function should satisfy these propositions, but also it should consider correlations between the whole sequence of outputs and ground truths rather than point-wisely optimizing the model. It is not achieved by other loss functions, such as MSE or DILATE. To handle all three distortions and the whole sequence of correlations, we build three objective functions (a.shift, phase, and amp losses) that achieve one or more invariance by utilizing softmax, Fourier coefficient, and auto-correlation to design a loss function.

Amplitude Shifting Invariance with Softmax (Amplitude Shifting)

To strengthen amplitude shifting invariance, we design a loss function that satisfies Eq. 1. This means, needs to be the same value for all . To satisfy the condition, we utilize the softmax function:


where , Softmax, and are the length of sequence, softmax function, and signed distance function, respectively. Because the Softmax produces the proportion of each value, it only reaches to the optimal solution when it satisfies Eq. 1. Also, if we utilize Softmax, there is no need to know arbitrary equal gap .

Invariances with Fourier Coefficients (Phase Shifting)

As we discussed in Sec. 3.4, one candidate method to obtain phase shifting invariance is to use Fourier coefficients. As described in prior studies (Jason07)

, we can reconstruct original time-series only with dominant frequencies. In this way, we utilize the norm of dominant Fourier coefficient of ground truth and prediction sequences as our additional objective function, achieving phase shifting invariance. When it comes to the other frequencies, we denote the norm of prediction sequence to reduce the value of Fourier coefficient. Consequently, with the help of our loss function, this loss function allows model to be noise robustness because the Fourier coefficients of white noises in original time-series are relatively small. Simply, we optimize the distance between Fourier coefficients of two time-series as:


where is the norm. This loss function obtains uniform amplification invariance by utilizing a normalization technique to Fourier coefficients. For example, and have the same Fourier coefficients if properly normalized. In summary, by Eq. 5, we could obtain (1) invariance for phase shifting, (2) invariance for uniform amplification, and (3) robustness to noise.

Invariances with auto-correlation (Uniform Amplification)

Although Fourier coefficients can be considered as a reasonable solution to catch the periodicity of the target time-series, it is not fully invariant to phase shifting for three reasons–(1) the statistics (e.g., mean and variance) in data keep changing, (2) such changing statistics also cause the changes of Fourier coefficients even in the same frequency, and (3) objectives only with a norm of them cannot fully represent the original time-series. Thus, we introduce an objective based on normalized cross-correlation, which satisfies Eq. 

2 for a periodic signal:


where is a normalized cross correlation function. This loss function helps predicted sequences to mimic label sequences by calculating difference between the auto-correlation of the label sequences and cross-correlation between label and predicted sequences. Therefore, the label and prediction have similar temporal dynamics regardless of phase shifting and uniform amplification.

In summary, we introduce (Transformation Invariant Loss Function with Distance Equilibrium), combining Eq. 4, Eq. 5, and Eq. 6 as follows:


where and

is hyperparameter.

5 Experiments

In this section, we present the results of our comprehensive experiments, demonstrating the effectiveness of and importance of transformation invariance.

Synthetic 0.0107 3.5080 1.0392 0.3523 0.0130 3.4005 1.1242 0.3825 0.0119 3.2873 1.1564 0.3811
ECG5000 0.2152 1.9718 0.8442 0.7743 0.8270 3.9579 2.0281 0.4356 0.2141 1.9575 0.7714 0.7773
Traffic 0.0070 1.4628 0.2343 0.7209 0.0095 1.6929 0.2814 0.6806 0.0072 1.4600 0.2276 0.7220
Table 1: Experimental results of short-term time-series forecasting with Seq2Seq GRU model.

Experimental Setup

We conduct the experiments with three state-of-the-art models, including Informer (Zhou21informer), N-Beats (Oreshkin2020nbeats), and Autoformer (Wu21autoformer)

and one simple sequence-to-sequence gated recurrent unit (GRU) model. We use five real-world datasets–ECG5000, Traffic, ETTh2, ETTm2, and ECL, and one synthetic dataset–Synthetic for model training. We repeat each experiment with a model and dataset 10 times in combinations with three different objective functions. Appendix 

A provides detailed explanations on the datasets, hyperparameter setting, and model architectures.

Methods Informer + MSE Informer + DILATE Informer +


96 0.2466 6.9254 3.6676 0.4633 0.3284 6.3109 3.5838 0.5037 0.1768 5.8437 1.6734 0.5379
192 0.2818 10.2654 11.1580 0.4254 0.4086 8.8262 7.1780 0.4893 0.2432 10.2134 9.9865 0.4317
336 0.3089 12.1822 18.7014 0.4434 0.4164 10.3779 13.2580 0.5062 0.2958 13.5586 20.2850 0.4165
720 0.2877 17.6369 38.4617 0.4425 0.4229 14.1196 23.9403 0.4815 0.3157 18.4617 43.3238 0.4262


96 0.0889 3.4007 1.5719 0.7386 0.1263 6.0144 2.7757 0.5129 0.0871 3.1354 1.3474 0.7817
192 0.1157 5.7964 2.8128 0.6705 0.2340 9.7004 7.8354 0.5266 0.1317 5.7093 2.9129 0.6983
336 0.1860 8.9971 6.7970 0.6365 0.2805 11.7889 13.3861 0.5025 0.1767 9.0866 7.4023 0.6555
720 0.2165 14.7685 24.6694 0.5768 0.3745 16.7734 29.2783 0.4747 0.2063 15.3057 24.1959 0.5860


96 0.2709 2.8067 0.1720 0.7032 0.9856 3.6394 1.4794 0.6324 0.2800 2.9466 0.2473 0.7275
192 0.2793 4.1193 0.1508 0.7060 1.1209 5.2289 2.1749 0.6053 0.3077 4.2693 0.2978 0.7336
336 0.3203 5.9533 0.1642 0.7222 1.2331 7.8470 3.0415 0.5694 0.3271 5.8090 0.1984 0.7143
720 0.6414 15.8561 4.4284 0.4564 1.3706 12.5981 5.6720 0.5506 0.4676 11.4027 0.7107 0.6298
Table 2: Experimental results on three real-world datasets (four cases) with Informer.

Evaluation Metrics

In the experiment, we evaluate with four evaluation metrics: mean squared error (MSE), dynamic time warping (DTW), its corresponding temporal distortion index (TDI), all of which are used in

Guen19dilate. As DTW is sensitive to noise and generates incorrect paths when one of the time-series data is noisy (as discussed in Sec. 3.3), we additionally use the longest common subsequence (LCSS) for comparison, which is more robust to outliers and noise (Esling12). The longer the length of matched subsequences is achieved, the better performance LCSS shows in modeling the shapes.

Methods N-Beats + MSE N-Beats + DILATE N-Beats +


96 0.1869 7.2379 2.3787 0.4688 0.3105 6.5849 3.6490 0.4879 0.1557 5.1011 1.3240 0.5862
192 0.2385 11.5667 4.9153 0.4505 0.6186 9.7254 7.0831 0.4637 0.1738 7.6334 2.4122 0.5819
336 0.2889 16.5255 11.5207 0.4544 1.1406 13.7328 14.6986 0.4584 0.2132 11.3351 5.3556 0.5373
720 0.3881 24.1570 18.8462 0.4381 1.6713 19.4392 23.7028 0.4575 0.3044 17.6006 9.6636 0.5287


96 0.0790 3.9685 2.0436 0.6721 0.1524 7.9302 5.5597 0.4379 0.0952 4.0110 2.1939 0.6902
192 0.1224 6.8695 3.2834 0.5762 0.2055 10.0393 8.5602 0.5107 0.1286 6.3556 4.9798 0.6160
336 0.1824 12.1438 8.5915 0.4587 0.2501 12.6342 16.1473 0.4819 0.1705 8.9377 8.3539 0.6195
720 0.2370 22.8676 17.8458 0.4929 0.4170 17.7764 24.6877 0.5836 0.2336 14.2715 19.0883 0.7070


96 0.3666 3.5207 0.2989 0.6589 1.1156 5.1430 2.6613 0.5074 0.3183 2.9707 0.4844 0.7229
192 0.4307 5.7578 0.4253 0.6212 1.1859 7.3406 2.8488 0.4973 0.3383 4.1817 0.4229 0.7187
336 0.5199 8.5563 0.5384 0.5965 1.2460 9.5096 3.0517 0.5091 0.3831 5.6643 0.3024 0.7112
720 0.6240 13.9436 0.6510 0.5717 1.3061 13.1928 3.7279 0.5337 0.4540 8.9997 0.3251 0.6960
Table 3: Experimental results on three real-world datasets (four cases) with N-Beats.

Results and Analysis

Table 1 shows the results of short-term forecasting performance of gated recurrent unit (GRU) optimized with MSE, DILATE, and metrics. Synthetic, ECG5000, and Traffic datasets are used for the experiment. With the Synthetic dataset, every used metric shows its own benefits. This result indicates that similarity of the shape and MSE measures have a clear advantage when a model is trained and evaluated with themselves. Also, since the model is evaluated with real-world datasets, it is revealed outperforms other objective functions in most evaluation metrics. These results indicate our approach for learning shapes in time-series data works better than existing methods for forecasting. DILATE does not show impressive performance with ECG5000 due to its high sensitiveness to noise, as discussed in Sec. 3.3.

Table 2 and Table 3 summarize the experiment results with two state-of-the-art models, Informer and N-Beats. The models make predictions for both short-term (=24) and long-term ( up to 720), so that we can investigate their performances with different forecasting difficulties. In most of datasets, the models with outperform those with other training metrics. Especially for long-term forecasting, we observe that For N-Beats and Informer with significantly improve the performance with the other metrics. We provide some visual examples in Appendix B and more detailed analysis, qualitative experiments with example visualizations, ablation study results in Appendix B. This result implies that improves performances of the models in learning temporal dynamics, including LCSS of N-Beats (improved over 10%).

Methods Autoformer + MSE Autoformer + DILATE Autoformer +


96 0.1538 5.2227 2.1865 0.6187 0.2211 6.0453 2.5345 0.5315 0.1494 5.1060 1.9752 0.6317
192 0.1974 7.8730 3.3382 0.6019 0.2825 8.6696 5.6671 0.5335 0.2079 7.8917 3.7532 0.5984
336 0.2393 10.8002 7.3141 0.5954 0.3759 11.0335 13.1347 0.5257 0.2360 10.7212 7.0085 0.5971
720 0.2859 16.3502 15.9233 0.5772 0.4296 15.9819 22.2173 0.4924 0.2378 16.0002 13.7906 0.5795


96 0.0990 4.3498 2.5052 0.6756 0.1135 5.3097 2.2211 0.5936 0.0940 3.9078 2.2587 0.7075
192 0.1340 6.3207 3.3676 0.6512 0.1854 8.5209 3.7894 0.5506 0.1259 6.0979 2.9278 0.6810
336 0.1587 9.4374 6.9205 0.6036 0.2001 12.0265 8.8305 0.5370 0.1548 9.5223 7.2875 0.6169
720 0.1999 14.8332 11.9655 0.6064 0.2665 17.8025 17.4114 0.5001 0.1885 14.5844 9.9918 0.6277


96 0.4209 3.5957 0.2461 0.6487 0.6813 3.6490 0.4780 0.6253 0.3515 3.2173 0.2298 0.6912
192 0.4206 4.9924 0.3416 0.6574 0.7319 5.5324 0.2775 0.6118 0.4032 4.8581 0.3301 0.6680
336 0.4621 6.6888 0.2795 0.6535 0.7895 7.5665 0.2503 0.6091 0.4637 6.7335 0.3923 0.6429
720 0.5005 10.8571 0.2383 0.6183 0.8630 12.1416 0.1877 0.6074 0.5049 9.8492 0.2525 0.6420
Table 4: Experimental results on three real-world datasets (four cases) with Autoformer.

6 Conclusion and Future Work

We propose , a transformation invariant loss function with distance equilibrium, which allows shape-aware time-series forecasting in a timely manner. To design , we review existing transformations in time-series data and discuss the conditions that ensure transformation invariances during optimization tasks. The designed ensures a model to be invariant to the amplitude shifting, phase shifting, and uniform amplification so that a model better captures the shape in time-series data. To prove the effectiveness of , we conduct comprehensive experiments with state-of-the-art models and real-world datasets. The results indicate that the model trained with generates more timely, robust, accurate, and shape-aware forecasting in both short-term to long-term forecasting tasks. We conjecture that this work can facilitate future research on transformation invariances and shape-aware forecasting.


Appendix A Detailed Experiment Setup


In our experiment, we utilize six datasets – Synthetic, ECG5000, and Traffic dataset for the simple model (i.e., Sequence-to-Sequence Gated Recurrent Unit) and ETTh2, ETTm2, and Electricity for the state-of-the-art model (i.e., Informer and N-Beats). For each dataset, we describe some metadata of them and experimental setting, including the input length and prediction window .

Synthetic: As Guen19dilate describe, the Synthetic dataset is an artificial dataset for measuring model performance on sudden changes (step functions) with an input signal composed of two peaks. The amplitude and temporal position of the two peaks are randomly selected. Then the selected position and amplitude of the step are determined by a peak position and amplitude. We use 500 time-series for training, 500 for validation and 500 for testing. For the Synthetic dataset, we set input length as and prediction window as . The generation code is provided in DILATE Github111

ECG5000: This dataset is originally a 20-hour long ECG (Electrocardiogram), downloaded from Physionet222 and archived in UCR Time Series Classification Archive (Dau2019UCR). The data is split by each heartbeat and processed to be in equal lengths (140). In the training, we use 500 for training, 500 for validation, and 4000 for testing. We take first steps as input and predict last steps.

Traffic: Traffic dataset is a collection of 48 months (2015-2016) hourly road occupancy rate (between 0 to 1) data from the California Department of Transportation333 As Guen19dilate do, we utilize univariate series of the first sensor, a total of 17544 data points. We set our problem as forecasting future occupancy rates with historical data (past week). We use 60% of the data for training, 20% for validation, and the rest for evaluation.

ETT: The ETT (Electricity Transformer Temperature) dataset, published by Zhou21informer, is 2-year data collected from two separated counties in China, including ETTh2 and ETTm2 datasets. Each data point has a target value of “oil temperature” and other 6 power load features. ETTh2 and ETTm2 datasets have 1-hour and 15-minute intervals, respectively. As Zhou21informer do, we split them into 12/4/4 months for the training/validation/testing. Detailed settings, such as the input and output length and hyperparameter setting, are based on the information at Informer Github444

ECL: The ECL (Electricity Consuming Load) is a dataset recorded in kWh every 15-minutes from 2012 to 2014, for 321 clients. In our experiment, we split them into 15/3/4 months for the train/validation/test, as Zhou21informer do. Note that we use the same hyperparameter settings in the ETTh2 dataset.

Deep Learning Model Architectures

We perform experiments with three different model architectures, including Sequence-to-Sequence GRU, Informer, and N-Beats. To induce models to predict future time-series in a timely manner, we set and for . Other training metrics, including MSE and DILATE, are used as described in their original papers. All models are trained with Early Stopping and ADAM optimizer.

Sequence-to-Sequence GRU To evaluate in simple model, we utilize one layer Sequence-to-Sequence GRU model. For the training of the GRU model, we set learning rate of

, hidden size of 128, trained by maximum 1000 epochs with Early Stopping and ADAM optimizer.

Informer When we train Informer with ETTh2, ETTm2, and ECL dataset, we utilize the official code and hyperparameter setting. In the case of ECL dataset, as author answered in their official code, we utilize same hyperparameter and dataset splitting criteria as ETTh2 dataset.

N-Beats For N-Beats, we utilize two generic blocks with the hidden size of 128. Additionally, we set the learning rate as for all three datasets.

Autoformer For Autoformer555, we use the official code and hyperparameter setting. For the ETTh2 dataset, we utilize hyperparameter settings described in the official code of FEDFormer666

Appendix B Additional Evaluations

b.1 Detailed Experiment Analysis

At first, we observe that the model optimized with outperforms the same model optimized with other objective functions in both short- and long-term forecasting tasks. An interesting point in the results is the large increased errors of TDI and DTW with long-term forecasting. For example, TDI of Informer with DILATE shows dramatically increased error with ECL dataset, as the forecasting window increases, while LCSS does not produce such large increased error. We attribute this to the weakness of DTW-based loss functions, which have a weakness due to high sensitiveness on noise. In contrast, does not show such large performance drop and even achieves better performance in the long-term forecasting (e.g., Table 2, ETTh2). Additionally, we can find that Informer with on ECL data and N-Beats with on all three datasets show significant improvements. It indicates that success to model shape, but other metrics could not. We provide additional qualitative results below.

Next, we present qualitative analysis of the results. Fig. 3 shows how the model with different training metrics forecast with different datasets. From the figure, we have noticed that allows the model to generate more robust, shape-aware forecasting, regardless of the amplitude shifting, phase shifting, and uniform amplification. For example, in the case of N-Beats (Fig. 3 (b) bottom), generate forecasting results, which are more robust, shape-aware prediction compared to other metrics. We also see the strength in the Informer case (Fig. 3 (b), top). Even when the model has not enough ability to capture shape, tries to retrieve the shape. We provide additional qualitative results with visualization below. When the model have enough ability to capture shape (i.e., except ETTh2, Informer of ), shown its noise-robust, smooth forecasting with correctly modeled temporal dynamics. In the most of N-Beats results and some of Informer results, reveals that these models have enough ability to capture the temporal dynamics with proper loss function. In summary, proves that it is model-agnostic, noise-robust, and able to capture the shape.

Figure 3: Qualitative results with simple sequence-to-sequence GRU model (a) and state-of-the-art model (b).

b.2 Additional Qualitative Examples

Figure 4: Qualitative results with simple sequence-to-sequence GRU model
Figure 5: Qualitative results with ETTh2 in short-term forecasting
Figure 6: Qualitative results with ETTh2 in long-term forecasting
Figure 7: Qualitative results with ETTm2 in short-term forecasting
Figure 8: Qualitative results with ETTm2 in long-term forecasting
Figure 9: Qualitative results with ECL in short-term forecasting
Figure 10: Qualitative results with ECL in long-term forecasting

b.3 Ablation Study

To evaluate the effect of the and measure the effect of each loss function, we conduct a set of experiment with ETTh2 dataset and N-Beats on the long-term forecasting problem. As we can see in the Fig. 11, the model tends to predict phase- and amplification-free forecasting when decreases. This results indicate our motivation, “

will return the forecasting results with same standard deviation with timely manner but without consideration of proper average value,” and “

will catch the dominant frequency of the target time-series, but it hardly matches its amplitude and phase.”

Furthermore, in the top of Fig. 12, we can observe three things: (1) if we utilize only, as we intended, it have different average (-1.19 vs. 0.11) but relatively similar standard deviation (0.408 vs. 0.299); (2) In the case of only, they can capture dominant frequency and produce relatively less-noisy forecasting; (3) have relatively similar average value (-1.195 vs. -0.319), but it has far different standard deviation (0.408 vs. 8.592). In contrast, forecasting results of the model trained with MSE is very noisy and hard to interpret (Fig. 12, bottom). Note that we normalized the results in Fig. 12 because of the scale issue.

Methods N-Beats
0.2388 ± 0.0324 13.9859 ± 0.5772 18.5824 ± 3.1545 0.7063 ± 0.0083
0.5767 ± 0.8339 17.4543 ± 0.4896 28.4352 ± 6.9420 0.6384 ± 0.0106
1.9377 ± 0.0222 16.2181 ± 1.2729 19.8596 ± 2.6131 0.6678 ± 0.0197
1.8361 ± 0.0652 16.7026 ± 0.8879 16.4266 ± 2.2085 0.6726 ± 0.0137
Amplitude only 1.5123 ± 0.0324 16.8485 ± 2.8237 29.6971 ± 5.2713 0.5740 ± 0.0175
Phase only 1.8453 ± 0.0470 13.7197 ± 0.1196 10.2519 ± 1.5417 0.6919 ± 0.0043
Amplification only 674.9155 ± 505.1760 14.7763 ± 0.6338 30.8568 ± 5.1776 0.6120 ± 0.0182
Table 5: Ablation study on with ETTh2, , and N-Beats
Figure 11: Ablation study result visualization with different on ETTm2 dataset
Figure 12: Ablation study result visualization of three proposed loss function on ETTm2 dataset