1 Introduction
Timeseries forecasting has been a core problem across various domains, including traffic domain (Li18; Lee20), economy (Zhu02), and disease propagation analysis (Matsubara14). The crucial part of the timeseries forecasting is modeling of the complex temporal dynamics (e.g., nonstationary signal, periodicity). Temporal dynamics, intuitively, shape, has always been one of the most attentiongetting keywords in timeseries domains, such as rush hour of traffic data or abnormal usage of the electricity (Keogh05b; Bakshi94; Weigend94; Wu21autoformer; Zhou21FEDformer)
. Deep learning methods are one of the appealing solutions to model complex nonlinear temporal dependencies and nonstationary signals, but recent work reveals that even deep learning is often insufficient to model temporal dynamics. To properly model the temporal dynamics,
Wu21autoformer; Zhou21FEDformer have proposed a novel deep learning approaches with input sequence decomposition. Guen19dilate try to model sudden changes timely and accurately with dynamic time warping (DTW). Bica20 adopts domain adversarial training to learn balanced representations, which is a treatment invariant representations over time. Wu21autoformer; Zhou21FEDformer have less attention to the essence of the problem: a shape, in other words, temporal dynamics. Guen19dilate; Bica20 try to capture the shape but still have some limitations like Fig. 1 (c).A shape is a part of patterns in timeseries data with a given time interval that could give valuable information, such as rise, drop, trough, peak, and plateau. We call the prediction is informative when it could properly consider the shape. Timeseries forecasting models should aim to both accurately forecast the value for each timestep and the prediction should have similar shapes as those in original timeseries, but existing models do not consider learning shape (Wu21autoformer; Zhou21FEDformer; Bica20; Guen19dilate), so the forecasting results are often inaccurate and uninformative, because deep learning model tends to learn in easy way (Karras19stylegan). Fig. 1 shows three real forecasting results with same model, different training metrics. When we utilize mean squared error (MSE) as an objectives, the model only aims to reduce gap between prediction and ground truth for each timestep. As a results, the model generates relatively easy prediction regardless of temporal dynamics (Fig. 1 (b)). It rarely gives information about original timeseries. In contrast, if we consider both gap and shape of prediction and ground truth, the model could achieve both accuracy and temporal dynamics, as shown in Fig. 1 (a).
In this work, we aim to design a novel objective function that guides models to improve forecasting performance by learning the shapes in timeseries data. To design such shapeaware loss function, we review existing literature (Esling12; Bakshi94; Keogh03) and investigate the notions of shapes and distortions that interrupt measurement for recognizing similarity of two timeseries data in terms of shapes (Sec. 3.1, Sec. 3.2, and Sec. 3.3). Based on the investigation, we newly propose required conditions for constructing an objective function for shapeaware timeseries forecasting (Sec. 3.4). We then present a novel loss function, (Transformation Invariant Loss function with Distance EQualibrium), that enables shapeaware representation learning with three different loss terms, which are invariant to the distortions (Sec. 4). For evaluation, we conduct extensive experiments with stateoftheart deep learning models for timeseries forecasting with . The results indicate that is modelagnostic and could improve accuracy of existing models, compared to MSE and DILATE.
Contributions
We make the following contributions: (1) To understand shapeawareness and distortion invariances in timeseries forecasting, we investigate existing distortions in amplitude and phase; (2) we implement that has invariances to many existing distortions and achieves shapeawareness and informative forecasting in a timely manner; and (3) we show that the proposed allows models to have higher accuracy compared to those with existing metrics such as DTW, TDI, and LCSS on average.
2 Related Work
2.1 TimeSeries Forecasting
There are many methods for timeseries forecasting from traditional ones, such as ARIMA model (Box15)
(Pesaran04)to recent deep learning models. In this section, we briefly describe the recent deep learning models for timeseries forecasting. Starting with the huge success of the recurrent neural networks (RNNs)
(Clevert16; Li18; Yu17), researchers have developed novel deep learning architectures, improving forecasting performance. To effectively capture longterm dependency, which is a weakness of RNNs, Stoller20have proposed convolutional neural networks (CNNs). However, it is required to stack lots of the same CNNs to capture longterm dependency
(Zhou21informer). Attentionbased approaches have been another popular research direction in timeseries forecasting, including Transformer (Vaswawni17) and Informer (Zhou21informer). Although the attentionbased models effectively capture temporal dependencies, they require high computational cost and often struggle to find proper temporal information (Wu21autoformer). To cope with the problem, Wu21autoformer; Zhou21FEDformer utilize the input decomposition method that helps models better encode appropriate information. The other stateoftheart models adopt neural memory networks (Kaiser17; Sukhbaatar15; Madotto18; Lee22), which refer to historical data stored in memory to generate meaningful representation.2.2 Training Metrics
Conventionally, mean squared error (MSE), norm and its variants are the mainstream to optimize forecasting models. However, they are not the best metric to train forecasting models (Esling12) because timeseries is temporally continuous. Additionally, norm gives less information about temporal correlation among timeseries data. To better model temporal dynamics in timeseries data, researchers have used differentiable, approximated dynamic time warping (DTW), as an alternative metric of MSE (Cuturi17; Abid18; Mensch18). However, using DTW as a loss function results in ignoring temporal localization of changes. Recently, Guen19dilate suggests DILATE, a training metric to timely catch sudden changes of nonstationary signals with smooth approximation of DTW and penalized temporal distortion index (TDI). To guarantee to work in a timely manner, Guen19dilate introduce a loss function that gives a harsh penalty when predictions show high temporal distortion. However, TDI relies on the DTW path, and DTW often shows misalignment because of its noise and scalesensitive. Thus, DILATE often loses its advantage with complex data, showing disadvantages at the beginning of the training. In this work, we discuss distortions and transformation invariances and design a new loss function that allows models to learn shapes in the data and produce noiserobust forecasting results.
3 Preliminary
In this section, we aim to investigate common distortions without losing the goal of timeseries forecasting (i.e., modeling temporal dynamics and accurate forecasting). To help understand the concepts, we first define notations and terms (Sec. 3.1). We then discuss common distortions in timeseries in transformation perspectives that need to be considered for building a shapeaware loss function (Sec. 3.2) and describe how other loss functions (e.g., DTW and TDI) handle shapes during learning (Sec. 3.3). Last, we explain the conditions for effective timeseries forecasting (Sec. 3.4).
3.1 Notations and Definitions
Let denote a data point at a time step . Then, we can define a timeseries forecasting problem as:
Definition 1.
Given length historical timeseries at time and corresponding length future timeseries , timeseries forecasting aims to learn mapping function .
To distinguish the label (i.e., groundtruth) and prediction timeseries data, we note the label data as and prediction data as . Next, we set up two goals for timeseries forecasting, which require not only precise, but also informative forecasting Wu21autoformer; Zhou21FEDformer; Guen19dilate as follow:

Mapping function should be learnt to pointwisely reduce distance between and ; and

The output should have similar temporal dynamics with .
Temporal dynamics are informative patterns in timeseries, such as rise, drop, trough, peak, and plateau. We define the temporal dynamics as follows:
Definition 2.
Temporal dynamics (or shapes) are the informative periodic and nonperiodic patterns in timeseries data.
In this work, we aim to design a shapeaware loss function that satisfies both goals. To this end, we first discuss distortions that two timeseries with similar shapes can have.
Definition 3.
Given two timeseries and in a similar shape, distortion is a difference between and .
Distortion generally occurs in different aspects. Distortions are defined as temporal distortion (i.e., warping) and amplitude distortion (i.e.,scaling) with respect to its relevance of dimension, time and amplitude. Existing distortion in data leads to misbehavior of the model, as measurements are interrupted by the distortion. For example, if we have two timeseries and , which have a similar shape but different means, could represent many temporal dynamics of . However, measurements often evaluate and are different (e.g., measuring with MSE) and causes misguidance of the model in training. As such, it is important to have measurements that consider similar shape invariant to distortion. We define a measurement for a distortion as follow:
Definition 4.
Let transformation represents a distortion . Then, we call measurement invariant to , if for any timeseries .
3.2 TimeSeries Distortions in Transformation Perspectives
Distortion, a gap between two similar timeseries, affects on capturing shapes in timeseries data. As such, it is important to investigate different distortions and their impact on representation learning aspects. There are six common timeseries distortions that models encounter during learning (Esling12; Batista14cid; Berkhin06; Liao05; Kerr08)–Amplitude Shifting, Phase Shifting, Uniform Amplification, Uniform Time Scaling, Dynamic Amplification, and Dynamic Time Scaling. Next, we explain each common timeseries distortion in terms of transformation with length timeseries , where t = . Fig. 2 presents example distortions, categorized by amplitude and time dimensions.

Amplitude Shifting describes how much a timeseries shifts against another timeseries. This can be described with two timeseries and the degree of shifting (k): , where is constant.

Phase Shifting is the same type of transformation (i.e., translation) as amplitude shifting, but it occurs along with the temporal dimension. This distortion can be represented with two timeseries functions with the degree of shift (k): , where is constant. Crosscorrelation (Paparrizos15kshape; Vlachos05) is the most popular measure method that is invariant to this distortion.

Uniform Amplification is a transformation that changes the amplitude by multiplication of . This distortion can be described with two functions and a multiplication factor (k): .

Uniform Time Scaling means a uniformly shortened or lengthened on the temporal axis. This distortion can be represented as , where and . Although Keogh04 propose uniform time warping methods to handle this distortion, it still remains one of the difficult distortion types to measure, due to the difficulty in finding the scaling factor without testing all possible cases (Keogh03).

Dynamic Amplification can be interpreted as any distortion occurred by nonzero multiplication on the amplitude dimension. This distortion can be described as follows: with function such that . Local amplification is a representative distortion of this type of distortions, which still remains challenging to solve.

Dynamic Time Scaling means any transformation that dynamically lengthens or shortens signals on the temporal dimension including local time scaling (Batista14cid) and occlusion (Batista14cid; Vlachos03). It can be represented as follows: , where is a positive, strictly increasing function. Dynamic time warping (DTW) (Bellman59dtw; Berndt94; Keogh05) is the most popular technique on this distortion. Das97
also introduce the longest common subsequence (LCSS) algorithm to tackle occlusion, noise, and outliers in this distortion.
There are several studies on shapeaware clustering (Bellman59dtw; Batista14cid; Paparrizos15kshape; Berkhin06; Liao05; Kerr08) and classification (Xi06; Batista14cid; Srisai09) tasks with the consideration of shapes. On the other hand, only a few studies exist for timeseries forecasting tasks, including Guen19dilate that utilizes dynamic time warping (DTW) and temporal distortion index (TDI) for modeling temporal dynamics. Next we describe mean square error (MSE) and DILATE, proposed by Guen19dilate, and discuss their invariance to the distortions.
3.3 Distortion Handling in Current TimeSeries Forecasting Objectives
Many measurement metrics have been used in the timeseries forecasting domain, and those based on the distance, including Euclidean distance, are widely used to handle timeseries data. However, such metrics do not have invariance to the aforementioned distortions (Ding08; Guen19dilate) due to its pointwise mapping. Specifically, since distance compares the values per time step, it cannot handle temporal distortions appropriately and vulnerable to scaling of the data. Guen19dilate propose a loss function, called DILATE, to overcome the inadequate characteristic in the distance metrics by recognizing temporal dynamics with DTW and TDI. In terms of transformation, DILATE handles dynamic time scaling, especially, local time scaling with DTW, and phase shifting with penalized TDI, defined as follows:
where , , are the warping path, cost matrix, and squared penalization matrix, respectively.
While DILATE shows better performance than existing methods, there is a missing point in invariance point of view. Basically, DTW computes the Euclidean distance of two timeseries after its temporal alignment in dynamic programming and the alignment relies on the distance function. Consequently, the dynamic alignment of the DTW can be properly achieved only when two timeseries have the same range (Esling12; Bellman59dtw). That means, it hardly achieves invariance on amplitude distortion without appropriate preprocessing. Gong17 also show that DTW poorly matches the prediction and target (i.e., groundtruth) timeseries with amplitude shifting. Even when the target timeseries is aligned with normalization, we cannot guarantee that the predicted and target timeseries are properly aligned due to DTW’s high sensitivity to noise. As a result, DILATE can generate poor alignment results that can cause wrong optimization of TDI, which produces instability during optimization steps and incorrect results. To design an effective shapeaware loss function, we have to understand measures and when the measures have transformation invariances. In the next section, we discuss how we interpret transformations in timeseries forecasting point of view and which types of transformations should be considered in objective function design.
3.4 Transformation Invariances in TimeSeries Forecasting
In the timeseries domain, data often have various distortions so measurements are needed to satisfy a number of transformation invariances for meaningfully modeling temporal dynamics. As discussed in Sec. 3.1, we set the goal of timeseries forecasting as (1) pointwisely reducing the gap between prediction and target timeseries and (2) preserving temporal dynamics of the target timeseries. To satisfy both of them, we have to consider (1) the method that should not have a negative impact on the traditional goal of accurate timeseries forecasting and (2) the distortions that play a crucial role in capturing the temporal dynamics of the target timeseries. In this section, we review all six distortions whether it is a feasible loss function or not, discuss their benefits and tradeoffs, and find appropriate distortions to be considered in timeseries forecasting.
Amplitude Shifting
In a wide range of situations, it is beneficial to capture the the trends of timeseries sequence in spite of shifts in terms of amplitude. Thus, being invariant to amplitude shifting in a loss function takes many advantages in timeseries forecasting: (1) shapeawareness invariant to amplitude shifting, (2) accurate deviation of values in modeling, and (3) effective ontime prediction of the peak or sudden changes. To guarantee the amplitude shifting invariant in the optimization stage, the loss function should induce an equal gap between prediction and ground truth data in each step. Formally speaking, the loss function with consideration of the amplitude shifting should satisfy:
(1) 
where is an arbitrary and equal gap, and is a signed distance with a boundary . By allowing tolerance between prediction and target timeseries, models can follow trends in timeseries instead of tending to predict exact values in pointwise. In short, unlike existing loss functions that handle only pointwise distance (e.g., DTW), we should deal with both the pointwise distance and its relational distance values to guarantee amplitude shifting.
Phase Shifting
There are forecasting tasks, whose main objectives concern accurate forecasting of peaks and periodicity in timeseries (e.g., heart beat data and stock price data). For such tasks, phase shifting invariance is one of the best solutions for (1) modeling periodicity, regardless of translation on temporal axis and (2) having precise statistics with shapes, such as peak and plateau values. If a loss function is to be invariant to phase shifting, the function should satisfy:
(2) 
Note Eq. 2 allows a similar shape as target timeseries in forecasting, not exactly same shape (e.g., with the same dominant frequency).
Uniform Amplification
This proposition will be useful in case of sparse data that contains a significant number of zeros. By adopting the uniform amplification invariance, models are able to focus nonzero sequences, whereas this proposition allows models to receive less penalty in zero sequences. Since it guarantees shapeawareness with a multiplication factor in a timely manner as Fig. 2, invariance for uniform amplification fits well. To have a model trained with the uniform amplification invariance, the loss function should satisfy:
(3) 
Uniform Time Scaling, Dynamic Amplification, and Dynamic Time Scaling
After careful consideration, we conclude that uniform time scaling, dynamic amplification, and dynamic time scaling are incompatible for optimization. We describe the reason below.
To achieve invariance for the uniform time scaling, the loss function should satisfy:
This proposition will influence negatively original temporal dynamics, considering that it gives the tolerance of mispredicting periodicity (e.g., daily periodic signals) and even cannot catch events (e.g., abrupt changing values) in timely manner. In summary, it hinders models from capturing shape and corrupts periodic information.
For both dynamic amplification and dynamic time scaling, loss functions always are zero for all pairs when we do not set the limit of tolerance. For example, if we do not limit tolerance, the proposition for dynamic amplification invariance is as follow:
If a loss function satisfies the proposition, it is always zero because there always exists except . Therefore, it is not able to give any information because all random values could be an optimal solution. The same situation happens with the dynamic time scaling if we do not limit the window. Consequently, all of the uniform time scaling, dynamic amplification, and dynamic time scaling are unsuitable to be objectives in timeseries forecasting.
4 Methods
In this section, we describe a novel loss function (a Transformation Invariant Loss function with Distance EQuilibrium), which allows models to perform shapeaware timeseries forecasting based on the three distortion invariances. To build a transformation invariant loss function, we have to design a loss function that satisfies the proposition for amplitude shifting invariance (Eq. 1), phase shifting invariance (Eq. 2), and uniform amplification shifting invariance (Eq. 3), as discussed in Sec. 3.4. We select them for our loss function because they help models capture the shape and do not harm the goal of the traditional timeseries forecasting (i.e., minimize gap between prediction and target timeseries). Not only the loss function should satisfy these propositions, but also it should consider correlations between the whole sequence of outputs and ground truths rather than pointwisely optimizing the model. It is not achieved by other loss functions, such as MSE or DILATE. To handle all three distortions and the whole sequence of correlations, we build three objective functions (a.shift, phase, and amp losses) that achieve one or more invariance by utilizing softmax, Fourier coefficient, and autocorrelation to design a loss function.
Amplitude Shifting Invariance with Softmax (Amplitude Shifting)
To strengthen amplitude shifting invariance, we design a loss function that satisfies Eq. 1. This means, needs to be the same value for all . To satisfy the condition, we utilize the softmax function:
(4) 
where , Softmax, and are the length of sequence, softmax function, and signed distance function, respectively. Because the Softmax produces the proportion of each value, it only reaches to the optimal solution when it satisfies Eq. 1. Also, if we utilize Softmax, there is no need to know arbitrary equal gap .
Invariances with Fourier Coefficients (Phase Shifting)
As we discussed in Sec. 3.4, one candidate method to obtain phase shifting invariance is to use Fourier coefficients. As described in prior studies (Jason07)
, we can reconstruct original timeseries only with dominant frequencies. In this way, we utilize the norm of dominant Fourier coefficient of ground truth and prediction sequences as our additional objective function, achieving phase shifting invariance. When it comes to the other frequencies, we denote the norm of prediction sequence to reduce the value of Fourier coefficient. Consequently, with the help of our loss function, this loss function allows model to be noise robustness because the Fourier coefficients of white noises in original timeseries are relatively small. Simply, we optimize the distance between Fourier coefficients of two timeseries as:
(5) 
where is the norm. This loss function obtains uniform amplification invariance by utilizing a normalization technique to Fourier coefficients. For example, and have the same Fourier coefficients if properly normalized. In summary, by Eq. 5, we could obtain (1) invariance for phase shifting, (2) invariance for uniform amplification, and (3) robustness to noise.
Invariances with autocorrelation (Uniform Amplification)
Although Fourier coefficients can be considered as a reasonable solution to catch the periodicity of the target timeseries, it is not fully invariant to phase shifting for three reasons–(1) the statistics (e.g., mean and variance) in data keep changing, (2) such changing statistics also cause the changes of Fourier coefficients even in the same frequency, and (3) objectives only with a norm of them cannot fully represent the original timeseries. Thus, we introduce an objective based on normalized crosscorrelation, which satisfies Eq.
2 for a periodic signal:(6) 
where is a normalized cross correlation function. This loss function helps predicted sequences to mimic label sequences by calculating difference between the autocorrelation of the label sequences and crosscorrelation between label and predicted sequences. Therefore, the label and prediction have similar temporal dynamics regardless of phase shifting and uniform amplification.
In summary, we introduce (Transformation Invariant Loss Function with Distance Equilibrium), combining Eq. 4, Eq. 5, and Eq. 6 as follows:
(7) 
where and
is hyperparameter.
5 Experiments
In this section, we present the results of our comprehensive experiments, demonstrating the effectiveness of and importance of transformation invariance.
Methods  GRU + MSE  GRU + DILATE  GRU +  

Eval  MSE  DTW  TDI  LCSS  MSE  DTW  TDI  LCSS  MSE  DTW  TDI  LCSS 
Synthetic  0.0107  3.5080  1.0392  0.3523  0.0130  3.4005  1.1242  0.3825  0.0119  3.2873  1.1564  0.3811 
ECG5000  0.2152  1.9718  0.8442  0.7743  0.8270  3.9579  2.0281  0.4356  0.2141  1.9575  0.7714  0.7773 
Traffic  0.0070  1.4628  0.2343  0.7209  0.0095  1.6929  0.2814  0.6806  0.0072  1.4600  0.2276  0.7220 
Experimental Setup
We conduct the experiments with three stateoftheart models, including Informer (Zhou21informer), NBeats (Oreshkin2020nbeats), and Autoformer (Wu21autoformer)
and one simple sequencetosequence gated recurrent unit (GRU) model. We use five realworld datasets–ECG5000, Traffic, ETTh2, ETTm2, and ECL, and one synthetic dataset–Synthetic for model training. We repeat each experiment with a model and dataset 10 times in combinations with three different objective functions. Appendix
A provides detailed explanations on the datasets, hyperparameter setting, and model architectures.Methods  Informer + MSE  Informer + DILATE  Informer +  

Metric  MSE  DTW  TDI  LCSS  MSE  DTW  TDI  LCSS  MSE  DTW  TDI  LCSS  
ETTh2 
96  0.2466  6.9254  3.6676  0.4633  0.3284  6.3109  3.5838  0.5037  0.1768  5.8437  1.6734  0.5379 
192  0.2818  10.2654  11.1580  0.4254  0.4086  8.8262  7.1780  0.4893  0.2432  10.2134  9.9865  0.4317  
336  0.3089  12.1822  18.7014  0.4434  0.4164  10.3779  13.2580  0.5062  0.2958  13.5586  20.2850  0.4165  
720  0.2877  17.6369  38.4617  0.4425  0.4229  14.1196  23.9403  0.4815  0.3157  18.4617  43.3238  0.4262  
ETTm2 
96  0.0889  3.4007  1.5719  0.7386  0.1263  6.0144  2.7757  0.5129  0.0871  3.1354  1.3474  0.7817 
192  0.1157  5.7964  2.8128  0.6705  0.2340  9.7004  7.8354  0.5266  0.1317  5.7093  2.9129  0.6983  
336  0.1860  8.9971  6.7970  0.6365  0.2805  11.7889  13.3861  0.5025  0.1767  9.0866  7.4023  0.6555  
720  0.2165  14.7685  24.6694  0.5768  0.3745  16.7734  29.2783  0.4747  0.2063  15.3057  24.1959  0.5860  
ECL 
96  0.2709  2.8067  0.1720  0.7032  0.9856  3.6394  1.4794  0.6324  0.2800  2.9466  0.2473  0.7275 
192  0.2793  4.1193  0.1508  0.7060  1.1209  5.2289  2.1749  0.6053  0.3077  4.2693  0.2978  0.7336  
336  0.3203  5.9533  0.1642  0.7222  1.2331  7.8470  3.0415  0.5694  0.3271  5.8090  0.1984  0.7143  
720  0.6414  15.8561  4.4284  0.4564  1.3706  12.5981  5.6720  0.5506  0.4676  11.4027  0.7107  0.6298 
Evaluation Metrics
In the experiment, we evaluate with four evaluation metrics: mean squared error (MSE), dynamic time warping (DTW), its corresponding temporal distortion index (TDI), all of which are used in
Guen19dilate. As DTW is sensitive to noise and generates incorrect paths when one of the timeseries data is noisy (as discussed in Sec. 3.3), we additionally use the longest common subsequence (LCSS) for comparison, which is more robust to outliers and noise (Esling12). The longer the length of matched subsequences is achieved, the better performance LCSS shows in modeling the shapes.Methods  NBeats + MSE  NBeats + DILATE  NBeats +  

Metric  MSE  DTW  TDI  LCSS  MSE  DTW  TDI  LCSS  MSE  DTW  TDI  LCSS  
ETTh2 
96  0.1869  7.2379  2.3787  0.4688  0.3105  6.5849  3.6490  0.4879  0.1557  5.1011  1.3240  0.5862 
192  0.2385  11.5667  4.9153  0.4505  0.6186  9.7254  7.0831  0.4637  0.1738  7.6334  2.4122  0.5819  
336  0.2889  16.5255  11.5207  0.4544  1.1406  13.7328  14.6986  0.4584  0.2132  11.3351  5.3556  0.5373  
720  0.3881  24.1570  18.8462  0.4381  1.6713  19.4392  23.7028  0.4575  0.3044  17.6006  9.6636  0.5287  
ETTm2 
96  0.0790  3.9685  2.0436  0.6721  0.1524  7.9302  5.5597  0.4379  0.0952  4.0110  2.1939  0.6902 
192  0.1224  6.8695  3.2834  0.5762  0.2055  10.0393  8.5602  0.5107  0.1286  6.3556  4.9798  0.6160  
336  0.1824  12.1438  8.5915  0.4587  0.2501  12.6342  16.1473  0.4819  0.1705  8.9377  8.3539  0.6195  
720  0.2370  22.8676  17.8458  0.4929  0.4170  17.7764  24.6877  0.5836  0.2336  14.2715  19.0883  0.7070  
ECL 
96  0.3666  3.5207  0.2989  0.6589  1.1156  5.1430  2.6613  0.5074  0.3183  2.9707  0.4844  0.7229 
192  0.4307  5.7578  0.4253  0.6212  1.1859  7.3406  2.8488  0.4973  0.3383  4.1817  0.4229  0.7187  
336  0.5199  8.5563  0.5384  0.5965  1.2460  9.5096  3.0517  0.5091  0.3831  5.6643  0.3024  0.7112  
720  0.6240  13.9436  0.6510  0.5717  1.3061  13.1928  3.7279  0.5337  0.4540  8.9997  0.3251  0.6960 
Results and Analysis
Table 1 shows the results of shortterm forecasting performance of gated recurrent unit (GRU) optimized with MSE, DILATE, and metrics. Synthetic, ECG5000, and Traffic datasets are used for the experiment. With the Synthetic dataset, every used metric shows its own benefits. This result indicates that similarity of the shape and MSE measures have a clear advantage when a model is trained and evaluated with themselves. Also, since the model is evaluated with realworld datasets, it is revealed outperforms other objective functions in most evaluation metrics. These results indicate our approach for learning shapes in timeseries data works better than existing methods for forecasting. DILATE does not show impressive performance with ECG5000 due to its high sensitiveness to noise, as discussed in Sec. 3.3.
Table 2 and Table 3 summarize the experiment results with two stateoftheart models, Informer and NBeats. The models make predictions for both shortterm (=24) and longterm ( up to 720), so that we can investigate their performances with different forecasting difficulties. In most of datasets, the models with outperform those with other training metrics. Especially for longterm forecasting, we observe that For NBeats and Informer with significantly improve the performance with the other metrics. We provide some visual examples in Appendix B and more detailed analysis, qualitative experiments with example visualizations, ablation study results in Appendix B. This result implies that improves performances of the models in learning temporal dynamics, including LCSS of NBeats (improved over 10%).
Methods  Autoformer + MSE  Autoformer + DILATE  Autoformer +  

Metric  MSE  DTW  TDI  LCSS  MSE  DTW  TDI  LCSS  MSE  DTW  TDI  LCSS  
ETTh2 
96  0.1538  5.2227  2.1865  0.6187  0.2211  6.0453  2.5345  0.5315  0.1494  5.1060  1.9752  0.6317 
192  0.1974  7.8730  3.3382  0.6019  0.2825  8.6696  5.6671  0.5335  0.2079  7.8917  3.7532  0.5984  
336  0.2393  10.8002  7.3141  0.5954  0.3759  11.0335  13.1347  0.5257  0.2360  10.7212  7.0085  0.5971  
720  0.2859  16.3502  15.9233  0.5772  0.4296  15.9819  22.2173  0.4924  0.2378  16.0002  13.7906  0.5795  
ETTm2 
96  0.0990  4.3498  2.5052  0.6756  0.1135  5.3097  2.2211  0.5936  0.0940  3.9078  2.2587  0.7075 
192  0.1340  6.3207  3.3676  0.6512  0.1854  8.5209  3.7894  0.5506  0.1259  6.0979  2.9278  0.6810  
336  0.1587  9.4374  6.9205  0.6036  0.2001  12.0265  8.8305  0.5370  0.1548  9.5223  7.2875  0.6169  
720  0.1999  14.8332  11.9655  0.6064  0.2665  17.8025  17.4114  0.5001  0.1885  14.5844  9.9918  0.6277  
ECL 
96  0.4209  3.5957  0.2461  0.6487  0.6813  3.6490  0.4780  0.6253  0.3515  3.2173  0.2298  0.6912 
192  0.4206  4.9924  0.3416  0.6574  0.7319  5.5324  0.2775  0.6118  0.4032  4.8581  0.3301  0.6680  
336  0.4621  6.6888  0.2795  0.6535  0.7895  7.5665  0.2503  0.6091  0.4637  6.7335  0.3923  0.6429  
720  0.5005  10.8571  0.2383  0.6183  0.8630  12.1416  0.1877  0.6074  0.5049  9.8492  0.2525  0.6420 
6 Conclusion and Future Work
We propose , a transformation invariant loss function with distance equilibrium, which allows shapeaware timeseries forecasting in a timely manner. To design , we review existing transformations in timeseries data and discuss the conditions that ensure transformation invariances during optimization tasks. The designed ensures a model to be invariant to the amplitude shifting, phase shifting, and uniform amplification so that a model better captures the shape in timeseries data. To prove the effectiveness of , we conduct comprehensive experiments with stateoftheart models and realworld datasets. The results indicate that the model trained with generates more timely, robust, accurate, and shapeaware forecasting in both shortterm to longterm forecasting tasks. We conjecture that this work can facilitate future research on transformation invariances and shapeaware forecasting.
References
Appendix A Detailed Experiment Setup
Dataset
In our experiment, we utilize six datasets – Synthetic, ECG5000, and Traffic dataset for the simple model (i.e., SequencetoSequence Gated Recurrent Unit) and ETTh2, ETTm2, and Electricity for the stateoftheart model (i.e., Informer and NBeats). For each dataset, we describe some metadata of them and experimental setting, including the input length and prediction window .
Synthetic: As Guen19dilate describe, the Synthetic dataset is an artificial dataset for measuring model performance on sudden changes (step functions) with an input signal composed of two peaks. The amplitude and temporal position of the two peaks are randomly selected. Then the selected position and amplitude of the step are determined by a peak position and amplitude. We use 500 timeseries for training, 500 for validation and 500 for testing. For the Synthetic dataset, we set input length as and prediction window as . The generation code is provided in DILATE Github^{1}^{1}1https://github.com/vincentleguen/DILATE.
ECG5000: This dataset is originally a 20hour long ECG (Electrocardiogram), downloaded from Physionet^{2}^{2}2https://physionet.org/ and archived in UCR Time Series Classification Archive (Dau2019UCR). The data is split by each heartbeat and processed to be in equal lengths (140). In the training, we use 500 for training, 500 for validation, and 4000 for testing. We take first steps as input and predict last steps.
Traffic: Traffic dataset is a collection of 48 months (20152016) hourly road occupancy rate (between 0 to 1) data from the California Department of Transportation^{3}^{3}3http://pems.dot.ca.gov. As Guen19dilate do, we utilize univariate series of the first sensor, a total of 17544 data points. We set our problem as forecasting future occupancy rates with historical data (past week). We use 60% of the data for training, 20% for validation, and the rest for evaluation.
ETT: The ETT (Electricity Transformer Temperature) dataset, published by Zhou21informer, is 2year data collected from two separated counties in China, including ETTh2 and ETTm2 datasets. Each data point has a target value of “oil temperature” and other 6 power load features. ETTh2 and ETTm2 datasets have 1hour and 15minute intervals, respectively. As Zhou21informer do, we split them into 12/4/4 months for the training/validation/testing. Detailed settings, such as the input and output length and hyperparameter setting, are based on the information at Informer Github^{4}^{4}4https://github.com/zhouhaoyi/Informer2020.
ECL: The ECL (Electricity Consuming Load) is a dataset recorded in kWh every 15minutes from 2012 to 2014, for 321 clients. In our experiment, we split them into 15/3/4 months for the train/validation/test, as Zhou21informer do. Note that we use the same hyperparameter settings in the ETTh2 dataset.
Deep Learning Model Architectures
We perform experiments with three different model architectures, including SequencetoSequence GRU, Informer, and NBeats. To induce models to predict future timeseries in a timely manner, we set and for . Other training metrics, including MSE and DILATE, are used as described in their original papers. All models are trained with Early Stopping and ADAM optimizer.
SequencetoSequence GRU To evaluate in simple model, we utilize one layer SequencetoSequence GRU model. For the training of the GRU model, we set learning rate of
, hidden size of 128, trained by maximum 1000 epochs with Early Stopping and ADAM optimizer.
Informer When we train Informer with ETTh2, ETTm2, and ECL dataset, we utilize the official code and hyperparameter setting. In the case of ECL dataset, as author answered in their official code, we utilize same hyperparameter and dataset splitting criteria as ETTh2 dataset.
NBeats For NBeats, we utilize two generic blocks with the hidden size of 128. Additionally, we set the learning rate as for all three datasets.
Autoformer For Autoformer^{5}^{5}5https://github.com/thuml/Autoformer, we use the official code and hyperparameter setting. For the ETTh2 dataset, we utilize hyperparameter settings described in the official code of FEDFormer^{6}^{6}6https://github.com/MAZiqing/FEDformer.
Appendix B Additional Evaluations
b.1 Detailed Experiment Analysis
At first, we observe that the model optimized with outperforms the same model optimized with other objective functions in both short and longterm forecasting tasks. An interesting point in the results is the large increased errors of TDI and DTW with longterm forecasting. For example, TDI of Informer with DILATE shows dramatically increased error with ECL dataset, as the forecasting window increases, while LCSS does not produce such large increased error. We attribute this to the weakness of DTWbased loss functions, which have a weakness due to high sensitiveness on noise. In contrast, does not show such large performance drop and even achieves better performance in the longterm forecasting (e.g., Table 2, ETTh2). Additionally, we can find that Informer with on ECL data and NBeats with on all three datasets show significant improvements. It indicates that success to model shape, but other metrics could not. We provide additional qualitative results below.
Next, we present qualitative analysis of the results. Fig. 3 shows how the model with different training metrics forecast with different datasets. From the figure, we have noticed that allows the model to generate more robust, shapeaware forecasting, regardless of the amplitude shifting, phase shifting, and uniform amplification. For example, in the case of NBeats (Fig. 3 (b) bottom), generate forecasting results, which are more robust, shapeaware prediction compared to other metrics. We also see the strength in the Informer case (Fig. 3 (b), top). Even when the model has not enough ability to capture shape, tries to retrieve the shape. We provide additional qualitative results with visualization below. When the model have enough ability to capture shape (i.e., except ETTh2, Informer of ), shown its noiserobust, smooth forecasting with correctly modeled temporal dynamics. In the most of NBeats results and some of Informer results, reveals that these models have enough ability to capture the temporal dynamics with proper loss function. In summary, proves that it is modelagnostic, noiserobust, and able to capture the shape.
b.2 Additional Qualitative Examples
b.3 Ablation Study
To evaluate the effect of the and measure the effect of each loss function, we conduct a set of experiment with ETTh2 dataset and NBeats on the longterm forecasting problem. As we can see in the Fig. 11, the model tends to predict phase and amplificationfree forecasting when decreases. This results indicate our motivation, “
will return the forecasting results with same standard deviation with timely manner but without consideration of proper average value,” and “
will catch the dominant frequency of the target timeseries, but it hardly matches its amplitude and phase.”Furthermore, in the top of Fig. 12, we can observe three things: (1) if we utilize only, as we intended, it have different average (1.19 vs. 0.11) but relatively similar standard deviation (0.408 vs. 0.299); (2) In the case of only, they can capture dominant frequency and produce relatively lessnoisy forecasting; (3) have relatively similar average value (1.195 vs. 0.319), but it has far different standard deviation (0.408 vs. 8.592). In contrast, forecasting results of the model trained with MSE is very noisy and hard to interpret (Fig. 12, bottom). Note that we normalized the results in Fig. 12 because of the scale issue.
Methods  NBeats  

Eval  MSE  DTW  TDI  LCSS 
0.2388 ± 0.0324  13.9859 ± 0.5772  18.5824 ± 3.1545  0.7063 ± 0.0083  
0.5767 ± 0.8339  17.4543 ± 0.4896  28.4352 ± 6.9420  0.6384 ± 0.0106  
1.9377 ± 0.0222  16.2181 ± 1.2729  19.8596 ± 2.6131  0.6678 ± 0.0197  
1.8361 ± 0.0652  16.7026 ± 0.8879  16.4266 ± 2.2085  0.6726 ± 0.0137  
Amplitude only  1.5123 ± 0.0324  16.8485 ± 2.8237  29.6971 ± 5.2713  0.5740 ± 0.0175 
Phase only  1.8453 ± 0.0470  13.7197 ± 0.1196  10.2519 ± 1.5417  0.6919 ± 0.0043 
Amplification only  674.9155 ± 505.1760  14.7763 ± 0.6338  30.8568 ± 5.1776  0.6120 ± 0.0182 