1 Introduction
Forecasting is one of the most fundamental problems in time series and spatiotemporal data analysis, with broad applications in energy, finance, and transportation. Deep learning models [li2019enhancing, salinas2020deepar, rasul2021autoregressive] have emerged as stateoftheart approaches for forecasting rich time series and spatiotemporal data. In several forecasting competitions such as M5 forecasting competition [makridakis2020m5], Argoverse motion forecasting challenge [chang2019argoverse], and IARAI Traffic4cast contest [kreil2020surprising]
, almost all the winning solutions are based on deep neural networks.
Despite the encouraging progress, we discover that the forecasting performance of deep learning models has longtail behavior. That means, a significant amount of samples are very difficult to forecast. Existing works often measure the forecasting performance by averaging across test samples. However, such an average performance measured by root mean square error (RMSE) or mean absolute error (MAE) can be misleading. A low RMSE or MAE may indicate good averaged performance, but it does not prevent the model from behaving disastrously in difficult scenarios.
From a practical perspective, the longtail behavior in forecasting performance can be alarming. Figure 1 visualizes examples of longtail behavior for a motion forecasting task. In motion forecasting, the longtail would correspond to rare events in driving such as turning maneuver and sudden stops. Failure to forecast accurately in these scenarios would pose paramount safety risks in route planning. In electricity forecasting, the tail behavior would occur during short circuits, power outage, grid failures, or sudden behavior changes. Merely focusing on the average performance would ignore the electric load anomalies, significantly increasing the maintenance and operational cost.
Longtailed learning is an area heavily studied in classification settings focusing on class imbalance. We refer readers to Table 2 in [menon2020long] and the survey paper by [zhang2021deep] for a complete review. Most common approaches to address the longtail problem include posthoc normalization, data resampling, loss engineering, and learning classagnostic representations. However, longtail learning methods in classification are not directly translatable to forecasting as we do not have a predefined class. A recent work by [makansi2021exposing]
propose to use Kalman filter to gauge the difficulty of different forecasting examples but such difficulties may not directly relate to deep neural networks used for the actual forecasting task.
In this paper, we address the longtail behavior of prediction error for deep probabilistic forecasting. We present two momentbased loss modifications: Kurtosis loss and Pareto loss. Kurtosis is a well studied symmetric measure of tailedness as a scaled fourth moment of the distribution. Pareto loss uses Generalized Pareto Distribution (GPD) to fit the longtailed error distribution and can also be described as a weighted summation of shifted moments. We investigate these tailedness measurements as regularization and loss weighting approaches for probabilistic forecasting tasks. We demonstrate significantly improved tail performance compared to the base model and the baselines, while achieving better average performance in most settings.
In summary, our contributions include

We discover longtail behavior in the forecasting performance of deep probabilistic models.

We investigate principled approaches to address longtail behavior and propose two novel methods: Pareto loss and Kurtosis loss.

We significantly improve the tail performance on four forecasting tasks including two time series and two spatiotemporal trajectory forecasting datasets.
2 Related work
Deep probabilistic forecasting. There is a flurry of work on using deep neural networks for probabilistic forecasting. For time series forecasting, a common practice is to combine classic time series models with deep learning, resulting in DeepAR [salinas2020deepar], Deep State Space [rangapuram2018deep], Deep Factors [wang2019deep] and normalizing Kalman Filter [de2020normalizing]. Others introduce normalizing flow [rasul2020multivariate], denoising diffusion [rasul2021autoregressive] and particle filter [pal2021rnn] to deep learning. For trajectory forecasting, the majority of works focus on deterministic prediction. A few recent works propose to approximate the conditional distribution of future trajectories given the past with explicit parameterization [mfp, luo2020probabilistic], CVAE [CVAE, desire, trajectron++] or implicit models such as GAN [socialgan, liu2019naomi]. Nevertheless, most existing works focus on average performance, the issue of longtail is largely overlooked in the community.
Longtailed learning.
The main efforts for addressing the longtail issue in learning revolve around reweighing, resampling, loss function engineering, and twostage training, but mostly for classification. Rebalancing during training comes either in form of synthetic minority oversampling
[chawla2002smote], oversampling with adversarial examples [Kozerawski_2020_ACCV], inverse class frequency balancing [liu2019large], balancing using effective number of samples [cui2019class], or balanceoriented mixup augmentation [xu2021towards]. Another direction involves postprocessing either in form of normalized calibration [pan2021model]or logit adjustment
[menon2020long]. An important direction is loss modification approaches such as Focal Loss [lin2017focal], Shrinkage Loss [lu2018deep], and Balanced MetaSoftmax [ren2020balanced]. Others utilize twostage training [liu2019large, cao2019learning] or separate expert networks [zhou2020bbn, li2020overcoming, wang2020long]. We refer the readers to [zhang2021deep] for an extensive survey. [tang2020long] indicated SGD momentum can contribute to the aggravation of the longtail problem and suggested deconfounded training to mitigate its effects. [feldman2020does, feldman2020neural] performed theoretical analysis and suggested label memorization in longtail distribution as a necessity for the network to generalize.A few were developed for imbalanced regression. Many approaches revolve around modifications of SMOTE such as adapted to regression SMOTER [torgo2013smote], augmented with Gaussian Noise SMOGN [branco2017smogn], or [ribeiro2020imbalanced] work extending for prediction of extremely rare values. [steininger2021density]
proposed DenseWeight, a method based on Kernel Density Estimation for better assessment of the relevance function for sample reweighing.
[yang2021delving] proposed a distribution smoothing over label (LDS) and feature space (FDS) for imbalanced regression. A concurrent work is [makansi2021exposing] where they noticed the longtail error distribution for trajectory prediction. They used Kalman filter [kalman1960new] performance as a difficulty measure and utilized contrastive learning to alleviate the tail problem. However, the tail of Kalman Filter may differ from that of deep learning models, which we elaborate on in later sections.3 Methodology
We first identify the longtail phenomena in probabilistic forecasting. Then, we propose two related strategies based on Pareto loss and Kurtosis loss to mitigate the tail issue.
3.1 Longtail in probabilistic forecasting
Given input and output respectively, probabilistic forecasting task aims to predict the conditional distribution of future states given current and past observations as:
(1) 
where is the length of the history and is the prediction horizon. We denote the maximum likelihood probabilistic forecasting model prediction as .
Long tail distribution of data can be seen in numerous real world datasets. This is evident for the four benchmark forecasting datasets (Electricity [Dua:2019], Traffic [Dua:2019], ETHUCY [pellegrini2009you, lerner2007crowds], and nuScenes [caesar2020nuscenes]) studied in this work. We can see the distribution of ground truth values () for all of them in Figure 2. We use loglog plots to increase the visibility of the long tail behavior present in the data – smaller values (constituting the minority on a linear scale) occur very frequently, while majority of values are very rare (creating the tail). In addition to the long tail data distribution, we also identify the long tail distribution of forecasting error from deep learning models (such as DeepAR [salinas2020deepar], Trajectron++ [salzmann2020trajectron++], and Trajectron++EWTA [makansi2019overcoming]) (as seen in Appendix G).
We hypothesize that long tail behavior in forecasting error distribution originates from the long tail behavior in data distribution, as well as the nature of gradient based deep learning. Therefore, modifying the loss function to account for the shape of the distribution would potentially lead to better tail performance. Next, we present two loss functions based on the moment of the error distribution.
3.2 Pareto Loss
Long tail distributions naturally lend themselves to analysis using Extreme Value Theory (EVT). [mcneil1997estimating]
shows that long tail behavior can be modeled as a generalized Pareto distribution (GPD). The probability distribution function (pdf) of the GPD is,
(2) 
where the parameters are location (), scale () and shape (). The pdf for GPD is defined for when and for when . can be set to 0 without loss of generality as it represents translation along the x axis. We can drop the scaling term
as the pdf will be scaled using a hyperparameter. The simplified pdf is,
(3) 
The highlevel idea of Pareto loss is to fit a GPD to the loss distribution to reprioritize the learning of easy and difficult (tail) examples. Let the loss function used by a given machine learning model be denoted as
. In probabilistic forecasting, a commonly used loss is Negative Log Likelihood (NLL) loss: where is the training example, and the model prediction. As the pdf in Eq.(3) only allows nonnegative input, the loss has to be lowerbounded. We propose to use an auxiliary loss to fit the GPD. For NLL which can be unbounded for continuous distributions, the auxiliary loss can simply be Mean Absolute Error (MAE): .There are two main classes of methods for modifying loss functions to improve tail performance: regularization [ren2020balanced, makansi2021exposing] and reweighting [lin2017focal, lu2018deep, yang2021delving]. Both classes are characterized by different behavior on tail data [ren2020balanced]. Inspired by these, we propose two variations of the Pareto Loss using the distribution fitted on : Pareto Loss Margin (PLM) and Pareto Loss Weighted (PLW).
PLM is based on the principles of marginbased regularization [ren2020balanced, liu2016large] which assigns larger penalties (margins) to harder examples. For a given hyperparameter , PLM is defined as,
(4) 
where
(5) 
which uses GPD to calculate the additive margin.
An alternative is to reweigh the loss terms using the loss distribution. For a given hyperparameter , PLW is defined as,
(6) 
where
(7) 
which uses GPD to reweigh the loss of each sample.
3.3 Kurtosis Loss
Kurtosis measures the tailedness of a distribution as the scaled fourth moment about the mean. To increase the emphasis on tail examples, we use this measure to propose kurtosis loss. For a given hyperparameter and using the same notation as Sec.3.2 kurtosis loss is defined as,
(8) 
where is the contribution of an example to kurtosis for a batch
(9) 
where and
are the mean and standard deviation of the auxiliary loss (
) values for a batch of examples.We propose to use the auxiliary loss distribution to compute kurtosis, as performance metrics in forecasting tasks frequently involve versions of L1 or L2 distance such as RMSE, MAE, or ADE. The goal is to decrease the long tail for these metrics, which might not correlate well with the base loss . The example in Sec. 3.2 where is NLL loss and is MAE loss illustrates this requirement well.
Kurtosis loss and pareto loss are related approaches to handling long tail behavior. Pareto Loss is a weighted sum of moments about while kurtosis loss is the fourth moment about the mean. Let and , then the Taylor expansion for the GPD pdf from Eq.(3) is,
(10) 
For or equivalently or
, the coefficients are positive for even moments and negative for odd moments. Even moments are always symmetric and positive, while odd moments are positive only for righttailed distributions. Since we use the negative of the pdf, it yields an asymmetric measure of the right tailedness of a value in the distribution.
Kurtosis loss uses the fourth moment about the distribution mean. This is a symmetric and positive measure, but in the context of right tailed distributions, kurtosis serves as a good measure of the long tailedness of the distribution. GPD and kurtosis are visualised in Appendix F
4 Experiments
We evaluate our methods on two probabilistic forecasting tasks: time series forecasting and trajectory prediction.
4.1 Setup
Datasets.
For time series forecasting, we use electricity and traffic datasets from the UCI ML repository [Dua:2019] used in [salinas2020deepar] as benchmarks. We also generate three synthetic 1D time series datasets, Sine, Gaussian and Pareto, to further our understanding of long tail behavior.
For trajectory prediction, we use two benchmark datasets: a pedestrian trajectory dataset ETHUCY (which is a combination of ETH [pellegrini2009you] and UCY [lerner2007crowds] datasets) and a vehicle trajectory dataset nuScenes [caesar2020nuscenes]. Details regarding the datasets are available in Appendix A.
Baselines.
We compare with the following baselines representing SoTA in long tail mitigation for different tasks:

[itemsep=1mm, topsep=0mm]

Contrastive Loss: [makansi2021exposing] uses contrastive loss as a regularizer to group examples together based on Kalman filter prediction errors.

Focal Loss: [lin2017focal] uses L1 loss to reweigh loss terms.

Shrinkage Loss: [lu2018deep] uses a sigmoidbased function to reweigh loss terms.

Label Distribution Smoothing (LDS): [yang2021delving] uses symmetric kernel to smooth the label distribution and use its inverse to reweigh loss terms.
Focal Loss, Shrinkage Loss, and LDS were originally proposed for classification and/or regression and required adaptation in order to be applicable to the forecasting task. For details on baseline adaptation, please see Appendix B.
Evaluation Metrics.
We use two common metrics for the evaluation of trajectory prediction models: Average Displacement Error (ADE), which is the average L2 distance between total predicted trajectory and ground truth, and Final Displacement Error (FDE) which is the L2 distance for the final timestep. For time series forecasting, we use Normalized Deviation (ND) and Normalized Root Mean Squared Error (NRMSE).
Apart from the abovementioned average performance metrics, we introduce metrics to capture performance on the tail. To measure the performance at tail of the distribution, we propose to adapt the ValueatRisk (VaR Eq. (11)) metric:
(11) 
VaR at level is the smallest error
such that the probability of observing error larger than
is smaller than , where is the error distribution. This evaluates to the quantile of the error distribution. We propose to measure VaR at three different levels: , , and .In addition, we use skew, kurtosis, and max error to further assess the tail performance. Skew and Kurtosis as metrics are meaningful only when looked at in conjunction with the mean. A distribution with a higher mean and lower skew and kurtosis does not imply a less severe tail.
4.2 Synthetic Dataset Experiments
In order to better understand the long tail error distribution, we perform experiments on three synthetic datasets. The task is to forecast 8 steps ahead given a history of 8 time steps. We use AutoRegression (AR) and DeepAR [salinas2020deepar] as forecasting models to perform this task. The top row in Figure 3 shows that among the datasets, only Gaussian and Pareto show tail behavior in the data distribution. Pareto dataset in particular is the only one to display long tail behavior. AR and DeepAR have different error distribution across the datasets. Based on these results, we make the following hypotheses for the sources of longtailedness.
Source 1: Long Tail in Data. The data distributions for Gaussian and Pareto datasets have similar tail behavior to the error distribution for both models, AR and DeepAR. This indicates that the long tail in data is a likely cause of long tail behavior in error. This connection is also well established as class imbalance for classification tasks [van2018inaturalist, liu2019large].
Source 2: Deep Learning Model. The results on the Sine dataset illustrate that even in the absence of long tail in the data, we can have long tail in the error distribution. The AR model, however, does not show long tail behavior for error. This indicates that the observed long tail behavior in error for DeepAR is model induced. We hypothesize that this is caused by DeepAR overfitting to simpler examples due to the nature of gradient based learning. Further results and analysis on these datasets can be found in Appendix H.
The difference between AR and DeepAR error distributions also suggests that assuming tail overlap between deep learning and nondeep learning methods (such as Kalman filter used by [makansi2021exposing]) might not generalize well.
Method  Metric  Mean  Max  Kurtosis  Skew  

DeepAR  ND  0.0584  0.0796  0.2312  0.4429  4.1520  426.5906  18.4057 
NRMSE  0.2953  0.0972  0.2595  0.5263  5.4950  470.8968  19.4827  
+ Contrastive Loss  ND  0.0618  0.0872  0.2102  0.4274  4.0004  384.568  17.5604 
NRMSE  0.3062  0.1069  0.2481  0.5392  5.1606  415.3592  18.3051  
+ Focal Loss  ND  0.0628  0.0853  0.2694  0.4398  4.3263  412.5172  18.0739 
NRMSE  0.3139  0.1052  0.3137  0.5297  5.7797  469.7605  19.3916  
+ Shrinkage Loss  ND  0.0694  0.0956  0.2334  0.4446  4.4714  325.7401  16.3852 
NRMSE  0.3244  0.1156  0.2828  0.5177  5.4245  336.7777  16.5656  
+ LDS  ND  0.0634  0.0890  0.2238  0.4925  3.8625  335.2523  16.2944 
NRMSE  0.2923  0.1149  0.2787  0.5458  4.9234  373.4702  17.1249  
+ Kurtosis Loss (Ours)  ND  0.0567  0.0842  0.2151  0.4120  3.2738  300.3517  15.4597 
NRMSE  0.2631  0.1046  0.2732  0.4779  4.2613  339.3773  16.4892  
+ PLM (Ours)  ND  0.0564  0.0799  0.1900  0.4164  3.4576  359.6645  16.9243 
NRMSE  0.2783  0.1000  0.2343  0.5102  4.7494  423.2319  18.3994  
+ PLW (Ours)  ND  0.0578  0.0796  0.2121  0.3558  3.4647  329.0847  16.393 
NRMSE  0.2807  0.0984  0.2555  0.4809  4.6040  366.6818  17.3120 
Method  Metric  Mean  Max  Kurtosis  Skew  

DeepAR  ND  0.1741  0.6866  25.5840  32.1330  84.1582  41.2804  6.1700 
NRMSE  0.4465  1.2283  6.0283  7.5988  18.8103  37.0089  5.7343  
+ Contrastive Loss  ND  0.2052  0.7463  24.3737  30.5117  81.1716  42.1391  6.2282 
NRMSE  0.4667  1.2956  5.7747  7.2342  18.3360  36.4420  5.6834  
+ Focal Loss  ND  0.4903  1.1553  26.7537  30.1506  52.8272  28.5912  5.4325 
NRMSE  0.7302  1.6485  6.5880  7.3660  13.7985  24.6181  4.9104  
+ Shrinkage Loss  ND  0.2431  0.8380  25.3381  32.9147  85.2713  45.0172  6.3935 
NRMSE  0.5114  1.3099  6.0418  7.8882  19.0771  39.5592  5.8742  
+ LDS  ND  0.4763  1.4781  28.9162  38.4263  126.5733  49.2714  6.5445 
NRMSE  0.7829  1.8702  6.8826  9.2061  27.3684  39.8322  5.7109  
+ Kurtosis Loss (Ours)  ND  0.2022  0.7653  25.3752  31.4677  62.9173  35.298  5.8785 
NRMSE  0.4892  1.4072  6.0263  7.3369  13.7783  29.6338  5.2683  
+ PLM (Ours)  ND  0.1594  0.7115  24.5911  30.331  90.3169  42.5373  6.1829 
NRMSE  0.4600  1.3881  5.6779  7.0033  20.5736  36.7518  5.6005  
+ PLW (Ours)  ND  0.3751  1.0495  25.4471  31.6621  65.759  35.4836  5.8813 
NRMSE  0.6238  1.4914  6.0552  7.3491  13.8938  28.9214  5.1844 
Method  Mean  Max  Kurtosis  Skew  

Traj++  0.21/0.41  0.56/1.33  0.78/1.97  0.98/2.47  2.33/5.04  16.02/16.09  3.02/3.26 
Traj++EWTA  0.16/0.33  0.43/1.05  0.60/1.53  0.76/1.89  1.63/3.95  16.40/19.21  2.97/3.34 
+ Contrastive  0.17/0.34  0.43/1.03  0.62/1.56  0.79/1.89  1.67/4.02  16.37/18.51  2.96/3.35 
+ Focal Loss  0.16/0.32  0.40/0.89  0.54/1.28  0.66/1.57  1.50/3.50  14.95/17.80  2.74/3.18 
+ Shrinkage Loss  0.16/0.33  0.43/1.05  0.58/1.50  0.74/1.84  1.66/3.95  16.67/19.54  3.00/3.41 
+ LDS  0.17/0.35  0.44/1.04  0.57/1.45  0.78/1.86  1.69/3.85  19.80/19.12  3.18/3.39 
+ Kurtosis Loss (ours)  0.17/0.34  0.46/0.98  0.59/1.25  0.67/1.47  1.22/2.77  5.28/7.25  1.77/2.11 
+ PLM (ours)  0.16/0.30  0.38/0.81  0.52/1.20  0.63/1.49  1.30/3.20  12.01/16.90  2.41/3.04 
+ PLW (ours)  0.21/0.36  0.46/0.84  0.55/1.08  0.63/1.32  1.25/2.93  6.62/10.52  1.69/2.08 
Method  Mean  Max  Kurtosis  Skew  

Traj++  0.23/0.42  0.73/1.62  1.11/2.73  1.46/3.61  7.87/10.98  37.74/26.96  4.23/4.18 
Traj++EWTA  0.19/0.34  0.65/1.49  1.00/2.49  1.32/3.34  7.07/11.42  55.26/36.33  5.12/4.88 
+ Contrastive  0.19/0.35  0.65/1.51  1.01/2.58  1.36/3.46  6.82/10.48  52.62/32.32  5.07/4.71 
+ Focal Loss  0.19/0.33  0.56/1.09  0.85/1.95  1.11/2.65  6.55/11.71  60.48/53.60  5.14/5.55 
+ Shrinkage Loss  0.19/0.32  0.62/1.32  0.96/2.31  1.25/3.17  6.39/10.26  53.5/36.91  5.00/4.95 
+ LDS  0.19/0.32  0.62/1.26  0.94/2.23  1.20/2.99  5.20/10.53  46.71/40.00  4.75/5.08 
+ Kurtosis Loss (ours)  0.20/0.38  0.65/1.35  0.85/1.82  1.03/2.27  5.39/7.52  28.32/17.88  3.24/3.00 
+ PLM (ours)  0.19/0.33  0.62/1.32  0.95/2.31  1.25/3.18  6.10/10.96  46.43/37.63  4.71/4.96 
+ PLW (ours)  0.24/0.37  0.60/1.00  0.82/1.49  1.01/2.01  7.51/9.91  62.85/42.87  4.46/4.57 

4.3 RealWorld Experiments
Time Series Forecasting
We present average and tail metrics on ND and NRMSE for the time series forecasting task on electricity and traffic datasets in Tables 1 and 2 respectively. We use DeepAR [salinas2020deepar], one of the SoTA in probabilistic time series forecasting, as the base model. The task for both datasets is to use a 1week history (168 hours) to forecast for 1 day (24 hours) at an hourly frequency. DeepAR exhibits long tail behavior in error on both datasets (refer Appendix G). The tail of the error distribution is significantly longer for the electricity dataset as compared to the traffic dataset.
Trajectory Forecasting
We present experimental results on ETHUCY and nuScenes datasets in Tables 3 and 4 respectively. Following [salzmann2020trajectron++] and [makansi2021exposing] we calculate model performance based on the best out of 20 guesses. On both datasets we compare our approaches with current SoTA longtail baseline methods using Trajectron++EWTA [makansi2021exposing] as a base model due to its SoTA average performance on these datasets. We include the Trajectron++ [salzmann2020trajectron++] results for reference as the previous stateoftheart method to add a meaningful comparison to the magnitude of performance change obtained by each long tail method.
On performing a comparative analysis of tail lengths between datasets, we notice that trajectory datasets manifest shorter tails compared to 1D time series datasets. Our Pareto approaches work better on longer tails and for this reason we augment weight and margin for PLM and PLW with an additional Mean Squared Error weight term to internally elongate the tail during the training process.
4.4 Results Analysis
Crosstask consistency
As shown in Tables 3 and 4, our proposed approaches, kurtosis loss and PLM, are the only methods improving on tail metrics across all tasks while maintaining the average performance of the base model. Our tasks differ in representation (1D, 2D), severity of longtail, base model loss function (GaussNLL, EWTA) and prediction horizon. This indicates that our methods generalize to diverse situations better than existing longtail methods.
Longtailedness across datasets
Using Eq. (12) as an indicative measure of the longtailedness in error distribution, we establish the datasets as ETHUCY, nuScenes, electricity, and traffic in longtailedness for the base model (Details in Appendix E). We notice the connections between longtailedness of the dataset and the performance of different methods.
(12) 
Reweighting vs Regularization.
As mentioned in Section 3.2, we can categorize loss modifying methods into two classes: reweighting (focal loss, shrinkage loss, LDS and PLW) and regularization (contrastive loss, PLM and kurtosis loss). Reweighting multiplies the loss for more difficult examples with higher weights. Regularization adds higher regularization values for examples with higher loss.
We notice that reweighting methods perform worse as the longtailedness increases. In scenarios with longer tails, the weights of tail samples can be very high. Overemphasizing tail examples hampers the learning for other samples. Shrinkage loss with a bounded weight limits this issue but fails to show tail improvements in longer tail scenarios. PLW is the best reweighting method on most datasets, likely due to its bounded weights. Inconsistency in average performance is likely due to reweighting nature of the loss which limits its applicability.
In contrast, regularization methods perform consistently across all tasks both on the tail and average metrics. The additive nature of regularization limits the adverse impact tail samples can have on the learning. This enables these methods to handle different longtailednesses without degrading the average performance.
PLM vs Kurtosis loss.
Kurtosis loss generally performs better on extreme tail metrics, and Max. The biquadratic behavior of kurtosis puts higher emphasis on fartail samples. Moreover, the magnitude of kurtosis varies significantly for different distributions, making the choice of hyperparameter (See Eq.(8)) critical. Further analysis on the same is available in Appendix D.
PLM is the most consistent method across all tasks improving on both tail and average metrics. As noted by [mcneil1997estimating] GPD is well suited to model long tail error distributions. PLM rewards examples moving away from the tail towards the mean with significantly lower margin values. PLM margin values saturate beyond a point in the tail providing similar improvements for subsequent tail samples. Visualization of PLM predictions for difficult tail examples can be seen in Fig. 4.
Kurtosis is sensitive to extreme samples in the tail, while PLM treats most samples in the tail similarly. This manifests in performance as kurtosis loss performing better on and Max, and PLM performing better on and .
This provides guidance on the choice of method as per the objective. Kurtosis Loss can improve the performance in worst case scenarios more significantly. PLM provides less drastic changes to the most extreme values, but it works more effectively throughout the entire distribution.
Tail error and longterm forecasting
Based on the trajectory forecasting results in Tables 3 and 4 we can see that error reduction for tail samples is more visible in FDE than ADE. This indicates that the magnitude of the observed error increases with the prediction horizon. The error accumulates through prediction steps making farfuture predictions inherently more difficult. Larger improvements in the FDE indicate that both Kurtosis and Pareto loss ensure that high tail errors (stemming mostly from large, farfuture prediction errors measured by FDE) are decreased.
The inadvertent direction of research in the forecasting domain is aiming at increasing the prediction horizon with high accuracy predictions. As we can see in Fig. 5, the effect of the tail examples is more pronounced with longer prediction horizons. Thus, methods addressing the tail performance will be necessary in order to ensure the practical applicability and reliability of future, longterm prediction.
5 Conclusion
We address the longtail problem in deep probabilistic forecasting. We propose Pareto loss (Margin and Weighted) and Kurtosis loss, two novel momentbased loss function approaches increasing emphasis on learning tail examples. We demonstrate their practical effects on two spatiotemporal trajectory datasets and two time series datasets. Our methods achieve significant improvements on tail examples over existing baselines without degrading average performance. Both proposed losses can be integrated with existing approaches in deep probabilistic forecasting to improve their performance on difficult and challenging scenarios.
Future directions include more principled ways to tune hyperparameters, new approaches to mitigate long tail for longterm forecasting and application to more complex tasks like video prediction. Based on our observations, we suggest evaluating additional tail performance metrics apart from average performance in machine learning task to identify potential long tail issues across different tasks and domains.
Acknowledgments
This work was supported in part by U.S. Department Of Energy, Office of Science, U. S. Army Research Office under Grant W911NF2010334, Facebook Data Science Award, Google Faculty Award, and NSF Grant #2037745.
References
Appendix A Dataset description
The ETHUCY dataset consists of five subdatasets, each with Bird’sEyeViews: ETH, Hotel, Univ, Zara1, and Zara2. As is common in the literature [makansi2021exposing, salzmann2020trajectron++] we present macroaveraged 5fold crossvalidation results in our experiment section. The nuScenes dataset includes 1000 scenes of 20 second length for vehicle trajectories recorded in Boston and Singapore.
The electricity dataset contains electricity consumption data for 370 homes over the period of Jan 1st, 2011 to Dec 31st, 2014 at a sampling interval of 15 minutes. We use the data from Jan 1st, 2011 to Aug 31st, 2011 for training and data from Sep 1st, 2011 to Sep 7th, 2011 for testing. The traffic dataset consists of occupancy values recorded by 963 sensors at a sampling interval of 10 minutes ranging from Jan 1st, 2008 to Mar 30th, 2009. We use data from Jan 1st, 2008 to Jun 15th, 2008 for training and data from Jun 16th, 2008 to Jul 15th, 2008 for testing. Both time series datasets are downsampled to 1 hour for generating examples.
The synthetic datasets are generated as 100 different time series consisting of 960 time steps. Each time series in the Sine dataset is generated using a random offset and a random frequency
both selected from a uniform distribution
. Then the time series is whereis the index of the time step. Gaussian and Pareto datasets are generated as order 1 lag autoregressive time series with randomly sampled Gaussian and Pareto noise respectively. Gaussian noise is sampled from a Gaussian distribution with mean 1 and standard deviation 1. Pareto noise is randomly sampled from a Pareto distribution with shape 10 and scaling 1.
Appendix B Method adaptation
Time Series forecasting
DeepAR uses Gaussian Negative Log Likelihood as the loss which is unbounded. Due to this many baseline methods need to be adapted in order to be usable. For the same reason, we also need an auxiliary loss (). We use MAE loss to fit the GPD, calculate kurtosis, and to calculate the weight terms for Focal and Shrinkage loss. For LDS we treat all labels across time steps as a part of a single distribution. Additionally, to avoid extremely high weights () in LDS due to the nature of long tail we ensure a minimum probability of for all labels.
Trajectory forecasting
We adapt Focal Loss and Shrinkage Loss to use EWTA loss [makansi2019overcoming] in order to be compatible with Trajectron++EWTA base model. LDS was originally proposed for a regression task and we adapt it to the trajectory prediction task in the same way as for the time series task. We use MAE to fit the GPD, due to the Evolving property of EWTA loss.
Appendix C Implementation details
Time Series forecasting
We use the DeepAR implementation from https://github.com/zhykoties/TimeSeries as the base code to run all time series experiments. The original code is an AWS API and not publicly available. The implementation of contrastive loss is taken directly from the source code of [makansi2021exposing].
Trajectory forecasting
For all tested base methods in the trajectory forecasting experiments (Trajectron++ [salzmann2020trajectron++] and Trajectron++EWTA [makansi2021exposing]) we have used the original implementations provided by the original authors of each method. The implementation of contrastive loss is taken directly from the source code of [makansi2021exposing].
The experiments have been conducted on a machine with 7 RTX 2080 Ti GPUs.
Appendix D Hyperparameter Tuning
We observe during our experiments that the performance of kurtosis loss is highly dependent on the hyperparameter (See Eq. (8)). Results for different values of on the electricity dataset for kurtosis are shown in Table5. We also show the variation of ND and NRMSE with the hyperparameter value in Figure 6. We can see that there is an optimal value of the hyperparameter and the approach performs worse with higher and lower values.
For both ETHUCY and nuScenes datasets we have used for Kurtosis loss, and for PLM and PLW. For both electricity and traffic datasets, we use for PLM, for PLW and for Kurtosis loss.
Method  Metric  Mean  Max  Kurtosis  Skew  

DeepAR  ND  0.0584  0.0796  0.2312  0.4429  4.1520  426.5906  18.4057 
NRMSE  0.2953  0.0972  0.2595  0.5263  5.4950  470.8968  19.4827  
+ Kurtosis Loss [0.001]  ND  0.0581  0.0815  0.2087  0.3936  4.2381  488.7306  19.8207 
NRMSE  0.3046  0.1014  0.2325  0.4756  5.7144  529.7499  20.7713  
+ Kurtosis Loss [0.005]  ND  0.0574  0.0767  0.2147  0.4138  3.6767  351.3378  16.7597 
NRMSE  0.2843  0.0999  0.2617  0.4792  5.0062  417.0575  18.3039  
+ Kurtosis Loss [0.01]  ND  0.0567  0.0842  0.2151  0.4120  3.2738  300.3517  15.4597 
NRMSE  0.2631  0.1046  0.2732  0.4779  4.2613  339.3773  16.4892  
+ Kurtosis Loss [0.1]  ND  0.0677  0.0954  0.2269  0.4579  3.8772  312.7331  16.0062 
NRMSE  0.3073  0.1184  0.2768  0.5419  5.1345  334.8358  16.3366 
Appendix E Long tail severity
In Table 6 we present the numerical values representing the approximate longtailedness for each of the datasets. Larger value indicates a longer tail.
Dataset  Metric  Longtailedness 

Electricity  ND  15.56 
Traffic  ND  45.08 
ETHUCY  FDE  7.96 
nuScenes  FDE  10.81 
Appendix F Pareto and Kurtosis
Figure 7 illustrates different GPDs for different shape parameter values. Higher shape value models more severe tail behavior.
Appendix G Long tail error distribution
In Fig. 8 we can see loglog plots of the error distributions of base model for each of the datasets. We can see each distribution exhibits a long tail behavior.
Appendix H Synthetic datasets
We present complete results of our experiments on the synthetic datasets in Table 7. We ran our methods, kurtosis loss, and PLM on these datasets as well. Both our methods show significant tail improvements over the base model across all datasets.
Method  Metric  Mean  Max  Kurtosis  Skew  

Sine Dataset  
AutoReg  ND  1.2255  2.162  2.7088  2.9306  3.1271  0.1565  0.1905 
NRMSE  1.5078  2.3134  2.7204  2.9379  3.1271  0.56  0.0369  
DeepAR  ND  0.0513  0.1721  0.316  0.5913  1.5744  71.9164  7.9019 
NRMSE  0.1534  0.2009  0.3507  0.6199  1.654  64.4497  7.4304  
+ Kurtosis Loss  ND  0.0455  0.1412  0.2914  0.447  1.5571  90.6313  8.6956 
NRMSE  0.133  0.1624  0.3455  0.5387  1.5571  76.7183  7.9383  
+ Pareto Loss  ND  0.0462  0.1326  0.3014  0.7151  1.582  78.6768  8.4086 
Margin  NRMSE  0.1517  0.1563  0.3551  0.737  1.7522  72.0235  7.9663 
Gaussian Dataset  
AutoReg  ND  0.573  1.0225  1.3334  1.6226  27.6956  845.0732  26.4337 
NRMSE  1.2705  1.1212  1.4045  1.6815  39.7474  1010.198  29.748  
DeepAR  ND  0.4379  0.705  0.7908  0.8651  1.1362  0.8225  0.7469 
NRMSE  0.5518  0.8172  0.9246  0.9908  1.3009  0.5562  0.65  
+ Kurtosis Loss  ND  0.4378  0.704  0.7973  0.8597  1.1294  0.8037  0.7418 
NRMSE  0.5518  0.8191  0.9255  0.9865  1.2951  0.539  0.6449  
+ Pareto Loss  ND  0.4391  0.7023  0.7946  0.8674  1.1069  0.7813  0.7352 
Margin  NRMSE  0.5534  0.8194  0.9232  0.9889  1.2786  0.4985  0.6333 
Pareto Dataset  
AutoReg  ND  1.9377  1.1748  1.7039  2.4782  2113.7503  2116.5018  44.2477 
NRMSE  81.1652  1.4027  1.9856  2.7312  4069.3972  2204.8078  45.3437  
DeepAR  ND  0.4416  0.8336  1.0317  1.1763  2.015  6.9242  2.036 
NRMSE  0.6349  1.1511  1.4295  1.6688  2.8327  7.0681  2.1547  
+ Kurtosis Loss  ND  0.4413  0.8345  1.0295  1.1738  2.0326  6.8831  2.0318 
NRMSE  0.6352  1.1541  1.4305  1.6653  2.8335  6.9941  2.144  
+ Pareto Loss  ND  0.4394  0.8497  1.0473  1.1955  2.086  6.6526  2.0185 
Margin  NRMSE  0.6397  1.1694  1.447  1.6735  2.845  6.4693  2.0711 