Log In Sign Up

Taming the Long Tail of Deep Probabilistic Forecasting

Deep probabilistic forecasting is gaining attention in numerous applications ranging from weather prognosis, through electricity consumption estimation, to autonomous vehicle trajectory prediction. However, existing approaches focus on improvements on the most common scenarios without addressing the performance on rare and difficult cases. In this work, we identify a long tail behavior in the performance of state-of-the-art deep learning methods on probabilistic forecasting. We present two moment-based tailedness measurement concepts to improve performance on the difficult tail examples: Pareto Loss and Kurtosis Loss. Kurtosis loss is a symmetric measurement as the fourth moment about the mean of the loss distribution. Pareto loss is asymmetric measuring right tailedness, modeling the loss using a generalized Pareto distribution (GPD). We demonstrate the performance of our approach on several real-world datasets including time series and spatiotemporal trajectories, achieving significant improvements on the tail examples.


page 1

page 2

page 3

page 4


Spliced Binned-Pareto Distribution for Robust Modeling of Heavy-tailed Time Series

This work proposes a novel method to robustly and accurately model time ...

Learning Generalizable Representations via Diverse Supervision

The problem of rare category recognition has received a lot of attention...

Deep Time Series Forecasting with Shape and Temporal Criteria

This paper addresses the problem of multi-step time series forecasting f...

Alternative modelling and inference methods for claim size distributions

The upper tail of a claim size distribution of a property line of busine...

On Exposing the Challenging Long Tail in Future Prediction of Traffic Actors

Predicting the states of dynamic traffic actors into the future is impor...

Mining Minority-class Examples With Uncertainty Estimates

In the real world, the frequency of occurrence of objects is naturally s...

DMIDAS: Deep Mixed Data Sampling Regression for Long Multi-Horizon Time Series Forecasting

Neural forecasting has shown significant improvements in the accuracy of...

1 Introduction

Forecasting is one of the most fundamental problems in time series and spatiotemporal data analysis, with broad applications in energy, finance, and transportation. Deep learning models [li2019enhancing, salinas2020deepar, rasul2021autoregressive] have emerged as state-of-the-art approaches for forecasting rich time series and spatiotemporal data. In several forecasting competitions such as M5 forecasting competition [makridakis2020m5], Argoverse motion forecasting challenge [chang2019argoverse], and IARAI Traffic4cast contest [kreil2020surprising]

, almost all the winning solutions are based on deep neural networks.

Despite the encouraging progress, we discover that the forecasting performance of deep learning models has long-tail behavior. That means, a significant amount of samples are very difficult to forecast. Existing works often measure the forecasting performance by averaging across test samples. However, such an average performance measured by root mean square error (RMSE) or mean absolute error (MAE) can be misleading. A low RMSE or MAE may indicate good averaged performance, but it does not prevent the model from behaving disastrously in difficult scenarios.

Figure 1: Log-log plot of error distribution for trajectory prediction on the ETH-UCY dataset using SoTA(Traj++EWTA). Also shown is a tail scenario with predictions using Traj++EWTA [purple] and Traj++EWTA+Pareto Margin Loss (ours) [teal].

From a practical perspective, the long-tail behavior in forecasting performance can be alarming. Figure 1 visualizes examples of long-tail behavior for a motion forecasting task. In motion forecasting, the long-tail would correspond to rare events in driving such as turning maneuver and sudden stops. Failure to forecast accurately in these scenarios would pose paramount safety risks in route planning. In electricity forecasting, the tail behavior would occur during short circuits, power outage, grid failures, or sudden behavior changes. Merely focusing on the average performance would ignore the electric load anomalies, significantly increasing the maintenance and operational cost.

Long-tailed learning is an area heavily studied in classification settings focusing on class imbalance. We refer readers to Table 2 in [menon2020long] and the survey paper by [zhang2021deep] for a complete review. Most common approaches to address the long-tail problem include post-hoc normalization, data resampling, loss engineering, and learning class-agnostic representations. However, long-tail learning methods in classification are not directly translatable to forecasting as we do not have a pre-defined class. A recent work by [makansi2021exposing]

propose to use Kalman filter to gauge the difficulty of different forecasting examples but such difficulties may not directly relate to deep neural networks used for the actual forecasting task.

In this paper, we address the long-tail behavior of prediction error for deep probabilistic forecasting. We present two moment-based loss modifications: Kurtosis loss and Pareto loss. Kurtosis is a well studied symmetric measure of tailedness as a scaled fourth moment of the distribution. Pareto loss uses Generalized Pareto Distribution (GPD) to fit the long-tailed error distribution and can also be described as a weighted summation of shifted moments. We investigate these tailedness measurements as regularization and loss weighting approaches for probabilistic forecasting tasks. We demonstrate significantly improved tail performance compared to the base model and the baselines, while achieving better average performance in most settings.

In summary, our contributions include

  • We discover long-tail behavior in the forecasting performance of deep probabilistic models.

  • We investigate principled approaches to address long-tail behavior and propose two novel methods: Pareto loss and Kurtosis loss.

  • We significantly improve the tail performance on four forecasting tasks including two time series and two spatiotemporal trajectory forecasting datasets.

2 Related work

Deep probabilistic forecasting. There is a flurry of work on using deep neural networks for probabilistic forecasting. For time series forecasting, a common practice is to combine classic time series models with deep learning, resulting in DeepAR [salinas2020deepar], Deep State Space [rangapuram2018deep], Deep Factors [wang2019deep] and normalizing Kalman Filter [de2020normalizing]. Others introduce normalizing flow [rasul2020multivariate], denoising diffusion [rasul2021autoregressive] and particle filter [pal2021rnn] to deep learning. For trajectory forecasting, the majority of works focus on deterministic prediction. A few recent works propose to approximate the conditional distribution of future trajectories given the past with explicit parameterization [mfp, luo2020probabilistic], CVAE [CVAE, desire, trajectron++] or implicit models such as GAN [socialgan, liu2019naomi]. Nevertheless, most existing works focus on average performance, the issue of long-tail is largely overlooked in the community.

Long-tailed learning.

The main efforts for addressing the long-tail issue in learning revolve around reweighing, resampling, loss function engineering, and two-stage training, but mostly for classification. Rebalancing during training comes either in form of synthetic minority oversampling 

[chawla2002smote], oversampling with adversarial examples [Kozerawski_2020_ACCV], inverse class frequency balancing [liu2019large], balancing using effective number of samples [cui2019class], or balance-oriented mixup augmentation [xu2021towards]. Another direction involves post-processing either in form of normalized calibration [pan2021model]

or logit adjustment 

[menon2020long]. An important direction is loss modification approaches such as Focal Loss [lin2017focal], Shrinkage Loss [lu2018deep], and Balanced Meta-Softmax [ren2020balanced]. Others utilize two-stage training [liu2019large, cao2019learning] or separate expert networks [zhou2020bbn, li2020overcoming, wang2020long]. We refer the readers to [zhang2021deep] for an extensive survey. [tang2020long] indicated SGD momentum can contribute to the aggravation of the long-tail problem and suggested de-confounded training to mitigate its effects. [feldman2020does, feldman2020neural] performed theoretical analysis and suggested label memorization in long-tail distribution as a necessity for the network to generalize.

A few were developed for imbalanced regression. Many approaches revolve around modifications of SMOTE such as adapted to regression SMOTER [torgo2013smote], augmented with Gaussian Noise SMOGN [branco2017smogn], or [ribeiro2020imbalanced] work extending for prediction of extremely rare values. [steininger2021density]

proposed DenseWeight, a method based on Kernel Density Estimation for better assessment of the relevance function for sample reweighing.

[yang2021delving] proposed a distribution smoothing over label (LDS) and feature space (FDS) for imbalanced regression. A concurrent work is [makansi2021exposing] where they noticed the long-tail error distribution for trajectory prediction. They used Kalman filter [kalman1960new] performance as a difficulty measure and utilized contrastive learning to alleviate the tail problem. However, the tail of Kalman Filter may differ from that of deep learning models, which we elaborate on in later sections.

3 Methodology

We first identify the long-tail phenomena in probabilistic forecasting. Then, we propose two related strategies based on Pareto loss and Kurtosis loss to mitigate the tail issue.

3.1 Long-tail in probabilistic forecasting

Given input and output respectively, probabilistic forecasting task aims to predict the conditional distribution of future states given current and past observations as:


where is the length of the history and is the prediction horizon. We denote the maximum likelihood probabilistic forecasting model prediction as .

Long tail distribution of data can be seen in numerous real world datasets. This is evident for the four benchmark forecasting datasets (Electricity [Dua:2019], Traffic [Dua:2019], ETH-UCY [pellegrini2009you, lerner2007crowds], and nuScenes [caesar2020nuscenes]) studied in this work. We can see the distribution of ground truth values () for all of them in Figure 2. We use log-log plots to increase the visibility of the long tail behavior present in the data – smaller values (constituting the minority on a linear scale) occur very frequently, while majority of values are very rare (creating the tail). In addition to the long tail data distribution, we also identify the long tail distribution of forecasting error from deep learning models (such as DeepAR [salinas2020deepar], Trajectron++ [salzmann2020trajectron++], and Trajectron++EWTA [makansi2019overcoming]) (as seen in Appendix G).

We hypothesize that long tail behavior in forecasting error distribution originates from the long tail behavior in data distribution, as well as the nature of gradient based deep learning. Therefore, modifying the loss function to account for the shape of the distribution would potentially lead to better tail performance. Next, we present two loss functions based on the moment of the error distribution.

Figure 2: Log-log plots of distribution of ground truth labels for the time series [Electricity: Top left, Traffic: Top right] and trajectory [ETH-UCY: Bottom left, nuScenes: Bottom right] forecasting datasets. The value for time series datasets represents energy and occupancy and for the trajectory datasets represents normalized 2D coordinates through whole prediction horizon. All datasets exhibit long tail behavior.

3.2 Pareto Loss

Long tail distributions naturally lend themselves to analysis using Extreme Value Theory (EVT). [mcneil1997estimating]

shows that long tail behavior can be modeled as a generalized Pareto distribution (GPD). The probability distribution function (pdf) of the GPD is,


where the parameters are location (), scale () and shape (). The pdf for GPD is defined for when and for when . can be set to 0 without loss of generality as it represents translation along the x axis. We can drop the scaling term

as the pdf will be scaled using a hyperparameter. The simplified pdf is,


The high-level idea of Pareto loss is to fit a GPD to the loss distribution to reprioritize the learning of easy and difficult (tail) examples. Let the loss function used by a given machine learning model be denoted as

. In probabilistic forecasting, a commonly used loss is Negative Log Likelihood (NLL) loss: where is the training example, and the model prediction. As the pdf in Eq.(3) only allows non-negative input, the loss has to be lower-bounded. We propose to use an auxiliary loss to fit the GPD. For NLL which can be unbounded for continuous distributions, the auxiliary loss can simply be Mean Absolute Error (MAE): .

There are two main classes of methods for modifying loss functions to improve tail performance: regularization [ren2020balanced, makansi2021exposing] and re-weighting [lin2017focal, lu2018deep, yang2021delving]. Both classes are characterized by different behavior on tail data [ren2020balanced]. Inspired by these, we propose two variations of the Pareto Loss using the distribution fitted on : Pareto Loss Margin (PLM) and Pareto Loss Weighted (PLW).

PLM is based on the principles of margin-based regularization [ren2020balanced, liu2016large] which assigns larger penalties (margins) to harder examples. For a given hyperparameter , PLM is defined as,




which uses GPD to calculate the additive margin.

An alternative is to reweigh the loss terms using the loss distribution. For a given hyperparameter , PLW is defined as,




which uses GPD to reweigh the loss of each sample.

3.3 Kurtosis Loss

Kurtosis measures the tailedness of a distribution as the scaled fourth moment about the mean. To increase the emphasis on tail examples, we use this measure to propose kurtosis loss. For a given hyperparameter and using the same notation as Sec.3.2 kurtosis loss is defined as,


where is the contribution of an example to kurtosis for a batch


where and

are the mean and standard deviation of the auxiliary loss (

) values for a batch of examples.

We propose to use the auxiliary loss distribution to compute kurtosis, as performance metrics in forecasting tasks frequently involve versions of L1 or L2 distance such as RMSE, MAE, or ADE. The goal is to decrease the long tail for these metrics, which might not correlate well with the base loss . The example in Sec. 3.2 where is NLL loss and is MAE loss illustrates this requirement well.

Kurtosis loss and pareto loss are related approaches to handling long tail behavior. Pareto Loss is a weighted sum of moments about while kurtosis loss is the fourth moment about the mean. Let and , then the Taylor expansion for the GPD pdf from Eq.(3) is,


For or equivalently or

, the coefficients are positive for even moments and negative for odd moments. Even moments are always symmetric and positive, while odd moments are positive only for right-tailed distributions. Since we use the negative of the pdf, it yields an asymmetric measure of the right tailedness of a value in the distribution.

Kurtosis loss uses the fourth moment about the distribution mean. This is a symmetric and positive measure, but in the context of right tailed distributions, kurtosis serves as a good measure of the long tailedness of the distribution. GPD and kurtosis are visualised in Appendix F

4 Experiments

We evaluate our methods on two probabilistic forecasting tasks: time series forecasting and trajectory prediction.

4.1 Setup


For time series forecasting, we use electricity and traffic datasets from the UCI ML repository [Dua:2019] used in [salinas2020deepar] as benchmarks. We also generate three synthetic 1D time series datasets, Sine, Gaussian and Pareto, to further our understanding of long tail behavior.

For trajectory prediction, we use two benchmark datasets: a pedestrian trajectory dataset ETH-UCY (which is a combination of ETH [pellegrini2009you] and UCY [lerner2007crowds] datasets) and a vehicle trajectory dataset nuScenes  [caesar2020nuscenes]. Details regarding the datasets are available in Appendix A.


We compare with the following baselines representing SoTA in long tail mitigation for different tasks:

  • [itemsep=1mm, topsep=0mm]

  • Contrastive Loss: [makansi2021exposing] uses contrastive loss as a regularizer to group examples together based on Kalman filter prediction errors.

  • Focal Loss: [lin2017focal] uses L1 loss to reweigh loss terms.

  • Shrinkage Loss: [lu2018deep] uses a sigmoid-based function to reweigh loss terms.

  • Label Distribution Smoothing (LDS): [yang2021delving] uses symmetric kernel to smooth the label distribution and use its inverse to reweigh loss terms.

Focal Loss, Shrinkage Loss, and LDS were originally proposed for classification and/or regression and required adaptation in order to be applicable to the forecasting task. For details on baseline adaptation, please see Appendix B.

Evaluation Metrics.

We use two common metrics for the evaluation of trajectory prediction models: Average Displacement Error (ADE), which is the average L2 distance between total predicted trajectory and ground truth, and Final Displacement Error (FDE) which is the L2 distance for the final timestep. For time series forecasting, we use Normalized Deviation (ND) and Normalized Root Mean Squared Error (NRMSE).

Apart from the above-mentioned average performance metrics, we introduce metrics to capture performance on the tail. To measure the performance at tail of the distribution, we propose to adapt the Value-at-Risk (VaR Eq. (11)) metric:


VaR at level is the smallest error

such that the probability of observing error larger than

is smaller than , where is the error distribution. This evaluates to the quantile of the error distribution. We propose to measure VaR at three different levels: , , and .

In addition, we use skew, kurtosis, and max error to further assess the tail performance. Skew and Kurtosis as metrics are meaningful only when looked at in conjunction with the mean. A distribution with a higher mean and lower skew and kurtosis does not imply a less severe tail.

Figure 3: Top Row: Ground truth distribution for synthetic datasets. Middle Row: Normalized Deviation (ND) error distribution for prediction using AutoRegression. Bottom Row : ND error distribution for prediction using DeepAR. Datasets (L to R): Sine, Gaussian, Pareto. The error distribution for AR and DeepAR on these datasets indicate that long tail behavior in error originates due to both long tail data distribution and gradient learning. Note: the x-axes for plots in the same column or y-axes for plots in the same row are not on the same range of values

4.2 Synthetic Dataset Experiments

In order to better understand the long tail error distribution, we perform experiments on three synthetic datasets. The task is to forecast 8 steps ahead given a history of 8 time steps. We use AutoRegression (AR) and DeepAR [salinas2020deepar] as forecasting models to perform this task. The top row in Figure 3 shows that among the datasets, only Gaussian and Pareto show tail behavior in the data distribution. Pareto dataset in particular is the only one to display long tail behavior. AR and DeepAR have different error distribution across the datasets. Based on these results, we make the following hypotheses for the sources of long-tailedness.

Source 1: Long Tail in Data. The data distributions for Gaussian and Pareto datasets have similar tail behavior to the error distribution for both models, AR and DeepAR. This indicates that the long tail in data is a likely cause of long tail behavior in error. This connection is also well established as class imbalance for classification tasks [van2018inaturalist, liu2019large].

Source 2: Deep Learning Model. The results on the Sine dataset illustrate that even in the absence of long tail in the data, we can have long tail in the error distribution. The AR model, however, does not show long tail behavior for error. This indicates that the observed long tail behavior in error for DeepAR is model induced. We hypothesize that this is caused by DeepAR overfitting to simpler examples due to the nature of gradient based learning. Further results and analysis on these datasets can be found in Appendix H.

The difference between AR and DeepAR error distributions also suggests that assuming tail overlap between deep learning and non-deep learning methods (such as Kalman filter used by [makansi2021exposing]) might not generalize well.

Method Metric Mean Max Kurtosis Skew
DeepAR ND 0.0584 0.0796 0.2312 0.4429 4.1520 426.5906 18.4057
NRMSE 0.2953 0.0972 0.2595 0.5263 5.4950 470.8968 19.4827
+ Contrastive Loss ND 0.0618 0.0872 0.2102 0.4274 4.0004 384.568 17.5604
NRMSE 0.3062 0.1069 0.2481 0.5392 5.1606 415.3592 18.3051
+ Focal Loss ND 0.0628 0.0853 0.2694 0.4398 4.3263 412.5172 18.0739
NRMSE 0.3139 0.1052 0.3137 0.5297 5.7797 469.7605 19.3916
+ Shrinkage Loss ND 0.0694 0.0956 0.2334 0.4446 4.4714 325.7401 16.3852
NRMSE 0.3244 0.1156 0.2828 0.5177 5.4245 336.7777 16.5656
+ LDS ND 0.0634 0.0890 0.2238 0.4925 3.8625 335.2523 16.2944
NRMSE 0.2923 0.1149 0.2787 0.5458 4.9234 373.4702 17.1249
+ Kurtosis Loss (Ours) ND 0.0567 0.0842 0.2151 0.4120 3.2738 300.3517 15.4597
NRMSE 0.2631 0.1046 0.2732 0.4779 4.2613 339.3773 16.4892
+ PLM (Ours) ND 0.0564 0.0799 0.1900 0.4164 3.4576 359.6645 16.9243
NRMSE 0.2783 0.1000 0.2343 0.5102 4.7494 423.2319 18.3994
+ PLW (Ours) ND 0.0578 0.0796 0.2121 0.3558 3.4647 329.0847 16.393
NRMSE 0.2807 0.0984 0.2555 0.4809 4.6040 366.6818 17.3120
Table 1: Performance on the Electricity Dataset (ND/NRMSE). PLW, PLM and Kurtosis (Ours) all improve on the average as well as tail metrics. Baseline methods perform slightly worse on average as compared to DeepAR. Results indicated as Better and Best
Method Metric Mean Max Kurtosis Skew
DeepAR ND 0.1741 0.6866 25.5840 32.1330 84.1582 41.2804 6.1700
NRMSE 0.4465 1.2283 6.0283 7.5988 18.8103 37.0089 5.7343
+ Contrastive Loss ND 0.2052 0.7463 24.3737 30.5117 81.1716 42.1391 6.2282
NRMSE 0.4667 1.2956 5.7747 7.2342 18.3360 36.4420 5.6834
+ Focal Loss ND 0.4903 1.1553 26.7537 30.1506 52.8272 28.5912 5.4325
NRMSE 0.7302 1.6485 6.5880 7.3660 13.7985 24.6181 4.9104
+ Shrinkage Loss ND 0.2431 0.8380 25.3381 32.9147 85.2713 45.0172 6.3935
NRMSE 0.5114 1.3099 6.0418 7.8882 19.0771 39.5592 5.8742
+ LDS ND 0.4763 1.4781 28.9162 38.4263 126.5733 49.2714 6.5445
NRMSE 0.7829 1.8702 6.8826 9.2061 27.3684 39.8322 5.7109
+ Kurtosis Loss (Ours) ND 0.2022 0.7653 25.3752 31.4677 62.9173 35.298 5.8785
NRMSE 0.4892 1.4072 6.0263 7.3369 13.7783 29.6338 5.2683
+ PLM (Ours) ND 0.1594 0.7115 24.5911 30.331 90.3169 42.5373 6.1829
NRMSE 0.4600 1.3881 5.6779 7.0033 20.5736 36.7518 5.6005
+ PLW (Ours) ND 0.3751 1.0495 25.4471 31.6621 65.759 35.4836 5.8813
NRMSE 0.6238 1.4914 6.0552 7.3491 13.8938 28.9214 5.1844
Table 2: Performance on the Traffic Dataset (ND/NRMSE). Pareto Loss Margin (Ours) delivers best overall results improving on both average and the tail. Regularization methods in general fare better than weighting methods due to a very long tail. Among the baseline methods contrastive loss exhibits most consistent improvements. Results indicated as Better and Best
Method Mean Max Kurtosis Skew
Traj++ 0.21/0.41 0.56/1.33 0.78/1.97 0.98/2.47 2.33/5.04 16.02/16.09 3.02/3.26
Traj++EWTA 0.16/0.33 0.43/1.05 0.60/1.53 0.76/1.89 1.63/3.95 16.40/19.21 2.97/3.34
+ Contrastive 0.17/0.34 0.43/1.03 0.62/1.56 0.79/1.89 1.67/4.02 16.37/18.51 2.96/3.35
+ Focal Loss 0.16/0.32 0.40/0.89 0.54/1.28 0.66/1.57 1.50/3.50 14.95/17.80 2.74/3.18
+ Shrinkage Loss 0.16/0.33 0.43/1.05 0.58/1.50 0.74/1.84 1.66/3.95 16.67/19.54 3.00/3.41
+ LDS 0.17/0.35 0.44/1.04 0.57/1.45 0.78/1.86 1.69/3.85 19.80/19.12 3.18/3.39
+ Kurtosis Loss (ours) 0.17/0.34 0.46/0.98 0.59/1.25 0.67/1.47 1.22/2.77 5.28/7.25 1.77/2.11
+ PLM (ours) 0.16/0.30 0.38/0.81 0.52/1.20 0.63/1.49 1.30/3.20 12.01/16.90 2.41/3.04
+ PLW (ours) 0.21/0.36 0.46/0.84 0.55/1.08 0.63/1.32 1.25/2.93 6.62/10.52 1.69/2.08
Table 3: Macro-averaged performance on the ETH-UCY Dataset (ADE/FDE). Our approaches improve tail performance better than existing baselines. The improvements are most significant for far-future prediction (FDE). PLM improves well across prediction horizon (ADE). Results indicated as Better and Best.
Method Mean Max Kurtosis Skew
Traj++ 0.23/0.42 0.73/1.62 1.11/2.73 1.46/3.61 7.87/10.98 37.74/26.96 4.23/4.18
Traj++EWTA 0.19/0.34 0.65/1.49 1.00/2.49 1.32/3.34 7.07/11.42 55.26/36.33 5.12/4.88
+ Contrastive 0.19/0.35 0.65/1.51 1.01/2.58 1.36/3.46 6.82/10.48 52.62/32.32 5.07/4.71
+ Focal Loss 0.19/0.33 0.56/1.09 0.85/1.95 1.11/2.65 6.55/11.71 60.48/53.60 5.14/5.55
+ Shrinkage Loss 0.19/0.32 0.62/1.32 0.96/2.31 1.25/3.17 6.39/10.26 53.5/36.91 5.00/4.95
+ LDS 0.19/0.32 0.62/1.26 0.94/2.23 1.20/2.99 5.20/10.53 46.71/40.00 4.75/5.08
+ Kurtosis Loss (ours) 0.20/0.38 0.65/1.35 0.85/1.82 1.03/2.27 5.39/7.52 28.32/17.88 3.24/3.00
+ PLM (ours) 0.19/0.33 0.62/1.32 0.95/2.31 1.25/3.18 6.10/10.96 46.43/37.63 4.71/4.96
+ PLW (ours) 0.24/0.37 0.60/1.00 0.82/1.49 1.01/2.01 7.51/9.91 62.85/42.87 4.46/4.57

Table 4: Average performance on the nuScenes Dataset (ADE/FDE). Our approaches improve tail performance for far-future prediction (FDE) better than existing baselines. Results indicated as Better and Best.

4.3 Real-World Experiments

Time Series Forecasting

We present average and tail metrics on ND and NRMSE for the time series forecasting task on electricity and traffic datasets in Tables  1 and 2 respectively. We use DeepAR [salinas2020deepar], one of the SoTA in probabilistic time series forecasting, as the base model. The task for both datasets is to use a 1-week history (168 hours) to forecast for 1 day (24 hours) at an hourly frequency. DeepAR exhibits long tail behavior in error on both datasets (refer Appendix G). The tail of the error distribution is significantly longer for the electricity dataset as compared to the traffic dataset.

Trajectory Forecasting

We present experimental results on ETH-UCY and nuScenes datasets in Tables 3 and 4 respectively. Following [salzmann2020trajectron++] and [makansi2021exposing] we calculate model performance based on the best out of 20 guesses. On both datasets we compare our approaches with current SoTA long-tail baseline methods using Trajectron++EWTA [makansi2021exposing] as a base model due to its SoTA average performance on these datasets. We include the Trajectron++ [salzmann2020trajectron++] results for reference as the previous state-of-the-art method to add a meaningful comparison to the magnitude of performance change obtained by each long tail method.

On performing a comparative analysis of tail lengths between datasets, we notice that trajectory datasets manifest shorter tails compared to 1D time series datasets. Our Pareto approaches work better on longer tails and for this reason we augment weight and margin for PLM and PLW with an additional Mean Squared Error weight term to internally elongate the tail during the training process.

4.4 Results Analysis

Cross-task consistency

As shown in Tables 3 and 4, our proposed approaches, kurtosis loss and PLM, are the only methods improving on tail metrics across all tasks while maintaining the average performance of the base model. Our tasks differ in representation (1D, 2D), severity of long-tail, base model loss function (GaussNLL, EWTA) and prediction horizon. This indicates that our methods generalize to diverse situations better than existing long-tail methods.

Long-tailedness across datasets

Using Eq. (12) as an indicative measure of the long-tailedness in error distribution, we establish the datasets as ETH-UCY, nuScenes, electricity, and traffic in long-tailedness for the base model (Details in Appendix E). We notice the connections between long-tailedness of the dataset and the performance of different methods.


Re-weighting vs Regularization.

As mentioned in Section 3.2, we can categorize loss modifying methods into two classes: re-weighting (focal loss, shrinkage loss, LDS and PLW) and regularization (contrastive loss, PLM and kurtosis loss). Re-weighting multiplies the loss for more difficult examples with higher weights. Regularization adds higher regularization values for examples with higher loss.

We notice that re-weighting methods perform worse as the long-tailedness increases. In scenarios with longer tails, the weights of tail samples can be very high. Over-emphasizing tail examples hampers the learning for other samples. Shrinkage loss with a bounded weight limits this issue but fails to show tail improvements in longer tail scenarios. PLW is the best re-weighting method on most datasets, likely due to its bounded weights. Inconsistency in average performance is likely due to re-weighting nature of the loss which limits its applicability.

In contrast, regularization methods perform consistently across all tasks both on the tail and average metrics. The additive nature of regularization limits the adverse impact tail samples can have on the learning. This enables these methods to handle different long-tailednesses without degrading the average performance.

Figure 4: Visualization of difficult (tail) examples for Electricity (top row left half), Traffic (top row right half), ETH-UCY (bottom row left half) and nuScenes (bottom row right half) datasets. The difficulty in all datasets is captured by a significant departure in behavior with respect to the history. This manifests as sudden increase or decrease in the 1D time series datasets and as high velocity trajectories with sharp turns for the trajectory datasets. These examples represent critical events in real world scenarios where the performance of the model is of utmost importance. Our methods perform significantly better on such examples.

PLM vs Kurtosis loss.

Kurtosis loss generally performs better on extreme tail metrics, and Max. The bi-quadratic behavior of kurtosis puts higher emphasis on far-tail samples. Moreover, the magnitude of kurtosis varies significantly for different distributions, making the choice of hyperparameter (See Eq.(8)) critical. Further analysis on the same is available in Appendix D.

PLM is the most consistent method across all tasks improving on both tail and average metrics. As noted by [mcneil1997estimating] GPD is well suited to model long tail error distributions. PLM rewards examples moving away from the tail towards the mean with significantly lower margin values. PLM margin values saturate beyond a point in the tail providing similar improvements for subsequent tail samples. Visualization of PLM predictions for difficult tail examples can be seen in Fig. 4.

Kurtosis is sensitive to extreme samples in the tail, while PLM treats most samples in the tail similarly. This manifests in performance as kurtosis loss performing better on and Max, and PLM performing better on and .

This provides guidance on the choice of method as per the objective. Kurtosis Loss can improve the performance in worst case scenarios more significantly. PLM provides less drastic changes to the most extreme values, but it works more effectively throughout the entire distribution.

Figure 5: Distribution of the top 5% most erroneous predictions (FDE) for different prediction horizons for ETH-UCY (Zara1) dataset. Predictions obtained using Trajectron++EWTA. The trend shows that the tail gets much worse as the prediction horizon is increased due to compounded error.

Tail error and long-term forecasting

Based on the trajectory forecasting results in Tables 3 and 4 we can see that error reduction for tail samples is more visible in FDE than ADE. This indicates that the magnitude of the observed error increases with the prediction horizon. The error accumulates through prediction steps making far-future predictions inherently more difficult. Larger improvements in the FDE indicate that both Kurtosis and Pareto loss ensure that high tail errors (stemming mostly from large, far-future prediction errors measured by FDE) are decreased.

The inadvertent direction of research in the forecasting domain is aiming at increasing the prediction horizon with high accuracy predictions. As we can see in Fig. 5, the effect of the tail examples is more pronounced with longer prediction horizons. Thus, methods addressing the tail performance will be necessary in order to ensure the practical applicability and reliability of future, long-term prediction.

5 Conclusion

We address the long-tail problem in deep probabilistic forecasting. We propose Pareto loss (Margin and Weighted) and Kurtosis loss, two novel moment-based loss function approaches increasing emphasis on learning tail examples. We demonstrate their practical effects on two spatiotemporal trajectory datasets and two time series datasets. Our methods achieve significant improvements on tail examples over existing baselines without degrading average performance. Both proposed losses can be integrated with existing approaches in deep probabilistic forecasting to improve their performance on difficult and challenging scenarios.

Future directions include more principled ways to tune hyperparameters, new approaches to mitigate long tail for long-term forecasting and application to more complex tasks like video prediction. Based on our observations, we suggest evaluating additional tail performance metrics apart from average performance in machine learning task to identify potential long tail issues across different tasks and domains.


This work was supported in part by U.S. Department Of Energy, Office of Science, U. S. Army Research Office under Grant W911NF-20-1-0334, Facebook Data Science Award, Google Faculty Award, and NSF Grant #2037745.


Appendix A Dataset description

The ETH-UCY dataset consists of five subdatasets, each with Bird’s-Eye-Views: ETH, Hotel, Univ, Zara1, and Zara2. As is common in the literature [makansi2021exposing, salzmann2020trajectron++] we present macro-averaged 5-fold cross-validation results in our experiment section. The nuScenes dataset includes 1000 scenes of 20 second length for vehicle trajectories recorded in Boston and Singapore.

The electricity dataset contains electricity consumption data for 370 homes over the period of Jan 1st, 2011 to Dec 31st, 2014 at a sampling interval of 15 minutes. We use the data from Jan 1st, 2011 to Aug 31st, 2011 for training and data from Sep 1st, 2011 to Sep 7th, 2011 for testing. The traffic dataset consists of occupancy values recorded by 963 sensors at a sampling interval of 10 minutes ranging from Jan 1st, 2008 to Mar 30th, 2009. We use data from Jan 1st, 2008 to Jun 15th, 2008 for training and data from Jun 16th, 2008 to Jul 15th, 2008 for testing. Both time series datasets are downsampled to 1 hour for generating examples.

The synthetic datasets are generated as 100 different time series consisting of 960 time steps. Each time series in the Sine dataset is generated using a random offset and a random frequency

both selected from a uniform distribution

. Then the time series is where

is the index of the time step. Gaussian and Pareto datasets are generated as order 1 lag autoregressive time series with randomly sampled Gaussian and Pareto noise respectively. Gaussian noise is sampled from a Gaussian distribution with mean 1 and standard deviation 1. Pareto noise is randomly sampled from a Pareto distribution with shape 10 and scaling 1.

Appendix B Method adaptation

Time Series forecasting

DeepAR uses Gaussian Negative Log Likelihood as the loss which is unbounded. Due to this many baseline methods need to be adapted in order to be usable. For the same reason, we also need an auxiliary loss (). We use MAE loss to fit the GPD, calculate kurtosis, and to calculate the weight terms for Focal and Shrinkage loss. For LDS we treat all labels across time steps as a part of a single distribution. Additionally, to avoid extremely high weights () in LDS due to the nature of long tail we ensure a minimum probability of for all labels.

Trajectory forecasting

We adapt Focal Loss and Shrinkage Loss to use EWTA loss [makansi2019overcoming] in order to be compatible with Trajectron++EWTA base model. LDS was originally proposed for a regression task and we adapt it to the trajectory prediction task in the same way as for the time series task. We use MAE to fit the GPD, due to the Evolving property of EWTA loss.

Appendix C Implementation details

Time Series forecasting

We use the DeepAR implementation from as the base code to run all time series experiments. The original code is an AWS API and not publicly available. The implementation of contrastive loss is taken directly from the source code of [makansi2021exposing].

Trajectory forecasting

For all tested base methods in the trajectory forecasting experiments (Trajectron++ [salzmann2020trajectron++] and Trajectron++EWTA [makansi2021exposing]) we have used the original implementations provided by the original authors of each method. The implementation of contrastive loss is taken directly from the source code of [makansi2021exposing].

The experiments have been conducted on a machine with 7 RTX 2080 Ti GPUs.

Appendix D Hyperparameter Tuning

We observe during our experiments that the performance of kurtosis loss is highly dependent on the hyperparameter (See Eq. (8)). Results for different values of on the electricity dataset for kurtosis are shown in Table5. We also show the variation of ND and NRMSE with the hyperparameter value in Figure 6. We can see that there is an optimal value of the hyperparameter and the approach performs worse with higher and lower values.

For both ETH-UCY and nuScenes datasets we have used for Kurtosis loss, and for PLM and PLW. For both electricity and traffic datasets, we use for PLM, for PLW and for Kurtosis loss.

Method Metric Mean Max Kurtosis Skew
DeepAR ND 0.0584 0.0796 0.2312 0.4429 4.1520 426.5906 18.4057
NRMSE 0.2953 0.0972 0.2595 0.5263 5.4950 470.8968 19.4827
+ Kurtosis Loss [0.001] ND 0.0581 0.0815 0.2087 0.3936 4.2381 488.7306 19.8207
NRMSE 0.3046 0.1014 0.2325 0.4756 5.7144 529.7499 20.7713
+ Kurtosis Loss [0.005] ND 0.0574 0.0767 0.2147 0.4138 3.6767 351.3378 16.7597
NRMSE 0.2843 0.0999 0.2617 0.4792 5.0062 417.0575 18.3039
+ Kurtosis Loss [0.01] ND 0.0567 0.0842 0.2151 0.4120 3.2738 300.3517 15.4597
NRMSE 0.2631 0.1046 0.2732 0.4779 4.2613 339.3773 16.4892
+ Kurtosis Loss [0.1] ND 0.0677 0.0954 0.2269 0.4579 3.8772 312.7331 16.0062
NRMSE 0.3073 0.1184 0.2768 0.5419 5.1345 334.8358 16.3366
Table 5: Electricity Dataset Evaluation using DeepAR (ND/NRMSE) and different Kurtosis loss hyperparameters. The value of is denoted in [] with the method name
Figure 6: Left: Variation of ND by hyperparameter for kurtosis loss. Right: Variation of NRMSE by hyperparameter for kurtosis loss.

Appendix E Long tail severity

In Table 6 we present the numerical values representing the approximate long-tailedness for each of the datasets. Larger value indicates a longer tail.

Dataset Metric Long-tailedness
Electricity ND 15.56
Traffic ND 45.08
nuScenes FDE 10.81
Table 6: An approximate long-tailedness (based on Eq. (12)) in the performance of base model on different datasets. Higher number indicates a longer tail. Trajectory datasets have shorter tail than 1D timeseries datasets.

Appendix F Pareto and Kurtosis

Figure 7: Left: Generalized Pareto distributions with different shape parameters (). Right: Illustrating the variation of kurtosis on distributions with the same mean.

Figure 7 illustrates different GPDs for different shape parameter values. Higher shape value models more severe tail behavior.

Appendix G Long tail error distribution

In Fig. 8 we can see log-log plots of the error distributions of base model for each of the datasets. We can see each distribution exhibits a long tail behavior.

Figure 8: Log log plots of error distribution of base model for the time series [Electricity: Top left, Traffic: Top right] and trajectory [ETH-UCY: Bottom left, nuScenes: Bottom right] forecasting datasets.

Appendix H Synthetic datasets

We present complete results of our experiments on the synthetic datasets in Table 7. We ran our methods, kurtosis loss, and PLM on these datasets as well. Both our methods show significant tail improvements over the base model across all datasets.

Method Metric Mean Max Kurtosis Skew
Sine Dataset
AutoReg ND 1.2255 2.162 2.7088 2.9306 3.1271 -0.1565 0.1905
NRMSE 1.5078 2.3134 2.7204 2.9379 3.1271 -0.56 -0.0369
DeepAR ND 0.0513 0.1721 0.316 0.5913 1.5744 71.9164 7.9019
NRMSE 0.1534 0.2009 0.3507 0.6199 1.654 64.4497 7.4304
+ Kurtosis Loss ND 0.0455 0.1412 0.2914 0.447 1.5571 90.6313 8.6956
NRMSE 0.133 0.1624 0.3455 0.5387 1.5571 76.7183 7.9383
+ Pareto Loss ND 0.0462 0.1326 0.3014 0.7151 1.582 78.6768 8.4086
Margin NRMSE 0.1517 0.1563 0.3551 0.737 1.7522 72.0235 7.9663
Gaussian Dataset
AutoReg ND 0.573 1.0225 1.3334 1.6226 27.6956 845.0732 26.4337
NRMSE 1.2705 1.1212 1.4045 1.6815 39.7474 1010.198 29.748
DeepAR ND 0.4379 0.705 0.7908 0.8651 1.1362 0.8225 0.7469
NRMSE 0.5518 0.8172 0.9246 0.9908 1.3009 0.5562 0.65
+ Kurtosis Loss ND 0.4378 0.704 0.7973 0.8597 1.1294 0.8037 0.7418
NRMSE 0.5518 0.8191 0.9255 0.9865 1.2951 0.539 0.6449
+ Pareto Loss ND 0.4391 0.7023 0.7946 0.8674 1.1069 0.7813 0.7352
Margin NRMSE 0.5534 0.8194 0.9232 0.9889 1.2786 0.4985 0.6333
Pareto Dataset
AutoReg ND 1.9377 1.1748 1.7039 2.4782 2113.7503 2116.5018 44.2477
NRMSE 81.1652 1.4027 1.9856 2.7312 4069.3972 2204.8078 45.3437
DeepAR ND 0.4416 0.8336 1.0317 1.1763 2.015 6.9242 2.036
NRMSE 0.6349 1.1511 1.4295 1.6688 2.8327 7.0681 2.1547
+ Kurtosis Loss ND 0.4413 0.8345 1.0295 1.1738 2.0326 6.8831 2.0318
NRMSE 0.6352 1.1541 1.4305 1.6653 2.8335 6.9941 2.144
+ Pareto Loss ND 0.4394 0.8497 1.0473 1.1955 2.086 6.6526 2.0185
Margin NRMSE 0.6397 1.1694 1.447 1.6735 2.845 6.4693 2.0711
Table 7: Performance on the Synthetic Datasets (ND/NRMSE).