1 Introduction
Forecasting is one of the most fundamental problems in time series and spatiotemporal data analysis, with broad applications in energy, finance, and transportation. Deep learning models [li2019enhancing, salinas2020deepar, rasul2021autoregressive] have emerged as state-of-the-art approaches for forecasting rich time series and spatiotemporal data. In several forecasting competitions such as M5 forecasting competition [makridakis2020m5], Argoverse motion forecasting challenge [chang2019argoverse], and IARAI Traffic4cast contest [kreil2020surprising]
, almost all the winning solutions are based on deep neural networks.
Despite the encouraging progress, we discover that the forecasting performance of deep learning models has long-tail behavior. That means, a significant amount of samples are very difficult to forecast. Existing works often measure the forecasting performance by averaging across test samples. However, such an average performance measured by root mean square error (RMSE) or mean absolute error (MAE) can be misleading. A low RMSE or MAE may indicate good averaged performance, but it does not prevent the model from behaving disastrously in difficult scenarios.

From a practical perspective, the long-tail behavior in forecasting performance can be alarming. Figure 1 visualizes examples of long-tail behavior for a motion forecasting task. In motion forecasting, the long-tail would correspond to rare events in driving such as turning maneuver and sudden stops. Failure to forecast accurately in these scenarios would pose paramount safety risks in route planning. In electricity forecasting, the tail behavior would occur during short circuits, power outage, grid failures, or sudden behavior changes. Merely focusing on the average performance would ignore the electric load anomalies, significantly increasing the maintenance and operational cost.
Long-tailed learning is an area heavily studied in classification settings focusing on class imbalance. We refer readers to Table 2 in [menon2020long] and the survey paper by [zhang2021deep] for a complete review. Most common approaches to address the long-tail problem include post-hoc normalization, data resampling, loss engineering, and learning class-agnostic representations. However, long-tail learning methods in classification are not directly translatable to forecasting as we do not have a pre-defined class. A recent work by [makansi2021exposing]
propose to use Kalman filter to gauge the difficulty of different forecasting examples but such difficulties may not directly relate to deep neural networks used for the actual forecasting task.
In this paper, we address the long-tail behavior of prediction error for deep probabilistic forecasting. We present two moment-based loss modifications: Kurtosis loss and Pareto loss. Kurtosis is a well studied symmetric measure of tailedness as a scaled fourth moment of the distribution. Pareto loss uses Generalized Pareto Distribution (GPD) to fit the long-tailed error distribution and can also be described as a weighted summation of shifted moments. We investigate these tailedness measurements as regularization and loss weighting approaches for probabilistic forecasting tasks. We demonstrate significantly improved tail performance compared to the base model and the baselines, while achieving better average performance in most settings.
In summary, our contributions include
-
We discover long-tail behavior in the forecasting performance of deep probabilistic models.
-
We investigate principled approaches to address long-tail behavior and propose two novel methods: Pareto loss and Kurtosis loss.
-
We significantly improve the tail performance on four forecasting tasks including two time series and two spatiotemporal trajectory forecasting datasets.
2 Related work
Deep probabilistic forecasting. There is a flurry of work on using deep neural networks for probabilistic forecasting. For time series forecasting, a common practice is to combine classic time series models with deep learning, resulting in DeepAR [salinas2020deepar], Deep State Space [rangapuram2018deep], Deep Factors [wang2019deep] and normalizing Kalman Filter [de2020normalizing]. Others introduce normalizing flow [rasul2020multivariate], denoising diffusion [rasul2021autoregressive] and particle filter [pal2021rnn] to deep learning. For trajectory forecasting, the majority of works focus on deterministic prediction. A few recent works propose to approximate the conditional distribution of future trajectories given the past with explicit parameterization [mfp, luo2020probabilistic], CVAE [CVAE, desire, trajectron++] or implicit models such as GAN [socialgan, liu2019naomi]. Nevertheless, most existing works focus on average performance, the issue of long-tail is largely overlooked in the community.
Long-tailed learning.
The main efforts for addressing the long-tail issue in learning revolve around reweighing, resampling, loss function engineering, and two-stage training, but mostly for classification. Rebalancing during training comes either in form of synthetic minority oversampling
[chawla2002smote], oversampling with adversarial examples [Kozerawski_2020_ACCV], inverse class frequency balancing [liu2019large], balancing using effective number of samples [cui2019class], or balance-oriented mixup augmentation [xu2021towards]. Another direction involves post-processing either in form of normalized calibration [pan2021model]or logit adjustment
[menon2020long]. An important direction is loss modification approaches such as Focal Loss [lin2017focal], Shrinkage Loss [lu2018deep], and Balanced Meta-Softmax [ren2020balanced]. Others utilize two-stage training [liu2019large, cao2019learning] or separate expert networks [zhou2020bbn, li2020overcoming, wang2020long]. We refer the readers to [zhang2021deep] for an extensive survey. [tang2020long] indicated SGD momentum can contribute to the aggravation of the long-tail problem and suggested de-confounded training to mitigate its effects. [feldman2020does, feldman2020neural] performed theoretical analysis and suggested label memorization in long-tail distribution as a necessity for the network to generalize.A few were developed for imbalanced regression. Many approaches revolve around modifications of SMOTE such as adapted to regression SMOTER [torgo2013smote], augmented with Gaussian Noise SMOGN [branco2017smogn], or [ribeiro2020imbalanced] work extending for prediction of extremely rare values. [steininger2021density]
proposed DenseWeight, a method based on Kernel Density Estimation for better assessment of the relevance function for sample reweighing.
[yang2021delving] proposed a distribution smoothing over label (LDS) and feature space (FDS) for imbalanced regression. A concurrent work is [makansi2021exposing] where they noticed the long-tail error distribution for trajectory prediction. They used Kalman filter [kalman1960new] performance as a difficulty measure and utilized contrastive learning to alleviate the tail problem. However, the tail of Kalman Filter may differ from that of deep learning models, which we elaborate on in later sections.3 Methodology
We first identify the long-tail phenomena in probabilistic forecasting. Then, we propose two related strategies based on Pareto loss and Kurtosis loss to mitigate the tail issue.
3.1 Long-tail in probabilistic forecasting
Given input and output respectively, probabilistic forecasting task aims to predict the conditional distribution of future states given current and past observations as:
(1) |
where is the length of the history and is the prediction horizon. We denote the maximum likelihood probabilistic forecasting model prediction as .
Long tail distribution of data can be seen in numerous real world datasets. This is evident for the four benchmark forecasting datasets (Electricity [Dua:2019], Traffic [Dua:2019], ETH-UCY [pellegrini2009you, lerner2007crowds], and nuScenes [caesar2020nuscenes]) studied in this work. We can see the distribution of ground truth values () for all of them in Figure 2. We use log-log plots to increase the visibility of the long tail behavior present in the data – smaller values (constituting the minority on a linear scale) occur very frequently, while majority of values are very rare (creating the tail). In addition to the long tail data distribution, we also identify the long tail distribution of forecasting error from deep learning models (such as DeepAR [salinas2020deepar], Trajectron++ [salzmann2020trajectron++], and Trajectron++EWTA [makansi2019overcoming]) (as seen in Appendix G).
We hypothesize that long tail behavior in forecasting error distribution originates from the long tail behavior in data distribution, as well as the nature of gradient based deep learning. Therefore, modifying the loss function to account for the shape of the distribution would potentially lead to better tail performance. Next, we present two loss functions based on the moment of the error distribution.
3.2 Pareto Loss
Long tail distributions naturally lend themselves to analysis using Extreme Value Theory (EVT). [mcneil1997estimating]
shows that long tail behavior can be modeled as a generalized Pareto distribution (GPD). The probability distribution function (pdf) of the GPD is,
(2) |
where the parameters are location (), scale () and shape (). The pdf for GPD is defined for when and for when . can be set to 0 without loss of generality as it represents translation along the x axis. We can drop the scaling term
as the pdf will be scaled using a hyperparameter. The simplified pdf is,
(3) |
The high-level idea of Pareto loss is to fit a GPD to the loss distribution to reprioritize the learning of easy and difficult (tail) examples. Let the loss function used by a given machine learning model be denoted as
. In probabilistic forecasting, a commonly used loss is Negative Log Likelihood (NLL) loss: where is the training example, and the model prediction. As the pdf in Eq.(3) only allows non-negative input, the loss has to be lower-bounded. We propose to use an auxiliary loss to fit the GPD. For NLL which can be unbounded for continuous distributions, the auxiliary loss can simply be Mean Absolute Error (MAE): .There are two main classes of methods for modifying loss functions to improve tail performance: regularization [ren2020balanced, makansi2021exposing] and re-weighting [lin2017focal, lu2018deep, yang2021delving]. Both classes are characterized by different behavior on tail data [ren2020balanced]. Inspired by these, we propose two variations of the Pareto Loss using the distribution fitted on : Pareto Loss Margin (PLM) and Pareto Loss Weighted (PLW).
PLM is based on the principles of margin-based regularization [ren2020balanced, liu2016large] which assigns larger penalties (margins) to harder examples. For a given hyperparameter , PLM is defined as,
(4) |
where
(5) |
which uses GPD to calculate the additive margin.
An alternative is to reweigh the loss terms using the loss distribution. For a given hyperparameter , PLW is defined as,
(6) |
where
(7) |
which uses GPD to reweigh the loss of each sample.
3.3 Kurtosis Loss
Kurtosis measures the tailedness of a distribution as the scaled fourth moment about the mean. To increase the emphasis on tail examples, we use this measure to propose kurtosis loss. For a given hyperparameter and using the same notation as Sec.3.2 kurtosis loss is defined as,
(8) |
where is the contribution of an example to kurtosis for a batch
(9) |
where and
are the mean and standard deviation of the auxiliary loss (
) values for a batch of examples.We propose to use the auxiliary loss distribution to compute kurtosis, as performance metrics in forecasting tasks frequently involve versions of L1 or L2 distance such as RMSE, MAE, or ADE. The goal is to decrease the long tail for these metrics, which might not correlate well with the base loss . The example in Sec. 3.2 where is NLL loss and is MAE loss illustrates this requirement well.
Kurtosis loss and pareto loss are related approaches to handling long tail behavior. Pareto Loss is a weighted sum of moments about while kurtosis loss is the fourth moment about the mean. Let and , then the Taylor expansion for the GPD pdf from Eq.(3) is,
(10) |
For or equivalently or
, the coefficients are positive for even moments and negative for odd moments. Even moments are always symmetric and positive, while odd moments are positive only for right-tailed distributions. Since we use the negative of the pdf, it yields an asymmetric measure of the right tailedness of a value in the distribution.
Kurtosis loss uses the fourth moment about the distribution mean. This is a symmetric and positive measure, but in the context of right tailed distributions, kurtosis serves as a good measure of the long tailedness of the distribution. GPD and kurtosis are visualised in Appendix F
4 Experiments
We evaluate our methods on two probabilistic forecasting tasks: time series forecasting and trajectory prediction.
4.1 Setup
Datasets.
For time series forecasting, we use electricity and traffic datasets from the UCI ML repository [Dua:2019] used in [salinas2020deepar] as benchmarks. We also generate three synthetic 1D time series datasets, Sine, Gaussian and Pareto, to further our understanding of long tail behavior.
For trajectory prediction, we use two benchmark datasets: a pedestrian trajectory dataset ETH-UCY (which is a combination of ETH [pellegrini2009you] and UCY [lerner2007crowds] datasets) and a vehicle trajectory dataset nuScenes [caesar2020nuscenes]. Details regarding the datasets are available in Appendix A.
Baselines.
We compare with the following baselines representing SoTA in long tail mitigation for different tasks:
-
[itemsep=1mm, topsep=0mm]
-
Contrastive Loss: [makansi2021exposing] uses contrastive loss as a regularizer to group examples together based on Kalman filter prediction errors.
-
Focal Loss: [lin2017focal] uses L1 loss to reweigh loss terms.
-
Shrinkage Loss: [lu2018deep] uses a sigmoid-based function to reweigh loss terms.
-
Label Distribution Smoothing (LDS): [yang2021delving] uses symmetric kernel to smooth the label distribution and use its inverse to reweigh loss terms.
Focal Loss, Shrinkage Loss, and LDS were originally proposed for classification and/or regression and required adaptation in order to be applicable to the forecasting task. For details on baseline adaptation, please see Appendix B.
Evaluation Metrics.
We use two common metrics for the evaluation of trajectory prediction models: Average Displacement Error (ADE), which is the average L2 distance between total predicted trajectory and ground truth, and Final Displacement Error (FDE) which is the L2 distance for the final timestep. For time series forecasting, we use Normalized Deviation (ND) and Normalized Root Mean Squared Error (NRMSE).
Apart from the above-mentioned average performance metrics, we introduce metrics to capture performance on the tail. To measure the performance at tail of the distribution, we propose to adapt the Value-at-Risk (VaR Eq. (11)) metric:
(11) |
VaR at level is the smallest error
such that the probability of observing error larger than
is smaller than , where is the error distribution. This evaluates to the quantile of the error distribution. We propose to measure VaR at three different levels: , , and .In addition, we use skew, kurtosis, and max error to further assess the tail performance. Skew and Kurtosis as metrics are meaningful only when looked at in conjunction with the mean. A distribution with a higher mean and lower skew and kurtosis does not imply a less severe tail.
4.2 Synthetic Dataset Experiments
In order to better understand the long tail error distribution, we perform experiments on three synthetic datasets. The task is to forecast 8 steps ahead given a history of 8 time steps. We use AutoRegression (AR) and DeepAR [salinas2020deepar] as forecasting models to perform this task. The top row in Figure 3 shows that among the datasets, only Gaussian and Pareto show tail behavior in the data distribution. Pareto dataset in particular is the only one to display long tail behavior. AR and DeepAR have different error distribution across the datasets. Based on these results, we make the following hypotheses for the sources of long-tailedness.
Source 1: Long Tail in Data. The data distributions for Gaussian and Pareto datasets have similar tail behavior to the error distribution for both models, AR and DeepAR. This indicates that the long tail in data is a likely cause of long tail behavior in error. This connection is also well established as class imbalance for classification tasks [van2018inaturalist, liu2019large].
Source 2: Deep Learning Model. The results on the Sine dataset illustrate that even in the absence of long tail in the data, we can have long tail in the error distribution. The AR model, however, does not show long tail behavior for error. This indicates that the observed long tail behavior in error for DeepAR is model induced. We hypothesize that this is caused by DeepAR overfitting to simpler examples due to the nature of gradient based learning. Further results and analysis on these datasets can be found in Appendix H.
The difference between AR and DeepAR error distributions also suggests that assuming tail overlap between deep learning and non-deep learning methods (such as Kalman filter used by [makansi2021exposing]) might not generalize well.
Method | Metric | Mean | Max | Kurtosis | Skew | |||
---|---|---|---|---|---|---|---|---|
DeepAR | ND | 0.0584 | 0.0796 | 0.2312 | 0.4429 | 4.1520 | 426.5906 | 18.4057 |
NRMSE | 0.2953 | 0.0972 | 0.2595 | 0.5263 | 5.4950 | 470.8968 | 19.4827 | |
+ Contrastive Loss | ND | 0.0618 | 0.0872 | 0.2102 | 0.4274 | 4.0004 | 384.568 | 17.5604 |
NRMSE | 0.3062 | 0.1069 | 0.2481 | 0.5392 | 5.1606 | 415.3592 | 18.3051 | |
+ Focal Loss | ND | 0.0628 | 0.0853 | 0.2694 | 0.4398 | 4.3263 | 412.5172 | 18.0739 |
NRMSE | 0.3139 | 0.1052 | 0.3137 | 0.5297 | 5.7797 | 469.7605 | 19.3916 | |
+ Shrinkage Loss | ND | 0.0694 | 0.0956 | 0.2334 | 0.4446 | 4.4714 | 325.7401 | 16.3852 |
NRMSE | 0.3244 | 0.1156 | 0.2828 | 0.5177 | 5.4245 | 336.7777 | 16.5656 | |
+ LDS | ND | 0.0634 | 0.0890 | 0.2238 | 0.4925 | 3.8625 | 335.2523 | 16.2944 |
NRMSE | 0.2923 | 0.1149 | 0.2787 | 0.5458 | 4.9234 | 373.4702 | 17.1249 | |
+ Kurtosis Loss (Ours) | ND | 0.0567 | 0.0842 | 0.2151 | 0.4120 | 3.2738 | 300.3517 | 15.4597 |
NRMSE | 0.2631 | 0.1046 | 0.2732 | 0.4779 | 4.2613 | 339.3773 | 16.4892 | |
+ PLM (Ours) | ND | 0.0564 | 0.0799 | 0.1900 | 0.4164 | 3.4576 | 359.6645 | 16.9243 |
NRMSE | 0.2783 | 0.1000 | 0.2343 | 0.5102 | 4.7494 | 423.2319 | 18.3994 | |
+ PLW (Ours) | ND | 0.0578 | 0.0796 | 0.2121 | 0.3558 | 3.4647 | 329.0847 | 16.393 |
NRMSE | 0.2807 | 0.0984 | 0.2555 | 0.4809 | 4.6040 | 366.6818 | 17.3120 |
Method | Metric | Mean | Max | Kurtosis | Skew | |||
---|---|---|---|---|---|---|---|---|
DeepAR | ND | 0.1741 | 0.6866 | 25.5840 | 32.1330 | 84.1582 | 41.2804 | 6.1700 |
NRMSE | 0.4465 | 1.2283 | 6.0283 | 7.5988 | 18.8103 | 37.0089 | 5.7343 | |
+ Contrastive Loss | ND | 0.2052 | 0.7463 | 24.3737 | 30.5117 | 81.1716 | 42.1391 | 6.2282 |
NRMSE | 0.4667 | 1.2956 | 5.7747 | 7.2342 | 18.3360 | 36.4420 | 5.6834 | |
+ Focal Loss | ND | 0.4903 | 1.1553 | 26.7537 | 30.1506 | 52.8272 | 28.5912 | 5.4325 |
NRMSE | 0.7302 | 1.6485 | 6.5880 | 7.3660 | 13.7985 | 24.6181 | 4.9104 | |
+ Shrinkage Loss | ND | 0.2431 | 0.8380 | 25.3381 | 32.9147 | 85.2713 | 45.0172 | 6.3935 |
NRMSE | 0.5114 | 1.3099 | 6.0418 | 7.8882 | 19.0771 | 39.5592 | 5.8742 | |
+ LDS | ND | 0.4763 | 1.4781 | 28.9162 | 38.4263 | 126.5733 | 49.2714 | 6.5445 |
NRMSE | 0.7829 | 1.8702 | 6.8826 | 9.2061 | 27.3684 | 39.8322 | 5.7109 | |
+ Kurtosis Loss (Ours) | ND | 0.2022 | 0.7653 | 25.3752 | 31.4677 | 62.9173 | 35.298 | 5.8785 |
NRMSE | 0.4892 | 1.4072 | 6.0263 | 7.3369 | 13.7783 | 29.6338 | 5.2683 | |
+ PLM (Ours) | ND | 0.1594 | 0.7115 | 24.5911 | 30.331 | 90.3169 | 42.5373 | 6.1829 |
NRMSE | 0.4600 | 1.3881 | 5.6779 | 7.0033 | 20.5736 | 36.7518 | 5.6005 | |
+ PLW (Ours) | ND | 0.3751 | 1.0495 | 25.4471 | 31.6621 | 65.759 | 35.4836 | 5.8813 |
NRMSE | 0.6238 | 1.4914 | 6.0552 | 7.3491 | 13.8938 | 28.9214 | 5.1844 |
Method | Mean | Max | Kurtosis | Skew | |||
---|---|---|---|---|---|---|---|
Traj++ | 0.21/0.41 | 0.56/1.33 | 0.78/1.97 | 0.98/2.47 | 2.33/5.04 | 16.02/16.09 | 3.02/3.26 |
Traj++EWTA | 0.16/0.33 | 0.43/1.05 | 0.60/1.53 | 0.76/1.89 | 1.63/3.95 | 16.40/19.21 | 2.97/3.34 |
+ Contrastive | 0.17/0.34 | 0.43/1.03 | 0.62/1.56 | 0.79/1.89 | 1.67/4.02 | 16.37/18.51 | 2.96/3.35 |
+ Focal Loss | 0.16/0.32 | 0.40/0.89 | 0.54/1.28 | 0.66/1.57 | 1.50/3.50 | 14.95/17.80 | 2.74/3.18 |
+ Shrinkage Loss | 0.16/0.33 | 0.43/1.05 | 0.58/1.50 | 0.74/1.84 | 1.66/3.95 | 16.67/19.54 | 3.00/3.41 |
+ LDS | 0.17/0.35 | 0.44/1.04 | 0.57/1.45 | 0.78/1.86 | 1.69/3.85 | 19.80/19.12 | 3.18/3.39 |
+ Kurtosis Loss (ours) | 0.17/0.34 | 0.46/0.98 | 0.59/1.25 | 0.67/1.47 | 1.22/2.77 | 5.28/7.25 | 1.77/2.11 |
+ PLM (ours) | 0.16/0.30 | 0.38/0.81 | 0.52/1.20 | 0.63/1.49 | 1.30/3.20 | 12.01/16.90 | 2.41/3.04 |
+ PLW (ours) | 0.21/0.36 | 0.46/0.84 | 0.55/1.08 | 0.63/1.32 | 1.25/2.93 | 6.62/10.52 | 1.69/2.08 |
Method | Mean | Max | Kurtosis | Skew | |||
---|---|---|---|---|---|---|---|
Traj++ | 0.23/0.42 | 0.73/1.62 | 1.11/2.73 | 1.46/3.61 | 7.87/10.98 | 37.74/26.96 | 4.23/4.18 |
Traj++EWTA | 0.19/0.34 | 0.65/1.49 | 1.00/2.49 | 1.32/3.34 | 7.07/11.42 | 55.26/36.33 | 5.12/4.88 |
+ Contrastive | 0.19/0.35 | 0.65/1.51 | 1.01/2.58 | 1.36/3.46 | 6.82/10.48 | 52.62/32.32 | 5.07/4.71 |
+ Focal Loss | 0.19/0.33 | 0.56/1.09 | 0.85/1.95 | 1.11/2.65 | 6.55/11.71 | 60.48/53.60 | 5.14/5.55 |
+ Shrinkage Loss | 0.19/0.32 | 0.62/1.32 | 0.96/2.31 | 1.25/3.17 | 6.39/10.26 | 53.5/36.91 | 5.00/4.95 |
+ LDS | 0.19/0.32 | 0.62/1.26 | 0.94/2.23 | 1.20/2.99 | 5.20/10.53 | 46.71/40.00 | 4.75/5.08 |
+ Kurtosis Loss (ours) | 0.20/0.38 | 0.65/1.35 | 0.85/1.82 | 1.03/2.27 | 5.39/7.52 | 28.32/17.88 | 3.24/3.00 |
+ PLM (ours) | 0.19/0.33 | 0.62/1.32 | 0.95/2.31 | 1.25/3.18 | 6.10/10.96 | 46.43/37.63 | 4.71/4.96 |
+ PLW (ours) | 0.24/0.37 | 0.60/1.00 | 0.82/1.49 | 1.01/2.01 | 7.51/9.91 | 62.85/42.87 | 4.46/4.57 |
|
4.3 Real-World Experiments
Time Series Forecasting
We present average and tail metrics on ND and NRMSE for the time series forecasting task on electricity and traffic datasets in Tables 1 and 2 respectively. We use DeepAR [salinas2020deepar], one of the SoTA in probabilistic time series forecasting, as the base model. The task for both datasets is to use a 1-week history (168 hours) to forecast for 1 day (24 hours) at an hourly frequency. DeepAR exhibits long tail behavior in error on both datasets (refer Appendix G). The tail of the error distribution is significantly longer for the electricity dataset as compared to the traffic dataset.
Trajectory Forecasting
We present experimental results on ETH-UCY and nuScenes datasets in Tables 3 and 4 respectively. Following [salzmann2020trajectron++] and [makansi2021exposing] we calculate model performance based on the best out of 20 guesses. On both datasets we compare our approaches with current SoTA long-tail baseline methods using Trajectron++EWTA [makansi2021exposing] as a base model due to its SoTA average performance on these datasets. We include the Trajectron++ [salzmann2020trajectron++] results for reference as the previous state-of-the-art method to add a meaningful comparison to the magnitude of performance change obtained by each long tail method.
On performing a comparative analysis of tail lengths between datasets, we notice that trajectory datasets manifest shorter tails compared to 1D time series datasets. Our Pareto approaches work better on longer tails and for this reason we augment weight and margin for PLM and PLW with an additional Mean Squared Error weight term to internally elongate the tail during the training process.
4.4 Results Analysis
Cross-task consistency
As shown in Tables 3 and 4, our proposed approaches, kurtosis loss and PLM, are the only methods improving on tail metrics across all tasks while maintaining the average performance of the base model. Our tasks differ in representation (1D, 2D), severity of long-tail, base model loss function (GaussNLL, EWTA) and prediction horizon. This indicates that our methods generalize to diverse situations better than existing long-tail methods.
Long-tailedness across datasets
Using Eq. (12) as an indicative measure of the long-tailedness in error distribution, we establish the datasets as ETH-UCY, nuScenes, electricity, and traffic in long-tailedness for the base model (Details in Appendix E). We notice the connections between long-tailedness of the dataset and the performance of different methods.
(12) |
Re-weighting vs Regularization.
As mentioned in Section 3.2, we can categorize loss modifying methods into two classes: re-weighting (focal loss, shrinkage loss, LDS and PLW) and regularization (contrastive loss, PLM and kurtosis loss). Re-weighting multiplies the loss for more difficult examples with higher weights. Regularization adds higher regularization values for examples with higher loss.
We notice that re-weighting methods perform worse as the long-tailedness increases. In scenarios with longer tails, the weights of tail samples can be very high. Over-emphasizing tail examples hampers the learning for other samples. Shrinkage loss with a bounded weight limits this issue but fails to show tail improvements in longer tail scenarios. PLW is the best re-weighting method on most datasets, likely due to its bounded weights. Inconsistency in average performance is likely due to re-weighting nature of the loss which limits its applicability.
In contrast, regularization methods perform consistently across all tasks both on the tail and average metrics. The additive nature of regularization limits the adverse impact tail samples can have on the learning. This enables these methods to handle different long-tailednesses without degrading the average performance.
PLM vs Kurtosis loss.
Kurtosis loss generally performs better on extreme tail metrics, and Max. The bi-quadratic behavior of kurtosis puts higher emphasis on far-tail samples. Moreover, the magnitude of kurtosis varies significantly for different distributions, making the choice of hyperparameter (See Eq.(8)) critical. Further analysis on the same is available in Appendix D.
PLM is the most consistent method across all tasks improving on both tail and average metrics. As noted by [mcneil1997estimating] GPD is well suited to model long tail error distributions. PLM rewards examples moving away from the tail towards the mean with significantly lower margin values. PLM margin values saturate beyond a point in the tail providing similar improvements for subsequent tail samples. Visualization of PLM predictions for difficult tail examples can be seen in Fig. 4.
Kurtosis is sensitive to extreme samples in the tail, while PLM treats most samples in the tail similarly. This manifests in performance as kurtosis loss performing better on and Max, and PLM performing better on and .
This provides guidance on the choice of method as per the objective. Kurtosis Loss can improve the performance in worst case scenarios more significantly. PLM provides less drastic changes to the most extreme values, but it works more effectively throughout the entire distribution.
Tail error and long-term forecasting
Based on the trajectory forecasting results in Tables 3 and 4 we can see that error reduction for tail samples is more visible in FDE than ADE. This indicates that the magnitude of the observed error increases with the prediction horizon. The error accumulates through prediction steps making far-future predictions inherently more difficult. Larger improvements in the FDE indicate that both Kurtosis and Pareto loss ensure that high tail errors (stemming mostly from large, far-future prediction errors measured by FDE) are decreased.
The inadvertent direction of research in the forecasting domain is aiming at increasing the prediction horizon with high accuracy predictions. As we can see in Fig. 5, the effect of the tail examples is more pronounced with longer prediction horizons. Thus, methods addressing the tail performance will be necessary in order to ensure the practical applicability and reliability of future, long-term prediction.
5 Conclusion
We address the long-tail problem in deep probabilistic forecasting. We propose Pareto loss (Margin and Weighted) and Kurtosis loss, two novel moment-based loss function approaches increasing emphasis on learning tail examples. We demonstrate their practical effects on two spatiotemporal trajectory datasets and two time series datasets. Our methods achieve significant improvements on tail examples over existing baselines without degrading average performance. Both proposed losses can be integrated with existing approaches in deep probabilistic forecasting to improve their performance on difficult and challenging scenarios.
Future directions include more principled ways to tune hyperparameters, new approaches to mitigate long tail for long-term forecasting and application to more complex tasks like video prediction. Based on our observations, we suggest evaluating additional tail performance metrics apart from average performance in machine learning task to identify potential long tail issues across different tasks and domains.
Acknowledgments
This work was supported in part by U.S. Department Of Energy, Office of Science, U. S. Army Research Office under Grant W911NF-20-1-0334, Facebook Data Science Award, Google Faculty Award, and NSF Grant #2037745.
References
Appendix A Dataset description
The ETH-UCY dataset consists of five subdatasets, each with Bird’s-Eye-Views: ETH, Hotel, Univ, Zara1, and Zara2. As is common in the literature [makansi2021exposing, salzmann2020trajectron++] we present macro-averaged 5-fold cross-validation results in our experiment section. The nuScenes dataset includes 1000 scenes of 20 second length for vehicle trajectories recorded in Boston and Singapore.
The electricity dataset contains electricity consumption data for 370 homes over the period of Jan 1st, 2011 to Dec 31st, 2014 at a sampling interval of 15 minutes. We use the data from Jan 1st, 2011 to Aug 31st, 2011 for training and data from Sep 1st, 2011 to Sep 7th, 2011 for testing. The traffic dataset consists of occupancy values recorded by 963 sensors at a sampling interval of 10 minutes ranging from Jan 1st, 2008 to Mar 30th, 2009. We use data from Jan 1st, 2008 to Jun 15th, 2008 for training and data from Jun 16th, 2008 to Jul 15th, 2008 for testing. Both time series datasets are downsampled to 1 hour for generating examples.
The synthetic datasets are generated as 100 different time series consisting of 960 time steps. Each time series in the Sine dataset is generated using a random offset and a random frequency
both selected from a uniform distribution
. Then the time series is whereis the index of the time step. Gaussian and Pareto datasets are generated as order 1 lag autoregressive time series with randomly sampled Gaussian and Pareto noise respectively. Gaussian noise is sampled from a Gaussian distribution with mean 1 and standard deviation 1. Pareto noise is randomly sampled from a Pareto distribution with shape 10 and scaling 1.
Appendix B Method adaptation
Time Series forecasting
DeepAR uses Gaussian Negative Log Likelihood as the loss which is unbounded. Due to this many baseline methods need to be adapted in order to be usable. For the same reason, we also need an auxiliary loss (). We use MAE loss to fit the GPD, calculate kurtosis, and to calculate the weight terms for Focal and Shrinkage loss. For LDS we treat all labels across time steps as a part of a single distribution. Additionally, to avoid extremely high weights () in LDS due to the nature of long tail we ensure a minimum probability of for all labels.
Trajectory forecasting
We adapt Focal Loss and Shrinkage Loss to use EWTA loss [makansi2019overcoming] in order to be compatible with Trajectron++EWTA base model. LDS was originally proposed for a regression task and we adapt it to the trajectory prediction task in the same way as for the time series task. We use MAE to fit the GPD, due to the Evolving property of EWTA loss.
Appendix C Implementation details
Time Series forecasting
We use the DeepAR implementation from https://github.com/zhykoties/TimeSeries as the base code to run all time series experiments. The original code is an AWS API and not publicly available. The implementation of contrastive loss is taken directly from the source code of [makansi2021exposing].
Trajectory forecasting
For all tested base methods in the trajectory forecasting experiments (Trajectron++ [salzmann2020trajectron++] and Trajectron++EWTA [makansi2021exposing]) we have used the original implementations provided by the original authors of each method. The implementation of contrastive loss is taken directly from the source code of [makansi2021exposing].
The experiments have been conducted on a machine with 7 RTX 2080 Ti GPUs.
Appendix D Hyperparameter Tuning
We observe during our experiments that the performance of kurtosis loss is highly dependent on the hyperparameter (See Eq. (8)). Results for different values of on the electricity dataset for kurtosis are shown in Table5. We also show the variation of ND and NRMSE with the hyperparameter value in Figure 6. We can see that there is an optimal value of the hyperparameter and the approach performs worse with higher and lower values.
For both ETH-UCY and nuScenes datasets we have used for Kurtosis loss, and for PLM and PLW. For both electricity and traffic datasets, we use for PLM, for PLW and for Kurtosis loss.
Method | Metric | Mean | Max | Kurtosis | Skew | |||
---|---|---|---|---|---|---|---|---|
DeepAR | ND | 0.0584 | 0.0796 | 0.2312 | 0.4429 | 4.1520 | 426.5906 | 18.4057 |
NRMSE | 0.2953 | 0.0972 | 0.2595 | 0.5263 | 5.4950 | 470.8968 | 19.4827 | |
+ Kurtosis Loss [0.001] | ND | 0.0581 | 0.0815 | 0.2087 | 0.3936 | 4.2381 | 488.7306 | 19.8207 |
NRMSE | 0.3046 | 0.1014 | 0.2325 | 0.4756 | 5.7144 | 529.7499 | 20.7713 | |
+ Kurtosis Loss [0.005] | ND | 0.0574 | 0.0767 | 0.2147 | 0.4138 | 3.6767 | 351.3378 | 16.7597 |
NRMSE | 0.2843 | 0.0999 | 0.2617 | 0.4792 | 5.0062 | 417.0575 | 18.3039 | |
+ Kurtosis Loss [0.01] | ND | 0.0567 | 0.0842 | 0.2151 | 0.4120 | 3.2738 | 300.3517 | 15.4597 |
NRMSE | 0.2631 | 0.1046 | 0.2732 | 0.4779 | 4.2613 | 339.3773 | 16.4892 | |
+ Kurtosis Loss [0.1] | ND | 0.0677 | 0.0954 | 0.2269 | 0.4579 | 3.8772 | 312.7331 | 16.0062 |
NRMSE | 0.3073 | 0.1184 | 0.2768 | 0.5419 | 5.1345 | 334.8358 | 16.3366 |
![]() |
![]() |
Appendix E Long tail severity
In Table 6 we present the numerical values representing the approximate long-tailedness for each of the datasets. Larger value indicates a longer tail.
Dataset | Metric | Long-tailedness |
---|---|---|
Electricity | ND | 15.56 |
Traffic | ND | 45.08 |
ETH-UCY | FDE | 7.96 |
nuScenes | FDE | 10.81 |
Appendix F Pareto and Kurtosis
![]() |
![]() |
Figure 7 illustrates different GPDs for different shape parameter values. Higher shape value models more severe tail behavior.
Appendix G Long tail error distribution
In Fig. 8 we can see log-log plots of the error distributions of base model for each of the datasets. We can see each distribution exhibits a long tail behavior.
Appendix H Synthetic datasets
We present complete results of our experiments on the synthetic datasets in Table 7. We ran our methods, kurtosis loss, and PLM on these datasets as well. Both our methods show significant tail improvements over the base model across all datasets.
Method | Metric | Mean | Max | Kurtosis | Skew | |||
---|---|---|---|---|---|---|---|---|
Sine Dataset | ||||||||
AutoReg | ND | 1.2255 | 2.162 | 2.7088 | 2.9306 | 3.1271 | -0.1565 | 0.1905 |
NRMSE | 1.5078 | 2.3134 | 2.7204 | 2.9379 | 3.1271 | -0.56 | -0.0369 | |
DeepAR | ND | 0.0513 | 0.1721 | 0.316 | 0.5913 | 1.5744 | 71.9164 | 7.9019 |
NRMSE | 0.1534 | 0.2009 | 0.3507 | 0.6199 | 1.654 | 64.4497 | 7.4304 | |
+ Kurtosis Loss | ND | 0.0455 | 0.1412 | 0.2914 | 0.447 | 1.5571 | 90.6313 | 8.6956 |
NRMSE | 0.133 | 0.1624 | 0.3455 | 0.5387 | 1.5571 | 76.7183 | 7.9383 | |
+ Pareto Loss | ND | 0.0462 | 0.1326 | 0.3014 | 0.7151 | 1.582 | 78.6768 | 8.4086 |
Margin | NRMSE | 0.1517 | 0.1563 | 0.3551 | 0.737 | 1.7522 | 72.0235 | 7.9663 |
Gaussian Dataset | ||||||||
AutoReg | ND | 0.573 | 1.0225 | 1.3334 | 1.6226 | 27.6956 | 845.0732 | 26.4337 |
NRMSE | 1.2705 | 1.1212 | 1.4045 | 1.6815 | 39.7474 | 1010.198 | 29.748 | |
DeepAR | ND | 0.4379 | 0.705 | 0.7908 | 0.8651 | 1.1362 | 0.8225 | 0.7469 |
NRMSE | 0.5518 | 0.8172 | 0.9246 | 0.9908 | 1.3009 | 0.5562 | 0.65 | |
+ Kurtosis Loss | ND | 0.4378 | 0.704 | 0.7973 | 0.8597 | 1.1294 | 0.8037 | 0.7418 |
NRMSE | 0.5518 | 0.8191 | 0.9255 | 0.9865 | 1.2951 | 0.539 | 0.6449 | |
+ Pareto Loss | ND | 0.4391 | 0.7023 | 0.7946 | 0.8674 | 1.1069 | 0.7813 | 0.7352 |
Margin | NRMSE | 0.5534 | 0.8194 | 0.9232 | 0.9889 | 1.2786 | 0.4985 | 0.6333 |
Pareto Dataset | ||||||||
AutoReg | ND | 1.9377 | 1.1748 | 1.7039 | 2.4782 | 2113.7503 | 2116.5018 | 44.2477 |
NRMSE | 81.1652 | 1.4027 | 1.9856 | 2.7312 | 4069.3972 | 2204.8078 | 45.3437 | |
DeepAR | ND | 0.4416 | 0.8336 | 1.0317 | 1.1763 | 2.015 | 6.9242 | 2.036 |
NRMSE | 0.6349 | 1.1511 | 1.4295 | 1.6688 | 2.8327 | 7.0681 | 2.1547 | |
+ Kurtosis Loss | ND | 0.4413 | 0.8345 | 1.0295 | 1.1738 | 2.0326 | 6.8831 | 2.0318 |
NRMSE | 0.6352 | 1.1541 | 1.4305 | 1.6653 | 2.8335 | 6.9941 | 2.144 | |
+ Pareto Loss | ND | 0.4394 | 0.8497 | 1.0473 | 1.1955 | 2.086 | 6.6526 | 2.0185 |
Margin | NRMSE | 0.6397 | 1.1694 | 1.447 | 1.6735 | 2.845 | 6.4693 | 2.0711 |