1 Introduction
Due to the recent advances in artificial intelligence research, the task of timeseries forecasting is being increasingly tackled with machine learning and deep learning techniques. There has been a large number of approaches suggested, ranging from relatively simple machine learning models
[1] to a variety of deep learning models[12]. Those approaches have been utilized in a broad spectrum of forecasting tasks, such as wind power forecasting, stock market prediction and motor temperature prediction[12]. In the above examples of tasks, as well as in multiple other applied cases, some samples are more crucial from the point of view of the user and thus would require a more accurate prediction from a model compared to its average performance. At the same time, those data points may be scarce in the training data. Hence, if left unattended performance might be worse than average for that data, which is highly undesirable. This issue is characterized as imbalanced regression, and so far has been addressed with data preprocessing or ensemble model methods[3].Despite the existing methods to tackle imbalanced regression problems, it is still a nontrivial task to data scientists and machine learning practitioners to identify and solve them in reallife timeseries forecasting contexts. In the effort of developing the best performing machine learning model and minimizing the error across all data points, some important artifacts in the data might be overlooked. Moreover, data sampling methods or ensemble model approaches[4][3] up until now focus on minimizing the prediction error in the underrepresented data samples and assume that the remaining data is negligible. That assumption is inaccurate, as in some applications, for example in stock market prediction, the cost of larger forecasting error in the more frequent cases could offset in the long run the potential benefit of a smaller error in a rare case. In order to tackle realworld applications, there is a need for a broader, balanced, flexible and iterative approach, honed through interaction with domain experts and integrating the latest research in predictive models.
In this paper we propose such an approach that has been designed based on a case study in a large industrial company, targeted to forecast the temperature of a core component in a large production line. The approach involves three steps: first selecting a weight function which quantifies the sample importance; then applying one or more sampling methods to the data; and finally training and evaluating the model with and without sampling. In the last step, we also analyze the input importance learnt by the model using SHAP[7] to gain insights about the effect of the imbalance. To exemplify our approach, we show how it is used for the aforementioned industrial task. We study the impact of choices in each of the steps, comparing different sampling techniques and deep learning models. In the end, we also combine the sampling with attention mechanisms [13] to extract insights of what is learned by a deep learning model.
2 Related Work
The advancements in data availability from a plethora of sources, the increasing computational capacity and the progress in artificial intelligence research has led to the usage of machine and deep learning models across a multitude of applications of timeseries forecasting. Previous work [12]
surveys several usecases including wind power forecasting, stock price prediction and estimation of remaining useful life of motors. Despite this widespread use of models, the majority of works present specific architectures for specific datasets, while works focusing on integrated frameworks in the sense of structured approaches to a more generalized problem are more scarce.
Although there has been an extensive amount of work in handling imbalanced datasets in classification tasks [3], regression with imbalanced data in the area of machine and deep learning has not been largely covered. In [4], Branco et al. study the effect of three proposed sampling methods on the predictive performance of machine learning models. In an applied example, in [11]
, the objective is that high water flows are predicted in a timely and accurate manner, and the problem is addressed with various sampling and ensemble methods, with an artificial neural network as a base. However, there is still a need for research into a structured approach towards realworld imbalanced regression problems, especially in the context of state of the art deep learning models.
3 Methodology
It is common in timeseries forecasting that certain data samples have more importance than others, but are also underrepresented in the dataset, resulting in an imbalanced regression problem. In this section we present a new approach for identifying and tackling this discrepancy with respect to imbalances in the context of regression tasks. This approach uses a weight function that quantifies the importance of each sample which is combined with undersampling methods to create a more balanced dataset. The new sampled dataset is then evaluated first visually, by making density plots of the data and then numerically, by using it for training and testing a predictive model.
3.1 Steps for Identifying and Treating Imbalanced Regression
We propose a set of general steps for approaching the imbalance on timeseries forecasting problems, which has been defined based on experiences we have collected in applying machine learning in a large scale industrial company. It consists of three steps illustrated in Figure 1. The first step is to select or define a weight function to quantify the sample importance, which allows us to identify and compare the different regions of interest in the data. The second step consists of selecting one or more sampling methods based on the weight function, applying them to the data and comparing the resulting distribution of against the original data using density plots. Finally, in the third step, a predictive model is trained and evaluated using both the sampled and the original versions of the dataset. A feedback loop can take the user from step 3 back to step 2 if the current combination of the selected sampling methods and models does not provide a satisfying performance after evaluation. We provide more details about each step in the following sections.
3.1.1 Step 1: Weight Function Definition
In the example of forecasting the temperature of a motor, there can be several days where the temperature is stable, with only small fluctuations, and only a few days where the temperature increases or decreases largely. Let us assume a usecase where the user is interested in building a model to predict accurately these rarer moments when the temperature changes more than usual. In such example, the daily temperature variation can be computed as a function of the data, which we refer in our framework as the weight function. In addition, we say that the data points mapped to a high variation by the weight function belong to a region of interest.
In general, the weight function can depend either only on the target variable (or a transformation of it), only on a subset of the input variables (e.g. to signify working points of interest) or on a combination of input and output variables, to express complex regions of interest. It can be written as , where and
refer to the input (a vector for multivariate input) and the prediction target, respectively.
Equation 1 gives an example of such a function for the target variation in the context of timeseries forecasting. The weight models the variation of the forecast target over the forecast horizon at time step .
(1) 
3.1.2 Step 2: Application of Sampling Method
At this step, a sampling method is applied to the dataset and its effects are analyzed. The sampling is based on the weight function previously selected and the relative proportions that the user wants to keep for the different regions of interest. We identify three scenarios for this step and we propose an undersampling method for each one of them.

Threshold undersampling: when the user identifies which region of the weight function is not important for the forecast task and can be removed;

Stochastic undersampling: when some regions are more important than others, but none can be entirely discarded;

Inverse histogram undersampling: when all regions are equally important and the user wants to have a balanced distribution of data over all of them.
Threshold undersampling (TUS) consists of removing all samples that lie below a given threshold of the weight function, and all the remaining samples have the same chance of being selected. This method is suitable for the cases where the user knows exactly what is the region of interest is in order to be able to select the best threshold, and it assumes that the samples below the threshold aren’t interesting to the prediction task. Equation 2 expresses the chance of sampling data point given its weight .
(2)  
(3)  
(4) 
Stochastic undersampling (SUS) uses the weight
computed for each data point as the probability of sampling it. Different from the TUS, SUS allows every sample, even the ones with lower weight value, to be sampled to avoid the creation of a new imbalance against those samples. Equation
3 models the relative probability of sampling a window from the dataset at time by using SUS. The factor is used to increase (or decrease) the effect of the weights, thus emphasizing the more interesting moments which might be underrepresented in the data.Inverse histogram undersampling (IHS) is an automatic method to obtain a sample where data is approximately uniformly distributed across the selected weight function
. It consists of building a histogram of the values of in the dataset and taking the inverse of the frequency of each value as the chance of sampling it. It ensures that each will be undersampled proportionally to its original frequency, so the most common values will have lower chance, while the rarer values will have higher chance. In Eq. 4 we can see a formalization of the method, where represents the frequency of in the data histogram.A good approach to gain insight about the data and the result of the sampling method is to compare the density plot of the weight function with and without using the sampling. Such a plot can show the regions which are over or underrepresented in the data, and can also give insights about how to tune the parameters of the selected method or which method should be selected.
In some reallife cases, it might not be easy to infer directly which of the three scenarios fits the problem better. In those cases, a subset of these methods can be selected for the next step, where we provide a heuristic to select the final method.
3.1.3 Step 3: Predictive Model Training and Evaluation
Finally, at this step, we can assess how much a predictive model improves by using the selected sampling methods. For that, we train and evaluate the model with and without using the sampling method on a separate evaluation set, so we end up with different combinations of training and evaluation sets which we will use to contrast the obtained evaluation errors and determine if the model benefits from the sampling. For the cases where the goal is to have a model that performs well on the samples of higher weight without sacrificing the performance on the rest of the data, we propose a heuristic for selecting the final sampling method to train the predictive model based on the results of the different evaluation sets. It is defined as:

For each sampling method, sample a training set and train a model with it;

For each sampling method, sample a separate evaluation set and evaluate the trained model on it;

Make a list of highest error over all the evaluation sets of each trained model to get an upper bound on its RMSE error;

Select the model with the lowest error in the list.
Next to studying the impact of the sampling on the performance of the models, we also propose to study how the models themselves change by using SHAP[7], which is a model agnostic technique. SHAP gives the relative importance of each input feature to the output of the model which can be compared when the model is trained with and without the sampling.
In addition, we take advantage of deep learning models with attention mechanisms [13] to gain extra insights of what is learnt. As an example of an attentionbased model, TACN[8] is a deep neural network model that provides the importance of the input timeseries across time steps through an attention mechanism. The change of the patterns shown by the mechanism also provides insights into the sampling effects on the model.
4 Experimental Setup
To give a reallife example of our approach, in this section we present a case to evaluate it based on a motor temperature prediction dataset. We also explain the techniques used at each step of the experiments and why they were chosen.
4.1 Motor Temperature Dataset
The dataset used in this experiment is made of sensor measurements extracted from a steel processing conveyor belt. The prediction target is the temperature of a bridle motor, which should be forecasted 5 minutes in advance to allow the operators to take preventive actions before a possible overheat. The rest of the data consist of properties of the steel strip (i.e. width, thickness and yield), the speed of the line, the tension applied by the bridles, the current temperature measurements of the motor, among others. The sensors are sampled every 10 seconds, and there are in total about 2 million samples.
In this dataset, we identify the temperature variation as a special property regarding the prediction target. We analyze the dataset based on this property and follow the steps of our framework: selecting a weight function, then selecting the sampling methods, and visualizing the sampling result.
4.2 Instantiation of the Framework
Here we describe the choices made at each step of the framework for analyzing the imbalance of the temperature variation.
4.2.1 Step 1  Temperature Variation Weight Function
The temperature variation is an important property to this forecasting task since the predictive model must predict accurately when the temperature will rise. Even if, on average, the model has a satisfying performance, it may still be inaccurate when predicting higher variation if the dataset is imbalanced. So for step 1 of the framework, we select the temperature variation as the weight function, which is modelled by equation 1, using as 30 time steps (5 minutes), which is the forecast horizon.
4.2.2 Step 2  Sampling Method Choice
For step 2, we experiment with three sampling methods: SUS with factor 1, SUS with factor 3 and IHS. Each one undersamples a different amount of low temperature variation data, creating a different balance, as shown by Figure (a)a. SUS with factor 1 and 3 are chosen to compare the effect of the factor in the proportion of data samples with low and high variation. For the IHS method, we use the Freedman Diaconis estimator[5] to compute the bin width of the histogram. 10.000 training data samples are extracted using each method.
4.2.3 Step 3  Predictive Model Choices
In our experiments, we choose a multilayer preceptron[9] as a deep neural network baseline which has been used in timeseries forecasting[1]
and three deep neural networks specialized in temporal data. These specialized architectures are the long shortterm memory (LSTM)
[6], a popular recurrent neural network, the temporal convolutional network (TCN)
[2], a sequencetosequence model which has shown promise when trained on a large amount of data[10] and the temporal attention convolutional network (TACN)[8].The TACN is an architecture which combines a TCN with an attention mechanism[13] to achieve interpretable and accurate forecasting. The perinstance interpretability comes in the form of a vector, equal to the input window size, which shows the importance of each input step to the forecasting output. The higher the value of the vector at a specific step, the higher the contribution of the input value at that step to the final output. By scaling the vector to the 01 range, we can estimate the relative importance among the input steps. Although this vector is produced per instance, we can draw conclusions about the generic learned behavior of the model by collecting and analyzing the vectors from a large number of instances.
For data preprocessing, we extract a window of 5 minutes (or 30 time steps) for each sample, which is the input for the TCN, LSTM, and TACN models. For the MLP model, we extract basic features of each sensor such as the mean, standard deviation, minimal and maximal values for each window. We also keep the last time step as an additional feature and for later analysis of the temperature variation case. All the models are evaluated using the rootmeansquare error metric (RMSE).
5 Results
In this section we describe the results obtained after applying our framework starting from step 2. Step 1 is already defined in Section 4.
5.1 Step 2  Comparison of the Sampling Methods
Figure (a)a shows the variation distribution after applying the sampling methods. Without any sampling, the dataset has a strong bias towards samples with variation close to zero, meaning that the temperature is stable, or varies very slowly most of the time. SUS with factor 3 give more emphasis to samples with higher variation, while significantly reducing the number of samples with lower variation. The sampling using SUS with factor 1, on the other hand, is more conservative and preserves a considerable amount of samples with low variation. Finally, IHS gives the best balance across all values, and is the one which gives the highest proportion of samples in the extreme of the temperature variation spectrum (above 6 degrees in Figure (a)a).
5.2 Step 3  Analysis of the Results
The results of the four models trained and tested with the selected sampling methods based on temperature variation can be seen in Table 1, with the lower error per evaluation set highlighted. The effect of the imbalance of the original data distribution is clearly shown in the ”None” rows, where the models were trained without sampling. For those lines, the RMSE is much higher in the SUS 3 column, where there is a smaller number of samples of low variation, suggesting that the models are biased towards low variation samples if trained without sampling methods.
On the other hand, these results show that there is a tradeoff between favoring samples with and without temperature variation. Models trained with a more aggressive kind of sampling, such as SUS with factor 3, have a much higher error when evaluated on the unsampled data than the models trained with SUS factor 1, for example. This can be explained by the density difference between samples with low variation (below 2.5 degrees in Figure (a)a), the same samples that are more common in the ”no sampling” dataset. With our approach, this tradeoff which exhibits nonlinear behavior can be estimated, taking into account the enduser preferences, and it can lead to a reevaluation of the sampling method in step 2. Also, together with these metrics, using insights about the model as described later in this subsection can indicate the sampling method that leads to the most encompassing, generalizable patterns learned by the models, thus creating a balance for the performance across data samples.
Model  Trained on  Evaluated on  

None  SUS  1  SUS  3  IHS  
MLP  None  1.401 0.181  2.270 0.177  3.611 0.322  2.984 0.241 
SUS  1  1.704 0.289  2.085 0.239  2.886 0.269  2.495 0.234  
SUS  3  4.150 0.635  3.089 0.390  2.430 0.305  2.836 0.32  
IHS  3.066 0.401  2.728 0.228  2.539 0.248  2.663 0.217  
LSTM  None  1.032 0.091  1.857 0.112  3.275 0.129  3.275 0.13 
SUS  1  1.283 0.153  1.469 0.123  2.769 0.130  2.731 0.094  
SUS  3  3.595 0.268  2.549 0.128  1.464 0.24  2.415 0.095  
IHS  2.728 0.174  2.131 0.093  1.863 0.106  2.27 0.093  
TCN  None  0.871 0.021  1.684 0.037  3.060 0.068  3.142 0.05 
SUS  1  1.007 0.079  1.462 0.041  2.686 0.066  2.703 0.063  
SUS  3  3.41 0.213  2.4 0.091  1.592 0.124  2.283 0.008  
IHS  2.579 0.231  2.016 0.091  1.845 0.062  2.145 0.039  
TACN  None  1.171 0.016  2.637 0.051  5.167 0.1  5.064 0.101 
SUS  1  1.334 0.242  2.077 0.355  3.878 0.675  3.183 0.694  
SUS  3  3.802 0.538  2.786 0.428  2.36 0.758  2.883 0.547  
IHS  3.093 0.918  2.424 0.608  2.430 0.748  2.752 0.623 
Trained on  Max. error  Measured on 

No sampling  3.142 0.05  IHS 
SUS  1  2.703 0.063  IHS 
SUS  3  3.41 0.213  No sampling 
IHS  2.579 0.231  No sampling 
Since the results show that the TCN achieves a relatively lower error in all the evaluation sets, we select it as the best model and follow the heuristic described in section 3.1. Table 2 shows the maximum error obtained by it over all the evaluation samples. The two lowest RMSE values reported in that table are from the TCN model trained with SUS factor 1 and IHS, and the evaluation sets where they have the highest error are No sampling and IHS. By comparing the performance of the TCN trained with both methods on the evaluation set without sampling, we can clearly see that the model trained with SUS with factor 1 has lower spreading of the error (Fig. (b)b). On the other hand, the same comparison on the evaluation set with higher variation (Fig. (c)c) shows that both models have similar error spreading, and the small advantage of using IHS in this case does not compensate for the increase in error in the low variation samples. Therefore, we can conclude that SUS with factor 1 is the sampling method with best performance across samples with low and high temperature variation.
5.2.1 Imbalance Effects in DL Models
In the third step of the framework, we also verify how the imbalances affect the performance and learned patterns of the predictive models, when they are trained on data with sampling versus unsampled data. To do that, we focus on the temperature variation property of the dataset, and we measure the SHAP values of the MLP model, as well as the attention importance values of the TACN.
To assess how the sampling methods influence the MLP model, we extract its SHAP values using the SUS 3 evaluation set. We compare both the MLP trained with SUS 3 and without sampling. Figure 3 shows that both models rely mostly on the last temperature measurement to make the forecast. This could be explained by the fact that the last temperature is relatively close to the predicted temperature, even when there is high variation. One hypothesis for such fact is that the MLP does not handle the time dependency of the inputs and, thus has a disadvantage in comparison to other temporal models such as the TCN or the LSTM.
To gain insights about the differences in the behavior of the trained TACN models using the interpretability mechanism, we run inference on the SUS with factor 3 evaluation set for the models trained on (a) unsampled data and (b) on the SUS factor 3 train set, and we study the resulting attention pattern variance.
In order to quantify this variance, we enumerate for both models the unique learned attention values for each input time step across all test samples, rounded to the second decimal, and present the result on Fig. 4. For the model trained on unsampled data, the unique values for each position are at most 3, while for the SUS model they are between 30 and 50. The above observations lead us to the following conclusions: The model trained on the unsampled data has learned a high reliance on the last value and limited number of patterns, which serves well in minimizing the error for the majority of the samples but results in low performance on the large variation samples. In contrast, the model trained on the SUS data is forced to learn a larger variety of patterns to accommodate for this target variation.
6 Conclusion
We presented a framework to analyze imbalanced timeseries forecasting problems and to train and evaluate ML models taking into account important properties. To our knowledge, this is the first framework which provides clear steps to help practitioners to select and compare different sampling methods and predictive models for such problems. It is put into practice to forecast the temperature of a motor in a steel processing conveyor belt, based on data extracted from a realworld industrial process and is validated in cooperation with domain experts. The problem analysis is made through the lens of the temperature variation property. We study the dataset using three different sampling methods and we train four different DL models to evaluate and compare the effectiveness of each combination of sampling and model. We also show the imbalance of the temperature variation and how it changes the models’ predictions when they are trained with different proportions of samples with high temperature variation. Finally, we use SHAP values and the TACN model’s attention mechanism to show the effect of low temperature variation in the dataset on the forecast models, inducing them to rely mostly on the last observed temperature.
As future work, our framework could be put into practice to analyze new timeseries prediction tasks, combining with more sampling techniques. The framework can be extended with new weight functions, making it suitable for an even wider range of tasks. In addition, our results point out to a possible relationship between the prediction error and the distribution of the training data which might be worth investigating.
Acknowledgements
This work has been conducted as part of the Just in Time Maintenance project funded by the European Fund for Regional Development. We also thank Tata Steel Europe for providing the data and technical expertise required for our experiments.
References
 [1] (2010) An empirical comparison of machine learning models for time series forecasting. Econometric Reviews 29 (56), pp. 594–621. Cited by: §1, §4.2.3.
 [2] (2018) An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling. arXiv:1803.01271 [cs]. Cited by: §4.2.3.
 [3] (2016) A Survey of Predictive Modeling on Imbalanced Domains. ACM Computing Surveys 49 (2), pp. 1–50. Cited by: §1, §1, §2.
 [4] (2019) Preprocessing approaches for imbalanced distributions in regression. Neurocomputing 343, pp. 76–99. External Links: Document Cited by: §1, §2.
 [5] (1981) On the histogram as a density estimator: l 2 theory. Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete 57 (4), pp. 453–476. Cited by: §4.2.2.
 [6] (1997) Long shortterm memory. Neural Comput. 9 (8), pp. 1735–1780. Cited by: §4.2.3.
 [7] (2017) A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30, pp. 4765–4774. Cited by: §1, §3.1.3.

[8]
(2020)
Interpretable multivariate time series forecasting with temporal attention convolutional neural networks
. In 2020 IEEE Symposium Series on Computational Intelligence (SSCI), Vol. , pp. 1687–1694. External Links: Document Cited by: §3.1.3, §4.2.3. 
[9]
(1958)
The perceptron: a probabilistic model for information storage and organization in the brain.
. Psychological review 65 (6), pp. 386. Cited by: §4.2.3.  [10] (2019) A Comparative Study of StateoftheArt Machine Learning Algorithms for Predictive Maintenance. In 2019 IEEE Symposium Series on Computational Intelligence (SSCI), pp. 760–767. Cited by: §4.2.3.
 [11] (2020) Resampling and ensemble techniques for improving annbased high streamflow forecast accuracy. Hydrology and Earth System Sciences Discussions 2020, pp. 1–35. Cited by: §2.
 [12] (2021) Deep learning for time series forecasting: a survey. Big Data 9 (1), pp. 3–21. External Links: Document Cited by: §1, §2.
 [13] (2017) Attention is all you need. In Advances in Neural Information Processing Systems Systems (NIPS), pp. 5998–6008. Cited by: §1, §3.1.3, §4.2.3.
Comments
There are no comments yet.