Supervised machine learning is the centerpiece among all the achievements made in the era of artificial intelligence. This is more or less a result of the availability and diversity of real-world data sets annotated with labels. A well-known nature of supervised learning is its susceptibility to imbalanced or biased data distribution, which is very common for many reasons including the technical limitations in data collection, biased sampling methods and the inherent skewness in the data sources. Fig.1 shows an example of biased/skewed distribution of data collected from a massive number of sensors (e.g., sensors) that are used to indicate the air pollution in the corresponding residential areas. When we use the data to model the population-pollution relation using machine learning (e.g., gradient descent), the resulting model may learn the pattern well from the samples of densely populated areas (e.g., cities), but less effectively from those of rural areas.
Although extensive efforts have been made to mitigate the impact of imbalanced data on classification tasks (e.g., [2, 1, 15]), none of the existing work can be directly applied to the regression problems. For example, the common techniques such as re-sampling and re-weighting for handling the classification tasks cannot be applied to the pre-processing of the regression data or the training process of regression models. This is because the target space in a classification task is a set of discrete labels (classes) while that of a regression task consists of continuous numerical values.
As a matter of fact, The data bias exists in both the feature (i.e., input) space and the target (i.e., output) space concerning a regression problem. The data distribution can be skewed in the feature space due to the bias in data collection and the sourcing property of each feature. For example, the data samples may form a few clusters rather than scatter nicely over the entire feature space, which makes the distribution of the information largely uneven.
Biased data distribution may significantly degrade the efficacy of the regression analysis, which includes a wide range of learning tasks (e.g., anomaly detection and time series prediction) and practical applications (e.g., power estimation[17, 11], price prediction, and event detection 
). The success in dealing with imbalanced classes (in classification) inspires us to improve the quality of model training for regression problems such linear regression and numerical forecast. In this work, we adopt a novel approach to re-valuing the data samples according to their distributions in the feature (input) space and the target (output) space, and apply the approach to re-weight the loss during gradient descent based training. To the best of our knowledge, our study is the first to optimize the models for regression analysis by means of loss re-weighting. The key contributions of our work are outlined as follows:
We propose to partition the feature space into a grid of cells, and further introduce two metrics, uniqueness and abnormality, to indicate the data’s learning values based on their variation.
We propose a loss re-weighting method (VILoss) for optimizing the regression analysis and present an easy-to-implement method for determining its hyper-parameter through empirical studies.
We have conducted comprehensive experiments on both synthetic and public data sets; the results show a solid improvement in model accuracy when the proposed VILoss is used as the loss criterion.
Ii Related Work
The primary purpose of this work is to find an approach to optimizing model training for regression tasks. Nonetheless, it is worthwhile to discuss the popular methods for tackling the class imbalance problem. Through the discussions we can understand why the methods developed for classification problems cannot be applied to the regression problems we are targeting in this work.
The most popular methodologies for addressing data imbalance/bias are re-sampling and re-weighting 
. Re-sampling is applied to the raw data set directly by adding repeated, interpolated or synthesized samples[7, 9] to the minority classes and/or removing a portion of samples in the majority classes . A problem of the re-sampling method is that when the samples carrying the useful information are likely to be removed in undersampling, while the addition of redundant samples may introduce extra noise  when conducting oversampling. Re-weighting is also commonly used to address data imbalance. Its fundamental idea is to assign different weights to different samples. Generally, a re-weighting method can be either model- and loss-agnostic [8, 4] or error/loss-incentive [5, 10]. An example of the former is the class frequency-based re-weighting, whilst the later tends to assign larger weights to those ”hard” samples that yields higher error.
However, none of these existing methods can be applied to regression analysis, where the data are not affiliated to discrete class labels. Regression analysis entails a great number of practical applications such as numerical estimation , prediction 
and anomaly/novelty detection. Biased data distribution commonly exists in these tasks and could affect the quality of the models learned for regression analysis, which is hardly investigated in previous studies.
Iii Re-valuing Data with Variation
The idea behind our re-weighting method is to differentiate each data sample by its learning value through gathering localized information on the data distribution (e.g., bias and outliers).
We propose to re-value data samples by gauging each sample in terms of its uniqueness and (potential) abnormality. Uniqueness indicates how valuable a sample is for learning compared to the others while abnormality measures how likely it could be a deviant/outlier. The properties of uniqueness and abnormality are closely associated to the data variation in the feature space and the target space, respectively.
We characterize uniqueness and abnormality of a sample in the context of its vicinity, which consists of a set of neighboring data samples. A straightforward method of finding a sample’s neigbhorhood is to calculate and sort its distances to all others, and then select the closest k samples. However, the time complexity is , which may result in unacceptably long pre-processing time for large data sets. In this work, we introduce Data Cell as an approximate form of neighborhood. With the concept of the cell, the data can be examined efficiently in groups.
Iii-a Data Cell
A data cell, in this paper, is defined as a logical neighborhood of data as a result of the dimension-wise partitioning in the feature space. For example, a three-dimensional feature space can be divided into cells with two divisions in each dimension. A cell can be formulated as:
where is a data sample (both and can be multi-dimensional), denotes the -th feature and is the dimensionality of the feature space. The cell’s subscript is a string of numbers indexing the cell where stands for its projected index in dimension . For the ease of the presentation, we refer to a cell by the notation in the rest of this paper, where can be regarded as the rank of the cell after all cells are sorted in the ascending order by their index. The logic of binding samples to cells is similar to the space discretization when using the numerical method to solve equations, in which the continuous space is discretized into a grid of cells.
We use a hyper-parameter to control the number of divisions in each dimension and each division is of equal size. Given samples, the time complexity of partitioning the space into cells is as we only need to traverse the data set once to assign each data sample to its corresponding cell. Given a raw feature space with features, a full partitioning yields cells. But in the case where is high-dimensional, we can select a subset of features for space partitioning and loss weighting (still training the model with the full feature set ). After the feature space is divided, we can efficiently characterize the uniqueness and abnormality of all samples in the context of each cell.
Iii-B Uniqueness: Feature Variation
Intuitively, a unique sample is more likely to carry useful information for learning than a redundant one or one that overlaps with others. For example, assume a feature space shown in Fig. 2, which consists of two cells in which there exists a strong bias in cell where the samples reside within a small range and gather very closely together. When training a model on this data set, traditional training methods are likely to converge slowly as the model learns very little from the majority of samples in .
We measure how unique a sample is with respect to its features using a simple criterion based on variation. Variation can be quantified statistically by the standard deviation and we generalize its formulation to fit the multi-dimensional feature space in the context of a cell using the Euclidean distance. Given a cell, we define:
is a vector of mean values over the corresponding elements of allin the cell .
Using the generalized standard deviation, we can easily measure the variation of the data in the feature space and characterize their uniqueness. The samples with more distinct features (i.e., more unique) are deemed to carry more useful information for (further) optimizing the model compared to the less unique samples. Therefore, the uniqueness should be a relative value after considering all cells. We define the uniqueness of a cell as follows:
where represents the average value of over all non-empty cells (The empty cells are excluded in the calculation). All samples in a cell share the uniqueness value of this cell:
Iii-C Abnormality: Target Variation
It is very common to have the corrupted data in a large data set, which deviate significantly from the actual distribution of the underlying function, and are poisonous to the training process. We judge the abnormality of the samples from the perspective of the target space. Given a cell of data, a large variation of intra-cell samples in the target space usually means the inconsistency, since in general the neighboring samples have similar feature values and thus are expected to have relatively similar target values. For example, the two outliers (highlighted red) among the data shown in Fig. 2 could mislead the model because they yield high loss and gradients as their Y values deviate significantly from the true distribution. Therefore, we propose to gauge the abnormality of a given sample by it target value in the context of its cell. Taking the average target value in the cell as a reference, the deviation of a sample in the target space can be measured using the -norm distances, i.e., . In this paper we use L1 norm () and the Euclidean norm (). We define the target variation of cell as:
where denotes the size (i.e., the number of samples inside) of and denotes the average value of the samples in the cell. Using as the reference, given a sample in cell , we quantify its level of abnormality using two forms (L1 and L2 forms):
Iv Variation-Incentive Loss
We re-value the data samples to optimize the learning by taking into account their uniqueness and abnormality properties introduced in section III. Specifically, we assign higher weights to the samples that exhibit strong uniqueness and low abnormality. Given a sample , a predicted output
and a base loss function(for the regression task), the proposed Variation-Incentive Loss (VILoss) is formulated in (9):
In (9), we weight the loss of a sample positively proportional to its uniqueness and inversely proportional to its abnormality plus one. By doing so we can modulate how much attention the sample obtains in training. We use in the denominator to avoid the extreme values in case . Note that since the weight applied to the loss is independent of the predictions made by the model and the model parameters, we can easily calculate the gradient of a sample with the VILoss as the loss function:
where denotes the model parameters for training.
VILoss for Numerical Regression
VILoss is designed for optimizing models for regression analysis that can be formulated as: , where and are the dimensionality of the feature space and the target space, respectively. With the prevalence of the gradient descent-based training methods, the commonly used loss functions for regression include Mean Square Error (MSE) loss (a.k.a., L2 loss) and Huber loss (a.k.a., smooth L1 loss).
Given a sample and the corresponding output predicted by a regression model, we have using MSE as the base loss:
where is the number of dimensions in the target space and is the element in the -th dimension. In a similar way we can apply our re-weighting method to higher-ordered loss functions like the Least Quartic Regression (LQR) loss , which uses a quartic function of error (i.e., ) that yields significantly higher losses on difficult samples.
Huber loss is a piece-wise function that is linear when the absolute error is above a threshold and is smoothed to a quadratic form otherwise. Applying our re-weighting method, we define as:
where is the smooth L1 error for the -th output dimension, and is a hyper-parameter for Huber loss and is often set to 1.0 empirically.
VILoss for Logistic Regression
VILoss for Logistic Regression
VILoss is designed for the regression tasks which requires the models to output numerical values. Nevertheless, the design of VILoss makes it applicable to training logistic regression models for binary classification. This is because in this case we can establish the cell average arithmetically in the target space where the label
takes the value of 0 or 1. Logistic regression models here refer to any form of models (e.g., linear regression, recurrent neural networks) with a sigmoid function applied to its output. As a result, the final output
for binary classification is a scalar of probability. In this paper, we use Binary Cross Entropy (BCE) as the base function of loss:
where is the class label for the sample taking either 0 or 1, and denotes the prediction (in the range ) made by the model with a logistic output layer.
V Study of the Hyper-parameter
The hyper-parameter (which determines the number of cells) plays an important role in the training outcome as we need to partition the feature space appropriately to differentiate the value of samples. On the one hand, as increases, the partitioning granularity gets finer (i.e., more cells), which is good because we use the cell uniqueness to represent the uniqueness of all samples in this cell. On the other hand, as increases, the cells become sparser, which means more information is lost regarding the local distribution of samples. To find the optimal value of before training, we define a simple metric called Localized Deviation (LD) to indicate if the feature space is appropriately partitioned given a set of data. LD is defined as the sum of concerning all the non-empty cells (i.e., ) as a result of partitioning:
We use the following empirical studies to show that the designed LD can be used to guide the selection of the optimal . We conducted the empirical studies on two synthetic data sets: Synth-1D and Synth-2D. Each set is generated from a polynomial data model with non-biased Gaussian noise (simulating normal variation) and a certain portion of abnormal data (simulating data corruption). We trained the models of varied complexity on these two data sets to observe the correlation between , LD and the model’s quality (measured in Mean Absolute Percentage Error, MAPE). Under each setting we report the lowest error achieved from all candidate models. Fig. 3 and Fig. 4 plot the results extracted from the full statistics in Tables V and VI in Appendix A.
On Synth-1D, VILoss outperformed MSE under all settings, and the lowest error is achieved by setting to 2, which also yields the maximum value of LD (Fig. 3). The case is different on Synth-2D wherein the value of LD is maximized at (Fig. 4) while VILoss also achieves the biggest reduction of error, which means cells represent the best partitioning of this data set. In both cases we find LD is a strong indicator of the effectiveness of our method. Based on these observations, we are able to determine an appropriate value of without actually running the training process, but by running this procedure instead: 1) Select a group of candidate values of ; 2) Calculate the value of LD for each candidate by using (15); 3) Choose the value of that maximizes LD.
Vi-a Experiment Setup
We conducted extensive experiments by running a diversity of regression tasks on both synthetic and real-world data sets. As summarized in Table I, we chose four public data sets that correspond to four typical application scenarios including house price regression (numerical value estimation), parking vacancy prediction (time series prediction), cyber intrusion detection (anomaly detection) and event detection (time series anomaly detection). For each set, 70% of the data are used for training and 30% for test. All selected data sets (including our Synth-1D and Synth-2D) are more or less biased in data distribution. The data are normalized on both features and targets. In Fig. 5
we show the feature distributions of these data sets among which only our synthetic data follow clear patterns of Gaussian distribution.
Vi-B Experimental Results
We applied our re-weighting method to model training using MSE, Huber and the Least Quartic Regression (LQR) loss 
as the base loss. MSE loss and Huber loss are the standard criteria for regression analysis while the LQR loss is designed for learning from the fat-tailed financial data that exhibit strong skewness or kurtosis. We also evaluated our method with the logistic regression tasks by using the popular BCE as the base loss. Our method is compared against the base losses in terms of the model’s performance in tests. All the results reported in the figures are collected under the optimal training settings (e.g., batch size, learning rate and time step).
We used Boston (housing) and the Parking Birmingham data sets from the UCI repository for numerical regression tasks. The Boston housing111https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html data set contains the information collected by the U.S. Census Service with 14 attributes in total of which the median house price is to be predicted. We use a linear regression model for this task. Parking Birmingham222https://archive.ics.uci.edu/ml/datasets/Parking+Birmingham is a time-series data set collected by the city council. We extracted two months of the data (from Oct. 4 to Dec. 9, 2016) from the records of five car parks in the city of Birmingham and use them to predict the vacancies in two other car parks (Broad Street and Bull Ring) in the same city. On this data set we trained and tested two forms of LSTM neural nets with one and two stacked hidden layers, respectively. The size of hidden layer is set to 10. Apart from the real-world data sets, the experimental results on synthetic data sets are also included for reference. The errors are reported in two metrics: mean absolute percentage error (MAPE) and mean absolute error (MAE).
|Boston Housing||Parking Birmingham|
In Table II, MAE and MAPE are reported by selecting the best polynomial models in each test (see Appendix A for details). On both synthetic data sets, the models are enhanced in performance (i.e., error reduction) when trained with our VIloss. in the L2 form achieved the best performance, reducing the relative error by 7.4% and 11.9% on Synth-1D and Synth-2D, respectively.
The test results on the Boston Housing and Parking Birmingham datasets are summarized in Table III. On Boston Housing, the model quality is improved by 4-5% when VILoss is applied. When testing on the Parking Birmingham dataset, we also observed a significant performance gain especially for the stacked LSTM model (i.e., LSTM-2) with about 10% error reduction compared to that trained with the MSE loss. Combining Tables II and III, Fig. 6 provides the comparison in terms of test error between VILoss and the baseline losses (MSE, Huber and LQR). From the figure we can observe a solid improvement in the accuracy (i.e., lower test error) of different models when trained using VILoss on both synthetic data sets and real-world data sets.
Our approach can also be applied to train logistic regression models for binary classification. We extracted a subset of the KDDCup’99 data set333https://scikit-learn.org/stable/datasets/index.html#kddcup99-dataset to include the TCP records only, and built a logistic regression model for predicting network intrusions (labelled 1 if true, 0 otherwise). For this data set, we weight the samples based on three of the features (’src_bytes’, ’dst_bytes’ and ’dst_host_count’). The CalIt2 data set444https://archive.ics.uci.edu/ml/datasets/CalIt2+Building+People+Counts from UCI and is a time-series data set containing the time stamps and two data streams (in-flow and out-flow) of a department building where the events take place occasionally and need to be detected. We merged raw records hourly and built an RNN model with sigmoid output. Table IV shows the comprehensive test results of these two models with several classification metrics.
From Table IV we can observe the high recall rates for all the training criteria on KDDCup’99 TCP, but using VILoss improves the model’s accuracy, precision and F1-score. This is because VILoss pays more attention to those uncharacteristic negative samples (with some degree of deviation from the negative majority) and thus reduces false positive decisions (false alarm rate). Event detection is much more difficult on the time series data CalIt2 even with recurrent layers. Some events (e.g., small conferences) do not increase the in-/out-flow of the building significantly, resulting in a relatively low recall rate. In this case, our improvement in recall rate is marginal. However, from Table IV we can observe a notable improvement in precision by roughly 10-20% over the baseline and the comprehensive performance gain is about 3%, as reflected by the f1-score. Fig. 7 summarizes the experimental results on the KDDCup’99 TCP and CalIt2 data sets. It is shown that VILoss can effectively optimize the training of both traditional logistic regression models and recurrent neural nets with logistic output.
Biased data distribution commonly exists in all kinds of real-world data sets and is likely to deteriorate the models learned from these data. However, existing studies hardly pay attention to this problem for the regression analysis. In this paper, we present an approach to quantifying the uniqueness and abnormality of the samples in a skewed and biased data distribution. Combining the two metrics we propose a loss re-weighting method that can be applied to different base loss functions (e.g., MSE, Huber, LQR and BCE Loss) for the regression purpose. The experimental results on various models and data sets show a significant gain in model performance when using VILoss in training.
-  Shaza M Abd Elrahman and Ajith Abraham. A review of class imbalance problem. Journal of Network and Innovative Computing, 1(2013):332–340, 2013.
-  Aida Ali, Siti Mariyam Shamsuddin, and Anca L Ralescu. Classification with class imbalance problem. Int. J. Advance Soft Compu. Appl, 5(3), 2013.
-  Giuseppe Arbia, Riccardo Bramante, and Silvia Facchinetti. Least quartic regression criterion to evaluate systematic risk in the presence of co-skewness and co-kurtosis. Risks, 8(3):95, 2020.
-  Yin Cui, Menglin Jia, Tsung-Yi Lin, Yang Song, and Serge Belongie. Class-balanced loss based on effective number of samples. In , pages 9268–9277, 2019.
Qi Dong, Shaogang Gong, and Xiatian Zhu.
Class rectification hard mining for imbalanced deep learning.In Proceedings of the IEEE International Conference on Computer Vision, pages 1851–1860, 2017.
Chris Drummond, Robert C Holte, et al.
C4. 5, class imbalance, and cost sensitivity: why under-sampling
In Workshop on learning from imbalanced datasets II
, volume 11, pages 1–8. Citeseer, 2003.
-  Hui Han, Wen-Yuan Wang, and Bing-Huan Mao. Borderline-smote: a new over-sampling method in imbalanced data sets learning. In International conference on intelligent computing, pages 878–887. Springer, 2005.
-  Chen Huang, Yining Li, Chen Change Loy, and Xiaoou Tang. Learning deep representation for imbalanced classification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5375–5384, 2016.
-  Myoung-Jong Kim, Dae-Ki Kang, and Hong Bae Kim. Geometric mean based boosting algorithm with over-sampling to resolve data imbalance problem for bankruptcy prediction. Expert Systems with Applications, 42(3):1074–1082, 2015.
-  Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017.
-  Weiwei Lin, Wentai Wu, Haoyu Wang, James Z Wang, and Ching-Hsien Hsu. Experimental and quantitative analysis of server power model for cloud data centers. Future Generation Computer Systems, 86:940–950, 2018.
-  Rushi Longadge and Snehalata Dongre. Class imbalance problem in data mining review. arXiv preprint arXiv:1305.1707, 2013.
-  Joseph Prusa, Taghi M Khoshgoftaar, David J Dittman, and Amri Napolitano. Using random undersampling to alleviate class imbalance on tweet sentiment data. In 2015 IEEE international conference on information reuse and integration, pages 197–202. IEEE, 2015.
-  Leilei Shi, Yan Wu, Lu Liu, Xiang Sun, and Liang Jiang. Event detection and identification of influential spreaders in social media data streams. Big Data Mining and Analytics, 1(1):34–46, 2018.
-  Yanmin Sun, Andrew KC Wong, and Mohamed S Kamel. Classification of imbalanced data: A review. International journal of pattern recognition and artificial intelligence, 23(04):687–719, 2009.
-  Wentai Wu, Ligang He, Weiwei Lin, Yi Su, Yuhua Cui, Carsten Maple, and Stephen A Jarvis. Developing an unsupervised real-time anomaly detection scheme for time series with multi-seasonality. IEEE Transactions on Knowledge and Data Engineering, 2020.
-  Wentai Wu, Weiwei Lin, and Zhiping Peng. An intelligent power consumption model for virtual machines under cpu-intensive workload in cloud environment. Soft Computing, 21(19):5755–5764, 2017.
Appendix A Experiment details for parameter study
The ground truth of the data is for Synth-1D, and for Synth-2D (here and represent the two features). The number of samples generated for these two synthetic data sets is 300 and 1000, respectively. 70 percent of each data set are used for training and 30 percent for test. On each data set we trained three polynomial models: a third-order polynomial model (which is under-parameterized because the degree of the ground truth model is 6), a sixth-order polynomial model (i.e., properly parameterized) and a tenth-order polynomial model (i.e., over-parameterized). The MSE Loss is used as the baseline. The batch size is set to 1 and 5 for Synth-1D and Synth-2D, respectively. We use MAPE (Mean Absolute Percentage Error) as the error metric. Results are summarized in Tables V and VI.
We gather the results and investigate the correlation between the effectiveness of the proposed VILoss and the LD metric we define in (15). The best result is selected for each loss function (i.e., MSE, VILoss L1-form and VILoss L2-form) with each specified value of from 2 to 100. We use MAPE as the error metric.
|test error of models|
|test error of models|