1. Introduction
The imbalanced learning problem appears when the distribution of samples is significantly unequal among different classes. The majority class with relatively more samples can overwhelm the minority class in sample size. Since standard machine learning methods usually seek the minimization of training errors, the resulting classifiers will be naturally biased towards the majority class, leading to the performance depreciation for important and interest minority samples
(Castro and Braga, 2013; Chawla et al., 2002). Over the past two decades, a large number of class imbalance techniques have been proposed to combat the imbalanced data learning (Chawla et al., 2002; Tang and He, 2017; Wong et al., 2014; Khan et al., 2018). These existing solutions can be roughly divided into algorithmlevel approaches and datalevel approaches.In algorithmlevel approaches, traditional classification algorithms are improved to put more emphasis on the learning of minority class by adjusting training mechanism or prediction rule such as modification of loss function
(Chung et al., 2015), introduction of classdependant costs (Castro and Braga, 2013), and movement of output threshold (Zhou and Liu, 2006). Datalevel approaches aim to establish class balance by adding new minority samples (i.e., oversampling) (Zhu et al., 2017, 2019; Zhang et al., 2016), deleting a part of original majority samples (i.e., undersampling) (Wong et al., 2014), or combining both of them. Compared to algorithmlevel methods, datalevel techniques have two main advantages. First, they are more universal, since the preprocessed data can be fed into various machine learning algorithms to boost their prediction capability in the minority class. Second, datalevel approaches can flexibly integrate with other techniques such as kernel methods and ensemble learning to explore elaborate hybrid solutions (Akbani et al., 2004; Tang and He, 2017).In this paper, we focus our attention on oversampling techniques in datalevel approaches, since oversampling directly addresses the difficulty source of classifying imbalanced data by compensating the insufficiency of minority class information, and, unlike undersampling, does not suffer the risk of discarding informative majority samples.
1.1. Motivation
The problem of imbalanced time series classification is frequently encountered in many realworld applications, including medical monitoring (Villar et al., 2016), abnormal movement detection (Übeyli, 2007) and industrial hazards surveillance (Janusz et al., 2017). Although there are a large number of oversampling solutions in previous literature, few of them are exclusively designed to deal with imbalanced timeseries data. Different from conventional data, time series data presents high dimensionality and high intervariable correlation as timeseries sample is an ordered variable set which is extracted from a continuous signal. As a result, addressing imbalanced time series classification exist some special difficulties as compared to classical class imbalance problems (Cao et al., 2011). In terms of data oversampling, the designed oversampling algorithm should have the capability of coping with the additional challenges due to high dimensionality, and protect the original correlation among variables so as not to confound the learning.
Therefore, we want to develop an oversampling method for imbalanced time series data that can accurately acquire the correlation structure of minority class, and generate structurepreserving synthetic samples to maintain the main correlation implied in the minority class. The purpose is to greatly improve the performance of minority class without seriously damaging the classification accuracy on the majority class.
1.2. Limitations of Existing Techniques
Interpolated techniques, probability distributionbased methods, and structure preserving approaches are three main types of oversampling.
In interpolation oversampling, the synthetic samples are randomly interpolated between the feature vectors of two neighboring minority samples
(Zhu et al., 2017, 2019). One of the most representative methods is SMOTE (Chawla et al., 2002). Because of the high dimensionality of timeseries data, there may exist a considerable space between arbitrary two timeseries minority samples. When the synthetic samples are allowed to create in such of the region, they seem to scatter in the whole feature space, which leads to severe overgeneralization problem. In addition, interpolated oversampling methods can introduce a lot of random data variations since they only take the local characteristics of minority samples into account. It will weaken the inherent correlation of original timeseries data.For probability distributionbased methods, they first estimate the underlying distribution of minority class, then yield the synthetic samples according to the estimated distribution (Cao and Zhai, 2016; Das et al., 2015)
. However, accurate discrete probability distribution or probability density function is extremely hard to obtain due to the scarcity of minority samples, especially in highdimensional space
(Fukunaga, 2013).Structurepreserving oversampling methods generate the synthetic samples on the premise of reflecting the main structure of minority class. In paper (Abdi and Hashemi, 2016)
, the authors proposed Mahalanobis Distancebased Oversampling (MDO). MDO produces the synthetic samples which obey the sample covariance structure of minority class by operating the value range of each feature in principal component space. The major drawback of MDO is that the sample covariance matrix can seriously deviate from the true covariance one for highdimensional data, i.e., the smallest (/largest) eigenvalues of sample covariance matrix can be greatly underestimated (/overestimated) compared to the corresponding true eigenvalues
(Friedman, 1989). Different from MDO, the structurepreserving oversampling methods SPO (Cao et al., 2011) and INOS (Cao et al., 2013) first divide the eigenspectrum of sample covariance matrix into the reliable and unreliable subspaces, then pull up the sample eigenvalues in unreliable subspace. However, both SPO and INOS assume the minority class is unimodal. This assumption often does not hold for reallife data, since the samples of a single class may imply multiple modes (e.g., the failure events of aircrafts exist multiple failure modes; a disease includes distinct subtypes). To handle the multimodal minority class, Pang et al. developed a parsimonious Mixture of Gaussian Trees models (MoGT) which attempts to construct Gaussian graphical model for each mode (Cao et al., 2014). However, MoGT only considers the correlations among pairs of nearest variables in order to reduce the number of estimated parameters. Besides, MoGT does not build the reliable mechanism to identify the modes of minority class. The authors, in fact, set the number of mixture components manually.1.3. Our Method and Main Contributions
Based on the above analyses, existing oversampling algorithms cannot protect the structure of minority class well for imbalanced time series data, especially when the minority class is multimodal. In this study, we propose a structurepreserving oversampling method OHIT which can accurately maintain the covariance structure of minority class and deal with the multimodality simultaneously. OHIT leverages a DensityRatio based Shared Nearest Neighbor clustering algorithm (DRSNN) to cluster the minority class samples in highdimensional space. Each discovered cluster corresponds to the representative data of a mode. To overcome the problem of small sample and high dimensionality, OHIT for each mode use the shrinkage technique to estimate covariance matrix. Finally, the structurepreserving synthetic samples are generated based on multivariate Gaussian distribution by using the estimated covariance matrices.
The major contributions of this paper are as follows: 1) We design a robust DRSNN clustering algorithm to capture the potential modes of minority class in highdimensional space. 2) We improve the estimate of covariance matrix in the context of small sample size and high dimensionality, through utilizing the shrinkage technique based on Sharpe’s singleindex model. 3) The proposed OHIT is evaluated on both the unimodal datasets and multimodal datasets, the results show that OHIT has better performance than existing representative methods.
2. The Proposed OHIT Framework
OHIT involves three key issues: 1) clustering highdimensional data; 2) estimating the largedimensional covariance matrix based on limited data; 3) and yielding structurepreserving synthetic samples. In section 2.1, we introduce the clustering of highdimensional data, where a new clustering algorithm DRSNN is presented. Section 2.2 describes the shrinkage estimation of covariance matrix. The shrinkage covariance matrix is a more accurate and reliable estimator than the sample covariance matrix in the context of limited data. Section 2.3 gives the generation of structurepreserving synthetic samples. Finally, the algorithm flow and complexity analysis of OHIT are together provided in Section 2.4
2.1. Clustering of Highdimensional Data
2.1.1. Preliminary
Two significant challenges exist in clustering highdimensional data. First, the distances or similarities between samples tend to be more uniform, which can weaken the utility of similarity measures for discrimination, causing clustering more difficult. Second, clusters usually present different densities, sizes, and shapes.
Some research works developed Shared Nearest Neighbor similarity (SNN)based density clustering methods to cluster highdimensional data (Ertóz et al., 2003; Ertoz et al., 2002). In density clustering, the concept of core point can help to solve the problems of clusters with different sizes, shapes. In SNN similarity, the similarity between a pair of samples is measured by the number of the common neighbors in their nearest neighbor lists (Jarvis and Patrick, 1973). Since the rankings of the distances are still meaningful in highdimensional space, SNN is regarded as a good secondary similarity measure for handling highdimensional data (Houle et al., 2010). Furthermore, given that SNN similarity only depends on the local configuration of the samples in the data space, the samples within dense clusters and sparse clusters will show roughly equal SNN similarities, which can mitigate the difficulty of clustering caused by the density variations of clusters.
The main phases of SNN clustering approaches can be summarized as follows: 1) defining the density of sample based on SNN similarity; 2) finding the core points according to the densities of samples, and then defining directly densityreachable sample set for them; 3) building the clusters around the core points. Below, we describe the key concepts associated with these phases, respectively.
SNN similarity and the density of sample. For two samples and , their SNN similarity is given as follows,
(1)  , 
where and are respectively the nearest neighbors of and , determined by certain primary similarity or distance measure (e.g., norm).
In traditional density clustering, the density of a sample is defined as the number of the samples whose distances from this sample are not larger than the distance threshold (Ester et al., 1996). If this definition is extended to SNN clustering, the density of a sample , , can be expressed as (Ertoz et al., 2002):
(2)  , 
where is ’s nearest neighbors according to SNN similarity, is an indicator function (it returns if the relational expression is true, otherwise,
is returned). However, this kind of definition can make outliers and normal samples being nondiscriminatory in density
(Ertóz et al., 2003).Consider Fig. 1a, the outliers and all have a considerable overlap degree of neighborhoods with their nearest neighbors. Hence, and are also high density according to Eqn. 2. To solve this problem, Eqn. 2 can be modified as
(3)  , 
where is . Eqn. 3 indicates only the samples occurred in the nearest neighbor lists each other can contribute the similarity into their densities. Fig. 1b shows the corresponding shared nearest neighbor graph of Fig. 1a. From this figure, we can see that the outliers ( and ) and their neighboring samples does not form pairs of shared nearest neighbors (i.e., there are no links), the densities of and tend to be ; at the same time, the links of the border samples such as , and are relatively sparse, their densities will be naturally lower than the densities of the samples within clusters. Eqn. 3 can benefit to obtain a reasonable distribution of sample density.
Core points and directly densityreachable sample set. In SNN clustering, the core points are the samples whose densities are higher the density threshold , and the directly densityreachable sample set of a core point is defined as those shared nearest neighbors which the similarities with this core point exceed (Ertóz et al., 2003).
The creation of clusters. The core points, that are directly densityreachable each other, are put into the same clusters; all the samples that are not directly densityreachable with any core points are categorized as outliers (or noisy samples); and the noncore and nonnoise points are assigned to the clusters in which their nearest core points are.
2.1.2. DRSNN: A Density Ratiobased Shared Nearest Neighbor Clustering Method
and are two important parameters in SNN clustering, but, as we know, there is no general principle to set the “right” values for them (Ertóz et al., 2003). In addition, SNN clustering is also sensitive to the neighbor parameter (Ertóz et al., 2003; Houle et al., 2010). If is set to be small, a complete cluster may be broken up into several small pieces, due to the limited densityreachable samples and the local variations in similarity. Consider Fig. 1c where the parameter is set 3. The directly densityreachable sets of the points in the blocks and are restricted in the respective blocks, and cannot be combined into the integrated cluster . On the other hand, if is too large such as being greater than the size of clusters, multiple clusters are prone to merge into a cluster, as the changes of density in transition regions will not have a substantially effect for separating different clusters. As shown in Fig. 1d where the shared nearest neighbor graph is presented when is . The border points and contain the points from different clusters in their directly densityreachable sets, and show roughly equal densities with the points inside of the clusters (i.e., easy to be the core points). Hence, , , and tend to form a uniform cluster. In conclusion, the major drawback of SNN clustering is hard to set the appropriate values for the parameters, causing unsteady performance of clustering.
To solve the problem mentioned above, we propose a new clustering method based on Density Ratio and SNN similarity, DRSNN. We first present the key components in DRSNN, then summarize the algorithm process of DRSNN.
The density of sample. To avoid the use of , DRSNN defines the density of a sample as the sum of the similarity between this sample and each of its shared nearest neighbors. Formally, the density of the considered sample , , is
(4)  . 
The density ratio of sample and the identification of core point. Instead of finding the core points based on the density estimate, DRSNN uses the estimate of density ratio (Zhu et al., 2016). Specifically, the density ratio of a sample is the ratio of the density of this sample to the average density value of nearest neighbors of this sample,
(5)  . 
The core points can be defined as the samples whose density ratios are not lower a threshold value . The use of density ratio has the following advantages. 1) It facilitates to identify core points. Given that the core points are the samples with local high densities, the densityratio threshold can be set to around 1. 2) The parameter can be eliminated. 3) The density ratios of samples are not affected by the variations of clusters in density.
Directly densityreachable sample set. In DRSNN, we define the directly densityreachable sample set for the core point as follows:
(6)  
where is ’s reverse nearest neighbors set. The directly densityreachable set mainly includes two parts, i.e., the nearest neighbors of and the core points in the reverse nearest neighbors of . The definition of is based on two considerations. One is that the samples distributed closely around a core point should be directly densityreachable with this core point. The other one is to assure that satisfies reflexivity and symmetry, which is a key condition that DRSNN can deterministically discover the clusters of arbitrary shape (Sander et al., 1998). Note that the parameter can restrain the mergence of clusters by using a small value to shrink directly densityreachable sample set, and reduce the risk of splitting the clusters by employing a large value to augment the set of directly densityreachable samples.
The summary of DRSNN algorithm. DRSNN algorithm can be summarized as follows:

Find nearest neighbors of minority samples according to certain primary similarity or distance measure.

Calculate SNN similarity. For all pairs of minority samples, compute their SNN similarities as Eqn. 1.

Calculate the density of each sample as Eqn. 4.

Calculate the density ratio of each sample as Eqn. 5.

Identify the core points, i.e., all the samples that have a density ratio greater than .

Find the directly densityreachable sample set for each core point as Eqn. 6.

Build the clusters. The core points, that are directly densityreachable each other, are placed in the same clusters; the samples which are not directly densityreachable with any core points are treated as outliers; finally, all the other points are assigned to the clusters where their directly densityreachable core points are.
Although DRSNN also contains three parameters (i.e., , and ), it is capable of selecting the proper value for around . In addition, and can be set in complementary way to avoid the mergence and dissociation of clusters, i.e., a large , compared to the number of samples, with a relative low , while a small accompanied by a relative high .
2.2. Shrinkage Estimation of Largedimensional Covariance Matrix
In the setting of high dimensionality and small sample, the sample covariance matrix is not anymore an accurate and reliable estimate of the true covariance matrix (Friedman, 1989). The shrinkage technique, as one of the most common methods improving the estimate of covariance matrix, aims to linearly combine the unrestricted sample covariance matrix and a constrained target matrix to yield a shrinkage estimator with less estimation error (Ledoit and Wolf, 2003a; Schäfer and Strimmer, 2009), i.e.,
(7) 
where is the weight assigned to the target matrix , called the shrinkage intensity. Since there are a lot of estimated parameters in and a limited amount of data, the unbiased
will exhibit a high variance, whereas the preset
will have relatively low variance but potentially high bias as it is presumed to impose certain lowdimensional structure. The shrinkage technique can acquire more precise estimate for by taking a properly tradeoff between and (Schäfer and Strimmer, 2009).A key question is how to find the optimal shrinkage intensity. Once is obtained, the shrinkage estimator can be determined. A popular solution is to analytically choose the value of by minimizing Mean Squared Error (MSE) (Ledoit and Wolf, 2003b). The advantages of this way are that the resulting estimator is distributionfree and inexpensive in computational complexity. Specifically, the MSE can be expressed as the squared Frobenius norm of the difference between and ,
(8) 
which leads to the risk function
(9)  
, and are the elements of , and , respectively; is the dimension of feature. Since
is an unbiased estimator of
, is actually .Computing the first and two derivatives of yields the following equations
(10)  
(11) 
By setting , we can obtain
(12) 
is positive according to Eqn. 11. Hence, is a minimum solution of . Following (Schäfer and Strimmer, 2009), we replace the items of expectations, variances, and covariances in Eqn. 12 with their unbiased sample counterparts, which gives rise to
(13) 
For the preset , we use the covariance matrix implied by Sharpe’s singleindex model (Sharpe, 1963). The singleindex model is used to forecast stock returns from timeseries stock exchange data. In this case, can be expressed by as follows,
(14) 
Putting Eqn. 14 into Eqn. 13, an expression of , that only contains the elements of sample covariance matrix, can be obtain finally
(15) 
where .
Given that the value of may be greater (/samller) than (/) due to limited samples, is often adopted in practice.
Dataset  Minority Class  Length  Training  Test  

Class Distribution  IR  Class Distribution  IR  
Yoga(Yg)  ’1’  426  137/163  1.19  1393/1607  1.15 
Herring(Hr)  ’2’  512  25/39  1.56  26/38  1.46 
Strawberry(Sb)  ’1’  235  132/238  1.8  219/394  1.8 
PhalangesOutlinesCorrect(POC)  ’0’  80  628/1172  1.87  332/526  1.58 
Lighting2(Lt2)  ’1’  637  20/40  2  28/33  1.18 
ProximalPhalanxOutlineCorrect(PPOC)  ’0’  80  194/406  2.09  92/199  2.16 
ECG200(E200)  ’1’  96  31/69  2.23  36/64  1.78 
Earthquakes(Eq)  ’0’  512  35/104  2.97  58/264  4.55 
Two_Patterns(Tp)  ’2’  128  237/763  3.22  1011/2989  2.96 
Car  ’3’  577  11/49  4.45  19/41  2.16 
ProximalPhalanxOutlineAgeGroup(PPOA)  ’1’  80  72/328  4.56  17/188  11.06 
Wafer(Wf)  ’1’  152  97/903  9.3  665/5499  8.27 
IR is the imbalance ratio (#majority class samples/#minority class samples). 
Dataset  Minority Class  Length  Training  Test  

Class Distribution  IR  Class Distribution  IR  
Worms(Ws)  ’5’, ’2’, ’3’  900  31/46  1.48  73/108  1.48 
Plane(Pl)  ’3’, ’5’  144  36/69  1.92  54/51  0.944 
Haptics(Ht)  ’1’, ’5’  1092  51/104  2.04  127/181  1.43 
FISH  ’4’, ’5’  463  43/132  3.07  57/118  2.07 
UWaveGestureLibraryAll(UWGLA)  ’8’, ’3’  945  206/690  3.35  914/2668  2.92 
InsectWingbeatSound(IWS)  ’1’, ’2’  256  40/180  4.5  360/1620  4.5 
Cricket_Z(CZ)  ’3’, ’5’  300  52/338  6.5  78/312  4 
SwedishLeaf(SL)  ’10’, ’7’  128  54/446  8.26  96/529  5.51 
FaceAll(FA)  ’1’, ’2’  131  80/480  12  210/1480  7.05 
MedicalImages(MI)  ’5’, ’6’, ’8’  99  23/358  15.57  69/691  10 
ShapesAll(SA)  ’1’, ’2’, ’3’  512  30/570  19  30/570  19 
NonInvasiveFatalECG_Thorax1(NIFT)  ’1’, ’23’  750  71/1729  24.35  100/1865  18.65 
2.3. Generation of Structurepreserving Synthetic Samples
The generation of synthetic samples of OHIT is simple. For a cluster discovered by DRSNN, we first compute its mean of cluster () and shrinkage covariance matrix (), then the synthetic samples are yielded based on the Gaussian distribution . In this way, the synthetic samples can maintain the covariance structure of each mode.
2.4. OHIT Algorithm and Complexity Analysis
Algorithm 1 summarizes the process of OHIT. Note that the actual data may not follow Gaussian distribution, but separately treating each mode is analogous to approximating the underlying distribution of minority class by the mixture of multiple Gaussian distribution, which alleviates the negative impacts from the violation of assumption to some degree.
The computational complexity of OHIT primarily consists of performing DRSNN clustering and estimating covariance matrix. Once the similarities are calculated for all pairs of samples (complexity–), DRSNN only requires to accomplish the process of clustering (Ertóz et al., 2003), while computing shrinkage covariance estimator has equal time complexity with the calculation of sample covariance matrix (Schäfer and Strimmer, 2009). Hence, the complexity of OHIT can be finally simplified to in the case of high dimensionality and small sample. This time requirement is same with that of simple SMOTE, which shows that OHIT is very efficient in computation.
3. Experimental Results and Discussion
3.1. Experimental setting
Experimental Datasets. We construct two groups of binary imbalanced datasets from the UCR time series repository (Chen et al., 2015). For the first group, the minority class of each dataset is the smallest original class in the data, and the majority class consists of all the remaining original classes. We call this group the “unimodal” data group, where “unimodal” refers to that the minority class is formed by only one original class (Cao et al., 2014). Table 1 presents the data characteristics of this group. It is worth pointing out that all the binary datasets whose imbalance ratios are higher than 1.5 in 2015 UCR repository have been added into this group, including Hr, Sb, POC, Lt2, PPOC, E200, Eq, and Wf.
For the second group, the minority class of each dataset is constructed by merging two or three smallest original classes, and the majority class is composed of the remaining original classes. If the small original classes have very limited samples, we combine three smallest original classes into the minority class, otherwise, two smallest original classes are merged. The datasets of this group are to simulate the scenario that the minority class is indeed multimodal. We call them the multimodal data group. Since our OHIT considers the multimodality of minority class, OHIT is expected to perform well on this group. Table 2 summarizes the data characteristics of this group, where the feature dimension is greater than the number of minority samples on all the datasets.
Metrics  Methods  Datasets  Average  

Yg  Hr  Sb  POC  Lt2  PPOC  E200  Eq  TP  Car  PPOA  Wf  

Original  .5652  NaN  .9571  .3584  .5778  .7013  .7143  NaN  .5780  .6875  .4242  .3163  .5812  
ROS  0.6073  0.4053  0.9513  0.5642  0.6117  0.7379  0.7431  0.1107  0.6367  0.8128  0.5320  0.6511  0.6137  
SMOTE  0.5949  0.5027  0.9366  0.4809  0.6961  0.7244  0.7678  0.1946  0.6379  0.8399  0.4369  0.5997  0.6177  
MDO  0.6072  0.4055  0.9478  0.5535  0.6299  0.7471  0.7336  0.0624  0.6282  0.6875  0.4955  0.6003  0.5915  
INOS  0.6130  0.4174  0.9485  0.5060  0.6039  0.7426  0.7537  0.0750  0.6469  0.7809  0.5051  0.6150  0.6007  
MoGT  0.5717  0.3920  0.9451  0.5352  0.6533  0.7431  0.7739  0.1870  0.5918  0.6875  0.4360  0.6840  0.6001  
OHIT  0.6134  0.4655  0.9513  0.5497  0.6543  0.7496  0.7650  0.1971  0.6490  0.8265  0.5096  0.5606  0.6243  

Original  .6180  .0000  .9688  .4753  .6388  .7506  .7725  .0000  .6818  .7421  .6261  .4366  .5592  
ROS  0.6386  0.5093  0.9677  0.6349  0.6624  0.8130  0.8002  0.2655  0.7724  0.8576  0.7691  0.8301  0.7101  
SMOTE  0.6232  0.5824  0.9603  0.5675  0.7222  0.8058  0.8219  0.3822  0.7827  0.8823  0.8311  0.8095  0.7309  
MDO  0.6399  0.5108  0.9651  0.6300  0.6734  0.8139  0.7904  0.1842  0.7586  0.7421  0.7259  0.8129  0.6873  
INOS  0.6448  0.5207  0.9667  0.5960  0.6555  0.8179  0.8092  0.2029  0.7785  0.8264  0.7758  0.8110  0.7005  
MoGT  0.6071  0.4993  0.9640  0.6160  0.6887  0.8178  0.8261  0.3788  0.7420  0.7421  0.7176  0.8129  0.7010  
OHIT  0.6421  0.5584  0.9685  0.6258  0.6952  0.8217  0.8181  0.3956  0.7830  0.8694  0.7928  0.8083  0.7316  
AUC  Original  .6771  .2490  .9898  .6691  .7056  .9044  .9032  .4681  .8496  .9294  .8908  .8019  .7532  
ROS  0.6772  0.6174  0.9908  0.6687  0.6992  0.8837  0.8935  0.5320  0.8529  0.9360  0.8731  0.8847  0.7924  
SMOTE  0.6620  0.6335  0.9929  0.6307  0.7218  0.8860  0.9000  0.5658  0.8555  0.9311  0.9008  0.7595  0.7866  
MDO  0.6770  0.6029  0.9920  0.6752  0.7267  0.8987  0.8955  0.5314  0.8559  0.9259  0.9002  0.8754  0.7964  
INOS  0.6862  0.6205  0.9919  0.6511  0.7000  0.8880  0.8924  0.5317  0.8625  0.9309  0.9075  0.8625  0.7938  
MoGT  0.6368  0.6235  0.9892  0.6696  0.7013  0.8882  0.9125  0.5266  0.8204  0.9067  0.8962  0.7578  0.7774  
OHIT  0.6821  0.6270  0.9931  0.6645  0.7063  0.8989  0.9008  0.5341  0.8599  0.9340  0.9098  0.8714  0.7985  
Best (/Worst) results are highlighted in bold (/italics) Type. 
Metrics  Methods  Datasets  Average  

Ws  PI  Ht  FISH  UWGLA  IWS  CZ  SL  FA  MI  SA  NIFT  

Original  .0267  .9623  .4388  .8627  .7178  .6182  NaN  .3667  .8262  .2000  .6667  .7345  .5837  
ROS  0.4639  0.9670  0.5365  0.9004  0.6533  0.5852  0.3699  0.8120  0.8338  0.3509  0.4777  0.6947  0.6371  
SMOTE  0.5049  0.9804  0.6220  0.8753  0.7050  0.6601  0.5528  0.6385  0.8099  0.3928  0.4748  0.3544  0.6309  
MDO  0.4926  0.9718  0.6077  0.8841  0.6757  0.6749  0.4309  0.7325  0.8193  0.6099  0.6837  0.6458  0.6857  
INOS  0.4680  0.9679  0.5924  0.8955  0.7350  0.6853  0.4727  0.7855  0.8147  0.4363  0.4758  0.6804  0.6675  
MoGT  0.4679  0.9751  0.6153  0.8882  0.7279  0.6962  0.3902  0.7868  0.7563  0.4429  0.6195  0.5874  0.6628  
OHIT  0.5053  0.9670  0.6293  0.9006  0.7405  0.6831  0.5148  0.7866  0.8108  0.4489  0.6872  0.6927  0.6972  

Original  .1165  .9623  .5385  .8749  .7925  .7077  .0000  .4778  .8768  .3580  .7290  .8036  .6031  
ROS  0.5485  0.9669  0.6119  0.9216  0.7670  0.7720  0.5301  0.8778  0.8860  0.6328  0.8665  0.8072  0.7657  
SMOTE  0.5682  0.9801  0.6719  0.9177  0.8360  0.8443  0.7536  0.8752  0.8993  0.7561  0.9111  0.8794  0.8244  
MDO  0.5716  0.9717  0.6670  0.9003  0.7855  0.7980  0.5986  0.8479  0.8882  0.7727  0.9302  0.7857  0.7931  
INOS  0.5527  0.9679  0.6544  0.9157  0.8416  0.8474  0.6420  0.8902  0.9007  0.7656  0.8999  0.8367  0.8096  
MoGT  0.5533  0.9747  0.6728  0.9054  0.8288  0.8407  0.5760  0.9049  0.8871  0.7387  0.9078  0.8030  0.7994  
OHIT  0.5768  0.9670  0.6843  0.9134  0.8448  0.8501  0.6914  0.9071  0.8996  0.7799  0.9305  0.8520  0.8247  
AUC  Original  .4359  .9975  .7101  .9496  .9072  .8835  .7349  .9587  .9598  .8559  .9271  .9605  .8567  
ROS  0.5549  0.9961  0.6735  0.9487  0.8627  0.8388  0.7214  0.9506  0.9605  0.7120  0.9110  0.9708  0.8418  
SMOTE  0.5738  0.9981  0.7125  0.9500  0.9093  0.8978  0.8220  0.9444  0.9531  0.7977  0.9242  0.9534  0.8697  
MDO  0.5591  0.9990  0.7112  0.9457  0.8770  0.8753  0.7602  0.9214  0.9544  0.8053  0.9331  0.9681  0.8592  
INOS  0.5646  0.9956  0.7084  0.9517  0.9074  0.9009  0.7738  0.9493  0.9525  0.8547  0.9212  0.9721  0.8710  
MoGT  0.5598  0.9967  0.7107  0.9447  0.9101  0.8961  0.6951  0.9387  0.9389  0.8384  0.9267  0.9560  0.8593  
OHIT  0.5709  0.9985  0.7285  0.9519  0.9110  0.9011  0.8141  0.9610  0.9541  0.8767  0.9362  0.9734  0.8815  
Best (/Worst) results are highlighted in bold (/italics) Type. 
Assessment metrics. In imbalanced learning area, Fvalue and Gmean are two widely used comprehensive metrics which can reflect the compromised performance on the majority class and minority class. The definitions of them are as follows:
, 
where recall and precision are the measures of completeness and exactness on the minority class, respectively; specificity is the measure of prediction accuracy on the majority class. Another popular overall metric is the Area Under the receiver operating characteristics Curve (AUC)
(Fawcett, 2004). Unlike Fvalue and Gmean, AUC is not affected by the decision threshold of classifiers. In this paper, we use Fvalue, Gmean, and AUC to assess the performance of algorithms.Base classifier. Previous studies (Cao et al., 2011, 2013, 2014)
have been shown that Support Vector Machines (SVM) in conjunction with the oversampling technique SPO (/INOS/MoGT) can acquire better performance in terms of Fvalue and Gmean than the stateoftheart approaches 1NN
(Nguyen et al., 2011) and 1NNDTW (Xi et al., 2006) for classifying imbalanced time series data. Hence, we select SVM with linear kernel as base classifier for achieving the experimental comparisons.The parameter of SVM is optimized by a nested 5fold crossvalidation over the training data. The considered values are . For each experimental dataset, the oversampling algorithm is applied to handle the training data so as to balance class distribution. Given that oversampling techniques involve the use of random numbers in the process of yielding synthetic samples, we run the oversampling method 10 times on the training data, the final performance result is the average of 10 results classifying the test data.
3.2. Comparison of Oversampling Methods
To evaluate the effectiveness of OHIT, we compare OHIT with existing representative oversampling methods, including random oversampling ROS, interpolationbased synthetic oversampling SMOTE, structurepreserving oversampling MDO and INOS, and the mixture model of Gaussian trees MoGT. With respect to the setting of parameters, the parameter values of OHIT are , , and , where is the number of minority samples. All the other methods use the default values recommended by the corresponding authors. Specifically, the neighbor parameter of SMOTE is set to 5; the parameters and in MDO are and , respectively; for INOS, 70% of the synthetic samples are generated from the Gaussian distribution reflected the covariance structure of minority class (i.e., ); in MoGT, the Bayesian information criterion is used to determine the number of mixture components.
OHIT vs  Unimodal data  Multimodal data  

Fmeasure  Gmean  AUC  Fmeasure  Gmean  AUC  
Original  9.8e4  9.8e4  0.0552  0.0122  4.9e4  0.0044 
ROS  0.4131  0.0342  0.0425  0.042  0.0015  0.0034 
SMOTE  0.6772  0.9097  0.3013  0.0342  0.6221  0.0342 
MDO  0.0342  0.0093  0.377  0.1099  0.0015  0.0024 
INOS  0.0342  0.0049  0.021  0.0425  0.0068  0.0015 
MoGT  0.064  0.0122  0.0269  0.0269  0.0015  4.9e4 
Metrics  Methods  Datasets  Average  

Yg  Hr  Sb  POC  Lt2  PPOC  E200  Eq  TP  Car  PPOA  Wf  
Recall  SMOTE  0.5904  0.4769  0.9913  0.4855  0.6786  0.8076  0.8639  0.1672  0.8675  0.8421  0.8471  0.7129  0.6943 
OHIT  0.6050  0.4038  0.9849  0.5587  0.5893  0.7957  0.8056  0.1845  0.7705  0.8158  0.6941  0.7316  0.6616  
Specificity  SMOTE  0.6580  0.7132  0.9302  0.6639  0.7697  0.8045  0.7828  0.8792  0.7118  0.9244  0.8160  0.9193  0.7978 
OHIT  0.6817  0.7737  0.9523  0.7025  0.8212  0.8487  0.8313  0.8489  0.7958  0.9268  0.9069  0.8933  0.8319  
Precision  SMOTE  0.5996  0.5332  0.8877  0.4766  0.7160  0.6576  0.6921  0.2346  0.5045  0.8379  0.2945  0.5183  0.5794 
OHIT  0.6223  0.5514  0.9200  0.5422  0.7368  0.7088  0.7287  0.2119  0.5607  0.8377  0.4032  0.4554  0.6066  
In terms of recall, specificity and precision, pvalues of Wilcoxon test between OHIT and SMOTE are , , and , respectively. 
Tables 3 and 4 respectively present the classification performances of all the compared algorithms on the unimodal datasets and multimodal datasets, where original represents SVM without combining any oversampling. For two data groups, original shows the worst results on most of the datasets in terms of Fvalue and Gmean, while OHIT achieves the best average performances in all the metrics.
In order to verify whether OHIT can significantly outperform the other compared algorithms, we perform the Wilcoxon signedranks test on the classification results of Tables 3 and 4. The test results are summarized in Table 5, where “+” and “*” denote the corresponding value is not greater than 0.05 and 0.1, respectively. From Table 5, one can see that the values on most of the significant tests are not beyond 0.05, and there are more significant differences on the multimodal datasets in comparison with the unimodal datasets.
OHIT vs  Unimodal data  Multimodal data  

Fmeasure  Gmean  AUC  Fmeasure  Gmean  AUC  

0.6229  0.7290  0.7935  0.6641  0.8131  0.8751  

0.5988  0.6977  0.7938  0.6523  0.7769  0.8475  

0.5938  0.6974  0.7939  0.6619  0.7900  0.8519  

0.6243  0.7316  0.7985  0.6972  0.8247  0.8815  
Best results are highlighted in bold type. 
OHIT vs  Unimodal data  Multimodal data  

Fmeasure  Gmean  AUC  Fmeasure  Gmean  AUC  

0.5771  0.1973  0.0039  0.021  0.0054  0.0244  

0.1763  0.0923  0.0923  0.0034  0.0068  0.0049  

0.021  9.8e4  0.0356  0.0269  0.0313  0.0063 
It is worth noting that the significant difference has not been found between OHIT and SMOTE over the unimodal data group. To more granularly investigate the performance differences of these two algorithms, we compute the recall, specificity, and precision values of them on the unimodal datasets. The results are summarized in Table 6. According to Table 6, SMOTE performs better in recall, but does not statistically outperform OHIT in terms of recall; while OHIT obtains the higher specificity and precision values on most of the datasets, and is significantly better than SMOTE in specificity (/precision) at a significant level of 0.05 (/0.1). From this result, we can find that, compared to OHIT, SMOTE boosts the performance of minority class more aggressively, but at the same time causes the misclassification of more majority samples. One main reason may be that high dimensionality can aggravate the overgeneralization problem of SMOTE. Since the space between two minority samples is increased exponentially with dimensionality, the synthetic samples interpolated by SMOTE can fall in huge region in highdimensional space. Greatly expanding the minority class regions is beneficial to predict the minority samples, but it can also increase the risk of invading majority class regions.
Although the major advantage of OHIT is capable of dealing with the multimodality of minority class, OHIT also exhibits the performance superiority on the unimodal datasets in comparison with MDO and INOS (Table 5). A congenital deficiency of MDO is that the covariance structure of minority class adopts the sample covariance matrix. INOS uses a regularization procedure to fix the unreliable eigenspectrum of sample covariance matrix, but does not consider the negative influence of outliers for the estimation of covariance matrix. Compared to INOS, OHIT can utilize DRSNN clustering to eliminate the outliers of minority class.
3.3. Evaluation of Separate OHIT Procedures
This experiment aims to evaluate the impacts of DRSNN clustering and shrinkage estimation on the performance of OHIT. To this end, we compare OHIT with the following OHIT variants: 1) OHIT without DRSNN clustering (denoted by OHIT/DRSNN); 2) OHIT without using shrinkage technique to improve covariance matrix estimator (OHIT/shrinkage); 3) OHIT replacing the shrinkage estimate of covariance matrix with the eigenspectrum regularization (OHIT with ER, the considered regularization is employed in INOS to alleviate the overadapted problem of sample covariance matrix).
Table 7 summarizes the average performance values of OHIT and its three variants (due to the limitation of space, the detailed experimental results are provided in Tables S1 and S2 in the supplementary material). The corresponding Wilcoxon test results between OHIT and each of its variants are presented in Table 8. We can find that OHIT is significantly better than all the variants in most of the cases, and more obvious advantages have been shown on the multimodal data group in comparison with the unimodal data group. It indicates that both DRSNN clustering and the shrinkage estimation of covariance matrix have positive effects on making OHIT to achieve better performance for highdimensional imbalanced time series classification.
3.4. Comparison of Oversampling Mechanisms on a Toy Dataset
We visually compare OHIT and the other compared algorithms based on a twodimensional toy dataset. Fig. 2a illustrates a balanced distribution, where each class has three modes and each mode contains 500 samples. In Fig. 2b, the minority class represented by blue pluses randomly retains 50 samples for each mode, so as to form imbalanced class distribution. Figs. 2c, 2d, 2e, 2f, 2g, and 2h show the augmented data after conducting ROS, SMOTE, MDO, INOS, MoGT, and OHIT on the minority class in sequence, where the introduced synthetic samples are denoted by red asterisks.
Based on these figures, the following observations can be obtained. 1) ROS does not effectively expand the regions of minority class, as the generated synthetic samples come from the replications of original minority samples (Fig. 2c). 2) SMOTE interpolates the synthetic samples between pairs of neighboring minority samples, which only considers the local characteristic of minority samples. Hence, the generated synthetic samples does not reflect the whole structures contained in the modes (Fig. 2d). 3) In MDO and INOS, the assumption, the minority class is unimodal, can lead to erroneous covariance matrix. The introduced synthetic samples can totally distort the original structure of each mode (Figs. 2e and 2f). 4) Although MoGT takes the multimodality into account by building multiple Gaussian tree models for the minority class, the modes of minority class are not captured correctly on this toy dataset (Fig. 2g). In fact, the authors of MoGT assign the number of Gaussian tree models in manual way when modelling the minority class. However, the number of modes is unknown in practice. The developed algorithm should have the capability of detecting the modes of minority class automatically. 5) Different from MoGT, OHIT has identified all the modes correctly. Among all the compared algorithms, the augmented data by OHIT is the most similar to the original balanced data (Fig. 2a vs Fig. 2h).
4. Conclusion
The learning from imbalanced timeseries data is challenging, since time series data tends to be highdimensional and highly correlated in variables. In this study, we have proposed a structure preserving oversampling OHIT for the classification of imbalanced timeseries data. To acquire the covariance structure of minority class correctly, OHIT leverages a DRSNN clustering algorithm to capture the multimodality of minority class in highdimensional space, and uses the shrinkage technique of covariance matrix to alleviate the problem of limited samples. We evaluated the effectiveness of OHIT on both the unimodal datasets and multimodal datasets. The experimental results showed that OHIT can significantly outperform existing typical oversampling solutions in most of cases, and each of DRSNN clustering and shrinkage technique is important for enabling OHIT to gain better performance for classifying imbalanced timeseries data.
Acknowledgements.
This work was supported in part by the National Natural Science Foundation of China (Project No. 61872131). We thank the authors of MDO, INOS, and MoGT for sharing their algorithm codes with us.References
 To combat multiclass imbalanced problems by means of oversampling techniques. IEEE Transactions on Knowledge and Data Engineering 28 (1), pp. 238–251. Cited by: §1.2.
 Applying support vector machines to imbalanced datasets. In European conference on machine learning, pp. 39–50. Cited by: §1.

A parsimonious mixture of gaussian trees model for oversampling in imbalanced and multimodal timeseries classification..
IEEE Transactions on Neural Networks & Learning Systems
25 (12), pp. 2226–2239. Cited by: §1.2, §3.1, §3.1.  Integrated oversampling for imbalanced time series classification. IEEE Transactions on Knowledge and Data Engineering 25 (12), pp. 2809–2822. Cited by: §1.2, §3.1.
 SPO: structure preserving oversampling for imbalanced time series classification. In Data Mining (ICDM), 2011 IEEE 11th International Conference on, pp. 1008–1013. Cited by: §1.1, §1.2, §3.1.
 An oversampling method based on probability density estimation for imbalanced datasets classification. In Proceedings of the 2016 International Conference on Intelligent Information Processing, pp. 44. Cited by: §1.2.

Novel costsensitive approach to improve the multilayer perceptron performance on imbalanced data
. IEEE transactions on neural networks and learning systems 24 (6), pp. 888–899. Cited by: §1, §1. 
SMOTE: synthetic minority oversampling technique.
Journal of artificial intelligence research
, pp. 321–357. Cited by: §1.2, §1.  The ucr time series classification archive. July. Cited by: §3.1.

Costaware pretraining for multiclass costsensitive deep learning
. arXiv preprint arXiv:1511.09337. Cited by: §1.  RACOG and wracog: two probabilistic oversampling techniques. IEEE transactions on knowledge and data engineering 27 (1), pp. 222–234. Cited by: §1.2.
 A new shared nearest neighbor clustering algorithm and its applications. In Workshop on clustering high dimensional data and its applications at 2nd SIAM international conference on data mining, pp. 105–115. Cited by: §2.1.1, §2.1.1.
 Finding clusters of different sizes, shapes, and densities in noisy, high dimensional data. In Siam International Conference on Data Mining, San Francisco, Ca, Usa, May, Cited by: §2.1.1, §2.1.1, §2.1.1, §2.1.2, §2.4.
 A densitybased algorithm for discovering clusters in large spatial databases with noise. In Proc. ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, Vol. 96, pp. 226–231. Cited by: §2.1.1.
 ROC graphs: notes and practical considerations for researchers. Machine learning 31 (1), pp. 1–38. Cited by: §3.1.
 Regularized discriminant analysis. Journal of the American statistical association 84 (405), pp. 165–175. Cited by: §1.2, §2.2.

Introduction to statistical pattern recognition
. Elsevier. Cited by: §1.2. 
Can sharedneighbor distances defeat the curse of dimensionality?
. In International Conference on Scientific & Statistical Database Management, Cited by: §2.1.1, §2.1.2.  Predicting seismic events in coal mines based on underground sensor measurements. Engineering Applications of Artificial Intelligence 64, pp. 83–94. Cited by: §1.1.
 Clustering using a similarity measure based on shared near neighbors. Cited by: §2.1.1.

Costsensitive learning of deep feature representations from imbalanced data
. IEEE transactions on neural networks and learning systems 29 (8), pp. 3573–3587. Cited by: §1.  Honey, i shrunk the sample covariance matrix. Social Science Electronic Publishing 30 (4), pp. págs. 110–119. Cited by: §2.2.
 Improved estimation of the covariance matrix of stock returns with an application to portfolio selection. Journal of Empirical Finance 10 (5), pp. 603–621. Cited by: §2.2.
 Positive unlabeled learning for time series classification. In TwentySecond International Joint Conference on Artificial Intelligence, Cited by: §3.1.
 Densitybased clustering in spatial databases: the algorithm gdbscan and its applications. Data Mining Knowl. Discovery 2 (2), pp. 169–194. Cited by: §2.1.2.
 A shrinkage approach to largescale covariance matrix estimation and implications for functional genomics. Stat Appl Genet Mol Biol 4 (1), pp. Article32. Cited by: §2.2, §2.2, §2.4.
 A simplified model for portfolio analysis. Management Science 9 (2), pp. 277–293. Cited by: §2.2.
 GIRbased ensemble sampling approaches for imbalanced learning. Pattern Recognition 71, pp. 306–319. Cited by: §1, §1.
 ECG beats classification using multiclass support vector machines with error correcting output codes. Digital Signal Processing 17 (3), pp. 675–684. Cited by: §1.1.
 Generalized models for the classification of abnormal movements in daily life and its applicability to epilepsy convulsion recognition. International journal of neural systems 26 (06), pp. 1650037. Cited by: §1.1.
 An undersampling method based on fuzzy logic for large imbalanced dataset. In Fuzzy Systems (FUZZIEEE), 2014 IEEE International Conference on, pp. 1248–1252. Cited by: §1, §1.
 Fast time series classification using numerosity reduction. In International Conference, Cited by: §3.1.
 Cgmos: certainty guided minority oversampling. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, pp. 1623–1631. Cited by: §1.
 Training costsensitive neural networks with methods addressing the class imbalance problem. IEEE Transactions on Knowledge & Data Engineering (1), pp. 63–77. Cited by: §1.
 Minority oversampling for imbalanced ordinal regression. KnowledgeBased Systems 166, pp. 140–155. Cited by: §1.2, §1.
 Synthetic minority oversampling technique for multiclass imbalance problems. Pattern Recognition 72, pp. 327–340. Cited by: §1.2, §1.
 Densityratio based clustering for discovering clusters with varying densities. Pattern Recognition 60, pp. 983–997. Cited by: §2.1.2.
Comments
There are no comments yet.