I Introduction
Timeseries Classification (TSC) is one of the most challenging tasks in data mining [12]
. In recent years, with the remarkable success of deep neural networks (DNNs) in Computer Vision (CV)
[19, 17, 34], many researchers have tried to employ DNNs in TSC due to the similarity between timeseries data (Onedimensional sequence) and image data (Twodimensional sequence). However, with extensive experiments on various existing DNNbased TSC approaches, we found that DNNs are easy to be overfitting on datasets from the UCR archive [1]. To be specific, several experiments are conducted for typical DNNs, i.e. Fully Convolutional Networks (FCN) [37], Residual Networks (ResNet) [17, 37], and InceptionTime [13], on the UCR datasets. Selecting InceptionTime as an example, only datasets have the training and test accuracy gap less than , which means the gap on the other datasets is more than , where many of them are more than , as shown in Fig. 1(a). In addition, Fig. 1(b) gives the training and test accuracy with respect to epochs on a specific dataset among UCR datasets, i.e.
dataset, where the gap becomes around epoch and keeps stationary till the end.After thoroughly analyzing the UCR datasets, we claim that the difference between datasets in CV and those in UCR archive contributes the most to the overfitting phenomenon. In detail, we can view this from the perspective of shot learning, where
represents the training examples per class. In CV, datasets always contain enough training samples for DNNs, e.g. MNIST
[24] and CIFAR10 [23] are both shot learning datasets withtraining samples in total, and ImageNet
[11] is averagely a shot learning dataset with more than 14 million traning samples. However, in UCR archive, datasets are less than shot and datasets with less than training samples, while only ones with more than training samples. For example, Fungi is a shot learning dataset, and DiatomSizeReduction contains only training samples. We conclude this as the fewshot problem in TSC. In summary, solving the fewshot problem is the key to improve accuracy. Interestingly, many stateoftheart approaches adopted methods for alleviating overfitting without pointing out the overfitting problem, such as the Hierarchical Vote system for Collective Of Transformationbased Ensembles (HIVECOTE) [25] with an ensemble of classifiers, InceptionTime [13] with an ensemble of models, RandOm Convolutional KErnel Transform (ROCKET) [9] with regularization and Cross Validation, etc. Among the aforementioned stateoftheart methods, ROCKET has the best accuracy and inference time balance in practice.In this paper, instead of employing the ordinary approaches for alleviating overfitting, e.g. and
regularization, Batch Normalization (BN)
[20], Dropout [31], early stopping, etc, we first proposed Label Smoothing based on InceptionTime [35] (LSTime) for improving the generalization ability of InceptionTime, as soft labels are closer to reallife compared to hard labels. For instance, as shown in Fig. 2(a), the true label of that handwritten number is . Yet, it also looks like intuitively. Thus, giving a hard label to that number may cause information losses. Also, in stocks, there are many meaningful chart patterns, among which Head&Shoulders (H&S), Triple Top (TT), and Double Top (DT) [36] are similar. In Fig. 2(b), the H&S chart pattern is close to TT. Therefore, we wish to keep its information of TT. As a consequence, soft labels maintain more information compared to hard labels. However, we would like to get the soft labels automatically rather than to set that manually. Thus, secondly, Knowledge Distillation based on InceptionTime [18] (KDTime) is leveraged to generate the soft labels by a teacher model, which is a pretrained network with a deep and complex architecture. The predicted labels from the teacher model represent the knowledge learned by it. After that, the knowledge (Predicted soft labels) can be distilled to help the training of a student model, which owns a relatively smaller and simpler architecture. As a consequence, Knowledge Distillation can also reduce the inference time because of the student model. At last, we claim that the teacher is not correct, which means it may produce wrong soft labels and misguide the student. As the groundtruth labels have already been obtained, we propose to simply calibrate the wrong predicted soft labels, in order to maximize the accuracy of InceptionTime, called Knowledge Distillation with Calibration based on InceptionTime (KDCTime). In addition, KDCTime includes two optional calibrating strategies, i.e. KDC by Translating (KDCT) and KDC by Reordering (KDCR).The InceptionTime model employed in LSTime, KDTime, and KDCTime is a single model instead of ones, since the training and inference time are essential, which indicate the feasibility of that model. Thus, compared to an ensemble of models with Inception modules each in its original version, the InceptionTime model in this paper is only one with Inception modules. In summary, the main contributions in this paper can be concluded as follows:

Discovered the TSC approaches based on DNNs are normally overfitting on the UCR datasets, which is caused by the fewshot problem of those datasets. Thus, the best direction to improve the performance of DNNs is alleviating overfitting.

Combined Label Smoothing and Knowledge Distillation with InceptionTime respectively, denoted LSTime and KDTime. LSTime trains the InceptionTime model with manually controlled soft labels, while KDTime is able to generate soft labels automatically by the teacher model.

Proposed KDCTime for calibrating incorrect soft labels predicted by the teacher model, where it contains two optional strategies, i.e. KDCT and KDCR. As a consequence, KDCTime further improves the accuracy of KDTime.

We have tested the accuracy, training time and test time of ROCKET, InceptionTime, LSTime, KDTime, KDCTime. The results show that compared to ROCKET, KDCTime simultaneously improves the accuracy and reduces the inference time of it with an acceptable training time overhead. As a conclusion, the performance of KDCTime is promising.
Ii Related Work
TSC, as a traditional timeseries mining research direction, has been considered as one of the most challenging problems [12]. Traditionally, Nearest Neighbor (1NN) Classifiers based on Dynamic Time Warping (DTW) distance has been shown to be a very promising approach [2]. Yet, the time complexity of DTW is unacceptable compared to Euclidean Distance (ED), i.e. compared to . Thus, many researchers have tried to accelerate the execution time of DTW. Rakthanmanon et al. [28] proposed UCRDTW by leveraging lower bounding and early abandoning. Sakurai et al. [30] proposed SPRING under the DTW distance, which is able to monitor timeseries streams in realtime. Gong et al. [16] proposed ForwardPropagation NSPRING (FPNS) for further accelerate the speed of SPRING. Nevertheless, those researches are all concentrating on time complexity. The upper bound of their accuracy is the accuracy of DTW distance.
In order to breakthrough the bottleneck of accuracy, researchers found that ensembling of several classifiers could significantly improve the accuracy of TSC. Thus, Baydogan et al. [4]
selected an ensemble of decision trees. Kate
[22] employed an ensemble of several NN classifiers with different distance measures, including DTW distance. Those methods motivated the development of an ensemble of classifiers, named Collective Of Transformationbased Ensembles (COTE) [3], which ensembles those classifiers over different timeseries representations instead of same ones. After that, Lines et al. [25] extended COTE by leveraging a new hierarchical structure with probabilistic voting, called HIVECOTE, which is currently considered the stateoftheart approach in accuracy. Nevertheless, in order to achieve such a high accuracy, those methods sacrifice the training and inference time, most of which are impractical when datasets are large.With the rapid development of deep learning, DNNs are widely applied in CV and are also increasingly employed in TSC
[37]. Cui et al. [8]proposed MultiScale Convolutional Neural Networks (MCNN), which transforms the timeseries into several feature vectors and feeds those vectors into a CNN model. Wang et al.
[37]implemented several DNN models originated from CV, i.e. MultiLayer Perceptrons (MLP), Fully Convolutional Networks (FCN), and Residual Networks (ResNet), in order to test their performance in TSC, which provides a strong baseline of DNNbased approaches. Based on a more recent DNN structure, i.e. Inception module
[34], KarimiBidhendi et al. [21] proposed an approach to transform timeseries into feature maps using Gramian Angular Difference Field (GADF), and finally feed those maps to an InceptionNet which is pretrained for image recognition. By extending a more recent version of IncpetionNet, i.e. InceptionV4 [33], Fawaz et al. [13] proposed InceptionTime, which ensembles Inceptionbased models to get a promising accuracy. Multiscale Attention Convolutional Neural Network (MACNN) [6] adopted attention mechanism to further improve the accuracy of MCNN. Instead of utilizing convolutions as parts of the model, RandOm Convolutional KErnel Transform (ROCKET) [9] employed random convolutional kernels as feature extractors converting timeseries into feature vectors, which were later fed to a ridge regressor. To the best of authors’ knowledge, ROCKET owns the best accuracy and inference time balance, while other DNNbased methods always requires a longer training and inference time. Actually, DNNbased methods always suffer from long training time, e.g. MACNN requires days of running time for training datasets from the UCR archive. That builds a huge barrier for researchers to reimplement the approach.We found that InceptionTime is a quite competitive approach. Using only InceptionTime model instead of ensembling ones, it owns a slower yet acceptable training time and a magnitude less inference time compared to ROCKET. Therefore, when InceptionTime only includes model instead of ensembling several models, we would like to ensure the accuracy of it by the information obtained from soft labels. The idea of utilizing soft labels was proposed by Szegedy et al. [35], that controls the smooth level of soft labels by manually setting a parameter . Yet, a better way to determine the smooth level of soft labels is generating soft labels automatically by a teacher model, and training a student model with those labels, the idea of which comes from Knowledge Distillation (KD) [18]. Since then, many extensions of KD were proposed. Some researches [29, 38] concentrated on letting the student model learn the feature maps, instead of soft labels, of the teacher model. In addition, Svitov et al. [32] leveraged the predicted labels by the teacher model as class centers, instead of soft labels, guiding the training of the student model. Oki et al. [27] integrated KD into triplet loss and utilized the predicted labels as anchor points for guiding the training of the student model. In this paper, instead of employing other types of KD methods, we still concentrated on labelbased KD approaches in order to save more execution time. At last, inspired by [7, 26], the gap between the student model and the teacher model should be small, or the student model would hard to mimic the teacher model. Therefore, in this paper, a layer student InceptionTime model is selected as the student model, and a layer teacher model is employed as the teacher model.
Iii Proposed Approaches
In this section, instead of concentrating on the model, all the proposed approaches are essentially centered with loss functions and labels. First, notations and definitions are given in Section
IIIA. After that, InceptionTime is briefly introduced in Section IIIB. Then, Label Smoothing for InceptionTime (LSTime) is demonstrated in Section IIIC. Next, Knowledge Distillation for InceptionTime (KDTime) is depicted in Section IIID. At last, Knowledge Distillation with Calibration for InceptionTime (KDCTime) is illustrated in in Section IIIE, where it contains strategies, i.e. Calibration by Translating (CT) and Calibration by Reordering (CR).Iiia Notations and Definitions
Definition 1
A timeseries is defined as a vector, where represents the th value of . The corresponding class of is a scalar , with classes in total.
Definition 2
A class label of is defined as a vector , where
represents the probability of
belonging to class . In addition, Eq. (1) always holds for any .(1) 
Definition 3
The true label of is defined as an onehot vector , where all except , called the hard label. The equation of is given in Eq. (2).
(2) 
Definition 4
A dataset is a pair of sets, including a set of timeseries and a set of true labels respectively, where each timeseries corresponds to a true label .
Definition 5
An InceptionTime model is treated as a function mapping an input into an output , where represents the hypothesis space, i.e. the space containing all possibilities of .
Definition 6
(3) 
Definition 7
A loss function is a function measuring the difference between the predicted label and the true label , in order to determine the performance of .
Definition 8
The problem in the paper is defined as follows: Given a dataset , find an minimizing the predefined . Formally, it is demonstrated in Eq. (4).
(4) 
To this end, important notations are briefly summarized in Table I.
Notations  Definitions 

A timeseries  
The true label w.r.t.  
A dataset  
A set of timeseries in  
A set of labels in  
An InceptionTime model  
The output of  
The Softmax function  
The predicted label of  
A loss function 
IiiB InceptionTime
The ordinary Softmax Cross Entropy Loss is adopted in InceptionTime [13]. We first implemented the one model version of InceptionTime with Inception modules, denoted . Thus, the loss function of is given in Eq. (5).
(5) 
where it is easy to know the final loss is only related to , as all the other are (Eq. (2)). In other words, only the result of is survived in the summation. Therefore, for simplicity, Eq. (5) can also be written as Eq. (6).
(6) 
Note that Eq. (6) is the reason that onehot class label is called hard label, as it explicitly selects only the probability of belonging class , yet ignoring all other probabilities in the loss function. However, in a more realistic scenario, we believe such deterministic case is rare. Hence, as also introduced in Section I, a soft version of is more feasible in this case.
IiiC Label Smoothing for InceptionTime
Second, following by the assumption in Section IIIB, we implemented Softmax Cross Entropy with Label Smoothing (LS) [35], with as the model. Therefore, the equation of label smoothed is given in Eq. (7), denoted .
(7) 
where is the smoothing coefficient set by users, representing how much the label is smoothed. Note after LS, still satisfies Eq. (1). Alternatively, Eq. (7) can also be written as Eq. (8).
(8) 
As a consequence, the Softmax Cross Entropy loss with LS is given in Eq. (9).
(9)  
where the left part represents the loss from hard labels, while the right part means the loss from soft labels. The smoothing coefficient controls the weights of losses from hard labels and soft labels.
Yet, manually controlling the smoothed level of labels by is not the best solution, since, except , every smoothed label has the same value. Similar to hard labels, this kind of manually controlled soft labels is not practical in the real world. Therefore, generating flexible soft labels by Knowledge Distillation is then proposed.
IiiD Knowledge Distillation for InceptionTime
Third, we implemented Knowledge Distillation (KD) to help us generate soft labels in replacement of manually controlling them. Instead of manually setting up the soft labels, Hinton et al. [18] proposed KD to generate soft labels by a teacher model, which owns a cumbersome architecture with a large number of parameters. Intuitively, the teacher model has more potential to capture the knowledge from training data because of its scale. Next, the predicted labels from the teacher model can be regarded as the knowledge learned by it, denoted . Thus, is an automatic soft label compared to the manual soft label . Note the one model version of InceptionTime with Inception modules is incorporated as the teacher model, denoted .
In addition, instead of directly using Softmax Cross Entropy loss, Softmax with a Temperature and KullbackLeibler (KL) divergence loss are adopted. The equation of Softmax with is given in Eq. (10).
(10) 
where the Temperature is a parameter for finetuning the smoothed level of predicted labels from the teacher model and from the student model, denoted and . Note means the labels keep unchanged, while or represents the labels are steeper or smoother respectively. For an extreme example, if , we will have . To this end, the loss of and can be measured by KL divergence, the equation of which is given in Eq. (11).
(11) 
After that, representing the loss of soft labels and representing the loss of hard labels are incorporated into a whole for training the student model, which has a relatively small architecture with less parameters. This procedure is regarded as distilling the knowledge from a teacher model into a student model, called KD, in order to preserve the accuracy of the teacher model while reduce its time and space complexity. Note is selected as the student model, which is the one model version of InceptionTime with Inception modules. The equation of KD loss is given in Eq. (12).
(12)  
where controls the weight of (Eq. (5)) and (Eq. (11)). Note is necessary because the scale of becomes smaller after finetuning by . Thus, multiplying a helps it to be the same scale of , so that the total loss has no preference to .
Nevertheless, similar to teachers in reallife, the teacher model is not ensured to be correct. Sometimes it may misguide the student model to wrong answers. To be specific, the incorrect soft labels generated by the teacher model will also result in the wrong labels predicted by the student model. In order to alleviate the affection of incorrect labels, KD adopts and
. Nonetheless, that brings additional hyperparameters into the model. Therefore, we would like to propose a better method to alleviate the affection of incorrect labels while not bringing additional hyperparameters.
IiiE Knowledge Distillation with Calibration for InceptionTime
At last, we propose Knowledge Distillation with Calibration (KDC) to calibrate the incorrect soft labels generated by the teacher model before distillation. Note the teacher model and student model are and respectively (Section IIID). In this way, it is not necessary to employ and in KDC. In order to calibrate the incorrect soft labels, all labels are regarded as vectors geometrically, including the hard label and the soft label generated by the teacher model. From this point of view, according to Eq. (1), the feasible solution space of
is a triangular hyperplane, named the label space. In other words, all
are located on a triangular hyperplane. Fig. 3(a) gives an example when . In this case, the triangular hyperplane is an D line in D space. Next, Fig. 3(b) shows another example when . In this case, the triangular hyperplane is a D regular triangle in D space. Last, the triangular hyperplane is a D regular tetrahedron in D space when . Nonetheless, we failed to plot a D space in figures. Note that distinct colors represent the areas of distinct classes. Thus, it is potential to calibrate from its original position to the target position , if is located in the wrong area. In addition, all are located at the vertices of the triangular hyperplane, as marked in Fig. 3(a) and Fig. 3(b).Therefore, our main task is to propose a proper method in order to modify to its correct area while its new position is located between and . The calibrated is denoted as . To this end, two approaches for calibration are proposed, which are calibration by translating and calibration by reordering. Note that only incorrect predicted labels will be calibrated. Formally, given a and its corresponding , will be computed only when . In other words, .
IiiE1 Calibration by Translating
Calibration by Translating (CT) represents geometrically translate from its original position to , the equation of which is shown in Eq. (13).
(13) 
where represents the vector from to , while is a calibration coefficient controlling the distance moves towards . It is easy to know that when , and when . In this case, it is simply substitution instead of calibration.
Hence, is the key coefficient defining the degree of calibration. We define the equation of as Eq. (14).
(14) 
where is the minimum distance between and when , and is the current distance between and . It is easy to know that always holds.
To this end, we calculated , and got the magic number . The procedure of calculation is given in Appendix A. As a consequence, Eq. (13) can be also rewritten as (15).
(15) 
where we know the unit vector decides the direction of translating, while determines the distance of translating. In this way, it ensures that all stay in the label space, since it guarantees . In addition, it also guarantees that , which leads to a consequence that will not stay unchanged.
IiiE2 Calibration by Reordering
Calibration by Reordering (CR) represents reprioritizing the values of the incorrect predicted label based on a specific strategy. To be concrete, given a and its corresponding , some will be resorted if . Therefore, our main task is to design a reordering strategy.
Given a awaiting for reordering. The reordering strategy is designed as follows: 1) is sorted by descending order. The sorted is denoted . Thus, since the descending order of is random, we have , where is the largest , is the second largest , and so on and so forth. Note is in descending order, which means . 2) After defining a temporary value , the value of will be assigned to , from , , all the way to . 3) Assign the value of to .
The algorithm of the whole procedure is given in Algorithm 1. It ensures that is located in the label space, since is only the reordered version of . In addition, it guarantees that is geometrically located in its class area. In other words, . As a consequence, is successfully calibrated by reordering.
IiiE3 Analysis of CT and CR
We theoretically analyzed the difference between CT and CR in terms of calibration. To be specific, for one that is not located in the correct area in the label space, KDC will be utilized to calibrate as either by CT or CR, where is located in the correct area. It can be concluded that there must exist a mapping of any label from the incorrect area to the correct area in the label space. Thus, we wish to know the mapping relationship by calculating the distance between and after KDC.
Fig. 4 shows an example in D space, where the color map of the distance between and after KDC in the label space is illustrated. The distance calculating function is defined as . As in the red area (Correct area) do not require calibration, their distance is calculated by directly. On the contrary, out the red area (Incorrect area) are calibrated as . Therefore, their distance is calculated by . As shown in Fig. 4(a), the being close to the edge of the red area are mapped close to the vertex, i.e. the hard label , while the being far away from the red area are mapped to the edge of the red area. Yet in Fig. 4(b), the being close to the edge of the red area are mapped also close to the edge, while the being far away from the red area are mapped to close to . As a consequence, CT keeps the relative position of incorrect unchanged, which means it preserves the spatial information of . In addition, CR keeps the shape of the distribution of incorrect unchanged, which means it preserves the distributional information of .
To this end, we are in position to define the loss function of KDC. Unlike the loss function of KD, we employ KLdivergence only without temperature and the cross entropy part, as shown in Eq. 16.
(16) 
where is the calibrated label generated by the teacher model, while is the label generated by the student model. Compared to , does not contains any hyperparameter and computes only KL divergence, which reduces much computational time and the complexity of hyperparameter tuning.
Thus, the total process of KDC can be concluded as steps: 1) Train a teacher model with a heavier and more complex architecture by true labels (Hard labels). Generate the predicted labels by the teacher model; 2) Calibrate the incorrect predicted labels; 3) Train the student model with a relatively small and simple architecture by calibrated labels (Soft labels). The algorithm of KDC is given in Algorithm 2.
Iv Experiments
We conduct the experiments on UCR datasets, in which there are datasets. Yet, there are problematic datasets with (Not a Number) values, since data missing or various timeseries length. They are listed in Table II. Therefore, the remaining datasets are selected for experiments.
Dataset Name  

AllGestureWiimoteX  DodgerLoopDay  GestureMidAirD1 
AllGestureWiimoteY  DodgerLoopGame  GestureMidAirD2 
AllGestureWiimoteZ  DodgerLoopWeekend  GestureMidAirD3 
MelbournePedestrian  PickupGestureWiimoteZ  GesturePebbleZ1 
ShakeGestureWiimoteZ  PLAID  GesturePebbleZ2 
In the experiments, ROCKET, Softmax cross entropy for InceptionTime (ITime), label smoothing for InceptionTime (LSTime), Knowledge Distillation for InceptionTime (KDTime), and KD with calibration for InceptionTime (KDCTime) are compared, where KDCTime includes two calibrating methods, i.e. KDC by traslating (KDCT) and KDC by reordering (KDCR) respectively. Except ROCKET, all the aforementioned methods can be concluded as ITimebased approaches. At first, in order to find the best hyperparameter for LSTime, KDTime, KDCT, and KDCR, we conducted the experiments of hyperparameter study for those approaches. Note ITime does not have any hyperparameter to be tuned. After that, we tested the accuracy, training time, and test time of the aforementioned approaches.
Similar to [13], critical difference diagrams are drawn in the paper for better illustrating the results of different approaches, as the results of datasets are hard to be depicted clearly. Critical difference diagram is a diagram drawn by the following steps: 1) Execute the Friedman test [14]
for rejecting the null hypothesis. 2) Perform the pairwise posthoc analysis
[5] by a Wilcoxon signedrank test with Holm’s alpha () correction [15]. 3) Visualize the statistical result by [10], where a thick horizontal line represents that the approaches are not significantly different with respect to results.Our experiments are conducted on a computer equipped with an Intel Core i CPU at GHz, GB memory, and a NVIDIA GeForce RTX GPU. The operating system is Windows 10. Additionally, the development environment is Anaconda with Python
and Pytorch
.Iva Hyperparameter Study for ITimebased Approaches
In this section, we first searched hyperparameters, i.e. batch size, epoch, and learning rate, on ITime. Thus, those hyperparameters of LSTime, KDTime, KDCT and KDCR can be determined also, since all these methods are based on InceptionTime. After that, in LSTime, and in KDTime, and in KDCT and KDCR were searched separately.
IvA1 Batch Size
First, the batch size was set to without searching. The reason is that batch size is theoretically the larger the better. An extreme case is the fullbatch. However, in deep learning tasks, that always leads to a consequence where graphics memory on GPU is infeasible to load the large dataset. Additionally, The batch size and the epoch are depending on each other, i.e. the more batch size represents the more epochs for converging, which means a longer training time. Thus, the batch size is always empirically set to either , , , , or . Hence, was adopted in the paper.
IvA2 Epoch
With a fixed batch size , we compared the accuracy of different epochs, which were , , , , and respectively. The critical difference diagram is given in Fig. 5, where epochs owns the best accuracy. Yet, epochs is of no critical difference of epochs. Since the training time of DNNbased approaches is slow, is selected as the epoch in order to save the training time. In addition, we also employed the early stop strategy with patience equals to epochs for reducing the training time and alleviating overfitting.
IvA3 Learning Rate
The accuracy of distinct learning rates were tested in total, which were , , , , and . Moreover, learning rate decay was employed in order to stabilize the training process. We leveraged fixed step decay, also called piecewise constant decay, as the decay strategy, where the step size is set as and gamma is set as . In other words, the learning rate will multiply by for every epochs. Note there are epochs, which ensures times of decay in total. The critical difference diagram is shown in Fig. 6, where the learning rate being equal to has the highest accuracy. Thus, is employed as the learning rate in the paper.
IvA4 in LSTime
The smoothing coefficient (Eq. (9)) represents the smoothed level of labels. We tested the accuracy of different in LSTime, which are , , , , and . After all, the critical difference diagram is given in Fig. 7, where gets the best accuracy. This claims that should not be too small or too big, since a small gives the label little additional information, while a big causes too much information loss from its original class. In addition, the accuracy of big is less than that of small , which means information from its original class is important, and it is not a good idea to completely abandon those information.
IvA5 and in KDTime
KDTime contains hyperparameters, which are and (Eq. (12)), where controls the weight of losses between the hard label and the smooth label respectively, and finetunes the smoothed level of smooth labels (Predicted labels by teacher model). Thus, we tested the accuracy of distinct in KDTime, which are , , , , and , while we also compared the accuracy of different , which are , , , , and . The critical diagrams of and are shown in Fig. 8 and Fig. 9. Fig. 8 shows that has the best accuracy. The close to or will reduce the accuracy. In addition, Fig. 9 shows that is the best. Similarly, the accuracy of KDTime will decrease if is too big or too small.
To this end, we are able to conclude that all the ITimebased methods select as batch size, as epoch, and as learning rate. In addition, in LSTime is set to , and in KDTime is set to and respectively. Finally, Adam is adopted as the optimizing algorithm in order to update the model.
IvB Accuracy of Different Approaches
In this section, we compared the accuracy of ROCKET, ITime, LSTime, KDTime, KDCTime by Translating (KDCT) and KDCTime by Reordering (KDCR). The accuracy of all approaches are averaged from
times running. In the mean time, the standard deviation of these
accuracy is also calculated. The result is listed in Table III (Appendix B). The number before the sign represents accuracy, while the number after the sign means standard deviation. Besides, the bold number represents the best accuracy or standard deviation among all approaches.As shown in Table III, KDCR gets the highest accuracy on datasets, which claims that its accuracy is competitive and promising. Yet, its converging process is not as stable as ROCKET, since ROCKET has the lowest standard deviation on the majority of datasets. Besides, the critical difference diagram is also given for better illustrating the result of Table III, which is shown in Fig. 10. Note that Fig. 10 also demonstrates the accuracy of the teacher model used in KDTime, KDCT and KDCR, which is a InceptionTime model with Inception modules. It is also trained times and the best one is selected as the teacher. That is the reason why its accuracy is the best one. In addition, KDCR and KDCT are of no significant difference with KDTime. The reason is that the accuracy of datasets are more than for the teacher model, which claims the majority of datasets meets the bottleneck of improving accuracy. In other words, only a small number of samples can be calibrated by KDCTime. Thus, We claim that the results of those datasets dominates the other datasets in the critical difference diagram. By ignoring that part of results, the accuracy of KDCR and KDCT is more promising, as shown in Fig. 11.
In summary, KDCR owns the best accuracy, which is better than KDCT. The reason is that KDCR keeps the information from marginal labels. In Fig. 3, marginal labels represent the labels located in the middle area of the label space. In Fig. 4(b), we know the labels located in the middle area have a long distance to , where they do not deterministically belong to any class, meaning that they contain abundant information from other classes. Yet, KDCT calibrates the marginal labels close to , which losses those information. In addition, KDCR calibrates the labels close to other classes as close to the correct one, as shown in Fig. 4(b), which eliminates the misguiding from deterministic incorrect labels. Nevertheless, KDCT keeps the information of those labels from incorrect classes.
IvC Training and Test Time of Different Approaches
In this section, we compared the training and test time of ROCKET, ITime, LSTime, KDTime, and KDCTIme, ignoring KDCT or KDCR, as their training and inference time are of no significant difference. Instead of showing all the results in a table, diagrams of comparison were selected for better illustrating the results.
Fig. 12 demonstrates the training time of KDCTime compared with ROCKET, ITime, LSTime, and KDTime. In Fig. 12(a), it shows that KDCTime is much slower. In detail, the training time of KDCTime on datasets is below order of magnitude slower than ROCKET, while that on datasets is more than order of magnitude slower. That is because all ITimebased approaches, including KDCTime, employ Gradient Descent as the optimizing algorithm, which requires the inference on many training samples for each update and a large number of updates in total, e.g. samples for each update, many times of updating in epoch, and
epochs in total. To the opposite, ROCKET utilizes ridge classifier and solves the ridge regression problem directly. However, the training time of KDCTime is still acceptable, which requires around
hour to train UCR datasets and hours for times running in total. Besides, as shown in Fig. 12(b) and (c), the training time of ITime and LSTime is similar to KDCTime. The reason is that their model is the same, which is InceptionTime with Inception modules. Still, KDCTime needs an extra teacher model in order to guild the training of its student model, which leads to a consequence that KDCTime requires extra training time for the teacher model. Note that the training of the teacher model is required for only one time. Once the teacher model is obtained, that model is available for multiple times of training for the student model. Fig. 12(d) shows that the training time of KDCTime is of no significant difference with KDTime, as their models are the same, they both require the teacher model, and Gradient Descent is selected as their optimizing algorithms. As a conclusion, the difference of training time mainly appears in categories of methods, where the first one is ROCKET, the second one is ITimebased methods without KD, and the last one is ITimebased approaches with KD.Fig. 13 demonstrates the test time of KDCTime compared with ROCKET, ITime, LSTime, and KDTime. In Fig. 13(a), it shows that KDCTime is much faster than ROCKET. To be specific, the test time of KDCTime on datasets is order of magnitude faster than ROCKET, that on datasets is orders of magnitude faster, and that on datasets is orders of magnitude faster. That is because in the stage of test, without the computational time of Gradient Descent, KDCTime only requires the time of inference on test samples, which is same as ROCKET. In this scenario, the computational time of random convolutional kernels in ROCKET is much slower than the Inception modules in KDCTime. In addition, as shown in Fig. 13(b), (c) and (d), the test time of those approaches are of no difference with KDCTime, since their inference models are all InceptionTime with Inception modules. As a consequence, the difference of test time can be categorized as groups, which are ROCKET and ITimebased approaches, regardless of ITimebased approaches with or without KD.
V Conclusion
In this paper, we discovered the DNNbased TSC approaches are easy to be overfitting on the UCR datasets, which is caused by the fewshot problem in the UCR archive. Thus, in order to alleviate overfitting, Label Smoothing for InceptionTime (LSTime) was first proposed by utilizing soft labels. Next, instead of manually adjusting soft labels, Knowledge Distillation for InceptionTime (KDTime) was proposed in order to automatically generate soft labels. At last, in order to rectify the incorrect predicted soft labels from the teacher model, KD with calibration (KDC) was proposed, where it has two optional strategies, namely KDC by Translating (KDCT) and KDC by Reordering (KDCR).
The experimental results show that the accuracy of KDCT and KDCR is promising, while KDCR gets the highest one. In addition, including KDCT and KDCR, all InceptionTimebased (ITimebased) approaches are orders of magnitude faster than ROCKET on test time, since the ITime model is the majority factor for the inference time. The training time of ITimebased approaches are slower than ROCKET, yet it is in an acceptable range and worthwhile in order to obtain a promising accuracy and fast inference time. At last, KDCT and KDCR do not introduce any additional hyperparameter compared to ITime.
In the future, instead of just concentrating on the loss functions and labels, we will try various models, in order to propose a brand new model which owns a high generalization capability.
Appendix A The procedure to calculate
Given a hard label and a soft label , where they both satisfy and . By treating and as vectors, we want to find the minimum distance between them when . Let , so that and . Let , so that . Therefore, we have the following optimization objective:
where we know and . Thus, . Since , we let . Thus, . In this way, our optimization objective can be rewritten as follows:
This is an optimization problem with inequality constraints. Therefore, we can define its Lagrangian function as:
where and are two sets of Lagrangian multipliers. By adopting KarushKuhnTucker (KKT) Conditions, we have:
At last, after solving this system of equations, we know obtains the minimum value when , , and all other . As a result, can be calculated as follows:
Appendix B The accuracy of different approaches
The accuracy of different approaches on UCR datasets is given in Table III on the last page.
Datasets  ROCKET  ITime  LSTime  KDTime  KDCT  KDCR 
ACSF1  
Adiac  
ArrowHead  
Beef  
BeetleFly  
BirdChicken  
BME  
Car  
CBF  
Chinatown  
ChlorineConcentration  
CinCECGTorso  
Coffee  
Computers  
CricketX  
CricketY  
CricketZ  
Crop  
DiatomSizeReduction  
DistalPhalanxOutlineAgeGroup  
DistalPhalanxOutlineCorrect  
DistalPhalanxTW  
Earthquakes  
ECG200  
ECG5000  
ECGFiveDays  
ElectricDevices  
EOGHorizontalSignal  
EOGVerticalSignal  
EthanolLevel  
FaceAll  
FaceFour  
FacesUCR  
FiftyWords  
Fish  
FordA  
FordB  
FreezerRegularTrain  
FreezerSmallTrain  
Fungi  
GunPoint  
GunPointAgeSpan  
GunPointMaleVersusFemale  
GunPointOldVersusYoung  
Ham  
HandOutlines 
Comments
There are no comments yet.