KDCTime: Knowledge Distillation with Calibration on InceptionTime for Time-series Classification

by   Xueyuan Gong, et al.
Beijing Institute of Technology

Time-series classification approaches based on deep neural networks are easy to be overfitting on UCR datasets, which is caused by the few-shot problem of those datasets. Therefore, in order to alleviate the overfitting phenomenon for further improving the accuracy, we first propose Label Smoothing for InceptionTime (LSTime), which adopts the information of soft labels compared to just hard labels. Next, instead of manually adjusting soft labels by LSTime, Knowledge Distillation for InceptionTime (KDTime) is proposed in order to automatically generate soft labels by the teacher model. At last, in order to rectify the incorrect predicted soft labels from the teacher model, Knowledge Distillation with Calibration for InceptionTime (KDCTime) is proposed, where it contains two optional calibrating strategies, i.e. KDC by Translating (KDCT) and KDC by Reordering (KDCR). The experimental results show that the accuracy of KDCTime is promising, while its inference time is two orders of magnitude faster than ROCKET with an acceptable training time overhead.



There are no comments yet.


page 1

page 11


Knowledge Distillation in Wide Neural Networks: Risk Bound, Data Efficiency and Imperfect Teacher

Knowledge distillation is a strategy of training a student network with ...

Isotonic Data Augmentation for Knowledge Distillation

Knowledge distillation uses both real hard labels and soft labels predic...

Better Supervisory Signals by Observing Learning Paths

Better-supervised models might have better performance. In this paper, w...

Model compression for faster structural separation of macromolecules captured by Cellular Electron Cryo-Tomography

Electron Cryo-Tomography (ECT) enables 3D visualization of macromolecule...

Learning Soft Labels via Meta Learning

One-hot labels do not represent soft decision boundaries among concepts,...

Black-Box Dissector: Towards Erasing-based Hard-Label Model Stealing Attack

Model stealing attack aims to create a substitute model that steals the ...

Adaptive Regularization of Labels

Recently, a variety of regularization techniques have been widely applie...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Time-series Classification (TSC) is one of the most challenging tasks in data mining [12]

. In recent years, with the remarkable success of deep neural networks (DNNs) in Computer Vision (CV)

[19, 17, 34], many researchers have tried to employ DNNs in TSC due to the similarity between time-series data (One-dimensional sequence) and image data (Two-dimensional sequence). However, with extensive experiments on various existing DNN-based TSC approaches, we found that DNNs are easy to be overfitting on datasets from the UCR archive [1]. To be specific, several experiments are conducted for typical DNNs, i.e. Fully Convolutional Networks (FCN) [37], Residual Networks (ResNet) [17, 37], and InceptionTime [13], on the UCR datasets. Selecting InceptionTime as an example, only datasets have the training and test accuracy gap less than , which means the gap on the other datasets is more than , where many of them are more than , as shown in Fig. 1(a). In addition, Fig. 1

(b) gives the training and test accuracy with respect to epochs on a specific dataset among UCR datasets, i.e.

dataset, where the gap becomes around epoch and keeps stationary till the end.

(a) The gap between training and test accuracy on UCR datasets
(b) The training and test accuracy w.r.t. epochs on dataset
Fig. 1: Two examples showing that InceptionTime, including other DNNs, is easy to be overfitting on UCR datasets

After thoroughly analyzing the UCR datasets, we claim that the difference between datasets in CV and those in UCR archive contributes the most to the overfitting phenomenon. In detail, we can view this from the perspective of -shot learning, where

represents the training examples per class. In CV, datasets always contain enough training samples for DNNs, e.g. MNIST

[24] and CIFAR10 [23] are both -shot learning datasets with

training samples in total, and ImageNet

[11] is averagely a -shot learning dataset with more than 14 million traning samples. However, in UCR archive, datasets are less than -shot and datasets with less than training samples, while only ones with more than training samples. For example, Fungi is a -shot learning dataset, and DiatomSizeReduction contains only training samples. We conclude this as the few-shot problem in TSC. In summary, solving the few-shot problem is the key to improve accuracy. Interestingly, many state-of-the-art approaches adopted methods for alleviating overfitting without pointing out the overfitting problem, such as the Hierarchical Vote system for Collective Of Transformation-based Ensembles (HIVE-COTE) [25] with an ensemble of classifiers, InceptionTime [13] with an ensemble of models, RandOm Convolutional KErnel Transform (ROCKET) [9] with regularization and Cross Validation, etc. Among the aforementioned state-of-the-art methods, ROCKET has the best accuracy and inference time balance in practice.

In this paper, instead of employing the ordinary approaches for alleviating overfitting, e.g. and

regularization, Batch Normalization (BN)

[20], Dropout [31], early stopping, etc, we first proposed Label Smoothing based on InceptionTime [35] (LSTime) for improving the generalization ability of InceptionTime, as soft labels are closer to real-life compared to hard labels. For instance, as shown in Fig. 2(a), the true label of that handwritten number is . Yet, it also looks like intuitively. Thus, giving a hard label to that number may cause information losses. Also, in stocks, there are many meaningful chart patterns, among which Head&Shoulders (H&S), Triple Top (TT), and Double Top (DT) [36] are similar. In Fig. 2(b), the H&S chart pattern is close to TT. Therefore, we wish to keep its information of TT. As a consequence, soft labels maintain more information compared to hard labels. However, we would like to get the soft labels automatically rather than to set that manually. Thus, secondly, Knowledge Distillation based on InceptionTime [18] (KDTime) is leveraged to generate the soft labels by a teacher model, which is a pre-trained network with a deep and complex architecture. The predicted labels from the teacher model represent the knowledge learned by it. After that, the knowledge (Predicted soft labels) can be distilled to help the training of a student model, which owns a relatively smaller and simpler architecture. As a consequence, Knowledge Distillation can also reduce the inference time because of the student model. At last, we claim that the teacher is not correct, which means it may produce wrong soft labels and misguide the student. As the ground-truth labels have already been obtained, we propose to simply calibrate the wrong predicted soft labels, in order to maximize the accuracy of InceptionTime, called Knowledge Distillation with Calibration based on InceptionTime (KDCTime). In addition, KDCTime includes two optional calibrating strategies, i.e. KDC by Translating (KDCT) and KDC by Reordering (KDCR).

(a) Soft labels for CV
(b) Soft labels for TSC
Fig. 2: Two examples demonstrating the benefits of soft labels

The InceptionTime model employed in LSTime, KDTime, and KDCTime is a single model instead of ones, since the training and inference time are essential, which indicate the feasibility of that model. Thus, compared to an ensemble of models with Inception modules each in its original version, the InceptionTime model in this paper is only one with Inception modules. In summary, the main contributions in this paper can be concluded as follows:

  • Discovered the TSC approaches based on DNNs are normally overfitting on the UCR datasets, which is caused by the few-shot problem of those datasets. Thus, the best direction to improve the performance of DNNs is alleviating overfitting.

  • Combined Label Smoothing and Knowledge Distillation with InceptionTime respectively, denoted LSTime and KDTime. LSTime trains the InceptionTime model with manually controlled soft labels, while KDTime is able to generate soft labels automatically by the teacher model.

  • Proposed KDCTime for calibrating incorrect soft labels predicted by the teacher model, where it contains two optional strategies, i.e. KDCT and KDCR. As a consequence, KDCTime further improves the accuracy of KDTime.

  • We have tested the accuracy, training time and test time of ROCKET, InceptionTime, LSTime, KDTime, KDCTime. The results show that compared to ROCKET, KDCTime simultaneously improves the accuracy and reduces the inference time of it with an acceptable training time overhead. As a conclusion, the performance of KDCTime is promising.

The remainder of this paper is organized as follows: The related work is reviewed in Section II. Next, LSTime, KDTime, and KDCTime are introduced in Section III. The experimental results are discussed in Section IV. Finally, Section V concludes the paper.

Ii Related Work

TSC, as a traditional time-series mining research direction, has been considered as one of the most challenging problems [12]. Traditionally, Nearest Neighbor (1-NN) Classifiers based on Dynamic Time Warping (DTW) distance has been shown to be a very promising approach [2]. Yet, the time complexity of DTW is unacceptable compared to Euclidean Distance (ED), i.e. compared to . Thus, many researchers have tried to accelerate the execution time of DTW. Rakthanmanon et al. [28] proposed UCR-DTW by leveraging lower bounding and early abandoning. Sakurai et al. [30] proposed SPRING under the DTW distance, which is able to monitor time-series streams in real-time. Gong et al. [16] proposed Forward-Propagation NSPRING (FPNS) for further accelerate the speed of SPRING. Nevertheless, those researches are all concentrating on time complexity. The upper bound of their accuracy is the accuracy of DTW distance.

In order to breakthrough the bottleneck of accuracy, researchers found that ensembling of several classifiers could significantly improve the accuracy of TSC. Thus, Baydogan et al. [4]

selected an ensemble of decision trees. Kate

[22] employed an ensemble of several -NN classifiers with different distance measures, including DTW distance. Those methods motivated the development of an ensemble of classifiers, named Collective Of Transformation-based Ensembles (COTE) [3], which ensembles those classifiers over different time-series representations instead of same ones. After that, Lines et al. [25] extended COTE by leveraging a new hierarchical structure with probabilistic voting, called HIVE-COTE, which is currently considered the state-of-the-art approach in accuracy. Nevertheless, in order to achieve such a high accuracy, those methods sacrifice the training and inference time, most of which are impractical when datasets are large.

With the rapid development of deep learning, DNNs are widely applied in CV and are also increasingly employed in TSC

[37]. Cui et al. [8]

proposed Multi-Scale Convolutional Neural Networks (MCNN), which transforms the time-series into several feature vectors and feeds those vectors into a CNN model. Wang et al.


implemented several DNN models originated from CV, i.e. MultiLayer Perceptrons (MLP), Fully Convolutional Networks (FCN), and Residual Networks (ResNet), in order to test their performance in TSC, which provides a strong baseline of DNN-based approaches. Based on a more recent DNN structure, i.e. Inception module

[34], Karimi-Bidhendi et al. [21] proposed an approach to transform time-series into feature maps using Gramian Angular Difference Field (GADF), and finally feed those maps to an InceptionNet which is pre-trained for image recognition. By extending a more recent version of IncpetionNet, i.e. Inception-V4 [33], Fawaz et al. [13] proposed InceptionTime, which ensembles Inception-based models to get a promising accuracy. Multi-scale Attention Convolutional Neural Network (MACNN) [6] adopted attention mechanism to further improve the accuracy of MCNN. Instead of utilizing convolutions as parts of the model, RandOm Convolutional KErnel Transform (ROCKET) [9] employed random convolutional kernels as feature extractors converting time-series into feature vectors, which were later fed to a ridge regressor. To the best of authors’ knowledge, ROCKET owns the best accuracy and inference time balance, while other DNN-based methods always requires a longer training and inference time. Actually, DNN-based methods always suffer from long training time, e.g. MACNN requires days of running time for training datasets from the UCR archive. That builds a huge barrier for researchers to reimplement the approach.

We found that InceptionTime is a quite competitive approach. Using only InceptionTime model instead of ensembling ones, it owns a slower yet acceptable training time and a magnitude less inference time compared to ROCKET. Therefore, when InceptionTime only includes model instead of ensembling several models, we would like to ensure the accuracy of it by the information obtained from soft labels. The idea of utilizing soft labels was proposed by Szegedy et al. [35], that controls the smooth level of soft labels by manually setting a parameter . Yet, a better way to determine the smooth level of soft labels is generating soft labels automatically by a teacher model, and training a student model with those labels, the idea of which comes from Knowledge Distillation (KD) [18]. Since then, many extensions of KD were proposed. Some researches [29, 38] concentrated on letting the student model learn the feature maps, instead of soft labels, of the teacher model. In addition, Svitov et al. [32] leveraged the predicted labels by the teacher model as class centers, instead of soft labels, guiding the training of the student model. Oki et al. [27] integrated KD into triplet loss and utilized the predicted labels as anchor points for guiding the training of the student model. In this paper, instead of employing other types of KD methods, we still concentrated on label-based KD approaches in order to save more execution time. At last, inspired by [7, 26], the gap between the student model and the teacher model should be small, or the student model would hard to mimic the teacher model. Therefore, in this paper, a -layer student InceptionTime model is selected as the student model, and a -layer teacher model is employed as the teacher model.

Iii Proposed Approaches

In this section, instead of concentrating on the model, all the proposed approaches are essentially centered with loss functions and labels. First, notations and definitions are given in Section

III-A. After that, InceptionTime is briefly introduced in Section III-B. Then, Label Smoothing for InceptionTime (LSTime) is demonstrated in Section III-C. Next, Knowledge Distillation for InceptionTime (KDTime) is depicted in Section III-D. At last, Knowledge Distillation with Calibration for InceptionTime (KDCTime) is illustrated in in Section III-E, where it contains strategies, i.e. Calibration by Translating (CT) and Calibration by Reordering (CR).

Iii-a Notations and Definitions

Definition 1

A time-series is defined as a vector, where represents the -th value of . The corresponding class of is a scalar , with classes in total.

Definition 2

A class label of is defined as a vector , where

represents the probability of

belonging to class . In addition, Eq. (1) always holds for any .

Definition 3

The true label of is defined as an one-hot vector , where all except , called the hard label. The equation of is given in Eq. (2).

Definition 4

A dataset is a pair of sets, including a set of time-series and a set of true labels respectively, where each time-series corresponds to a true label .

Definition 5

An InceptionTime model is treated as a function mapping an input into an output , where represents the hypothesis space, i.e. the space containing all possibilities of .

Definition 6

The predicted label is produced by normalizing with the Softmax function (Eq. (3)). Thus, Eq. (1) always holds.

Definition 7

A loss function is a function measuring the difference between the predicted label and the true label , in order to determine the performance of .

Definition 8

The problem in the paper is defined as follows: Given a dataset , find an minimizing the predefined . Formally, it is demonstrated in Eq. (4).


To this end, important notations are briefly summarized in Table I.

Notations Definitions
A time-series
The true label w.r.t.
A dataset
A set of time-series in
A set of labels in
An InceptionTime model
The output of
The Softmax function
The predicted label of
A loss function
TABLE I: Notations and definitions

Iii-B InceptionTime

The ordinary Softmax Cross Entropy Loss is adopted in InceptionTime [13]. We first implemented the one model version of InceptionTime with Inception modules, denoted . Thus, the loss function of is given in Eq. (5).


where it is easy to know the final loss is only related to , as all the other are (Eq. (2)). In other words, only the result of is survived in the summation. Therefore, for simplicity, Eq. (5) can also be written as Eq. (6).


Note that Eq. (6) is the reason that one-hot class label is called hard label, as it explicitly selects only the probability of belonging class , yet ignoring all other probabilities in the loss function. However, in a more realistic scenario, we believe such deterministic case is rare. Hence, as also introduced in Section I, a soft version of is more feasible in this case.

Iii-C Label Smoothing for InceptionTime

Second, following by the assumption in Section III-B, we implemented Softmax Cross Entropy with Label Smoothing (LS) [35], with as the model. Therefore, the equation of label smoothed is given in Eq. (7), denoted .


where is the smoothing coefficient set by users, representing how much the label is smoothed. Note after LS, still satisfies Eq. (1). Alternatively, Eq. (7) can also be written as Eq. (8).


As a consequence, the Softmax Cross Entropy loss with LS is given in Eq. (9).


where the left part represents the loss from hard labels, while the right part means the loss from soft labels. The smoothing coefficient controls the weights of losses from hard labels and soft labels.

Yet, manually controlling the smoothed level of labels by is not the best solution, since, except , every smoothed label has the same value. Similar to hard labels, this kind of manually controlled soft labels is not practical in the real world. Therefore, generating flexible soft labels by Knowledge Distillation is then proposed.

Iii-D Knowledge Distillation for InceptionTime

Third, we implemented Knowledge Distillation (KD) to help us generate soft labels in replacement of manually controlling them. Instead of manually setting up the soft labels, Hinton et al. [18] proposed KD to generate soft labels by a teacher model, which owns a cumbersome architecture with a large number of parameters. Intuitively, the teacher model has more potential to capture the knowledge from training data because of its scale. Next, the predicted labels from the teacher model can be regarded as the knowledge learned by it, denoted . Thus, is an automatic soft label compared to the manual soft label . Note the one model version of InceptionTime with Inception modules is incorporated as the teacher model, denoted .

In addition, instead of directly using Softmax Cross Entropy loss, Softmax with a Temperature and Kullback-Leibler (KL) divergence loss are adopted. The equation of Softmax with is given in Eq. (10).


where the Temperature is a parameter for fine-tuning the smoothed level of predicted labels from the teacher model and from the student model, denoted and . Note means the labels keep unchanged, while or represents the labels are steeper or smoother respectively. For an extreme example, if , we will have . To this end, the loss of and can be measured by KL divergence, the equation of which is given in Eq. (11).


After that, representing the loss of soft labels and representing the loss of hard labels are incorporated into a whole for training the student model, which has a relatively small architecture with less parameters. This procedure is regarded as distilling the knowledge from a teacher model into a student model, called KD, in order to preserve the accuracy of the teacher model while reduce its time and space complexity. Note is selected as the student model, which is the one model version of InceptionTime with Inception modules. The equation of KD loss is given in Eq. (12).


where controls the weight of (Eq. (5)) and (Eq. (11)). Note is necessary because the scale of becomes smaller after fine-tuning by . Thus, multiplying a helps it to be the same scale of , so that the total loss has no preference to .

Nevertheless, similar to teachers in real-life, the teacher model is not ensured to be correct. Sometimes it may misguide the student model to wrong answers. To be specific, the incorrect soft labels generated by the teacher model will also result in the wrong labels predicted by the student model. In order to alleviate the affection of incorrect labels, KD adopts and

. Nonetheless, that brings additional hyperparameters into the model. Therefore, we would like to propose a better method to alleviate the affection of incorrect labels while not bringing additional hyperparameters.

Iii-E Knowledge Distillation with Calibration for InceptionTime

At last, we propose Knowledge Distillation with Calibration (KDC) to calibrate the incorrect soft labels generated by the teacher model before distillation. Note the teacher model and student model are and respectively (Section III-D). In this way, it is not necessary to employ and in KDC. In order to calibrate the incorrect soft labels, all labels are regarded as vectors geometrically, including the hard label and the soft label generated by the teacher model. From this point of view, according to Eq. (1), the feasible solution space of

is a triangular hyperplane, named the label space. In other words, all

are located on a triangular hyperplane. Fig. 3(a) gives an example when . In this case, the triangular hyperplane is an -D line in -D space. Next, Fig. 3(b) shows another example when . In this case, the triangular hyperplane is a -D regular triangle in -D space. Last, the triangular hyperplane is a -D regular tetrahedron in -D space when . Nonetheless, we failed to plot a -D space in figures. Note that distinct colors represent the areas of distinct classes. Thus, it is potential to calibrate from its original position to the target position , if is located in the wrong area. In addition, all are located at the vertices of the triangular hyperplane, as marked in Fig. 3(a) and Fig. 3(b).

(a) The -D label space when
(b) The -D label space when
Fig. 3: Two examples showing the label spaces when and

Therefore, our main task is to propose a proper method in order to modify to its correct area while its new position is located between and . The calibrated is denoted as . To this end, two approaches for calibration are proposed, which are calibration by translating and calibration by reordering. Note that only incorrect predicted labels will be calibrated. Formally, given a and its corresponding , will be computed only when . In other words, .

Iii-E1 Calibration by Translating

Calibration by Translating (CT) represents geometrically translate from its original position to , the equation of which is shown in Eq. (13).


where represents the vector from to , while is a calibration coefficient controlling the distance moves towards . It is easy to know that when , and when . In this case, it is simply substitution instead of calibration.

Hence, is the key coefficient defining the degree of calibration. We define the equation of as Eq. (14).


where is the minimum distance between and when , and is the current distance between and . It is easy to know that always holds.

To this end, we calculated , and got the magic number . The procedure of calculation is given in Appendix A. As a consequence, Eq. (13) can be also rewritten as (15).


where we know the unit vector decides the direction of translating, while determines the distance of translating. In this way, it ensures that all stay in the label space, since it guarantees . In addition, it also guarantees that , which leads to a consequence that will not stay unchanged.

Iii-E2 Calibration by Reordering

Calibration by Reordering (CR) represents reprioritizing the values of the incorrect predicted label based on a specific strategy. To be concrete, given a and its corresponding , some will be resorted if . Therefore, our main task is to design a reordering strategy.

Given a awaiting for reordering. The reordering strategy is designed as follows: 1) is sorted by descending order. The sorted is denoted . Thus, since the descending order of is random, we have , where is the largest , is the second largest , and so on and so forth. Note is in descending order, which means . 2) After defining a temporary value , the value of will be assigned to , from , , all the way to . 3) Assign the value of to .

The algorithm of the whole procedure is given in Algorithm 1. It ensures that is located in the label space, since is only the reordered version of . In addition, it guarantees that is geometrically located in its class area. In other words, . As a consequence, is successfully calibrated by reordering.

1:Given a and its corresponding
2:The reordered label
3:Sort to get
5:for , until  do
Algorithm 1 Algorithm of calibration by reordering

Iii-E3 Analysis of CT and CR

We theoretically analyzed the difference between CT and CR in terms of calibration. To be specific, for one that is not located in the correct area in the label space, KDC will be utilized to calibrate as either by CT or CR, where is located in the correct area. It can be concluded that there must exist a mapping of any label from the incorrect area to the correct area in the label space. Thus, we wish to know the mapping relationship by calculating the distance between and after KDC.

Fig. 4 shows an example in -D space, where the color map of the distance between and after KDC in the label space is illustrated. The distance calculating function is defined as . As in the red area (Correct area) do not require calibration, their distance is calculated by directly. On the contrary, out the red area (Incorrect area) are calibrated as . Therefore, their distance is calculated by . As shown in Fig. 4(a), the being close to the edge of the red area are mapped close to the vertex, i.e. the hard label , while the being far away from the red area are mapped to the edge of the red area. Yet in Fig. 4(b), the being close to the edge of the red area are mapped also close to the edge, while the being far away from the red area are mapped to close to . As a consequence, CT keeps the relative position of incorrect unchanged, which means it preserves the spatial information of . In addition, CR keeps the shape of the distribution of incorrect unchanged, which means it preserves the distributional information of .

(a) Calibration by traslating
(b) Calibration by reordering
Fig. 4: A -D example illustrating the color map of the distance between and after KDC in the label space

To this end, we are in position to define the loss function of KDC. Unlike the loss function of KD, we employ KL-divergence only without temperature and the cross entropy part, as shown in Eq. 16.


where is the calibrated label generated by the teacher model, while is the label generated by the student model. Compared to , does not contains any hyperparameter and computes only KL divergence, which reduces much computational time and the complexity of hyperparameter tuning.

Thus, the total process of KDC can be concluded as steps: 1) Train a teacher model with a heavier and more complex architecture by true labels (Hard labels). Generate the predicted labels by the teacher model; 2) Calibrate the incorrect predicted labels; 3) Train the student model with a relatively small and simple architecture by calibrated labels (Soft labels). The algorithm of KDC is given in Algorithm 2.

1:The training data and its corresponding labels .
2:The trained student model
3:Initialize a teacher model
4:Train by and to get
5:Generate by
6:for each  do
7:     if  and belong to distinct class then
8:         Calibrate to get Eq. 15 or Algorithm 1      
9:Initialize the student model
10:Train by and to get
Algorithm 2 Algorithm of KDCTime

Iv Experiments

We conduct the experiments on UCR datasets, in which there are datasets. Yet, there are problematic datasets with (Not a Number) values, since data missing or various time-series length. They are listed in Table II. Therefore, the remaining datasets are selected for experiments.

Dataset Name
AllGestureWiimoteX DodgerLoopDay GestureMidAirD1
AllGestureWiimoteY DodgerLoopGame GestureMidAirD2
AllGestureWiimoteZ DodgerLoopWeekend GestureMidAirD3
MelbournePedestrian PickupGestureWiimoteZ GesturePebbleZ1
ShakeGestureWiimoteZ PLAID GesturePebbleZ2
TABLE II: Problematic Datasets

In the experiments, ROCKET, Softmax cross entropy for InceptionTime (ITime), label smoothing for InceptionTime (LSTime), Knowledge Distillation for InceptionTime (KDTime), and KD with calibration for InceptionTime (KDCTime) are compared, where KDCTime includes two calibrating methods, i.e. KDC by traslating (KDCT) and KDC by reordering (KDCR) respectively. Except ROCKET, all the aforementioned methods can be concluded as ITime-based approaches. At first, in order to find the best hyperparameter for LSTime, KDTime, KDCT, and KDCR, we conducted the experiments of hyperparameter study for those approaches. Note ITime does not have any hyperparameter to be tuned. After that, we tested the accuracy, training time, and test time of the aforementioned approaches.

Similar to [13], critical difference diagrams are drawn in the paper for better illustrating the results of different approaches, as the results of datasets are hard to be depicted clearly. Critical difference diagram is a diagram drawn by the following steps: 1) Execute the Friedman test [14]

for rejecting the null hypothesis. 2) Perform the pairwise post-hoc analysis

[5] by a Wilcoxon signed-rank test with Holm’s alpha () correction [15]. 3) Visualize the statistical result by [10], where a thick horizontal line represents that the approaches are not significantly different with respect to results.

Our experiments are conducted on a computer equipped with an Intel Core i- CPU at GHz, GB memory, and a NVIDIA GeForce RTX GPU. The operating system is Windows 10. Additionally, the development environment is Anaconda with Python

and Pytorch


Iv-a Hyperparameter Study for ITime-based Approaches

In this section, we first searched hyperparameters, i.e. batch size, epoch, and learning rate, on ITime. Thus, those hyperparameters of LSTime, KDTime, KDCT and KDCR can be determined also, since all these methods are based on InceptionTime. After that, in LSTime, and in KDTime, and in KDCT and KDCR were searched separately.

Iv-A1 Batch Size

First, the batch size was set to without searching. The reason is that batch size is theoretically the larger the better. An extreme case is the full-batch. However, in deep learning tasks, that always leads to a consequence where graphics memory on GPU is infeasible to load the large dataset. Additionally, The batch size and the epoch are depending on each other, i.e. the more batch size represents the more epochs for converging, which means a longer training time. Thus, the batch size is always empirically set to either , , , , or . Hence, was adopted in the paper.

Iv-A2 Epoch

With a fixed batch size , we compared the accuracy of different epochs, which were , , , , and respectively. The critical difference diagram is given in Fig. 5, where epochs owns the best accuracy. Yet, epochs is of no critical difference of epochs. Since the training time of DNN-based approaches is slow, is selected as the epoch in order to save the training time. In addition, we also employed the early stop strategy with patience equals to epochs for reducing the training time and alleviating overfitting.

Fig. 5: Critical difference diagram for different epochs

Iv-A3 Learning Rate

The accuracy of distinct learning rates were tested in total, which were , , , , and . Moreover, learning rate decay was employed in order to stabilize the training process. We leveraged fixed step decay, also called piecewise constant decay, as the decay strategy, where the step size is set as and gamma is set as . In other words, the learning rate will multiply by for every epochs. Note there are epochs, which ensures times of decay in total. The critical difference diagram is shown in Fig. 6, where the learning rate being equal to has the highest accuracy. Thus, is employed as the learning rate in the paper.

Fig. 6: Critical difference diagram for different learning rates

Iv-A4 in LSTime

The smoothing coefficient (Eq. (9)) represents the smoothed level of labels. We tested the accuracy of different in LSTime, which are , , , , and . After all, the critical difference diagram is given in Fig. 7, where gets the best accuracy. This claims that should not be too small or too big, since a small gives the label little additional information, while a big causes too much information loss from its original class. In addition, the accuracy of big is less than that of small , which means information from its original class is important, and it is not a good idea to completely abandon those information.

Fig. 7: Critical difference diagram for different in LSTime

Iv-A5 and in KDTime

KDTime contains hyperparameters, which are and (Eq. (12)), where controls the weight of losses between the hard label and the smooth label respectively, and fine-tunes the smoothed level of smooth labels (Predicted labels by teacher model). Thus, we tested the accuracy of distinct in KDTime, which are , , , , and , while we also compared the accuracy of different , which are , , , , and . The critical diagrams of and are shown in Fig. 8 and Fig. 9. Fig. 8 shows that has the best accuracy. The close to or will reduce the accuracy. In addition, Fig. 9 shows that is the best. Similarly, the accuracy of KDTime will decrease if is too big or too small.

Fig. 8: Critical difference diagram for different in KDTime
Fig. 9: Critical difference diagram for different in KDTime

To this end, we are able to conclude that all the ITime-based methods select as batch size, as epoch, and as learning rate. In addition, in LSTime is set to , and in KDTime is set to and respectively. Finally, Adam is adopted as the optimizing algorithm in order to update the model.

Iv-B Accuracy of Different Approaches

In this section, we compared the accuracy of ROCKET, ITime, LSTime, KDTime, KDCTime by Translating (KDCT) and KDCTime by Reordering (KDCR). The accuracy of all approaches are averaged from

times running. In the mean time, the standard deviation of these

accuracy is also calculated. The result is listed in Table III (Appendix B). The number before the sign represents accuracy, while the number after the sign means standard deviation. Besides, the bold number represents the best accuracy or standard deviation among all approaches.

As shown in Table III, KDCR gets the highest accuracy on datasets, which claims that its accuracy is competitive and promising. Yet, its converging process is not as stable as ROCKET, since ROCKET has the lowest standard deviation on the majority of datasets. Besides, the critical difference diagram is also given for better illustrating the result of Table III, which is shown in Fig. 10. Note that Fig. 10 also demonstrates the accuracy of the teacher model used in KDTime, KDCT and KDCR, which is a InceptionTime model with Inception modules. It is also trained times and the best one is selected as the teacher. That is the reason why its accuracy is the best one. In addition, KDCR and KDCT are of no significant difference with KDTime. The reason is that the accuracy of datasets are more than for the teacher model, which claims the majority of datasets meets the bottleneck of improving accuracy. In other words, only a small number of samples can be calibrated by KDCTime. Thus, We claim that the results of those datasets dominates the other datasets in the critical difference diagram. By ignoring that part of results, the accuracy of KDCR and KDCT is more promising, as shown in Fig. 11.

In summary, KDCR owns the best accuracy, which is better than KDCT. The reason is that KDCR keeps the information from marginal labels. In Fig. 3, marginal labels represent the labels located in the middle area of the label space. In Fig. 4(b), we know the labels located in the middle area have a long distance to , where they do not deterministically belong to any class, meaning that they contain abundant information from other classes. Yet, KDCT calibrates the marginal labels close to , which losses those information. In addition, KDCR calibrates the labels close to other classes as close to the correct one, as shown in Fig. 4(b), which eliminates the misguiding from deterministic incorrect labels. Nevertheless, KDCT keeps the information of those labels from incorrect classes.

Fig. 10: Critical difference diagram illustrating the accuracy of different approaches
Fig. 11: Critical difference diagram illustrating the accuracy of three approaches on datasets with accuracy less than for KDTime

Iv-C Training and Test Time of Different Approaches

In this section, we compared the training and test time of ROCKET, ITime, LSTime, KDTime, and KDCTIme, ignoring KDCT or KDCR, as their training and inference time are of no significant difference. Instead of showing all the results in a table, diagrams of comparison were selected for better illustrating the results.

Fig. 12 demonstrates the training time of KDCTime compared with ROCKET, ITime, LSTime, and KDTime. In Fig. 12(a), it shows that KDCTime is much slower. In detail, the training time of KDCTime on datasets is below order of magnitude slower than ROCKET, while that on datasets is more than order of magnitude slower. That is because all ITime-based approaches, including KDCTime, employ Gradient Descent as the optimizing algorithm, which requires the inference on many training samples for each update and a large number of updates in total, e.g. samples for each update, many times of updating in epoch, and

epochs in total. To the opposite, ROCKET utilizes ridge classifier and solves the ridge regression problem directly. However, the training time of KDCTime is still acceptable, which requires around

hour to train UCR datasets and hours for times running in total. Besides, as shown in Fig. 12(b) and (c), the training time of ITime and LSTime is similar to KDCTime. The reason is that their model is the same, which is InceptionTime with Inception modules. Still, KDCTime needs an extra teacher model in order to guild the training of its student model, which leads to a consequence that KDCTime requires extra training time for the teacher model. Note that the training of the teacher model is required for only one time. Once the teacher model is obtained, that model is available for multiple times of training for the student model. Fig. 12(d) shows that the training time of KDCTime is of no significant difference with KDTime, as their models are the same, they both require the teacher model, and Gradient Descent is selected as their optimizing algorithms. As a conclusion, the difference of training time mainly appears in categories of methods, where the first one is ROCKET, the second one is ITime-based methods without KD, and the last one is ITime-based approaches with KD.

(a) KDCTime and ROCKET
(b) KDCTime and ITime
(c) KDCTime and LSTime
(d) KDCTime and KDTime
Fig. 12: The training time of KDCTime compared with ROCKET, ITime, LSTime, and KDTime

Fig. 13 demonstrates the test time of KDCTime compared with ROCKET, ITime, LSTime, and KDTime. In Fig. 13(a), it shows that KDCTime is much faster than ROCKET. To be specific, the test time of KDCTime on datasets is order of magnitude faster than ROCKET, that on datasets is orders of magnitude faster, and that on datasets is orders of magnitude faster. That is because in the stage of test, without the computational time of Gradient Descent, KDCTime only requires the time of inference on test samples, which is same as ROCKET. In this scenario, the computational time of random convolutional kernels in ROCKET is much slower than the Inception modules in KDCTime. In addition, as shown in Fig. 13(b), (c) and (d), the test time of those approaches are of no difference with KDCTime, since their inference models are all InceptionTime with Inception modules. As a consequence, the difference of test time can be categorized as groups, which are ROCKET and ITime-based approaches, regardless of ITime-based approaches with or without KD.

(a) KDCTime and ROCKET
(b) KDCTime and ITime
(c) KDCTime and LSTime
(d) KDCTime and KDTime
Fig. 13: The test time of KDCTime compared with ROCKET, ITime, LSTime, and KDTime

V Conclusion

In this paper, we discovered the DNN-based TSC approaches are easy to be overfitting on the UCR datasets, which is caused by the few-shot problem in the UCR archive. Thus, in order to alleviate overfitting, Label Smoothing for InceptionTime (LSTime) was first proposed by utilizing soft labels. Next, instead of manually adjusting soft labels, Knowledge Distillation for InceptionTime (KDTime) was proposed in order to automatically generate soft labels. At last, in order to rectify the incorrect predicted soft labels from the teacher model, KD with calibration (KDC) was proposed, where it has two optional strategies, namely KDC by Translating (KDCT) and KDC by Reordering (KDCR).

The experimental results show that the accuracy of KDCT and KDCR is promising, while KDCR gets the highest one. In addition, including KDCT and KDCR, all InceptionTime-based (ITime-based) approaches are orders of magnitude faster than ROCKET on test time, since the ITime model is the majority factor for the inference time. The training time of ITime-based approaches are slower than ROCKET, yet it is in an acceptable range and worthwhile in order to obtain a promising accuracy and fast inference time. At last, KDCT and KDCR do not introduce any additional hyperparameter compared to ITime.

In the future, instead of just concentrating on the loss functions and labels, we will try various models, in order to propose a brand new model which owns a high generalization capability.

Appendix A The procedure to calculate

Given a hard label and a soft label , where they both satisfy and . By treating and as vectors, we want to find the minimum distance between them when . Let , so that and . Let , so that . Therefore, we have the following optimization objective:

where we know and . Thus, . Since , we let . Thus, . In this way, our optimization objective can be rewritten as follows:

This is an optimization problem with inequality constraints. Therefore, we can define its Lagrangian function as:

where and are two sets of Lagrangian multipliers. By adopting Karush-Kuhn-Tucker (KKT) Conditions, we have:

At last, after solving this system of equations, we know obtains the minimum value when , , and all other . As a result, can be calculated as follows:

Appendix B The accuracy of different approaches

The accuracy of different approaches on UCR datasets is given in Table III on the last page.