Time series classification for predictive maintenance on event logs

11/22/2020 ∙ by Antoine Guillaume, et al. ∙ Université d'Orléans Worldline 0

Time series classification (TSC) gained a lot of attention in the past decade and number of methods for representing and classifying time series have been proposed. Nowadays, methods based on convolutional networks and ensemble techniques represent the state of the art for time series classification. Techniques transforming time series to image or text also provide reliable ways to extract meaningful features or representations of time series. We compare the state-of-the-art representation and classification methods on a specific application, that is predictive maintenance from sequences of event logs. The contributions of this paper are twofold: introducing a new data set for predictive maintenance on automated teller machines (ATMs) log data and comparing the performance of different representation methods for predicting the occurrence of a breakdown. The problem is difficult since unlike the classic case of predictive maintenance via signals from sensors, we have sequences of discrete event logs occurring at any time and the lengths of the sequences, corresponding to life cycles, vary a lot.



There are no comments yet.


page 17

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The main goal of predictive maintenance (PdM) is to increase machine availability by minimizing unplanned maintenance due to machine failures. The benefits vary depending on the use case: for example, in a factory this leads to increased productivity, in a hospital it reduces downtime of machines that may be critical to patient health.

A classical example of PdM is the case of vibration monitoring ORHAN2006293

. Consider an industrial machine equipped with sensors that can measure the amplitude of vibration emitted by a piece of the machine. This signal can be used to identify an ongoing degradation of this piece, and with knowledge of the breaking point (i.e. the amplitude where the piece will likely break), estimate the remaining time before failure. More complex systems can be monitored the same way by including more parameters, like temperature or pressure.

In this paper we consider cases where the machines we want to monitor are not equipped with sensors to capture meaningful features such as vibration, and it is not reasonable or possible to install such sensors, or to acquire the outputs of the sensors. If such machines produce logs, for example for debugging purpose, we can then use this source of data to try to identify relevant patterns in the logs, allowing to predict equipment failures. In the case of automated teller machines (ATMs in the following), we cannot directly access the sensor outputs because they are handled by a process involving transformations from the multiple communication protocols inside the ATMs and the receiving servers.

The problem we address is that of PdM, with the sole information of log data. In our application, logs are sequences of events occurring at irregular time stamps. The only available knowledge is the semantics of the events, that is either the operations performed on the ATMs or information on the state of the system. An event is generally linked to a hardware module of an ATM and a level of gravity. As stated, such a framework is applicable to a wide range of machines such as smart devices, medical equipments, servers or anything emitting logs as part of its normal operating behavior.

The dataset we use in our application of PdM with log data from ATMs has the following properties:

  • It is composed of discrete time series representing the occurrences of a finite number of events through logs of event codes.

  • Event codes give information about the component on which an event occurs and on the gravity of that event.

  • The dataset is imbalanced with 121 recorded failures on 603 instances.

  • Data was collected over an 18 month period, with some ATMs integrated during this period. Thus, the time series range from a week to 18 months.

  • The frequencies of events vary between ATMs and with time.

As shown in SurveyPdM, most works in predictive maintenance are either solved by using regression to estimate remaining useful life or by building a health indicator and estimating it based on recent data. Most of those approaches use data from sensors. This implies a dataset with meaningful features, and knowledge of failure time for all instances in the dataset, both of which we are lacking. To overcome these shortcomings, we chose to model PdM as a classification problem.

Many representation methods have been recently proposed for time series; some can be seen as feature extraction methods, allowing the use of classic (i.e. attribute-based) classifiers, some are dedicated to transforming time series into other time dependent representations. We review those representation methods and study their performance on our application dataset. Moreover, we study the impact of different encodings of the event codes on the results. Our contributions can be summarized as follows:

  • a new dataset for predictive maintenance 111available on demand only for research purposes

  • a specific problem of predictive maintenance that has not yet been long studied,

  • a review of state-of-the art representation and classification methods for time-series,

  • a study of their performance on our data

The paper is organized as follows. In Section 2, we introduce the problem of predictive maintenance and model it as a Time-Series Classification Task. In Section 3, we present related work on Predictive Maintenance. Section 4 is devoted to the presentation of the data. Section 5 proposes a review of representation and classification methods for time series. Section 6 is dedicated to experiments.

In an effort to promote reproducible research, all our experiments and data-sets are available online 222 https://gogit.univ-orleans.fr/lifo/antoine-guillaume/tsc-pdm-event-log.

2 Predictive maintenance

The goal of PdM is to predict failure to avoid as much unplanned downtime as possible. Regression has been extensively used to estimate the remaining duration before failure for a hardware component. But in the ATM dataset, most instances have not yet experienced failure, and we cannot compute the regression target without a timestamp indicating failure. Such data should have been discarded for applying a classic regression method.

The dataset being very unbalanced, even censored regression methods, for example based on survival analysis SurvivalPdM, seem hardly applicable. Thus, we preferred to model the task as a classification: given a time series until time , will a failure occur in the near future?

To frame the problem as a Time Series Classification task, we need to define some time intervals. We rely on the definitions given in logPdM but we adapt some of them to better fit our application.

Consider the representation given in Figure 1

, we want to predict a failure that would occur in the near future, but with enough advance to have time to perform maintenance before the actual breakdown. This is the role of the predictive padding, it quantifies, at minimum, the time needed to perform a maintenance. In the context of the modeling step we define the time intervals as follows:

Figure 1: Illustration of the predictive maintenance intervals for TSC.

Let a time series of length representing the log data of a machine, with a timestamp and an event log.

  • Predictive padding ():

    The predictive padding is a quantity of time that expresses the minimum time needed to perform maintenance. To build a training/test instance for the dataset given , we remove the data inside , and use our knowledge of the state of at to affect the label. The result is a couple with and a label giving the state of the machine at time

  • Infected interval () =

    The infected interval is located right after the failure. All data in this interval should not be used for training or evaluation because it contains unusual data due to breakdown, maintenance or restart procedure.

With the above definitions, we can now frame the PdM problem as a time series classification task: considering an input time series representing event logs that occurred up to time , can we predict the occurrence of a failure on a component after time ?

It would be interesting not only to learn a model to discriminate classes but also to identify failure signatures present in , that will influence its state after time

. We define failure signatures as patterns in the time or frequency domain that are characteristic of a degradation. This point is not addressed in this paper.

3 Related work

In this section we mostly focus on log based predictive maintenance approaches and techniques. The representation and classification methods used in this study will be presented in Section 5. For a more complete focus on Time Series Classification, readers are referred to the archive referenced in TSCArchive.

Supervised methods are the most common approaches for Predictive Maintenance. In PdMARMA

the authors use the prediction of an Auto-Regressive Moving Average model (ARMA) based on historical data and statistics to transform an input time series to a set of features, which is then processed by a Principal Component Analysis (PCA) before being used as inputs to a regression model to estimate remaining useful life. In


the authors use event logs from ATMs to extract different types of features: statistics, pattern-based features, which targets error message occurrence order, failure similarity, profile-based ones, which takes into account the physical characteristics of the ATM and survival probability from a Cox survival model. Those features are then given as input to classification models such as Random forest. A similar approach is taken by


using regression model to estimate the Remaining Useful Life (RUL), they also compare knowledge based and automatic feature selection approaches.

Some Predictive Maintenance systems rely on Multiple instance learning (MIL) logPdM HarddriveMIL MILReg. Rather than learning on instances, MIL learns from bags of instances and both classification and regression can be used in this context. A positive label is affected to a bag if at least one instance in the bag is positive. For example, logPdM creates bags with a few days of logs per medical equipment.

Anomaly detection, notably applied to log data, can also be used in a PdM context. In eGFC, the authors present the evolving Gaussian Fuzzy Classifier (eGFC) approach for anomaly detection using log from data centers. They use an evolving rule based approach to classify incoming logs, with rules being an ensemble of Gaussian membership function , with the modal value and the dispersion of a variable , and a class , with rule identifier. The activation of a rule is given by a triangular norm of all . Given this score, existing rules can be updated with or rules can also be created if no rule reaches a threshold of activation. A distance measure is defined between rules to merge close ones, and rules that were not activated in multiple iterations can be deleted.

4 Dataset and Preprocessing

The application we consider is predictive maintenance for ATMs. For the sake of simplicity, we do not detail all the processes that, from a given mechanical event on an ATM, lead to the generation of the event logs. Let us only write that each mechanical event leads to a sequence of event logs being generated, which will depend on the gravity of the event and the involved mechanical parts. Hence, we only present the features of the dataset necessary to the understanding of this paper.

4.1 Initial data

The data we use in this paper come from logs of hardware events emitted by multiple ATMs. An event log is composed of a time-stamp and an event code (see Table 1), with an event code being a unique identifier for a type of hardware event. The dictionary of event codes is provided with the data.

time-stamp event code Translation
2018-04-28 12:39:38 40000 Withdrawal
2018-04-28 12:41:03 2000 General State OK
2018-04-28 12:41:04 5000 Distribution Module OK
2018-04-28 12:46:34 40000 Withdrawal
Table 1: Example of logs from the ATM dataset

In our experiments, we focus on predicting failures of the distribution module, which is composed of multiple banknotes boxes and a distribution circuit with suction cups and mechanical rollers.

The mechanical information recorded by sensors are sent to a computer inside the ATM, which produces logs that are sent to a platform, which translates them into event codes. Unfortunately, we cannot access the original logs produced by the ATM. This explains why we consider only the logs of event codes. For simplicity, we only consider the distribution module in this study, but our approach is the same to predict failure on other components.

The dataset contains the event logs from 486 ATMs, sliced into life cycles corresponding to a period of time in which the ATM has worked correctly without failure. An ATM is composed of one or more life cycles per component, with each component having distinct life cycles. A life cycle is thus defined by a component, a start time and an end time corresponding to the failure of this component (see Figure 2 for an illustration of this process). Life cycles for the distribution module ranges from 7 days to 593 days, with a range of 865 to 178854 log entries (see Fig 3 for the distribution). The frequency of events varies over time, mostly due to ATM usage.

Right and left censoring are also present in the data. Right censoring corresponds to the cases where, when data collection stopped, some life cycles had not yet failed. The same reasoning applies to left censoring: when the data recording started, some life cycles were already working. Hence, we do not have the data from the true beginning of those life cycles.

Figure 2: An illustration of the division into life cycles. Note that the infected interval corresponding to maintenance is removed independently of the origin of the failure, even if it has not a hardware component as origin (e.g. network failure).

4.2 Slicing strategy

Depending on the known or supposed degradation process of the monitored machine, we can also apply a particular slicing strategy to better frame the predictive task.

Whole life cycle

This is the basic case, where we consider the whole time series data of a life cycle. It is best suited if failure signatures (that are unknown) are supposed independent of time. With the removal of the predictive padding () at the end of life cycles, a life cycle is then defined as .


If only recent data is relevant because failure signatures are supposed to happen in a given time window before a failure, we can slice time series into sub-sequences. Given a sub-sequence length , we consider the life cycle as . We can then create sub-sequences starting from the end of , such as and so on. will inherit the label of , while other sub-sequences will be labeled as negatives.

For the ATM dataset we use the whole time series strategy. We have no assumptions about failure signatures given only event logs, and we could not identify any significant pattern from visual inspection or analysis. For example, there is no clear correlation between the frequency of cash withdrawal and failure (e.g. high withdrawal frequency does not necessarily mean more failures). The dataset consists of 482 life cycles yet without failure and 121 cycles with failure. Figure 3 shows the distribution of life cycle duration for both classes and the mean time series length according to the number of days. Let us notice that the events arrive at irregular timestamps and their numbers per day vary.

Figure 3: The top graph shows the distribution of life cycles duration in days for life cycles with failure (orange) and yet without failure (blue). The bottom graph shows the mean time series length for bins of life cycle durations, again for both classes.

4.3 Pre-processing

Using the whole time series approach, the data is an ensemble of varying length uni-variate time series from either the positive (failure) class or the negative (no failure) class. This dataset leads to new challenges for time-series classifications:

  • The dataset is a sequence of event logs at irregular timestamps.

  • The sizes of life cycles have a large variance.

  • The variation in event frequencies between and within ATMs is mainly due to the localization (local population, recurring events, nearby businesses, etc.) which causes multiple kinds of seasonality on the number of withdrawals. Preliminary experiments have shown no gain in clustering ATMs based on their withdrawal profile.

4.3.1 Series with varying lengths

Few methods have the ability of handling time series of different sizes. A basic approach consists in using Dynamic Time Warping (DTW) distance with a k-NN classifier. Aside from the fact that computing DTW, even with speed-up techniques, is costly with time series of length such as ours, there is no evidence of any gain using DTW with time series of varying lengths. In DTWFacts the authors conduct many experiments on DTW; in one of them, they state that they were unable to create an artificial problem where the difference in accuracy between using DTW or reinterpolating time series to the length of the shortest to use Euclidean distance was significant.

Figure 4: Visual comparison between an original time series of the ATM dataset with approximately 12000 points and the PAA transformation of size 1000

Considering the huge variance in length between our time series, options such as uniform scaling, noise or zero padding are not ideal tan2019time as we have both a varying event frequency and different start and end points. For the same reasons, up-sampling the time series to the length of the longest ones also does not seem reasonable. For those reasons, we have chosen to use Piecewise Aggregate Approximation (PAA) PAA to transform the varying length series into series of the same length, making the assumption that the delay between two events is not significant in our application. As we are more interested in the shape of the time series rather than their actual values, PAA seems a reasonable choice as it can keep this notion of shape, even if it implies information loss due to the averaging that it performs (see Figure 4).

4.3.2 Data encoding

We chose to represent the time series of event logs as a sequence of event codes. Four encodings are considered, and we will investigate in Section 6 their interest in regard to our classification problem. Figure 5 illustrates the different encodings.

Figure 5: Visual comparison between the different data encodings after being processed by PAA
  • R1 converts the event codes into range with the number of event codes. It is made by sorting the list of codes and affecting to each code its position in the sorted list.

  • R2 is the ”raw” representation, it simply takes the numeric value of the code as it is. A code with value 4000 will keep this value.

  • R3 will sort the event codes by gravity while R2 and R1 implicitly sort event codes by components because of how the dictionary is made. By default, we take all codes from a same gravity and affect them a value by sorting them as we did for all the codes in R1. We do it again for the next gravity level, adding an arbitrary value (here 200) for each level to separate the gravity levels. We chose to process gravity level in the following order: KO, Warning, OK.

  • R4 affects a unique random value between 0 and to each code.

5 Methods

In this section we briefly present the representation and classification techniques that we use for the experiments.

The focus of this paper being time series classification, we do not cover other techniques such as survival analysis, semi-supervised regression techniques or Health-Index estimation that could be used to tackle issues related to censored data.

5.1 Change of Representation techniques

In Section 4.3.1, we discussed why we use PAA to transform time series so to get equal length time series. In our experiments, the techniques discussed below will be used after performing PAA.

5.1.1 Time series approximation

In the following, we consider a time series of size .

Piecewise Aggregate Approximation (PAA)

PAA performs an approximation of a time series by dividing it into fragments and outputting the mean value of each as result of the approximation. By specifying an output size , PAA computes the approximation of such as :


If is not a divisor of , the modulo is generally spread across multiple fragments, such as we have at maximum fragments of length .

Symbolic Aggregate approXimation (SAX)

SAX is classically used to transform a time series of length into a string of arbitrary length . It first uses PAA to produce a series of length and then given the size of the alphabet it computes discretization bins on the values of . The result is a time series of length with different values, the bins allowing to assign to each value of a particular letter of the alphabet. Multiple strategies exist to define the range of the bins, all are based on the distribution of the values in :

  • Uniform range: All bins have identical ranges, this range is obtained by dividing the value distribution in equal parts.

  • Percentile range: All bins have the same number of points inside their ranges, bins edges are defined using the percentiles of .

  • Normal range: Here, bins edges are percentiles from a standard normal distribution.

Note that the alphabet can also be numeric so the approximation can be used for numeric applications. Another discretization technique, called Symbolic Fourier Approximation (SFA) SFA

also exists, it relies on discrete Fourier transformation to extract Fourier coefficients from an input time series and transforms it into a word of discrete Fourier coefficients.

5.1.2 Time series to images

Inspired by the recent success of image classification, methods transforming time series into images have been developed, allowing to apply image classifiers for time series classification. In the experiments, we call an image transformation ”flat” when we flatten the output image to obtain one feature per pixel, and use them as independent features.

Gramian Angular Fields (GAF)

GramianField aims at creating a matrix containing temporal correlations between each couple of a time series of size . The size of the matrix will be the size of the image and each value will be encoded into colors, creating the image. First, the data is re-scaled in range so that it can be used as an input of the arc-cosine function:


Then, each point is converted into polar coordinates by:


with the angular cosine and the radius. To build the matrix , GAF takes the trigonometric sum (or difference) of each couple:

Recurrence Plots

ReccurencePlot proposes to build a matrix containing the pairwise distances between parts of a time serie of length . Those parts are called trajectories and they are defined as :


with the ”dimension” parameter and the ”time delay” parameter. Once those trajectories are computed, the Recurrence Matrix is defined as:


with the Heaviside function, which is a step function defined by and . The parameter represents a threshold for a minimum distance between points.

5.1.3 Sliding operations

Some techniques use sliding window operations on an input time series, to generate a new time series in which some patterns may appear, depending on the type of sliding operations.

Matrix Profile (MP)

ProfileMatrix takes as input a time series and a window size and generates a new time series in which represents the minimum distance between the sub-sequence and any other sub-sequence of size inside . The brute force approach being intractable, multiple algorithms providing speed-ups through lower bounding and other techniques have been developed, which make the time complexity of the MP be linear with input length. A lot of applications are possible using MP notably motif and discord discovery, but we will not detail them as we only use MP as a time series transformation to be used as input for other methods.

Random Convolutional Kernel Transform (ROCKET)

ROCKET randomly initializes a huge number (10.000 by default) of convolutional kernels such as , with the length of the kernel, its weights, the bias, the dilation and the padding. If we generate kernels, ROCKET will output a 1-dimensional array containing features, because 2 features are extracted for each convolution of the input by a kernel. For a time series and a kernel , the output of the convolution is defined:


with equal to the input augmented with a padding length . The two features extracted from are the maximum value and the proportion of positive values. This operation is repeated for each of the kernels that were randomly initialized.

5.2 Classification techniques

We suppose that the reader has knowledge on the most widely used classification algorithms (i.e. k-nearest neighbors (KNN), support vector machines (SVM), random forest (RF), etc.) and focus on algorithms that can be used for time series classification.

Time series Forest (TSF)

TSF applies the principle of random forests to time series data. RF classically considers features at each node to make a split, with the total number of features of the input. In TSF, is defined as the length of the input time series. A tree will randomly sample starting positions and interval lengths for input time series, giving

intervals on which statistics like mean, standard deviation and slope are computed giving us candidates that we can consider for a split. The tree building algorithm is similar to RF but it introduces a new splitting criterion called ”Entrance”

and based on the information gain () and the margin:


with the number of training instances inside a tree node, the start and the end point of the randomly sampled interval, and the function computing on the time series interval the chosen statistic (i.e. either mean, std or slope) and the splitting value. The authors describe the use of the as a tiebreaker when candidates have the same : either by using a small value of , or by storing the value and using it only if a tie occurs with .

Random Interval Spectral forEst (RISE) RISE uses the same approach of building trees based on features extracted from intervals, but extracts time and frequency domain features with methods such as auto-correlation or power spectrum.

Bag-of-SFA Symbols in Vector Space (BOSSVS)

BOSSVS is using a text mining approach to solve the time series classification problem. From an input time series , a word size , an alphabet and a window size , it produces a histogram of symbolic Fourier approximation (SFA) words. A distance to the classes is then computed from the tf-idf (term frequency inverse document frequency) of the SFA words. More precisely, is divided into moving windows of size , each window is then transformed to a SFA word of size based on the Fourier coefficients discretized with . The histogram of SFA words is then built with the words produced by each window. For a time series , a histogram and a SFA word , the term frequency (tf) of inside a class is defined as:

with the value of in the histogram of . The inverse document frequency (idf) is defined as :


The idf of a SFA word for a class represents the proportion of classes in which this word occurs, high values of idf mean that a word is specific to a few classes relatively to the total number of classes. With

, the BOSS-VS distance can then be defined as a cosine similarity:

Time Series Combination of Heterogeneous and Integrated Embedding Forest (TS-CHIEF)

TSCHIEF is another ensemble, tree based algorithm. The authors define multiple splitting methods at node level, based on popular algorithms used in TSC. To split a node, those different methods are randomly initialized multiple times, and the one that maximizes a splitting criterion is chosen as the splitting method for this node. The algorithms that are re-defined as splitting methods in TS-CHIEF are : Proximity Forest, BOSS-VS and RISE. The methodology is close to the one used in HIVE-COTE HIVECOTE, but with fewer splitting methods and random initialization, which, without significative loss in performance presents a huge gain in time complexity.

6 Experiments

The experiments we perform aims at answering the following questions.

  • What are the best combinations of representation techniques and classifiers for the problem at hand? (Subsection 6.3.1)

  • More precisely, given a family of representation techniques, which ones give the best results for the application? (Subsection 6.3.2)

  • What are the performance of the time-series classifiers: TSF, RISE, BOSSVS, TS-CHIEF? (Subsection 6.3.1)

  • Among the 4 encodings proposed in Section 4.3.2, which one is the most suited to our problem? (Subsection 6.3.3)

  • What is the influence of the predictive padding? (Subsection 6.3.4)

6.1 Protocol

Starting with the raw data described in Section 4, we apply a predictive padding of 48 hours and Piecewise Aggregate Approximation to all time series to reshape them to a size of 1000, time series with a size inferior to 1000 are dismissed. For each combination of representation and algorithms we wish to test, we use a 10-fold cross validation to extract the balanced accuracy, f1-score and the critical failure index (CFI) as that is not affected by the number of true negatives. We then repeat this protocol for each data encoding described in Section 4.3.2.

The code for the experiments is available online 333https://gogit.univ-orleans.fr/lifo/antoine-guillaume/tsc-pdm-event-log.

6.2 Methods

We aim at studying the interest of representation techniques for our application of failure detection in ATMs. These methods can be distinguished according to the type of outputs they produce (time series, features or images) and we combine them with different classifiers, depending on the type of outputs:

  • Feature-based representations (ROCKET, flattened images) generate a new set of features and are combined with classic classifiers: SVM, RF, Ridge and kNN with Euclidean distance.

  • Image representations (GAF, recurrence plots) are combined with ResNet50V2 ResNetV2, a deep residual networks architecture.

  • Time series representations (SFA, SAX and Matrix Profile) generate a new time series and are combined with TSF, RISE, BOSSVS and kNN with Euclidean distance. We also use InceptionTime InceptionTime

    , an ensemble of deep Convolutional Neural Network (CNN) models using Inception modules.

  • Representation stacking is also tested, for example using Matrix Profile before ROCKET.

In the following, we call method a combination of representation and classification algorithms. For instance SAX ROCKET KNN represents the application of SAX on the initial time series, followed by the application of ROCKET, thus generating features that can be given as inputs to kNN.

6.3 Experimental results

6.3.1 Best Methods

In this section, we present the best methods for the ATM dataset based on our experimental results. The full results are available in Appendix A. Figure 6 shows the ranking of the top performing methods based on mean balanced accuracy over all data encodings (R1, R2, R3 and R4).

Figure 6: Average ranking of methods based on the mean balanced accuracy on all 4 data encodings (top 25 for readability)

The best algorithms are shown in Table 2. For sake of simplicity, we do not include the random encoding (R4) that gives the worst results.

name balanced accuracy R1 balanced accuracy R2 balanced accuracy R3
ROCKET SVM 0.875(+/- 0.059) 0.865(+/- 0.066) 0.902(+/- 0.044)
TS-CHIEF 0.856(+/- 0.060) 0.846(+/- 0.056) 0.819(+/- 0.070)
TSF 0.761(+/- 0.107) 0.760(+/- 0.083) 0.842(+/- 0.080)
MP RISE 0.770(+/- 0.064) 0.758(+/- 0.066) 0.823(+/- 0.065)
Recurrence Flat RF 0.750(+/- 0.074) 0.741(+/- 0.077) 0.808(+/- 0.072)
Recurrence ResNet50V2 0.759(+/- 0.046) 0.753(+/- 0.015) 0.805(+/- 0.077)
MP kNN 0.733(+/- 0.066) 0.726(+/- 0.053) 0.792(+/- 0.091)
Table 2: Performance of the best methods on the ATM dataset. R4 and many ROCKET variants are not included.

ROCKET based methods are the top performers for our application. A representation of the two features extracted by one of the most discriminative kernels is given in Figure 7. Other methods also give good results such as TS-CHIEF, TSF, Recurrence Plots and Matrix profile combined with RISE or kNN.

Figure 7: The two features ppv on axis x and max on axis y, generated by the most discriminant kernel of ROCKET (the importance of this kernel is assessed by the sum of the information gain of its two features, max and ppv for RF). Color map represents the class for training data and the class probability predicted by ROCKET RF for test data

Aside from predictive performance, another point to consider is the time complexity of the different methods. Experiments were run on a DELL PowerEdge R730 on Debian 9.x with 2 XEON E5-2630 Corei7 with 20 cores at 2.20GHz and 64GB of RAM. We recorded fit and prediction times on each cross validation split and we give in Table 3 the mean of the total time (fit+predict) in seconds computed on all splits, for all data encodings (R1, R2, R3 and R4). ResNet50V2 and InceptionTime were run using a Tesla P100.

name Mean fit + predict time (seconds)
MP KNN 1.754(+/- 0.094)
TSF 15.49(+/- 0.331)
ROCKET SVM 112.5(+/- 1.099)
Recurrence Flat RF 153.5(+/- 4.076)
Recurrence ResNet50V2 275.2(+/- 27.41)
MP RISE 434.1(+/- 32.84)
TS CHIEF 14408(+/- 143.4)
Table 3: Run time of the top performing methods

Looking at the top methods, TSF offers a competitive run-time (the fastest method) followed by ROCKET SVM, which has the best performance in terms of balanced accuracy and a run-time of less than 2 minutes. TS-CHIEF performance in terms of accuracy is good, but its run-time is too long compared to other methods (about 4 hours).

6.3.2 Influence of representation techniques

We study the influence on performance of the different representation methods, presented in Subsection 5.1.

For time series representations, in most cases, Matrix profile (MP) has a positive impact on the performance of time series classifiers as seen in Table 4 whereas SAX and SFA do not improve performance.

name balanced accuracy R1 balanced accuracy R2 balanced accuracy R3
MP TSF 0.775(+/- 0.067) 0.783(+/- 0.084) 0.796(+/- 0.074)
TSF 0.761(+/- 0.107) 0.760(+/- 0.083) 0.842(+/- 0.080)
MP RISE 0.770(+/- 0.064) 0.758(+/- 0.066) 0.823(+/- 0.065)
RISE 0.672(+/- 0.071) 0.683(+/- 0.090) 0.773(+/- 0.085)
MP kNN 0.733(+/- 0.066) 0.726(+/- 0.053) 0.792(+/- 0.091)
KNN 0.587(+/- 0.088) 0.614(+/- 0.069) 0.653(+/- 0.068)

Table 4: Effect of Matrix profile on time series classifiers

Considering image representations, Recurrence plot is above Gramian Angular Fields as shown in Table 5, when used with ResNet50V2. Recurrence plot keeps this advantage even when considered as a feature based representation (”flat” image) (see Table 2). Figure 8 shows an image generated by both methods.

name balanced accuracy R1 balanced accuracy R2 balanced accuracy R3
Recurrence ResNet50V2 0.759(+/- 0.046) 0.753(+/- 0.015) 0.805(+/- 0.077)
Gramian ResNet50V2 0.737(+/- 0.059) 0.698(+/- 0.049) 0.761(+/- 0.042)
Table 5: Performance of the image representations
Figure 8: Example of image (128x128) generated by Gramian Angular Fields (left) and Recurrence Plot (right) using the ”jet” color map for the same instance.

For feature based representations, ROCKET is the best one, as shown in Table 6. This performance is mostly due to the proportion of positive values statistics. Figure 7 illustrates this point on the most discriminating kernel found by ROCKET. The representation based on flattened images can reach a balanced accuracy above 80%.

name balanced accuracy R1 balanced accuracy R2 balanced accuracy R3
ROCKET SVM 0.875(+/- 0.059) 0.865(+/- 0.066) 0.902(+/- 0.044)
Recurrence Flat RF 0.750(+/- 0.074) 0.741(+/- 0.077) 0.808(+/- 0.072)
Gramian Flat RF 0.682(+/- 0.055) 0.654(+/- 0.086) 0.792(+/- 0.079)
Table 6: Performance of the feature based representations

Representation stacking does not have a positive impact compared to individual representation methods. For example applying MP before using any feature based representation leads to worse results than without MP. Combining image representations also seems to be less efficient.

6.3.3 Influence of data encodings

In Subsection 4.3.2, we have proposed 4 encodings for the events emitted by the ATMs and we aim at studying their influence.

Table 7 resumes the performance of each encoding through 4 criteria: the max and mean balanced accuracy obtained on all methods, the number of times the encoding has been ranked first and its mean rank. We can observe that R3 is the best one, followed by R1 and R2, which are quite close and R4 is the worst one. From those results, we can deduce the following points :

  • Encoding event codes by their gravity level (R3) is more efficient than by their components (R1, R2).

  • The notion on scale between event codes values, that is the choice between R1 and R2 encodings, is not significant for most methods. Note that we applied min-max normalization before running classification algorithms where the notion of scale plays an important part (SVM, KNN et Ridge). Thus, the choice between R1 and R2 can mostly influence the methods that perform a change of representation (as for instance PAA).

  • The low results of the random representation (R4) implies that a relation does exist between some patterns of event codes and failures.

R1 R2 R3 R4
Mean BA 0.71(+/- 0.09) 0.71(+/- 0.08) 0.76(+/- 0.09) 0.60(+/- 0.08)
Max BA 0.875 0.893 0.902 0.770
Rank 1 ( times) 2 6 34 1
Mean rank 2.27 2.53 1.30 3.88
Table 7: Performance of the data encodings on balanced accuracy (BA) and ranks within the same methods

6.3.4 Influence of the predictive padding

We study the influence of the predictive padding size. We rely on two efficient methods, both from the point of view of accuracy and run-time, that are ROCKET SVM and TSF.

Figure 9 shows the performance of ROCKET SVM and TSF with a predictive padding size ranging from 0 to 25 days. The impact on performance is still reasonable for ROCKET SVM and TSF for up to 5 days. Performance starts to decline faster around 20 days. Let us notice that, as predictive padding increases, the number of examples decreases, since the examples with less than 1000 entries are dismissed. It is thus possible that some failure signatures are located inside those long predictive padding durations.

Figure 9: Influence of the predictive padding size on performance

7 Conclusion & Future work

The goal of this study was to formulate a predictive maintenance problem based on time series of event logs and conduct multiple experiments to find the most suited approaches for the application (predict failure for ATMs). Our experimental results show that using ROCKET, with a SVM and a data encoding based on event gravity level, seems to be the most efficient approach. Nevertheless, the features generated by ROCKET are not easily interpretable and this is a drawback since interpretability is an important property for predictive maintenance. Numbers of methods have been recently developed for time series interpretability timexplain; AgnosticTS, either global methods, which look at relations learned by an estimator, or local methods, which study how an estimator behaves if inputs are slightly modified. Global methods do not seem adapted as features used by a ROCKET based estimator will not be directly interpretable. Concerning local methods, studying the effects of small variations of the inputs on the relevant features generated by ROCKET is not trivial. We aim at developing a method that combines the efficiency of methods like ROCKET and interpretability capacities.

Our current formulation of predictive maintenance as a time series classification problem impose, by means of the predictive padding, a lower bound on the time before failure. Thus, it allows to predict a failure early enough to perform maintenance. We would now like to study the introduction of an upper bound, so that we do not predict failure too early, avoiding ”unnecessary” maintenance.

We thank Naly Raliravaka (LIFO) for his help on setting up the experimental environments and online repository. We thank members of the ATM team at equensWorldline, Arnaud Celier, Olivier Hubert, Armindo Martins, Philippe Carrez and Laurent Desnouailles for discussions about the ATMs, Frank Charles and Ghislaine Le Goff for their contribution on the ATM data acquisition, Didier Noellec and Eric Prieur for their support in the project management. This work was supported by the ANRT CIFRE grant n°2019/0281 in partnership with equensWorldline and the University of Orléans.


Appendix A Experiment Results

Mean fit + predict time (seconds)
KNN 0.182(+/- 0.015)
MP KNN 1.754(+/- 0.094)
SFA KNN 2.424(+/- 0.298)
SAX KNN 2.715(+/- 0.065)
SFA TSF 2.808(+/- 0.414)
Recurrence Flat Ridge 3.880(+/- 0.306)
Gramian Flat Ridge 4.283(+/- 0.349)
Recurrence Flat KNN 4.898(+/- 0.261)
Recurrence Flat SVM 5.584(+/- 0.403)
Gramian Flat KNN 6.348(+/- 0.310)
Gramian Flat SVM 6.914(+/- 0.344)
Gramian + Recurrence KNN 7.579(+/- 0.320)
MP Gramian + Recurrence KNN 9.202(+/- 0.797)
Gramian + Recurrence SVM 11.40(+/- 0.437)
MP BOSSVS 13.10(+/- 0.281)
MP Gramian + Recurrence SVM 13.41(+/- 0.328)
BOSSVS 14.57(+/- 0.980)
SAX TSF 15.29(+/- 0.447)
TSF 15.49(+/- 0.331)
MP TSF 17.10(+/- 0.289)
MP ROCKET SVM 101.9(+/- 1.466)
MP ROCKET KNN 102.9(+/- 1.727)
MP ROCKET Ridge 105.8(+/- 0.855)
SAX ROCKET Ridge 106.9(+/- 1.285)
ROCKET KNN 107.8(+/- 1.182)
ROCKET Ridge 108.1(+/- 0.972)
SAX ROCKET SVM 109.8(+/- 1.442)
ROCKET SVM 112.5(+/- 1.099)
SAX ROCKET KNN 113.4(+/- 0.854)
Recurrence Flat RF 153.5(+/- 4.076)
Gramian Flat RF 161.8(+/- 4.689)
MP Gramian + Recurrence RF 260.4(+/- 6.999)
Gramian + Recurrence RF 269.8(+/- 6.082)
Recurrence ResNet50V2 275.2(+/- 27.41)
Gramian ResNet50V2 279.5(+/- 33.41)
ROCKET RF 281.1(+/- 8.008)
SAX ROCKET RF 283.1(+/- 7.444)
MP ROCKET RF 292.0(+/- 7.554)
InceptionTime 397.3(+/- 37.54)
SAX RISE 433.8(+/- 31.44)
MP RISE 434.1(+/- 32.84)
RISE 442.9(+/- 31.09)
TS CHIEF 14408(+/- 143.4)
Table 8: Mean run time for all methods
name balanced accuracy R1 balanced accuracy R2 balanced accuracy R3 balanced accuracy R4
ROCKET SVM 0.875(+/- 0.059) 0.865(+/- 0.066) 0.902(+/- 0.044) 0.770(+/- 0.055)
ROCKET RF 0.863(+/- 0.056) 0.893(+/- 0.042) 0.873(+/- 0.050) 0.750(+/- 0.083)
ROCKET Ridge 0.839(+/- 0.040) 0.860(+/- 0.063) 0.863(+/- 0.058) 0.693(+/- 0.059)
SAX ROCKET RF 0.832(+/- 0.075) 0.830(+/- 0.076) 0.847(+/- 0.070) 0.620(+/- 0.065)
TSF 0.761(+/- 0.107) 0.760(+/- 0.083) 0.842(+/- 0.080) 0.536(+/- 0.056)
ROCKET KNN 0.771(+/- 0.084) 0.665(+/- 0.057) 0.839(+/- 0.078) 0.512(+/- 0.030)
MP ROCKET RF 0.795(+/- 0.088) 0.806(+/- 0.079) 0.837(+/- 0.072) 0.729(+/- 0.083)
MP ROCKET Ridge 0.788(+/- 0.075) 0.778(+/- 0.054) 0.832(+/- 0.062) 0.741(+/- 0.053)
SAX ROCKET SVM 0.812(+/- 0.107) 0.827(+/- 0.092) 0.830(+/- 0.068) 0.645(+/- 0.090)
SAX ROCKET Ridge 0.799(+/- 0.109) 0.795(+/- 0.085) 0.825(+/- 0.074) 0.676(+/- 0.044)
MP ROCKET KNN 0.778(+/- 0.067) 0.772(+/- 0.053) 0.824(+/- 0.070) 0.642(+/- 0.057)
MP RISE 0.770(+/- 0.064) 0.758(+/- 0.066) 0.823(+/- 0.065) 0.677(+/- 0.088)
MP ROCKET SVM 0.786(+/- 0.083) 0.767(+/- 0.079) 0.820(+/- 0.069) 0.769(+/- 0.079)
TS-CHIEF 0.856(+/- 0.060) 0.846(+/- 0.056) 0.819(+/- 0.070) 0.677(+/- 0.078)
Recurrence Flat RF 0.750(+/- 0.074) 0.741(+/- 0.077) 0.808(+/- 0.072) 0.650(+/- 0.063)
Gramian + Recurrence RF 0.732(+/- 0.054) 0.731(+/- 0.075) 0.805(+/- 0.092) 0.584(+/- 0.069)
SAX ROCKET KNN 0.726(+/- 0.085) 0.717(+/- 0.078) 0.805(+/- 0.089) 0.522(+/- 0.031)
Recurrence ResNet50V2 0.759(+/- 0.046) 0.753(+/- 0.015) 0.805(+/- 0.077) 0.600(+/- 0.076)
MP KNN 0.783(+/- 0.051) 0.765(+/- 0.062) 0.798(+/- 0.064) 0.646(+/- 0.010)
MP TSF 0.775(+/- 0.067) 0.783(+/- 0.084) 0.796(+/- 0.074) 0.684(+/- 0.082)
Gramian Flat RF 0.682(+/- 0.055) 0.654(+/- 0.086) 0.792(+/- 0.079) 0.503(+/- 0.026)
SAX TSF 0.744(+/- 0.099) 0.734(+/- 0.075) 0.789(+/- 0.079) 0.556(+/- 0.053)
InceptionTime 0.698(+/- 0.116) 0.755(+/- 0.079) 0.782(+/- 0.097) 0.590(+/- 0.053)
MP BOSSVS 0.753(+/- 0.075) 0.718(+/- 0.065) 0.779(+/- 0.075) 0.643(+/- 0.056)
MP Gramian + Recurrence SVM 0.724(+/- 0.063) 0.711(+/- 0.052) 0.779(+/- 0.073) 0.668(+/- 0.092)
Gramian + Recurrence SVM 0.710(+/- 0.094) 0.701(+/- 0.091) 0.774(+/- 0.092) 0.556(+/- 0.056)
MP Gramian + Recurrence RF 0.721(+/- 0.054) 0.720(+/- 0.066) 0.773(+/- 0.091) 0.682(+/- 0.096)
RISE 0.672(+/- 0.071) 0.683(+/- 0.090) 0.773(+/- 0.085) 0.613(+/- 0.060)
Gramian ResNet50V2 0.737(+/- 0.059) 0.698(+/- 0.049) 0.761(+/- 0.042) 0.505(+/- 0.055)
SAX RISE 0.634(+/- 0.070) 0.611(+/- 0.058) 0.734(+/- 0.096) 0.586(+/- 0.065)
Recurrence Flat SVM 0.688(+/- 0.087) 0.683(+/- 0.069) 0.728(+/- 0.118) 0.565(+/- 0.073)
Gramian Flat SVM 0.647(+/- 0.085) 0.642(+/- 0.085) 0.726(+/- 0.098) 0.504(+/- 0.048)
Recurrence Flat Ridge 0.689(+/- 0.069) 0.696(+/- 0.078) 0.723(+/- 0.089) 0.556(+/- 0.075)
Gramian Flat Ridge 0.666(+/- 0.079) 0.645(+/- 0.106) 0.711(+/- 0.096) 0.517(+/- 0.044)
KNN 0.562(+/- 0.026) 0.616(+/- 0.007) 0.706(+/- 0.013) 0.501(+/- 0.001)
Gramian + Recurrence KNN 0.665(+/- 0.081) 0.707(+/- 0.109) 0.692(+/- 0.075) 0.512(+/- 0.033)
Recurrence Flat KNN 0.650(+/- 0.068) 0.648(+/- 0.068) 0.670(+/- 0.077) 0.506(+/- 0.020)
Gramian Flat KNN 0.577(+/- 0.079) 0.650(+/- 0.075) 0.586(+/- 0.051) 0.496(+/- 0.046)
SAX KNN 0.619(+/- 0.028) 0.637(+/- 0.062) 0.586(+/- 0.039) 0.489(+/- 0.001)
MP Gramian + Recurrence KNN 0.522(+/- 0.032) 0.531(+/- 0.044) 0.543(+/- 0.036) 0.558(+/- 0.056)
SFA TSF 0.525(+/- 0.040) 0.548(+/- 0.038) 0.536(+/- 0.035) 0.496(+/- 0.021)
SFA KNN 0.544(+/- 0.015) 0.515(+/- 0.009) 0.526(+/- 0.028) 0.515(+/- 0.003)
BOSSVS 0.543(+/- 0.022) 0.549(+/- 0.022) 0.518(+/- 0.014) 0.521(+/- 0.018)
Table 9: Balanced Accuracy for all methods sorted by R3 score
name CFI R1 CFI R2 CFI R3 CFI R4
ROCKET SVM 0.350(+/- 0.094) 0.355(+/- 0.129) 0.294(+/- 0.106) 0.512(+/- 0.106)
ROCKET Ridge 0.399(+/- 0.070) 0.358(+/- 0.100) 0.356(+/- 0.119) 0.632(+/- 0.105)
ROCKET RF 0.395(+/- 0.123) 0.318(+/- 0.149) 0.366(+/- 0.115) 0.525(+/- 0.138)
TSF 0.508(+/- 0.199) 0.522(+/- 0.136) 0.370(+/- 0.144) 0.909(+/- 0.108)
MP ROCKET Ridge 0.479(+/- 0.126) 0.505(+/- 0.091) 0.386(+/- 0.140) 0.563(+/- 0.093)
ROCKET KNN 0.489(+/- 0.147) 0.670(+/- 0.103) 0.386(+/- 0.141) 0.967(+/- 0.057)
SAX ROCKET RF 0.454(+/- 0.131) 0.456(+/- 0.124) 0.388(+/- 0.125) 0.770(+/- 0.091)
SAX ROCKET SVM 0.431(+/- 0.196) 0.398(+/- 0.182) 0.397(+/- 0.135) 0.710(+/- 0.162)
TS-CHIEF 0.343(+/- 0.118) 0.364(+/- 0.108) 0.404(+/- 0.135) 0.651(+/- 0.144)
MP ROCKET RF 0.476(+/- 0.154) 0.459(+/- 0.144) 0.408(+/- 0.139) 0.576(+/- 0.150)
MP RISE 0.499(+/- 0.113) 0.524(+/- 0.113) 0.411(+/- 0.112) 0.644(+/- 0.169)
MP ROCKET KNN 0.482(+/- 0.124) 0.506(+/- 0.087) 0.412(+/- 0.122) 0.720(+/- 0.104)
SAX ROCKET Ridge 0.476(+/- 0.176) 0.479(+/- 0.148) 0.419(+/- 0.147) 0.667(+/- 0.080)
Recurrence Flat RF 0.522(+/- 0.136) 0.538(+/- 0.138) 0.420(+/- 0.126) 0.696(+/- 0.126)
MP ROCKET SVM 0.485(+/- 0.144) 0.519(+/- 0.137) 0.423(+/- 0.138) 0.533(+/- 0.117)
Gramian + Recurrence RF 0.549(+/- 0.105) 0.549(+/- 0.146) 0.424(+/- 0.163) 0.824(+/- 0.136)
SAX ROCKET KNN 0.566(+/- 0.155) 0.581(+/- 0.144) 0.439(+/- 0.149) 0.942(+/- 0.052)
Gramian Flat RF 0.656(+/- 0.084) 0.704(+/- 0.129) 0.454(+/- 0.132) 0.968(+/- 0.042)
MP TSF 0.500(+/- 0.132) 0.482(+/- 0.159) 0.458(+/- 0.138) 0.641(+/- 0.143)
MP KNN 0.5(+/- 0.052) 0.529(+/- 0.075) 0.463(+/- 0.083) 0.711(+/- 0.017)
SAX TSF 0.540(+/- 0.172) 0.565(+/- 0.138) 0.468(+/- 0.135) 0.877(+/- 0.095)
RISE 0.660(+/- 0.129) 0.645(+/- 0.163) 0.471(+/- 0.163) 0.768(+/- 0.117)
Recurrence ResNet50V2 0.502(+/- 0.086) 0.556(+/- 0.048) 0.477(+/- 0.104) 0.768(+/- 0.122)
MP BOSSVS 0.514(+/- 0.142) 0.577(+/- 0.117) 0.477(+/- 0.136) 0.711(+/- 0.105)
InceptionTime 0.628(+/- 0.23) 0.517(+/- 0.151) 0.487(+/- 0.171) 0.785(+/- 0.081)
MP Gramian + Recurrence SVM 0.592(+/- 0.107) 0.612(+/- 0.090) 0.498(+/- 0.118) 0.684(+/- 0.134)
MP Gramian + Recurrence RF 0.588(+/- 0.092) 0.587(+/- 0.118) 0.501(+/- 0.143) 0.644(+/- 0.172)
Gramian + Recurrence SVM 0.626(+/- 0.129) 0.639(+/- 0.124) 0.511(+/- 0.154) 0.846(+/- 0.071)
Gramian ResNet50V2 0.572(+/- 0.092) 0.620(+/- 0.073) 0.530(+/- 0.066) 0.887(+/- 0.073)
SAX RISE 0.728(+/- 0.129) 0.770(+/- 0.111) 0.548(+/- 0.175) 0.821(+/- 0.128)
Gramian Flat SVM 0.707(+/- 0.147) 0.722(+/- 0.132) 0.566(+/- 0.170) 0.917(+/- 0.052)
Recurrence Flat SVM 0.661(+/- 0.130) 0.666(+/- 0.114) 0.619(+/- 0.149) 0.830(+/- 0.092)
Gramian + Recurrence KNN 0.673(+/- 0.144) 0.622(+/- 0.151) 0.626(+/- 0.151) 0.937(+/- 0.056)
Gramian Flat Ridge 0.711(+/- 0.078) 0.729(+/- 0.103) 0.638(+/- 0.117) 0.878(+/- 0.044)
Recurrence Flat Ridge 0.680(+/- 0.063) 0.673(+/- 0.066) 0.647(+/- 0.110) 0.832(+/- 0.083)
KNN 0.806(+/- 0.046) 0.752(+/- 0.024) 0.675(+/- 0.047) 0.984(+/- 0.015)
Recurrence Flat KNN 0.712(+/- 0.111) 0.713(+/- 0.108) 0.685(+/- 0.109) 0.981(+/- 0.040)
SAX KNN 0.734(+/- 0.009) 0.704(+/- 0.046) 0.788(+/- 0.059) 0.900(+/- 0.009)
BOSSVS 0.784(+/- 0.044) 0.782(+/- 0.046) 0.792(+/- 0.046) 0.792(+/- 0.046)
Gramian Flat KNN 0.822(+/- 0.124) 0.731(+/- 0.085) 0.822(+/- 0.100) 0.908(+/- 0.051)
SFA KNN 0.870(+/- 0.025) 0.911(+/- 0.021) 0.880(+/- 0.023) 0.960(+/- 0.010)
MP Gramian + Recurrence KNN 0.942(+/- 0.062) 0.929(+/- 0.083) 0.909(+/- 0.069) 0.881(+/- 0.112)
SFA TSF 0.939(+/- 0.074) 0.893(+/- 0.069) 0.913(+/- 0.063) 0.972(+/- 0.033)
Table 10: CFI for all methods sorted by R3 score
name F1 score R1 F1 score R2 F1 score R3 F1 score R4
ROCKET SVM 0.783(+/- 0.070) 0.776(+/- 0.093) 0.823(+/- 0.070) 0.648(+/- 0.102)
ROCKET Ridge 0.747(+/- 0.055) 0.777(+/- 0.078) 0.776(+/- 0.088) 0.528(+/- 0.108)
ROCKET RF 0.746(+/- 0.093) 0.801(+/- 0.102) 0.769(+/- 0.079) 0.632(+/- 0.123)
TSF 0.636(+/- 0.175) 0.635(+/- 0.119) 0.763(+/- 0.104) 0.149(+/- 0.162)
ROCKET KNN 0.663(+/- 0.128) 0.485(+/- 0.122) 0.751(+/- 0.106) 0.057(+/- 0.098)
MP ROCKET Ridge 0.675(+/- 0.108) 0.656(+/- 0.081) 0.751(+/- 0.105) 0.601(+/- 0.087)
SAX ROCKET RF 0.696(+/- 0.109) 0.695(+/- 0.104) 0.751(+/- 0.091) 0.364(+/- 0.124)
SAX ROCKET SVM 0.704(+/- 0.163) 0.735(+/- 0.143) 0.743(+/- 0.101) 0.427(+/- 0.175)
TS-CHIEF 0.785(+/- 0.093) 0.771(+/- 0.085) 0.737(+/- 0.107) 0.500(+/- 0.154)
MP RISE 0.658(+/- 0.108) 0.635(+/- 0.107) 0.735(+/- 0.084) 0.503(+/- 0.176)
MP ROCKET RF 0.673(+/- 0.131) 0.690(+/- 0.119) 0.733(+/- 0.111) 0.580(+/- 0.144)
MP ROCKET KNN 0.672(+/- 0.115) 0.656(+/- 0.078) 0.732(+/- 0.092) 0.426(+/- 0.127)
Recurrence Flat RF 0.635(+/- 0.120) 0.619(+/- 0.123) 0.725(+/- 0.102) 0.452(+/- 0.134)
SAX ROCKET Ridge 0.668(+/- 0.162) 0.672(+/- 0.131) 0.724(+/- 0.113) 0.493(+/- 0.089)
MP ROCKET SVM 0.667(+/- 0.123) 0.637(+/- 0.128) 0.722(+/- 0.107) 0.627(+/- 0.111)
Gramian + Recurrence RF 0.614(+/- 0.097) 0.607(+/- 0.132) 0.717(+/- 0.130) 0.277(+/- 0.185)
SAX ROCKET KNN 0.586(+/- 0.169) 0.574(+/- 0.161) 0.705(+/- 0.131) 0.104(+/- 0.093)
Gramian Flat RF 0.505(+/- 0.094) 0.440(+/- 0.153) 0.695(+/- 0.120) 0.058(+/- 0.076)
MP KNN 0.665(+/- 0.046) 0.636(+/- 0.070) 0.694(+/- 0.070) 0.448(+/- 0.021)
MP TSF 0.655(+/- 0.121) 0.667(+/- 0.137) 0.692(+/- 0.112) 0.511(+/- 0.154)
SAX TSF 0.610(+/- 0.167) 0.593(+/- 0.135) 0.683(+/- 0.122) 0.205(+/- 0.147)
Recurrence ResNet50V2 0.659(+/- 0.082) 0.613(+/- 0.047) 0.679(+/- 0.090) 0.360(+/- 0.152)
MP BOSSVS 0.641(+/- 0.131) 0.583(+/- 0.121) 0.676(+/- 0.116) 0.437(+/- 0.132)
RISE 0.492(+/- 0.151) 0.499(+/- 0.200) 0.675(+/- 0.145) 0.361(+/- 0.146)
InceptionTime 0.504(+/- 0.235) 0.637(+/- 0.14) 0.661(+/- 0.152) 0.347(+/- 0.113)
MP Gramian + Recurrence SVM 0.570(+/- 0.111) 0.552(+/- 0.098) 0.659(+/- 0.102) 0.464(+/- 0.145)
MP Gramian + Recurrence RF 0.576(+/- 0.099) 0.573(+/- 0.128) 0.651(+/- 0.138) 0.503(+/- 0.163)
Gramian + Recurrence SVM 0.530(+/- 0.141) 0.517(+/- 0.135) 0.642(+/- 0.139) 0.260(+/- 0.105)
Gramian ResNet50V2 0.593(+/- 0.092) 0.545(+/- 0.075) 0.636(+/- 0.060) 0.194(+/- 0.111)
SAX RISE 0.409(+/- 0.171) 0.359(+/- 0.154) 0.602(+/- 0.164) 0.284(+/- 0.170)
Gramian Flat SVM 0.435(+/- 0.157) 0.419(+/- 0.148) 0.585(+/- 0.163) 0.147(+/- 0.087)
Recurrence Flat SVM 0.491(+/- 0.149) 0.489(+/- 0.129) 0.532(+/- 0.172) 0.279(+/- 0.126)
Gramian + Recurrence KNN 0.475(+/- 0.153) 0.530(+/- 0.158) 0.528(+/- 0.145) 0.112(+/- 0.100)
Gramian Flat Ridge 0.442(+/- 0.091) 0.415(+/- 0.132) 0.520(+/- 0.122) 0.213(+/- 0.067)
Recurrence Flat Ridge 0.480(+/- 0.073) 0.487(+/- 0.076) 0.511(+/- 0.118) 0.277(+/- 0.124)
KNN 0.321(+/- 0.065) 0.396(+/- 0.031) 0.488(+/- 0.054) 0.030(+/- 0.030)
Recurrence Flat KNN 0.435(+/- 0.128) 0.434(+/- 0.139) 0.466(+/- 0.139) 0.033(+/- 0.071)
SAX KNN 0.419(+/- 0.011) 0.453(+/- 0.055) 0.344(+/- 0.081) 0.181(+/- 0.015)
BOSSVS 0.351(+/- 0.061) 0.354(+/- 0.063) 0.340(+/- 0.063) 0.340(+/- 0.064)
Gramian Flat KNN 0.281(+/- 0.181) 0.415(+/- 0.110) 0.289(+/- 0.139) 0.163(+/- 0.088)
SFA KNN 0.228(+/- 0.039) 0.161(+/- 0.036) 0.212(+/- 0.037) 0.075(+/- 0.019)
MP Gramian + Recurrence KNN 0.101(+/- 0.109) 0.121(+/- 0.139) 0.158(+/- 0.118) 0.193(+/- 0.178)
SFA TSF 0.105(+/- 0.124) 0.184(+/- 0.115) 0.152(+/- 0.104) 0.050(+/- 0.063)
Table 11: F1 Score for all methods sorted by R3 score