Autoencoders (AE) have become the de-facto standard for anomaly detection within deep learning. In this pair of neural networks, one trains an encoder to map the features of the input data (i.e. time series) into the latent space. The decoder reconstructs this representation as well as possible, with the constraint that the latent space is much smaller than the input domain. After training on data without anomalies (e.g. normal data), predictions of anomalies can be done by defining a threshold on the anomaly score (e.g. reconstruction loss) to predict abnormalities. In recent years, variational AEs (Kingma and Welling, 2013)
(VAE), have gained popularity, since they encode the data distribution in the latent space, rather than the raw features. This allows training on all variations of data and thus reliefs the burden of filtering data beforehand. Furthermore one can see anomaly detection as a probability rather than a raw score(An and Cho, 2015).
Since time series are ubiquitous and present in a myriad of types for classification, we are interested in models beyond this binary classification task. With the development of semi-supervised generative models (Kingma et al., 2014), we are able to classify time series data, while only having to label a smaller amount of data. But, we still need to know all manifestations of classes beforehand. On the other hand, we could cluster the data, needing no label information at all (Aghabozorgi et al., 2015). This however, often comes with the drawback of lower classification accuracy and the need to manually annotate the found clusters.
To combine the benefits of the high classification accuracy in semi-supervised models with the ability to detect new classes, the hybrid approach of semi-unsupervised learning (Davidson et al., 2021; Willetts et al., 2020) has emerged. In this paper we present SuSL4TS, a convolutional Gaussian mixture model for semi-unsupervised learning on time series data. Figure 1 visualizes the basic principle of our approach.
Our contributions are twofold: (1) We present a model capable of semi-unsupervised time series classification from raw time series, partially on par with state of the art models, while only needing a limited amount of labels, (2) We show the efficacy of our approach on several benchmark datasets, and perform extensive experiments in this new domain for time series.
The remainder of this paper is structured as follows: after presenting related work in the field, we present the used datasets in Section 3. Section 4 illustrates the foundations of the used model, while Section 5 outlines our experiments. We conclude with a discussion (Section 6) of the experiments and depict future work in Section 7.
2. Related Work
Related work in time series classification is manifold since most datasets and solutions are customary. A larger review and benchmark of different algorithms and datasets can be found in (Bagnall et al., 2017; Ruiz et al., 2021). The authors present algorithms suited for time series classification task in an univariate (Bagnall et al., 2017) and a multivariate (Ruiz et al., 2021) setting. Univariate refers to datasets in which only a single sensor is used for the classification, whereas multivariate considers multiple sensor readings at the same time. The best performing algorithms are often BOSS (Schäfer, 2015), COTE (Lines et al., 2018) and TS-Chief (Shifaz et al., 2020) in the univariate problems, whereas ROCKET (Dempster et al., 2020), WEASEL (Schäfer and Leser, 2017) and INCEPT-NET (Ismail Fawaz et al., 2020) show great performance in the multivariate problems. We refer the reader to the mentioned papers for a detailed explanation of the algorithms and specific datasets. A systemic analysis of related work can be found in (Lucas et al., 2019), in which the authors list time series classifiers based on their technique: distance based, features based, ensemble approaches and tree approaches. In general, most algorithms aim for fully supervised classification, whereas we aim to reduce the time spent for labeling datasets.
Since the field of semi-unsupervised learning is relatively new, there is limited related work in this domain. In (Willetts et al., 2018), the authors present a network capable of semi-unsupervised time series classification on human activity recognition. They compare their approach to the semi-supervised M2-model (Kingma et al., 2014), and show great performance even with classes hidden. Their approach uses extracted features from the time series and use them within the fully connected network. We are however interested in the use of the raw time series signals, and therefore focus on the convolutional approach presented in (Davidson et al., 2021).
The following datasets were used for the experiments. Since we are especially interested in the learning paradigms of semi-supervised and semi-unsupervised learning, we only hand-select some datasets for our purposes. The main strength of our approach are most prominent, if we have a large dataset that would require huge efforts to label, and only a limited amount of known classes in comparison to the whole dataset. Therefore we use three datasets to perform our experiments, both in the univariate and multivariate setting, stemming from different domains of data acquisition, which meet our requirements. The datasets are introduced in the following. A short tabular view is available in Table 1.
|Dataset||Features||Classes||Samples from||Sampling rate||Input size||Best accuracy|
|HAR||accelerometer & gyroscope data||6||subjects||(Multiclass SVM (Anguita et al., 2013))|
|ECG||electrocardiogram recording||5||subjects||(CNN (Kachuee et al., 2018))|
|El. Devices||electrical consumption||7||households||every||(BOSS (Shifaz et al., 2020))|
3.1. Human Activity Recognition
The Human Activity Recognition (HAR) dataset (Anguita et al., 2013) consists of data collected from accelerometer and gyroscope sensors in smartphones. The subjects (), aged 19 to 48, were tasked with performing various Activities of Daily Living (ADL) while carrying a smartphone on their waist. The subjects were instructed to perform six distinct ADL adhering to a defined protocol outlining the order of activities. The selected activities were standing, sitting, laying down, walking, walking downstairs and walking upstairs. Each activity was performed for , except walking up- and downstairs which only lasted . Each activity was performed twice throughout the routine and pauses separated activities. Linear acceleration and angular velocity in three axes was recorded at a sampling rate of . The data was then pre-processed for noise reduction. Additionally gravitational and body motion was separated using a low-pass filter. A total of 9 signals were sampled with a window of with overlap (i.e. input is of size
). A feature vector was obtained from each sampling window. A total of 561 features were extracted with measures common in HAR literature like mean, correlation, signal magnitude area and autoregression coefficients as well as energy of different frequency bands, frequency skewness and the angle between vectors like mean body acceleration and they vector. The data was randomly divided into a 70/30 training/test split containing and samples respectively as shown in Table 3(a). The authors also used a multiclass SVM to achieve a classification accuracy on the dataset (Anguita et al., 2013).
3.2. ECG Heartbeat Classification
The second dataset used was the MIT-BIH Arrythmia Dataset (Kachuee et al., 2018). It consists of electrocardiogram (ECG) recordings from subjects recorded at a sampling rate. The recordings are grouped in five categories based on annotations by cardiologists. As can be seen in Table 3(b) the class frequency is skewed towards the N class, which is to be expected since it includes normal heart behavior. Further explanation of the individual classes can be found in (Kachuee et al., 2018)
. Furthermore each entry in the set consists of a single heartbeat padded with zeroes to ensure consistent length (i.e. input is of size). The authors of (Kachuee et al., 2018) propose a CNN architecture to classify the dataset in a fully supervised fashion, achieving an accuracy of .
3.3. Electric Devices
The Electric Devices dataset contains measurements from households observing their electrical consumption. Samples are taken every from households. After pre-processing and resampling to averages, the samples have a length of values (i.e. input is of size ). We use the version from (Bagnall et al., 2012)111Downloaded at https://timeseriesclassification.com, regrouping the originally ten classes to seven: kettle; immersion heater; washing machine; cold group; oven/cooker; screen group and dishwasher. The authors of (Shifaz et al., 2020) report the best accuracy by using BOSS at .
Gaussian Mixture Models
A variational autoencoder in general consists of an encoder , mapping the input data of dimensions into the latent space of dimensions (Kingma and Welling, 2013). The decoder on the other hands inverts this mapping, recreating the input data from the (compressed) latent encoding. This compressed space is often used for other downstream tasks in a second training step, for example classification or other information extraction tasks. The first step in this process is trained unsupervised, as it requires no annotated data, whereas the latter task is trained fully supervised.
This two-step process can be merged into one, by adapting the joint probability distribution, resulting in a Gaussian Mixture Deep Generative model (GMM) capable of learning semi-supervised classification (Kingma et al., 2014). With some further modification we can use the inductive bias requirement to perform semi-unsupervised classification tasks with GMMs (Davidson et al., 2021; Willetts et al., 2020, 2018).
GMM for Semi-unsupervised Classification
and adapt it to perform time series classification on raw sensor signals. That is, we are interested in the improved pattern recognition and performance shown in(Davidson et al., 2021). Therefore we adapt their work and replace the 2d convolutional networks for image classification, with 1d convolutions for raw time series. Additionally we use the work shown in (Willetts et al., 2020) in two ways. Firstly, we use it a reference in performance for the presented convolutional model. Secondly, we adapt their idea of the Gaussian regularization with the standard term provided by the Adam optimizer (Kingma and Ba, 2014)
. Our overall loss function can now be described as(Davidson et al., 2021)
refers to the labeled subset of the data, thus containing samples and their corresponding class
(one-hot encoded). On the other hand,contains all unlabeled data. That is all data that is to be mapped to the known classes (semi-supervised classification), and data stemming from possibly new classes (unsupervised clustering).
holds the trainable weights at epoch.
are hyperparameters weighting the entropy regularization, whereas the loss termsmeasure the evidence lower bound (ELBO) from the GMM model. All other loss terms, the network architecture and further details can be seen in (Davidson et al., 2021). This approach allows us to analyze our experiments in four learning regimes: unsupervised, semi-unsupervised, semi-supervised and fully supervised.
Since SuSL4TS is capable of handling all learning paradigms, we have conducted experiments in any setting. That is, we performed a parameter search for trials in the settings unsupervised learning (UL, labeled), semi-supervised (SSL, and labeled), semi-unsupervised (SuSL, and labeled) and supervised (SL, labeled). In the semi-unsupervised setting, we hid different classes. For the HAR dataset we tested three settings: (1) hiding all walking classes, (2) hiding all stationary classes, and (3) hiding one movement (walking) and one stationary (laying) class, while using the remainder semi-supervised. For the ECG dataset we used two hiding schemes: (1) omitting all normal heart beats, and (2) omitting classes Q and V. Within the electric devices dataset, we used two settings: (1) hiding classes 1–3, and (2) hiding classes 4–7. This was chosen arbitrarily, since there is no inherent split from the data222Classes are only labeled 1–7, while the mapping to the named version is missing.. When hiding classes with different sizes, we used and of each class. That is, we did not use subsampling. Finally, each time series was scaled via standard scaling/z-normalization.
For all settings and datasets we tested different types of networks. The first one is the fully convolutional approach, using a convolutional feature extractor and decoder (SuSL4TS). In the second setting, we tested a fully connected model, using MLPs in the encoder, as well as the decoder (MLP SuSL). We also experimented with a mixture (convolutional encoder and linear decoder) but found its performance to consistently lay between the other two settings.
All networks were implemented using the PyTorch(Paszke et al., 2019) framework. We chose the Adam optimizer (Kingma and Ba, 2014) for training and performed a Bayesian hyperparameter search using optuna (Akiba et al., 2019) for each learning paradigm and dataset. The search space for each parameter can be seen in Table 2.
|kernel size||[3, 5, 7]|
The batch size was fixed to , meaning that each batch contained labeled examples and the same amount of unlabeled examples. In case of a size mismatch of labeled and unlabeled data, we re-sampled the smaller subset to fill the batches. We used a cosine annealing learning rate scheduler and trained for epochs. Predictions on the test set were done using weights of the last epoch resulting in the best accuracy on the validation set (= of the training set). We increase every epoch with a step size of , with a maximum of 1.
6. Results & Discussion
The results of our experiments described in Section 5 can be seen in Table 3. Some tables are available in the online appendix 333https://github.com/LSX-UniWue/SuSL4TS as they only quantify the depicted observations.
|Baseline (SVM,(Anguita et al., 2013))|
|Baseline (MLP SuSL,(Willetts et al., 2018, 2020))|
|SuSL4TS (movement (h))|
|SuSL4TS (stationary (h))|
|SuSL4TS (walking,laying (h))|
|Baseline (CNN,(Kachuee et al., 2018))|
|Baseline (MLP SuSL,(Willetts et al., 2018, 2020))|
|SuSL4TS (q,v (h))|
|SuSL4TS (n (h))|
|El. Devices||Majority Vote|
|Baseline (BOSS,(Shifaz et al., 2020))|
|Baseline (MLP SuSL,(Willetts et al., 2018, 2020))|
|SuSL4TS (1–3 (h))|
|SuSL4TS (4–7 (h))|
Human Activity Recognition
When taking a closer look at the HAR dataset, we see that our approach is not able to perform on par with the fully supervised baseline (SVM, 92 vs 96). However, the SVM performs on extracted features of the signal, while we directly use the raw signal. The main difference in performance can be attributed to the classes sitting and standing, as can be seen in Table 3(k). They are easily confused with each other.
But even with fewer labels (SSL, and ), we achieve almost the same classification performance as with all labels (91 vs 92). Surprisingly the accuracy is higher with fewer labeled samples (92 vs 93), which can be attributed to the better recall with the class standing.
In the unlabeled setting (i.e. time series clustering), the performance drops significantly (64 vs 92). Furthermore, there is no difference between the fully connected approach and the convolutional one.
In the first semi-unsupervised setting (walking classes hidden), we can see worse performance in both ( and ) settings than compared to the unsupervised clustering. This drop in accuracy is due to the fact that all walking related classes are classified as walking, and even standing is classified as walking (see Table 3(g)). In contrast to the unsupervised model within the semi-unsupervised task, both parts of the encoder, the labeled and unlabeled leg, have to be trained, implying more weights need to be tuned for the network with only few labeled samples left. This setting is thus more complicated than the unsupervised task and the models tend to group unknown classes into one unknown super-class.
On the other hand, when hiding all stationary classes (i.e. standing, laying and sitting), performance increases again (51 vs 88). Wrongfully assigned classes are mostly confusion of sitting and standing (see Table 3(h)).
In the last semi-unsupervised setting (hiding walking and laying), performance is a little lower than hiding all static classes (77 vs 88). The drop in performance is due to the fact that the class standing is completely missed and most samples are classified as walking (see Table 3(i)).
All discussed observations for the HAR dataset are also visible in the visualization of the latent space shown in Figure 2.
We used UMAP (McInnes et al., 2018)(min_dist=0.99, n_neighbors=10, metric=cosine) to plot a two dimensional manifold of the learned embedding.
ECG Heartbeat Classification
The classification results for the univariate ECG dataset are displayed in Table 3. When comparing with the fully supervised CNN architecture presented in (Kachuee et al., 2018), both semi-unsupervised approaches outperform the baseline (98 vs 96).
In the semi-supervised setting ( and , all classes known), there is almost no difference in terms of accuracy compared to the supervised settings. Again, accuracies are slightly higher when using fewer labels. Compared to the fully connected SuSL model, the convolutional feature extractor fares slightly better (97 vs 98).
In the unsupervised setting we can observe a performance drop, although not as high as for the HAR dataset (83 vs 63). Due to the highly skewed nature of the dataset, the unsupervised classification task is not much better than the majority vote (), suggesting that only the normal class was detected (see Table 3(t)).
In the first semi-unsupervised setting (classes Q and V hidden), we can see a slight increase in performance in comparison to the unsupervised setting (84 vs 85). Both hidden classes are missed in the test set classification, where the differences in accuracies are founded in better recall of classes F and S (see Tables 3(v) and 3(u)).
On the other hand, when hiding the normal class, performance increases above the majority vote (88 vs 83). The amount of available labels ( and ) does not impact the predictions largely (88 vs 90). However, with available labels, the class F is not missed, class V is missed in both settings (see Tables 3(x) and 3(w)).
The classification results for the univariate electric devices dataset are displayed in Table 3. When comparing with the best performing model (BOSS) in the supervised settings, both neural network approaches perform worse (80 vs 70), where most mis-classification are done within the classes 3–5 (see Table 3(s)).
In the semi-supervised setting ( and , all classes known), we observe the same behavior as in the ECG datasets, classification accuracy remains similar to the fully supervised setting (68 vs 70). Since class 7 is completely missed (), the model likely overfitted, since this class is detected in the validation set (see Tables 3(r) and 3(q)).
In the unsupervised classification task, we can observe that two classes are missed in the test set (1, 7), but even predictions within the other classes are not very clear (52 vs 70). The missed classes only represent smaller portions of this dataset, suggesting the parameter search found models yielding higher accuracy when predicting mostly the larger classes (see Table 3(l)).
In the first semi-unsupervised setting (classes 1–3 hidden), we observe a larger dip in accuracy compared to the semi-supervised setting (55/59 vs 70). Within this dataset we can see an increase in performance when comparing the setting with available labels, in contrast to the configuration (55 vs 59, see Tables 3(n) and 3(m)). In both settings, one class is completely missed (1 or 7), but the main difference in performance is the higher precision at all other classes with labels available.
On the other hand, when hiding classes 4–7, we can observe a similar decline in performance (50/60 vs 70). With labels available, the model assigns mostly all missed samples to the same hidden class (5), while missing classes 1, 4 and 6–7 completely (see Table 3(o)). When presenting the model more labels, accuracy once again increases (similar to classes 1–3 hidden), and only class 1 is completely missed (see Table 3(p)).
Throughout all datasets and settings, we have made some observations applicable in general, which we will discuss now.
(1) In the multivariate dataset (i.e. HAR), the convolutional approach of SuSL4TS outperforms the fully connected version for semi-unsupervised learning. That is, it performs better in any setting we tested.
(2) In the univariate datasets, we can see a mixed picture. For the electric devices dataset, SuSL4TS performs better in any labeled setting by a large margin. Given the ECG dataset, both versions perform equally well, with no clear tendency.
(3) If specific classes are not known, we can see drastically different results in the classification. Most prominent in the HAR dataset, as hiding the walking classes collapses predictions to perform worse than the unsupervised setting.
(4) In general the semi-unsupervised setting, only when using the larger amount of labels available, we can see an increased performance compared to the unsupervised settings with no labels at all (except HAR with all movements hidden).
(5) Generally the semi-supervised (i.e. all classes known, only limited amount of labels) performance is as good as the fully supervised setting.
In this paper we presented SuSL4TS, a convolutional Gaussian mixture model for performing semi-unsupervised time series classification. We showed the efficacy of our approach by comparing it with optimized methods on several benchmark datasets, while requiring no manual feature extraction. Especially in the semi-supervised settings, the model performs nearly as good as its fully supervised counterpart. When omitting specific classes (i.e. classes unknown a priori) , accuracy can highly deviate in certain combinations of labeled versus unlabeled data, showing lower performance than using no labels at all.
In future work, we will analyze the applicability of our approach in real world, large scale data. For example, we could test the highly skewed sensor data obtained from beehives (Zacepins et al., 2016; Kviesis and Zacepins, 2016; Davidson et al., 2020), or other highly skewed anomaly detection datasets (Ren et al., 2019), alleviating the burden of having to manually discern the different types of anomalies. On the other hand, we could use the normal class completely unlabeled and only annotate a few anomaly classes. This dataset is similar to the presented ECG analysis. In a more complex setting of time series classification, we could try to classify audio files, either with extracted mel-spectrograms or the raw series (Kahl et al., 2020; Piczak, 2015; Davidson et al., 2020).
As mentioned, SuSL4TS is a generative model, thus we can draw random samples resembling the learned classes from the latent space. That enables us to generate samples for a given class to be used for other tasks or augment the labeled set in the whole dataset.
- Time-series clustering–a decade review. Information Systems 53, pp. 16–38. Cited by: §1.
- Optuna: a next-generation hyperparameter optimization framework. In Proceedings of the 25rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Cited by: §5, Table 2.
- Variational autoencoder based anomaly detection using reconstruction probability. Special Lecture on IE 2 (1), pp. 1–18. Cited by: §1.
A public domain dataset for human activity recognition using smartphones.
Proceedings of the 21th international European symposium on artificial neural networks, computational intelligence and machine learning, pp. 437–442. Cited by: §3.1, Table 1, Table 3.
- Transformation based ensembles for time series classification. In Proceedings of the 2012 SIAM international conference on data mining, pp. 307–318. Cited by: §3.3.
- The great time series classification bake off: a review and experimental evaluation of recent algorithmic advances. Data mining and knowledge discovery 31 (3), pp. 606–660. Cited by: §2.
Semi-unsupervised learning: an in-depth parameter analysis.
German Conference on Artificial Intelligence (Künstliche Intelligenz), pp. 51–66. Cited by: §1, §2, §4, §4.
- Anomaly detection in beehives using deep recurrent autoencoders. arXiv preprint arXiv:2003.04576. Cited by: §7.
- ROCKET: exceptionally fast and accurate time series classification using random convolutional kernels. Data Mining and Knowledge Discovery 34 (5), pp. 1454–1495. Cited by: §2.
- Inceptiontime: finding alexnet for time series classification. Data Mining and Knowledge Discovery 34 (6), pp. 1936–1962. Cited by: §2.
- ECG heartbeat classification: a deep transferable representation. In 2018 IEEE International Conference on Healthcare Informatics (ICHI), External Links: Cited by: §3.2, Table 1, §6, Table 3.
- Overview of birdclef 2020: bird sound recognition in complex acoustic environments. In CLEF 2020-11th International Conference of the Cross-Language Evaluation Forum for European Languages, Cited by: §7.
- Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4, §5.
- Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §1, §4.
- Semi-supervised learning with deep generative models. Advances in neural information processing systems 27. Cited by: §1, §2, §4.
- Application of neural networks for honey bee colony state identification. In 2016 17th International Carpathian Control Conference (ICCC), pp. 413–417. Cited by: §7.
- Time series classification with hive-cote: the hierarchical vote collective of transformation-based ensembles. ACM Transactions on Knowledge Discovery from Data 12 (5). Cited by: §2.
- Proximity forest: an effective and scalable distance-based classifier for time series. Data Mining and Knowledge Discovery 33 (3), pp. 607–635. Cited by: §2.
- Umap: uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426. Cited by: §6.
- Pytorch: an imperative style, high-performance deep learning library. Advances in neural information processing systems 32. Cited by: §5.
- ESC: dataset for environmental sound classification. In Proceedings of the 23rd ACM international conference on Multimedia, pp. 1015–1018. Cited by: §7.
- Time-series anomaly detection service at microsoft. In Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, pp. 3009–3017. Cited by: §7.
- The great multivariate time series classification bake off: a review and experimental evaluation of recent algorithmic advances. Data Mining and Knowledge Discovery 35 (2), pp. 401–449. Cited by: §2.
- Multivariate time series classification with weasel+ muse. arXiv preprint arXiv:1711.11343. Cited by: §2.
- The boss is concerned with time series classification in the presence of noise. Data Mining and Knowledge Discovery 29 (6), pp. 1505–1530. Cited by: §2.
- TS-chief: a scalable and accurate forest algorithm for time series classification. Data Mining and Knowledge Discovery 34 (3), pp. 742–775. Cited by: §2, §3.3, Table 1, Table 3.
- Semi-unsupervised learning of human activity using deep generative models. arXiv preprint arXiv:1810.12176. Cited by: §2, §4, Table 3, 3(d), 3(e).
- Semi-unsupervised learning: clustering and classifying using ultra-sparse labels. In 2020 IEEE International Conference on Big Data (Big Data), pp. 5286–5295. Cited by: §1, §4, §4, Table 3, 3(d), 3(e).
- Remote detection of the swarming of honey bee colonies by single-point temperature monitoring. Biosystems engineering 148, pp. 76–80. Cited by: §7.