Log In Sign Up

Semi-unsupervised Learning for Time Series Classification

by   Padraig Davidson, et al.
University of Würzburg

Time series are ubiquitous and therefore inherently hard to analyze and ultimately to label or cluster. With the rise of the Internet of Things (IoT) and its smart devices, data is collected in large amounts any given second. The collected data is rich in information, as one can detect accidents (e.g. cars) in real time, or assess injury/sickness over a given time span (e.g. health devices). Due to its chaotic nature and massive amounts of datapoints, timeseries are hard to label manually. Furthermore new classes within the data could emerge over time (contrary to e.g. handwritten digits), which would require relabeling the data. In this paper we present SuSL4TS, a deep generative Gaussian mixture model for semi-unsupervised learning, to classify time series data. With our approach we can alleviate manual labeling steps, since we can detect sparsely labeled classes (semi-supervised) and identify emerging classes hidden in the data (unsupervised). We demonstrate the efficacy of our approach with established time series classification datasets from different domains.


Unsupervised Visual Time-Series Representation Learning and Clustering

Time-series data is generated ubiquitously from Internet-of-Things (IoT)...

Semi-unsupervised Learning of Human Activity using Deep Generative Models

Here we demonstrate a new deep generative model for classification. We i...

COBRAS-TS: A new approach to Semi-Supervised Clustering of Time Series

Clustering is ubiquitous in data analysis, including analysis of time se...

Clustering Time Series Data through Autoencoder-based Deep Learning Models

Machine learning and in particular deep learning algorithms are the emer...

Bayesian nonparametric shared multi-sequence time series segmentation

In this paper, we introduce a method for segmenting time series data usi...

Coresets for Time Series Clustering

We study the problem of constructing coresets for clustering problems wi...

Time series cluster kernels to exploit informative missingness and incomplete label information

The time series cluster kernel (TCK) provides a powerful tool for analys...

1. Introduction

Autoencoders (AE) have become the de-facto standard for anomaly detection within deep learning. In this pair of neural networks, one trains an encoder to map the features of the input data (i.e. time series) into the latent space. The decoder reconstructs this representation as well as possible, with the constraint that the latent space is much smaller than the input domain. After training on data without anomalies (e.g. normal data), predictions of anomalies can be done by defining a threshold on the anomaly score (e.g. reconstruction loss) to predict abnormalities. In recent years, variational AEs (Kingma and Welling, 2013)

(VAE), have gained popularity, since they encode the data distribution in the latent space, rather than the raw features. This allows training on all variations of data and thus reliefs the burden of filtering data beforehand. Furthermore one can see anomaly detection as a probability rather than a raw score 

(An and Cho, 2015).

Since time series are ubiquitous and present in a myriad of types for classification, we are interested in models beyond this binary classification task. With the development of semi-supervised generative models (Kingma et al., 2014), we are able to classify time series data, while only having to label a smaller amount of data. But, we still need to know all manifestations of classes beforehand. On the other hand, we could cluster the data, needing no label information at all (Aghabozorgi et al., 2015). This however, often comes with the drawback of lower classification accuracy and the need to manually annotate the found clusters.

To combine the benefits of the high classification accuracy in semi-supervised models with the ability to detect new classes, the hybrid approach of semi-unsupervised learning (Davidson et al., 2021; Willetts et al., 2020) has emerged. In this paper we present SuSL4TS, a convolutional Gaussian mixture model for semi-unsupervised learning on time series data. Figure 1 visualizes the basic principle of our approach.

Figure 1. Semi-unsupervised learning for time series data (SuSL4TS). The model is tasked to classify the data on the left hand side (a single multivariate time series), with only limited labels available for classes 1 and 2, and two completely unknown classes. The output on the right hand side is the classified data with all four classes found. Within the model we can see four distinct cluster automatically found when zooming in on the latent space.

Our contributions are twofold: (1) We present a model capable of semi-unsupervised time series classification from raw time series, partially on par with state of the art models, while only needing a limited amount of labels, (2) We show the efficacy of our approach on several benchmark datasets, and perform extensive experiments in this new domain for time series.

The remainder of this paper is structured as follows: after presenting related work in the field, we present the used datasets in Section 3. Section 4 illustrates the foundations of the used model, while Section 5 outlines our experiments. We conclude with a discussion (Section 6) of the experiments and depict future work in Section 7.

2. Related Work

Related work in time series classification is manifold since most datasets and solutions are customary. A larger review and benchmark of different algorithms and datasets can be found in (Bagnall et al., 2017; Ruiz et al., 2021). The authors present algorithms suited for time series classification task in an univariate (Bagnall et al., 2017) and a multivariate (Ruiz et al., 2021) setting. Univariate refers to datasets in which only a single sensor is used for the classification, whereas multivariate considers multiple sensor readings at the same time. The best performing algorithms are often BOSS (Schäfer, 2015), COTE (Lines et al., 2018) and TS-Chief (Shifaz et al., 2020) in the univariate problems, whereas ROCKET (Dempster et al., 2020), WEASEL (Schäfer and Leser, 2017) and INCEPT-NET (Ismail Fawaz et al., 2020) show great performance in the multivariate problems. We refer the reader to the mentioned papers for a detailed explanation of the algorithms and specific datasets. A systemic analysis of related work can be found in (Lucas et al., 2019), in which the authors list time series classifiers based on their technique: distance based, features based, ensemble approaches and tree approaches. In general, most algorithms aim for fully supervised classification, whereas we aim to reduce the time spent for labeling datasets.

Since the field of semi-unsupervised learning is relatively new, there is limited related work in this domain. In (Willetts et al., 2018), the authors present a network capable of semi-unsupervised time series classification on human activity recognition. They compare their approach to the semi-supervised M2-model (Kingma et al., 2014), and show great performance even with classes hidden. Their approach uses extracted features from the time series and use them within the fully connected network. We are however interested in the use of the raw time series signals, and therefore focus on the convolutional approach presented in (Davidson et al., 2021).

3. Datasets

The following datasets were used for the experiments. Since we are especially interested in the learning paradigms of semi-supervised and semi-unsupervised learning, we only hand-select some datasets for our purposes. The main strength of our approach are most prominent, if we have a large dataset that would require huge efforts to label, and only a limited amount of known classes in comparison to the whole dataset. Therefore we use three datasets to perform our experiments, both in the univariate and multivariate setting, stemming from different domains of data acquisition, which meet our requirements. The datasets are introduced in the following. A short tabular view is available in Table 1.

Dataset Features Classes Samples from Sampling rate Input size Best accuracy
HAR accelerometer & gyroscope data 6 subjects (Multiclass SVM (Anguita et al., 2013))
ECG electrocardiogram recording 5 subjects (CNN (Kachuee et al., 2018))
El. Devices electrical consumption 7 households every (BOSS (Shifaz et al., 2020))
Table 1. Datasets used for the experiments

3.1. Human Activity Recognition

The Human Activity Recognition (HAR) dataset (Anguita et al., 2013) consists of data collected from accelerometer and gyroscope sensors in smartphones. The subjects (), aged 19 to 48, were tasked with performing various Activities of Daily Living (ADL) while carrying a smartphone on their waist. The subjects were instructed to perform six distinct ADL adhering to a defined protocol outlining the order of activities. The selected activities were standing, sitting, laying down, walking, walking downstairs and walking upstairs. Each activity was performed for , except walking up- and downstairs which only lasted . Each activity was performed twice throughout the routine and pauses separated activities. Linear acceleration and angular velocity in three axes was recorded at a sampling rate of . The data was then pre-processed for noise reduction. Additionally gravitational and body motion was separated using a low-pass filter. A total of 9 signals were sampled with a window of with overlap (i.e. input is of size

). A feature vector was obtained from each sampling window. A total of 561 features were extracted with measures common in HAR literature like mean, correlation, signal magnitude area and autoregression coefficients as well as energy of different frequency bands, frequency skewness and the angle between vectors like mean body acceleration and the

y vector. The data was randomly divided into a 70/30 training/test split containing and samples respectively as shown in Table 3(a). The authors also used a multiclass SVM to achieve a classification accuracy on the dataset (Anguita et al., 2013).

3.2. ECG Heartbeat Classification

The second dataset used was the MIT-BIH Arrythmia Dataset (Kachuee et al., 2018). It consists of electrocardiogram (ECG) recordings from subjects recorded at a sampling rate. The recordings are grouped in five categories based on annotations by cardiologists. As can be seen in Table 3(b) the class frequency is skewed towards the N class, which is to be expected since it includes normal heart behavior. Further explanation of the individual classes can be found in (Kachuee et al., 2018)

. Furthermore each entry in the set consists of a single heartbeat padded with zeroes to ensure consistent length (i.e. input is of size

). The authors of (Kachuee et al., 2018) propose a CNN architecture to classify the dataset in a fully supervised fashion, achieving an accuracy of .

3.3. Electric Devices

The Electric Devices dataset contains measurements from households observing their electrical consumption. Samples are taken every from households. After pre-processing and resampling to averages, the samples have a length of values (i.e. input is of size ). We use the version from (Bagnall et al., 2012)111Downloaded at, regrouping the originally ten classes to seven: kettle; immersion heater; washing machine; cold group; oven/cooker; screen group and dishwasher. The authors of (Shifaz et al., 2020) report the best accuracy by using BOSS at .

4. Methodology

Gaussian Mixture Models

A variational autoencoder in general consists of an encoder , mapping the input data of dimensions into the latent space of dimensions  (Kingma and Welling, 2013). The decoder on the other hands inverts this mapping, recreating the input data from the (compressed) latent encoding. This compressed space is often used for other downstream tasks in a second training step, for example classification or other information extraction tasks. The first step in this process is trained unsupervised, as it requires no annotated data, whereas the latter task is trained fully supervised.

This two-step process can be merged into one, by adapting the joint probability distribution

, resulting in a Gaussian Mixture Deep Generative model (GMM) capable of learning semi-supervised classification (Kingma et al., 2014). With some further modification we can use the inductive bias requirement to perform semi-unsupervised classification tasks with GMMs (Davidson et al., 2021; Willetts et al., 2020, 2018).

GMM for Semi-unsupervised Classification

In this work, we built on the work presented in (Davidson et al., 2021; Willetts et al., 2020)

and adapt it to perform time series classification on raw sensor signals. That is, we are interested in the improved pattern recognition and performance shown in 

(Davidson et al., 2021). Therefore we adapt their work and replace the 2d convolutional networks for image classification, with 1d convolutions for raw time series. Additionally we use the work shown in (Willetts et al., 2020) in two ways. Firstly, we use it a reference in performance for the presented convolutional model. Secondly, we adapt their idea of the Gaussian regularization with the standard term provided by the Adam optimizer (Kingma and Ba, 2014)

. Our overall loss function can now be described as 

(Davidson et al., 2021)

refers to the labeled subset of the data, thus containing samples and their corresponding class

(one-hot encoded). On the other hand,

contains all unlabeled data. That is all data that is to be mapped to the known classes (semi-supervised classification), and data stemming from possibly new classes (unsupervised clustering).

holds the trainable weights at epoch


are hyperparameters weighting the entropy regularization, whereas the loss terms

measure the evidence lower bound (ELBO) from the GMM model. All other loss terms, the network architecture and further details can be seen in (Davidson et al., 2021). This approach allows us to analyze our experiments in four learning regimes: unsupervised, semi-unsupervised, semi-supervised and fully supervised.

5. Experiments

Experimental Setup

Since SuSL4TS is capable of handling all learning paradigms, we have conducted experiments in any setting. That is, we performed a parameter search for trials in the settings unsupervised learning (UL, labeled), semi-supervised (SSL, and labeled), semi-unsupervised (SuSL, and labeled) and supervised (SL, labeled). In the semi-unsupervised setting, we hid different classes. For the HAR dataset we tested three settings: (1) hiding all walking classes, (2) hiding all stationary classes, and (3) hiding one movement (walking) and one stationary (laying) class, while using the remainder semi-supervised. For the ECG dataset we used two hiding schemes: (1) omitting all normal heart beats, and (2) omitting classes Q and V. Within the electric devices dataset, we used two settings: (1) hiding classes 1–3, and (2) hiding classes 4–7. This was chosen arbitrarily, since there is no inherent split from the data222Classes are only labeled 1–7, while the mapping to the named version is missing.. When hiding classes with different sizes, we used   and   of each class. That is, we did not use subsampling. Finally, each time series was scaled via standard scaling/z-normalization.

For all settings and datasets we tested different types of networks. The first one is the fully convolutional approach, using a convolutional feature extractor and decoder (SuSL4TS). In the second setting, we tested a fully connected model, using MLPs in the encoder, as well as the decoder (MLP SuSL). We also experimented with a mixture (convolutional encoder and linear decoder) but found its performance to consistently lay between the other two settings.


All networks were implemented using the PyTorch 

(Paszke et al., 2019) framework. We chose the Adam optimizer (Kingma and Ba, 2014) for training and performed a Bayesian hyperparameter search using optuna (Akiba et al., 2019) for each learning paradigm and dataset. The search space for each parameter can be seen in Table 2.

Parameter Search space
lr loguniform()
layers randint(1,3)
filters 2**randint(5,7)
units 2**randint(5,11)
kernel size [3, 5, 7]
clipping 10**randint(-10, 0)
Table 2. Hyperparameter search spaces. Optimization is done with optuna (Akiba et al., 2019) for runs in each experiment.

The batch size was fixed to , meaning that each batch contained labeled examples and the same amount of unlabeled examples. In case of a size mismatch of labeled and unlabeled data, we re-sampled the smaller subset to fill the batches. We used a cosine annealing learning rate scheduler and trained for epochs. Predictions on the test set were done using weights of the last epoch resulting in the best accuracy on the validation set (= of the training set). We increase every epoch with a step size of , with a maximum of 1.

6. Results & Discussion

The results of our experiments described in Section 5 can be seen in Table 3. Some tables are available in the online appendix 333 as they only quantify the depicted observations.

Dataset Model UL SuSL SSL SL
HAR Majority Vote
Baseline (SVM,(Anguita et al., 2013))
Baseline (MLP SuSL,(Willetts et al., 2018, 2020))
SuSL4TS (movement (h))
SuSL4TS (stationary (h))
SuSL4TS (walking,laying (h))
ECG Majority Vote
Baseline (CNN,(Kachuee et al., 2018))
Baseline (MLP SuSL,(Willetts et al., 2018, 2020))
SuSL4TS (q,v (h))
SuSL4TS (n (h))
El. Devices Majority Vote
Baseline (BOSS,(Shifaz et al., 2020))
Baseline (MLP SuSL,(Willetts et al., 2018, 2020))
SuSL4TS (1–3 (h))
SuSL4TS (4–7 (h))
Table 3. Results. Each block describes one dataset, and is subdivided with the baseline methods. For reference, we include the majority vote baseline, a fully supervised baseline, and a semi-unsupervised baseline with an MLP on the raw time series. For SuSL4TS, we provide performance of our model in the different learning paradigms, with two versions of the semi-supervised and semi-unsupervised. We report the accuracy on the test set for unsupervised (UL), semi-supervised (SSL), semi-unsupervised (SuSL) and supervised (SL) learning paradigms.

Human Activity Recognition

When taking a closer look at the HAR dataset, we see that our approach is not able to perform on par with the fully supervised baseline (SVM, 92 vs 96). However, the SVM performs on extracted features of the signal, while we directly use the raw signal. The main difference in performance can be attributed to the classes sitting and standing, as can be seen in Table 3(k). They are easily confused with each other.

But even with fewer labels (SSL,   and   ), we achieve almost the same classification performance as with all labels (91 vs 92). Surprisingly the accuracy is higher with fewer labeled samples (92 vs 93), which can be attributed to the better recall with the class standing.

In the unlabeled setting (i.e. time series clustering), the performance drops significantly (64 vs 92). Furthermore, there is no difference between the fully connected approach and the convolutional one.

In the first semi-unsupervised setting (walking classes hidden), we can see worse performance in both (   and   ) settings than compared to the unsupervised clustering. This drop in accuracy is due to the fact that all walking related classes are classified as walking, and even standing is classified as walking (see Table 3(g)). In contrast to the unsupervised model within the semi-unsupervised task, both parts of the encoder, the labeled and unlabeled leg, have to be trained, implying more weights need to be tuned for the network with only few labeled samples left. This setting is thus more complicated than the unsupervised task and the models tend to group unknown classes into one unknown super-class.

On the other hand, when hiding all stationary classes (i.e. standing, laying and sitting), performance increases again (51 vs 88). Wrongfully assigned classes are mostly confusion of sitting and standing (see Table 3(h)).

In the last semi-unsupervised setting (hiding walking and laying), performance is a little lower than hiding all static classes (77 vs 88). The drop in performance is due to the fact that the class standing is completely missed and most samples are classified as walking (see Table 3(i)).

All discussed observations for the HAR dataset are also visible in the visualization of the latent space shown in Figure 2.

(a) Unsupervised classification.
(b) Semi-supervised classification with labels.
(c) Fully supervised classification.
(d) Semi-unsupervised classification with and the hidden classes of Sitting (1), Standing (2) and Laying (0).
(e) Semi-unsupervised classification with and the hidden classes of Walking (3), Walking Downstairs (4) and Walking Upstairs (5).
(f) Semi-unsupervised classification with and the hidden classes of Laying (0) and Walking (3).
Figure 2. Embedding visualization. UMAP dimensionality reduction of the learned latent space on the test set for the HAR dataset.

We used UMAP (McInnes et al., 2018)(min_dist=0.99, n_neighbors=10, metric=cosine) to plot a two dimensional manifold of the learned embedding.

ECG Heartbeat Classification

The classification results for the univariate ECG dataset are displayed in Table 3. When comparing with the fully supervised CNN architecture presented in (Kachuee et al., 2018), both semi-unsupervised approaches outperform the baseline (98 vs 96).

In the semi-supervised setting (   and   , all classes known), there is almost no difference in terms of accuracy compared to the supervised settings. Again, accuracies are slightly higher when using fewer labels. Compared to the fully connected SuSL model, the convolutional feature extractor fares slightly better (97 vs 98).

In the unsupervised setting we can observe a performance drop, although not as high as for the HAR dataset (83 vs 63). Due to the highly skewed nature of the dataset, the unsupervised classification task is not much better than the majority vote (), suggesting that only the normal class was detected (see Table 3(t)).

In the first semi-unsupervised setting (classes Q and V hidden), we can see a slight increase in performance in comparison to the unsupervised setting (84 vs 85). Both hidden classes are missed in the test set classification, where the differences in accuracies are founded in better recall of classes F and S (see Tables 3(v) and 3(u)).

On the other hand, when hiding the normal class, performance increases above the majority vote (88 vs 83). The amount of available labels (   and   ) does not impact the predictions largely (88 vs 90). However, with available labels, the class F is not missed, class V is missed in both settings (see Tables 3(x) and 3(w)).

Electric Devices

The classification results for the univariate electric devices dataset are displayed in Table 3. When comparing with the best performing model (BOSS) in the supervised settings, both neural network approaches perform worse (80 vs 70), where most mis-classification are done within the classes 3–5 (see Table 3(s)).

In the semi-supervised setting (   and   , all classes known), we observe the same behavior as in the ECG datasets, classification accuracy remains similar to the fully supervised setting (68 vs 70). Since class 7 is completely missed (), the model likely overfitted, since this class is detected in the validation set (see Tables 3(r) and 3(q)).

In the unsupervised classification task, we can observe that two classes are missed in the test set (1, 7), but even predictions within the other classes are not very clear (52 vs 70). The missed classes only represent smaller portions of this dataset, suggesting the parameter search found models yielding higher accuracy when predicting mostly the larger classes (see Table 3(l)).

In the first semi-unsupervised setting (classes 1–3 hidden), we observe a larger dip in accuracy compared to the semi-supervised setting (55/59 vs 70). Within this dataset we can see an increase in performance when comparing the setting with available labels, in contrast to the configuration (55 vs 59, see Tables 3(n) and 3(m)). In both settings, one class is completely missed (1 or 7), but the main difference in performance is the higher precision at all other classes with labels available.

On the other hand, when hiding classes 4–7, we can observe a similar decline in performance (50/60 vs 70). With labels available, the model assigns mostly all missed samples to the same hidden class (5), while missing classes 1, 4 and 6–7 completely (see Table 3(o)). When presenting the model more labels, accuracy once again increases (similar to classes 1–3 hidden), and only class 1 is completely missed (see Table 3(p)).

General Discussion

Throughout all datasets and settings, we have made some observations applicable in general, which we will discuss now.

(1) In the multivariate dataset (i.e. HAR), the convolutional approach of SuSL4TS outperforms the fully connected version for semi-unsupervised learning. That is, it performs better in any setting we tested.

(2) In the univariate datasets, we can see a mixed picture. For the electric devices dataset, SuSL4TS performs better in any labeled setting by a large margin. Given the ECG dataset, both versions perform equally well, with no clear tendency.

(3) If specific classes are not known, we can see drastically different results in the classification. Most prominent in the HAR dataset, as hiding the walking classes collapses predictions to perform worse than the unsupervised setting.

(4) In general the semi-unsupervised setting, only when using the larger amount of labels available, we can see an increased performance compared to the unsupervised settings with no labels at all (except HAR with all movements hidden).

(5) Generally the semi-supervised (i.e. all classes known, only limited amount of labels) performance is as good as the fully supervised setting.

7. Summary

In this paper we presented SuSL4TS, a convolutional Gaussian mixture model for performing semi-unsupervised time series classification. We showed the efficacy of our approach by comparing it with optimized methods on several benchmark datasets, while requiring no manual feature extraction. Especially in the semi-supervised settings, the model performs nearly as good as its fully supervised counterpart. When omitting specific classes (i.e. classes unknown a priori) , accuracy can highly deviate in certain combinations of labeled versus unlabeled data, showing lower performance than using no labels at all.

In future work, we will analyze the applicability of our approach in real world, large scale data. For example, we could test the highly skewed sensor data obtained from beehives (Zacepins et al., 2016; Kviesis and Zacepins, 2016; Davidson et al., 2020), or other highly skewed anomaly detection datasets (Ren et al., 2019), alleviating the burden of having to manually discern the different types of anomalies. On the other hand, we could use the normal class completely unlabeled and only annotate a few anomaly classes. This dataset is similar to the presented ECG analysis. In a more complex setting of time series classification, we could try to classify audio files, either with extracted mel-spectrograms or the raw series (Kahl et al., 2020; Piczak, 2015; Davidson et al., 2020).

As mentioned, SuSL4TS is a generative model, thus we can draw random samples resembling the learned classes from the latent space. That enables us to generate samples for a given class to be used for other tasks or augment the labeled set in the whole dataset.


  • S. Aghabozorgi, A. S. Shirkhorshidi, and T. Y. Wah (2015) Time-series clustering–a decade review. Information Systems 53, pp. 16–38. Cited by: §1.
  • T. Akiba, S. Sano, T. Yanase, T. Ohta, and M. Koyama (2019) Optuna: a next-generation hyperparameter optimization framework. In Proceedings of the 25rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Cited by: §5, Table 2.
  • J. An and S. Cho (2015) Variational autoencoder based anomaly detection using reconstruction probability. Special Lecture on IE 2 (1), pp. 1–18. Cited by: §1.
  • D. Anguita, A. Ghio, L. Oneto, X. Parra Perez, and J. L. Reyes Ortiz (2013) A public domain dataset for human activity recognition using smartphones. In

    Proceedings of the 21th international European symposium on artificial neural networks, computational intelligence and machine learning

    pp. 437–442. Cited by: §3.1, Table 1, Table 3.
  • A. Bagnall, L. Davis, J. Hills, and J. Lines (2012) Transformation based ensembles for time series classification. In Proceedings of the 2012 SIAM international conference on data mining, pp. 307–318. Cited by: §3.3.
  • A. Bagnall, J. Lines, A. Bostrom, J. Large, and E. Keogh (2017) The great time series classification bake off: a review and experimental evaluation of recent algorithmic advances. Data mining and knowledge discovery 31 (3), pp. 606–660. Cited by: §2.
  • P. Davidson, F. Buckermann, M. Steininger, A. Krause, and A. Hotho (2021) Semi-unsupervised learning: an in-depth parameter analysis. In

    German Conference on Artificial Intelligence (Künstliche Intelligenz)

    pp. 51–66. Cited by: §1, §2, §4, §4.
  • P. Davidson, M. Steininger, F. Lautenschlager, K. Kobs, A. Krause, and A. Hotho (2020) Anomaly detection in beehives using deep recurrent autoencoders. arXiv preprint arXiv:2003.04576. Cited by: §7.
  • A. Dempster, F. Petitjean, and G. I. Webb (2020) ROCKET: exceptionally fast and accurate time series classification using random convolutional kernels. Data Mining and Knowledge Discovery 34 (5), pp. 1454–1495. Cited by: §2.
  • H. Ismail Fawaz, B. Lucas, G. Forestier, C. Pelletier, D. F. Schmidt, J. Weber, G. I. Webb, L. Idoumghar, P. Muller, and F. Petitjean (2020) Inceptiontime: finding alexnet for time series classification. Data Mining and Knowledge Discovery 34 (6), pp. 1936–1962. Cited by: §2.
  • M. Kachuee, S. Fazeli, and M. Sarrafzadeh (2018) ECG heartbeat classification: a deep transferable representation. In 2018 IEEE International Conference on Healthcare Informatics (ICHI), External Links: Document, Link Cited by: §3.2, Table 1, §6, Table 3.
  • S. Kahl, M. Clapp, W. Hopping, H. Goëau, H. Glotin, R. Planqué, W. Vellinga, and A. Joly (2020) Overview of birdclef 2020: bird sound recognition in complex acoustic environments. In CLEF 2020-11th International Conference of the Cross-Language Evaluation Forum for European Languages, Cited by: §7.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4, §5.
  • D. P. Kingma and M. Welling (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §1, §4.
  • D. P. Kingma, S. Mohamed, D. Jimenez Rezende, and M. Welling (2014) Semi-supervised learning with deep generative models. Advances in neural information processing systems 27. Cited by: §1, §2, §4.
  • A. Kviesis and A. Zacepins (2016) Application of neural networks for honey bee colony state identification. In 2016 17th International Carpathian Control Conference (ICCC), pp. 413–417. Cited by: §7.
  • J. Lines, S. Taylor, and A. Bagnall (2018) Time series classification with hive-cote: the hierarchical vote collective of transformation-based ensembles. ACM Transactions on Knowledge Discovery from Data 12 (5). Cited by: §2.
  • B. Lucas, A. Shifaz, C. Pelletier, L. O’Neill, N. Zaidi, B. Goethals, F. Petitjean, and G. I. Webb (2019) Proximity forest: an effective and scalable distance-based classifier for time series. Data Mining and Knowledge Discovery 33 (3), pp. 607–635. Cited by: §2.
  • L. McInnes, J. Healy, and J. Melville (2018) Umap: uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426. Cited by: §6.
  • A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al. (2019) Pytorch: an imperative style, high-performance deep learning library. Advances in neural information processing systems 32. Cited by: §5.
  • K. J. Piczak (2015) ESC: dataset for environmental sound classification. In Proceedings of the 23rd ACM international conference on Multimedia, pp. 1015–1018. Cited by: §7.
  • H. Ren, B. Xu, Y. Wang, C. Yi, C. Huang, X. Kou, T. Xing, M. Yang, J. Tong, and Q. Zhang (2019) Time-series anomaly detection service at microsoft. In Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, pp. 3009–3017. Cited by: §7.
  • A. P. Ruiz, M. Flynn, J. Large, M. Middlehurst, and A. Bagnall (2021) The great multivariate time series classification bake off: a review and experimental evaluation of recent algorithmic advances. Data Mining and Knowledge Discovery 35 (2), pp. 401–449. Cited by: §2.
  • P. Schäfer and U. Leser (2017) Multivariate time series classification with weasel+ muse. arXiv preprint arXiv:1711.11343. Cited by: §2.
  • P. Schäfer (2015) The boss is concerned with time series classification in the presence of noise. Data Mining and Knowledge Discovery 29 (6), pp. 1505–1530. Cited by: §2.
  • A. Shifaz, C. Pelletier, F. Petitjean, and G. I. Webb (2020) TS-chief: a scalable and accurate forest algorithm for time series classification. Data Mining and Knowledge Discovery 34 (3), pp. 742–775. Cited by: §2, §3.3, Table 1, Table 3.
  • M. Willetts, A. Doherty, S. Roberts, and C. Holmes (2018) Semi-unsupervised learning of human activity using deep generative models. arXiv preprint arXiv:1810.12176. Cited by: §2, §4, Table 3, 3(d), 3(e).
  • M. Willetts, S. Roberts, and C. Holmes (2020) Semi-unsupervised learning: clustering and classifying using ultra-sparse labels. In 2020 IEEE International Conference on Big Data (Big Data), pp. 5286–5295. Cited by: §1, §4, §4, Table 3, 3(d), 3(e).
  • A. Zacepins, A. Kviesis, E. Stalidzans, M. Liepniece, and J. Meitalovs (2016) Remote detection of the swarming of honey bee colonies by single-point temperature monitoring. Biosystems engineering 148, pp. 76–80. Cited by: §7.


Class Training samples Test samples
Walking up
Walking down
(a) Class distribution in the HAR dataset
Class Training samples Test samples
(b) Class distribution in the MIT-BIH dataset
Class Training samples Test samples
(c) Class distribution in the electric devices dataset
Dataset Model UL SuSL SSL SL
HAR Baseline (MLP SuSL,(Willetts et al., 2018, 2020))
SuSL4TS (movement (h))
SuSL4TS (stationary (h))
SuSL4TS (walking,laying (h))
ECG Baseline (MLP SuSL,(Willetts et al., 2018, 2020))
SuSL4TS (q,v (h))
SuSL4TS (n (h))
El. Devices Baseline (MLP SuSL,(Willetts et al., 2018, 2020))
SuSL4TS (1–3 (h))
SuSL4TS (4–7 (h))
(d) Macro F1 Scores
Dataset Model UL SuSL SSL SL
HAR Baseline (MLP SuSL,(Willetts et al., 2018, 2020))
SuSL4TS (movement (h))
SuSL4TS (stationary (h))
SuSL4TS (walking,laying (h))
ECG Baseline (MLP SuSL,(Willetts et al., 2018, 2020))
SuSL4TS (q,v (h))
SuSL4TS (n (h))
El. Devices Baseline (MLP SuSL,(Willetts et al., 2018, 2020))
SuSL4TS (1–3 (h))
SuSL4TS (4–7 (h))
(e) Weighted F1 Scores
L Si St W W Down W Up
(f) Confusion Matrix HAR UL
Accuracy: 0.6538853070919579
L Si St W (h) W Down (h) W Up (h)
(g) Confusion Matrix HAR SUSL (movement hidden)
Accuracy: 0.509331523583305
L (h) Si (h) St (h) W W Down W Up
(h) Confusion Matrix HAR SUSL (stationary hidden)
Accuracy: 0.8798778418730913
L (h) Si St W (h) W Down W Up
(i) Confusion Matrix HAR SUSL hide on each
Accuracy: 0.7763827621309807
L Si St W W Down W Up
(j) Confusion Matrix HAR SSL
Accuracy: 0.9270444519850696
L Si St W W Down W Up
(k) Confusion Matrix HAR SL
Accuracy: 0.9317950458092976
1 2 3 4 5 6 7
(l) Confusion Matrix El. Devices UL
Accuracy: 0.5272986642458825
1 (h) 2 (h) 3 (h) 4 5 6 7
(m) Confusion Matrix El. Devices SUSL (1-3 hidden)
Accuracy: 0.5459732849176501
1 (h) 2 (h) 3 (h) 4 5 6 7
(n) Confusion Matrix El. Devices SUSL (1-3 hidden)
Accuracy: 0.5914926728050837
1 2 3 4 (h) 5 (h) 6 (h) 7 (h)
(o) Confusion Matrix El. Devices SUSL (4-7 hidden)
Accuracy: 0.49513681753339384
1 2 3 4 (h) 5 (h) 6 (h) 7 (h)
(p) Confusion Matrix El. Devices SUSL (4-7 hidden)
Accuracy: 0.5961613279730256
1 2 3 4 5 6 7
(q) Confusion Matrix El. Devices SSL
Accuracy: 0.6969264686811049
1 2 3 4 5 6 7
(r) Confusion Matrix El. Devices SSL
Accuracy: 0.6801971209959797
1 2 3 4 5 6 7
(s) Confusion Matrix El. Devices SL
Accuracy: 0.6996498508624044
(t) Confusion Matrix ECG UL
Accuracy: 0.8328613192033619
N S V (h) F Q (h)
(u) Confusion Matrix ECG SUSL (Q,V hidden)
Accuracy: 0.8470217431025032
N S V (h) F Q (h)
(v) Confusion Matrix ECG SUSL (Q,V hidden)
Accuracy: 0.8425452219989037
N (h) S V F Q
(w) Confusion Matrix ECG SUSL (N hidden)
Accuracy: 0.8826511967842134
N (h) S V F Q
(x) Confusion Matrix ECG SUSL (N hidden)
Accuracy: 0.9041659053535538
(y) Confusion Matrix ECG SSL
Accuracy: 0.9751964187831171
(z) Confusion Matrix ECG SSL
Accuracy: 0.9727754430842317
(aa) Confusion Matrix ECG SL
Accuracy: 0.9761099945185456