1 Introduction
Modern mechatronic systems contain a large number of internal sensor and control signals that can be accessed for example via fieldbus interfaces. In addition, these systems can be extended by external measuring systems, which in total leads to a heterogeneous set of multivariate time series signals (MTS), where heterogeneous here implies divergent sampling times, measuring resolutions or scale levels of the individual univariate signals.
In the course of progressive digitization, such system signals are increasingly used for classification tasks (such as alarm or condition monitoring) or for the regression of target variables that cannot be measured directly (e.g. process stability or quality). In the following, these are collectively referred to as estimation tasks.
The estimation methods primarily used for this purpose from the field of statistical or machine learning depend on extensive feature extraction
[1], especially in the case of a highdimensional, heterogeneous data basis. In the case of manual feature extraction, this requires a high level of technical expertise regarding the system at hand and is consequently accompanied by a high effort during implementation. Alternatively, supervised endtoend estimation methods from the field of deep learning explicitly manage without previously extracted features, but require a large amount of already labeled data for training
[2].Methods of representation learning
have recently emerged in the field of computer vision and speech recognition, enabling unsupervised feature extraction that outperforms the prediction performance of common manual and automated feature extraction methods in the case of only a small amount of existing labeled data
[3]. The application to mechatronic systems has so far been limited to specialized single solutions [4], which cannot be directly transferred to any arbitrary mechanical or electrical system.2 State of Research
2.1 Representation Learning
The term representation learning or feature learning covers methods that allow an automatic extraction of relevant features and thus make the step of manual feature engineering superfluous in principle. In the context of this work, the focus is on an unsupervised method that learns a compact representation of the MTS without knowledge of the target variable and thus only on the basis of the input time series. Originally derived from the field of computer vision, feature learning methods are increasingly adapted for estimation tasks based on time series data such as Chen et al. [4] using an autoencoder for feature extraction from the torque signals of a 6axis industrial robot to predict collisions in the workspace. Li et al. [5] and Jiang et al. [6] both use a Generative Adverssarial Network (GAN) for feature extraction from industrial time series data, while Franceschi et al. [3] use a pure encoderbased network in combination with a socalled
triplet loss function
for classification on various reference time series. In this work, we focus on an autoencoderbased approach because, in contrast to GANs for instance, these generally provide a better representation of the population of training data, often at the cost of worse performance than purely generative models [7] (i.e., in generating realistic new time series), but this is not the focus of this paper.Autoencoder
An autoencoder is an artificial neural network whose primary goal is to reconstruct an input signal
x (see Figure 1). The dimensionreduced latent variable (also bottelneck or latent space)(1) 
is the result of an encoder function with the parameters (weights)
and is used as a feature vector in the representation learning in this work. The latent variable subsequently passes through a decoder
(2) 
with parameters , which generates a reconstruction of the input signal.
In the present case of mostly realvalued input values, usually the mean squared error
is used as reconstruction error and serves as a loss function during optimization.
Besides fully connected or dense layers with nonlinear activation functions, more complex neural layers also find use. For example, Bianchi et al. use recurrent (
RNN) bidirectional layers in combination with an additional kernel loss function for feature extraction from MTS with missing values [8].Variational Autoencoder
An extension of the regular autoencoder is the Variational Autoencoder (VAE), which is mostly used as generative model in the field of computer vision. As an example of a Variational Bayes model, the VAE models the unknown distribution function of the input data using a model distribution [9]
. The stochastic decoder can therefore be understood as a conditional probability distribution
, which together with the prior distribution of the latent variable forms a generative model by factorizing the multivariate distribution(3) 
Similarly, the encoder represents an inference model that can be conceived as a conditional probability distribution of the latent variables given input data
. This approach leads by the application of the evidence lower bound, ELBO to the modified loss function [9](4) 
with the KullbackLeibler divergence
, which penalizes the deviation between a given prior distribution of the latent variableand its actual (empirical) distribution given by the encoder. The second term represents the reconstruction error. The standard multivariate normal distribution
is mostly used as prior for the latent variable. For the training using the standard stochastic gradient descent (
stochastic gradient descent, SGD) method, the gradient of the loss function is calculated for each minibatch of training data in order to perform minimization of the loss function as a function of the network parameters and (backpropagation). As can be seen in Figure 1(a), this is not possible when directly drawing, since the backpropagation is interrupted by the random variable
[11]. Only with a mathematically equivalent reparametrization by swapping out the random variation into the random variable (which is typically modeled as normally distributed), a backpropagation of the error through the encoder is possible (socalled reparameterization trick, see Figure 1(b)).3 Methodology
This section presents the suggested autoencoderbased model, as well as the embedding pipeline. This is followed by details on the experimental design, in particular the chosen comparison methods and the selected publicly available datasets.
3.1 Feature Extraction
Unsupervised feature extraction from MTS with the amount of univariate signals is performed using multiple VAEs trained separately for each univariate time series. This, in contrast to common twodimensional convolutional network architectures (convolutional neural networks, CNN), allows a parallelization of the training and an individual architecture of the networks, depending on the sampling rate of the heterogeneous signals^{1}^{1}1On the other hand, existing crosscorrelations between the individual univariate time series are no longer directly observable..
As an aggregated feature, the concatenation of the individual latent variables is used as input of the estimator. Table 1 lists the most important (hyper) parameters of the trained model. These were kept constant across all datasets in order to provide the best possible demonstration of generality.
(Hyper)parameter  Value  Remark 

Input dimension  : Maximum window length  
Compression ratio  Quotient of and  
Hidden layer  Two hidden layers  
Activation radio    resp. lin. for output layer 
Regularization    Early stopping and norm 
Optimizer  Adam  SGD optimizer [12] 
Batchsize  
Normalization    Instance or layer norm. 
Learning rate  No learning rate scheduler  
Max. epochs 
Note: early stopping 
Instead of specifying a constant dimension of the latent space, a constant compression rate is chosen so that the number of extracted features scales proportional to the sampling rate and signal duration. Furthermore, during training and inference, instance normalization is performed for each windowed input time series to account for divergent ranges of signal values and stabilize convergence during the training.
3.2 Pipeline
The sequence of modeling/training and inference steps is typical for a socalled semisupervised training procedure, consisting of unsupervised representation learning and supervised training of an estimator. (see Figure 3): After consecutive training of both models, they are applied unchanged during inference ( in this case on independent test datasets).
The amount of labeled training data is varied during the testing procedure (logarithmic scaling of quantity).
The feature learning method (VAE) is provided with all available training data without labels beforehand in order to learn representations. This corresponds to the realistic use case of having a large amount of raw data, but only a certain fraction of it has been labeled. For each dataset, method, and quantity of labeled training data,
replicates are performed so that empirical mean standard deviation of the particular performance metric can be calculated. The train/test split of the datasets was provided by the authors of the same.
Python 3.8^{2}^{2}2
In particular: torch 1.7.1, sktime 0.4.3 and sklearn 0.24.1.
on a computer with GPU support (CUDA 7.5) is used as experimental environment. The source code and links to the used datasets are made publicly available^{3}^{3}3https://github.com/MrPr3ntice/vae_rep_learn_mts.3.3 Comparison methods
Dataset  Train samples  Test samples  Number of channels  Duration per sample ()  Target value (type)  Reference 

Rolling bearing damages  1440  800  7  Rolling bearing condition (3 classes)  [18]  
Stepper motors  70152  23384  7  Operating condition (4 classes)  [19]  
Hydraulic system  1544  661  17  Cooler condition (3 classes), Hydraulic accumulator pressure (regr.)  [20] 
The presented representation learning model (VAE) as well as the semisupervised pipeline are compared with a feature extraction method from the current state of the art (Rocket
). Additionally, principal component analysis serves as a deterministic baseline comparison method. The Python library
tsfreshallows automatic extraction and selection of statistical time and frequency domain features, thus providing a comparison method from the field of manual feature extraction.
For all comparison methods, a ridge regressionbased estimator is used in that parameterization, as it is recommended as a favored predictor, especially by authors of several current feature extractors
[13, 14]. In case of the VAE, a support vector machine (SVM) with Gaussian kernel is used to better account for the forced normal distribution character of the latent variable^{4}^{4}4In particular, due to the symmetric kernel, closed intervals on one variable can be separated in case of classification..Pca
Using principal component analysis (PCA), time series data can be reduced in dimension by the help of singular value decomposition. PCA produces those orthogonal linear transformations of the input data which maximize the variance of the components of the target subspace in descending order. The obtained transformation can be used as a feature for classification or regression. In particular, PCA can be considered as a special case of a linear autoencoder without any hidden layers
[9], which is why it will serve as a comparative baseline method here.Statistical Features
The extraction of manually or automatically selected statistical features from time series for ML applications is widely used. The Python library tsfresh provides a collection of established extraction methods to automatically generate and select a variety of these features from time series data. The collection of features includes, for example, statistical ratios and correlations in both time and frequency domain. A complete overview of the extracted features can be taken from the libraries documentation [15]. In the present case, the default settings^{5}^{5}5extraction: efficient, selection: extract_relevant_feat ures() is used.
Rocket
Random convolutional kernel transform is a stateoftheart method that uses randomly sampled convolution kernels to extract features from time series. Subsequently, a Ridge regression^{6}^{6}6
A special case of Tikhonov regularization for linear regression, sometimes also referred to as
regularization [16]. It can be used in both classification and regression case. is trained with the thereby generated features and the known target values. Due to this straightforward setup, the computational cost of Rocket is lower than that of comparably performing methods [14]. To the state of this work, the method produces the best average classification results on the datasets of the UCR and UEA time series archive [17].3.4 Data sets
The number of publicly available data sets is limited, especially in the area of mechanical and electronic systems. For validation of the proposed method, data sets covering a wide range of mechatronic applications are selected from those available. All data sets consist of real measurement data from several sensors, which differ e.g. in sampling rates. This results in three multivariate data sets from heterogeneous time series, from which three classification tasks and one regression task are derived. An overview of the selected data sets can be found in Table 2.
Rolling bearing damages
A widespread use case for machine learning applications is the detection of different rolling bearing damages. A comprehensive reference data set representing this use case has been published recently by Leissmeier et al. [18]. In addition to highfrequency sampled measurements of motor currents and housing vibration (), additional, lowerfrequency data such as radial force () and temperature () are available for damage classification. The dataset includes different damage types of rolling bearings on the outer and inner ring as well as the data from undamaged bearings under different operating conditions. Further information on the data set can be found in [18].
Stepper motors
For stepper motor monitoring, there is a dataset published by Goubeaud et al. [19]. This includes measurements of current, voltage, and vibration (translational acceleration). The target variable is the operating mode of the stepper motor, which differentiates between clockwise and counterclockwise operation, as well as operation in the normal range and beyond the mechanical stop. A detailed description of the experimental setup and the data acquisition is given in the original publication [19].
Hydraulic system
The hydraulic system described by Helwig et al. [20] is equipped with a variety of different sensors. In addition to measurements of pressure and flow rates also temperature, current, and vibration are recorded.The sampling rates range from (temperature) to ^{7}^{7}7Compared to the original data set sampled at , pressure has been sampled down to in favor of the computation time. (pressure), resulting in a heterogeneous data set with a wide variety of sensor types, value range and sampling rate. In this work, the state of the cooler (fault classification) and the pressure in the hydraulic accumulator (regression) will be considered as target variables.
4 Results
The results are shown in Figure 4 (a)  (d). More detailed results can be retrieved from the Tables 3 and 4 in the appendix.
For the first three cases, the Rocket comparison method achieves the highest test results with respect to the mean accuracy. Only in the case of the pressure estimation in the hydraulic accumulator, VAE shows almost consistently the lowest RMSE. In this regard, it should be noted that Rocket was originally developed for classification tasks and has been applied for this task in majority. Principal component analysis and the statistical features determined by tsfresh are not competitive for the prediction tasks on the first two data sets; for classification on the hydraulic data, all methods achieve similarly low error rates for a higher fraction of labeled training data.
A direct comparison with the partly available results of the original publications of the data sets is not possible at this point, since the multivariate case of a training data with a varying number of labeled data, which is treated here, was not considered in most studies. Additionally a neutral selection and parameterization of the compared estimators has been used in this work, in order to ensure the best possible comparability between the methods and not the highest possible performance of the same.
Looking at the results for a small number of labeled training data, it can be seen for all data sets that the presented implementation of VAE forms a better predictive measure based on the extracted features in the range up to at least . This supports the research hypothesis that autoencoderbased representation learning methods for extracting features from heterogeneous multivariate time series are particularly suitable for very small amounts of existing labeled training data in the presence of a sufficiently large amount of unlabeled training data at the same time.
5 Conclusion
In this work, an unsupervised autoencoderbased feature extractor for estimation tasks on heterogeneous, multivariate time series of mechatronic data was presented and validated on three data sets from the scientific community. Even though the prediction quality based on the entire training data does not approach current stateoftheart feature extractors like Rocket, an increased quality for the case of a low amount of labeled training data with a high availability of unlabeled data could be shown.
The presented results were obtained using only processed (reference) data sets. Thus, a consequential step towards automated estimation methods for the condition monitoring in comparable applications is the adaptation of the presented method for nonpreprocessed raw data. Here, for example, the method of the denoising autoencoder may be considered as a possible extension for noisy or even corrupted input data.
Publication notice
A later version of this paper in german language was submitted to VDI Mechatronic Tagung 2021 and has been published in the conference proceedings.
References
 [1] Fawaz, H. I.; Forestier, G.; Weber, J.; Idoumghar, L.; Muller, P. A.: Deep learning for time series classification: a review. Data Mining and Knowledge Discovery. (2019), 33(4), pp. 917–963.
 [2] Lei, Y.; Jia, F.; Lin, J.; Xing, S.; Ding, S. X.: An intelligent fault diagnosis method using unsupervised feature learning towards mechanical big data. IEEE Transactions on Industrial Electronics. (2016), 63(5), pp. 3137–3147.
 [3] Franceschi, J.Y.; Dieuleveut, A.; Jaggi, M.: Unsupervised Scalable Representation Learning for Multivariate Time Series. Advances in Neural Information Processing Systems. (2019) 32, ISSN 10495258.

[4]
Chen, T.; Liu, X.; Xia, B.; Wang, W.; Lai, Y.:
Unsupervised Anomaly Detection of Industrial Robots Using SlidingWindow Convolutional Variational Autoencoder
. IEEE Access. (2020) 8, ISSN 21693536, pp. 47072–47081. 
[5]
Li, D.; Chen, D.; Jin, B.; Shi, L.; Goh, J.; Ng, S. K.:
MADGAN: Multivariate anomaly detection for time series data with generative adversarial networks
. In: International Conference on Artificial Neural Networks. (2019), Springer, Cham, pp. 703–716.  [6] Jiang, W.; Cheng, C.; Zhou, B.; Ma, G.; Yuan, Y.: A novel GANbased fault diagnosis approach for imbalanced industrial time series. (2019), arXiv preprint arXiv:1904.00575.

[7]
Grover, A.; Dhar, M.; Ermon, S.:
FlowGAN: Combining maximum likelihood and adversarial learning in generative models.
In: Proceedings of the AAAI Conference on Artificial Intelligence. (2018) 32, 1.
 [8] Bianchi, F. M.; Livi L.; Mikalsen, K. Ø.; Kampffmeyer M.; Jenssen, R.: Learning representations of multivariate time series with missing data. Pattern Recognition. (2019) 96, ISSN 00313203.
 [9] Kingma, D. P.; Welling, M.: An Introduction to Variational Autoencoders . Foundations and Trends in Machine Learning. (2019) 12, No. 4, pp. 307–392.
 [10] Kingma, D. P.; Welling, M.: Autoencoding Variational Bayes. (2013), arXiv preprint arXiv:1312.6114.
 [11] Rezende, D. J.; Mohamed, S., Wierstra, D.: Stochastic back propagation and approximate inference in deep generative models. In: International Conference on Machine Learning. (2014), pp. 1278–1286.
 [12] Kingma, D. P.; Ba, J.: Adam: A method for stochastic optimization. (2014), arXiv preprint arXiv:1412.6980.
 [13] Le Nguyen, T.; Gsponer, S.; Ilie, I.; O’Reilly, M.; Ifrim, G.: Interpretable time series classification using linear models and multiresolution multidomain symbolic representations. Data Mining and Knowledge Discovery. (2019) 33, ISSN 1573756X, pp. 1183–1222.
 [14] Dempster, A.; Petitjean, F.; Webb, G. I.: ROCKET: exceptionally fast and accurate time series classification using random convolutional kernels. Data Mining and Knowledge Discovery. (2020) 34, ISSN 1573756X, pp. 1454–1495.
 [15] Christ, M.; Braun, N.; Neuffer, J.; KempaLiehr A.W.: Time Series FeatuRe Extraction on basis of Scalable Hypothesis tests (tsfresh – A Python package). Neurocomputing. (2018) 307, ISSN 1573756X, S. 72–77.
 [16] Kennedy, P.: A Guide to Econometrics. 5th edition. Cambridge: The MIT Press. (2003). ISBN: 026261183X, pp. 205–206.
 [17] Ruiz, A. P.; Flynn, M.; Large, J.; Middlehurst, M.; Bagnall, A.: The great multivariate time series classification bake off: a review and experimental evaluation of recent algorithmic advances. Data Mining and Knowledge Discovery. (2020), Springer OA, ISSN: 1573756X, pp. 1–49.
 [18] Lessmeier, C.; Kimotho, J.; Zimmer,D.; Sextro, W.: Condition Monitoring of Bearing Damage in Electromechanical Drive Systems by Using Motor Current Signals of Electric Motors. European Conference of the Prognostics and Health Management Society. (2016).
 [19] Goubeaud, M.; Grunert, T.; Lützenkirchen, J.; Joußen, P.; Ghorban, F.; Kummert, A.: Introducing a New Benchmarked Dataset for Mechanical Stop Detection of Stepper Motors. 2020 27th IEEE International Conference on Electronics, Circuits and Systems (ICECS). (2020), pp. 1–4.
 [20] Helwig, N.; Pignanelli, E.; Schütze, A.: Condition monitoring of a complex hydraulic system using multivariate statistics. 2015 IEEE International Instrumentation and Measurement Technology Conference (I2MTC) Proceedings. (2015) 24, ISSN, pp. 210–215.
Appendix
% of labeled training samples ()  VAE + SVM  PCA + Ridge  tsfresh + Ridge  Rocket + Ridge 

Rolling bearing damages data set  
Stepper motors data set  
Hydraulic system dataset  
% of labeled training samples ()  VAE + SVR  PCA + Ridge  tsfresh + Ridge  Rocket + Ridge 
