Seglearn: A Python Package for Learning Sequences and Time Series

03/21/2018 ∙ by David M. Burns, et al. ∙ Sunnybrook Health Sciences Centre 0

Seglearn is an open-source python package for machine learning time series or sequences using a sliding window segmentation approach. The implementation provides a flexible pipeline for tackling classification, regression, and forecasting problems with multivariate sequence and contextual data. This package is compatible with scikit-learn and is listed under scikit-learn Related Projects. The package depends on numpy, scipy, and scikit-learn. Seglearn is distributed under the BSD 3-Clause License. Documentation includes a detailed API description, user guide, and examples. Unit tests provide a high degree of code coverage.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Many real-world machine learning problems e.g. voice recognition, human activity recognition, power systems fault detection, stock price and temperature prediction, involve data that is captured as sequences over a period of time (Aha, 2018)

. Sequential data sets do not fit the standard supervised learning framework, where each sample

within the data set is assumed to be independently and identically distributed (iid) from a joint distribution

(Bishop, 2011). Instead, the data consist sequences of pairs, and nearby values of within a sequence are likely to be correlated to each other. Sequence learning exploits the sequential relationships in the data to improve algorithm performance.

2 Supported Problem Classes

Sequence data sets have a general formulation (Dietterich, 2002) as sequence pairs , where each is a multivariate sequence with samples and each target is a univariate sequence with samples . The targets can either be sequences of categorical class labels (for classification problems), or sequences of continuous data (for regression problems). The number of samples varies between the sequence pairs in the data set. Time series with a regular sampling period may be treated equivalently to sequences. Irregularly sampled time series are formulated with an additional sequence variable that increases monotonically and indicates the timing of samples in the data set .

Important sub-classes of the general sequence learning problem are sequence classification and sequence prediction. In sequence classification problems (eg song genre classification), the target for each sequence is a fixed class label and the data takes the form . Sequence prediction involves predicting a future value of the target or future values , given , , and sometimes also .

A final important generalization is the case where contextual data associated with each sequence, but not varying within the sequence, exists to support the machine learning algorithm performance. Perhaps the algorithm for reading electrocardiograms will be given access to laboratory data, the patient’s age, or known medical diagnoses to assist with classifying the sequential data recovered from the leads.

seglearn

provides a flexible, user-friendly framework for learning time series and sequences in all of the above contexts. Transforms for sequence padding, truncation, and sliding window segmentation are implemented to fix sample number across all sequences in the data set. This permits utilization of many classical and modern machine learning algorithms that require fixed length inputs. Sliding window segmentation transforms the sequence data into a piecewise representation (segments), which is particularly effective for learning periodized sequences

(Bulling et al., 2014)

. An interpolation transform is implemented for resampling irregularly sampled time series. The sequence or time series data can be learned directly with various neural network architectures

(Lipton et al., 2015), or via a feature representation which greatly enhances performance of classical algorithms (Bulling et al., 2014).

3 Installation

The seglearn source code is available at: https://github.com/dmbee/seglearn. It is operating system agnostic, and implemented purely in Python. The dependencies are numpy, scipy, and scikit-learn. The package can be installed using pip:
$ pip install seglearn

Alternatively, seglearn can be installed from the sources:
$ git clone https://github.com/dmbee/seglearn
$ cd seglearn
$ pip install .

Unit tests can be run from the root directory using pytest.

4 Implementation

Figure 1: Example seglearn

pipelines for a) learning segment feature representations, b) learning segments directly. SVC: Support Vector Classifier, CNN: Convolution Neural Network, RNN: Recurrent Neural Network.

The seglearn API was implemented for compatibility with scikit-learn and its existing framework for model evaluation and selection. The seglearn package provides means for handling sequence data, segmenting it, computing feature representations, calculating train-test splits and cross-validation folds along the temporal axis.111Note splitting time series data along the temporal axis violates the assumption of independence between train and test samples. However, this is useful in some cases, such as the analysis of a single series. An iterable, indexable data structure is implemented to represent sequence data with supporting contextual data.

The seglearn functionality is provided within a scikit-learn pipeline allowing the user to leverage scikit-learn transformer and estimator classes, which are particularly helpful in the feature representation approach to segment learning. Direct segment learning with neural networks is implemented in pipeline using the keras package, and its scikit-learn API. Examples of both approaches are provided in the documentation and example gallery. The integrated learning pipeline, from raw data to final estimator, can be optimized within the scikit-learn model_selection framework. This is important because segmentation parameters (eg window size, segment overlap) can have a significant impact on sequence learning performance (Burns et al., 2018; Bulling et al., 2014).

Sliding window segmentation transforms sequence data into a piecewise representation (segments), such that predictions are made and scored for all segments in the data set. Sliding window segmentation can be performed for data sets with a single target value per sequence, in which case that target value is mapped to all segments generated from the parent sequence. If the target for each is sequence is also a sequence, the target is segmented as well and various methods may be used to select a single target value from the target segment (e.g. mean value, middle value, last value, etc.) or the target segment sequence can be predicted directly if an estimator implementing sequence to sequence prediction is utilized.

A human activity recognition data set (Burns et al., 2018) consisting of inertial sensor data recorded by a smartwatch worn during shoulder rehabilitation exercises is provided with the source code to demonstrate the features and usage of the seglearn package.

5 Basic Example

This example demonstrates the use of seglearn for performing sequence classification with our smartwatch human activity recognition data set.
import seglearn as sgl
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler

data = sgl.load_watch()
X_train, X_test, y_train, y_test = train_test_split(data["X"], data["y"])

clf = sgl.Pype([("seg", sgl.SegmentX(width=100, overlap=0.5)),
...                                ("features", sgl.FeatureRep()),
...                                ("scaler", StandardScaler()),
...                                ("rf", RandomForestClassifier())])

clf.fit(X_train, y_train)
score = clf.score(X_test, y_test)
print("accuracy score:", score)

accuracy score: 0.7805084745762711

6 Comparison to other Software

Three other Python packages for performing machine learning on time series and sequences were identified: tslearn (Tavenard, 2017), cesium-ml (Naul et al., 2016), and tsfresh (Christ et al., 2018). These were compared to seglearn based on time series learning capabilities (Table 1), and performance (Table 2).

cesium-ml (v0.9.6) and tsfresh (v0.11.1) support feature representation learning of multi-variate time series, and currently implement more features than does seglearn. However, the feature representation transformers are implemented as a pre-processing step, independent to the otherwise sklearn compatible pipeline. This design choice precludes end-to-end model selection. There are no examples or apparent support for problems where the target is a sequence/time series or integration with deep learning models.

tslearn (v0.1.18.4) implements time-series specific classical algorithms for clustering, classification, and barycenter computation for time series with varying lengths. There is no support for feature representation learning, learning context data, or deep learning.

The performance comparison was conducted using our human activity recognition data set with 140 multivariate time series with 6 channels sampled uniformly at 50 Hz and 7 activity classes. The series’ were all truncated to 4 seconds (200 samples). Classification accuracy was measured on 35 series’ held out for testing, and 105 used for training. seglearn, cesium-ml, and tsfresh

were tested using the sklearn implementation of the SVM classifier with a radial basis function (RBF) kernel on 5 features (median, minimum, maximum, standard deviation, and skewness) calculated on each channel (total 30 features).

tslearn was evaluated with its own SVM classifier implementing a global alignment kernel (Cuturi et al., 2007). The testing was performed using an Intel Core i7-4770 testbed with 16 GB of installed memory, on Linux Mint 18.3 with Python 2.7.12.

Classification accuracy was identical between cesium-ml, tsfresh, and seglearn (as they used the same features and classifier in the evaluation), and all three significantly exceeded the accuracy achieved with tslearn. seglearn significantly outperformed the other packages in terms of computation time.

tslearn cesium-ml ts-fresh seglearn
Active development (2018)
Documentation
Unit Tests
Multivariate time series
Context data X X
Time series target X X X
Sliding window segmentation X X X
Temporal folds X X X
sklearn compatible model selection X X X
Feature representation learning X
Number of implemented features N/A 58 64 20
Deep learning X X X
Classification
Clustering
Regression
Forecasting X
Table 1: Comparison of time series learning package features for tslearn v0.1.18.4, cesium-ml v0.9.6, tsfresh v0.11.1 and seglearn v1.0.2.
tslearn cesium-ml ts-fresh seglearn
Classification accuracy 0.057 0.714 0.714 0.714
Computation time (seconds) 0.79 62.9 0.40 0.088
Table 2: Comparison of time series learning package performance on our human activity recognition dataset.

References