Meta-Learning for Few-Shot Time Series Classification

09/13/2019
by   Jyoti Narwariya, et al.
Tata Consultancy Services
0

Deep neural networks (DNNs) have achieved state-of-the-art results on time series classification (TSC) tasks. In this work, we focus on leveraging DNNs in the often-encountered practical scenario where access to labeled training data is difficult, and where DNNs would be prone to overfitting. We leverage recent advancements in gradient-based meta-learning, and propose an approach to train a residual neural network with convolutional layers as a meta-learning agent for few-shot TSC. The network is trained on a diverse set of few-shot tasks sampled from various domains (e.g. healthcare, activity recognition, etc.) such that it can solve a target task from another domain using only a small number of training samples from the target task. Most existing meta-learning approaches are limited in practice as they assume a fixed number of target classes across tasks. We overcome this limitation in order to train a common agent across domains with each domain having different number of target classes, we utilize a triplet-loss based learning procedure that does not require any constraints to be enforced on the number of classes for the few-shot TSC tasks. To the best of our knowledge, we are the first to use meta-learning based pre-training for TSC. Our approach sets a new benchmark for few-shot TSC, outperforming several strong baselines on few-shot tasks sampled from 41 datasets in UCR TSC Archive. We observe that pre-training under the meta-learning paradigm allows the network to quickly adapt to new unseen tasks with small number of labeled instances.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

04/29/2020

Learning to Learn to Disambiguate: Meta-Learning for Few-Shot Word Sense Disambiguation

Deep learning methods typically rely on large amounts of annotated data ...
12/13/2019

Meta-Learning Initializations for Image Segmentation

While meta-learning approaches that utilize neural network representatio...
06/05/2021

Signal Transformer: Complex-valued Attention and Meta-Learning for Signal Recognition

Deep neural networks have been shown as a class of useful tools for addr...
12/01/2021

A Few-Shot Meta-Learning based Siamese Neural Network using Entropy Features for Ransomware Classification

Ransomware defense solutions that can quickly detect and classify differ...
08/05/2019

Learning to Generalize to Unseen Tasks with Bilevel Optimization

Recent metric-based meta-learning approaches, which learn a metric space...
07/05/2017

Labeled Memory Networks for Online Model Adaptation

Augmenting a neural network with memory that can grow without growing th...
09/29/2020

MetaMix: Improved Meta-Learning with Interpolation-based Consistency Regularization

Model-Agnostic Meta-Learning (MAML) and its variants are popular few-sho...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Time series data is ubiquitous in the current digital era with several applications across domains such as forecasting, healthcare, equipment health monitoring, and meteorology among others. Time series classification (TSC) has several practical applications such as disease diagnosis from time series of physiological parameters (Che et al., 2018)

, classifying heart arrhythmias from ECG signals

(Rajpurkar et al., 2017), and human activity recognition (Yang et al., 2015)

. Recently, deep neural networks (DNNs) such as those based on long short term memory networks (LSTMs)

(Karim et al., 2018) and 1-dimensional convolution neural networks (CNNs) (Wang et al., 2017; Fawaz et al., 2018b; Kashiparekh et al., 2019) have achieved state-of-the-art results on TSC tasks. However, it is well-known that DNNs are prone to overfitting, especially when access to a large labeled training dataset is not available. (Fawaz et al., 2018c; Kashiparekh et al., 2019).

Few recent attempts aim to address the issue of scarce labeled data for univariate TSC (UTSC) by leveraging transfer learning

(Yosinski et al., 2014) via DNNs, e.g. (Malhotra et al., 2017; Serrà et al., 2018; Fawaz et al., 2018c; Kashiparekh et al., 2019). These approaches consider pre-training a deep network in an unsupervised (Malhotra et al., 2017) or supervised (Serrà et al., 2018; Fawaz et al., 2018c; Kashiparekh et al., 2019) manner using a large number of time series from diverse domains, and then fine-tune the pre-trained model for the target task using labeled data from target domain. However, these transfer learning approaches for TSC based on pre-training a network on large number of diverse time series tasks do not necessarily guarantee a pre-trained model (or network initialization) that can be quickly fine-tuned with a very small number of labeled training instances, and rather rely on ad-hoc fine-tuning procedures.

Rather than learning a new task from scratch, humans leverage their pre-existing skills by fine-tuning and recombining them, and hence are highly data-efficient, i.e. can learn from as little as one example per category (Perfors and Tenenbaum, 2009). Meta-learning (Schmidhuber, 1987) approaches intend to take a similar approach for few-shot learning

, i.e. learning a task from few examples. More recently, several approaches for few-shot learning for regression, image classification, and reinforcement learning domains have been proposed under the gradient-based meta-learning or the “learning to learn” framework, e.g. in

(Finn et al., 2017; Nichol et al., 2018; Rusu et al., 2019). A neural network-based meta-learning model is explicitly trained to quickly learn a new task from a small amount of data. The model learns to solve several tasks sampled from a given distribution where each task is, for example, an image classification problem with few labeled examples. Since each task corresponds to a learning problem, performing well on a task corresponds to learning quickly.

Despite the advent of aforementioned pre-trained models for time series, few-shot learning (i.e. learning from few, say five, examples per class) for TSC remains an important and unaddressed research problem. The goal of few-shot TSC is to train a model on large number of diverse few-shot TSC tasks such that it can leverage this experience through the learned parameters, and quickly generalize to new tasks with small number of labeled instances. More specifically, we train a residual network (ResNet) (Wang et al., 2017; Fawaz et al., 2018b) on several few-shot TSC tasks such that the ResNet thus obtained generalizes to solve new few-shot learning tasks. In contrast to existing methods for data-efficient transfer learning, our method provides a way to directly optimize the embedding itself for classification, rather than an intermediate bottleneck layer such as the ones proposed in (Malhotra et al., 2017; Serrà et al., 2018).

Key contributions of this work are:

  • We define the problem of few-shot learning for univariate TSC (UTSC), and propose a training and evaluation protocol for the same.

  • We propose a few-shot UTSC approach by training a ResNet to solve diverse few-shot UTSC tasks using a meta-learning procedure (Nichol et al., 2018). The ResNet thus obtained can be quickly adjusted (fine-tuned) on a new, previously unseen, UTSC task with few labeled examples per class.

  • As opposed to fixed -way classification setting in most existing few-shot methods, our approach can handle multi-way classification problems with varying number of classes without introducing any additional task-specific parameters to be trained from scratch such as those in the final classification layer (Serrà et al., 2018; Kashiparekh et al., 2019): In order to generalize across few-shot tasks with varying number of classes, we leverage triplet loss (Weinberger et al., 2006; Schroff et al., 2015) for training the ResNet. This allows our approach to leverage the same neural network architecture across diverse applications without introducing any additional task-specific parameters to be trained from scratch.

  • Since the proposed approach uses triplet loss to learn a Euclidean embedding for time series, it can also be seen as a data-efficient metric learning procedure for time series that can learn from very small number of labeled instances.

In few-shot setting, we demonstrate that a vanilla nearest-neighbor classifier over the embeddings obtained using our approach outperforms existing nearest-neighbor classifiers based on the highly effective dynamic time warping (DTW) classifier (Bagnall et al., 2017) and even state-of-the-art time series classifier BOSS (Schäfer, 2015). The rest of the paper is organized as follows: we contrast our work to existing literature in Section 2. We define the problem of few-shot learning for UTSC in Section 3. We then provide details of the neural network architecture used for training the few-shot learner in Section 4 followed by the details of meta-learning based training algorithm for few-shot UTSC in Section 5. We provide details of empirical evaluation of proposed approach in Section 6 and conclude in Section 7.

2. Related Work

Several approaches have been proposed to deal with scarce labeled data for TSC, via data augmentation, warping, simulation, transfer learning, etc. in e.g. (Le Guennec et al., 2016; Cui et al., 2016; Fawaz et al., 2018a; Kashiparekh et al., 2019). Regularization in DNNs, e.g. decorrelating convolutional filter weights (Paneri et al., 2019)

has been found to be effective for TSC and avoid overfitting in scarce data scenarios. Iterative semi-supervised learning

(Wei and Keogh, 2006) also addresses scarce labeled data scenario by iteratively increasing the labeled set but assumes availability of a relatively large amount of data albeit initially unlabeled. In this work, we take a different route to deal with scarce labeled data scenarios and leverage gradient-based meta-learning to explicitly train a network to quickly adapt and solve new few-shot TSC tasks.

Transfer learning using pre-trained DNNs has been shown to achieve better classification performance than training DNNs from scratch for TSC: a few instances of pre-trained DNNs for TSC have been recently proposed in e.g. (Malhotra et al., 2017; Serrà et al., 2018; Kashiparekh et al., 2019). However, none of these methods are explicitly trained to quickly adapt to a target task and tend to rely on ad-hoc fine-tuning procedures. Furthermore, they do not study the extreme case of few-shot TSC: while (Malhotra et al., 2017) relies on training an SVM classifier on top of unsupervised embeddings obtained via a deep LSTM, (Serrà et al., 2018; Kashiparekh et al., 2019)

rely on introducing and training a new final softmax layer from scratch for each new task. Our approach explicitly pre-trains a DNN using triplet loss to optimize for quick adaptation to a few-shot task. Moreover, unlike existing methods, our approach directly optimizes for time series embeddings over which the similarity of time series can be defined, and hence can work in a kNN setting without requiring the training of additional parameters like those of an SVM in

(Malhotra et al., 2017), or those of a feedforward final layer in (Serrà et al., 2018; Kashiparekh et al., 2019).

Several approaches for few-shot learning have been recently introduced for image classification, regression, and reinforcement learning, e.g. (Vinyals et al., 2016; Finn et al., 2017; Nichol et al., 2018; Rusu et al., 2019)

. To the best of our knowledge, our work is the first attempt to study few-shot learning for TSC. We formulate the few-shot learning problem for UTSC, and build on top of the following recent advances in deep learning research to develop an effective few-shot approach for TSC: i) gradient-based meta-learning

(Finn et al., 2017; Nichol et al., 2018), ii) residual network with convolutional layers for TSC (Wang et al., 2017), iii) leveraging multi-length filters to ensure generalizability of filters to tasks with varying time series length and temporal properties (Roy et al., 2018; Kashiparekh et al., 2019), and iv) triplet loss (Schroff et al., 2015) to ensure generalizability to tasks with varying number of classes without introducing any additional parameters.

Dynamic time warping (DTW) and its variants (Vintsyuk, 1968; Jeong et al., 2011) are known to be very robust and strong distance metric baselines for TSC over a diverse set of applications (Bagnall et al., 2017). However, it is also well-known that no single distance metric works well across scenarios as they lack the ability to leverage the data-distribution and properties of the task at hand (Wang et al., 2013; Bagnall et al., 2017). It has been shown that k-nearest-neighbor (kNN) TSC can be significantly improved by learning a distance metric from labeled examples (Mei et al., 2015; Abid and Zou, 2018). Similarly, modeling time series similarity using Siamese recurrent networks based supervised learning has been proposed in (Pei et al., 2016)

. CNNs trained using triplet loss for TSC have been very recently proposed for unsupervised learning in

(Franceschi et al., 2019) and for supervised learning in (Brunel et al., 2019). However, to the best of our knowledge, none of the metric learning approaches consider pre-training a neural network that can be quickly fine-tuned for new TSC few-shot tasks.

3. Problem Definition

Consider a -shot learning problem for UTSC sampled from a distribution that requires learning a multi-way classifier for a test task given only labeled time series instances per class. Rather than training a classifier from scratch for the test task, the goal is to obtain a neural network with parameters that is trained to efficiently (e.g. in a few iterations of updates of via gradient descent) solve several -shot learning tasks sampled from . These -shot tasks are divided into three sets: a training meta-set , a validation meta-set , and a testing meta-set . The training meta-set is used to obtain the parameters

, the validation meta-set is used for model selection (hyperparameters for neural network training), and the testing meta-set is used only for final evaluation.

Each task instance in and consists of a labeled training set of univariate time series , where is the number of univariate time series instances for each of the classes. Ignoring the sub- and super-scripts, each univariate time series with for , where is the length of time series, and is the class label. Unlike the tasks in and , which only contain a training set, each task in also contains a testing set apart from a training set . The classes in and are the same while classes across tasks are, in general, different. For any from

, the goal is to estimate the corresponding label

by using an updated version of obtained by fine-tuning the neural network using the labeled samples from . In other words, the training set of a task is used for fine-tuning the neural network parameters , while the corresponding testing set of the task is used for evaluation.

It is to be noted that the tasks in the three meta-sets correspond to time series from disjoint sets of classes, i.e. the classes in any task in training meta-set are different from those of any task in validation meta-set, and so on. In practice, we sample the tasks from diverse domains such as electric devices, motion capture, spectrographs, sensor readings, ECGs, simulated time series, etc. taken from the UCR TSC Archive (Chen et al., 2015). Each dataset, and in turn tasks sampled from it, have a different notion of classes depending upon the domain, a different number of classes , and a different .

4. Neural Network

As shown in Figure 1

, we consider a ResNet consisting of multiple convolutional blocks with shortcut residual connections

(He et al., 2016)

between them, eventually followed by a global average pooling (GAP) layer such that the network does not have any feedforward layers at the end. Each convolutional block consists of a convolutional layer followed by a batch normalization (BN) layer

(Ioffe and Szegedy, 2015)

which acts as a regularizer. Each BN layer is in turn followed by a ReLU layer. We omit further architecture details and refer the reader to

(Kashiparekh et al., 2019).

Figure 1. ResNet Architecture depicting two residual blocks each with two convolutional layers, and variable-length filters in each convolutional layer.

In order to quickly adapt to any unseen task, the neural network should be able to extract temporal features at multiple time scales and should ensure that the fine-tuned network can generalize to time series of varying lengths across tasks. We, therefore, use filters of multiple lengths in each convolutional block to capture temporal features at various time scales, as found to be useful in (Roy et al., 2018; Brunel et al., 2019; Kashiparekh et al., 2019).

In a nutshell, ResNet takes a univariate time series of any length

as input and converts it to a fixed-dimensional feature vector

, where is the number of filters in the final convolutional layer. We denote the set of all the trainable parameters of the ResNet consisting of filter weights and biases across convolutional layers, and BN layer parameters by .

Most ResNet implementations for TSC (Wang et al., 2017; Fawaz et al., 2018b; Serrà et al., 2018; Kashiparekh et al., 2019) use a feedforward layer followed by a softmax layer to eventually map

to class probabilities, and use cross-entropy loss for training. Further, when training the ResNet for multiple tasks with varying number of classes across tasks, a multi-head output with different final feedforward layer for each task is typically used, e.g. as in

(Serrà et al., 2018; Kashiparekh et al., 2019). However, in our setting, this implies a different feedforward layer for each new few-shot task, introducing at least additional task-specific parameters111when the GAP layer is followed by a single feedforward layer and a softmax layer that need to be trained from scratch for each new few-shot task. This is not desirable in a few-shot learning setting given only a small number of samples per class, as this can lead to overfitting: this is one reason due to which most few-shot learning formulations, e.g. (Vinyals et al., 2016; Finn et al., 2017), consider a fixed number of target classes across tasks. However, we intend to learn a few-shot learning algorithm that overcomes this limitation. We propose using triplet loss (Weinberger et al., 2006; Schroff et al., 2015; Brunel et al., 2019) as the training objective which allows for generalization to varying number of classes without introducing any additional task-specific parameters, as detailed next.

4.1. Loss Function

Triplet loss relies on pairwise distance between representations of time series samples from within and across classes, irrespective of the number of classes. Using triplet loss at time of fine-tuning for the test task, therefore, allows the network to adapt to a given few-shot classification task without introducing any additional task-specific parameters. Triplets consist of two matching time series and a non-matching time series such that the loss aims to separate the positive pair from the negative by a distance margin. Given the set of all valid triplets of time series for a training task of the form consisting of an anchor time series , a positive time series , and a negative time series ; where the positive time series is another instance from same class as the anchor, while the negative is from a different class than the anchor. We aim to obtain corresponding representations such that the distance between the representations of an anchor and any positive time series is lower than the distance between the representations of the anchor and any negative time series.

More specifically, we consider triplet loss based on Euclidean norm given by:

(1)

where is the distance-margin between the positive and negative pairs. The loss to be minimized is then given by:

(2)

where , such that only those triplets violating the constraint in Eq. 1 contribute to the loss. Note that since we use triplet loss for training, the number of instances per class .

5. Few-Shot Learning for UTSC

Figure 2. Few-Shot Training Approach.

We consider a meta-learning approach for few-shot UTSC based on Reptile (Nichol et al., 2018), a first-order gradient descent based meta-learning algorithm, and refer to that as FS-1. We also consider a simpler variant of this approach and refer to that as FS-2: similar to the training procedure of FS-1, FS-2 is also trained to solve multiple UTSC tasks but not explicitly trained in a manner that ensures quick adaptation to any new UTSC task. Except for the triplet loss, FS-2 is similar to (Serrà et al., 2018; Kashiparekh et al., 2019) in the way data is sampled and used for training.

5.1. Fs-1

5.1.1. Objective

FS-1 learns an initialization for the parameters of the ResNet such that these parameters can be quickly optimized using gradient-based learning at test time to solve a new few-shot UTSC task—i.e., the model generalizes from a small number of examples from the test task. In order to learn the parameters , we train the ResNet on a diverse set of UTSC tasks in with varying number of classes and time series lengths. As explained in Section 4, the same neural network parameters

are shared across all tasks owing to the fact that: i. ResNet yields a fixed-dimensional representation for varying length time series, and ii. the nature of the loss function that does not require any changes due to the varying number of classes across tasks.

Similar to (Finn et al., 2017; Nichol et al., 2018), we consider the following optimization problem: find an initial set of parameters for the ResNet, such that for a randomly sampled task with corresponding loss as given in Eq. 2, the learner will have low loss after updates, such that:

(3)

where is the operator (e.g. corresponding to Adam optimizer or SGD) that updates using mini-batches from .

5.1.2. Implementation Details

FS-1 sequentially samples few-shot tasks from the set of tasks . As summarized in Algorithm 1 and depicted in Figure 2, the meta-learning procedure consists of meta-iterations. Each meta-iteration involves sampling -shot tasks. Each task, in turn, is solved using

steps of gradient-based optimization, e.g. using stochastic gradient descent (SGD) or Adam

(Kingma and Ba, 2015) – this, in turn, involves randomly sampling mini-batches from the instances in the task. Each task is associated with a triplet loss defined over the valid triplets as described in Section 4.1.

Given that each task has a varying number of instances owing to varying , we set the number of iterations for each task to , where is the mini-batch size and

is the number of epochs. Therefore, instead of fixing the number of iterations

for each sampled task, we fix the number of epochs across datasets, such that the network is trained to adapt quickly in a fixed number of epochs, as described later. Also note that the number of triplets in each batch is significantly more than the number of unique time series in a mini-batch.

: initial parameters of the ResNet
for meta-iteration  do
     for  do
         Sample a -shot task
         Get number of classes for task
         Set
         Compute using steps (mini-batches) of Adam to minimize loss
     end for
     Update
end for
Algorithm 1 Few-Shot UTSC Approach-1 (FS-1)
: initial parameters of the ResNet
for iteration  do
     for  do
         Sample a -shot task
         Get number of classes for task
         Set
         Compute using steps (mini-batches) of SGD or Adam to minimize loss
     end for
end for
Algorithm 2 Few-Shot UTSC Approach-2 (FS-2)

The filter weights of the ResNet are randomly initialized, e.g. via orthogonal initialization (Saxe et al., 2013). In the th meta-iteration, ResNet for each of the tasks is initialized with . Each task with labeled data is solved by updating the parameters of the network (, where is number of classes in ) times to obtain

(4)

In practice, we use a batch version of the optimization problem in Equation 3 and use a meta-batch of tasks to update as follows:

(5)

Note that with implies that is updated using the updated values obtained after solving tasks for iterations each. It is this particular way of updating by internally solving multiple tasks, that this algorithm is considered an example of gradient descent based meta-learning. As shown in (Nichol et al., 2018), when performing multiple gradient updates as per Eqs. 4 and 5, i.e. having while solving few-shot tasks, then the expected update is very different from taking a gradient step on the expected loss , i.e. having . In fact, it is easy to note that the update of consists of terms from the second-and-higher derivatives of due to the presence of derivatives of in . Hence, the final solution using is significantly different from the one obtained using .

5.1.3. Fine-tuning and inference in a test -shot task

We denote the optimal parameters of ResNet after meta-training as , and use this as initialization of target task-specific ResNet. For any new -shot -way test task with labeled instances in and any test time series taken from , first is updated to using . The embeddings for all the samples in is compared to the embedding for using 1NN classifier to get the class estimate.

5.2. Fs-2

As shown in Algorithm 2, FS-2 is a simpler variant of FS-1 where instead of updating the parameters by collectively using updated values from tasks, is continuously updated at each mini-batch irrespective of the task. As a result, the network is trained for a few iterations on a task, and then the task is changed. Unlike FS-1, FS-2 uses only the first-order derivatives of .

6. Experimental Evaluation

6.1. Experimental Setup

6.1.1. Sampling few-shot UTSC tasks

We restrict the distribution of tasks to univariate TSC with a constraint on the maximum length of the time series such that . We sample tasks from the publicly available UCR Archive of UTSC datasets (Chen et al., 2015), where each dataset corresponds to a -way multi-class classification task with number of classes and the length of time series varies across datasets. However, all the time series in any dataset are of same length. Each time series is

-normalized using the mean and standard deviation of all the points in the time series.

Out of the total of 65 datasets on UCR Archive with , we use 18 datasets to sample tasks for training meta-set and 6 datasets to sample tasks for the validation meta-set (dataset level splits are same as in (Malhotra et al., 2017)). Any task in or has randomly sampled time series for each of the classes in the dataset. The remaining 41 datasets with length as listed in Table 1 are used to create tasks for the testing meta-set. As a result of this way of creating the training, validation and testing meta-sets, the classes in each meta-set are disjoint. However, the classes in the train and test sets of a task in a testing meta-set is, of course, the same.

Figure 3. Evaluation protocol for FS-1 and FS-2 on a UCR dataset. For ResNet, is randomly initialized for each task. is the accuracy on -th task.

Each dataset in UCR Archive is a -way classification problem with an original train and test split. As shown in Figure 3, we sample 100 -shot tasks from each of the 41 datasets. Each task (out of the 100) sampled from a dataset contains samples from each of the classes for and samples from each of the classes for for each task are sampled from the respective original train and test split of the dataset222We also considered the original test split for each test task during evaluation. We obtained similar conclusions under this evaluation strategy as well, and hence, omit those results for brevity.. The (or ) samples for each class in (or ) are sampled uniformly from the entire set of samples of the respective class. While is used for fine-tuning to get , is used to evaluate the updated task-specific model . (Note that while the class distribution in the original dataset may not be uniform, each -shot task consists of equal number, i.e. , samples per class.)

6.1.2. Hyperparameters for FS-1 and FS-2

On the basis of initial experiments on a subset of the training meta-set, we use the ResNet architecture with layers and convolution filters per layer (33 filters each of length 4,8,16,32,64). We use Adam optimizer with a learning rate of 0.0001 for updating on each task while using in the meta-update step in Equation 5. FS-1 and FS-2 are trained for a total of meta-iterations with meta-batch size of , and mini-batch size . We trained FS-1 and FS-2 using and for the tasks in training meta-set while is used for validation and testing meta-sets. across all experiments unless stated otherwise. We found the model with for tasks in training meta-set to be better based on average triplet loss on validation meta-set. We use epochs for solving each task while training FS-1 and FS-2 models. The number of epochs to be used while fine-tuning for tasks in testing meta-set is chosen from the range 1-100 based on average triplet loss on tasks in validation meta-set. We found and 8 to be best for FS-1 and FS-2 models, respectively. Therefore, is fine-tuned for epochs for each task in testing meta-set. For the triplet loss, we use .

6.1.3. Baselines Considered

For comparison, we consider following baseline classifiers each using 1NN as the final classifier over raw time series or extracted features333For DTW and BOSS, we use implementations as available at http://www.timeseriesclassification.com/code.php.:

  1. [leftmargin=*]

  2. ED: 1NN based on Euclidean distance is the simplest baseline considered, where time series of length is represented by a fixed-dimensional vector of the same length. (Note: For any given dataset and subsequent tasks sampled from it, the length is same across samples, and hence 1NN based on ED is applicable.)

  3. DTW: 1NN based on dynamic time warping (DTW) approach is one of the highly effective and strong baseline for UTSC (Bagnall et al., 2017). We use leave-one-out cross-validation on of each task to find the best warping window in the range , where is the window length and is the time series length.

  4. BOSS: Bag-of-SFA-Symbols (Schäfer, 2015)

    is a state-of-the-art time series feature extraction technique that provides time series representations while being tolerant to noise. BOSS provides a symbolic representation based on Symbolic Fourier Approximation (SFA)

    (Schäfer and Högqvist, 2012) on each fixed-length sliding window extracted from a time series while providing low pass filtering and quantization for noise reduction. The hyper-parameters, i.e. wordLength and normalization are chosen based on leave-one-out cross validation over the ranges and respectively, while default values of remaining hyper-parameters is used. 1NN is applied on the extracted features for final classification decision.

  5. ResNet: Instead of using obtained via FS-1 or FS-2 as a starting point for fine-tuning, we consider a ResNet-based baseline where the model is trained from scratch for each task using triplet loss. The architecture is same as those used for FS-1 and FS-2 (also similar to state-of-the-art ResNet versions studied in (Wang et al., 2017; Fawaz et al., 2018b; Kashiparekh et al., 2019)). Given that each task has a very small number of training samples and the parameters are to be trained from scratch, ResNet architectures are likely to be prone to overfitting despite batch normalization. To mitigate this issue, apart from the same network architecture as FS-1 and FS-2, we also consider smaller networks with smaller number of trainable parameters. More specifically, we considered four combinations resulting from number of layers and number of filters per layer , where and . We consider the model with best overall results amongst these four combinations as baseline, viz. number of layers = 2 and number of filters = 165. For fair comparison, each ResNet model is trained for 16 epochs444We also tried training till 32 epochs for ResNet and found insignificant improvement in results. as for FS-1.

(a) FS-1 vs ResNet
(b) FS-1 vs DTW
(c) FS-1 vs BOSS
(d) FS-1 vs FS-2
Figure 4. Classification accuracy rates comparison for 5-shot UTSC. Each point in a scatter plot corresponds to a dataset.

6.1.4. Performance Metrics

Each task is evaluated using classification accuracy rate on the test set—inference is correct if the estimated label is same as the ground truth label. Each task consists of test samples: the performance results for each task equals the fraction of correctly classified test samples. Further, we follow the methodology from (Demšar, 2006; Bagnall et al., 2017) to compare the proposed approach with various baselines considered. For each dataset, we average the classification error results over 100 randomly sampled tasks (as described in Section 6.1.1). To study the relative performance of the approaches over multiple data sets, we compare classifiers by ranks using the Friedman test and a post-hoc pairwise Nemenyi test.

Figure 5. Critical Difference Diagram comparing ranks of few-shot learning approaches (FS-1 and FS-2) with other baselines for samples per class used for fine-tuning.
Dataset Name N ED DTW BOSS ResNet
FS-2
(ours)
FS-1
(ours)
Dataset Name N ED DTW BOSS ResNet
FS-2
(ours)
FS-1
(ours)
50words 50 0.483 0.644 0.499 0.513 0.524 0.591 InsectW.B.Sound 11 0.489 0.473 0.398 0.485 0.452 0.487
Adiac 37 0.538 0.540 0.709 0.539 0.674 0.671 Meat 3 0.919 0.919 0.876 0.559 0.880 0.890
Beef 5 0.618 0.626 0.701 0.519 0.595 0.653 MedicalImages 10 0.579 0.675 0.488 0.620 0.585 0.592
BeetleFly 2 0.667 0.614 0.789 0.702 0.958 0.900 Mid.Phal.O.A.G 3 0.529 0.558 0.478 0.527 0.515 0.547
BirdChicken 2 0.468 0.496 0.921 0.692 1.000 0.929 Mid.Phal.O.C 2 0.563 0.550 0.526 0.540 0.531 0.529
Chlor.Conc. 3 0.339 0.338 0.356 0.342 0.331 0.329 Mid.Phal.TW 6 0.338 0.339 0.348 0.341 0.351 0.353
Coffee 2 0.920 0.914 0.977 0.934 0.970 0.978 PhalangesO.C 2 0.532 0.535 0.512 0.544 0.536 0.539
Cricket_X 12 0.348 0.567 0.491 0.555 0.544 0.594 Prox.Phal.O.A.G 3 0.692 0.719 0.731 0.729 0.697 0.682
Cricket_Y 12 0.375 0.556 0.461 0.505 0.516 0.562 Prox.Phal.O.C 2 0.633 0.626 0.645 0.65 0.638 0.634
Cricket_Z 12 0.357 0.560 0.481 0.523 0.541 0.598 Prox.Phal.TW 6 0.427 0.445 0.419 0.517 0.411 0.432
Dist.Phal.O.A.G 3 0.710 0.698 0.658 0.709 0.705 0.664 Strawberry 2 0.682 0.671 0.714 0.722 0.755 0.741
Dist.Phal.O.C 2 0.571 0.583 0.575 0.609 0.569 0.588 SwedishLeaf 15 0.599 0.690 0.776 0.765 0.778 0.776
Dist.Phal.TW 6 0.444 0.448 0.437 0.476 0.481 0.463 synthetic_control 6 0.736 0.958 0.867 0.96 0.948 0.971
ECG200 2 0.771 0.755 0.728 0.712 0.738 0.758 Two_Patterns 4 0.361 0.970 0.692 0.874 0.811 0.831
ECG5000 5 0.524 0.494 0.533 0.533 0.548 0.533 uWave_X 8 0.591 0.615 0.479 0.598 0.546 0.606
ECGFiveDays 2 0.685 0.666 0.909 0.916 0.928 0.939 uWave_Y 8 0.504 0.518 0.363 0.478 0.430 0.478
ElectricDevices 7 0.239 0.423 0.351 0.381 0.380 0.375 uWave_Z 8 0.536 0.551 0.489 0.57 0.541 0.599
FaceAll 14 0.545 0.764 0.795 0.742 0.712 0.785 wafer 2 0.922 0.922 0.936 0.911 0.894 0.892
FaceFour 4 0.812 0.869 1.000 0.792 0.958 0.934 Wine 2 0.496 0.493 0.571 0.562 0.631 0.578
FordA 2 0.561 0.541 0.693 0.769 0.777 0.797 yoga 2 0.505 0.525 0.548 0.501 0.546 0.528
FordB 2 0.515 0.535 0.585 0.692 0.726 0.787 W/T/L of FS-1 32/0/9 27/0/14 30/2/9 26/2/13 24/0/17 -
Mean Arithmetic Rank 4.537 3.463 3.890 3.305 3.244 2.561
Table 1. Comparison of classification accuracy rates for 5-shot learning scenario. Best approach is marked in bold and second-best is underlined. N denotes the number of classes.
ED DTW BOSS ResNet FS-2 FS-1
2 4.232 2.976 3.902 3.805 3.207 2.878
5 4.537 3.463 3.890 3.305 3.244 2.561
10 4.573 3.476 3.646 3.683 3.427 2.195
20 4.439 3.354 2.927 3.902 3.793 2.585
Table 2. Comparison of various approaches in terms of ranks over classification accuracy rates on all the 4100 tasks from 41 datasets with varying . Best approach is marked in bold and second-best is underlined.
ED DTW BOSS ResNet FS-2 FS-1
2-5 24 4.167 4.083 3.375 3.458 3.042 2.875
6-10 9 4.778 2.333 5.333 2.389 3.778 2.389
10 8 5.375 2.875 3.812 3.902 3.875 1.812
Overall 41 4.537 3.463 3.890 3.305 3.244 2.561
Table 3. Comparison of ranks across datasets with varying number of classes in 5-shot task and is the number of datasets.

6.2. Results and Observations

  • [leftmargin=*]

  • As shown in Figure 5, we observe that FS-1 improves upon all the baselines considered for 5-shot tasks. The pairwise comparison of FS-1 with other baselines in Figure 4 show significant gains in accuracies across many datasets. FS-1 has Win/Tie/Loss (W/T/L) counts of 26/2/13 when compared to the best non-few-shot-learning model, i.e. ResNet. On 27/41 datasets, FS-1 is amongst the top-2 models. Refer Table 1 for dataset-wise detailed results. Our approach FS-2 with a simpler update rule than FS-1 is the second best model but is very closely followed by the ResNet models trained from scratch.

  • To study the effect of number of training samples per class available in end task, we consider for (while remains the same with ), and experiment under same protocol of 4100 tasks (with 100 tasks sampled from each of the 41 datasets). As observed by ranks comparison in Table 2,

    • FS-1 is the best performing model, especially for 5 and 10-shot scenarios with large gaps in ranks.

    • When considering very small number of training samples per class, i.e. for , we observe that FS-1 is still the best model although it is very closely followed by DTW. This is expected as given just two samples per class, it is very difficult to effectively learn any data distribution patterns, especially when the domain of the task is unseen while training. The fact that FS-1 and FS-2 still perform significantly better than ResNet models trained from scratch show the generic nature of filters learned in

      . As expected, data-intensive machine learning and deep learning models like BOSS and ResNet that are trained from scratch only on the target task data tend to overfit, and are even worse than DTW.

    • For tasks with larger number of training samples per class, i.e. , FS-1 is still the best algorithm. As expected, machine learning based state-of-the-art model BOSS performs better than other baselines when sufficient training samples are available and is closer to FS-1.

  • To study the generalizability of FS-1 to varying as a result of leveraging triplet loss, we group the datasets based on . As shown in Table 3, we observe that FS-1 is consistently amongst the top-2 models across values of . While FS-1 is significantly better than other algorithms for and , it is as good as the best algorithm DTW for .

6.2.1. Importance of fine-tuning different layers in deep ResNet

We also study the importance of fine-tuning different convolutional layers of FS-1. We consider four variants FS-1- with , where we freeze parameters of lowermost convolutional layers of the pre-trained model, while fine-tuning top layers only. From Figure 6, we observe that FS-1-, i.e. where the filter weights of only the first convolutional layer are frozen while those of all higher layers are fine-tuned, performs better than the default FS-1 model where all layers are fine-tuned. On the other hand, freezing higher layers as well (FS-1-2 and FS-1-3) or freezing all the layers (FS-1-4, i.e. no fine-tuning on target task) leads to significant drop in classification performance. These results indicate that the first layer has learned generic features while being trained on diverse set of -shot tasks and that the higher layers of the FS-1 model are important to quickly adapt to the target -shot task.

Figure 6. Effect of freezing parameters of different layers while fine-tuning for target few-shot task using FS-1.

6.2.2. Few-shot learning to adapt to new classes for a given dataset

ED DTW BOSS ResNet FS-2 FS-1
50Words 0.614 0.812 0.713 0.733 0.719 0.784
Adiac 0.723 0.692 0.791 0.652 0.808 0.827
ShapesAll 0.854 0.897 0.942 0.915 0.924 0.958
Table 4. Results on 5-shot 5-way classification tasks using dataset-specific pre-training.

Apart from the above scenario where the UCR datasets used to sample tasks in training, validation and testing meta-sets are different, we also consider a scenario (similar to (Vinyals et al., 2016)) where there are a large number of classes within a TSC dataset, and the goal is to quickly adapt to a new set of classes given a model that has been pre-trained on another disjoint set of classes from the same dataset.

We consider three datasets with large number of classes from the UCR Archive, namely, 50Words, Adiac and ShapesAll, containing 50, 37, and 60 classes, respectively. We use half of the classes (randomly chosen) to form the training meta-set, 1/4th of the classes for validation meta-set, and remaining 1/4th of the classes for testing meta-set. We train the FS-1 and FS-2 models on 5-shot 5-way TSC tasks from training meta-set for and . We chose the best meta-iteration based on average triplet loss on the validation meta-set (also containing 5-shot 5-way classification tasks). Note that ED, DTW and BOSS are trained on the respective task from the testing meta-set only. Also, whenever number of samples for a class is less than 5, we take all samples for that class in all tasks. The average classification accuracy rates on 100 5-shot 5-way tasks from the testing meta-set are shown in Table 4. We observe that FS-1 outperforms other approaches indicating the ability to quickly generalize to new classes for a given domain.

6.2.3. Non-few-shot learning scenario

(a) Critical Difference Diagram
(b) FS-1 vs second best method (ResNet)
Figure 7. Non-few-shot learning scenario using original train-test splits from UCR Archive.

We also evaluate FS-1 when sufficient labeled data is available for training, i.e. the standard non-few-shot learning scenario with original class distributions and train-test splits as provided in (Chen et al., 2015). As shown in Figure 6(a), we observe that the meta-learned FS-1 outperforms other approaches even in non-few-shot scenarios proving the benefit of meta-learning based initialization. Furthermore, when compared to the results in Figure 5, we observe increased performance gap between the deep learning approaches (FS-1, FS-2 and ResNet) and other approaches (BOSS, DTW, ED) due to availability of sufficient training data. We provide scatter-plot comparison for FS-1 with second best approach ResNet in Figure 6(b) and omit other dataset-wise results for lack of space.

7. Conclusion and Future Work

The ability to quickly adapt to any given time series classification task with a small number of labeled samples is an important task with several practical applications. We have proposed a meta-learning approach for few-shot time series classification (TSC). It can also be seen as a data-efficient metric learning mechanism that leverages a pre-trained model. We have shown that it is possible to train a model on few-shot tasks from diverse domains such that the model gathers an ability to quickly generalize and solve few-shot tasks from previously unseen domains. By leveraging the triplet loss, we are able to generalize across classification tasks with different number of classes.

We hope that this work opens a promising direction for future research in meta-learning for time series modeling. In this work, we have explored first-order meta-learning algorithms. In future, it would be interesting to explore more sophisticated meta-learning algorithms such as (Finn et al., 2017; Rusu et al., 2019; Finn et al., 2018) for the same. A similar approach for time series forecasting will be interesting to explore as well.

References