I Introduction
In the current digital era, time series data is ubiquitous due to widespread adoption of Internet of Things technology with applications across several domains such as healthcare, equipment health monitoring, meteorology, demand forecasting, etc. Time series classification (TSC) has several practical applications such as those in healthcare (e.g. realtime monitoring, disease diagnosis using time series of physiological parameters, classifying heart arrhythmia in ECG, etc.) and fault diagnostics using sensor data from equipment (e.g. determining the type of fault from sensor data).
Deep learning approaches, such as those based on recurrent neural networks (RNNs) and convolution neural networks (CNNs) have been proven to be very effective for univariate time series classification (UTSC)
[1, 2, 3]. Deep CNNs have yielded some of the stateoftheart models [1, 3] for TSC. However, it is wellknown that training deep networks requires significant hyperparameter tuning effort and expertise, high computational resources, and is prone to overfitting, especially when access to a large labeled training dataset is difficult.Transfer learning [4, 5] is known to be an effective way to address some of the abovementioned challenges in training deep neural networks: It enables knowledge transfer from a source task with sufficient training instances to a related target task with fewer training instances, for example, by training a deep neural network model on source task(s) with large labeled data, and adapting this model for the target task using a small amount of labeled data from the target task. This approach of finetuning a pretrained network for target task is often faster and easier than obtaining a network from scratch (starting with randomly initialized weights) that often requires computationally expensive hyperparameter tuning [5]. For example, it is wellestablished that training a deep CNN using a diverse set of images results in generic filters that can provide useful features for target tasks with images from unseen domains [6].
Recently, transfer learning for TSC using deep neural networks has been explored, e.g., using RNNs in [7, 8, 9], and using CNNs in [10, 11]. These approaches pretrain a deep network on time series from diverse domains, and then either use it as a time series feature extractor for the target task as in TimeNet [7, 8], or use the pretrained network to initialize the parameters of the neural network for the target task [10, 11, 9]. When pretraining a deep network on time series from diverse domains, the rate of change of relevant information in time series can vary significantly across tasks and domains. We note that CNN based TSC architectures, as proposed in [1, 3], can extract local information at only one time scale determined by a single fixed filter size, limiting the flexibility of the model. The filter size of a convolutional layer should, therefore, be chosen carefully to extract relevant features depending on the domain and target task. Indeed, handcrafted transformations such as smoothening and downsampling of time series to learn features at various time scales for TSC have been shown to be useful in Multiscale CNNs [12]. We hypothesize that this aspect is even more relevant in a transfer learning setting when adapting a pretrained deep CNN with its convolutional filters to a target domain. For example, training a common network on diverse tasks with time series as short as to as long as warrants the need to take into account varying relevant time scales: a filter length of may be useful and sufficient to capture relevant features for datasets with short time series, whereas a filter length of may be more appropriate for datasets with long time series.
In this work, we propose ConvTimeNet (CTN), a deep CNNbased transfer learning approach for UTSC. CTN consists of multiple length 1D convolutional filters in all convolutional layers (similar to that in InceptionNet [13, 14]) resulting in filters that can capture features at multiple time scales. The key contributions of this work can be summarized as follows:

We propose ConvTimeNet (CTN), a novel pretrained deep CNN for univariate time series classification tasks.

We demonstrate that finetuning CTN for target tasks outperforms a deep CNN trained from scratch in terms of classification accuracy.

We demonstrate that the CTN can cater to diverse time series of varying lengths by using 1D convolutional filters of multiple lengths to capture features at different time scales.

We demonstrate that finetuning the CTN is computationally efficient compared to training a deep CNN from scratch.
We report stateoftheart results for UTSC on the UCR TSC benchmark [15] considering all datasets with time series length up to 512, while also significantly improving upon our RNNbased TimeNet [16].
Ii Related Work
Several approaches for UTSC have been reviewed in [17]. A plethora of research has been done using featurebased approaches or methods to extract a set of features that represent time series patterns as reviewed in [17]. COTE (Collective of TransformationBased Ensembles), ST (Shapelet Transform), PF (Proximity Forest) and (BagofSFASymbols) are considered to be the stateoftheart nondeep learning algorithms for UTSC [17]. Although COTE is one of the most accurate classifiers, it has a large training time complexity which is ( being number of training samples and being time series length). Whereas most of these approaches extract features using data from the UTSC task at hand, our proposed approach aims to learn generic multitimescale features via filters in CNNs which can be useful on time series from unseen domains in a transfer learning setting.
Recently, several deep learning architectures based on Long shortterm memory networks (LSTMs), CNNs, and their combinations have been proposed for univariate TSC (e.g.
[12, 1, 2]). To overcome overfitting issues and achieve better generalizability, data augmentation methods have been proposed: combining datasets of similar length across domains [18], using simulated data [19], window slicing, warping, and mixing [18, 12], etc. Decorrelating filters of CNNs has been recently shown to be effective in reduce overfitting [20]. On the other hand, we consider transfer learning to achieve better generalizability by pretraining a model on large labeled datasets and then finetuning it for the end (target) task with potentially less labeled data.Several approaches for transfer learning exist in other domains such as computer vision and natural language processing, e.g. via finetuning
[21, 22]. In the context of time series classification applications, few instances of leveraging transfer learning to achieve better generalizability have been considered when using deep learning models, e.g. [7, 10, 11]. [23] reports significant improvements by combining prespecified features from Fourier, wavelet, and other transformations of the time series signals with deep learning features from TimeNet [16]. However, none of these approaches consider the inherent need of multiscale learning when training a common model across domains with varying time series. [14] attempts to address this need but not in transfer learning scenarios. In this work, we propose CTN which uses multiple length filters yielding significant improvements over fixedlength CNN models for transfer learning.Iii ConvTimeNet
Iiia Overview
Consider a univariate time series with for , and being the length of the time series. Further, consider a labeled dataset having samples and ground truth class label with being the number of classes. The goal of UTSC model trained on
is to predict a probability vector
with corresponding to the ground truth onehot vector for a test time series . In this work, we propose CTN, a deep convolutional neural network (CNN) based UTSC model that is trained on a source set containing UTSC source datasets. Once trained, CTN can be adapted to a new target UTSC task with small labeled dataset via suitable finetuning.IiiB CTN Architecture
As depicted in Fig. 1, the architecture of CTN is fairly simple: it consists of multiple convolutional blocks followed by a Global Average Pooling (GAP) layer [24], as detailed next. CTN is trained via additional multihead fully connected (FC) and softmax layers, as detailed in Section IIIC.
IiiB1 Convolutional Blocks with multiple length filters
Consider a CTN with convolutional blocks. The th convolutional block consists of 1D convolution filters of varying lengths (e.g. filters of exponentially varying lengths ) as shown in Fig. 2. This allows CTN to extract and combine features from different time scales, improving the generalization of the network across diverse TSC tasks (as shows later empirically in Section IVE).
For a time series
, the input tensor to the
th convolutional block with filters is given by with channels (Note: corresponding to the univariate input time series). A filter of length in layer is represented by a tensor where . The feature map obtained using th filter is given by where is the convolution operation and is a scalar bias^{1}^{1}1We use zeropadding to keep the length
of input and output same.. The output tensor consisting of the feature maps from the filters is represented by . Note that the length varies across filters, e.g. in Fig. 2. We consider equal number of filters for each length, such that there are filters of each length.We consider residual connections
[25] across blocks to allow gradients to flow directly to lower layers, enabling training of deep networks. Depending on whether the output of the block is to be added to the output of a previous layer in the network via residual connections or not, there are two types of convolutional blocks: Type1 and Type2, as shown in Fig. 2. For convolutional blocks of Type1,is passed through a batch normalization (BN) layer
[26]and a Rectified Linear Unit (ReLU) layer (where
operation) to obtain . The structure of the convolutional blocks of Type2 differs from that of Type1 in the sense that is processed by the BN layer to obtain but not the ReLU layer thereafter. Instead, a residual connection is used such that is added to after being processed via an optional convolutional layer to obtain to enable elementwise addition with , and then finally passed through a ReLU layer to obtain , where is the elementwise addition operation.IiiB2 GAP layer to obtain fixeddimensional vector for time series of varying lengths
For classification tasks, a standard CNN approach would flatten the output of the last convolutional layer to obtain a dimensional vector, and further use FC layer(s) before a final softmax layer. For long time series, i.e. large , this approach leads to a significantly large number of trainable parameters that grows linearly with . Instead, we pass the output of the final convolutional block through a Global Average Pooling (GAP) layer that averages each feature map along the time dimension (as used in, e.g. [1] and [3]). More specifically, GAP layer maps to a vector by taking a simple average of the values in each of the feature maps, thereby drastically reducing the number of trainable parameters.
In a nutshell, CTN takes as input a univariate time series of length and converts it to a fixeddimensional feature vector of length to be subsequently passed to a multihead FC layer followed by softmax layer for training the various layers of CTN, as described next and summarized in Algorithm 1.
IiiC Training CTN
Hereafter, we use to refer to the set of all trainable parameters of CTN consisting of , , and the BN parameters for and . In order to learn the parameters , we train CTN over the diverse set of time series classification tasks in with varying number of classes and time series lengths, by adopting a multihead learning strategy: the core neural network (CTN) is common across the source tasks, while the taskspecific parameters of the FC layer before the softmax layer are learned independently for each source task. A labeled training dataset () consists of samples and corresponds to a class classification problem. Since each dataset has different number of classes, we use FC and softmax layers, one for each dataset as shown in Fig. 3, with the th head mapping to probability values.
Since the number of samples across datasets can vary significantly, we consider training CTN on randomly sampled batches of size for each of the
datasets in an epoch. Each epoch, therefore, considers
training samples from each dataset resulting in a total of training samples per epoch. The order in which the datasets are iterated within an epoch is decided randomly. In turn, the batches of a dataset are processed together (one after the other) while updating andin each iteration using stochastic gradient descent to minimize crossentropy loss:
(1) 
where is the probability that the th time series instance in batch belongs to class , for the target class while otherwise.
Note that while the parameters are updated during each of the iterations in an epoch, the taskspecific parameters are updated only during the corresponding iterations of dataset in an epoch and stored thereafter, until these are reused and updated during the next epoch when processing the batches from . By using the same filters for all the UTSC tasks, the learned filters are likely to capture generic time series trends, patterns and features that are potentially useful for time series from other domains.
IiiD Finetuning CTN for a target dataset
We first describe how to finetune CTN for a new UTSC task, i.e. for a new dataset with a different set of target classes, and then discuss how to use it for (i) finding the best parameters via holdout validation, as well as for (ii) transfer learning. For a new UTSC task, we consider training the taskspecific FC layer with parameters (followed by softmax) on top of the GAP layer of CTN while also updating the parameters as shown in Fig. 4. The parameters and
are then updated together using crossentropy loss function as in Equation
1.IiiD1 Finetuning for validation
Rather than the standard approach of using holdout instances from the training dataset to build a validation set, we use different datasets for validation, where we finetune CTN for holdout unseen datasets independently oneatatime, and then use average loss across the validation datasets as validation loss (described later in this section). This way of validation of CTN mimics the transfer learning scenario where the goal is to adapt CTN for a new dataset, and therefore, yields a CTN model that is likely to generalize to unseen tasks. Such an approach for defining validation tasks has been shown to be useful in transfer learning settings, e.g. [27].
More specifically, for obtaining the best parameters during the iterative training process (refer Algorithm 1), we use a (relatively smaller) validation set containing UTSC datasets such that . Let represent the parameters of CTN at the end of th training epoch. The time series instances in are divided into train, validate and test samples. For each dataset for , the parameters and are finetuned using the train samples of via stochastic gradient descent for a fixed number of epochs. Using updated and at the epoch with minimum validation loss, we compute the test loss for . Then, the validation loss for CTN at the end of th training epoch is defined as the average of these test losses across all datasets in , and is given by . The optimal parameters are chosen at the epoch where the validation loss is minimum, and represent the final parameters of the CTN.
IiiD2 Transfer to a new task
The procedure for adapting / finetuning CTN for any new target task is similar to that of validation: We use as initial weights of CTN and randomly initialize FC layer weights , and train them simultaneously for a fixed number of iterations using labeled data from target task.
Iv Experimental Evaluation
Dataset  T  C  N  Dataset  T  C  N 

ItalyPowerDemand  24  2  1096  SonyAIBORobotSurfaceII  65  2  980 
SonyAIBORobotSurface  70  2  621  TwoLeadECG  82  2  1162 
FacesUCR  131  14  2250  Plane  144  7  210 
Gun_Point  150  2  200  ArrowHead  251  3  211 
WordSynonyms  270  25  905  ToeSegmentation1  277  2  268 
Lightning7  319  7  143  ToeSegmentation2  343  2  166 
DiatomSizeReduction  345  4  322  OSULeaf  427  6  442 
Ham  431  2  214  Fish  463  7  350 
ShapeletSim  500  2  200  ShapesAll  512  60  1200 
Dataset  T  C  N  Dataset  T  C  N 

MoteStrain  84  2  1272  CBF  128  3  930 
Trace  275  4  200  Symbols  398  6  1020 
Herring  512  2  128  Earthquakes  512  2  461 
Dataset Name 

BOSS  ResNet 


Dataset Name 

BOSS  ResNet 




Adiac  0.21  0.24  0.17  0.17  0.16  50words  0.20  0.29  0.26  0.17  0.16  
Chlor.Conc.  0.27  0.34  0.16  0.14  0.17  Beef  0.13  0.20  0.25  0.31  0.26  
Cricket_X  0.19  0.26  0.21  0.14  0.14  BeetleFly  0.20  0.10  0.15  0.17  0.13  
Cricket_Y  0.17  0.25  0.20  0.14  0.14  BirdChicken  0.10  0.05  0.11  0.18  0.17  
Cricket_Z  0.19  0.25  0.19  0.12  0.13  Coffee  0.00  0.00  0.00  0.00  0.00  
Dist.Phal.O.A.G  0.25  0.25  0.20  0.21  0.18  Dist.Phal.O.C  0.24  0.27  0.20  0.22  0.21  
Dist.Phal.TW  0.30  0.32  0.24  0.27  0.26  ECG5000  0.05  0.06  0.07  0.06  0.06  
ECG200  0.12  0.13  0.12  0.14  0.08  ECGFiveDays  0.00  0.00  0.03  0.00  0.00  
ElectricDevices  0.29  0.20  0.27  0.29  0.30  FaceAll  0.08  0.22  0.17  0.20  0.21  
FordA  0.04  0.07  0.08  0.05  0.06  FaceFour  0.10  0.00  0.05  0.05  0.03  
FordB  0.20  0.29  0.09  0.08  0.08  InsectWingbeatSound  0.35  0.48  0.49  0.36  0.37  
Mid.Phal.O.A.G  0.36  0.46  0.27  0.29  0.28  MedicalImages  0.24  0.28  0.23  0.22  0.21  
Mid.Phal.O.C  0.20  0.22  0.19  0.19  0.19  Mid.Phal.TW  0.43  0.46  0.40  0.41  0.39  
PhalangesO.C  0.23  0.23  0.16  0.17  0.17  Meat  0.08  0.10  0.03  0.11  0.09  
Prox.Phal.O.A.G  0.15  0.17  0.15  0.16  0.16  Prox.Phal.O.C  0.13  0.15  0.08  0.10  0.09  
Prox.Phal.TW  0.22  0.20  0.21  0.22  0.22  Strawberry  0.05  0.02  0.04  0.03  0.03  
SwedishLeaf  0.05  0.08  0.04  0.04  0.04  synthetic_control  0.00  0.03  0.00  0.00  0.00  
Two_Patterns  0.00  0.01  0.00  0.00  0.00  uWave_X  0.18  0.24  0.22  0.17  0.17  
uWave_Y  0.24  0.32  0.33  0.24  0.23  uWave_Z  0.25  0.31  0.25  0.23  0.23  
wafer  0.00  0.01  0.00  0.00  0.00  Wine  0.35  0.26  0.26  0.17  0.17  
yoga  0.12  0.08  0.13  0.10  0.08  W/T/L of CTNT  26/6/9  30/4/7  22/6/13  17/18/6    
Mean Arithmetic Rank  3.22  3.91  2.88  2.72  2.27 
We empirically evaluate CTN from three perspectives: 1) classification performance: to evaluate if finetuning CTN for target task provides better accuracy compared to training a model from scratch, 2) computational efficiency: to evaluate if CTN can be adapted quickly with fewer iterations compared to training a deep model from scratch, 3) ablation study: to understand the advantage of multiple filter lengths in CTN. Additionally, we provide a qualitative analysis of the trained filters in CTN and useful insights into the interpretability of results in Sec IVF.
Iva Dataset details
We train and test CTN on diverse disjoint subsets of the datasets taken from the UCR TSC Archive Benchmark [15, 17, 3]
belonging to seven diverse categories: Image Outline, Sensor Readings, Motion Capture, Spectrographs, ECG, Electric Devices and Simulated Data. All time series are znormalized, i.e. the mean and standard deviation of the values in any time series is 0 and 1, respectively. The length of time series varies significantly from
across datasets, and the number of classes also varies significantly from . Further, the number of labeled training instances varies between . We use the same (random) split of training () and validation () datasets (refer Table I) as used in [16] and detailed in [30], such that we have datasets for training CTN and for model selection. The 18 training and 6 validation datasets have , and we therefore, restrict testing to the remaining 41 datasets with .For each training dataset, all the labeled train as well as test samples from the original traintest split in the archive are used for training CTN. For each of the 6 validation datasets and the 41 test datasets, we use the same traintest splits as provided in [17] while finetuning CTN using train split of respective datasets.
IvB Hyperparameters
Based on preliminary experiments on a smaller subset of and to decide the number of layers and filters, we consider CTN with blocks, each convolutional layer in the convolutional blocks consist of five different filters lengths, i.e. , with 33 filters for each such that for each convolutional block (). We use the Adam optimizer for optimizing the weights of the networks with an initial learning rate of . We used orthogonal initialization of convolutional filters in all our experiments. CTN was trained for epochs; during each epoch, for each dataset , we randomly chose batches of size each. For validation datasets, we finetune CTN parameters and taskspecific parameters for epochs with a learning rate of . While adapting CTN for each test dataset, parameters of CTN and the FC layer are finetuned for iterations with a reduced learning rate of .
IvC Baselines considered
We refer to the proposed approach of finetuning the pretrained CTN for target task as CTNT (CTNTransfer), and compare it to: (i) CTNS: We train CTNS (CTN architecture trained from Scratch) as an exact replica of CTN with all parameters initialized randomly for each test dataset. By doing so, we can attribute the gains in performance, if any, obtained via using CTNT over CTNS to the pretrained filters in CTN. (ii) ResNet [3] as the stateoftheart deep learning approach: ResNet is trained independently for each dataset and contains 11 layers of which the first 9 layers are convolutional with shortcut residual connections between residual blocks (each block with 3 convolutional layers) followed by a GAP layer and a FC layer+softmax. (iii) Two nondeep learning stateoftheart techniques as baselines [17]: i) Flat COTE (Collective of transformationbased ensembles) [28], ii) BOSS (Bag of SFA Symbols)[29].
For evaluating CTNT and CTNS on each test dataset, we use the entire train split for training and use the model parameters corresponding to the iteration with minimum training crossentropy loss, following the same protocol^{2}^{2}2We additionally considered a stratified sampling approach to divide the train split of each dataset into 75%25% training and validation samples, and still found the resulting variant of CTNT to perform better than BOSS, ResNet and Flat COTE methods used for comparison in Table II. However, this variant was worse compared to the CTNT model using entire train split for training; especially for three datasets, namely Beef, Chlor.Conc., and 50words. This can be attributed to a small number of training instances per class and/or large diversity in patterns within samples of same class. as used in [1, 3]. We train three models for each dataset (with randomly initialized FC layer and entire network for CTNT and CTNS, respectively), and report the average of the three error rates in Table II.
IvD Observations
The comparison of classification error rates (fraction of wrongly classified instances) and the number of wins/ties/losses (W/T/L) is summarized in Table II. We make the following key observations:

[wide, labelwidth=!, labelindent=0pt]

CTNT has W/T/L of 17/18/6 concluding that a pretrained network based transfer learning (CTNT) has significantly better performance compared to training the CTNlike architecture from scratch (CTNS), as also highlighted in Fig. 4(a). Further, we observe that CTNT has W/T/L of 22/6/13 compared to ResNet, proving the advantage of leveraging a pretrained model. CTNT has mean arithmetic rank of 2.27 based on error rates which is significantly better than both nontransferbased deep learning approaches, i.e. CTNS and ResNet.
Fig. 5: Scatter plots of classification error rates. 
As shown in Fig. 6, we observe that CTNT performs significantly better compared to CTNS and ResNet when number of parameter updates is small, i.e. fewer number of training/finetuning iterations. (Due to random initialization of FC layer, ResNet and CTNT are similar initially but CTNT quickly adapts.) This suggests that starting from a pretrained model is computationally efficient compared to starting from scratch: CTNT takes fewer iterations to reach optimal classification performance while having better classification error rates, proving the advantage of leveraging a pretrained network over a network trained from scratch.

CTNT has W/T/L of 26/6/9 compared to COTE. Given that COTE is extremely computationally expensive [17]
, training and deploying it in practical applications can be highly inefficient. On the other hand, training and inference in CTNT is highly parallelizable making it suitable for practical applications. Further, finetuning of pretrained CTN is efficient and overcomes the need for any hyperparameter tuning.
IvE Ablation study: Does having different filter lengths help?
To evaluate the importance of having multiple filter lengths in a transfer learning setting to deal with diverse datasets, we train four CTNlike architectures keeping the filter length fixed (, , , and ) for all layers while keeping the total number of trainable parameters to be approximately same as that in CTN by suitably adjusting the number of filters in each layer, such that we have , , and filters when , , , and , respectively, in each convolutional layer. We observe that CTNT performs significantly better in comparison to any of these variants. CTNT has W/T/L of 24/10/7 compared to the best performing fixedlength variant CTNT corresponding to as shown in Fig. 4(b). These results highlight the significance of having filters of multiple lengths in a transfer learning setting: multiple filter lengths help to capture trends and patterns occurring at varying temporal resolutions which would be otherwise difficult to capture via filters of fixed length in a CNN model.
IvF Analysis of CTN filters
IvF1 Finetuning all layers vs partial finetuning
Typically, lower layers of a deep neural network tend to learn generic features while higher layers tend to learn taskspecific features. To analyze this behavior in CTN, we consider four variants where we freeze the parameters of the first, first two, first three, and all four convolutional layers of CTN, respectively, while finetuning the remaining layers for a test dataset as described in Section IIID. We keep the parameters of BN layers trainable and only freeze the convolutional layers for reasons explained in [31]. We observe an average improvement in classification performance of around 1% across the test datasets by freezing the first convolutional block, a drop of around 0.9% when freezing first two or three layers, while a significant drop of on freezing all four convolutional layers. These observations suggest that finetuning the final convolutional layer can be critical to obtain good taskspecific models from pretrained CTN. Further, minor improvement by freezing the first layer can be attributed to the fact that it may be capturing generic patterns relevant across datasets.
IvF2 Qualitative analysis of filters from first layer of ConvTimeNet
We first find the filter with maximum value for relevance for a dataset, where , and with :
(2) 
Fig. 8 depicts the filter weights for eight different test datasets before and after finetuning of CTN. We observe that the filters capture typical patterns that are encountered in time series like sharp/gradual rise/fall, rise followed by a fall, etc. further indicating generic features learned by CTN which do not change much on finetuning. We found different filters to be most relevant for different datasets. The patterns captured are illustrated in Fig. 7 using filter weights and corresponding activations for sample time series.
Further, it is interesting to note that some of the filters (e.g. the ones with in Fig. 8) may appear extremely noisy and not capturing any trends at first. However, ignoring those points in the filters with weights very close to (say, between to ) yields meaningful patterns  these filters tend to capture trends over longer time steps while ignoring some of the steps (corresponding to filter weight ) in a time series: e.g. refer the most relevant filter for uWave_Y dataset with which is at time steps and , i.e. this filter ignores the th and th time steps in a window of length during the convolution operation, and therefore tries to capture coarser higher level temporal patterns rather than finer trends.
IvF3 Interpretability via Occlusion Sensitivity
We provide preliminary analysis of interpretability in terms of identifying the region(s) in the time series that are most relevant for making a particular classification decision. We use the “Two Patterns” test dataset from “Simulated” category as an illustrative example for its ease of visual interpretability: “Two Patterns” has four classes constituting the possible combinations of the two patterns “up” and “down”. Refer Fig. 9(a) for a (test) instance of updown class along with the two most relevant filters for this dataset identified using Eq. 2. We observe that one filter captures the “up” trend while the other captures the “down” trend, with maximum activation value coming at the corresponding points in the time series as depicted in Fig. 9(b). To find the region of time series used by CTNT classifier to arrive at the classification decision, we compute occlusion sensitivities [32] by occluding parts of the time series and observing the changes in probability for the predicted class. Specifically, we consider a moving window of length and set the values over that window to
. The moment an important part of the time series is occluded, we expect a sharp drop in the probability
for the predicted class. This change in probability over time, i.e. occlusion sensitivity (where is probability for the predicted class after occluding and is the actual probability without any occluded parts) is shown in Fig. 9(c). We observe that as soon as the window covers the “up”/“down” trend in the time series, there is a sharp drop in indicating that the network is focusing on the correct regions in the time series for making the decisions (also these regions coincide with the most relevant filters for the dataset).V Conclusion and Future Work
We have proposed ConvTimeNet (CTN): a pretrained deep CNN for univariate time series classification. CTN leverages multiple length filters to model various temporal patterns from diverse time series across domains. Adapting a pretrained model like CTN for the target task via finetuning i) yields significantly better results compared to existing stateoftheart time series classification approaches, ii) is computationally efficient, and iii) does not require expertise in deep learning compared to training a deep network from scratch. In future, we plan to train a bigger CTN model on a larger and diverse dataset with longer time series. Also, it will be interesting to see if the number of parameters to be updated during the finetuning task can be reduced to make finetuning even more efficient.
References
 [1] Z. Wang, W. Yan, and T. Oates, “Time series classification from scratch with deep neural networks: A strong baseline,” in Neural Networks (IJCNN), 2017 International Joint Conference on. IEEE, 2017, pp. 1578–1585.
 [2] F. Karim, S. Majumdar, H. Darabi, and S. Chen, “Lstm fully convolutional networks for time series classification,” IEEE Access, vol. 6, pp. 1662–1669, 2018.
 [3] H. I. Fawaz, G. Forestier, J. Weber, L. Idoumghar, and P.A. Muller, “Deep learning for time series classification: a review,” arXiv preprint arXiv:1809.04356, 2018.
 [4] S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE Transactions on knowledge and data engineering, vol. 22, no. 10, pp. 1345–1359, 2010.
 [5] Y. Bengio, “Deep learning of representations for unsupervised and transfer learning,” in Proceedings of ICML Workshop on Unsupervised and Transfer Learning, 2012, pp. 17–36.
 [6] K. Simonyan and A. Zisserman, “Very deep convolutional networks for largescale image recognition,” arXiv preprint arXiv:1409.1556, 2014.

[7]
P. Malhotra, V. TV, L. Vig, P. Agarwal, and G. Shroff, “TimeNet: Pretrained
deep recurrent neural network for time series classification,” in
25th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning
, 2017, pp. 607–612.  [8] P. Gupta, P. Malhotra, L. Vig, and G. Shroff, “Using features from pretrained timenet for clinical predictions,” in The 3rd International Workshop on Knowledge Discovery in Healthcare Data at IJCAI, 2018.
 [9] P. Gupta, P. Malhotra, L. Vig, and G. Shroff, “Transfer learning for clinical time series analysis using recurrent neural networks,” ACM SIGKDD workshop on Machine Learning for Medicine and Healthcare. arXiv preprint arXiv:1807.01705, 2018.
 [10] J. Serrà, S. Pascual, and A. Karatzoglou, “Towards a universal neural network encoder for time series,” arXiv preprint arXiv:1805.03908, 2018.
 [11] H. I. Fawaz, G. Forestier, J. Weber, L. Idoumghar, and P.A. Muller, “Transfer learning for time series classification,” arXiv preprint arXiv:1811.01533, 2018.
 [12] Z. Cui, W. Chen, and Y. Chen, “Multiscale convolutional neural networks for time series classification,” arXiv preprint arXiv:1603.06995, 2016.

[13]
C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in
Proceedings of the IEEE conference on computer vision and pattern recognition
, 2015, pp. 1–9.  [14] S. Roy, I. KiralKornek, and S. Harrer, “Chrononet: A deep recurrent neural network for abnormal eeg identification,” arXiv preprint arXiv:1802.00308, 2018.
 [15] Y. Chen, E. Keogh, B. Hu, N. Begum et al., “The ucr time series classification archive,” July 2015, www.cs.ucr.edu/ẽamonn/time_series_data/.
 [16] P. Malhotra, V. TV, L. Vig, P. Agarwal, and G. Shroff, “Timenet: Pretrained deep recurrent neural network for time series classification,” in Proceedings of the European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning (ESANN), 2017.
 [17] A. Bagnall, J. Lines, A. Bostrom, J. Large, and E. Keogh, “The great time series classification bake off: a review and experimental evaluation of recent algorithmic advances,” Data Mining and Knowledge Discovery, vol. 31, no. 3, pp. 606–660, 2017.
 [18] A. Le Guennec, S. Malinowski, and R. Tavenard, “Data augmentation for time series classification using convolutional neural networks,” in ECML/PKDD Workshop on Advanced Analytics and Learning on Temporal Data, 2016.
 [19] H. I. Fawaz, G. Forestier, J. Weber, L. Idoumghar, and P.A. Muller, “Data augmentation using synthetic data for time series classification with deep residual networks,” arXiv preprint arXiv:1808.02455, 2018.

[20]
V. Kaushal Paneri, TV, P. Malhotra, L. Vig, and G. Shroff, “Regularizing fully
convolutional networks for time series classification by decorrelating
filters,”
The ThirtyThird AAAI Conference on Artificial Intelligence (AAAI19)
, 2019.  [21] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 580–587.
 [22] M. Long, Y. Cao, J. Wang, and M. I. Jordan, “Learning transferable features with deep adaptation networks,” arXiv preprint arXiv:1502.02791, 2015.
 [23] A. Ukil, P. Malhotra, S. Bandyopadhyay et al., “Fusing features based on signal properties and timenet for time series classification,” in Proceedings of the European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning (ESANN), 2019.
 [24] M. Lin, Q. Chen, and S. Yan, “Network in network,” arXiv preprint arXiv:1312.4400, 2013.
 [25] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
 [26] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” arXiv preprint arXiv:1502.03167, 2015.
 [27] O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra et al., “Matching networks for one shot learning,” in Advances in Neural Information Processing Systems, 2016, pp. 3630–3638.
 [28] T. Bagnall, J. Lines, J. Hills, and A. Bostrom, “Timeseries classification with cote: the collective of transformationbased ensembles,” IEEE Transactions on Knowledge and Data Engineering, vol. 27, no. 9, pp. 2522–2535, 2015.
 [29] P. Schäfer, “The boss is concerned with time series classification in the presence of noise,” Data Mining and Knowledge Discovery, vol. 29, 11 2015.
 [30] P. Malhotra, V. TV, L. Vig, P. Agarwal, and G. Shroff, “Timenet: Pretrained deep recurrent neural network for time series classification,” in arXiv preprint arXiv:1706.08838, 2017.
 [31] C. Doersch and A. Zisserman, “Multitask selfsupervised visual learning,” The IEEE International Conference on Computer Vision (ICCV), 2017.
 [32] M. D. Zeiler and R. Fergus, “Visualizing and understanding convolutional networks,” in European conference on computer vision. Springer, 2014, pp. 818–833.
Comments
There are no comments yet.