Instance Explainable Temporal Network For Multivariate Timeseries

05/26/2020
by   Naveen Madiraju, et al.
0

Although deep networks have been widely adopted, one of their shortcomings has been their blackbox nature. One particularly difficult problem in machine learning is multivariate time series (MVTS) classification. MVTS data arise in many applications and are becoming ever more pervasive due to explosive growth of sensors and IoT devices. Here, we propose a novel network (IETNet) that identifies the important channels in the classification decision for each instance of inference. This feature also enables identification and removal of non-predictive variables which would otherwise lead to overfit and/or inaccurate model. IETNet is an end-to-end network that combines temporal feature extraction, variable selection, and joint variable interaction into a single learning framework. IETNet utilizes an 1D convolutions for temporal features, a novel channel gate layer for variable-class assignment using an attention layer to perform cross channel reasoning and perform classification objective. To gain insight into the learned temporal features and channels, we extract region of interest attention map along both time and channels. The viability of this network is demonstrated through a multivariate time series data from N body simulations and spacecraft sensor data.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

02/04/2018

Deep Temporal Clustering : Fully Unsupervised Learning of Time-Domain Features

Unsupervised learning of time series data, also known as temporal cluste...
11/24/2020

RTFN: A Robust Temporal Feature Network for Time Series Classification

Time series data usually contains local and global patterns. Most of the...
08/18/2020

RTFN: Robust Temporal Feature Network

Time series analysis plays a vital role in various applications, for ins...
05/03/2020

Multivariate Time Series Forecasting Based on Causal Inference with Transfer Entropy and Graph Neural Network

Multivariate time series (MTS) forecasting is an important problem in ma...
11/09/2020

SuperDeConFuse: A Supervised Deep Convolutional Transform based Fusion Framework for Financial Trading Systems

This work proposes a supervised multi-channel time-series learning frame...
12/22/2021

Regularized Multivariate Analysis Framework for Interpretable High-Dimensional Variable Selection

Multivariate Analysis (MVA) comprises a family of well-known methods for...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Deep learning has become the dominant approach to supervised learning of labeled data

(LeCun et al., 2015) (Schmidhuber, 2015). One of the main drawbacks in deep networks has been the difficulty in making the underlying reasons and logic for their decision making human understandable. The primary focus of research, both in designs of networks as well as the explainability, has been on images, sequence data, and unstructured data while relatively less attention has been paid to learning of complex, multivariate time series data (MVTS). MVTS encompasses many areas of science and engineering such as financial trading, medical monitoring, and event detection (Aghabozorgi et al., 2015) and have become pervasive due to rise of sensors/IoT devices.

This combination of factors has left a gap in technology for accurate learning of multivariate time series data. The problem of multivariate time series classification is particularly challenging. In imaging, one can have 3 channels representing the color, and there is strong similarity in the learnable features across widely differing domains such as classification of anomalies in medical imaging and natural images. This enables transfer learning where networks can be trained on natural images where there exists large labeled data and fine tune on specific domain applications where one often is faced with a paucity of labeled data. In contrast, no two MVTS are alike and there is considerable variations in the relevant features, temporal scales, and dimensionality. Further, time series data from real world applications are often noisy, can have temporal gaps, and may be collected from multiple sensors with different resolution and dta lengths. The data noise can be due to the data acquisition method and/or the inherent nature of the data

(Antunes and Oliveira, 2001)

. As a result, while there exists heavily used standard labeled data sets for image classification such as MNIST, Imagenet, and CIFAR10, there is no such equivalent training sets for MVTS and even if they are created, the associated lift from transfer learning is not expected to be as effective as in imaging problems.

Our contribution here is directed at developing the ability to disentangle in the MTVS classification the relevant input channels that contribute to an instance of classification. This would be quite valuable across many application domains. For example, in financial markets, this could enable uncovering the key signals from a large list that contribute to classification of a given market regime. Or in the medical domain, one can learn which clinical inputs from a patient led to a specific diagnosis.

2. Recent Work

Recently there has been rising interest to apply neural networks in time series applications

(Madiraju et al., 2018) (Karim et al., 2018)(Song et al., 2018). Further, CNNs are increasingly taking place of recurrent networks like RNN and LSTM for time series data and optimal means to incorporate temporal information into CNNs remains an active area of research. Recently, a general architecture for sequences model by convolutional networks, the Temporal Convolution Network (TCN) (Bai et al., 2018), was proposed. The paper empirically shows that CNN outperforms LSTM’s on a wide variety of benchmarks for timeseries applications. Use of shared TCN also reduced the number of trainable parameters, which is also useful in MVTS where there is a paucity of labeled data sets.

Despite development of generalized architectures for univariate time series, very few translate to MVTS. This is partly due to the non-linear interaction among variable in MVTS. A common approach to MVTS is to cast variables as separate channels into a CNN-type architecture but the drawback is that such architectures do not fully account for non-local and non-linear interactions between channels. Recently, relational networks(Santoro et al., 2017) and its variants including transformer (Vaswani et al., 2017) and non-local networks (Wang et al., 2018), have become popular for reasoning tasks such as visual question answering. These architectures attain efficiency by reducing the number of parameters by leveraging dot product and stability by using normalization and skip connections. This efficient use of parameters is ideal for multivariate applications which require pairwise combinatorial reasoning (or higher order). Variants of this architecture have been successfully applied to multi modal problems like video-speech problems (Zadeh et al., 2018) (Tsai et al., 2018) but not for MVTS.

Our contribution is two-fold. We have adapted the transformer attention architecture to perform MVTS classification, modeling the interaction between various channels. Secondly, we have incorporated this architecture into an end to end neural net that provides not just instance specific but class specific heatmap of the contributing channels. This latter feature provides a useful level of explainability and insight into how the network is making its decisions. Previously work on explainability include (Assaf and Schumann, 2019) that proposed convolutional solution based on grad-CAM to provide a heatmap. (Yuan et al., 2018) and (Xu et al., 2018) proposed attention based architectures to perform classification and to also gain some insights into the inner workings of the network. However, in an important distinction to our work, these networks do not provide class and instance specific channels of interest. Our novel network, IETNet, provides not just instance specific but class specific heatmap of the channels of importance.

3. Method

3.1. Feature Extractor

The first element of the network consists of mapping the input to feature representations. This is done using a shared Temporal Convolution Networks (TCN) which extracts time domain features of each channel independently. As described in (Bai et al., 2018), we make use of causal 1D convolutions in the network. That means in each layer, the output at a particular time are convolved only with the inputs from time t and earlier. Moreover, the architecture consists of dilated convolutions which exponentially expand the receptive field of the network. When dilation , the network reduces to a regular convolution. Using larger ’s enables the network to capture a wide range of inputs. Standard practice is to stack exponentially increasing dilations with

along each layer. This ensures top layer of the network is able to see all of the input. We also make use of ReLU activations and skip connections to make network more stable. Finally, average pooling is used to collapse temporal axis for each variable. Number of layers and complexity can be adjusted based to the problem. Figure

1 illustrates the resulting architecture. Note that each variable in MVTS shares the same TCN network, which is fully convolutional, thereby effectively reducing the number of parameters. This addresses the attendant problem of over parametrization and resultant overfitting in MVTS data.

Figure 1. The feature extractor for proposed algorithm. The temporal convolutional network is shared across all the variables. The architecture of residual block is illustrated above. This block with various dilations are stacked together to extract features. The features along temporal axis are collapsed and each variable is concatenated as shown.

3.2. Channel Gate and Classification

The output of the previous layer is a time collapsed feature vector. We now stack together these vectors for all channels. This multivariate features (

) has dimensions . Up to this point, there is no interaction between the channels. In this layer, we want each class to choose which channels represent it the most. To realize this, we pass

through a multi-headed dot product attention. Here the attention performs pairwise reasoning between each and every channel. After entangling the channels, using a feed forward layer, we collapse the features into a class score for each channel. We next perform softmax along the channel axis to get the most useful channels for each class. We call this tensor channel gate with dimensions

. Here channel scores sum to 1 for each class. Since our example is binary classification in figure 2, We have one row for channel gate. Now we use this channel gate to filter the multivariate feature vector and then perform global average pooling to get final class score. The architecture is illustrated in 2 for a binary classification problem.

Note that we want attention to just create channel scores which we use to filter the multivariate features . Hence the class score is highly dependent on what channels the attention layer chooses. In a way the network can only perform classification if it chooses the right channels.

Figure 2. Illustrates channel Gate and classification for the N body problem. The matrix in top represents time collapsed multivariate features , where red represent higher magnitude of activate and blue represents lower. The attention performs pairwise reasoning between each and every channel. We then get a channel score using a feed forward layer and softmax. Finally we perform gating and global pooling to get class score. We can see the channel gate here picked second and fourth channel strongly.

4. Experiments and Analysis

4.1. Implementation

The TCN layer has 16 filters with kernel size of 2 along with dilations of and with skip connections. Each variable in MVTS shares the same TCN network, making it quite light weight, with only 27,648 total weights in our implementation. As needed, deeper variants can be readily implemented due to the modular structure of the network and skip connections that enable stacking convolutions.

We used ReLU’s for the activations, glorot normal initialization and adam optimizer with a learning rate which cycles between (0.0001,0.001) using noam scheme (Vaswani et al., 2017). A dropout of was applied during training. We have used publicly available implementation of TCN111https://github.com/philipperemy/keras-tcn. For multiheaded attention, we used same feature size of 16 with ReLU activation and 1 head. We adapted the following publicly available code for attention architecture222https://github.com/Kyubyong/transformer.

4.2. Evaluation Metrics

For classification, we use ROC curves and confusion matrix. For evaluation of the accuracy of the channel localization, we use mean average precision at k retrieved objects, which is the standard evaluation metric in information retrieval. In our problem, we want to evaluate whether given the predicted class, is the model retrieving the relevant channels. We use the following equation to compute the average precision at various

.

Here is number of relevant channels to retrieve. This can be set based on any prior knowledge of the problem or can be determined by the counting the number of highly precise channels and ignoring the low precision ones. is the ground truth positives, is the number of observed hits/true positives, and is the number of channels retrieved. We score the predictions by their confidence and take top channels as the retrieved channels.

4.3. N-body

We created MVTS data using a two-dimensional N-body gravitational simulation. The data consists of 8 channels and 2 classes:

  1. class 0 All 8 channels are positions sampled from a 4 body problem .

  2. class 1 First 4 channels are positions of 2 bodies sampled from a 2-body simulation. Next 4 channels are positions of 2 bodies sampled from a 4 body simulation .

    With this data construct, the important channels in class 1 are the first 4 channels and provides a way for us to assess the accuracy of the channel localization of IETNet.

The Data is generated by simulations for 2-body with and 4-body with , respectively. The positions and velocities are randomly initialized with coordinates between . Further, we compute the positions for 2000 times steps using a gravitational constant of 1. This forms the individual simulations. From these, the multivariate time series consisting of various classes is created by grouping as described above as classes. The training, test and validation sets consist of samples sizes of 183, 244, and 183, respectively. The goal is to determine the efficacy of the channel localizer when the data is from a 2-body class.

(a) Classification performance shows very high degree of agreement between ground truth and predicted labels for N-body simulation dataset.
(b) Channel localization by the model aggregated over all of test data. We perform average activation of the channel gate over the test set and then normalize the values by the test set size. As we can see, IETNet can separate variables by strongly picking first few channels which correspond to 2-body as shown in bottom panel. Likewise for class 0 the network is using all of the channels to say its a background class.
(c) Channels picked by the model on test set for the N-body problem. We plot mean average precision at various ’s. The model is seen to align with the ground truth, especially at first few ’s. class 1 has 4 ground truth channels and therefore it varies from 1-4
Figure 3. N-Body data experiments

Classification performance of the IETNet on the N-body problem is shown using the confusion matrix in figure (a)a. The model is seen to have very high accuracy with only one misclassified example. Channel gate results of IETNet are shown in figure (b)b. Each horizontal bar shows the relative importance of the variables/channels for each class as picked by the channel gate, aggregated over the entire test set. As shown in the bottom horizontal bar in (b)b, the network has correctly picked and for the test set when the predicted class is 1 (2-body class). In case of class 0 (4-body class), the network has picked . The ground truth for this class is less clear but the chosen channels do make physical sense in that one needs to look at channels beyond the first four to identify the class.

Next, we use the mean average precision at k retrieved objects to further assess the efficacy of the channel localizer. This is shown in figure (c)c

along with standard deviation of the retrieved channels. We included

as a part of ground truth channels in case of class 1. The model is observed to have a very high precision when retrieving top few channels, with the score gradually decreasing as we retrieve more number of channels. The trend shows high agreement of retrieved channels to ground truth channels.

4.4. Spacecraft Data

In the previous section, we demonstrated the technique using synthetic planetary data. Here, we apply the technique to a challenging MVTS data that were collected from NASA’s recent and ongoing Magnetospheric Multiscale Mission which is obtaining high resolution data of the Earth’s space environment.

In situ measurements of this multi-spacecraft mission, made through magnetometer and plasma instruments on-board the spacecraft, serve as probes in the space environment surrounding the spacecraft. This sensor data is challenging since the space environment is turbulent and has many embedded transients that can mask the events of interest. One of the event types of interest is the so-called flux transfer events (FTEs) (Russell and Elphic, 1979) which are formed due to the magnetic reconnection process, a main driver of space weather effects.

Space physicists identify the FTEs in the data by first transforming the raw magnetic field data into the boundary normal coordinates based on the model of the Earth’s magnetopause. The three components of the magnetic field () are transformed into () where is the component along the magnetopause normal, is tangential to the magnetopause, and forms the third orthogonal coordinate. In this transformed frame, FTEs exhibit a bipolar signature in which makes it easier to identify the FTEs visually (Fig. 4). It is important to note that it would be difficult to visually identify FTEs in the original frame as evident in Fig. 4. As such, this data set is ideal for testing and validation of our approach for identification of important channels. The most important channel for identification of FTEs is and the model should highlight that as such. For the relative importance of various variables to the classification of FTEs, we refer the reader to (Karimabadi et al., 2009)

Our data consists of 15 variables in the following order ,,,,,,,,,,,,,,. The ’s refer to components of the ion velocity in the original frame and its magnitude, is the plasma density, refer to ion temperature parallel and perpendicular to the magnetic field, respectively, and refers to the total ion temperature. Data is labeled by whether a given time window has FTE events (class 1) or no events (class 0). We do not specify the beginning or end of the event. A given interval with FTEs may have one or more FTEs. The labels were created by space physicists through visual inspection of the data.

Figure 4. An example of flux transfer event (FTE), most visible due to its bipolar signature in .

Data consists of 184 samples of class 0 and 227 samples of class 1 time series with an equal length of 1440. This data is divided into 295 train samples (169 class 1’s), validation size of 33 (20 class 1’s), and test size of 83 (38 class 1’s).

In the first experiment on this data, we keep all 15 variables and then check whether the network selects

as the most important channel. One can imagine that the accuracy of the classifier could impact the accuracy of the channel importance component. To disentangle this effect, we first plot the ROC of the classifier on the test set. This is shown in figure

(a)a where the optimal operating point is marked in green(obtained using validation set). The AUC is 0.84 and is significantly better than AUC of 0.72 for a standard LSTM.

Using this operating point, we show in figure (b)b the channel localization by the model aggregated over the entire test set. The effect of using different operating points will be demonstrated in section 5.2. The top and bottom bars show the aggregated channel localization for class 0 (no event) and class 1 (event), respectively. For class 1, the network has picked the magnetic field channels with the strongest importance given to as expected. Note that the second highest importance is given to which is the closest to in the original frame. In class 0 cases, there would be nothing unique about or other magnetic field components and correctly the network has selected channels with plasma variables such as density and temperature as the most important.

The importance of the magnetic field variables in class 1 events, as identified by the model, is further illustrated in (c)c. We included as a part of ground truth channels in case of class 1. As we can see the model has high precision when it retrieves top few channels and the score gradually decrease as we look at more number of channels. The trend shows high agreement of retrieved channels to ground truth channels. We also map the standard deviation of hit rate across test set of the retrieved channels to have a better understanding of model performance.

(a) classification performance shows very high degree of agreement between ground truth and predicted labels for NASA data.
(b) Channels localization by the model aggregated over all of test data. We perform average activation of the channel gate over the test set and then normalize the values by the test set size. As we can see IETNet can separate variables by strongly picking first few channels which correspond to 2 body as shown in bottom panel. Likewise for class 0 the network is using all of the channels to say its a background class.
(c) Channels picked by the model on test set for NASA Problem. We plot mean average precision at various ’s. As we can see the model’s agreement to the ground truth specially at first few ’s. We have the six magnetic field channels as part of ground truth
Figure 5. NASA data experiments

5. Discussion

5.1. Variable Persistence

To test the robustness of the channel localizer, we conducted several experiments where we judiciously removed certain channels, retrained and reran the model, and examined the impact on the relative importance of the remaining channels. We saw that and are the two most important channels. Removing , the new model still selects as the most prominent channel as shown in (a)a. Similarly, removing , the new model correctly selects as the most prominent channel. In our third experiment, we remove both and . Interestingly, the model now selects plasma variables such as and as the most informative channels as shown in figure (c)c. This makes sense from physical understanding of FTEs. In the absence of highly informative magnetic field components, one has to rely more on plasma variables for identification of FTEs. Note that all 15 variables have predictive power but the most prominent ones are the magnetic field variables.

(a) Removed , one of the prominent channels. The model still picks as the top informative channel.
(b) Removed , the most prominent channel. The model selects the previously second most important channel as the top informative channel.
(c) Removed both and . The model picks on and other plasma variables as well as which are the next most informative channels.
Figure 6. Variable Persistence

5.2. Impact of Operating Point

Next, we examine the impact of operating point selection on the channel localization. Figure (a)a shows the results at three operating points marked in Fig. 7. The channel importance for each instance would be affected by the accuracy of the classifier on that instance. And similarly, we would expect the aggregated channel importance to be a mix of channel importance for class 0 and 1, with the balance dependent on the operating point. The operating point selects the balance between the true positive and false positive rates. At low threshold, the classifier has high sensitivity at the expense of higher false positive rate. In such a case, one would expect the aggregated channel importance to have a stronger influence from class 0, with the opposite expected at high threshold (low sensitivity but low false positive rate). This is exactly what is observed. Recall that for class 0 the plasma variables are the most prominent, whereas for class 1, and are the most prominent channels. becomes increasingly dominant in aggregated test set as one moves up the ROC curve and then starts to decrease in importance relative to plasma variables, while remaining an important variable, past the optimal threshold.

(a) Low threshold
(b) Optimal threshold
(c) High threshold
Figure 7. Impact of operating point

6. Conclusion

Here we proposed a new neural network, IETNet, capable of identifying the most importance channels for each classification instance of multivariate time series data. The efficacy of this network was demonstrated through two examples, N body problem and in situ spacecraft measurements from a recent NASA mission. Detailed analysis of the model on N body simulation and NASA spacecraft sensor data reveals high degree of agreement between our prior knowledge of important channels and channels picked by the model. As most natural stimuli are time-continuous and multivariate, the approach promises to be of great utility in real-world applications. We plan to extend this network so that rather than the mean, it would predict the probability distribution function which can then be used to further quantify the significance of each channel localization instance. Generalization to various publicly available data sets is another direction of future research.

References

  • S. Aghabozorgi, A. S. Shirkhorshidi, and T. Y. Wah (2015) Time-series clustering–a decade review. Information Systems 53, pp. 16–38. Cited by: §1.
  • C. M. Antunes and A. L. Oliveira (2001) Temporal data mining: an overview. In KDD workshop on temporal data mining, Vol. 1, pp. 13. Cited by: §1.
  • R. Assaf and A. Schumann (2019) Explainable deep neural networks for multivariate time series predictions. In

    Proceedings of the 28th International Joint Conference on Artificial Intelligence

    ,
    pp. 6488–6490. Cited by: §2.
  • S. Bai, J. Z. Kolter, and V. Koltun (2018) An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271. Cited by: §2, §3.1.
  • F. Karim, S. Majumdar, H. Darabi, and S. Harford (2018) Multivariate lstm-fcns for time series classification. External Links: arXiv:1801.04503 Cited by: §2.
  • H. Karimabadi, T. Sipes, Y. Wang, B. Lavraud, and A. Roberts (2009) A new multivariate time series data analysis technique: automated detection of flux transfer events using cluster data. Journal of Geophysical Research: Space Physics 114 (A6). Cited by: §4.4.
  • Y. LeCun, Y. Bengio, and G. Hinton (2015) Deep learning. nature 521 (7553), pp. 436–444. Cited by: §1.
  • N. S. Madiraju, S. M. Sadat, D. Fisher, and H. Karimabadi (2018)

    Deep temporal clustering: fully unsupervised learning of time-domain features

    .
    arXiv preprint arXiv:1802.01059. Cited by: §2.
  • C. T. Russell and R. Elphic (1979) ISEE observations of flux transfer events at the dayside magnetopause. Geophysical Research Letters 6 (1), pp. 33–36. Cited by: §4.4.
  • A. Santoro, D. Raposo, D. G. Barrett, M. Malinowski, R. Pascanu, P. Battaglia, and T. Lillicrap (2017) A simple neural network module for relational reasoning. In Advances in neural information processing systems, pp. 4967–4976. Cited by: §2.
  • J. Schmidhuber (2015) Deep learning in neural networks: an overview. Neural networks 61, pp. 85–117. Cited by: §1.
  • H. Song, D. Rajan, J. J. Thiagarajan, and A. Spanias (2018)

    Attend and diagnose: clinical time series analysis using attention models

    .
    In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §2.
  • Y. H. Tsai, P. P. Liang, A. Zadeh, L. Morency, and R. Salakhutdinov (2018) Learning factorized multimodal representations. arXiv preprint arXiv:1806.06176. Cited by: §2.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §2, §4.1.
  • X. Wang, R. Girshick, A. Gupta, and K. He (2018) Non-local neural networks. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 7794–7803. Cited by: §2.
  • Y. Xu, S. Biswal, S. R. Deshpande, K. O. Maher, and J. Sun (2018) Raim: recurrent attentive and intensive model of multimodal patient monitoring data. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2565–2573. Cited by: §2.
  • Y. Yuan, G. Xun, F. Ma, Y. Wang, N. Du, K. Jia, L. Su, and A. Zhang (2018) Muvan: a multi-view attention network for multivariate temporal data. In 2018 IEEE International Conference on Data Mining (ICDM), pp. 717–726. Cited by: §2.
  • A. Zadeh, P. P. Liang, N. Mazumder, S. Poria, E. Cambria, and L. Morency (2018) Memory fusion network for multi-view sequential learning. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §2.