1 Introduction
Changepoints, when the structure or parameters of a system change, are critical to detect in many domains. In medicine, finance, climate science and other fields, these changes can indicate that important events have occurred (e.g. onset of illness or a financial crisis), or that a system has changed in critical ways (e.g. increasing illness severity). Both types of changes will influence decisions about treatment and policies. Changepoint detection (CPD) aims to find these critical times automatically, but this is a challenging problem as changes can result in complex patterns across multiple observed variables, which may also be interdependent. Further, not all changepoints lead to a sudden transition; many occur over a period of time (e.g. weightloss, transition between activities) and are harder to identify. Being able to detect such changes, though, will have wide applicability in many domains.
Both parametric and nonparametric solutions have been proposed for CPD. Parametric methods (Adams and MacKay, 2007; Montanez et al., 2015) often make strong assumptions about the data and most are context specific, so they face difficulty when changes result in complex temporal patterns that are hard to model manually. Nonparametric methods (Saatçi et al., 2010; Li et al., 2015) address this with engineered divergence metrics or kernel functions, but the choice of parameters or kernels significantly affects accuracy. More fundamentally, these methods focus on abrupt changes, while in real world applications like activity recognition or finding onset of illness, a change may be gradual and happen over different durations. Some methods exist to detect gradual changepoints (Bardwell and Fearnhead, 2017; Harel et al., 2014), but cannot handle changes occurring at arbitrary timescales. In applications such as detecting activity changes, though, how quickly someone transitions from sitting to standing should not affect CPD accuracy.
At the same time, Deep Neural Networks (DNN) can learn functions automatically and be adapted to new tasks if there is sufficient training data, leading to use in applications such as time series classification (Yang et al., 2015). DNNs have not yet been exploited for CPD, though, and face challenges in generalizing across scales (requiring training data for all possible transition speeds). Since such data can be costly or infeasible to collect, it is ideal to have a scaleinvariant approach that can generalize beyond observed timescales.
To address the gap in CPD, we propose a novel DNN architecture for CPD in multivariate data. Our approach further makes two key contributions to neural network architecture (trainable wavelet layer, Pyramid recurrent neural networks (PRN)), which provide scale invariance and allow better use of multivariate data where patterns appear at varying speeds. We focus here on CPD due to the significance of this task, however the approach is highly general and may be applicable to classification problems and time series analysis more generally. On both simulated and realworld data, we show that our architecture allows more accurate detection of both abrupt and extremely gradual (e.g. like weight loss) changes, and further is scale invariant – allowing detection of changes at any timescale, regardless of those seen in training.
2 Related Work
Changepoint detection is a core problem for timeseries analysis. One approach is to use a model and find when observations deviate from what is predicted by the model. Bayesian Online CPD (BOCPD) (Adams and MacKay, 2007) detects changes in an online manner, but makes the limiting assumption that the time series between changes has a stationary exponentialfamily distribution. More generally, Bayesian techniques require full definition of the likelihood function (Montanez et al., 2015), which may be difficult to specify. Nonparametric models (Saatçi et al., 2010) increase flexibility, but also increase computational complexity. Gaussian Graphical Models (GGMs) move beyond the univariate case to detect changes in multivariate time series (Xuan and Murphy, 2007). GGM is an offline method that models the correlations between multivariate time series using a multivariate Gaussian. This method is closest to ours as it focuses on multivariate CPD, but unlike our approach it makes strong assumptions about the data distribution.
To eliminate the need to specify a model, modelfree approaches have emerged, such as densityratio estimation methods
(Yamada et al., 2013; Liu et al., 2013; Kuncheva and Faithfull, 2014) and kernel methods (Harchaoui et al., 2009; Li et al., 2015), but these depend strongly on the chosen estimation methods or kernels, which might be domain specific. (Idé et al., 2016) proposed an online method for CPD in multivariate data, which focuses on handling noisy data by first extracting features to capture major patterns. While this can overcome noise, it may miss more subtle or gradual changes. Other techniques define custom divergence functions such as using the difference in a covariance matrix (Barnett and Onnela, 2016), however not all changes will result in a significant change in covariance, such as when there are small changes across a number of variables that in aggregate indicate a system change. Statistical methods have other limitations such as reliance on the choice of kernels in MMD (Gretton et al., 2007), choice of parameters in Hotelling Tsquare (Chen and Gupta, 2000), or prior information in CUSUM (Page, 1954); or high computational complexity with large samples for generalized likelihood ratio (James et al., 1992). Thus such models cannot be readily used in a new domain without reengineering the divergence or kernel functions.Few methods were explicitly designed to detect gradual changes, though BOCPD has been extended this way by reformulating changes as segments instead of points (Bardwell and Fearnhead, 2017). Alternatively, gradual changes can be formulated as concept drifts (Harel et al., 2014). We instead develop scale invariant models, which can handle short and longterm temporal patterns, and can generalize to novel timescales without extra computation or training data.
Deep learning, which allows recognition of complex patterns in large datasets without engineering of features and metrics, provides a promising way to address the core challenges of CPD. CNNs can learn to extract increasingly abstract features from raw data through a stack of nonlinear convolutions, allowing, for example, recognition of hundreds of object types in natural images (Szegedy et al., 2015). RNNs further can learn complex temporal patterns in sequences of arbitrary length (e.g. to recognize human activities (Hammerla et al., 2016)), which are exactly the types of changes that are challenging for CPD.
Ideally, a CPD method should perform equally well on test data regardless of whether changes happen faster or slower than seen in training data. However, the fixed resolution of CNN and RNN architectures makes them sensitive to scale. While CNN extensions can model multiple scales simultaneously (Shen et al., 2015), this is not the same as scale invariance, as it simply concatenates features. Chung et al. (2016) introduce Hierarchical Multiscale Recurrent Neural Networks (HMRNN), which process a sequence through successive RNN layers at different resolutions. This can improve efficiency by detecting boundaries and only updating the RNN when a change is recognized. However, the layers of RNN resemble layers of convolution in CNNs (modeling the signal at a different abstraction level) and are not invariant to scale changes at the same abstraction level. Our proposed architecture, PRN, in contrast combines advantages of both CNN and RNN and augments them with scale invariance.
RNNs are naturally built to model longterm dependencies, as is necessary for recognizing gradual changes, but suffer from vanishing gradients. Extensions such as Long ShortTerm Memory (LSTM)
(Hochreiter and Schmidhuber, 1997) solve vanishing gradients, but still have limited memory space. Intuitively, information from an infinitely long sequence cannot be stored in a fixeddimensional RNN cell. Skip RNNs were proposed to reduce computational complexity by skipping state updates while preserving the performance of baseline RNN models (Campos et al., 2017). The skip operation avoids redundant updates, but risks skipping temporal dependencies (especially long term ones), which can hurt the overall performance. To address this, recent work has augmented RNNs with various types of memory or stack (Sukhbaatar et al., 2015; Joulin and Mikolov, 2015), but these methods are not scaleinvariant. Frameworks like Feature pyramid networks (Lin et al., 2017) and wavelet CNN (Fujieda et al., 2018) have been proposed for images with different scales or resolutions, though neither are directly applicable to time series data. In contrast, our PRN models infinitely long sequences with its multiscale RNN, which forms a stack of memory cells in an arbitrary number of levels. A higherlevel RNN cell in a stack has lower resolution, and thus can store longer dependencies at no additional computational cost, while a lowerlevel RNN cell has a high resolution and prevents the loss of details in the short term.3 Method
We propose a new class of deep learning architectures called Pyramid Recurrent Neural Networks (PRNs). The model takes a multivariate time series and transforms it into a pyramid of multiscale feature maps using a trainable wavelet layer (NWL). All pyramid levels are processed in parallel using multiple streams of CNN with shared weights, yielding a pyramid of more abstract feature maps (DWN). Next, we build a multiscale RNN on top of the pyramid feature map, to encode longerterm, dependencies. The PRN output is used to detect changes at each time step with a binary classifier.
3.1 Deep Wavelet Neural Networks (DWN)
CNNs can learn to recognize complex patterns in multivariate time series, partly due to parametersharing across time, which leads to shiftinvariance. CNNs are not scaleinvariant, though, so a learned pattern cannot necessarily be recognized when it appears more gradually or quickly. To make CNNs scale invariant, we introduce Deep Wavelet Neural Networks (DWN), which consist of a Neural Wavelet Layer (NWL) followed by parallel streams of CNN.
The NWL can be seen as a set of multiscale convolutions with trainable kernels, which are applied in parallel on each variable of the input time series. The input to the NWL is a multivariate time series, , where is the number of timepoints and is the number of variables. The NWL takes and produces multiple feature maps, which together form a pyramid of convolution responses. That is:
(1) 
An example is shown in Figure 1. Specifically, the NWL uses the filter bank technique (Mallat, 1999) for discrete wavelet transform. Given a pair of separating convolutional kernels (typically a lowpass and a highpass kernel), it convolves the signal with both, outputs the highpass response, and downsamples the lowpass response for the next iteration. It repeats this process and in each iteration outputs an upper level of the output pyramid. Although traditional wavelets such as Haar or Gabor (Mallat, 1999)
can be used, we have experimentally found that initializing the filter banks with random numbers and training them using backpropagation with the rest of the network leads to higher accuracy.
More formally, the NWL is characterized by its trainable kernels for all variables , where is the kernel size. Given each channel of as input (e.g. ), the NWL iteratively computes lowpass and highpass responses, starting with and , that are:
(2) 
where is convolution and
is a downsampling operation (e.g. implemented by linear interpolation). At the
th iteration of the wavelet transform, given and , it computes and such that:(3) 
This operation is repeated for a prespecified number of times, , or until the length of and
becomes smaller than a threshold. The hyperparameter,
, can be selected using crossvalidation. A larger (or smaller threshold) results in a larger receptive field at the highest level of the pyramid, enabling the detection of more gradual patterns. However, a large also brings more computation and requires a larger buffer in the case of online processing.The output of each iteration for variables can be concatenated to form
(4) 
where indicates concatenation. The output of the NWL is the stack of all . These are called different levels of a pyramid throughout this paper. In the original filter bank method the last lowpass response, , is also stacked with the output but we did not observe an improvement with .
The key advantage of a NWL over a conventional convolution layer is that a single wavelet can encode the input with multiple granularities at once, whereas a single convolution only encodes a single granularity. Although different layers of a CNN have different granularities, they encode the data at a different level of abstraction, and thus cannot simultaneously extract the same pattern at different scales. On the other hand, a single wavelet layer can encode changes with the same patterns at different paces, simultaneously into the same feature map, at different levels of the pyramid.
We will use the proposed NWL as a part of a larger, deeper architecture, which is described in the rest of this section. Hence, an important aspect of NWL is that it can be used as a layer of a deep network, in composition with other neural layer types such as convolutional and fully connected layers. For example, the input to a wavelet layer can be the output of a convolutional layer. Alternatively, to stack a convolutional layer on the output of a wavelet layer, one should apply the convolution on each level of the wavelet pyramid, resulting in a pyramidshaped output.
Accordingly, a network composed of one wavelet layer and an arbitrary number of other layers, can take a multivariate time series as input, and produce a pyramidshaped response as output. We refer to such a network architecture as a Deep Wavelet Neural Network (DWN). In this paper we use a specific form of DWN, which starts with a NWL, directly applied on the input time series , followed by parallel streams of CNN with shared parameters, each of which takes one level of the NWL pyramid. More specifically, we use an
layer CNN with a downsampling stride of
at the th layer, which results in a total downsampling factor of , and with feature maps at the th layer. We apply that CNN in parallel on each level of the output pyramid of the NWL, which means for each , it gets and outputs .3.2 Pyramid Recurrent Layer
The DWN output is a multiscale pyramid of sequential feature maps that encode shortterm temporal patterns at different times and scales. It is common to process sequential features using an RNN, to encode longerterm temporal patterns. However, conventional RNNs process a single sequence, not a multiscale pyramid of sequences. Similar to the need for a wavelet layer, RNNs are not scaleinvariant, so an RNN cannot necessarily recognize a temporally shortened or stretched instance of a learned pattern without having seen this scale in training. Further, RNNs fail to learn very gradual patterns, due to limited memory. While this can be addressed by memoryaugmented networks, they remain sensitive to scale.
To address these issues, we introduce a novel hierarchically connected variant of RNNs. Our proposed network, PRN, scans the multiscale output of a DWN, and simultaneously encodes temporal patterns at different scales. An RNN is applied in parallel on different levels of the input pyramid. On each level at each step, it takes as input the corresponding entry from the input pyramid, along with the most recent output of the RNN operating at the upper level. We concatenate those two vectors and feed as input to the RNN. We refer to this technique as Pyramid Recurrent Layer (PRL).
Denoting the value at level of the input pyramid at time as , and assuming the downsampling ratio in the wavelet transform is , (i.e., each level of the pyramid has times the length of its upper level) we can write the recurrent state at level and time as:
(5) 
where
is a nonlinear activation function such as ReLU, and
, , andare trainable parameters of this layer. These parameters define a linear transformation of the current state, past state, and higherlevel state, as illustrated in Figure
2. Note that the proposed hierarchical structure is agnostic of the function of each cell. Although we used a simple RNN cell for illustration, we could use any variant of RNNs such as LSTM (Hochreiter and Schmidhuber, 1997) or Skip RNN (Campos et al., 2017) as the RNN cell.The proposed architecture can be compared with an RNN operating on a single data sequence. If the data granularity is high, the RNN likely fails to model longterm dependencies, due to the wellknown problem of vanishing gradients. One can lower the data granularity, so longterm patterns can be summarized in fewer steps, but this results in the loss of details. Accordingly, conventional RNNs were not designed to effectively detect both abrupt and gradual patterns at the same time. On the other hand, in the proposed PRL, each RNN unit is provided with inputs from the same level of granularity as well as the level above. The RNN that operates at the lowest level, in turn, receives information from all levels of granularity. Figure 2 illustrates the effect of forgetting using decreasing color saturation. While it is impossible to keep track of the past through the lower level alone, the information path from upper levels connect the past to present in only three steps. This lets the PRL model longterm patterns, while it can still model fine details through the lower levels.
3.3 Pyramid Recurrent Neural Networks (PRN)
We propose PRN as a composition of a DWN and a PRL. An input time series of arbitrary length is transformed through a DWN into a pyramidshaped representation, which is then fed into a PRL. For CPD and other classification problems, a logistic regression layer is built on the output of the RNN cells that operate at the lowest level of the pyramid. This layer produces detection scores at each time step with the highest possible granularity. The detection score for time
is:(6) 
where
is the sigmoid function and
and are trainable parameters. The classification loss at each time is the cross entropy loss, where is the ground truth at time :(7) 
We optimize this loss using stochastic gradient descent on parameters of the classifier (
and ), PRL (, , and ), and NWL ( and ).4 Evaluation
We compare the proposed PRN to deep learning and CPD baselines. Using both simulated and realworld datasets, we show that PRNs can detect abrupt and gradual changes more accurately than existing methods and can be used for activity recognition by learning labels for different changes.
4.1 Datasets
Synthetic dataset
We create a synthetic dataset to evaluate accuracy at simultaneously detecting gradual and abrupt changes. We construct 2000 time series each with 12 variables and 8192 time steps. Each time series is a combination of a Brownian process and white noise. Each has 4 randomly placed changepoints, which are defined as a shift in the mean of 4 randomly selected dimensions with random duration and size of change. Duration 0 is an abrupt change, while longer ones provide more challenging cases to recognize. An example of the simulated time series together with ground truth and detection results are shown in Figure
3. We randomly split the series, with 1000 for training and 1000 for testing. To demonstrate robustness of the proposed method against variability in scale, we also do a split by scale, where all changes in one half are strictly more gradual than all in the other half.OPPORTUNITY dataset This activity recognition dataset (Chavarriaga et al., 2013) provides a challenging realworld test, as activity changes take place at varying durations. The data are onbody sensor recordings from 4 participants performing activities of daily living, such as cleaning a table. Each participant has 6 records (runs) of 20min each. Values of 72 sensors (10 modalities) were recorded at 30Hz, and manually labeled with 18 activities. Following (Hammerla et al., 2016), we ignore variables with missing values, which leads to 79 variables for each record. We use run 2 of subject 1 for validation and runs 4 and 5 of subjects 2 and 3 for test, and the rest for training. We consider the transition between two activities a changepoint, and use the activity labels from the data as ground truth.
Bee Waggle Dance dataset is our second realworld test case (Oh et al., 2008). The data includes six videos of bee waggle dances (used to communicate with other bees) with 30 frames per second. The data include 3 variables encoding the bee’s position and head angle at each frame. Using the position and angle information, each frame is labeled as “turn left”, “turn right”, or “waggle dance.” Similar to OPPORTUNITY, we consider the transition between two activities a change point. We test our method and other baselines on sequence 1 of the bee data, training on time series from the first 256 frames and testing on the other 768 frames. We use this small training data for consistency with prior works (Saatçi et al., 2010), and as a challenging evaluation.
4.2 Baselines
We compare the proposed architecture to unsupervised CPD and supervised deeplearning baselines:
GGM (Xuan and Murphy, 2007) is related to BOCPD, a classic method for CPD, but was selected for fairer comparison as GGM is offline and allows multivariate time series.
CNN
We use a CNN that takes a time series as input and predicts a sequence of detection scores for changes. Due to the widely used maxpooling layers, the output has a lower temporal granularity compared to the input. We denote the ratio of output length to the input length as
.RCN We apply an RNN to the output of the CNN. The output has the same granularity as CNN, while each step of the output has a larger receptive field that encodes past data.
HMRNN We compare against HMRNN (Chung et al., 2016), which is multiscale and more efficient than RNN.
DWN Our DWN is formed by applying an NWL to the input time series and feeding the output pyramid levels to parallel branches of a CNN. The output of CNN is upsampled to have the same size and fused by arithmetic mean.
PRN We apply the complete proposed method which consists of a DWN followed by a Pyramid Recurrent Layer to fuse levels of the pyramid.
PRNS We replace the standard RNN cell in our PRN with a Skip RNN (Campos et al., 2017) to test whether a more efficient RNN can provide the same performance as PRN.
4.3 Implementation details
We briefly summarize implementation of the baselines, and provide full details in the appendix. All of the deeplearning baselines share a core CNN architecture, with each convolution layer followed by maxpooling and ReLU activation, and output fed to a fully connected perceptron with sigmoid activation, which results in binary detection scores at each time step. The granularity ratio
for this architecture is . For DWN, PRN, and PRNS, we used a 7level wavelet with kernel size 3 for both synthetic and OPPORTUNITY data. Due to the small size and more abrupt transitions in the bee data, we used a 5level wavelet with kernel size 3. For all datasets RCN and PRN used an LSTM cell with 256 hidden units, and 128 units for HMRNN (3 layers). At test time, the models take a time series and predict a sequence of detection scores. To detect changepoints, we apply nonmaximum suppression with a sliding window of length (which controls how nearby two distinct changes can be) and filter the maximum values with a threshold. We evaluate AUC by iterating over this threshold. For GGM, we use the full covariance model to capture the correlations between features. We use a uniform prior as in (Xuan and Murphy, 2007), and set the pruning threshold to . GGM is unsupervised, but used the same test data as all other methods for fair comparison.The real world datasets (Bee and OPPORTUNITY) include diverse changepoints formed by transitions between many activities. Thus we use multitask learning, training the model to both detect changes and classify activities by changing the output dimension of the last fully connected later to have
units (N=activities+1: 19 for OPPORTUNITY, and 4 for Bee), with the first N1 units predicting a log probability for each activity and the last unit the probability of a change.
As detected changepoints may not exactly match the true times, we use a tolerance parameter that sets how close a detected change must be to a true change to be considered correct. Precision is matched detections divided by all detections, and recall is matches divided by true changes.
4.4 Results
4.4.1 Synthetic data
Fig. 3
illustrates results for our scale invariant PRN and scale sensitive CNN. CNN has a higher false positive rate, while also missing a change. While detected changes and ground truth are not always precisely aligned, small gaps are acceptable for gradual changes, where it can be hard to define a single moment when the change occurs.
We use three traintest splits to test extrapolation (from abrupt to gradual and vice versa) and ability to handle a mix of scales. First, realworld cases are likely to have a mix of scales in both training and test data, and in this case (fig. (c)c
) PRN and PRNS perform best, while HMRNN and GGM have the lowest AUC. HMRNN relies on correct detection of boundaries and is not invariant to scale changes at the same level of abstraction, leading to errors with gradual changes. GGM highlights the challenge of unsupervised learning with a small number of events per time series. In the two scalevariant splits, the model must extrapolate patterns from training data to novel scales. Comparing mixed scales to training on abrupt and testing on gradual changes (fig.
(a)a), we see that this is challenging for methods that are not scaleinvariant, as shown by the drop in performance for CNN (from 41% to 15%) and RCN (from 39% to 11%) when the tolerance is 64 steps (). While AUC of our methods (DWN and PRN) also decreases with this more difficult task, the drop is substantially lower for DWN (20%) due to the wavelet layer and shared parameters across scales. (See Appendix for details). Finally, when training on gradual and testing on abrupt changes (fig. (b)b) our approach again outperforms CNN and RCN due to its ability to generalize across scales. With , AUC is higher for PRN (72%) and DWN (79%) compared to CNN (67%) and RCN (27%). Note that DWN outperforms PRN and PRNS, and similarly, CNN outperforms RCN, as recurrent architectures are generally less effective for this kind of extreme generalization. The performance of our DWN shows the effectiveness of the added wavelet layer in modeling both gradual and abrupt changes in time series.4.4.2 Opportunity data
On this realworld activity data, PRN outperforms all other methods at all tolerance levels, as shown in fig. (a)a. In contrast to the synthetic data, PRNS now has significantly lower AUC than both PRN and DWN. It may be that Skip RNN is skipping important information encoded in our wavelet later. HMRNN has the lowest AUC of all RNN based methods, as it relies on distinct differences in distribution, which may not happen between activities. GGM had the lowest AUC for all cases, showing that supervised methods that can leverage data labels can provide better CPD. When the tolerance is 2sec (), a reasonable value for activity recognition, PRN has 82% AUC while DWN, RCN, CNN, PRNS, HMRNN, and GGM respectively achieve 75%, 74%, 69%, 47%, 24%, and 5%. Full results are in the Appendix.
The deep learning methods, PRN, PRNS, RCN, DWN, HMRNN, and CNN, respectively took 110, 105, 80, 69, 24, and 6 minutes to train and converge. Recurrent methods take longer due to backpropagation through time, but this only happens during training. DWN has a superior performance to RCN in most cases, while also being faster to train. While PRN has better performance, when computational complexity is higher priority, our DWN can be used instead.
4.4.3 Bee Waggle Dance data
On this second realworld dataset, PRN outperforms other methods when (around 1 second) with AUC of 93% (fig. (b)b). For tolerance of around 2 seconds (), PRN has 93% AUC while the next three best performing methods are RCN (84%), CNN (79%), and DWN (78%) (see Appendix). Further, changes in tolerance affect our method much less than others. For instance, when the tolerance is lowered from 32 () to 16 (), the AUC of RCN drops significantly (from 84% to 18%), while AUC of PRN drops much less (from 93% to 61%). CNN has a dramatic drop in accuracy from to , suggesting it is consistently detecting changes with a delay. Thus, PRN is less sensitive to this parameter and more reliable for real world cases. Similar to the OPPORTUNITY data, GGM has the lowest AUC for all tolerances, and HMRNN has the lowest AUC among deep learning methods. Both PRNS and HMRNN have higher max AUC in Bee data (64% and 57%) than OPPORTUNITY (51% and 30%) because the bee activities have more distinct boundaries than human ones. Thus, our PRN is more widely applicable.
5 Conclusion
We propose PRN, a new scaleinvariant deep learning architecture, and show how it can detect from abrupt to gradual changepoints in multimodality time series. The core is 1) DWN: a CNN with trainable Wavelet layers that recognize shortterm multiscale patterns; and 2) PRL: a pyramidshaped RNN on top of the multiscale feature maps to simultaneously model longterm patterns and fuse multiscale information. Unlike existing DNNs, PRN can detect events involving short and longterm patterns and extrapolate to scales not seen in training. Experiments on real and synthetic data demonstrate that PRN detects changes quickly, with lower sensitivity to parameters than other approaches. Future work will focus on handling missing and noisy labels with semisupervised learning methods.
Appendix A Implementation details
All of the deeplearning baselines share a core CNN architecture on which the additional modules are built. We fix the architecture of the core CNN to be , where we use the notation for a convolution layer where is the kernel size, is the number of output feature maps, and is the pooling stride. Each convolution layer is followed by maxpooling and ReLU activation. The output of all baselines are fed to a fully connected perceptron with sigmoid activation which results in binary detection scores at each time step. The granularity ratio for this architecture is . For DWN, PRN, and PRNS, we used a 7level wavelet with kernel size 3 for both synthetic and OPPORTUNITY dataset. For Bee Waggle Dance data, due to the small size and more abrupt transitions, we used a 5level wavelet with kernel size 3. For all datasets RCN and PRN used an LSTM cell with 256 hidden units, 128 units for HMRNN.
We train all models using Adam (Kingma and Ba, 2014) with early stopping to avoid overfitting with initial learning rate of 0.001. At test time, the models take a time series and predict a sequence of detection scores. To detect changepoints, we apply nonmaximum suppression with a sliding window of length and filter the maximum values with a threshold. We evaluate AUC by iterating over this threshold. Hyperparameter controls how nearby two distinct changes can be detected and is tuned for each method separately using crossvalidation.
The real world datasets (Bee data and OPPORTUNITY data) are more challenging than the synthetic data, as they include diverse changepoints formed by transitions between many activity types. To address this, we use multitask learning, training the model to both detect changes and classify activity by changing the output dimension of the last fully connected later to have multiple units (19 for OPPORTUNITY data, and 4 for Bee data). For OPPORTUNITY data, the first 18 units predict a log probability for each activity and the last 1 unit outputs the probability of a change point (for bee data, it’s 3 units and 1 unit). We define a softmax crossentropy loss on those 18 units and add it as a regularization term to the objective function. Multitask learning improved the results equally for all baselines, because the model has auxiliary information, namely the activity type and not just the existence of a change.
For GGM, we use the full covariance model instead of the independent features model to capture the correlations between features. We use a uniform prior as in (Xuan and Murphy, 2007), and set the pruning threshold to . Since there is no training for GGM, we evaluate the algorithm using the same test data as all other methods we compared on both synthetic and real world dataset.
We evaluate precision and recall, and report AUC. As detected changepoints may not exactly match the true changepoints, we use a tolerance parameter
that sets how close a detected change must be to a true change to be considered a correct detection. We match detected changepoints to the closest true changepoint within time steps. Precision is the number of matched detections divided by the number of all detections, and recall is the number of matches divided by the number of true changes.Appendix B Results detail
b.1 Synthetic data
Tables 13 show the AUC (Area Under the ROC Curve) results for synthetic data. Table 1 shows the results for the experiment of “train abrupt and test gradual”, Table 2 shows the results for the experiment of “train gradual and test abrupt”, and Table 3 shows the results for the experiment of “train all and test all” for synthetic data.
HMRNN  RCN  CNN  GGM  PRNS  DWN  PRN  

8  0.01  0.002  0.003  0.007  0.006  0.005  0.003 
16  0.01  0.008  0.011  0.010  0.023  0.016  0.011 
32  0.01  0.026  0.042  0.012  0.076  0.063  0.043 
64  0.011  0.107  0.153  0.012  0.271  0.213  0.155 
128  0.013  0.391  0.421  0.013  0.625  0.546  0.503 
256  0.064  0.783  0.757  0.016  0.849  0.822  0.863 
512  0.120  0.824  0.801  0.023  0.862  0.835  0.876 
HMRNN  RCN  CNN  GGM  PRNS  DWN  PRN  

8  0.002  0.001  0.013  0.007  0.033  0.102  0.027 
16  0.007  0.003  0.049  0.011  0.115  0.301  0.100 
32  0.017  0.014  0.188  0.013  0.347  0.599  0.376 
64  0.040  0.269  0.665  0.013  0.689  0.787  0.724 
128  0.140  0.650  0.797  0.014  0.811  0.822  0.816 
256  0.144  0.795  0.814  0.018  0.833  0.833  0.830 
512  0.148  0.817  0.830  0.026  0.834  0.834  0.830 
HMRNN  RCN  CNN  GGM  PRNS  DWN  PRN  

8  0.027  0.039  0.061  0.007  0.027  0.014  0.039 
16  0.032  0.093  0.144  0.011  0.100  0.061  0.122 
32  0.051  0.204  0.244  0.012  0.249  0.164  0.284 
64  0.055  0.390  0.407  0.017  0.496  0.406  0.551 
128  0.119  0.582  0.586  0.021  0.737  0.700  0.747 
256  0.134  0.788  0.747  0.030  0.863  0.840  0.860 
512  0.142  0.852  0.808  0.043  0.874  0.860  0.869 
b.2 Real World data
b.2.1 Opportunity data
Table 4 shows the results for OPPORTUNITY data. Figure 6 shows example results for the OPPORTUNITY dataset tested using our scale invariant PRN and scale sensitive CNN. In the time series, we see that CNN has a missed detection and at least one false positive around time 300, while PRN detects all changes close to their actual times. Overall CNN has a higher false positive rate. While PRN’s detected changes and ground truth are not always precisely aligned, the small gaps are acceptable in real world data, where it can be hard to define a single moment when the change occurs.
HMRNN  RCN  CNN  GGM  PRNS  DWN  PRN  

2  0.013  0.036  0.024  0.007  0.007  0.034  0.040 
4  0.028  0.077  0.066  0.016  0.024  0.093  0.104 
8  0.053  0.213  0.197  0.022  0.068  0.234  0.297 
16  0.143  0.513  0.442  0.027  0.236  0.515  0.601 
32  0.187  0.713  0.629  0.032  0.418  0.712  0.773 
64  0.240  0.744  0.687  0.046  0.471  0.753  0.815 
128  0.300  0.771  0.710  0.065  0.507  0.759  0.833 
b.2.2 Bee Waggle Dance data
Table 5 shows the results for Bee Waggle Dance data. The results show that our proposed PRN has the highest AUC score for almost all tolerance values (except ). Thus, our proposed PRN is less sensitive to the tolerance and more reliable for real world cases.
HMRNN  RCN  CNN  GGM  PRNS  DWN  PRN  

2  0.033  0.007  0.008  0.019  0.009  0.007  0.025 
4  0.144  0.007  0.053  0.023  0.009  0.023  0.145 
8  0.362  0.054  0.192  0.041  0.119  0.131  0.400 
16  0.480  0.178  0.703  0.077  0.393  0.329  0.608 
32  0.573  0.841  0.789  0.083  0.643  0.777  0.932 
64  0.573  0.841  0.789  0.083  0.643  0.777  0.932 
128  0.573  0.841  0.789  0.083  0.643  0.777  0.932 
References
 Adams and MacKay [2007] Ryan Prescott Adams and David JC MacKay. Bayesian Online Changepoint Detection. arXiv preprint arXiv:0710.3742, 2007.
 Bardwell and Fearnhead [2017] Lawrence Bardwell and Paul Fearnhead. Bayesian Detection of Abnormal Segments in Multiple Time Series. Bayesian Analysis, (1):193–218, 2017.
 Barnett and Onnela [2016] Ian Barnett and JukkaPekka Onnela. Change Point Detection in Correlation Networks. Scientific reports, 6:18893, 2016.
 Campos et al. [2017] Víctor Campos, Brendan Jou, Xavier Girói Nieto, Jordi Torres, and ShihFu Chang. Skip RNN: Learning to Skip State Updates in Recurrent Neural Networks. arXiv preprint arXiv:1708.06834, 2017.
 Chavarriaga et al. [2013] Ricardo Chavarriaga, Hesam Sagha, Alberto Calatroni, et al. The Opportunity Challenge: A Benchmark Database for Onbody Sensorbased Activity Recognition. Pattern Recognition Letters, 34(15):2033–2042, 2013.
 Chen and Gupta [2000] Jie Chen and AK Gupta. Parametric Statistical Change Point Analysis. Springer, 2000.
 Chung et al. [2016] Junyoung Chung, Sungjin Ahn, and Yoshua Bengio. Hierarchical Multiscale Recurrent Neural Networks. arXiv preprint arXiv:1609.01704, 2016.
 Fujieda et al. [2018] Shin Fujieda, Kohei Takayama, and Toshiya Hachisuka. Wavelet Convolutional Neural Networks. arXiv preprint arXiv:1805.08620, 2018.
 Gretton et al. [2007] Arthur Gretton, Karsten M Borgwardt, Malte Rasch, Bernhard Schölkopf, and Alex J Smola. A kernel method for the twosampleproblem. In NIPS, 2007.
 Hammerla et al. [2016] Nils Y. Hammerla, Shane Halloran, and Thomas Plötz. Deep, Convolutional, and Recurrent Models for Human Activity Recognition Using Wearables. In IJCAI, 2016.
 Harchaoui et al. [2009] Zaid Harchaoui, Eric Moulines, and Francis R Bach. Kernel changepoint analysis. In NIPS, 2009.
 Harel et al. [2014] Maayan Harel, Shie Mannor, Ran ElYaniv, and Koby Crammer. Concept Drift Detection through Resampling. In ICML, 2014.
 Hochreiter and Schmidhuber [1997] Sepp Hochreiter and Jürgen Schmidhuber. Long Shortterm Memory. Neural computation, 9(8):1735–1780, 1997.
 Idé et al. [2016] Tsuyoshi Idé, Dzung T Phan, and Jayant Kalagnanam. Change Detection Using Directional Statistics. In IJCAI, 2016.

James et al. [1992]
Barry James, Kang Ling James, and David Siegmund.
Asymptotic Approximations for Likelihood Ratio Tests and Confidence Regions for a Changepoint in the Mean of a Multivariate Normal Distribution.
Statistica Sinica, pages 69–90, 1992.  Joulin and Mikolov [2015] Armand Joulin and Tomas Mikolov. Inferring Algorithmic Patterns with Stackaugmented Recurrent Nets. In NIPS, 2015.
 Kingma and Ba [2014] Diederik P Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. arXiv preprint arXiv:1412.6980, 2014.

Kuncheva and
Faithfull [2014]
Ludmila I Kuncheva and William J Faithfull.
PCA Feature Extraction for Change Detection in Multidimensional Unlabeled Data.
IEEE transactions on neural networks and learning systems, 25(1):69–80, 2014.  Li et al. [2015] Shuang Li, Yao Xie, Hanjun Dai, and Le Song. Mstatistic for Kernel Changepoint Detection. In NIPS, 2015.
 Lin et al. [2017] TsungYi Lin, Piotr Dollár, Ross B Girshick, et al. Feature Pyramid Networks for Object Detection. In CVPR, 2017.
 Liu et al. [2013] Song Liu, Makoto Yamada, Nigel Collier, and Masashi Sugiyama. Changepoint Detection in Timeseries Data by Relative Densityratio Estimation. Neural Networks, 43:72–83, 2013.
 Mallat [1999] Stéphane Mallat. A Wavelet Tour of Signal Processing. Academic press, 1999.

Montanez et al. [2015]
George D Montanez, Saeed Amizadeh, and Nikolay Laptev.
Inertial Hidden Markov Models: Modeling Change in Multivariate Time Series.
In AAAI, 2015.  Oh et al. [2008] Sang Min Oh, James M Rehg, Tucker Balch, and Frank Dellaert. Learning and Inferring Motion Patterns using Parametric Segmental Switching Linear Dynamic Systems. IJCV, 77(13):103–124, 2008.
 Page [1954] Ewan S Page. Continuous inspection schemes. Biometrika, 41(1/2):100–115, 1954.
 Saatçi et al. [2010] Yunus Saatçi, Ryan D Turner, and Carl E Rasmussen. Gaussian Process Change Point Models. In ICML, 2010.

Shen et al. [2015]
Wei Shen, Mu Zhou, Feng Yang, et al.
Multiscale Convolutional Neural Networks for Lung Nodule Classification.
In International Conference on Information Processing in Medical Imaging, 2015.  Sukhbaatar et al. [2015] Sainbayar Sukhbaatar, Jason Weston, Rob Fergus, et al. Endtoend Memory Networks. In NIPS, 2015.
 Szegedy et al. [2015] Christian Szegedy, Wei Liu, Yangqing Jia, et al. Going Deeper with Convolutions. CVPR, 2015.
 Xuan and Murphy [2007] Xiang Xuan and Kevin Murphy. Modeling Changing Dependency Structure in Multivariate Time Series. In ICML, 2007.

Yamada et al. [2013]
Makoto Yamada, Akisato Kimura, Futoshi Naya, and Hiroshi Sawada.
ChangePoint Detection with Feature Selection in HighDimensional TimeSeries Data.
In IJCAI, 2013.  Yang et al. [2015] Jianbo Yang, Minh Nhut Nguyen, Phyo Phyo San, et al. Deep Convolutional Neural Networks on Multichannel Time Series for Human Activity Recognition. In IJCAI, 2015.
Comments
There are no comments yet.