Usage of specific attention improves change point detection

The change point is a moment of an abrupt alteration in the data distribution. Current methods for change point detection are based on recurrent neural methods suitable for sequential data. However, recent works show that transformers based on attention mechanisms perform better than standard recurrent models for many tasks. The most benefit is noticeable in the case of longer sequences. In this paper, we investigate different attentions for the change point detection task and proposed specific form of attention related to the task at hand. We show that using a special form of attention outperforms state-of-the-art results.


page 1

page 2

page 3

page 4


Deep learning model solves change point detection for multiple change types

A change points detection aims to catch an abrupt disorder in data distr...

Change point detection based on method of moment estimators

A change point detection procedure using the method of moment estimators...

Principled change point detection via representation learning

Change points are abrupt alterations in the distribution of sequential d...

Retrain or not retrain: Conformal test martingales for change-point detection

We argue for supplementing the process of training a prediction algorith...

Change Point Detection for Compositional Multivariate Data

Change point detection algorithms have numerous applications in fields o...

Detection and Estimation of Multiple Transient Changes

Change-point detection methods are proposed for the case of temporary fa...

Continual Learning for Infinite Hierarchical Change-Point Detection

Change-point detection (CPD) aims to locate abrupt transitions in the ge...

1 Introduction

The goal of Change Point Detection (CPD) is to find the moment of data distribution shift. Such tasks appear in different areas, from monitoring systems to video analysis [aminikhanghahi2017survey] to oilgas [romanenkova2019real]. One of the recent methods for CPD [romanenkova2021principled]

proves that a recurrent neural network model constructs meaningful representations and solve a problem better than non-principled approaches. The state-of-the-art model for sequential data is Transformer 

[vaswani2017attention]. The main benefit of Transformer’s is their ability to work with long-range dependencies via attention mechanism. Attention matrix specifies at what exact points we should look [vaswani2017attention]. A modification of attention matrix allows faster and more efficient processing of longer sequences [tay2020long, tay2020efficient].

We propose specific attention mechanisms that allow efficient work for the change point detection problem. By considering different autoregression and non-autoregressive attention matrices, we highlight important properties of both the problem and the model based on transformers. The proposed models work faster and directly incorporate the peculiarities of the problem at hand. They outperform existing results for the considered change point detection problem in semi-structured sequences of sequential data.

2 Related work

Attention mechanism is an important idea in deep learning for sequential data

[vaswani2017attention]. It allows usage of the whole sequence and doesn’t a model to have a long memory. Instead, at each layer, the model looks at the whole sequence.

The transformer architecture based on attention shows impressive results in many problems related to the processing of sequential data and in particular NLP [fursov2021differentiable], [reis2021transformers]

and computer vision 


However, direct application of this mechanism can be prohibitively expensive: the computational complexity of the vanilla attention is , where is the sequence length. A quest for more computationally effective attention leaded to numerous ideas explored in reviews [tay2020long, tay2020efficient]. One of the key ideas in this area is to use so-called sparse attention: instead of the whole attention matrix, we drop parts of it. This mechanism also allows highlighting a specific part of sequences.

One of the problem statements for sequential data is the change point: we want to detect a change of a distribution in a sequence as fast as possible [shiryaev2017stochastic]. Accurate solutions for such problems are vital in different areas, including software maintenance [artemov2016detecting] and oilgas industry [romanenkova2019real]. Simple statistics-based approaches are sufficient in many cases [van2020evaluation]

. However, semi-structured data require deep model to provide reasonable quality of solutions ranging from rather simple neural network 

[hushchyn2020online] to more complex workflows [romanenkova2021principled, kail2021recurrent] and problem statements [sultani2018real].

For complex multi-step processing of sequences of videos, we still need efficient models that capture the essential properties of data and can detect the change point detection: both in terms of model training and evaluation time and in detection delay in a sequence. It seems natural to apply the efficient transformers paradigm to such data: the attention that pays attention to the recent past is expected to meet these efficiency and quality requirements.

3 Problem statement

The solution of this problem with the neural network was investigated in [romanenkova2021principled]. Its working principle is as follows.

Let a dataset be a set of sequences , where each sequence has the length and corresponding change point is in , if we have a change point in a sequence and otherwise.

Let the random process of length be given, where is an observation at time . The problem is the quickest detection of the true change moment as possible.


be predicted probabilities of the change point at a specified time moment

for the -th sequence.

Let and be defined as follows:



is a hyperparameter that restricts the length of the considered part of the sequence for delay loss and


According to [romanenkova2021principled] the equation (3) is the lower bound for the expected value of detection delay and the equation (2) is lower bound for the expected time to false alarm.

As it was shown in paper above, there is a principled differentiable loss function for solving the CPD problem via neural network



4 Methods

As there is a principled differentiable loss function for solving the CPD problem, we can teach any model with this loss function. We want to expand the range of these models, and we start with a model such as a transformer. Also, we apply different masks for the source sequence. To compare with  [romanenkova2021principled] we use the RNN model as a baseline. Intuitively, it seems that if we have only one point of change, we do not need to look at the entire sequence of data, but we can only consider some part of it. For this, we will use different masks for the input sequence. The mask is a boolean matrix of elements equal to the number of elements in the input sequence. In the place where True or 1 stands, the value is not considered, i.e. the mask covers it. We have chosen several types of masks:

  1. lower triangular mask  — consider only past points, that implies an online working mode. The mask has the following form:

  2. -diagonal mask  — n elements on the diagonal, so we consider right and left neighboring elements. It is some implementation of the ”window” method, see more [article]. The matrix for this mask is

  3. -diagonal mask plus lower triangular with a side of elements  — points from the beginning of the dataset are added to the previous point. It seems that the model based on the initial elements will give the best result, the matrix for this mask is

  4. 1-diagonal mask plus lower triangular with side of n elements  — this is almost the same as the previous point, but now we do not want to look at the adjacent points to the current one

We investigate how the attention mechanisms can improve change point detection problems compared to an RNN model.

5 Results

5.1 Data

As the dataset to evaluate our approach, we use sequences of handwritten digits based on generative models trained on MNIST similar to 


. To generate this dataset, we use sequences from embedding space from one digit to another obtained via Conditional Variational Autoencoder 

[sohn2015learning] (CVAE). Then we take two points corresponding to a certain pair of digits and also add the points from the line connecting two initial points. For each sequence, we apply a decoder to get sequence images that closely reflect a corresponding digit. In the dataset, we put sequences with and without a change point. Our data consist of 1000 sequences with a length 64. The dataset is balanced: the number of sequences with and without changes is equal.

5.2 Metrics

We use metrics commonly used for the evaluation of change point detection algorithms [van2020evaluation]: score, Covering and Area under the detection curve (Area) [romanenkova2021principled]. Some of them are inspired by Image segmentation, as these two problem statements share a lot in common [arbelaez2010contour].

Better methods have bigger scores and Covering, but smaller Area. We refer an interested reader to the review [van2020evaluation] for a more detailed discussion of the metrics.

  score  Covering  Area
RNN-based, CPD loss      
RNN-based, BCE loss      
Transformer without mask, CPD loss      
Transformer without mask, BCE loss      
Low-triangular mask, CPD loss      
Low-triangular mask, BCE loss      
2-diagonal mask, CPD loss      
2-diagonal mask, BCE loss      
8-diagonal mask, CPD loss      
1-diag. + 8-lower-triang., BCE loss      
1-diag. + 8-lower-triang., CPD loss      
3-diag. + 32-lower-triang., BCE loss      
Table 1: Comparison of the performance of different approaches, mean std. The best value for each type of loss are in bold. The second-best value for each type of loss are underlined.

5.3 Main results

We compare methods based on the attention mechanism and transformers with more common approaches for sequential data processing. Among different attention mechanisms, we consider sequential attention with low triangular attention matrix, diagonal and tri-diagonal attention along.

We present our main results in Table 1. As we see from the results, usage of specific loss with attention improves our results compared to Recurrent Neural Network and Binary Cross-Entropy loss.

Along with the table, we present figures with a variation of the main hyperparameter of our approach: how many off-diagonal elements we use. Figures  26 in Appendix demonstrates the dynamic of the performance of our methods with respect to this hyperparameters. We consider score, Area under the detection curve and Covering.

All the results for both the table and figures were obtained by averaging over 10 runs. For all metrics, we see an improvement compared to a vanilla RNN approach. We also see that BCE loss is typically less stable than CPD loss. The diagonal mask with lower-triangular seems to perform the best among introduced approaches if the correct window size is used.

6 Acknowledgements

The work of Alexey Zaytsev was supported by the Russian Foundation for Basic Research grant 20-01-00203. The work of Evgenya Romanenkova was supported by the Russian Science Foundation (project 20-71-10135).

7 Conclusions

We propose a method of change-point detection based on an attention mechanism. We prove that choosing an attention matrix with respect to the nature of the task solve the change point detection problem more efficiently. Choosing any reasonable matrices with a combination of principled loss functions improves results in all metrics. However, the most successful methods are diagonal and diagonal with a small lower-triangular tail. We show that our approach outperforms RNN-based state-of-the-art on up to 15% area under the detection curve, which means faster and accurate predictions.

[ heading=bibintoc,]


There are results from variation of the main hyper-parameter of our approach.

Figure 1: Dynamic of covering for different masks size, BCE loss
Figure 2: Dynamic of covering for different masks size, CPD loss
Figure 3: Dynamic of F1-score for different masks size, BCE loss
Figure 4: Dynamic of F1-score for different masks size, CPD loss
Figure 5: Dynamic of area under the detection curve for different masks, BCE loss
Figure 6: Dynamic of area under the detection curve for different masks, CPD loss