The goal of Change Point Detection (CPD) is to find the moment of data distribution shift. Such tasks appear in different areas, from monitoring systems to video analysis [aminikhanghahi2017survey] to oilgas [romanenkova2019real]. One of the recent methods for CPD [romanenkova2021principled]
proves that a recurrent neural network model constructs meaningful representations and solve a problem better than non-principled approaches. The state-of-the-art model for sequential data is Transformer[vaswani2017attention]. The main benefit of Transformer’s is their ability to work with long-range dependencies via attention mechanism. Attention matrix specifies at what exact points we should look [vaswani2017attention]. A modification of attention matrix allows faster and more efficient processing of longer sequences [tay2020long, tay2020efficient].
We propose specific attention mechanisms that allow efficient work for the change point detection problem. By considering different autoregression and non-autoregressive attention matrices, we highlight important properties of both the problem and the model based on transformers. The proposed models work faster and directly incorporate the peculiarities of the problem at hand. They outperform existing results for the considered change point detection problem in semi-structured sequences of sequential data.
2 Related work
Attention mechanism is an important idea in deep learning for sequential data[vaswani2017attention]. It allows usage of the whole sequence and doesn’t a model to have a long memory. Instead, at each layer, the model looks at the whole sequence.
The transformer architecture based on attention shows impressive results in many problems related to the processing of sequential data and in particular NLP [fursov2021differentiable], [reis2021transformers]
and computer vision[khan2021transformers].
However, direct application of this mechanism can be prohibitively expensive: the computational complexity of the vanilla attention is , where is the sequence length. A quest for more computationally effective attention leaded to numerous ideas explored in reviews [tay2020long, tay2020efficient]. One of the key ideas in this area is to use so-called sparse attention: instead of the whole attention matrix, we drop parts of it. This mechanism also allows highlighting a specific part of sequences.
One of the problem statements for sequential data is the change point: we want to detect a change of a distribution in a sequence as fast as possible [shiryaev2017stochastic]. Accurate solutions for such problems are vital in different areas, including software maintenance [artemov2016detecting] and oilgas industry [romanenkova2019real]. Simple statistics-based approaches are sufficient in many cases [van2020evaluation]
. However, semi-structured data require deep model to provide reasonable quality of solutions ranging from rather simple neural network[hushchyn2020online] to more complex workflows [romanenkova2021principled, kail2021recurrent] and problem statements [sultani2018real].
For complex multi-step processing of sequences of videos, we still need efficient models that capture the essential properties of data and can detect the change point detection: both in terms of model training and evaluation time and in detection delay in a sequence. It seems natural to apply the efficient transformers paradigm to such data: the attention that pays attention to the recent past is expected to meet these efficiency and quality requirements.
3 Problem statement
The solution of this problem with the neural network was investigated in [romanenkova2021principled]. Its working principle is as follows.
Let a dataset be a set of sequences , where each sequence has the length and corresponding change point is in , if we have a change point in a sequence and otherwise.
Let the random process of length be given, where is an observation at time . The problem is the quickest detection of the true change moment as possible.
be predicted probabilities of the change point at a specified time momentfor the -th sequence.
Let and be defined as follows:
is a hyperparameter that restricts the length of the considered part of the sequence for delay loss and.
As it was shown in paper above, there is a principled differentiable loss function for solving the CPD problem via neural network:
As there is a principled differentiable loss function for solving the CPD problem, we can teach any model with this loss function. We want to expand the range of these models, and we start with a model such as a transformer. Also, we apply different masks for the source sequence. To compare with [romanenkova2021principled] we use the RNN model as a baseline. Intuitively, it seems that if we have only one point of change, we do not need to look at the entire sequence of data, but we can only consider some part of it. For this, we will use different masks for the input sequence. The mask is a boolean matrix of elements equal to the number of elements in the input sequence. In the place where True or 1 stands, the value is not considered, i.e. the mask covers it. We have chosen several types of masks:
lower triangular mask — consider only past points, that implies an online working mode. The mask has the following form:
-diagonal mask — n elements on the diagonal, so we consider right and left neighboring elements. It is some implementation of the ”window” method, see more [article]. The matrix for this mask is
-diagonal mask plus lower triangular with a side of elements — points from the beginning of the dataset are added to the previous point. It seems that the model based on the initial elements will give the best result, the matrix for this mask is
1-diagonal mask plus lower triangular with side of n elements — this is almost the same as the previous point, but now we do not want to look at the adjacent points to the current one
We investigate how the attention mechanisms can improve change point detection problems compared to an RNN model.
As the dataset to evaluate our approach, we use sequences of handwritten digits based on generative models trained on MNIST similar to[romanenkova2021principled]
. To generate this dataset, we use sequences from embedding space from one digit to another obtained via Conditional Variational Autoencoder[sohn2015learning] (CVAE). Then we take two points corresponding to a certain pair of digits and also add the points from the line connecting two initial points. For each sequence, we apply a decoder to get sequence images that closely reflect a corresponding digit. In the dataset, we put sequences with and without a change point. Our data consist of 1000 sequences with a length 64. The dataset is balanced: the number of sequences with and without changes is equal.
We use metrics commonly used for the evaluation of change point detection algorithms [van2020evaluation]: score, Covering and Area under the detection curve (Area) [romanenkova2021principled]. Some of them are inspired by Image segmentation, as these two problem statements share a lot in common [arbelaez2010contour].
Better methods have bigger scores and Covering, but smaller Area. We refer an interested reader to the review [van2020evaluation] for a more detailed discussion of the metrics.
|RNN-based, CPD loss|
|RNN-based, BCE loss|
|Transformer without mask, CPD loss|
|Transformer without mask, BCE loss|
|Low-triangular mask, CPD loss|
|Low-triangular mask, BCE loss|
|2-diagonal mask, CPD loss|
|2-diagonal mask, BCE loss|
|8-diagonal mask, CPD loss|
|1-diag. + 8-lower-triang., BCE loss|
|1-diag. + 8-lower-triang., CPD loss|
|3-diag. + 32-lower-triang., BCE loss|
5.3 Main results
We compare methods based on the attention mechanism and transformers with more common approaches for sequential data processing. Among different attention mechanisms, we consider sequential attention with low triangular attention matrix, diagonal and tri-diagonal attention along.
We present our main results in Table 1. As we see from the results, usage of specific loss with attention improves our results compared to Recurrent Neural Network and Binary Cross-Entropy loss.
Along with the table, we present figures with a variation of the main hyperparameter of our approach: how many off-diagonal elements we use. Figures 2 – 6 in Appendix demonstrates the dynamic of the performance of our methods with respect to this hyperparameters. We consider score, Area under the detection curve and Covering.
All the results for both the table and figures were obtained by averaging over 10 runs. For all metrics, we see an improvement compared to a vanilla RNN approach. We also see that BCE loss is typically less stable than CPD loss. The diagonal mask with lower-triangular seems to perform the best among introduced approaches if the correct window size is used.
The work of Alexey Zaytsev was supported by the Russian Foundation for Basic Research grant 20-01-00203. The work of Evgenya Romanenkova was supported by the Russian Science Foundation (project 20-71-10135).
We propose a method of change-point detection based on an attention mechanism. We prove that choosing an attention matrix with respect to the nature of the task solve the change point detection problem more efficiently. Choosing any reasonable matrices with a combination of principled loss functions improves results in all metrics. However, the most successful methods are diagonal and diagonal with a small lower-triangular tail. We show that our approach outperforms RNN-based state-of-the-art on up to 15% area under the detection curve, which means faster and accurate predictions.
There are results from variation of the main hyper-parameter of our approach.