## 1 Introduction

Point processes are widely used for modeling the so-called discrete events data, which consists of a series events times and associated additional information. Recently, a class of mutual exciting point processes, known as Hawkes process (Hawkes, 1971), has gained much popularity in the statistics and machine learning literature. Hawkes processes provide a flexible model for capturing spatio-temporal correlations, and have been successfully applied in a wide range of domains including seismology (Ogata, 1988, 1998), criminology (Mohler et al., 2011), epidemiology (Rizoiu et al., 2018), social networks (Yang and Zha, 2013), finance (Hawkes, 2018), and neural activity (Reynaud-Bouret et al., 2013).

Detection of abrupt changes in the Hawkes process is a fundamental problem, which aims to detect the change as quickly as possible subject to false alarm constraints. For instance, in sensor network monitoring, we would like to detect any changes as soon as possible using a stream of event data; such change may represent a shift in the systems status or event anomalies. The challenges for detecting changes in Hawkes processes include handling complex spatial and temporal dependence of the event data streaming and long-term dependency; we need to develop computationally efficient online algorithms with performance guarantees.

A motivating application for our study is neuronal network analysis, where we are interested in inferring the network connectivity and detecting neural state changes from experimental data – the neural spike train data, which records the sequence of times when a neuron fires an action potential. Hawkes processes provide an intuitive modeling framework for such data: its mutually exciting property naturally mimics neuron-to-neuron influence’s electrochemical dynamics. The model’s probabilistic nature can capture noisy influences on the network (due to unobserved neurons or external stimuli). There has been much work on applying Hawkes models for a range of neuroscience problems, e.g., functional connectivity (Lambert et al., 2018) and uncertainty quantification (Wang et al., 2020b). The problem of change-point detection is fundamental for neuronal network analysis, allowing researchers to identify when an external stimulus is internalized. These change-points often arise from sparse population code changes in the biological network (Tang et al., 2018)

. Sparse population coding provides a sparse and distributed representation of information over the biological network, with each neuron representing the presence (or absence) of a specific feature. Figure

1 illustrates this idea. Here, each dot represents a neuron in the visual cortex. Colored dots show neurons that respond to seeing a cat or a dog, and shared dots represent common features between both animals (e.g., mammal, pet). Identifying the change-point between two population codes allows scientists to investigate the relationship between stimulus and response timing and better understand each neuron’s role in the population code.While there has been much study for estimating Hawkes processes in the literature (see

Reinhart (2018) for a recent survey), change-point detection for Hawkes processes is much less studied. In Wang et al. (2020a), the offline change-point detection for high-dimensional Hawkes processes was studied, and the goal is to estimate the (multiple) change-points. In Rambaldi et al. (2018), a model selection scheme was proposed to identify the presence of exogenous events that increase the intensity of the Hawkes process for a given time period. A cumulant based multi-resolution segmentation algorithm was proposed in Zhou et al. (2020) to find the optimal partition of the nonstationary Hawkes process into several non-overlapping segments. On the contrary, we focus on sequential detection to detect the change as quickly as possible. Online change-point detection for Hawkes process was considered in Li et al. (2017), where the generalized likelihood ratio test was used to detect the change with unknown post-change parameters. The expectation-maximization (EM) algorithm was used in

Li et al. (2017) to estimate the unknown post-change parameter, and there is no recursive implementation.In this paper, we present a new CUSUM procedure for sequential change detection in Hawkes processes. The new CUSUM procedure is computationally efficient and memory-efficient as a recursive procedure, which is crucial for its online implementation. We present the theoretical properties of the CUSUM procedure. We also compare the detection procedures with existing change detection algorithms based on the generalized likelihood ratio statistic (Li et al., 2017) and a score statistic using numerical examples. Finally, we consider a real-world application of detecting neuronal network population code change. The results show that the CUSUM procedure outperforms the alternative methods.

The rest of the paper is organized as follows. Section 2 introduces the basics for the Hawkes process. Section 3 sets up the change-point detection problem. Section 4 presents the proposed CUSUM procedure and its theoretical properties. Section 5 discusses some alternative detection methods. Section 6 numerically compares their performance. Section 7 demonstrates a real-world application using neural spike train data. Section 8 concludes the paper with some discussions.

## 2 Preliminaries

A temporal point process is a random process whose realization consists of a sequence of discrete events occurring at times , with . Let the history be the sequence of times of events up to but not including time . Strictly speaking, is a filtration, that is, an increasing sequence of -algebras. Let represents the number of events before time , then is a counting process which can be defined as:

(2.1) |

where is the Dirac function. The sequence of discrete event times can be regarded as when the counting process has jumped.

A point process can be characterized by conditional intensity function, denoted as . Originally conditional intensity function is also called hazard function (Rasmussen, 2011) and is defined as

(2.2) |

where

is the probability density function of the next event time conditional on the past, and

is the associated conditional cumulative distribution function capturing the probability of the

-th event happening before time :Thus if we consider a small time interval , we have

### 2.1 One-Dimensional Point Processes

For one-dimensional Hawkes process, the intensity function takes the form (Hawkes, 1971):

(2.3) |

where is the base intensity, is the influence parameter, and is a normalized kernel function satisfying . A commonly used kernel function is the exponential kernel with . We assume to ensure a stationary process.

Given event times which happened before a given time , the log-likelihood function for the Hawkes process can be written as follows (see Daley and Vere-Jones (2003) for details):

(2.4) |

In case of the exponential kernel and a constant base intensity , (2.4) reads

(2.5) |

As we will see in the following, the log-likelihood plays a key role in sequential change detection procedures.

### 2.2 Network Point Processes

The multivariate Hawkes process on a network with nodes is represented by a series of event times together with their location , where is the event time and is the node on which the -th event occurs. Here we use to represent the set . The intensity function for node at time is

where is the base intensity at node , is the influence parameter from node to node , is a normalized kernel function, and is a counting process on node :

The log-likelihood function for the network setting up to time is given by:

(2.6) |

where is the matrix representation for the influence parameters.

###### Remark 2.1.

Note that the likelihood function decouples as the summation over nodes: it consists of the sum of the log-likelihood at each node. The intensity function only involves events observed on the neighbors of (i.e., the nodes that will influence node ). This property may enable us to develop distributed change-point detection procedure, i.e., the nodes can compute their likelihood in parallel and only need to communicate with their neighbors (if the neighborhood information is known).

## 3 Problem Set-up

The problem of change-point detection for Hawkes networks sets up as follows. Assume there may exist a true change-point time , and the event data follows one point process before the change-point and follows another point process afterward. We are particularly interested in two special cases: (i) The null point process is a Poisson process, whereas the alternative point process is a Hawkes point process; (ii) The null point process is a Hawkes point process, whereas the alternative point process is a different Hawkes point process, e.g., the influence parameter has been shifted. Note that the first scenario can be seen as a specific case since the Poisson process can be treated as a special Hawkes process with influence parameters setting to be 0.

Consider the hypothesis test to detect the temporal pattern shifts in the Hawkes process, assuming the Hawkes process is stationary and the change-point is an unknown variable, as follows:

(3.1) |

where denotes the true intensity for node at time . The pre-change parameter is usually given from our knowledge of the process or can be estimated from reference data. The post-change parameter is known in some scenarios, but more often it corresponds to an unexpected anomaly and we may not have enough data to estimate it in advance. Alternatively, we can treat the post-change parameters as the targeted smallest change to be detected.

A change detection procedure resolves the two hypothesis using a stopping time , which is a function of the event sequence as we will explain in the next section.

## 4 CUSUM Detection Procedure

In this section, we present the cumulative sum (CUSUM) statistics based on the log-likelihood ratio. The CUSUM procedure was first proposed in Page (1954), assuming both the pre- and post-change parameters. It is computationally efficient but less robust to parameter uncertainties (to make it more robust, we can use adaptive-CUSUM procedure such as that in Xie et al. (2020)). The CUSUM procedure is most commonly defined for i.i.d. observations, and recently there has also been much development in CUSUM for non i.i.d. observations (Tartakovsky et al., 2014; Xie et al., 2021).

We first define the log-likelihood ratio function that will be used as the main building block. For a hypothesized change-point , the log-likelihood ratio of the model (3.1) up to time can be derived as:

(4.1) |

where

is the intensity for node if the change-point happens at , and

is the intensity under the null hypothesis, where we use

to indicate that the change never happens.### 4.1 CUSUM Statistic

Given assumed post-change parameters, , the stopping time for CUSUM is given by

(4.2) |

where is the log-likelihood ratio statistic defined in (4.1), and is a pre-specified threshold. The procedure stops when the log-likelihood ratio from some hypothesized change-point exceeds threshold .

In contrast to the original CUSUM procedure (Page, 1954) where the samples are taken in a discrete-time fashion, here the CUSUM statistic is continuous-time and has memory. In particular, due to the memory of the process, the observations are non-i.i.d. and have complex dependence to the past. Therefore, we do not have a simple recursion as the vanilla version of the CUSUM procedure.

Next we derive an computationally efficient recursive algorithm for computing the CUSUM statistic for the network Hawkes process. We start with a lemma for the log-likelihood ratio which shows that, although the supremum of the log-likelihood ratio statistic over the unknown change-point appears to be on a continuum, it will be obtained at the observed event times.

###### Lemma 4.1.

Given the event times , for any fixed and , , there is

and

Lemma 4.1 says that we only need to consider the values of the log-likelihood evaluated as the past event times rather than a continum of possible values for ; this will greatly simplify the computation of the log-likelihood ratio statistic.

###### Proof..

For fixed event and any , the intensity at any time is a constant that does not depend on , and the first part of the log-likelihood ratio (4.1),

is also a constant. For the second part,

the integrand is no larger than 0 because for any . Therefore the supremum of is reached when , which is

For the supremum w.r.t. over , the equality can be derived using similar arguments. ∎

For practical reasons such as computational efficiency, we can simplify the calculation in in (4.2) which involves , by considering on a discretized grid with a pre-specified grid size , i.e., we only need to calculate the detection statistic for .

Finally, with the discretization for both and , the log-likelihood ratio and has the following relationship, given ,

(4.3) |

If we have access to the cumulative kernels

the recursion can be computed without numerical integration:

(4.4) |

The CUSUM algorithm is summarized in Algorithm 1.

###### Remark 4.1 (Choice of grid size ).

The grid size corresponds to the updating frequency of the CUSUM statistics and is an important parameter to the CUSUM algorithm. There is a trade-off in choosing the parameter : a large might result in a larger detection delay, while a very small might leads to unnecessary computational complexity. It would be interesting to develop a method to choose adaptively.

### 4.2 Memory-Efficient CUSUM

The exact CUSUM algorithm requires remembering the entire history of events because all the past events influence the intensity function. In practice, we choose to neglect those events that happened a while ago to save memory, assuming that their influence on the present and the future is small, while achieving a similar performance.

Consider the following truncated kernel with a width :

Under the truncated kernel, an event has no influence over the whole process after into the future, and we only need to keep events during in our memory for computation. With the truncated kernel , the intensity for node can be approximated by

(4.5) |

Here for all , only depends on event data during . Moreover, the intensity for does not depend on , which enables us to update the log-likelihood ratio recursively for small . If we also have access to the cumulative kernels the recursion step in (4.4) can be approximated by

(4.6) |

where , and the summation is taken only for event times during . The resulted CUSUM procedure is summarized in Algorithm 2.

We illustrate the effect of the truncation width using a numerical example. The model is described in Section 6, where the kernel functions are all exponential:

with . Figure 2 shows the comparison between CUSUM statistics with and without kernel truncation. Both statistics with truncated kernel have the same trend with the exact CUSUM. When , of the cumulative influence is preserved, and the statistic deviates from the exact CUSUM, which may result in a false alarm. When preserving of the cumulative influence, there is little difference between the two statistics.

### 4.3 Theoretical Properties of CUSUM

In this section, we study the theoretical properties of CUSUM (as presented in Algorithm 1). The analysis helps us determine the proper threshold for the detection algorithm without using the Monte Carlo method, which is computationally expensive and prohibitive for some practical problems.

Two widely used performance metrics for sequential change-point detection are: (i) Average run length (ARL), defined as the expected value of the stopping time when there is no change, i.e., , where is the probability measure on the sequence of event times when the change never occurs, and is the corresponding expectation; (ii) Expected detection delay (EDD), defined as the expected delay between the stopping time and the true change-point. Two common definitions for EDD can be found in Lorden (1971) and Pollak (1985), both of them consider the worst-case delay over all possible change-point values. In particular, if the true change-point is , then the EDD can be defined as , where denotes the probability measure on the observations when the change occurs at time , and denotes the corresponding expectation.

We establish a lower bound for ARL regarding the threshold . Typically for the i.i.d. observation setting, the ARL grows exponentially with respect to the threshold . Here although Hawkes process is a continuous time procedure, it is naturally discretized by the event times; interesting, we obtain a similar result for the ARL of our CUSUM procedure:

###### Theorem 4.1 (ARL of CUSUM).

In Algorithm 1 under , the number of events happened before satisfies

###### Proof..

For each fixed , is a martingale w.r.t. , with By Ville’s maximal inequality for non-negative supermartingales, we have

and for each ,

By union bound, for any ,

And by the definition of , there is always . Thereby we complete the proof. ∎

For i.i.d. observations, the EDD of CUSUM procedure is on the order of divided by the Kullback–Leibler (KL) divergence for the pre- and post-change distributions. Similar results can be obtained for CUSUM with non-i.i.d. observations (Tartakovsky et al., 2014). We expect a similar result may hold for the CUSUM procedure for Hawkes process, although the complete proof is complicated, which we leave for future research.

###### Remark 4.2 (EDD of CUSUM).

Let

be the asymptotic KL divergence between the post-change and pre-change processes (since the process the conditional intensity function is stochastic). For any change-point and any event data up to , we expect the EDD to be

Here is irrelevant with or . We show this for the one-dimensional Hawkes process, where the kernel function has finite support and is upper bounded.

The concentration bound for was derived by Reynaud-Bouret et al. (2007), where is a bounded function on event data during for some and translates the event time by . They also provide a bound on the number of events in every unit time, which is followed immediately by a bound on the intensity. To be specific, we have with probability for every , where is some large-enough constant. For , the log-likelihood ratio can be written as

For the first term, a concentration bound around can be derived using Reynaud-Bouret et al. (2007)’s argument, by first applying the bound on and above. We have for any ,

where depends on the model and . The second term is the average of a martingale, and will have a concentration bound around 0 by first applying the bound on and the number of events every unit time, followed by Hoeffding’s inequality. We have for any

where depends on the model and . With the two concentration inequality above, we can check that is . For , since the process restarts at , it can be regarded as a translation of the case . The only difference is that takes into consideration the events before . Since has finite support, such difference only exists for a limited time and will not make a huge difference to the analysis of the EDD.

The KL divergence between different Hawkes models using mean-field approximation is summarized in Li et al. (2017). In particular, the KL divergence for the model shown in (3.1) is given by

where is the expected intensity for post-change distributions, and is the expected intensity for pre-change distribution. Here

is the identity matrix,

is a vector of ones,

is the constant base intensity vector, and . Quantifying the KL-divergence between the pre- and post-change distributions can help us to understand whether a case is easy or difficult to detect.## 5 Alternative Detection Procedures

In reality, the post-change parameters are not always known due to the lack of anomaly data. There could be various types of anomalies, and we do not know what to expect. This section considers two other approaches to change-point detection on the Hawkes process: the score and the GLR statistics. Neither of them requires any knowledge regarding the post-change parameters.

### 5.1 Score Statistics

We consider the score statistics for constructing a detection procedure. The score statistic can detect any deviations from the null hypothesis (Xie and Siegmund, 2012). It is particularly suitable for detecting small deviation (i.e., locally most efficient) and does not require estimating post-change parameters. The score function is defined as the derivative of the log-likelihood as in (2.6) over the parameters , on which we would like to detect the change. Let , the score function on each node is

(5.1) |

since only depends on the parameter . Then

where and . In [Ogata (1978), Theorem 3.4], it is shown that the limiting distribution of the score function at the true parameter is Gaussian

where

is the Fisher information and is positive definite. Note that the Fisher information is a diagonal block matrix, since for any , , is a constant w.r.t. , and

Each block of corresponds to , the influence from all nodes to node . The limiting distribution of the score function at the true parameter can be shown to be

For change-point detection, we adopt the conventional sliding window approach, i.e., calculating the score statistic inside the sliding window for a suitably chosen window length , and raise the alarm whenever the statistic exceeds threshold :

With a sufficiently large window length, the score statistic under the null hypothesis should be around , the expected value of random variable. After the change-point, the expected score function at the pre-change parameters is no longer 0. We would expect the score statistic to be noticeably larger than , and thus the change is detected.

Like the CUSUM statistics, a memory-efficiency problem arises since the intensity depends on the whole history. We can again replace with using truncated kernels (similarly, for the GLR statistics in the next subsection). For practical reasons, we also need to choose a grid size and only compute the grid’s score statistics. We discuss the relation between grid size and EDD for a fixed ARL in Section 6.

### 5.2 GLR Statistics

When the post-change parameters are unknown, another way to perform change-point detection is to use the generalized likelihood ratio (GLR) statistic. The idea is to find the parameters that best fit the data and compare the likelihood ratio between the fitted parameters and the pre-change ones. The GLR statistic on Hawkes networks was discussed in Li et al. (2017). A sliding window of fixed length is adopted, and within each window, the log-likelihood ratio is defined as

where is the intensity assuming is the change-point and is the post-change parameter,

and is the parameter that maximizes the likelihood in the current window :

A change is detected when the log-likelihood ratio exceeds certain threshold :

For each window, a convex optimization problem is solved to find the maximum likelihood estimate that best fits the data. However, this operation makes the GLR statistic computationally more expensive than CUSUM and the score statistic. To address this issue, we can use from the previous window as an initialization for the gradient descent algorithm to find the MLE in the next step – this “warm-start” may lead to faster convergence for finding the MLE.

Compared with the score statistic, the GLR statistic is computationally more expensive, and it does not necessarily have better performance (which can be partly due to the noisy MLE estimates), as shown in our example in Section 6. However, the GLR statistic is numerically more stable than the score statistic, especially for large networks. The reason is that we usually may not estimate Fisher information with high accuracy. Even provided with the exact pre-change parameter , there is no close-form solution for the Fisher information, and it can only be estimated by simulation or real data. The score statistic involves inverting the Fisher information, which can suffer from a high condition number and numerical instability when we try to invert it.

## 6 Numerical Experiments

In this section, we compare several change-point detection procedures using a simulated example. Consider a small network of nodes, shown in Figure 3. The background intensity is proportional to the size of the node ranging from 0.5 to 1, and the edges indicate the directed influence between nodes. The edges in black are the pre-change parameters, while the edges in orange show a topological change between nodes 1,2, and 3. There are two emerging edges after the change-point, with all other edges remain the same.

Figure 4 shows the CUSUM, GLR, and score statistics, respectively, where the dashed line indicates the change-point. As expected, CUSUM grows steadily larger after the change-point. The score and GLR statistics drawn are based on the same sequence of events. They are generally larger after the change-point, while the change in the score statistic is more evident than in the GLR statistic.

(a) CUSUM | (b) Score stat | (c) GLR |

### 6.1 Performance Comparison

We compare the performance of different statistics by plotting EDD versus (which is supposed to be close to a straight line according to the theoretical analysis). First, we introduce a simple baseline: the Shewhart control chart (Shewhart, 1925, 1931), which counts the number of events in a sliding window and stops when the number of events falls out of a specific range:

This Shewhart chart can detect change in the average intensity. In this example, we only consider the case that the average intensity will be increased after the change-point, and thus choose a one-sided interval by letting .

The comparison results are shown in Figure 5. The CUSUM procedure with exact post-change parameters achieves the best performance, followed by the score statistic and the GLR statistic; all are better than the baseline. For the GLR, score, and Shewhart statistics, we use 60, 60, and 120 as their window lengths, respectively. This corresponds to the optimal window size (found numerically) for an ARL between 500 and 50000. This result shows that the score and GLR statistics use the event data more efficiently than the Shewhart chart, as the Shewhart chart needs a larger window length to identify the difference. Here we set the change-point to be slightly larger than the window length so that for small ARL, the EDD can be smaller than .

We also consider the CUSUM procedure when the assumed post-change parameter differs from the true one; thus, there is a model mismatch. While the real change happens on two pairs of nodes making the influence factor both 0.4, we consider a CUSUM procedure which assumes a correct post-change network topology; however, the magnitudes of the post-change parameters on the edges are 50% and 200% of the true ones, respectively. We also consider a CUSUM procedure that assumes the correct magnitude of the post-change parameter but an incorrect post-change network topology. In one case, only the change in the influence from is expected, and in another, a change in the influence from , , , is expected. The simulation results show that even with misspecified post-change parameters (magnitude or network topology), the CUSUM procedure can still achieve better performance than the GLR and the score procedures. This demonstrates that the CUSUM procedure is reasonably robust to the misspecification of the post-change parameters.

### 6.2 Effect of Grid Size

The choice of the grid size involves a trade-off between algorithm performance and complexity. In this example, we compare CUSUM, the GLR, and the score statistics with a grid size

ranging from 0.1 to 50. For the EDD, we assume that the change-point is uniformly distributed between two grid points. Figure

6 shows the effect of the grid size on CUSUM, the GLR, and the score procedure. We find that a large grid size will result in both larger ARL and EDD, and if we tune the threshold to fix the ARL, the EDD may still increase with a larger .To understand the effect of on computation complexity, we mainly consider the GLR statistic as the computation for the GLR is the most expensive. To solve the convex optimization problem for each window, we use the EM algorithm as described in Li et al. (2017) and terminate when the update in the log-likelihood is less than . Figure 6(c) shows the average iterations needed per window for different grid sizes.