NeuralNetworkPointProcess
code for "Fully Neural Network based Model for General Temporal Point Processes"
view repo
A temporal point process is a mathematical model for a time series of discrete events, which covers various applications. Recently, recurrent neural network (RNN) based models have been developed for point processes and have been found effective. RNN based models usually assume a specific functional form for the time course of the intensity function of a point process (e.g., exponentially decreasing or increasing with the time since the most recent event). However, such an assumption can restrict the expressive power of the model. We herein propose a novel RNN based model in which the time course of the intensity function is represented in a general manner. In our approach, we first model the integral of the intensity function using a feedforward neural network and then obtain the intensity function as its derivative. This approach enables us to both obtain a flexible model of the intensity function and exactly evaluate the log-likelihood function, which contains the integral of the intensity function, without any numerical approximations. Our model achieves competitive or superior performances compared to the previous state-of-the-art methods for both synthetic and real datasets.
READ FULL TEXT VIEW PDFcode for "Fully Neural Network based Model for General Temporal Point Processes"
The activity of many diverse systems is characterized as a sequence of temporally discrete events. The examples include financial transactions, communication in a social network, and user activity at a web site. In many cases, the occurrences of the event are correlated to each other in a certain manner, and information on future events may be extracted from the information of past events. Therefore, the appropriate modeling of the dependence of the event occurrence on the history of past events is important for understanding the system and predicting future events.
A temporal point process is a useful mathematical tool for modeling the time series of discrete events. In this framework, the dependence on the event history is characterized using a conditional intensity function that maps the history of the past events to the intensity function of the point process. The most common models, such as the Poisson process or the Hawkes process Hawkes ; Hawkes-sparse ; HIF , assume a specific parametric form for the conditional intensity function. Recently, Du et al. (2016) proposed a model based on a recurrent neural network (RNN) for point processes RMTPP , and the variant models were further developed RNN2 ; NSR ; NeuralHawkes ; RLPP ; DRLPP ; RPP
. In this approach, an RNN is used to obtain a compact representation of the event history. The conditional intensity function is then modeled as a function of the hidden state of the RNN. Consequently, the RNN based models outperform the parametric models in prediction performance.
Although such RNN based models aim to capture the dependence of the event occurrence on the event history in a general manner, a specific functional form is usually assumed for the time course of the conditional intensity function. For example, the model in RMTPP assumed that the conditional intensity function exponentially decreases or increases with the elapsed time from the most recent event until the next event. However, using such an assumption can limit the expressive ability of the model and potentially deteriorate the predictive skill if the employed assumption is incorrect. We herein generalize RNN based models such that the time evolution of the conditional intensity function is represented in a general manner. For this purpose, we formulate the conditional intensity function based on a neural network rather than assuming a specific functional form.
However, exactly evaluating the log-likelihood function for such a general model is generally intractable because the log-likelihood function of a temporal point process contains the integral of the conditional intensity function. To overcome this limitation, we first model the integral of the conditional intensity function using a feedforward neural network rather than directly modeling the conditional intensity function itself. Then, the conditional intensity function is obtained by differentiating it. This approach enables us to exactly evaluate the log-likelihood function of our general model without numerical approximations. Finally, we show the effectiveness of our proposed model by analyzing synthetic and real data.
A temporal point process is a stochastic process that generates a sequence of discrete events at times in a given observation interval . The process is characterized via a conditional intensity function , which is the intensity function of the event at the time conditioned on the event history up to the time , given as follows:
(1) |
If the conditional intensity function is specified, the probability density function of the time
of the next event, given the times of the past events, is obtained as follows:(2) |
where the exponential term in the right-hand side represents the probability that no events occur in . The probability density function to observe an event sequence is then obtained as follows:
(3) |
The most basic example of a temporal point process is a stationary Poisson process, which assumes that the events are independent of each other. The conditional intensity function of the stationary Poisson process is given as . Another popular example is the Hawkes process Hawkes ; Hawkes-sparse ; HIF , which is a simple model of a self-exciting point process. The conditional intensity function of the Hawkes process is given as , where is a kernel function ( if ) that represents the triggering effect from the past event.
The conditional intensity function, which maps the event history to the intensity function, plays a major role in the modeling of point processes. Du et al. (2016) proposed to use an RNN to model the conditional intensity function RMTPP
. We first feed an input vector
, which extracts the information of the event time , to the RNN. A simple form of the input is the inter-event interval as or its logarithm as . A hidden state of the RNN is updated as follows:(4) |
where , , and denote the recurrent weight matrix, input weight matrix, and bias term, respectively, and
is an activation function. We here treat the hidden state of the RNN as a compact vector representation of the event history. The conditional intensity function is then formulated as a function of the elapsed time from the most recent event and the hidden state of the RNN, given as follows:
(5) |
where is a non-negative function referred to as a hazard function.
Du et al. (2016) assumed the following form for the hazard function RMTPP :
(6) |
The exponential function in the above equation is used to ensure the non-negativity of the intensity. In this model, the conditional intensity function exponentially decreases or increases with the elapsed time from the most recent event until the next event.
A simplified model, where the conditional intensity function is constant over the period between the successive events, was considered in RLPP ; RPP . We here formulate such a model as a special case of the model of eq. (6), given as follows:
(7) |
In this model, the inter-event interval
follows the exponential distribution with mean
.The log-likelihood function of the RNN based model can be obtained from eq. (3) as follows:
(8) |
The parameter values of the model are estimated by maximizing the log-likelihood function. For this purpose, the backpropagation through time (BPTT) is employed to obtain the gradient of the log-likelihood function. In the BPTT, the RNN is unfolded to a feedforward network, whose weights are shared across layers, and the backpropagation is applied to the unfolded network. Although the hidden state
of the RNN originally depends on all the preceding events, fully considering the history dependence for a long sequence is generally intractable. Only the dependence on a fixed number of the most recent events is considered herein to reduce the computational cost as was done in RMTPP , whereis a hyperparameter called the truncation depth. Namely, for each time index
, the hidden state is obtained by feeding the inputs from the most recent events to the RNN.The main problem of the previous studies is that a specific functional form is usually assumed for the time course of the hazard function as in eqs. (6) or (7), which can miss the general dependence of the event occurrence on the past events. One may want to exploit a more complex model for the hazard function to generalize the model. However, such a complex model is generally intractable because the log-likelihood function in eq. (8) includes the integral of the hazard function. Although the integral may be approximately evaluated using numerical methods, the numerical approximations can deteriorate the fitting accuracy and be computationally expensive. This is the main limitation in the flexible modeling of the hazard function.
Rather than directly modeling the hazard function, we herein propose to model the cumulative hazard function , defined as follows:
(9) |
The hazard function itself can be then obtained by differentiating the cumulative hazard function with respect to as follows:
(10) |
The log-likelihood function is now reformulated as follows using the cumulative hazard function:
(11) |
Therefore, this approach enables us to avoid the integral in the log-likelihood function, which is in contrast to eq. (8).
In the present study, we model the cumulative hazard function using a feedforward neural network (cumulative hazard function network; Fig. 1) for flexible modeling. The cumulative hazard function is a monotonically increasing function of and is positive-valued. The cumulative hazard function network is designed to reproduce these properties. The positivity of the network output can be ensured using an output unit, in which the activation function is positive-valued. Considering monotonicity, we employ the idea used in monotonic ; NL . To summarize, the weights of the particular network connections are constrained to be positive (Fig. 1).
The detail of the cumulative hazard function network is described below. In the network, each unit receives the weighted sum of the inputs and applies an activation function to produce the output. The first hidden layer in the network receives the elapsed time and the hidden state of the RNN as the inputs^{1}^{1}1We may input to the first layer rather than if the variation of the inter-event interval is large.. The weights of the connections from the elapsed time to the first hidden layer and all the connections from the hidden layers are constrained to be positive. The connections, in which the weights are constrained to be positive, are indicated by the red line in Fig. 1. The activation functions of the hidden units and the output unit are set to be the function and the function, , respectively. In this setting, the network output is monotonically increasing with respect to the elapsed time and takes only a positive value, which mimics the cumulative hazard function.
The cumulative hazard function and the hazard function are now formulated as follows based on the output of the cumulative hazard function network:
(12) | |||
(13) |
This also defines the log-likelihood function in eq. (11) of this model based on the network output. The differentiation term can be easily computed using a feedforward computational graph (see Supplementary Material for more details). Practically, one can obtain
by simply using a gradient function implemented in a neural network library, such as TensorFlow, which automatically generates the corresponding computational graph. The gradient of the log-likelihood function can be obtained using backpropagation. Note that the source code of the proposed model is available online
^{2}^{2}2https://github.com/omitakahiro/NeuralNetworkPointProcess.The RNN based point process models were proposed in RMTPP . Most previous studies assumed a specific functional form for the time-course of the hazard function. The exponential hazard function in eq. (6) is commonly assumed RMTPP ; DRLPP . Some studies RLPP ; RPP assumed the constant hazard function as in eq. (7), which is equivalent to assume that the inter-event intervals follow the exponential distribution. In contrast to these studies, our model does not assume any specific functional form for the hazard function, and the time course of the hazard function is formulated in a general manner based on a neural network.
A few studies addressed the general modeling of the hazard function. Jing and Somla (2017) proposed to discretize the continuous hazard function to a piecewise constant function NSR , given as follows:
(14) |
for for some choice of and
. Mei and Eisner (2017) proposed a continuous-time long short-term memory (LSTM) where the output continuously evolves in time, and the conditional intensity function is given as a function of the output
NeuralHawkes . For this model, the Monte Carlo method was used to approximate the integral. In contrast, some studies NeuralHawkes ; RLPP found that the performance of the continuous-time LSTM model for the event time prediction was very similar to that of the model of Du et al. (2016).In all cases, numerical approximations are used to evaluate the integral in the log-likelihood function in eq. (8). However, a numerical approximation is computationally expensive and can also affect the fitting accuracy. In contrast to these studies, the log-likelihood function of our general model can be exactly evaluated without a numerical approximation because the integral of the hazard function is modeled by a feedforward neural network in our approach. Therefore, a more accurate estimate can be efficiently obtained by our approach.
In this section, we conduct experiments using synthetic and real data. We herein evaluate the predictive performances of the four RNN based point process models. The number of units in the RNN is fixed to 64 for all the models. The first two models are equipped with the constant hazard function in eq. (7) (constant model) and the exponential hazard function in eq. (6) (exponential model), respectively. The third model employs the piecewise constant hazard function in eq. (14) (piecewise constant model). We set to the maximum value of the inter-event interval in each dataset and use the condition . The fourth model employs the neural network based hazard function proposed in this study (neural network based model). For this model, we use two hidden layers for the cumulative hazard function network, and the number of units in each layer is 64. In this setting, the numbers of the parameters are almost the same between the third and fourth models.
Each dataset is divided into the training and test data. In the training phase, the model parameters are estimated using the training data. For this optimization, the Adam optimizer with learning rate , , and is used Adam , and the batch size is 256. We also choose the truncation depth from using of the training data (see Sec. 2.2 for more details on truncation depth). In the test phase, we evaluate the predictive performances of the trained models using the test data. In each time step, the probability density function of the time of the coming event given the past events is calculated from eq. (2), and scored by the negative log-likelihood for the actually observed (a smaller score means a better predictive performance). The score is finally averaged over the events in the test data. We performed these computations under a GPU environment provided by Google Colaboratory.
We use the synthetic data generated from the following stochastic processes. In this experiment, 100,000 events are generated from each process, and 80,000/20,000 events are used for training/testing.
Stationary Poisson Process (S-Poisson): The conditional intensity function is given as .
Non-stationary Poisson Process (N-Poisson): The conditional intensity function is given as .
Stationary Renewal Process (S-Renewal): In this process, the inter-event intervals are independent and identically distributed according to a given probability density function
. We herein use the log-normal distribution with a mean of 1.0 and a standard deviation of 6.0 for
. In this setting, a generated sequence exhibits a bursty behavior: multiple events tend to occur in a short period and are followed by a long silence period like the burst firing of biological neurons.
Non-stationary Renewal Process (N-Renewal): A sequence following a non-stationary renewal process is obtained as follows trendrenewal : we first generate a sequence from a stationary renewal process, and then we rescale the time according to for a non-negative trend function
. We use the gamma distribution with a mean of 1.0 and a standard deviation of 0.5 to generate the stationary renewal process and set the trend function to
. In this process, an inter-event interval tends to be followed by the one with a similar length, but the expected length gradually varies in time.Self-correcting Process (SC): The conditional intensity function is given as .
Hawkes Processes (Hawkes1 and Hawkes2): We use the Hawkes process, in which the kernel function is given by the sum of multiple exponential functions: the conditional intensity function is given by . For the Hawkes1 model, we set . For the Hawkes2 model, we set . Compared to the Hawkes1 model, the kernel function of the Hawkes2 model rapidly varies in time for small .
In addition to the four RNN based models, we evaluate the predictive performance of the true model, i.e., the model that generated the data, and use it as a reference. The scores of the RNN based models are standardized by subtracting the score of the true model. The value of in the standardized score corresponds to the score of the true model.
Figure 2 summarizes the performances of the four RNN based models for the synthetic datasets (a smaller score means a better performance). We first find that the proposed neural network based model achieves a competitive or better performance against performances of the other models. The neural network based model also performs robustly for all the datasets: the performance of the neural network based model is always close to that of the true model. These results demonstrate that (i) the performance is improved by employing the neural network based hazard function and that (ii) our model can be applicable to a diverse class of data generating processes.
The performances of the constant model and the exponential model critically depend on whether the hazard function is correctly specified. The constant hazard function is correct for the S-Poisson and N-Poisson processes. The exponential hazard function is correct for the S-Poisson, N-Poisson, and SC processes, and is approximately correct for the Hawkes1 process. In fact, these models perform similarly to the true model for the cases where the hazard function is correctly specified but perform poorly for the other cases. Figure 3 shows the estimated conditional intensity function and clearly demonstrates that the exponential model captures well the true conditional intensity function for the self-correcting process where the exponential hazard function is valid; however, it fails for the Hawkes2 process where the exponential hazard function is not valid. In contrast, our neural network based model can reproduce well the true model for both cases. In this manner, the constant and exponential models are sensitive to model misspecification.
The performance of the piecewise constant model is much worse than the neural network based model, particularly for the S-Renewal, N-Renewal, and Hawkes2 processes. For these processes, the variability of the inter-event intervals is large, and the conditional intensity function can rapidly vary for a short period after an event. For such cases, the piecewise constant approximation might not work well. The performance of the piecewise constant model would be improved if the approximation accuracy is improved; however, this increases the computational cost. In this experiment, the numbers of the parameters are set to be almost the same between the piecewise constant model and the neural network based model. Moreover, the neural network based model performs better than the piecewise constant model, indicating that the neural network based model is more efficient than the piecewise constant model.
We use the following real datasets for the next experiment.
Finance dataset: This dataset contains the trading records of Nikkei 225 mini, which is the most liquid features contracts in Asia Nikkei . The timestamps of 182,373 transactions in one day are analyzed, and the first and the last of the data are used for training and testing, respectively.
Emergency call dataset: This dataset contains the records of the police department calls for service in San Francisco police . Each record contains the timestamp and address from which the call was made. We prepare 100 separate sequences for the 100 most frequent addresses, which contain a total of 294,865 events. The first and the last of the events in the sequences are used for training and testing, respectively.
Meme dataset: MemeTracker tracks the popular phrases from numerous online resources such as news media and personal blogs meme . This dataset records the timestamps when the focused phrases appear on the internet. We first extract the 50 most frequent phrases and obtain the corresponding 50 separate sequences. We use 40 sequences out of 50 with 247,579 events for training and the remaining 10 sequences with 61,095 events for testing.
Music dataset: This dataset records the history of music listening of users at https://www.last.fm/ music . We prepare the 100 sequences for the 100 most active users in Jan 2009, which contain a total of 299,046 events. The first and the last of the events in the sequences are used for training and testing, respectively.
Figure 4 summarizes the performances for the real datasets. We find that the neural network based model exhibits a competitive or superior score as compared to those of the other models; this demonstrates the practical effectiveness of the proposed model. For the finance dataset, the performances of all the models are close to each other, implying that the constant or exponential hazard function reproduces the event occurrence process of the financial transactions. The neural network based model performs much better than the other models, particularly for the Meme and music datasets: the difference in the scores between the neural network based model and the other models is greater than 0.5. This difference should be significant (e.g., in the case of the right panel in Fig. 3, where the exponential model clearly fails to reproduce the true model, the score difference is about 0.4 between the true and exponential models). For the two datasets, the data contain inter-event intervals that are much longer than the average, and the variability of the inter-event intervals is large. Other than the neural network based model, the models presumably fail to adapt to such a feature.
In this study, we extended the RNN based point process models such that the time course of the hazard function is represented in a general manner based on a neural network. We then showed the effectiveness of our model by analyzing both synthetic and real data. Primary advantages of the proposed model are summarized as follows:
By using a feedforward neural network, our model can reproduce any time course of the hazard function in principle, i.e., the usefulness of fully neural network based modeling in point processes is indicated.
By modeling the cumulative hazard function rather than the hazard function itself, we can avoid the direct evaluation of the integral in the log-likelihood function. The log-likelihood function can be exactly and efficiently evaluated without relying on the numerical approximation because of this approach.
Moreover, the cumulative hazard function plays an important role in the diagnostic analysis etas ; trt ; omi . We did not consider herein the mark of each event (i.e., the information associated with each event other than the timestamp) because the primary contribution of this work is the development of a general model of the hazard function. However, our approach can be easily extended to marked temporal point processes as well.
T. O. and K. A. are supported by Kozo Keikaku Engineering Inc.
Shuang Li, Shuai Xiao, Shixiang Zhu, Nan Du, Yao Xie, and Le Song. Learning temporal point processes via reinforcement learning. In
Advances in Neural Information Processing Systems, 2018.In our method, we need to obtain not only , which is given as the output of the cumulative hazard function network, but also its derivative . A method for calculating is described below.
We first formulate the operation of the cumulative hazard function network. We denote the output of the th hidden layer in the cumulative hazard function network by and write the feedforward operation as follows:
(15) |
where , , and represent the activation function, the weight matrix, and the bias term for the th layer, respectively. The th layer is the input layer, and we set , where is the RNN output. The th layer is the output layer, and we have .
We here introduce an additional node for each layer, which has the same dimension as , and consider a feedforward operation for , given as follows:
(16) |
We also set , where is a zero row vector whose dimension is the same as . It is noted that the parameters are shared between eqs. (15) and (16).
It is now easy to show that the relation holds. Therefore, we can obtain as the output of the feedforward computational graph defined by eq. (16), as . This also means that the gradient of the with respect to the model parameters can be calculated using backpropagation.
Comments
There are no comments yet.