## Abstract

Inter-event times of various human behavior are apparently non-Poissonian and obey long-tailed distributions as opposed to exponential distributions, which correspond to Poisson processes. It has been suggested that human individuals may switch between different states in each of which they are regarded to generate events obeying a Poisson process. If this is the case, distributions of inter-event times should approximately obey a mixture of exponential distributions with different parameter values. In the present study, we introduce the minimum description length principle to compare mixtures of exponential distributions with different numbers of components (i.e., constituent exponential distributions). Because these distributions violate the identifiability property, one is mathematically not allowed to apply the Akaike or Bayes information criteria to their maximum likelihood estimator to carry out model selection. We overcome this theoretical barrier by applying a minimum description principle to joint likelihoods of the data and latent variables. We show that mixtures of exponential distributions with a few components are selected as opposed to more complex mixtures in various data sets and that the fitting accuracy is comparable to that of state-of-the-art algorithms to fit power-law distributions to data. Our results lend support to Poissonian explanations of apparently non-Poissonian human behavior.

Keywords: temporal network, power-law distribution, Poisson process, model selection, normalized maximum likelihood

## 1 Introduction

Many social and economic processes are a consequence of human behavior. Technological advances are increasingly enabling us to record various human behaviors in quantitative manners. Such quantitative understanding often challenges traditional assumptions underlying mathematical modeling of human behavior, a major instance of which is a lack of Poissonian properties in a range of human behavioral data. In other words, when a sequence of time-stamped events, such as email correspondences, is observed from a human, the event sequence often deviates from Poisson processes, in which inter-event times independently obey an exponential distribution. Rather, empirical inter-event times from human behavior and other data often obey long-tailed distributions [1, 2]. Such non-Poissonian processes on nodes and edges are building blocks of temporal networks on which dynamical processes such as epidemic processes often behave differently from the same processes on static networks [3, 4].

Two classes of models that generate long-tailed distributions of inter-event times are priority queue models and modulated Markov processes [3, 4]. In priority queue models, one assumes that a human individual is a queue that receives tasks across time and prioritizes some particular tasks for execution [1]. In contrast, in modulated Markov processes, one assumes that a human individual generates events according to a Poisson process (i.e., one determines whether to have an event right now at a constant rate without memory) but modulates the event rate. The event rate may be assumed to take one of a few values [5, 6, 7, 8, 9, 10] or continuously many values [11, 12]. When we assume a few states, where each state corresponds to a Poisson process. An underlying assumption about human behavior is that an individual transits among a few distinguishable states, such as an active state and an inactive state. If this is the case, we should be able to fit a mixture of exponential distributions rather than power-law distributions to empirical distributions of inter-event times with a reasonable accuracy. This is because a subset of inter-event times, i.e., those produced in each state, is expected to obey an exponential distribution.

In fact, a mixture of a small number of exponential distributions and a power-law distribution with an exponential cutoff can look similar. In the present study, we fit mixtures of exponential distributions to several data sets of inter-event times. We perform model selection to determine the number of component exponential distributions fitted to data and also compare the mixture of exponential distributions with two types of conventionally used power-law distributions. A technical challenge is that the maximum likelihood estimator of the mixture of exponential distributions does not satisfy the asymptotic normality such that it is not allowed to use the AIC or BIC (e.g. p. 203 in Ref. [13]). Therefore, based on the minimum description length criterion [14, 15], we perform the procedure so-called latent variable completion [16, 17, 18, 19] to derive variants of minimum description length justified for mixtures of exponential distributions. We find that mixtures of up to three exponential distributions are the best performer in many cases. Python codes for estimating and selecting the mixture of exponential distributions are available at Github (https://github.com/naokimas/exp_mixture_model).

## 2 Methods

### 2.1 Mixture of exponential distributions and its maximum likelihood estimation

We fit mixtures of exponential distributions (we refer to them as EMMs, after the exponential mixture models) to distributions of inter-event times, denoted by , and compare the fit with the case of power-law distributions. We denote by

the number of the exponential distributions to be mixed, which we refer to as the number of component distributions. The mixing weight, i.e., the probability that the

th exponential distribution is used, is denoted by (). The probability density function for an EMM is given by

(1) |

where , , and is the mean of the th exponential distribution.

Given a series of inter-event times, , and the number of components, , we estimate the values of the model parameters, , as follows.

First, we run the expectation-maximization (EM) algorithm

[20]. As an initial condition, we set (), where the superscript indicates the iteration number in the EM procedure. We also draw () such that independently obeys the uniform density on , where and are the minimum and maximum of the inter-event time in the given data set, respectively. In this manner, we intend to generate a sufficiently broad initial distribution of while a majority of the values is relatively small. Intuitively, a large value of , which sometimes occurs, reflects the long tail of the distribution of inter-event times. We iterate the following EM steps times by alternating the expectation (E) and maximization (M) steps. The E step is given by(2) |

where . One can interpret as the probability that is generated from the th exponential distribution. The M step is given by

(3a) | |||||

(3b) |

After the iteration of the E and M steps times, one obtains and , which is an approximate maximum likelihood estimator. This estimator corresponds to the case in which one does not estimate the latent variables that explicitly indicate which exponential distribution out of the exponential distributions has generated each .

When the values of the latent variables are required, we compute them as follows. We denote the latent variable corresponding to each () by (). In other words, is estimated to be generated from the th exponential distribution. We first estimate the latent variables by

(4) | |||||

We then estimate the other parameters by

(5) |

where , is the number of inter-event times whose estimated latent variable is equal to , and

(6) |

The and values given by Eqs. (5) and (6

) are a maximum likelihood estimator when the joint distribution of the inter-event times (i.e.,

, , ) and the latent variables (i.e., , , ) is estimated. They are different from the values estimated by the EM algorithm (i.e., and ).To cope with the problem of local maxima, we estimate the values of , , and (, ) ten times starting from different initial conditions. Then, among the ten maximum likelihood estimators, we employ the one that has yielded the largest likelihood of the joint distribution of and .

### 2.2 Model selection criteria for mixtures of exponential distributions

We consider the following six model selection criteria.

#### 2.2.1 AIC and BIC

#### 2.2.2 AIC and BIC with latent variable completion

An EMM is nonidentifiable, which by definition dictates that a distribution does not correspond to a parameter set in a one-to-one manner [13]. For example, if and , the distribution of is the same exponential distribution regardless of the value of . When a distribution is nonidentifiable, the maximum likelihood estimator is not asymptotically normal, and the conventional AIC and BIC, which are derived on the basis of asymptotic normality, lose justification [13].

A method for overcoming this theoretical barrier is to apply model selection criteria to a complete variable model, in which one completes the values of the latent variables, rather than to use a marginalized model [16, 17, 18, 19]. We call this method the latent variable completion [19] throughout this paper. Under latent variable completion, one explicitly incorporates the likelihood of the latent variables, , without marginalizing them, into a model selection criterion. Then, the joint distribution of and is an identifiable distribution such that the use of the AIC and BIC is justified. Also in practice, latent variable completion is better at describing some empirical data (for example, at estimating an appropriate number of components, ) than model selection criteria that do not use latent variable completion [16, 17].

We refer to the corresponding AIC or BIC as the AIC or BIC with latent variable completion [16, 18] and denote them by and , respectively. We obtain

(9) |

and

(10) |

In Eqs. (9) and (10), = , …, , , …, , derived in Eqs. (5) and (6), is the maximum likelihood estimator for the joint distribution of inter-event times and latent variables . The likelihood in Eqs. (9) and (10) is given by

(11) |

In Eqs. (9), (10), and (11), is the number of components that are used at least once under the estimated latent variable values. The latent variable specifies which of the exponential distributions has produced for each . Therefore, as a result of estimating the latent variables, a th exponential distribution may not have any inter-event times belonging to it (i.e., ). We exclude such exponential distributions, i.e., those with , and only consider the remaining exponential distributions. To calculate and , we use instead of because is the actual number of components given the estimated latent variable values. For the same reason, we will use instead of in the two model selection criteria introduced in the following sections, which also explicitly use the estimated latent variables and latent variable completion.

#### 2.2.3 Normalized maximum likelihood codelength

We next introduce model selection criteria on the basis of the minimum description length (MDL) principle [15]. This principle asserts that the best model should be the one that attains the shortest codelength required for encoding the data as well as the model itself. The codelength is the number of bits required for encoding the data into a binary sequence under the information-theoretic requirement that each codeword can be uniquely decodable even without commas. The normalized maximum likelihood (NML) codelength is a minimum description length [23, 15]. The NML codelength minimizes the worst-case (i.e., in terms of data ) regret value, which is equal to the the actual codelength minus the ideal codelength, i.e., [24]. The other two model selection criteria, which we will explain in the next two sections, are based on the NML. Therefore, we explain the NML codelength in this section. Although we assume that inter-event times are continuous-valued, we first explain the NML codelength for a sequence of discrete-valued variables for expository purposes.

To shorten the codelength for the given data , one should in principle minimize the Shannon information given by , where we assume throughout the present paper that is in base unless we specify the base. The minimizer is obviously given by the maximum likelihood estimator, . However, the maximum likelihood estimator generally yields

(12) |

because depends on . In Eq. (12), the summation is taken over all possible values of

. Therefore, we use the normalized probability distribution given by

(13) |

to encode the data [25]. In other words, we set

(14) | |||||

where

(15) |

is called the parametric complexity. In Eq. (14), the first term on the right-hand side is small when the model fits the data well, and the second term represents the complexity of the model. In general, tends to increase as increases because an increase in leads to the expansion of the parameter space in which one searches , which leads to an increase in [26].

One can similarly derive the NML codelength for a sequence of continuous-valued variables by replacing the summation by the integral. For example, the equivalent of Eq. (15) in the case of continuous-valued variables is given by

(16) |

In the remainder of this section, we explain the NML codelength for an exponential distribution [18], not for an EMM, for two reasons. First, analytically calculating the NML codelength for an EMM seems formidable. Second, we will use the NML codelength for single exponential distributions in deriving the two types of the NML-based codelengths for EMMs in the following sections.

Consider an exponential distribution given by

(17) |

For this distribution, we obtain

(18a) | |||||

(18b) |

where is a region of such that the maximum likelihood estimator given by

(19) |

is contained in . We have to confine the domain of integration to in this manner because the integral in Eq. (18b) diverges if is replaced by . In practice, if we take small enough and large enough, would be satisfied for empirical sequences of inter-event times . The standard method to evaluate the integral in Eq. (18b) is to transform the integral to that in the parameter space [27, 18]. This method yields an integral of the form similar to , which is manageable [18]. Region

is a useful heuristic for avoiding such a divergence in NML-related calculations

[27].To incorporate the codelength necessary for encoding the value of and into the NML codelength, we rewrite and , where and are integer. Then, we encode and instead of and .

#### 2.2.4 Normalized maximum likelihood codelength with latent variable completion

As a first NML type of the model selection criterion for EMMs, we consider a latent variable completion of the NML codelength. We refer to the resulting criterion by and the codelength by . Although an analytical expression for the NML codelength for an EMM without latent variable completion is unavailable, we can analytically calculate the NML codelength for the joint distribution of inter-event times and the latent variables.

Similar to the derivation in the case of a mixture of Gaussian distributions

[18], we derive the codelength for EMMs as follows:(24) |

where

(25) |

We remind that is given by Eq. (11). Region of is defined such that each of , , in the maximum likelihood estimator when the latent variables are explicitly estimated (i.e., Eq. (6)) is contained in . Region coincides with region when . Similarly to the case of Eq. (18b), we confine the domain of integration to to avoid the divergence of the integral in Eq. (25). We set

(26) |

where

(27) |

and

(28) |

where

(29) |

for two reasons. First, is small when is large or is small. Second, encoding of and is facilitated by the introduction of integer variables, as we did for the NML codelength (section 2.2.3).

By substituting Eq. (11) in Eq. (25), one obtains the parametric complexity as follows:

(30) |

where

(31) | |||||

Note that we have used Eq. (21) to derive Eq. (31) and that coincides with . Because we cannot calculate using Eq. (30) due to combinatorial explosion, we calculate using the recursive relationship [18] given by

(32) |

The calculation of using this recursive relationship, which dominates the computation time for calculating , requires time.

Finally, we add the codelength to encode integers and to obtain

(33) |

#### 2.2.5 Decomposed NML codelength

The second type of the NML codelength with latent variable completion, which we call the decomposed NML (DNML) codelength [19], also completes the latent variable and then calculates the NML codelength. The DNML does so after decomposing the joint distribution into the distribution of the latent variables, , and that of the observables, , conditioned on the value of [19]. The DNML codelength was originally formulated for the case in which both observables and latent variables were discrete [19]. Here we regard the inter-event time, , as continuous-valued and the latent variable, , as discrete-valued.

We start by considering

(34) |

Equation (22) yields

(35) | |||||

where and are given by Eqs. (27) and (29), respectively. It should be noted that we could prepare integers and for each to code through , similarly to Eq. (23), to derive an alternative of Eq. (35). However, that coding method requires integers and therefore less efficient than using only two integers and .

The NML codelength on the right-hand side of Eq. (34) is for a multinomial distribution of and is given by

(36) |

In Eq. (36), is the parametric complexity for the multinomial distribution having elements. Using Eq. (5), one obtains

(37) | |||||

where is the number of inter-event times whose latent variable is equal to . One can recursively calculate [28] by

(38a) | |||||

(38b) | |||||

(38c) |

The computational time for solving the set of recursive equations is given by and dominates the computational time for .

Finally, similarly to the case of , we add the codelength to encode integers and to obtain

(39) |

### 2.3 Power-law distributions

We also fitted two types of power-law distributions to the empirical distributions of inter-event times. The first type is the Pareto distribution whose probability density function is given by

(40) |

which is defined for . We use the maximum likelihood estimator given by [29, 30]

(41) |

and

(42) |

The second estimator was the one proposed by Clauset et al. [31]. In this method, which we refer to as the PLFit algorithm, one selects the value such that the distribution of the data points satisfying is as close as possible to a power law. The power-law exponent given the value is estimated by Eq. (42). It should be noted that the PLFit does not intend to fit a power-law or different distribution to the data points whose values are less than . We used a publicly available Python implementation of the PLFit [32].

## 3 Results

### 3.1 Data

We used the following seven data sets of time-stamped event sequences obtained from human behavior. Basic properties of each data set are shown in Table 1.

Data | Time resolution | mean std of the inter-event time | ||
---|---|---|---|---|

Office | 92 | 30 | 20 sec | sec |

Hypertext | 112 | 72 | 20 sec | sec |

Reality | 104 | 90 | 1.5 hour | day |

Bitcoin | 2,969 | 35 | 128 sec | sec |

777 | 415 | 1 sec | sec | |

College | 1,176 | 159 | 1 sec | sec |

Sexual | 5,660 | 56 | 1 day | day |

represents the number of individuals with at least 100 inter-event times. The mean and standard deviation (abbreviated as std) are for inter-event times and calculated on the basis of the individuals having at least 100 inter-event times. For the Sexual data set,

, the mean, and standard deviation are based on the individuals having at least 50 inter-event times.First, we used the data that the SocioPatterns project collected from individuals in an office building that hosted scientific departments [33]. We refer to this data set as the Office data set. The individuals wore a sensor on their chest, with which physical proximity between pairs of individuals was detected. The recording lasted for two weeks. A time-stamped event was defined as a contact that lasted at least 20 seconds, which was the time resolution of the data. For each individual, we ignored the partner of the contact and used the beginning time of each contact as the time of the event. In this manner, we examined a sequence of time-stamped events for each individual in this and the following data sets. Note that, because a contact is a symmetric relationship between two individuals, any event between nodes and is considered for both and .

The inter-event time was defined as the difference between the starting time of the two consecutive events. There was no event during the night time due to the circadian rhythm. The effect of the circadian rhythm on analysis of inter-event times may be considerable but beyond the scope of the present study. Therefore, we excluded the inter-event times when the two events belonged to different days unless otherwise stated. When there were multiple events in consecutive 20-second time windows between the same pair of individuals, we regarded that they constitute a single event lasting over 20 seconds rather than a sequence of events with inter-event times equal to 20 seconds. In practice, in this case, we discarded all but the first event in each sequence of the multiple events in consecutive 20-second time windows and then calculated the inter-event times. We also aggregated events for a focal node that occurred at the same time with different partners into one event. Therefore, the minimum inter-event time is equal to 20 seconds. For the individuals that have at least 100 inter-event times, the mean and standard deviation of the inter-event times are shown in Table 1.

Second, we used the Hypertext 2009 dynamic contact network (Hypertext for short) [34]. Under the SocioPatterns project, the data were collected from approximately 75% of the participants in the HT09 conference in Torino over three days. As was the case for the Office data set, we ignored inter-event times that span multiple days.

The third data set originates from Reality Mining Project (Reality for short), which provides time-stamped physical proximity relationships between the participants of an experiment, who are students or faculty members of MIT. The data were obtained from Bluetooth logs of mobile phones [35]. We did not skip inter-event times overnight because the data were collected over months and the mean inter-event time was much longer than that for the Office and Hypertext data sets (Table 1). We used the data from September 2004 to May 2005, which were the months in each of which there were at least events in total.

Fourth, Bitcoin OTC is an over-the-counter marketplace where users trade with bitcoin [36]. In Bitcoin OTC, users rate other users regarding the trustworthiness. We neglected the rate score and who a focal user rated. For each user , we examined a series of time-stamped events, where an event was defined by rating behavior by user toward anybody. We refer to this data set as Bitcoin.

The Email-Eu-core (Email for short) data set was collected from a large European research institution between October 2003 and May 2005 [37]. A node was an email address. An edge was defined between two nodes if and only if they sent email to each other at least once. An time-stamped event for node in the network was defined by an email that sent to anybody, disregarding the identity of the recipient of the email.

The CollegeMsg (College for short) data were collected from students belonging to an online community at the University of California, Irvine, between April and December 2004 [38]. The event was defined as the message sent from a user to another user. We downloaded the Bitcoin, Email, and College data from the Stanford Network Analysis Project (SNAP) website (http://snap.stanford.edu/).

The sexual contact (Sexual for short) data set provides times of online sexual encounters between escorts and sex buyers [39]. We used event sequences for each sex buyer by regarding a commercial sexual activity with any escort as an event.

### 3.2 Fitting of the EMM to the individual with the largest number of events and comparison with power-law distributions

We fitted EMMs with different numbers of components, , to the sequence of inter-event times obtained from each individual in each data set. We examined , , …, , , , , and in each case. Then, according to each of the six model selection criteria, AIC, BIC, , , , and DNML, we selected the best value.

Results of model selection for the individual with the largest number of events in each data set are shown in Table 2. Table 2 indicates that the effective number of components finally chosen, , is at most four in all but the three out of the 42 combinations of the data set and criterion. Furthermore, for the four out of the seven data sets, is at most three regardless of the criterion. These results suggest that a mixture of a small number of exponential distributions may be a reasonable approximation to empirical distributions of inter-event times in many cases.

Data | AIC | BIC | DNML | ||||
---|---|---|---|---|---|---|---|

Office | 403 | ||||||

Hypertext | 659 | ||||||

Reality | 1,198 | ||||||

Bitcoin | 604 | ||||||

3,948 | |||||||

College | 1,090 | ||||||

Sexual | 118 |

Next, we compared the performance between the estimated EMMs and power-law estimators in approximating the empirical distributions. We have fitted power-law distributions because they have widely been applied to empirical distributions of inter-event times [1, 2, 3, 4]. For the same individuals as those used in Table 2, the empirical and estimated distributions of inter-event times are compared in Fig. 1. When the different model selection criteria yielded different EMMs (i.e., EMMs with different values of or ), we showed all the selected EMMs overlaid in the figure. We compared the different distributions in terms of the survival probability (i.e., the probability with which the inter-event time is larger than , i.e.,

and the odds ratio defined by

. Note that the odds ratio is particularly good at capturing differences between distributions at small values of [40].Figure 1 elicits the following casual observations. First, in five data sets (i.e., except Bitcoin and Email) the right tails of the distribution seem to be more accurately approximated by an EMM than the power-law distributions. This may be because the two power-law estimators employed here assume a strictly power-law tail, whereas empirical data usually have exponential cutoffs. For the Bitcoin and Email data sets, the power-law estimate obtained by the PLFit algorithm seems to be roughly as accurate as EMMs in approximating right tails of the distribution. Second, the odds ratio plots show that, at small values, the approximation looks the most accurate with the Pareto distribution in all but the College data set.

To examine whether these casual observations extend to the entire set of inter-event times from the same individuals, we investigated the likelihood of the empirical data in each model. The likelihood for the three models, i.e., the selected EMM, maximum likelihood estimator of the Pareto distribution, and PLFit is compared in Table 3. Because the PLFit discards small inter-event times, we start by comparing the EMM and Pareto distribution when all inter-event times are used. The table indicates that the EMM realizes a larger likelihood value than the Pareto distribution in five data sets and vice versa in the other two data sets (i.e., Hypertext and Reality). The EMM has more parameters than the Pareto distribution, which one may suspect as a reason why the EMM fits the data better than the Pareto distribution in more data sets than vice versa. However, the results are qualitatively the same when the two models are compared in terms of the AIC and BIC (Table S1). We remark that the maximum likelihood estimator for the Pareto distribution is not asymptotically normal such that the application of the AIC and BIC to the Pareto distribution as well as to EMMs is not justified. It should be noted that the other four criteria are not relevant to the Pareto distribution because it is not a latent variable model.

all Data EMM Pareto EMM Pareto Office 403 387(96%) Hypertext 659 464(70%) Reality 1,198 447(37%) Bitcoin 604 570(94%) Email 3,948 3,915(99%) College 1,090 1,089(100%) Sexual 118 108(92%) |

Data EMM Pareto PLFit Office 296(73%) Hypertext 337(51%) Reality 116(10%) Bitcoin 85(14%) Email 253(6%) College 983(90%) Sexual 75(64%) |

Comments

There are no comments yet.